Hidden Markov Models and Applica2ons Spring 2017 February 21,23, 2017
Gene finding in prokaryotes
Reading frames A protein is coded by groups of three nucleo2des (codons): ACGTACGTACGTACGT ACG-TAC-GTA-CGT-ACG-T TYVRT There are two other ways in which this sequence can be decomposed into codons: A-CGT-ACG-TAC-GTA-CGT AC-GTA-CGT-ACG-TAC-GT These are the three reading frames The complementary strand has three addi2onal reading frames
Coding for a protein Three non- overlapping nucleo2des (codon) code for an amino acid Each amino acid has more than one codon that codes for it The code is almost universal across organisms Codons which code for the same amino acid are similar Six reading frames
Coding for a protein Every gene starts with the codon ATG. This specifies the reading frame and the start of transla2on site. The protein sequence con2nues un2l a stop codon (UGA, UAA, UAG) is encountered. DNA: [TAC CGC GGC TAT TAC TGC CAG GAA GGA ACT] r RNA: AUG GCG CCG AUA AUG ACG GUC CUU CCU UGA Protein: Met Ala Pro Ile Met Thr Val Leu Pro Stop
Open Reading Frames (ORFs) An open reading frame is a sequence whose length is a mul2ple of 3, starts with the start codon, and ends with a stop codon, with no stop codon in the middle. How do we determine if an ORF is a protein coding gene? Suppose we see a long run of non- stop codons axer a start codon, then it has a low probability of arising by chance.
A model 1 for DNA sequences Need a probabilis2c model assigning a probability to a DNA sequence. Simplest model: nucleo2des are independent and iden2cally distributed. Probability of a sequence s=s 1 s 2 s n : P(s) = P(s 1 )P(s 2 ) P(s n ) Distribu2on characterized by the numbers (p A, p C, p G, p T ) that sa2sfy: p A +p C +p G +p T =1 (mul2nomial distribu2on) 1 All models are wrong. Some are useful G.E.P. Box
A method for gene finding Under our model, assuming that the four nucleo2des have equal frequencies: P(run of k non- stop codons) = (61/64) k Choose k such that P(run of k non- stop codons) is smaller than the rate of false posi2ves we are willing to accept. At the 5% level we get k=62.
Informa2on we ignored Coding regions have different nucleo2de usage Different sta2s2cs of di- nucleo2des and 3- mers in coding/non- coding regions Promoter region contains sequence signals for binding of TFs and RNA polymerase Need beeer models! But works surprisingly well in prokaryotes. Need more detailed models for eukaryotes (introns/exons)
Hidden Markov Models
Example: The Dishonest Casino Game: 1. You bet $1 2. You roll 3. Casino player rolls 4. Highest number wins $2 The casino has two dice: Fair die P(1) = P(2) = P(3) = P(5) = P(6) = 1/6 Loaded die P(1) = P(2) = P(3) = P(5) = 1/10 P(6) = 1/2 Casino player switches between fair and loaded die (not too often, and not for too long)
The dishonest casino model 0.95 0.05 0.9 FAIR LOADED P(1 F) = 1/6 P(2 F) = 1/6 P(3 F) = 1/6 P(4 F) = 1/6 P(5 F) = 1/6 P(6 F) = 1/6 0.1 P(1 L) = 1/10 P(2 L) = 1/10 P(3 L) = 1/10 P(4 L) = 1/10 P(5 L) = 1/10 P(6 L) = 1/2
Ques2on # 1 Evalua2on GIVEN: A sequence of rolls by the casino player 1245526462146146136136661664661636616366163616515615115146123562344 QUESTION: Prob = 1.3 x 10-35 How likely is this sequence, given our model of how the casino works? This is the EVALUATION problem in HMMs
Ques2on # 2 Decoding GIVEN: A sequence of rolls by the casino player 1245526462146146136136661664661636616366163616515615115146123562344 QUESTION: FAIR LOADED FAIR What por2on of the sequence was generated with the fair die, and what por2on with the loaded die? This is the DECODING ques2on in HMMs
Ques2on # 3 Learning GIVEN: A sequence of rolls by the casino player 1245526462146146136136661664661636616366163616515615115146123562344 QUESTION: How does the casino player work: How loaded is the loaded die? How fair is the fair die? How oxen does the casino player change from fair to loaded, and back? This is the LEARNING ques2on in HMMs
The dishonest casino model 0.95 0.05 0.9 FAIR LOADED P(1 F) = 1/6 P(2 F) = 1/6 P(3 F) = 1/6 P(4 F) = 1/6 P(5 F) = 1/6 P(6 F) = 1/6 0.1 P(1 L) = 1/10 P(2 L) = 1/10 P(3 L) = 1/10 P(4 L) = 1/10 P(5 L) = 1/10 P(6 L) = 1/2
Defini2on of a hidden Markov model Alphabet Σ = { b 1, b 2,,b M } Set of states Q = { 1,...,K } (K = Q ) Transi2on probabili2es between any two states a ij = transi2on probability from state i to state j a i1 + + a ik = 1, for all states i Ini2al probabili2es a 0i a 01 + + a 0K = 1 Emission probabili2es within each state e k (b) = P( x i = b π i = k) e k (b 1 ) + + e k (b M ) = 1 1 K 2
Hidden states and observed sequence At 2me step t, π t denotes the (hidden) state in the Markov chain x t denotes the symbol emieed in state π t A path of length N is: An observed sequence of length N is: x 1, x 2,, x N π 1, π 2,, π N
A parse of a sequence Given a sequence x = x 1 x N, A parse of x is a sequence of states π = π 1,, π N 1 2 K 1 2 K 1 2 K 1 2 K x 1 x 2 x 3 x K 2 1 K 2
Likelihood of a parse Given a sequence x = x 1,x N and a parse π = π 1,,π N, How likely is the parse (given our HMM)? 1 2 K 1 2 K 1 2 K 1 2 K x 1 x 2 x 3 x K P(x, π) = P(x 1,, x N, π 1,, π N ) = P(x N, π N x 1 x N- 1, π 1,, π N- 1 ) P(x 1 x N- 1, π 1,, π N- 1 ) = P(x N, π N π N- 1 ) P(x 1 x N- 1, π 1,, π N- 1 ) = = P(x N, π N π N- 1 ) P(x N- 1, π N- 1 π N- 2 ) P(x 2, π 2 π 1 ) P(x 1, π 1 ) = P(x N π N ) P(π N π N- 1 ) P(x 2 π 2 ) P(π 2 π 1 ) P(x 1 π 1 ) P(π 1 ) = a 0π1 a π1π2 a πn- 1πN e π1 (x 1 ) e πn (x N ) = N i= 1 a π e i 1π i π i ( x i )
Example: the dishonest casino What is the probability of a sequence of rolls x = 1, 2, 1, 5, 6, 2, 1, 6, 2, 4 and the parse π = Fair, Fair, Fair, Fair, Fair, Fair, Fair, Fair, Fair, Fair? (say ini2al probs a 0,Fair = ½, a 0,Loaded = ½) ½ P(1 Fair) P(Fair Fair) P(2 Fair) P(Fair Fair) P(4 Fair) = ½ (1/6) 10 (0.95) 9 = 5.2 10-9
Example: the dishonest casino So, the likelihood the die is fair in all this run is 5.2 10-9 What about π = Loaded, Loaded, Loaded, Loaded, Loaded, Loaded, Loaded, Loaded, Loaded, Loaded? ½ P(1 Loaded) P(Loaded Loaded) P(4 Loaded) = ½ (1/10) 8 (1/2) 2 (0.9) 9 = 4.8 10-10 Therefore, it is more likely that the die is fair all the way, than loaded all the way
Example: the dishonest casino Let the sequence of rolls be: x = 1, 6, 6, 5, 6, 2, 6, 6, 3, 6 And let s consider π = F, F,, F P(x, π) = ½ (1/6) 10 (0.95) 9 = 5.2 10-9 (same as before) And for π = L, L,, L: P(x, π) = ½ (1/10) 4 (1/2) 6 (0.9) 9 = 3.02 10-7 So, the observed sequence is ~100 2mes more likely if a loaded die is used
What we know Given a sequence x = x 1,,x N and a parse π = π 1,,π N, we know how to compute 1 2 K 1 2 K 1 2 K 1 2 K P(x, π) x 1 x 2 x 3 x K
What we would like to know 1. Evalua2on GIVEN HMM M, and a sequence x, FIND Prob[ x M ] 2. Decoding GIVEN HMM M, and a sequence x, FIND the sequence π of states that maximizes P[ x, π M ] 3. Learning GIVEN HMM M, with unspecified transi2on/emission probs., and a sequence x, FIND parameters θ = (e i (.), a ij ) that maximize P[ x θ ]
Problem 2: Decoding Find the best parse of a sequence
Decoding GIVEN x = x 1 x 2,,x N 1 1 1 1 We want to find π = π 1,, π N, such that P[ x, π ] is maximized 2 2 2 2 π * = argmax π P[ x, π ] K K K K Maximize a 0π1 e π1 (x 1 ) a π1π2 a πn- 1πN e πn (x N ) x 1 x 2 x 3 x N We can use dynamic programming! Let V k (i) = max {π1,, π i- 1} P[x 1 x i- 1, π 1,, π i- 1, x i, π i = k] = Probability of maximum probability path ending at state π i = k
Decoding main idea InducEve assumpeon: Given V k (i) = max {π1,, π i- 1} P[x 1 x i- 1, π 1,, π i- 1, x i, π i = k] What is V r (i+1)? V r (i+1) = max {π1,, πi} P[ x 1 x i, π 1,, π i, x i+1, π i+1 = r ] = max {π1,, πi} P(x i+1, π i+1 = r x 1 x i,π 1,, π i ) P[x 1 x i, π 1,, π i ] = max {π1,, πi} P(x i+1, π i+1 = r π i ) P[x 1 x i- 1, π 1,, π i- 1, x i, π i ] = max k [P(x i+1, π i+1 = r π i = k) max {π1,, π i- 1} P[x 1 x i- 1,π 1,,π i- 1, x i,π i =k]] = max k e r (x i+1 ) a kr V k (i) = e r (x i+1 ) max k a kr V k (i)
The Viterbi Algorithm Input: x = x 1,,x N IniEalizaEon: V 0 (0) = 1 (0 is the imaginary first posi2on) V k (0) = 0, for all k > 0 IteraEon: TerminaEon: Traceback: V j (i) = e j (x i ) max k a kj V k (i- 1) Ptr j (i) = argmax k a kj V k (i- 1) P(x, π*) = max k V k (N) π N * = argmax k V k (N) π i- 1 * = Ptr πi (i)
The Viterbi Algorithm Input: x = x 1,,x N IniEalizaEon: V 0 (0) = 1 (0 is the imaginary first posi2on) V k (0) = 0, for all k > 0 IteraEon: TerminaEon: Traceback: V j (i) = e j (x i ) max k a kj V k (i- 1) Ptr j (i) = argmax k a kj V k (i- 1) P(x, π*) = max k a k0 V k (N) (with an end state) π N * = argmax k V k (N) π i- 1 * = Ptr πi (i)
The Viterbi Algorithm State 1 x 1 x 2 x 3 x i-1 x i.x N j V j (i) K Similar to aligning a set of states to a sequence
The Viterbi Algorithm State 1 x 1 x 2 x 3 x i-1 x i.x N j V j (i) K Similar to aligning a set of states to a sequence Time: O(K 2 N) Space: O(KN)
Viterbi Algorithm a prac2cal detail Underflows are a significant problem P[ x 1,., x i, π 1,, π i ] = a 0π1 a π1π2,,a πi e π1 (x 1 ),,e πi (x i ) These numbers become extremely small underflow SoluEon: Take the logs of all values V r (i) = log e k (x i ) + max k [ V k (i- 1) + log a kr ]
Example Let x be a sequence with a por2on that has 6 s with probability 1/6, followed by a por2on with 6 s with frac2on 1/2 x = 123456123456 12345 6626364656 1626364656 Easy to convince yourself that the op2mal parse is: FFF...F LLL...L
Example Observed Sequence: x = 1,2,1,6,6 P(x) = 0.0002195337 Best 8 paths: LLLLL 0.0001018 FFFFF 0.0000524 FFFLL 0.0000248 FFLLL 0.0000149 FLLLL 0.0000089 FFFFL 0.0000083 LLLLF 0.0000018 LFFFF 0.0000017
Problem 1: Evalua2on Finding the probability a sequence is generated by the model
Genera2ng a sequence by the model Given a HMM, we can generate a sequence of length n as follows: 1. Start at state π 1 according to probability a 0π1 2. Emit leeer x 1 according to probability e π1 (x 1 ) 3. Go to state π 2 according to probability a π1π2 4. un2l emi ng x n 1 1 1 1 a 02 2 2 2 2 0 K K K K e 2 (x 1 ) x 1 x 2 x 3 x n
The Forward Algorithm want to calculate P(x) = probability of x, given the HMM (x = x 1,,x N ) Sum over all possible ways of genera2ng x: P(x) = Σ all paths π P(x, π) To avoid summing over an exponen2al number of paths π, define f k (i) = P(x 1,,x i, π i = k) (the forward probability)
The Forward Algorithm deriva2on the forward probability: f k (i) = P(x 1 x i, π i = k) = Σ π1 πi- 1 P(x 1 x i- 1, π 1,, π i- 1, π i = k) e k (x i ) = Σ r Σ π1 πi- 2 P(x 1 x i- 1, π 1,, π i- 2, π i- 1 = r) a rk e k (x i ) = Σ r P(x 1 x i- 1, π i- 1 = r) a rk e k (x i ) = e k (x i ) Σ r f r (i- 1) a rk
The Forward Algorithm A dynamic programming algorithm: IniEalizaEon: f 0 (0) = 1 ; f k (0) = 0, for all k > 0 IteraEon: f k (i) = e k (x i ) Σ r f r (i- 1) a rk TerminaEon: P(x) = Σ k f k (N)
The Forward Algorithm If our model has an end state: IniEalizaEon: f 0 (0) = 1 ; f k (0) = 0, for all k > 0 IteraEon: f k (i) = e k (x i ) Σ r f r (i- 1) a rk TerminaEon: P(x) = Σ k f k (N) a k0 Where, a k0 is the probability that the termina2ng state is k (usually = a 0k )
Rela2on between Forward and Viterbi VITERBI IniEalizaEon: V 0 (0) = 1 V k (0) = 0, for all k > 0 IteraEon: FORWARD IniEalizaEon: f 0 (0) = 1 f k (0) = 0, for all k > 0 IteraEon: V j (i) = e j (x i ) max k V k (i- 1) a kj TerminaEon: P(x, π*) = max k V k (N) f l (i) = e l (x i ) Σ k f k (i- 1) a kl TerminaEon: P(x) = Σ k f k (N)
The most likely state Given a sequence x, what is the most likely state that emieed x i? In other words, we want to compute P(π i = k x) Example: the dishonest casino Say x = 12341623162616364616234161221341 Most likely path: π = FF F However: marked leeers more likely to be L
Mo2va2on for the Backward Algorithm We want to compute P(π i = k x), We start by compu2ng P(π i = k, x) = P(x 1 x i, π i = k, x i+1 x N ) = P(x 1 x i, π i = k) P(x i+1 x N x 1 x i, π i = k) = P(x 1 x i, π i = k) P(x i+1 x N π i = k) Forward, f k (i) Backward, b k (i) Then, P(π i = k x) = P(π i = k, x) / P(x) = f k (i) b k (i) / P(x)
The Backward Algorithm deriva2on Define the backward probability: b k (i) = P(x i+1 x N π i = k) = Σ πi+1 πn P(x i+1,x i+2,, x N, π i+1,, π N π i = k) = Σ r Σ πi+2 πn P(x i+1,x i+2,, x N, π i+1 = r, π i+2,, π N π i = k) = Σ r e r (x i+1 ) a kr Σ πi+2 πn P(x i+2,, x N, π i+2,, π N π i+1 = r) = Σ r e r (x i+1 ) a kr b r (i+1)
The Backward Algorithm A dynamic programming algorithm for b k (i): IniEalizaEon: b k (N) = 1, for all k IteraEon: b k (i) = Σ r e r (x i+1 ) a kr b r (i+1) TerminaEon: P(x) = Σ r a 0r e r (x 1 ) b r (1)
The Backward Algorithm In case of an end state: IniEalizaEon: b k (N) = a k0, for all k IteraEon: b k (i) = Σ r e r (x i+1 ) a kr b r (i+1) TerminaEon: P(x) = Σ r a 0r e r (x 1 ) b r (1)
Example Observed Sequence: x = 1,2,1,6,6 P(x) = 0.0002195337 Conditional Probabilities given x Position F L 1 0.5029927 0.4970073 2 0.4752192 0.5247808 3 0.4116375 0.5883625 4 0.2945388 0.7054612 5 0.2665376 0.7334624
Computa2onal Complexity What is the running 2me, and space required for Forward, and Backward algorithms? Time: O(K 2 N) Space: O(KN) Useful implementa2on technique to avoid underflows - Rescaling at each posi2on by mul2plying by a constant
Scaling Define: Recursion: Choosing scaling factors such that Leads to: Scaling for the backward probabili2es:
Posterior Decoding We can now calculate f k (i) b k (i) P(π i = k x) = P(x) we can now ask What is the most likely state at posi2on i of sequence x? Using Posterior Decoding we can now define: = argmax k P(π i = k x)
Posterior Decoding For each state, Posterior Decoding gives us a curve of probability of state for each posi2on, given the sequence x That is some2mes more informa2ve than Viterbi path π * Posterior Decoding may give an invalid sequence of states Why?
Viterbi vs. posterior decoding A class takes a mul2ple choice test. How does the lazy professor construct the answer key? Viterbi approach: use the answers of the best student Posterior decoding: majority vote
Viterbi, Forward, Backward VITERBI FORWARD BACKWARD IniEalizaEon: V 0 (0) = 1 V k (0) = 0, for all k > 0 IniEalizaEon: f 0 (0) = 1 f k (0) = 0, for all k > 0 IniEalizaEon: b k (N) = a k0, for all k IteraEon: IteraEon: IteraEon: V r (i) = e r (x i ) max k V k (i- 1) a kr f r (i) = e r (x i ) Σ k f k (i- 1) a kr b r (i) = Σ k e k (x i +1) a rk b k (i+1) TerminaEon: TerminaEon: TerminaEon: P(x, π*) = max k V k (N) P(x) = Σ k f k (N) a k0 P(x) = Σ k a 0k e k (x 1 ) b k (1)
A modeling Example A+ C+ G+ T+ CpG islands in DNA sequences A- C- G- T-
Methyla2on & Silencing Methyla2on Addi2on of CH 3 in C- nucleo2des Silences genes in region CG (denoted CpG) oxen mutates to TG, when methylated Methyla2on is inherited during cell division
CpG Islands CpG nucleotides in the genome are frequently methylated C methyl-c T Methylation often suppressed around genes, promoters CpG islands
CpG Islands In CpG islands: CG is more frequent Other dinucleo2des (AA, AG, AT ) have different frequencies Problem: Detect CpG islands
A model of CpG Islands Architecture CpG Island A+ C+ G+ T+ Non CpG Island A- C- G- T-
A model of CpG Islands Transi2ons How do we es2mate parameters of the model? Emission probabiliees: 1/0 1. TransiEon probabiliees within CpG islands Established from known CpG islands + A C G T A.180.274.426.120 C.171.368.274.188 G.161.339.375.125 2. TransiEon probabiliees within non- CPG islands Established from non- CpG islands T.079.355.384.182 - A C G T A.300.205.285.210 C.233.298.078.302 G.248.246.298.208 T.177.239.292.292
A model of CpG Islands Transi2ons What about transieons between (+) and (- ) states? Their probabili2es affect Avg. length of CpG island Avg. separa2on between two CpG islands 1-p Length distribution of X region: P[L X = 1] = 1-p p X Y q P[L X = 2] = p(1-p) P[L X = k] = p k-1 (1-p) 1-q E[L X ] = 1/(1-p) Geometric distribution, with mean 1/(1-p)
A model of CpG Islands Transi2ons No reason to favor exi2ng/entering (+) and (- ) regions at a par2cular nucleo2de Es2mate average length L CPG of a CpG island: L CPG = 1/(1- p) p = 1 1/L CPG For each pair of (+) states: a kr p a kr + A+ C+ G+ T+ For each (+) state k, (- ) state r: a kr =(1- p)(a 0r- ) Do the same for (- ) states A problem with this model: CpG islands don t have a geometric length distribu2on This is a defect of HMMs a price we pay for ease of analysis & efficient computa2on 1 p 1-q A- C- G- T-
Using the model Given a DNA sequence x, The Viterbi algorithm predicts loca2ons of CpG islands Given a nucleo2de x i, (say x i = A) The Viterbi parse tells whether x i is in a CpG island in the most likely parse Using the Forward/Backward algorithms we can calculate P(x i is in CpG island) = P(π i = A + x) Posterior Decoding can assign locally op2mal predic2ons of CpG islands = argmax k P(π i = k x)
Posterior decoding Results of applying posterior decoding to a part of human chromosome 22
Posterior decoding
Viterbi decoding
Sliding window (size = 100)
Sliding widow (size = 200)
Sliding window (size = 300)
Sliding window (size = 400)
Sliding window (size = 600)
Sliding window (size = 1000)
Modeling CpG islands with silent states
What if a new genome comes? Suppose we just sequenced the porcupine genome We know CpG islands play the same role in this genome However, we have no known CpG islands for porcupines We suspect the frequency and characteris2cs of CpG islands are quite different in porcupines How do we adjust the parameters in our model? LEARNING
Two learning scenarios EsEmaEon when the right answer is known Examples: GIVEN: a genomic region x = x 1 x N where we have good (experimental) annota2ons of the CpG islands GIVEN: the casino player allows us to observe him one evening as he changes the dice and produces 10,000 rolls EsEmaEon when the right answer is unknown Examples: GIVEN: the porcupine genome; we don t know how frequent are the CpG islands there, neither do we know their composi2on GIVEN: 10,000 rolls of the casino player, but we don t see when he changes dice GOAL: Update the parameters θ of the model to maximize P(x θ)
When the right answer is known Given x = x 1 x N for which π = π 1 π N is known, Define: A kr = # of 2mes k r transi2on occurs in π E k (b) = # of 2mes state k in π emits b in x The maximum likelihood esemates of the parameters are: A kr E k (b) a kr = e k (b) = Σ i A ki Σ c E k (c)
When the right answer is known IntuiEon: When we know the underlying states, the best es2mate is the average frequency of transi2ons & emissions that occur in the training data Drawback: Given liele data, there may be overfifng: P(x θ) is maximized, but θ is unreasonable 0 probabiliees VERY BAD Example: Suppose we observe: x = 2 1 5 6 1 2 3 6 2 3 5 3 4 2 1 2 1 6 3 6 π = F F F F F F F F F F F F F F F L L L L L Then: a FF = 14/15 a FL = 1/15 a LL = 1 a LF = 0 e F (4) = 0; e L (4) = 0
Pseudocounts Solu2on for small training sets: Add pseudocounts A kr E k (b) = # 2mes k r state transi2on occurs + t kr = # 2mes state k emits symbol b in x + t k (b) t kr, t k (b) are pseudocounts represen2ng our prior belief Larger pseudocounts Strong prior belief Small pseudocounts (ε < 1): just to avoid 0 probabili2es
Pseudocounts A kr + t kr E k (b) + t k (b) a kr = e k (b) = Σ i A ki + t kr Σ c E k (c) + t k (b)
Pseudocounts Example: dishonest casino We will observe player for one day, 600 rolls Reasonable pseudocounts: t 0F = t 0L = t F0 = t L0 = 1; t FL = t LF = t FF = t LL = 1; t F (1) = t F (2) = = t F (6) = 20 (strong belief fair is fair) t L (1) = t L (2) = = t L (6) = 5 (wait and see for loaded) Above numbers arbitrary assigning priors is an art
When the right answer is unknown We don t know the actual A kr, E k (b) Idea: Ini2alize the model Compute A kr, E k (b) Update the parameters of the model, based on A kr, E k (b) Repeat un2l convergence Two algorithms: Baum- Welch, Viterbi training
Gene finding with HMMs
Example: genes in prokaryotes EasyGene: a prokaryo2c gene- finder (Larsen TS, Krogh A) Codons are modeled with 3 looped triplets of states (this converts the length distribu2on to nega2ve binomial)
Using the gene finder Apply the gene finder to both strands Training using annotated genes
A simple eukaryo2c gene finder