Example: The Dishonest Casino Hidden Markov Models Durbin and Eddy, chapter 3 Game:. You bet $. You roll 3. Casino player rolls 4. Highest number wins $ The casino has two dice: Fair die P() = P() = P(3) = P(5) = P(6) = /6 Loaded die P() = P() = P(3) = P(5) = /0 P(6) = / Casino player switches between fair and loaded die (not too often, and not for too long) The dishonest casino model Question # Evaluation 0.95 0.05 0.9 GIVEN: A sequence of rolls by the casino player FAIR LOADED 4556464646363666664666366636663665565546356344 QUESTION: Prob =.3 x 0-35 P( F) = /6 P( F) = /6 P(3 F) = /6 P(4 F) = /6 P(5 F) = /6 P(6 F) = /6 0. P( L) = /0 P( L) = /0 P(3 L) = /0 P(4 L) = /0 P(5 L) = /0 P(6 L) = / How likely is this sequence, given our model of how the casino works? This is the EVALUATION problem in HMMs Question # Decoding Question # 3 Learning GIVEN: A sequence of rolls by the casino player 4556464646363666664666366636663665565546356344 QUESTION: FAIR LOADED FAIR What portion of the sequence was generated with the fair die, and what portion with the loaded die? GIVEN: A sequence of rolls by the casino player 4556464646363666664666366636663665565546356344 QUESTION: How does the casino player work: How loaded is the loaded die? How fair is the fair die? How often does the casino player change from fair to loaded, and back? This is the LEARNING question in HMMs This is the DECODING question in HMMs
The dishonest casino model Definition of a hidden Markov model 0.95 0.05 0.9 Alphabet Σ = { b, b,,b M } Set of states Q = {,..., } ( = Q ) Transition probabilities between any two states FAIR LOADED a ij = transition probability from state i to state j a i + + a i =, for all states i P( F) = /6 P( F) = /6 P(3 F) = /6 P(4 F) = /6 P(5 F) = /6 P(6 F) = /6 0. P( L) = /0 P( L) = /0 P(3 L) = /0 P(4 L) = /0 P(5 L) = /0 P(6 L) = / Initial probabilities a 0i a 0 + + a 0 = Emission probabilities within each state e k (b) = P( x i = b π i = k) e k (b ) + + e k (b M ) = Hidden states and observed sequence An HMM is memory-less At time step t, π t denotes the (hidden) state in the Markov chain x t denotes the symbol emitted in state π t At time step t, the only thing that affects the next state is the current State, π t A path of length N is: An observed sequence of length N is: π, π,, π N x, x,, x N P(π t+ = k whatever happened so far ) = P(π t+ = k π, π,, π t, x, x,, x t ) = P(π t+ = k π t ) A parse of a sequence Given a sequence x = x x N, A parse of x is a sequence of states π = π,, π N Given a sequence x = x,x N and a parse π = π,,π N, How likely is the parse (given our HMM)? P(x, π) = P(x,, x N, π,, π N ) Likelihood of a parse = P(x N, π N x x N-, π,, π N- ) P(x x N-, π,, π N- ) = P(x N, π N π N- ) P(x x N-, π,, π N- ) = = P(x N, π N π N- ) P(x N-, π N- π N- )P(x, π π ) P(x, π ) = P(x N π N ) P(π N π N- ) P(x π ) P(π π ) P(x π ) P(π ) = a 0π a ππ a πn-πn e π (x )e πn (x N ) N =! a# e ( x i" i i i ) # # i=
Example: the dishonest casino What is the probability of a sequence of rolls x =,,, 5, 6,,, 6,, 4 and the parse π = Fair, Fair, Fair, Fair, Fair, Fair, Fair, Fair, Fair, Fair? (say initial probs a 0,Fair = ½, a 0,Loaded = ½) ½ P( Fair) P(Fair Fair) P( Fair) P(Fair Fair) P(4 Fair) = ½ (/6) 0 (0.95) 9 = 5. 0-9 Example: the dishonest casino So, the likelihood the die is fair in all this run is 5. 0-9 What about π = Loaded, Loaded, Loaded, Loaded, Loaded, Loaded, Loaded, Loaded, Loaded, Loaded? ½ P( Loaded) P(Loaded Loaded) P(4 Loaded) = ½ (/0) 8 (/) (0.9) 9 = 4.8 0-0 Therefore, it is more likely that the die is fair all the way, than loaded all the way Example: the dishonest casino Let the sequence of rolls be: x =, 6, 6, 5, 6,, 6, 6, 3, 6 And let s consider π = F, F,, F P(x, π) = ½ (/6) 0 (0.95) 9 = 5. 0-9 (same as before) And for π = L, L,, L: P(x, π) = ½ (/0) 4 (/) 6 (0.9) 9 = 3.0 0-7 So, the observed sequence is ~00 times more likely if a loaded die is used Clarification of notation P[ x M ]: The probability that sequence x was generated by the model The model is: architecture (#states, etc) + parameters θ = a ij, e i (.) So, P[x M] is the same as P[ x θ ], and P[ x ], when the architecture, and the parameters, respectively, are implied Similarly, P[ x, π M ], P[ x, π θ ] and P[ x, π ] are the same when the architecture, and the parameters, are implied In the LEARNING problem we write P[ x θ ] to emphasize that we are seeking the θ* that maximizes P[ x θ ] What we know Given a sequence x = x,,x N and a parse π = π,,π N, we know how to compute how likely the parse is: P(x, π). Evaluation What we would like to know GIVEN HMM M, and a sequence x, FIND Prob[ x M ]. Decoding GIVEN HMM M, and a sequence x, FIND the sequence π of states that maximizes P[ x, π M ] 3. Learning GIVEN HMM M, with unspecified transition/emission probs., and a sequence x, FIND parameters θ = (e i (.), a ij ) that maximize P[ x θ ] 3
Decoding Problem : Decoding GIVEN x = x x,,x N We want to find π = π,, π N, such that P[ x, π ] is maximized π * = argmax π P[ x, π ] Maximize a 0π e π (x ) a ππ a πn-πn e πn (x N ) Find the best parse of a sequence We can use dynamic programming! Let V k (i) = max {π,, π i-} P[x x i-, π,, π i-, x i, π i = k] = Probability of maximum probability path ending at state π i = k Decoding main idea Inductive assumption: Given V k (i) = max {π,, π i-} P[x x i-, π,, π i-, x i, π i = k] What is V r (i+)? V r (i+) = max {π,, πi} P[ x x i, π,, π i, x i+, π i+ = r ] = max {π,, πi} P(x i+, π i+ = r x x i,π,, π i ) P[x x i, π,, π i ] = max {π,, πi} P(x i+, π i+ = r π i ) P[x x i-, π,, π i-, x i, π i ] = max k [P(x i+, π i+ = r π i = k) max {π,, π i-} P[x x i-,π,,π i-, x i,π i =k]] = max k e r (x i+ ) a kr V k (i) = e r (x i+ ) max k a kr V k (i) The Viterbi Algorithm Input: x = x,,x N V 0 (0) = (0 is the imaginary first position) V k (0) = 0, for all k > 0 V j (i) = e j (x i ) max k a kj V k (i-) Ptr j (i) = argmax k a kj V k (i-) P(x, π*) = max k V k (N) Traceback: π N * = argmax k V k (N) π i- * = Ptr πi (i) The Viterbi Algorithm The Viterbi Algorithm Input: x = x,,x N V 0 (0) = (0 is the imaginary first position) V k (0) = 0, for all k > 0 x i- Traceback: V j (i) = e j (x i ) max k a kj V k (i-) Ptr j (i) = argmax k a kj V k (i-) P(x, π*) = max k a k0 V k (N) (with an end state) π N * = argmax k V k (N) π i- * = Ptr πi (i) V j (i) Similar to aligning a set of states to a sequence Time: O( N) Space: O(N) 4
The edit graph for the decoding problem Viterbi Algorithm a practical detail Underflows are a significant problem P[ x,., x i, π,, π i ] = a 0π a ππ,,a πi e π (x ),,e πi (x i ) These numbers become extremely small underflow Solution: Take the logs of all values V r (i) = log e k (x i ) + max k [ V k (i-) + log a kr ] The Decoding Problem is essentially finding a longest path in a directed acyclic graph (DAG) Example Let x be a sequence with a portion that has /6 6 s, followed by a portion with ½ 6 s x = 34563456345 66636465666364656 Then, it is not hard to show that the optimal parse is: FFF...F LLL...L Example Observed Sequence: x =,,,6,6 P(x) = 0.00095337 Best 8 paths: LLLLL 0.00008 FFFFF 0.000054 FFFLL 0.000048 FFLLL 0.000049 FLLLL 0.0000089 FFFFL 0.0000083 LLLLF 0.000008 LFFFF 0.000007 Conditional Probabilities given x Position F L 0.50997 0.4970073 0.4759 0.547808 3 0.46375 0.588365 4 0.945388 0.70546 5 0.665376 0.733464 Generating a sequence by the model Given a HMM, we can generate a sequence of length n as follows: Problem : Evaluation Finding the probability a sequence is generated by the model. Start at state π according to probability a 0π. Emit letter x according to probability e π (x ) 3. Go to state π according to probability a ππ 4. until emitting x n 0 5
The Forward Algorithm The Forward Algorithm derivation want to calculate P(x) = probability of x, given the HMM (x = x,,x N ) Sum over all possible ways of generating x: P(x) = Σ all paths π P(x, π) To avoid summing over an exponential number of paths π, define f k (i) = P(x,,x i, π i = k) (the forward probability) the forward probability: f k (i) = P(x x i, π i = k) = Σ ππi- P(x x i-, π,, π i-, π i = k) e k (x i ) = Σ r Σ ππi- P(x x i-, π,, π i-, π i- = r) a rk e k (x i ) = Σ r P(x x i-, π i- = r) a rk e k (x i ) = e k (x i ) Σ r f r (i-) a rk The Forward Algorithm A dynamic programming algorithm: f 0 (0) = ; f k (0) = 0, for all k > 0 f k (i) = e k (x i ) Σ r f r (i-) a rk P(x) = Σ k f k (N) The Forward Algorithm If our model has an end state: f 0 (0) = ; f k (0) = 0, for all k > 0 f k (i) = e k (x i ) Σ r f r (i-) a rk P(x) = Σ k f k (N) a k0 Where, a k0 is the probability that the terminating state is k (usually = a 0k ) Relation between Forward and Viterbi The most probable state VITERBI V 0 (0) = V k (0) = 0, for all k > 0 FORWARD f 0 (0) = f k (0) = 0, for all k > 0 Given a sequence x, what is the most likely state that emitted x i? In other words, we want to compute P(π i = k x) Example: the dishonest casino Say x = 34636663646634634 V j (i) = e j (x i ) max k V k (i-) a kj P(x, π*) = max k V k (N) f l (i) = e l (x i ) Σ k f k (i-) a kl P(x) = Σ k f k (N) Most likely path: π = FFF However: marked letters more likely to be L 6
Motivation for the Backward Algorithm The Backward Algorithm derivation We want to compute P(π i = k x), Define the backward probability: We start by computing P(π i = k, x) = P(x x i, π i = k, x i+ x N ) = P(x x i, π i = k) P(x i+ x N x x i, π i = k) = P(x x i, π i = k) P(x i+ x N π i = k) Then, P(π i = k x) = P(π i = k, x) / P(x) = f k (i) b k (i) / P(x) b k (i) = P(x i+ x N π i = k) = Σ πi+πn P(x i+,x i+,, x N, π i+,, π N π i = k) = Σ r Σ πi+πn P(x i+,x i+,, x N, π i+ = r, π i+,, π N π i = k) = Σ r e r (x i+ ) a kr Σ πi+πn P(x i+,, x N, π i+,, π N π i+ = r) = Σ r e r (x i+ ) a kr b r (i+) The Backward Algorithm A dynamic programming algorithm for b k (i): b k (N) =, for all k b k (i) = Σ r e r (x i+ ) a kr b r (i+) P(x) = Σ r a 0r e r (x ) b r () The Backward Algorithm In case of an end state: b k (N) = a k0, for all k b k (i) = Σ r e r (x i+ ) a kr b r (i+) P(x) = Σ r a 0r e r (x ) b r () Computational Complexity What is the running time, and space required for Forward, and Backward algorithms? Time: O( N) Space: O(N) Useful implementation technique to avoid underflows - Rescaling at each position by multiplying by a constant Define: Recursion: Scaling Choosing scaling factors such that Leads to: Scaling for the backward probabilities: 7
Posterior Decoding We can now calculate f k (i) b k (i) P(π i = k x) = P(x) Then, we can ask What is the most likely state at position i of sequence x? Using Posterior Decoding we can now define: = argmax k P(π i = k x) For each state, Posterior Decoding Posterior Decoding gives us a curve of probability of state for each position, given the sequence x That is sometimes more informative than Viterbi path π * Posterior Decoding may give an invalid sequence of states Why? Viterbi vs. posterior decoding Viterbi, Forward, Backward A class takes a multiple choice test. How does the lazy professor construct the answer key? Viterbi approach: use the answers of the best student Posterior decoding: majority vote VITERBI V 0 (0) = V k (0) = 0, for all k > 0 V r (i) = e r (x i ) max k V k (i-) a kr FORWARD f 0 (0) = f k (0) = 0, for all k > 0 f r (i) = e r (x i ) Σ k f k (i-) a kr BACWARD b k (N) = a k0, for all k b r (i) = Σ k e r (x i +) a kr b k (i+) P(x, π*) = max k V k (N) P(x) = Σ k f k (N) a k0 P(x) = Σ k a 0k e k (x ) b k () Methylation & Silencing A modeling Example CpG islands in DNA sequences Methylation Addition of CH 3 in C- nucleotides Silences genes in region CG (denoted CpG) often mutates to TG, when methylated Methylation is inherited during cell division 8
CpG Islands CpG Islands CpG nucleotides in the genome are frequently methylated C methyl-c T Methylation often suppressed around genes, promoters CpG islands In CpG islands: CG is more frequent Other dinucleotides (AA, AG, AT) have different frequencies Problem: Detect CpG islands A model of CpG Islands Architecture A model of CpG Islands Transitions How do we estimate parameters of the model? Emission probabilities: /0. Transition probabilities within CpG islands + A C G T A.80.74.46.0 C.7.368.74.88 Established from known CpG islands G.6.339.375.5 T.079.355.384.8. Transition probabilities within non-cpg islands - A C G T A.300.05.85.0 Established from non-cpg islands C.33.98.078.30 G.48.46.98.08 T.77.39.9.9 p A model of CpG Islands Transitions What about transitions between (+) and (-) states? Their probabilities affect Avg. length of CpG island Avg. separation between two CpG islands -p q Length distribution of X region: P[L X = ] = -p P[L X = ] = p(-p) P[L X = k] = p k- (-p) A model of CpG Islands Transitions No reason to favor exiting/entering (+) and (-) regions at a particular nucleotide Estimate average length L CPG of a CpG island: L CPG = /(-p) p = /L CPG For each pair of (+) states: a kr p a kr + For each (+) state k, (-) state r: a kr =(-p)(a 0r- ) Do the same for (-) states A problem with this model: CpG islands don t have a geometric length distribution p -q -q E[L X ] = /(-p) Geometric distribution, with mean /(-p) This is a defect of HMMs a price we pay for ease of analysis & efficient computation 9
Using the model Posterior decoding Given a DNA sequence x, The Viterbi algorithm predicts locations of CpG islands Given a nucleotide x i, (say x i = A) The Viterbi parse tells whether x i is in a CpG island in the most likely parse Results of applying posterior decoding to a part of human chromosome Using the Forward/Backward algorithms we can calculate P(x i is in CpG island) = P(π i = A + x) Posterior Decoding can assign locally optimal predictions of CpG islands = argmax k P(π i = k x) Posterior decoding Viterbi decoding Sliding window (size = 00) Sliding widow (size = 00) 0
Sliding window (size = 300) Sliding window (size = 400) Sliding window (size = 600) Sliding window (size = 000) Modeling CpG islands with silent states What if a new genome comes? Suppose we just sequenced the porcupine genome We know CpG islands play the same role in this genome However, we have no known CpG islands for porcupines We suspect the frequency and characteristics of CpG islands are quite different in porcupines How do we adjust the parameters in our model? LEARNING
Two learning scenarios Estimation when the right answer is known Examples: GIVEN: a genomic region x = x x N where we have good (experimental) annotations of the CpG islands GIVEN: the casino player allows us to observe him one evening as he changes the dice and produces 0,000 rolls Estimation when the right answer is unknown Examples: GIVEN: the porcupine genome; we don t know how frequent are the CpG islands there, neither do we know their composition GIVEN: 0,000 rolls of the casino player, but we don t see when he changes dice GOAL: Update the parameters θ of the model to maximize P(x θ) When the right answer is known Given x = x x N for which π = π π N is known, Define: A kr = # of times k r transition occurs in π E k (b) = # of times state k in π emits b in x The maximum likelihood estimates of the parameters are: A kr E k (b) a kr = e k (b) = Σ i A ki Σ c E k (c) When the right answer is known Intuition: When we know the underlying states, the best estimate is the average frequency of transitions & emissions that occur in the training data Drawback: Given little data, there may be overfitting: P(x θ) is maximized, but θ is unreasonable 0 probabilities VERY BAD Example: Suppose we observe: x = 5 6 3 6 3 5 3 4 6 3 6 π = F F F F F F F F F F F F F F F L L L L L Then: a FF = 4/5 a FL = /5 a LL = a LF = 0 e F (4) = 0; e L (4) = 0 Pseudocounts Solution for small training sets: Add pseudocounts A kr E k (b) = # times k r state transition occurs + t kr = # times state k emits symbol b in x + t k (b) t kr, t k (b) are pseudocounts representing our prior belief Larger pseudocounts Strong prior belief Small pseudocounts (ε < ): just to avoid 0 probabilities Pseudocounts Pseudocounts A kr + t kr E k (b) + t k (b) a kr = e k (b) = Σ i A ki + t kr Σ c E k (c) + t k (b) Example: dishonest casino We will observe player for one day, 600 rolls Reasonable pseudocounts: t 0F = t 0L = t F0 = t L0 = ; t FL = t LF = t FF = t LL = ; t F () = t F () = = t F (6) = 0 (strong belief fair is fair) t L () = t L () = = t L (6) = 5 (wait and see for loaded) Above numbers arbitrary assigning priors is an art
When the right answer is unknown We don t know the actual A kr, E k (b) Idea: Initialize the model Compute A kr, E k (b) Update the parameters of the model, based on A kr, E k (b) Repeat until convergence Two algorithms: Baum-Welch, Viterbi training When the right answer is unknown Given x = x x N for which the true π = π π N is unknown, EXPECTATION MAXIMIZATION (EM). Pick model parameters θ. Estimate A kr, E k (b) (Expectation) 3. Update θ according to A kr, E k (b) (Maximization) 4. Repeat & 3, until convergence To estimate A kr : Estimating new parameters At each position of sequence x, find probability that transition k r is used: P(π i = k, π i+ = r, x x N ) Q P(π i = k, π i+ = r x) = = P(x) P(x) Q = P(x x i, π i = k, π i+ = r, x i+ x N ) = P(π i+ = r, x i+ x N π i = k) P(x x i, π i = k) = P(π i+ = r, x i+ x i+ x N π i = k) f k (i) = P(x i+ x N π i+ = r) P(x i+ π i+ = r) P(π i+ = r π i = k) f k (i) = b r (i+) e r (x i+ ) a kr f k (i) f k (i) a kr e r (x i+ ) b r (i+) So: P(π i = k, π i+ = r x, θ) = P(x θ) Estimating new parameters f k (i) a kr e r (x i+ ) b r (i+) P(π i = k, π i+ = r x, θ) = P(x θ) f k (i) x x i- k x i a kr e r (x i+ ) Summing over all positions gives the expected number of times the transition is used: f k (i) a kr e r (x i+ ) b r (i+) A kr = Σ i P(π i = k, π i+ = r x, θ) = Σ i P(x θ) r x i+ b r (i+) x i+ x N Emission probabilities: Estimating new parameters Σ {i xi = b} f k(i) b k (i) E k (b) = P(x θ) When you have multiple sequences: sum over them. The Baum-Welch Algorithm Pick the best-guess for model parameters (or random). Forward. Backward 3. Calculate A kr, E k (b) + pseudocounts 4. Calculate new model parameters a kr, e k (b) 5. Calculate new log-likelihood P(x θ) Likelihood guaranteed to increase (EM) Repeat until P(x θ) does not change much 3
Time Complexity: The Baum-Welch Algorithm # iterations O( N) Guaranteed to increase the log likelihood P(x θ) Not guaranteed to find globally best parameters: Converges to local optimum Too many parameters / too large model: Overfitting Alternative: Viterbi Training Same. Perform Viterbi, to find π *. Calculate A kr, E k (b) according to π * + pseudocounts 3. Calculate the new parameters a kr, e k (b) Until convergence Comments: Convergence is guaranteed Why? Does not maximize P(x θ) Maximizes P(x θ, π * ) In general, worse performance than Baum-Welch Higher-order HMMs How do we model memory of more than one time step? HMM variants First order HMM: P(π i+ = r π i = k) (a kr ) nd order HMM: P(π i+ = r π i = k, π i - = j) (a jkr ) A second order HMM with states is equivalent to a first order HMM with states a HT (prev = H) a HT (prev = T) state HH a HHT a HTH state HT state H a TH (prev = H) a TH (prev = T) state T a THH state TH a THT a TTH a HTT state TT Note that not all transitions are allowed! Modeling the Duration of States Solution : Chain several states Length distribution of region X: E[L X ] = /(-p) Geometric distribution, with mean /(-p) This is a significant disadvantage of HMMs Several solutions exist for modeling different length distributions Disadvantage: Still very inflexible L X = C + geometric with mean /(-p) 4
Solution : Negative binomial distribution Example: genes in prokaryotes n copies of state X: EasyGene: a prokaryotic gene-finder (Larsen TS, rogh A) Suppose the duration in X is m, where During first m turns, exactly n arrows to next state are followed In the m th turn, an arrow to next state is followed m P(L X = m) = n ( p) n p m-n Codons are modeled with 3 looped triplets of states (negative binomial with n = 3) Solution 3: Duration modeling Upon entering a state:. Choose duration d, according to probability distribution. Generate d letters according to emission probs 3. Take a transition to next state according to transition probs Disadvantage: Increase in complexity: Time: O(D ) Space: O(D) where D = maximum duration of state 5