An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Hidden Markov Models Slides revised and adapted to Computational Biology IST 2015/2016 Ana Teresa Freitas Three classic HMM problems Evaluation Given a model and an output sequence, what is the probability that the model generated that output? Decoding Given a model and an output sequence, what is the most likely state sequence (path) through the model that produced that sequence? Learning Given a model and a set of observed sequences, what should the model parameters be so that it has a high probability of generating those sequences? 1
Evaluation Problem HMM parameters Σ: set of emission characters. Ex.: Σ = {H, T} or {1,0} for coin tossing Σ = {A,C,T,G} for DNA Q: set of hidden states, each emitting symbols from Σ. Q={F,B} for coin tossing HMM Parameters (cont) A = (a kl ): a Q x Q matrix of probability of changing from state k to state l. a FF = 0.9 a FB = 0.1 a BF = 0.1 a BB = 0.9 E = (e k (b)): a Q x Σ matrix of probability of emitting symbol b while being in state k. e F (0) = ½ e B (0) = ¼ e F (1) = ½ e B (1) = ¾ 2
P(x,π) Calculation P(x,π): Probability that sequence x was generated by the path π: n P(x,π) = P(π 0 π 1 ). Π P(x i π i ) P(π i π i+1 ) i=1 = a π0, π 1 Π e π i (x i ) a π i, π i+1 π 0 and π n+1 represent the fictitious initial and terminal state (begin and end) Decoding Problem Goal: Find an optimal hidden path of states given observations. Input: Sequence of observations x = x 1 x n generated by an HMM M(Σ, Q, A, E) Output: A path that maximizes P(x π) over all possible paths π. 3
Edit Graph for Decoding Problem Building Manhattan for Decoding Problem Andrew Viterbi used the Manhattan grid model to solve the Decoding Problem. Every choice of π = π 1 π n corresponds to a path in the graph. The only valid direction in the graph is eastward. This graph has Q 2 (n-1) edges. 4
Decoding Problem vs. Alignment Problem Valid directions in the alignment problem. Valid directions in the decoding problem. Decoding Problem as Finding a Longest Path in a DAG The Decoding Problem is reduced to finding a longest path in the directed acyclic graph (DAG) above. Notes: the length of the path is defined as the product of its edges weights, not the sum. 5
Decoding Problem (cont d) Every path in the graph has the probability P(x π). The Viterbi algorithm finds the path that maximizes P(x π) among all possible paths. The Viterbi algorithm runs in O(n Q 2 ) time. Decoding Problem: weights of edges w (k, i) (l, i+1) The weight w is given by:??? 6
Decoding Problem: weights of edges n P(x π) = Π e π i+1 (x i+1 ). a π i, π i+1 i=0 w (k, i) (l, i+1) The weight w is given by:?? Decoding Problem: weights of edges i-th term = e π i+1 (x i+1 ). a π i, π i+1 w (k, i) (l, i+1) The weight w is given by:? 7
Decoding Problem: weights of edges i-th term = e l (x i+1 ). a kl for π i =k, π i+1 =l w (k, i) (l, i+1) The weight w=e l (x i+1 ). a kl Decoding Problem and Dynamic Programming S k,i is the probability of the most probable path for the prefix (x 1,...,x i ) s l,i+1 = max k Є Q {s k,i weight of edge between (k,i) and (l,i+1)}= = max k Є Q {s k,i a kl e l (x i+1 ) }= e l (x i+1 ) max k Є Q {s k,i a kl } 8
Decoding Problem (cont d) Initialization: s begin,0 = 1 s k,0 = 0 for k begin. For each i =0,...,L-1 and for each l Q recursively calculate: s l,i+1 = e l (x i+1 ) max k Є Q {s k,i a kl } Let π * be the optimal path. Then, P(x π * ) = max k Є Q {s k,n. a k,end } Viterbi Algorithm The value of the product can become extremely small, which leads to overflowing. To avoid overflowing, use log value instead. s k,i+1 = loge l (x i+1 ) + max k Є Q {s k,i + log(a kl )} Initialization: s begin,0 = 0 s k,0 = - for k begin The score for the best path Score (X, π * ) = max k Є Q {s k,n + log(a kend )} 9
Viterbi Algorithm Complexity: We calculate the values of O( Q.L) cells of the matrix V Spend O( Q ) operations per cell Overall time complexity: O(L. Q 2 ) Space complexity: O(L. Q ) Example Consider a HMM model that includes two hidden states (S1 and S2). Initial probabilities for both 1 and 2 are equal to 0.5, while transition probabilities are : a S1S1 = 0.6; a S2S2 = 0.5; a S1S2 =0.4; a S2S1 = 0.5. Nucleotides T, C, G, A are emitted with probabilities e A = e T = 0.3 and e C = e G = 0.2 from state S1; and e A = e T = 0.2 and e C = e G = 0.3 from state S2. 10
Problem 1 Evaluation Given the HMM described in the example and the output sequence x = GGCACTGAA what is the probability that the HMM generated x using the following path (sequence of hidden states)? π = S1 S1 S2 S2 S2 S2 S1 S1 S1 Problem 2 - Decoding Use the Viterbi algorithm to compute the most likely path generating sequence x = GGCACTGAA using the HMM in the example. 11
Forward Algorithm For Markov chains we calculate the probability of a sequence, P(x) How to calculate this probability for an HMM as well? We must add the probabilities for all possible paths P ( x) = P( x π,π ) Forward Algorithm Define f k,i (forward probability) as the probability of emitting the prefix x 1 x i and eventually reaching the state π i = k. f k,i = P(x 1 x i, π i = k) The recurrence for the forward algorithm: f l,i+1 = e l (x i+1 ). Σ f k,i. a kl k Є Q 12
Forward algorithm Initialization: f begin,0 = 1 f k,0 = 0 for k begin. For each i =0,...,L calculate: f l,i = e l (x i ) Σ k f k,i-1 a kl Termination P(x) = Σ k f k,l. a k,end Example Consider a HMM model that includes two hidden states (S1 and S2). Initial probabilities for both 1 and 2 are equal to 0.5, while transition probabilities are : a S1S1 = 0.6; a S2S2 = 0.5; a S1S2 =0.4; a S2S1 = 0.5. Nucleotides T, C, G, A are emitted with probabilities e A = e T = 0.3 and e C = e G = 0.2 from state S1; and e A = e T = 0.2 and e C = e G = 0.3 from state S2. 13
Problem 3 Compute P(x) Use the Forward algorithm to compute the the probability of a sequence x, P(x), using the HMM in the example. x = GGC Forward-Backward Problem Given: a sequence of coin tosses generated by an HMM. Goal: find the probability that the dealer was using a biased coin at a particular time. 14
Backward Algorithm However, forward probability is not the only factor affecting P(π i = k x). The sequence of transitions and emissions that the HMM undergoes between π i+1 and π n also affect P(π i = k x). forward x i backward Backward Algorithm (cont d) Define backward probability b k,i as the probability of being in state π i = k and emitting the suffix x i+1 x n. The recurrence for the backward algorithm: b k,i = Σ e l (x i+1 ). b l,i+1. a kl l Є Q 15
Backward-Forward Algorithm The probability that the dealer used a biased coin at any moment i: P(x, π i = k) f k (i). b k (i) P(π i = k x) = = P(x) P(x) P(x) is the sum of P(x, π i = k) over all k Example Consider a HMM model with two hidden states, S1 and S2. Consider also that the starting probabilities are 0.4 for S1 and 0.6 for S2, and the transition (T) and emission (E) probability matrices are the following: Compute the most likely path generating the sequence X = TAC. What is the probability of generating the sequence X = TACG and being at state S2 when generating the symbol C? 16
HMM Parameter Estimation So far, we have assumed that the transition and emission probabilities are known. However, in most HMM applications, the probabilities are not known. It s very hard to estimate the probabilities. HMM Parameter Estimation Problem q Given Ø HMM with states and alphabet (emission characters) Ø Independent training sequences x 1, x m q Find HMM parameters Θ (that is, a kl, e k (b)) that maximize P(x 1,, x m Θ) the joint probability of the training sequences. 17
Maximize the likelihood P(x 1,, x m Θ) as a function of Θ is called the likelihood of the model. The training sequences are assumed independent, therefore P(x 1,, x m Θ) = Π i P(x i Θ) The parameter estimation problem seeks Θ that realizes max P ( x i Θ) Θ i In practice the log likelihood is computed to avoid underflow errors Two situations Known paths for training sequences CpG islands marked on training sequences One evening the casino dealer allows us to see when he changes dice Unknown paths CpG islands are not marked Do not see when the casino dealer changes dice 18
Known paths A kl = # of times each k l is taken in the training sequences E k (b) = # of times b is emitted from state k in the training sequences Compute a kl and e k (b) as maximum likelihood estimators: a = A / A kl k kl e ( b) = E k l' kl ' ( b) / b' E ( b' ) k Pseudocounts q Some state k may not appear in any of the training sequences. This means A kl = 0 for every state l and a kl cannot be computed with the given equation. q To avoid this overfitting use predetermined pseudocounts r kl and r k (b). A kl = # of transitions k l + r kl E k (b) = # of emissions of b from k + r k (b) The pseudocounts reflect our prior biases about the probability values. 19
Unknown paths: Viterbi training Idea: use Viterbi decoding to compute the most probable path for training sequence x Start with some guess for initial parameters and compute π* the most probable path for x using initial parameters. Iterate until no change in π* : 1. Determine A kl and E k (b) as before 2. Compute new parameters a kl and e k (b) using the same formulas as before 3. Compute new π* for x and the current parameters Viterbi training analysis q The algorithm converges precisely There are finitely many possible paths. New parameters are uniquely determined by the current π*. There may be several paths for x with the same probability, hence must compare the new π* with all previous paths having highest probability. q Does not maximize the likelihood Π x P(x Θ) but the contribution to the likelihood of the most probable path Π x P(x Θ, π*) q In general performs less well than Baum-Welch 20
Unknown paths: Baum-Welch Idea: 1. Guess initial values for parameters. art and experience, not science 2. Estimate new (better) values for parameters. how? 3. Repeat until stopping criteria is met. what criteria? Better values for parameters Would need the A kl and E k (b) values but cannot count (the path is unknown) and do not want to use a most probable path. For all states k,l, symbol b and training sequence x Compute A kl and E k (b) as expected values, given the current parameters 21