Hidden Markov Models Slides revised and adapted to Bioinformática 55 Engª Biomédica/IST 2005 Ana Teresa Freitas Forward Algorithm For Markov chains we calculate the probability of a sequence, P(x) How to calculate this probability for an HMM as well? We must add the probabilities for all possible paths = π P( x) P( x,π ) Bioinformática - 55 1
Forward Algorithm Define f k,i (forward probability) ) as the probability of emitting the prefix x 1 x i and eventually reaching the state π i = k. k f k,i = P(x 1 x i,π i = k) The recurrence for the forward algorithm: f l,i+1 l,i+1 = e l (x i+1 ). Σ f k,i. a kl k Є Q Bioinformática - 55 Forward algorithm Initialization: f begin,0 = 1 f k,0 = 0 for k begin. For each i =0,...,L calculate: f l,i = e l (x i ) Σ k f k,i-1 a kl Termination P(x) ) = Σ k f k,l. a k,end Bioinformática - 55 2
Forward-Backward Problem Given: a sequence of coin tosses generated by an HMM. Goal: find the probability that the dealer was using a biased coin at a particular time. Backward Algorithm However, forward probability is not the only factor affecting P(π i = k x). The sequence of transitions and emissions that the HMM undergoes between π i+1 and π n also affect P(π i = k x). forward x i backward 3
Backward Algorithm (cont d) Define backward probability b k,i as the probability of being in state π i = k and emitting the suffix x i+1 x n. The recurrence for the backward algorithm: b k,i (x i+1,i = Σ e l(x i+1 ). b l,i+1. a kl l Є Q Backward-Forward Algorithm The probability that the dealer used a biased coin at any moment i: P(x, π i = k) k f k (i). b k (i) P(π i = k x) ) = P(x) P(x) = P(x) is the sum of P(x, π i = k) over all k 4
HMM Parameter Estimation So far, we have assumed that the transition and emission probabilities are known. However, in most HMM applications, the probabilities are not known. It s s very hard to estimate the probabilities. HMM Parameter Estimation Problem Given HMM with states and alphabet (emission characters) Independent training sequences x 1, x m Find HMM parameters Θ (that is, a kl, e k (b)) that maximize P(x 1,, x m Θ) the joint probability of the training sequences. 5
Maximize the likelihood P(x 1,, x m Θ) as a function of Θ is called the likelihood of the model. The training sequences are assumed independent, therefore P(x 1,, x m Θ) = Π i P(x i Θ) The parameter estimation problem seeks Θ that realizes max P ( x i Θ) Θ i In practice the log likelihood is computed to avoid underflow errors Two situations Known paths for training sequences CpG islands marked on training sequences One evening the casino dealer allows us to see when he changes dice Unknown paths CpG islands are not marked Do not see when the casino dealer changes dice 6
Known paths A kl = # of times each k l is taken in the training sequences E k (b) = # of times b is emitted from state k in the training sequences Compute a kl and e k (b) as maximum likelihood estimators: a = A / A kl k kl e ( b) = E l' k kl' ( b)/ b' E ( b') k Pseudocounts Some state k may not appear in any of the training sequences. This means A kl = 0 for every state l and a kl cannot be computed with the given equation. To avoid this overfitting use predetermined pseudocounts r kl and r k (b). A kl = # of transitions k l + r kl E k (b) = # of emissions of b from k + r k (b) The pseudocounts reflect our prior biases about the probability values. 7
Unknown paths: Viterbi training Idea: use Viterbi decoding to compute the most probable path for training sequence x Start with some guess for initial parameters and compute π* the most probable path for x using initial parameters. Iterate until no change in π* : 1. Determine A kl and E k (b) as before 2. Compute new parameters a kl and e k (b) using the same formulas as before 3. Compute new π* for x and the current parameters Viterbi training analysis The algorithm converges precisely There are finitely many possible paths. New parameters are uniquely determined by the current π*. There may be several paths for x with the same probability, hence must compare the new π* with all previous paths having highest probability. Does not maximize the likelihood Π x P(x Θ) but the contribution to the likelihood of the most probable path Π x P(x Θ, π*) In general performs less well than Baum-Welch 8
Unknown paths: Baum-Welch Idea: 1. Guess initial values for parameters. art and experience, not science 2. Estimate new (better) values for parameters. how? 3. Repeat until stopping criteria is met. what criteria? Better values for parameters Would need the A kl and E k (b) values but cannot count (the path is unknown) and do not want to use a most probable path. For all states k,l, symbol b and training sequence x Compute A kl and E k (b) as expected values, given the current parameters 9
Notation For any sequence of characters x emitted along some unknown pathπ, denote by π i = k the assumption that the state at position i (in which x i is emitted) is k. Probabilistic setting for A k,l Given x 1,,x m consider a discrete probability space with elementary events ε k,l, = k l is taken in x 1,, x m For each x in {x 1,,x m } and each position i in x let Y x,i be a random variable defined by Y x, i ( ε k, l 1, if π i = k and π i + 1 = l ) = 0, otherwise Define Y = Σ x Σ i Y x,i random var that counts # of times the event ε k,l happens in x 1,,x m. 10
The meaning of A kl Let A kl be the expectation of Y E(Y) = Σ x Σ i E(Y x,i ) = Σ x Σ i P(Y x,i = 1) = Σ x Σ i P({ε k,l π i = k and π i+1 = l}) = Σ x Σ i P(π i = k, π i+1 = l x) Need to compute P(π i = k, π i+1 = l x) Probabilistic setting for E k (b) Given x 1,,x m consider a discrete probability space with elementary events ε k,b = b is emitted in state k in x 1,,x m For each x in {x 1,,x m } and each position i in x let Y x,i be a random variable defined by Y x, i( εk, b 1, if xi = b and πi = k ) = 0, otherwise Define Y = Σ x Σ i Y x,i random var that counts # of times the event ε k,b happens in x 1,,x m. 11
The meaning of E k (b) Let E k (b) be the expectation of Y E(Y) = Σ x Σ i E(Y x,i ) = Σ x Σ i P(Y x,i = 1) = Σ x Σ i P({ε k,b x i = b and π i = k}) x { i x = b} i Need to compute P(π i = k x) P({ ε π k, b xi = b, π i = k}) = P( i = k x) x { i x = b} i Computing new parameters Consider x = x 1 x n training sequence Concentrate on positions i and i+1 Use the forward-backward values: f ki = P(x 1 x i, π i = k) b ki = P(x i+1 x n π i = k) 12
Compute A kl (1) Prob k l is taken at position i of x P(π i = k, π i+1 = l x 1 x n ) = P(x, π i = k, π i+1 = l) / P(x) Compute P(x) using either forward or backward values We ll show that P(x, π i = k, π i+1 = l) = b li+1 e l (x i+1 ) a kl f ki Expected # times k l is used in training sequences A kl = Σ x Σ i (b li+1 e l (x i+1 ) a kl f ki ) / P(x) Compute A kl (2) P(x, π i = k, π i+1 = l) = P(x 1 x i, π i = k, π i+1 = l, x i+1 x n ) = P(π i+1 = l, x i+1 x n x 1 x i, π i = k) P(x 1 x i,π i =k)= P(π i+1 = l, x i+1 x n π i = k) f ki = P(x i+1 x n π i = k, π i+1 = l) P(π i+1 = l π i = k) f ki = P(x i+1 x n π i+1 = l) a kl f ki = P(x i+2 x n x i+1, π i+1 = l) P(x i+1 π i+1 = l) a kl f ki = P(x i+2 x n π i+1 = l) e l (x i+1 ) a kl f ki = b li+1 e l (x i+1 ) a kl f ki 13
Compute E k (b) Prob x i of x is emitted in state k P(π i = k x 1 x n ) = P(π i = k, x 1 x n )/P(x) P(π i = k, x 1 x n ) = P(x 1 x i,π i = k,x i+1 x n ) = P(x i+1 x n x 1 x i,π i = k) P(x 1 x i,π i = k) = P(x i+1 x n π i = k) f ki = b ki f ki Expected # times b is emitted in state k E k ( b) = ( fki bki ) x i: x = b i P( x) a e kl = ( b) Finally, new parameters A kl = / E l' A kl' ( b)/ E k k k b' Can add pseudocounts as before. ( b' ) 14
Stopping criteria Cannot actually reach maximum (optimization of continuous functions) Therefore need stopping criteria Compute the log likelihood of the model for current Θ Compare with previous log likelihood Stop if small difference Stop after a certain number of iterations x log P ( x Θ ) The Baum-Welch algorithm Initialization: Pick the best-guess for model parameters (or arbitrary) Iteration: 1. Forward for each x 2. Backward for each x 3. Calculate A kl, E k (b) 4. Calculate new a kl, e k (b) 5. Calculate new log-likelihood Until log-likelihood does not change much 15
Baum-Welch analysis Log-likelihood is increased by iterations Baum-Welch is a particular case of the EM (expectation maximization) algorithm Convergence to local maximum. Choice of initial parameters determines local maximum to which the algorithm converges Finding Distant Members of a Protein Family A distant cousin of functionally related sequences in a protein family may have weak pairwise similarities with each member of the family and thus fail significance test. However, they may have weak similarities with many members of the family. The goal is to align a sequence to all members of the family at once. Family of related proteins can be represented by their multiple alignment and the corresponding profile. 16
Profile Representation of Protein Families Aligned DNA sequences can be represented by a 4 n profile matrix reflecting the frequencies of nucleotides in every aligned position. Protein family can be represented by a 20 n profile representing frequencies of amino acids. Profiles and HMMs HMMs can also be used for aligning a sequence against a profile representing protein family. A 20 n profile P corresponds to n sequentially linked match states M 1,,M n in the profile HMM of P. 17
Multiple Alignments and Protein Family Classification Multiple alignment of a protein family shows variations in conservation along the length of a protein Example: after aligning many globin proteins, the biologists recognized that the helices region in globins are more conserved than others. What are Profile HMMs? A Profile HMM is a probabilistic representation of a multiple alignment. A given multiple alignment (of a protein family) is used to build a profile HMM. This model then may be used to find and score less obvious potential matches of new protein sequences. 18
Profile HMMs D D D D I I I I I Start M 1 M 2 M 3 M 4 End I Match: Insert: a Delete: conserved an insertion a deletion, position with a general with silent state specialized (background) without emission any emission probability probability L M W K E ILMWKE ILWK Profile HMM A profile HMM 19
Building a profile HMM Multiple alignment is used to construct the HMM model. Assign each column to a Match state in HMM. Add Insertion and Deletion state. Estimate the emission probabilities according to amino acid counts in column. Different positions in the protein will have different emission probabilities. Estimate the transition probabilities between Match, Deletion and Insertion states The HMM model gets trained to derive the optimal parameters. States of Profile HMM Match states Insertion states Insertion states I 0 I 1 I n Deletion states Deletion states D 1 D n M 1 M n (plus begin/end states) 20
Transition Probabilities in Profile HMM log(a MI )+log(a IM ) = gap initiation penalty log(a II ) = gap extension penalty Emission Probabilities in Profile HMM Probabilty of emitting a symbol a at an insertion state I j : e Ij (a) = p(a) where p(a) is the frequency of the occurrence of the symbol a in all the sequences. 21
Profile HMM Alignment Define v M j (i) as the logarithmic likelihood score of the best path for matching x 1..x i to profile HMM ending with x i emitted by the state M j. v I j (i) and v D j (i) are defined similarly. Profile HMM Alignment: Dynamic Programming v M j(i) = log (e( M j (x i)/p( v M j-1(i-1) + log(a M j-1, 1,Mj ) )) + max v I j-1(i-1) + log(a I j-1,m j ) v D j-1(i-1) + log(a D j-1,m j ) )/p(x i )) + max v I j(i) = log (e( I j (x i)/p( v M j(i-1) + log(a M j, I j ) )) + max v I j(i-1) + log(a I j, I j ) v D j(i-1) + log(a D j, I j ) )/p(x i )) + max 22
Paths in Edit Graph and Profile HMM A path through an edit graph and the corresponding path through a profile HMM 23