CISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II)

CISC 889 Bioinformatics (Spring 24) Hidden Markov Models (II) a. Likelihood: forward algorithm b. Decoding: Viterbi algorithm c. Model building: Baum-Welch algorithm Viterbi training Hidden Markov models A Markov chain of states At each state, there are a set of possible observations E.g.,.95.9 : /6 2: /6 3: /6 4: /6 5: /6 6: /6 Fair.5. Three major problems Most probable state path The likelihood Parameter estimation for HMMs : / 2: / 3: / 4: / 5: / 6: /2 Loaded

A biological example: CpG islands Higher rate of Methyl-C mutating to T in CpG dinucleotides generally lower CpG presence in genome, except at some biologically important ranges, e.g., in promoters, -- called CpG islands. The conditional probabilities are collected from ~ 6, bps human genome sequences, + stands for CpG islands and for non CpG islands. A + A C G T A C G T.8.274.426.2 A.3.25.285.2 C.7.368.274.88 C.322.298.78.32 G.6.339.375.25 G.248.246.298.28 T.79.355.384.82 T.77.239.292.292 Task : given a sequence x, determine if it is a CpG island. Solution: compute the log-odds ratio scored by the two Markov chains: S(x) = log [ P(x model +) / P(x model -)] # of sequence Histogram of the length-normalized scores S(x) 2

Task 2: For a long genomic sequence x, label these CpG islands, if there are any. Approach : Adopt the method for Task by calculating the log-odds score for a window of, say, bps around every nucleotide and plotting it. Problems with this approach: Won t do well if CpG islands have sharp boundary and variable length No effective way to choose a good Window size. Approach 2: using hidden Markov model.9.95 A:.7 C:.368 G:.274 T:.88..5 A:.372 C:.98 G:.2 T:.338 + The model has two states, + for CpG island and - for non CpG island. Those numbers are made up here, and shall be fixed by learning from training examples. A reasonable assignment for emission frequencies may use the respective eigenvectors of the two Markov chains in approach. Use the same notations as in the text: a kl is the transition probability from state k to state l; e k (b) is the emission frequency probability that symbol b is seen when in state k. 3

.9.95 A:.7 C:.368 G:.274 T:.88..5 A:.372 C:.98 G:.2 T:.338 + Joint probability of an observed sequence x and a state path : i:23456789 x:tgcgcgtac :--++++--- P(x, ) = i= to L e i (x i ) a i i+ P(x, ) =.388.95.2.5.368.9.274.9.368.9.274..338.95.372.95.98. Then, the probability to observe sequence x in the model is P(x) = Σ P(x, ), which is also called the likelihood of the model. Q: Given a sequence x of length L, how many state paths do we have? A: N L, where N stands for the number of states in the model. As an exponential function of the input size, it precludes enumerating all possible state paths for computing P(x). Example: Let s assume the model has a set of two states S = {, }. For L=3, there are 2 3 = 8 paths: Φ = {,,,,,,, }. P(x) = Σ P(x, ) 4

Specifically, let f k (i) be the intermediate result computed up to the k-th term in the i-th bracket, [e (x ) + + e N (x ) ] a a [e (x i ) + e k (x i ) + +e N (x i )] a a [e (x L ) + +e N (x L )] f k (i) which is the probability contributed by all paths from the beginning up to (and include) position i with state at position i being k. And it can be written recursively as f k (i) = [Σ j f j (i-) a jk ] e k (x i ) Graphically, x x 2 x 3 x L N N N A silent state is introduced for better presentation N Forward algorithm Initialization: f () =, f k () = for k >. Recursion: f k (i) = e k (x i ) Σ j f j (i-) a jk. Termination: P(x) = Σ k f k (L) a k. Time complexity: length. O(N 2 L), where N is the number of states and L is the sequence Similarly, we can compute P(x) backwards. Backward algorithm Initialization: b k (L) = a k for all k. Recursion: b k (i) = Σ j a kl e l (x i+ ) b l (i+). Termination: P(x) = Σ k a k e k (x ) b k (). Therefore, f k (i)b k (i) gives P(x, i = k), the probability contributed from all paths that go through state k at position i. Note: b k (i) does not include emitting x i, this avoids double counting e k (x i ) in f k (i)b k (i). 5

Decoding: Given an observed sequence x, what is the most probable state path, i.e., * = argmax P(x, ) [e (x ) e N (x ) ] a a [e (x i ) e k (x i ) e N (x i )] a a [e (x L ) e N (x L )] v k (i) Viterbi Algorithm Initialization: v () =, v k () = for k >. Recursion: v k (i) = e k (x i ) max j (v j (i-) a jk ); ptr i (k) = argmax j (v j (i-) a jk ); Termination: P(x, ) = max k (v k (L) a k ); * L = argmax j (v j (L) a j ); Traceback: * i- = ptr i ( * i )..9.95 A:.7 C:.368 G:.274 T:.88..5 A:.372 C:.98 G:.2 T:.338 + v k (i) = e k (x i ) max j (v j (i-) a jk ); V k (i) T G C T A i +.94.23.76.3.2.69.8.33..38 k 6

Posterior decoding P( i = k x) = P(x, i = k) /P(x) = f k (i)b k (i) / P(x) Algorithm: for i = to L do argmax k P( i = k x) Notes:. Posterior decoding may be useful when there are multiple almost most probable paths, or when a function is defined on the states. 2. The state path identified by posterior decoding may not be most probable overall, or may not even be a viable path. 7

Model building - Topology - Requires domain knowledge - Parameters - When states are labeled - Simple counting: a kl = A kl / Σ l A kl and e k (b) = E k (b) / Σ b E k (b ) - When states are not labeled Method (Viterbi training). Use Viterbi algorithm to label 2. Do counting to collect new a kl and e k (b); 3. Repeat steps and 2 until stopping criterion is met. Method 2 (Baum-Welch algorithm) Baum-Welch algorithm (Expectation-Maximization) A iterative procedure similar to Viterbi training P( i = k, i+ = l x, ) = f k (i) a kl e l (x i+ ) b l (i+) / P(x) A kl = Σ j {(/P(x j ) [Σ i f kj (i) a kl e l (x j i+ ) b l j (i+)] } E k (b) =Σ j {(/P(x j ) [Σ {i x_i^j = b} f kj (i) b kj (i)] } 8