Lab 3: Practical Hidden Markov Models (HMM)

Advanced Topics in Bioinformatics Lab 3: Practical Hidden Markov Models () Maoying, Wu Department of Bioinformatics & Biostatistics Shanghai Jiao Tong University November 27, 2014

Hidden Markov Models (s) Three components of s: Initial State Probability vector π π i : the initial probability of state i State transition matrix A R N N independent of t a ij : state transition probability of S i S j Confusion/Emission matrix B R N M independent of t b i(ot): probability of symbol O t given state i.

Problems and Solutions Evaluation: Pr(O 1:T λ) Forward algorithm Backward algorithm Decoding: Pr(S 1:T O 1:T, λ) Viterbi algorithm Learning: O λ Forward-backward (Baum-Welch) algorithm

Forward Algorithm Input: Hidden Markov model λ, observation O 1... O T with length T Bayesian equation: P(A C) = B P(AB C) P(O 1 O 2... O T λ) = N q T =1 P(O 1O 2... O T, q T λ) Forward variable (local probability of observing O 1... O t and state at time t is j given hidden Markov model): α t (j) = P(O 1 O 2... O t, q t = j λ) Forward Algorithm: 1 Initialization: α 1 (j) = π(j)b j(o1) 2 Recursion: α t (j) = N i=1 (α t 1(i)a ij )b j(ot) 3 Output: p = N j=1 α T (j) The time complexity of forward algorithm is O(N 2 T )

Backward Algorithm Input: Hidden Markov model λ, observation O 1... O T with length T P(A C) = B P(AB C) P(O t O t 1... O 1 λ) = N q 1 =1 P(O to t 1... O 1, q 1 λ) Backward variable (local probability of observing O t+1... O T given that state at time t is i and the is λ): β t (i) = P(O t+1 O t+2... O T q t = i, λ) Backward Algorithm: 1 Initialization: β T (i) = 1.0 2 Recursion: β t (i) = N j=1 (β t+1(j)a ij )b j(ot+1) 3 Output: p = N i=1 β 1(i)

Viterbi Algorithm: Finding Most Probable Hidden-State Path Used for decoding Partial best path and its associated local probability δ t (i) Initialization: δ 1 (i) = π(i)b io1 Recursion: δ t (i) = max j (δ t 1 (j)a ji b iot ) Backtracking: φ t (i) = arg max j (δ t 1 (j)a ji b iot ) Output: the maximum likelihood p = max i (δ T (i)) and the most probable path (s 1,..., s T ).

Forward-Backward algorithm: Baum-Welch Definition: γ t(i) = P(q t = S i O, λ) (prob. of transition starting from S i at time t) γ t(i) = αt (i)βt (i) αt (i)βt (i) = P(O λ) Ni=1 α t (i)β t (i) Definition: ξ t(i, j) = P(q t = S i, q t+1 = S j O, λ) (prob of S i S j at time t) ξ t(i, j) = α t (i)a ij b jot+1 β t+1 (j) Ni=1 Nj=1 α t (i)a ij b jot+1 β t+1 (j) Thus T 1 t=1 γt(i) is expected number of transition starting from S i, and T 1 t=1 ξt(i, j) is the expected number of S i S j. Inferring model parameters ˆπ i = γ 1 (i) (1) T 1 t=1 ξt(i, j) â ij = T 1 t=1 γt(i) (2) T t=1,q ˆb j(ok ) = t =O k γ t(j) T t=1 γt(j) (3)

Exercise 1 The Bacteriophage lambda genome sequence (NCBI accession NC 001416) has long stretches of either very GC-rich (mostly in the first half of the genome) or very AT-rich sequence (mostly in the second half of the genome). Use a with two different states ( AT-rich and GC-rich ) to infer which state of the is most likely to have generated each nucleotide position in the Bacteriophage lambda genome sequence. For the AT-rich state, set p = (p A, p C, p G, p T ) = 0.27, 0.2084, 0.198, 0.3236. For the GC-rich state, set p = (0.2462, 0.2476, 0.2985, 0.2077). Set the probability of switching from the AT-rich state to the GC-rich state to be 0.0002, and the probability of switching from the GC-rich state to the AT-rich state to be 0.0002. What is the most probable state path?

Exercise 2 Given a with four different states ( A-rich, C-rich, G-rich and T-rich ), infer which state of the is most likely to have generated each nucleotide position in the Bacteriophage lambda genome sequence. For the A-rich state, set p = (p A, p C, p G, p T ) = (0.3236, 0.2084, 0.198, 0.27); For the C-rich state, set p = (0.2462, 0.2985, 0.2476, 0.2077); For the G-rich state, set p = (0.2462, 0.2476, 0.2985, 0.2077); For the T-rich state, set p = (0.27, 0.2084, 0.198, 0.3236). Set the probability of switching between any two different states to be 6.666667e 05. What is the most probable state path? Do you find differences between these results and the results from simply using a two-state?

Exercise 3 Make a two-state to model protein sequence evolution, with hydrophilic and hydrophobic states. hydrophilic: set p = (p A, p R, p N, p D, p C, p Q, p E, p G, p H, p I, p L, p K, p M, p F, p P, p S, p T, p W, p Y, p V ) = (0.02, 0.068, 0.068, 0, 068, 0.02, 0.068, 0.068, 0.068, 0.068, 0.012, 0.012, 0.012, 0.068, 0.02, 0.02, 0.068, 0.068, 0.068, 0.068, 0.012) hydrophobic p = (0.114, 0.007, 0.007, 0.007, 0.114, 0.007, 0.007, 0.025, 0.007, 0.114, 0.114, 0.007, 0.114, 0.114, 0.025, 0.026, 0.026, 0.025, 0.026, 0.114). Set the probability of switching between two states to be 0.01. Now infer which state of the is most likely to have generated each amino acid position in the the human odorant receptor 5BF1 protein (UniProt accession Q8NHC7). What is the most probable state path? The odorant receptor is a 7-transmembrane protein, meaning that it crosses the cell membrane seven times. As a consequence the protein has 7 hydrophobic regions that cross the fatty cell membrane, and 7 hydrophilic segments that touch the watery cytoplasm and extracellular environments. What do you think are the coordinates in the protein of the seven transmembrane regions?