Hidden Markov Models Slides revised and adapted to Bioinformática 55 Engª Biomédica/IST 2005 Ana Teresa Freitas CG-Islands Given 4 nucleotides: probability of occurrence is ~ 1/4. Thus, probability of occurrence of a dinucleotide is ~ 1/16. However, the frequencies of dinucleotides in DNA sequences vary widely. In particular, CG is typically underepresented (frequency of CG is typically < 1/16) 1
Why CG-Islands? CG is the least frequent dinucleotide because C in CG is easily methylated and has the tendency to mutate into T afterwards However, the methylation is suppressed around genes in a genome. So, CG appears at relatively high frequency within these CG islands So, finding the CG islands in a genome is an important problem Markov Sequence Models Markov sequence models are like little machines that generate sequences. Markov models can model sequences with specific probability distributions. In an n-order Markov sequence model, the probability distribution of the next letter depends on the previous n letters generated. Markov models of order 5 or more are often needed to model DNA sequences well. 2
Markov Sequence Models Markov models are composed of states and transitions, drawn as boxes and arrows. States emit letters States have labels n letters long, showing the letter the state emits and the previous n-1 letters emitted by the model Transitions have probabilities Outgoing transitions from a state sum to 1 Markov Sequence Models There is a 1-1 correspondence between sequences and paths through a (non-hidden) Markov model. The probability of a sequence is the product of the transition probabilities on the arcs. The 0-order model is a special case since it also has emission probabilities at its one state. 3
Example Markov Sequence Models 0-order model: 1 state no labels 1 st -order: 4 states Label = next letter 2 nd -order: 20 states Label = previous+next letters Pr(A A) Pr(T T) Pr(A AA) S p 1 M 1-p E S Pr(A) A Pr(G G) Pr(T A) Pr(T G) Pr(C A) T Pr(C C) E AA Pr(G AA) Pr(T AA) Pr(C AA) AT E q A q C Pr(G) G Pr(C G) C AG AC q G qt Most transitions omitted for clarity Example: 1 st -order Let X = X 1 X 2 X L be a sequence. Then the 1 st -order Markov model below generates it with probability: P(X) = P(x1). 2 L P(X i-1 X i ) P(A A) P(T T) P(X) = 1 L P(X i-1 X i ) S P(A S) A P(G G) P(T A) P(T G) T P(C C) P(E T) E Note: in S at t=0, in E at t=l+1 P(G S) G P(C A) P(C G) C P(E C) Transitions omitted for clarity 4
Problem: Identifying a CpG island Input: A short DNA sequence X = (x1,...,xl) Σ (where Σ = {A,C,G,T}). Question: Decide whether X is a CpG island. Use two Markov chain models: - One for dealing with CpG islands (the + model) - Other for dealing with non CpG islands (the model) Problem: Identifying a CpG island Let a st + denote the transition probability of s,t Σ inside a CpG island Let a st - denote the transition probability of s,t outside a CpG island The logarithmic likelihood score for a sequence X P( X, CpG) Score( X ) = log = P( X, noncpg) The higher this score, the more likely it is that X is a CpG island a L + xi 1, xi log i= 1 axi 1, xi 5
Problem: Locating CpG island in a DNA sequence Input: A long DNA sequence X = (x1,...,xl) Σ (where Σ = {A,C,G,T}). Question: Locate the CpG islands along X The naive approach: - Extract a sliding window X k = (x k+1,..., x k+l ) of a given lenght l (l << L) from the sequence X - Calculate Score (X k ) for each one of the resulting subsequences - Subsequences that receive positive scores are potential CpG islands Problem: Locating CpG island in a DNA sequence Disadvantage: - We have no information about the lenghts of the islands - Assume that the islands are at least l nucleotides long - Different windows may classify the same position differently A better solution: a unified model 6
CG Islands and the Fair Bet Casino The CG islands problem can be modeled after a problem named The Fair Bet Casino The game is to flip coins, which results in only two possible outcomes: Head or Tail. The Fair coin will give Heads and Tails with same probability ½. The Biased coin will give Heads with prob. ¾. The Fair Bet Casino (cont d) Thus, we define the probabilities: P(H F) = P(T F) = ½ P(H B) = ¾,, P(T B) = ¼ The crooked dealer changes between Fair and Biased coins with probability 10% 7
The Fair Bet Casino Problem Input: A sequence x = x 1 x 2 x 3 x n of coin tosses made by two possible coins (F( or B). Output: A sequence π = π 1 π 2 π 3 π n, with eachπ i being either F or B indicating that x i is the result of tossing the Fair or Biased coin respectively. Problem Fair Bet Casino Problem Any observed outcome of coin tosses could have been generated by any sequence of states! Need to incorporate a way to grade different sequences differently. Decoding Problem 8
P(x fair coin) vs. P(x biased coin) Suppose first that dealer never changes coins. Some definitions: P(x fair coin): prob. of the dealer using the F coin and generating the outcome x. P(x biased coin): prob. of the dealer using the B coin and generating outcome x. P(x fair coin) vs. P(x biased coin) P(x fair coin)=p(x 1 x n fair coin) Π i=1,n p (x( i fair coin)= (1/2) n P(x biased coin)= P(x 1 x n biased coin)= Π i=1,n p (x( i biased coin)=(3/4) k (1/4) n-k = 3 k /4 n k - the number of Heads in x. 9
P(x fair coin) vs. P(x biased coin) P(x fair coin) = P(x biased coin) 1/2 n = 3 k /4 n 2 n = 3 k n = k log 2 3 when k = n / log 2 3 (k ~ 0.67n) If k < 0.67n, the dealer most likely used a fair coin Log-odds Ratio We define log-odds odds ratio as follows: log 2 (P(x fair coin) / P(x biased coin)) =1 log 2 (p + (x i ) / p - (x i )) =Σ k i=1 = n k log 2 3 10
Computing Log-odds Ratio in Sliding Windows x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x n Consider a sliding window of the outcome sequence. Find the log-odds for this short window. 0 Log-odds value Biased coin most likely used Fair coin most likely used Hidden Markov Models: HMMs There is no 1-1 correspondence between states and (sequence) symbols. You can t always tell what state emitted a particular character: states are hidden. There can be more than one path that generates any given sequence. This allows us to model more complex sequence distributions. 11
Hidden Markov Model (HMM) Can be viewed as an abstract machine with k hidden states that emits symbols from an alphabet Σ. Each state has its own probability distribution, and the machine switches between states according to this probability distribution. While in a certain state, the machine makes 2 decisions: What state should I move to next? What symbol - from the alphabet Σ - should I emit? Why Hidden? Observers can see the emitted symbols of an HMM but have no ability to know which state the HMM is currently in. Thus, the goal is to infer the most likely hidden states of an HMM based on the given sequence of emitted symbols. 12
Example: CpG Island HMM CpG island DNA states: large C, G transition probabilities Normal DNA states: small C, G transition probabilities -+ a AC A + C + A - C - B + a GC + a CG - a GC - a CG E G + T + G - T - CpG Island Sub-model Normal DNA Sub-model Most transitions omitted for clarity Example: Modeling CpG Islands In a CpG island, the probability of a C following a G is much higher than in normal intragenic DNA sequence. We can construct an HMM to model this by combining two HMMs: one for normal sequence and one for CpG island sequence. Transitions between the two sub-models allow the model to switch between CpG island and normal DNA. Because there is more than one state that can generate a given character, the states are hidden when you just see the sequence. For example, a C can be generated by either the C + or C - states in the following model. 13
HMM Parameters Σ: : set of emission characters. Ex.: Σ = {H, T} or {1,0} for coin tossing Σ = {A,C,T,G} for DNA Q: : set of hidden states, each emitting symbols from Σ. Q={F,B} for coin tossing HMM Parameters (cont d) A = (a( kl ): a Q x Q matrix of probability of changing from state k to state l. a FF = 0.9 a FB = 0.1 a BF = 0.1 a BB = 0.9 E = (e( k (b)): a Q x Σ matrix of probability of emitting symbol b while being in state k. e F (0) = ½ e F (1) = ½ e B (0) = ¼ e B (1) = ¾ 14
HMM for Fair Bet Casino The Fair Bet Casino in HMM terms: Σ = {0, 1} (0( for Tails and 1 Heads) Q = {F,B{ F,B} F for Fair & B for Biased coin. Transition Probabilities A *** Emission Probabilities E A Fair Biased E Tails(0) Heads(1) Fair a FF = 0.9 a FB = 0.1 Fair e F (0) = ½ e F (1) = ½ Biased a BF = 0.1 a BB = 0.9 Biased e B (0) = ¼ e B (1) = ¾ HMM for Fair Bet Casino (cont d) HMM model for the Fair Bet Casino Problem 15
Hidden Paths A pathπ= π 1 π n in the HMM is defined as a sequence of states. Consider path π = FFFBBBBBFFF and sequence x = 01011101001 Probability that x i was emitted from stateπ i x 0 1 0 1 1 1 0 1 0 0 1 π = F F F B B B B B F F F P(x π i i ) ½ ½ ½ ¾ ¾ ¾ ¼ ¾ ½ ½ ½ P(π i-1 π i ) ½ 9 / 9 10 / 1 10 / 9 10 / 9 10 / 9 10 / 9 10 / 1 10 / 9 10 / 9 10 / 10 Transition probability from stateπ i-1 to stateπ i P(x π) Calculation P(x π): Probability that sequence x was generated by the path π: P(x π) ) = P(π 0 π 1 ) ΠP(x i π i ) P(π i π i+1 n i=1 +1 ) = a π0, π 1 Π e π i (x i ) a πi,π i+1 π 0 andπ n+1 represent the fictitious initial and terminal state (begin and end) 16
Three classic HMM problems Evaluation Given a model and an output sequence, what is the probability that the model generated that output? Decoding Given a model and an output sequence, what is the most likely state sequence (path) through the model that produced that sequence? Learning Given a model and a set of observed sequences, what should the model parameters be so that it has a high probability of generating those sequences? Decoding Problem Goal: Find an optimal hidden path of states given observations. Input: Sequence of observations x = x 1 x n generated by an HMM M(Σ,, Q, A, E) E Output: A path that maximizes P(x π) over all possible paths π. 17
Building Manhattan for Decoding Problem Andrew Viterbi used the Manhattan grid model to solve the Decoding Problem. Every choice of π = π 1 π n corresponds to a path in the graph. The only valid direction in the graph is eastward. This graph has Q 2 (n-1) edges. Edit Graph for Decoding Problem 18
Decoding Problem vs. Alignment Problem Valid directions in the alignment problem. Valid directions in the decoding problem. Decoding Problem as Finding a Longest Path in a DAG The Decoding Problem is reduced to finding a longest path in the directed acyclic graph (DAG) above. Notes: the length of the path is defined as the product of its edges weights, not the sum. 19
Decoding Problem (cont d) Every path in the graph has the probability P(x π). The Viterbi algorithm finds the path that maximizes P(x π) ) among all possible paths. The Viterbi algorithm runs in O(n Q 2 ) time. Decoding Problem: weights of edges w (k, i) (l, i+1) The weight w is given by:??? 20
Decoding Problem: weights of edges n P(x π) ) =Πeπ= i+1 i=0 w +1 (x i+1 +1 ). a πi,π i+1 (k, i) (l, i+1) The weight w is given by:?? Decoding Problem: weights of edges i-th term = e π i+1 (x i+1 ). a πi, π i+1 w (k, i) (l, i+1) The weight w is given by:? 21
Decoding Problem: weights of edges i-th term = e l (x i+1 ). a kl for π i =k,π i+1 =l w (k, i) (l, i+1) The weight w=e l (x i+1 ). a kl Decoding Problem and Dynamic Programming S k,i k,i is the probability of the most probable path for the prefix (x 1,...,x i ) s l,i+1 = max kєq {s k,i weight of edge between (k,i) and = max kєq {s k,i a kl e l (x i+1 +1) }= e l (x i+1 +1) max kєq {s k,i a kl } and (l,i+1)}= 22
Decoding Problem (cont d) Initialization: s begin,0 = 1 s k,0 = 0 for k begin. For each i =0,...,L-1 1 and for each l Q recursively calculate: s l,i+1 = e l (x i+1 +1) max kєq {s k,i a kl } Let π * be the optimal path. Then, P(x π * ) = max kєq {s k,n. a k,end } Viterbi Algorithm The value of the product can become extremely small, which leads to overflowing. To avoid overflowing, use log value instead. s k,i+1,i+1= loge l (x i+1 ) + max kєq {s k,i + log(a kl )} Initialization: s begin,0 = 0 s k,0 = - for k begin The score for the best path Score (X, π * ) = max kєq {s k,n + log(a kend )} 23
Viterbi Algorithm Complexity: We calculate the values of O( Q.L) cells of the matrix V Spend O( Q ) operations per cell Overall time complexity: O(L. O 2 ) Space complexity: O(L. Q ) 24