1/22/13. Example: CpG Island. Question 2: Finding CpG Islands

Save this PDF as:
 WORD  PNG  TXT  JPG

Size: px
Start display at page:

Download "1/22/13. Example: CpG Island. Question 2: Finding CpG Islands"

Transcription

1 I529: Machine Learning in Bioinformatics (Spring 203 Hidden Markov Models Yuzhen Ye School of Informatics and Computing Indiana Univerty, Bloomington Spring 203 Outline Review of Markov chain & CpG island HMM: three questions & three algorithms Q: most probable state path Viterbi algorithm Q2: probability of a sequence p(x Forward algorithm Q3: Posterior decoding (the distribution of S i, given x Forward/backward algorithm Applications CpG island (problem 2 Eukaryotic gene finding (Genscan Generalized HMM (GHMM A state emits a string according to a probability distribution Viterbi decoding for GHMM st order Markov chain An integer time stochastic process, consting of a set of m> states {s,,s m } and. An m dimenonal initial distribution vector ( p(s,.., p(s m. 2. An m m trantion probabilities matrix M= (a s j For example, for DNA sequence, the states are {A, C,, G}, p(a the probability of A to be the st letter in a DNA sequence, and a AG the probability that G follows A in a sequence. Example: CpG Island We conder two questions (and some variants: Question : Given a short stretch of genomic data, does it come from a CpG island? Question 2: Given a long piece of genomic data, does it contain CpG islands in it, where, and how long? We solve the first question by modeling sequences with and without CpG islands as Markov Chains over the same states {A,C,G,} but different trantion probabilities. Question 2: Finding CpG Islands Given a long genomic string with posble CpG Islands, we define a Markov Chain over 8 states, all interconnected: A + A - C + G + + C - G - - he problem is that we don t know the sequence of states which are traversed, but just the sequence of letters. herefore we use here Hidden Markov Model he fair bet cano problem he game is to flip coins, which results in only two posble outcomes: Head or ail. he Fair coin will give Heads and ails with same probability ½. he Biased coin will give Heads with prob. ¾. hus, we define the probabilities: P(H F = P( F = ½ P(H B = ¾, P( B = ¼ he crooked dealer changes between Fair and Biased coins with probability 0.

2 he fair bet cano problem Input: A sequence x = x x 2 x 3 x n of coin tosses made by two posble coins (F or B. Output: A sequence S = s s 2 s 3 s n, with each s i being either F or B indicating that x i is the result of tosng the Fair or Biased coin, respectively. Hidden Markov model (HMM Can be viewed as an abstract machine with k hidden states that emits symbols from an alphabet Σ. Each state has its own probability distribution, and the machine switches between states according to this probability distribution. While in a certain state, the machine makes 2 decions: What state should I move to next? What symbol - from the alphabet Σ - should I emit? Parameters defining a HMM M M M M S S 2 S L- S L x x 2 X L- x L HMM consts of: A Markov chain over a set of (hidden states, and for each state s and observable symbol x, an emison probability p(x i =x S i =s. A set of parameters defining a HMM: Markov chain initial probabilities: p(s = t = b t p(s s 0 =p(s Markov chain trantion probabilities: p(s = t S i = s = a st Emison probabilities: p(x i = b S i = s = e s (b Why hidden? Observers can see the emitted symbols of an HMM but have no ability to know which state the HMM is currently in. hus, the goal is to infer the most likely hidden states of an HMM based on the given sequence of emitted symbols. Example: CpG island Question 2: Given a long piece of genomic data, does it contain CpG islands in it, where, and how long? Hidden Markov Model: this seems a straightforward model (but we will discuss later why this model is NO good. Hidden states: { +, - } Observable symbols: {A, C, G, } he probability of the full chain in HMM S M S 2 M M S L- M S L x x 2 X L- x L For the full chain in HMM we assume the probability: LY LY p(s, X =p(s s L,x x L= p(s i s i p(x i s i= a e(xi, i= i= he probability distribution over all sequences of length L, L p( s,.., sl; x,.., xl = as =, ( i s e i s x i i,.., x ( L s,.., sl; x,.., xl i= ( s,.., sl; x 2

3 hree common questions M M M M S S 2 S L- S L x x 2 X L- x L 3 questions of interest, given a HMM: Given the vible observation sequence x=(x,,x L, find:. A most probable (hidden path 2. he probability of x 3. For each i =,..,L, and for each state k, the probability that s i =k. Q. Most probable state path Given an output sequence x = (x,,x L, A most probable path s*= (s *,,s* L is one which maximizes p(s x. s* = (s *,...,s * L = argmax p(s x,...,x L (s Since sx x p(, p( sx = α p( sx, p( we need to find s which maximizes p(s,x Viterbi algorithm Viterbi algorithm s s 2 s i s S i- l X X 2 X i... he task: compute argmax p(s ; x,...,x L (s Let the states be {,,m} Idea: for i=,,l and for each state l, compute: v l (i = the probability p(s,..,s i ;x,..,x i s i =l of a most probable path up to i, which ends in state l. X X i- v l (i = the probability p(s,..,s i ;x,..,x i s i =l of the most probable path up to i, which ends in state l. For i =,,L and for each state l we have: v ( i = e ( x max{ v ( i a } l l i k kl k X i Viterbi algorithm Example: a fair cano problem 0 s s 2 s i s L- s L HMM: hidden states {F(air, L(oaded}, observation symbols {F(ace, B(ack} X X 2 X i X L- X L Add the special initial state 0 Initialization: v 0 (0 =, v k (0 = 0 for k > 0 For i= to L do for each state l : v l (i = e l (x i max k {v k (i-a kl } ptr i (l=argmax k {v k (i-a kl } //storing previous state for retrieving the path ermination: s L *=max k {v k (L} rantion probabilities Emison probabilities Initial prob. F L F B P(F=P(L=/2 F F /2 /2 L L 3/4 /4 Find the most likely state sequence for the observation sequence: HHH Result: p(s *,,s L* ;x,,x l, where s i *=ptl (s * 3

4 Q2. Computing p(x Given an output sequence x = (x,,x L, compute the probability that this sequence was generated by the given HMM: x x, s p( = p( S he summation taken over all state-paths s generating x. Forward algorithm he task: compute Idea: for i=,,l and for each state l, compute: F l (i = p(x,,x i ;s i =l, the probability of all the paths which emit (x,..,x i and end in state s i =l.?? p ( x = p( x, s s Recurve formula: F( i = e( x F ( i a l l i k kl k s i X X i- X i Forward algorithm 0 s s 2 s i s L- s L X X 2 X i X L- X L Similar to the Viterbi algorithm (use sum instead of maximum: Initialization: f 0 (0 :=, f k (0 := 0 for k>0 For i= to L do for each state l : F l (i = e l (x i k F k (i-a kl Q3. Distribution of S i, given x Given an output sequence x = (x,,x L, Compute for each i=,,l and for each state k the probability that s i = k. his helps to reply queries like: what is the probability that s i is in a CpG island, etc. Result: p(x,,x L = F k k ( L Solution in two stages Computing for a ngle i: s s 2 s i s L- s L s s 2 s i s L- s L X X 2 X i X L- X L X X 2 X i X L- X L. For a fixed i and each state k, an algorithm to compute p(s i =k x,,x L. 2. An algorithm which performs this task for every i =,..,L, without repeating the first task L times. ps ( i, x,..., xl ps ( i x,..., xl = px (,..., x α ps (, x,..., x i L L 4

5 Computing for a ngle i: F(s i : he forward algorithm: s s 2 s i s L- s L 0 s s 2 s i s L- s L X X 2 X i X L- X L X X 2 X i X L- X L p(x,,x L,s i = p(x,,x i,s i p(x,,x L x,,x i,s i (by the equality p(a,b = p(ap(b A. p(x,,x i,s i = f s i (i F(s i, which is computed by the forward algorithm. Initialization: F (0 = For i= to L do for each state l : F(s i = e (x i - F (s i- a s i-, s i he algorithm computes F(s i = P(x,,x i,s i for i=,,l B(s i : he backward algorithm B(s i : he backward algorithm s s 2 s i S s L s s 2 s i s L- s L X X 2 X i X X L X X 2 X i X L- X L p(x,,x L,s i = p(x,,x i,s i p(x,,x L x,,x i,s i We are left with the task to compute the Backward algorithm B(s i p(x,,x L x,,x i,s i, and get the dered result: p(x,,x L,s i = p(x,,x i,s i p(x,,x L s i F(s i B(s i From the probability distribution of Hidden Markov Chain and the definition of conditional probability: B(s i = p(x,,x L x,,x i,s i = p(x,,x L s i = = + a, + e + ( x p( x,.., x s B( s i + 2 L B(s i : he backward algorithm B(s i : he backward algorithm S L- S L S i S X S 2 X 2 he Backward algorithm computes B(s i from the values of B(s for all states s. B ( = p( x,.., xl = as i, s e s ( x B( + + S L X L First step, step L-: Compute B(s L- for each posble state s L- : ( s - = - = L p ( x L s L a s L -, s L s L B e ( x For i=l-2 down to, for each posble state s i, compute B(s i from the values of B(s : s L B ( = p( x,.., xl = as i, s e s ( x B( + + L X L 5

6 he combined answer s s 2 s i s L- s L ime and space complexity of the forward/backward algorithms s s 2 s i s L- s L X X 2 X i X L- X L. o compute the probability that S i =s i given x=(x,,x L, run the forward algorithm and compute F(s i = P(x,,x i,s i, run the backward algorithm to compute B(s i = P(x,,x L s i, the product F(s i B(s i is the answer (for every posble value s i. 2. o compute these probabilities for every s i mply run the forward and backward algorithms once, storing F(s i and B(s i for every i (and every value of s i. Compute F(s i B(s i for every i. X X 2 X i X L- X L ime complexity is O(m 2 L where m is the number of states. It is linear in the length of the chain, provided the number of states is a constant. Space complexity is also O(m 2 L. Example: Finding CpG islands Observed symbols: {A, C, G, } Hidden States: { +, - } rantion probabilities: P(+ +, P(- +, P(+ -, P(- - Emison probabilities: P(A +, P(C +, P(G +, P( + P(A -, P(C -, P(G -, P( - Bad model! did not model the correlation between adjacent nucleotides! Example: Finding CpG islands Observed symbols: {A, C, G, } Hidden States: {A +, C +, G +, +, A -, C -, G -, - } Emison probabilities: P(A A + =P(C C + =P(G G + =P( + =P(A A - =P(C C - =P(G G - =P( - =.0; else P(X Y=0; rantion probabilities: 6 probabilities in + model; 6 probabilities for - model; 6 probabilities for trantions between + and - models Example: Eukaryotic gene finding In eukaryotes, the gene is a combination of coding segments (exons that are interrupted by non-coding segments (introns his makes computational gene prediction in eukaryotes even more difficult Prokaryotes don t have introns - Genes in prokaryotes are continuous Example: Eukaryotic gene finding On average, vertebrate gene is about 30KB long Coding region takes about KB Exon zes vary from double digit numbers to kilobases An average 5 UR is about 750 bp An average 3 UR is about 450 bp but both can be much longer. 6

7 Central dogma and splicing exon = coding intron = non-coding intron intron2 exon exon2 exon3 transcription splicing translation Splicing gnals ry to recognize location of splicing gnals at exon-intron junctions his has yielded a weakly conserved donor splice te and acceptor splice te Profiles for tes are still weak, and lends the problem to the Hidden Markov Model (HMM approaches, which capture the statistical dependencies between tes Modeling splicing gnals Donor te 5 3 Potion % A C G Genscan model Genscan conders the following: Promoter gnals Polyadenylation gnals Splice gnals Probability of coding and non-coding DNA Gene, exon and intron length Chris Burge and Samuel Karlin, Prediction of Complete Gene Structures in Human Genomic DNA, JMB. ( , Genscan model States correspond to different functional units of a genome (promoter region, intron, exon,. he states for introns and exons are subdivided according to three frames. here are two symmetric sub modules for forward and backward strands. Performance: 80% exon detecting (but if a gene has more than one exon, the detection accuracy decrease rapidly. Note: there is no edge pointing from a node to itself in the Markov chain model of Genscan. Why? Because Genscan uses the Generalized Hidden Markov model (GHMM, instead of the regular HMM. State duration Geometric distribution In the regular HMM, the length distribution of a hidden state (also called the duration always follow a geometric distribution. In reality, however, the length distribution may be different. 7

8 Generalized HMMs (Hidden semi-markov models Based on a variant-length Markov chain; he (emitted output of a state is a string of finite length; For a given state, the output string and its length are according to a probability distribution; Different states may have different length distributions. GHMMs A finite set Σ of hidden states Initial state probability distribution b t =p(s 0 rantion probabilities a st = p(s i =t s i- =s for s, t in Σ; a tt =0. *Length distribution f of the states t (f t is the length distribution of state t *Probabilistic models for each state t, according to which output strings are generated upon viting the state Segmentation by GHMMs A parse ϕ of an observation sequence X=(x, x L of length L is a sequence of hidden states (s,,s t with an associated duration d i to each state s i, where A parse represents a partition of X, and is equivalent to a hidden state sequence in HMM; Intuitively, a parse is an annotation of a sequence, matching each segment with a functional unit of a gene Let ϕ=(s,,s t be a parse of sequence X; ( P x q + x q +2...x q +d i s i : probability of generating x q + x q +2...x q +d i by the sequence generation model of i state s i with length d i, where q = d j he probability of generating X based on ϕ is t P( x,..., x L ;s,...,s t = p( s f s ( d P( x,..., x d s a s i f ( d i P( x q +,..., x q +d i s i We have P( φ X = P( φ, X = P( φ, X P( X P( φ,x φ for all ϕ on a sequence X of length L. i=2 j = Viterbi decoding for GHMM s s 2 X X 2 he task: compute Let the states be {,,m} s i X i argmax p(s ; x,...,x L (s Idea: for i=,,l and for each state l, compute: v l (i = the probability p(s,..,s i ;x,..,x i s i =l of a most probable path up to i, which ends in state l. Viterbi decoding for GHMM v l (i = the probability p(s,..,s i ;x,..,x i s i =l of a most probable path up to i, which ends in state l. For i =,,L and for each state l we have: V l $ max & q< i ( i = max% k m,k l & ' P( x x 2...x i s l a 0l Complexity: O(m 2 L 2 P( x q + x q +2...x i s l V k ( qa kl 8

9 Example: a fair cano problem HMM: hidden states {F(air, L(oaded}, observation symbols {H(ead, (ail} rantion probabilities F L F L Length distribution 2 3 F /2 /2 L Emison probabilities H F /2 /2 L Probability of other length: 0 Initial prob. P(F=P(L=/2 Find the most likely hidden state sequence for the observation sequence: HHHH S*=FFLL or LLFF 9