idden Markov Models in Bioinformatics 14.11 60 min Definition Three Key Algorithms Summing over Unknown States Most Probable Unknown States Marginalizing Unknown States Key Bioinformatic Applications Pedigree Analysis Isochores in Genomes (CG-rich regions) Profile MM Alignment Fast/Slowly Evolving States Secondary Structure Elements in Proteins Gene Finding Statistical Alignment
idden Markov Models (O 1, 1 ), (O 2, 2 ),. (O n, n ) is a sequence of stochastic variables with 2 components - one that is observed (O i ) and one that is hidden ( i ). The marginal discribution of the i s are described by a omogenous Markov Chain: Let p i,j = P( k =i, k+1 =j) Let π i =P{ 1 =i) - often π i is the equilibrium distribution of the Markov Chain. Conditional on k (all k), the O k are independent. The distribution of O k only depends on the value of i and is called the emit function e(i, j) = P{O k = i k = j) O 1 O 2 O 3 O 4 O 5 O 6 O 7 O 8 O 9 O 10 1 2 3
What is the probability of the data? P( O r ) = r P( O r r )P( r ) The probability of the observed is, which can be hard to calculate. owever, these calculations can be considerably accelerated. Let P k = j Ok the probability of the observations (O 1,..O k ) = i conditional on k =j. Following recursion will be obeyed: i. P k = j Ok = i = P(O k = i k = j) P Ok 1 k 1 = j k 1 = j p j,i ii. P 1 = j O1 = i = P(O 1 = i 1 = j)π j (initial condition) iii. P(O) = n = j P n = j On = i 1 2 3 O 1 O 2 O 3 O 4 O 5 O 6 O 7 O 8 O 9 O 10 P 5 = 2 O5 = i = P(O 5 = i 5 = 2) 4 = j P 4 = j O4 p j,i
What is the most probable hidden configuration? Let * be the sequences of hidden states that maximizes the observed j sequence O ie ArgMax [ P{O } ]. Let k probability of the most probability of the most probable path up to k ending in hidden state j. Again recursions can be found i. k j i = max i { k 1 p i, j }e(o k, j) ii. 1 j = π j e(o 1,1) The actual sequence of hidden states k * can be found recursively by * iii. k 1 i = {i k 1 p i, j e(o k, j) = k k *} 1 2 3 O 1 O 2 O 3 O 4 O 5 O 6 O 7 O 8 O 9 O 10 6 1 = max j { 6 1 * p j,1 *e(o 6,1)} 5 * = {i 5 i * p i,1 *e(o 6,1) = 6 1 }
What is the probability of specific hidden state? j Let Q k be the probability of the observations from j+1 to n given k =j. These will also obey recursions: Q k j k+1 = i i = P(O k k +1 = i) p j,i Q k +1 The probability of the observations and a specific hidden state can P{O, k = j) = P j j found as: k Q k And of a specific hidden state can found as: P{ k = j) = P k j Q k j /P(O) O 1 O 2 O 3 O 4 O 5 O 6 O 7 O 8 O 9 O 10 1 2 3 P{ 5 = 2) = P 5 2 Q 5 2 /P(O)
Fast/Slowly Evolving States sequences k Felsenstein & Churchill, 1996 positions 1 1 n MM: slow - r s fast - r f π r - equilibrium distribution of hidden states (rates) at first position p i,j - transition probabilities between hidden states L (j,r) - likelihood for j th column given rate r. L (j,r) - likelihood for first j columns given j th column has rate r. Likelihood Recursions: L (j,f) = (L (j-1,f) p f, f + L (j-1,s) p s, f )L ( j, f ) Likelihood Initialisations: L (j,s) = (L (j-1,f) p f,s + L (j-1,s) p s,s )L ( j,s) L (1,f) = π f L (1, f ) L (1,s) = π s L (1,s)
Statistical Alignment Steel and ein,2000 + olmes and Bruno,2000 Emit functions: e(##)= π(n 1 )f(n 1,N 2 ) e(#-)= π(n 1 ), e(-#)= π(n 2 ) π(n 1 ) - equilibrium prob. of N f(n 1,N 2 ) - prob. that N 1 evolves into N 2 An MM Generating Alignments - # # E # # - E * * λβ λ/µ (1 λβ)e -µ λ/µ (1 λβ)(1 e -µ ) (1 λ/µ) (1 λβ) - # λβ λ/µ (1 λβ)e -µ λ/µ (1 λβ)(1 e -µ ) (1 λ/µ) (1 λβ) _ # λβ λ/µ (1 λβ)e -µ λ/µ (1 λβ)(1 e -µ ) (1 λ/µ) (1 λβ) 1 λβe µ λβe µ (µ λ)β # - 1 e µ 1 e µ 1 e µ λβ T C C A C
Probability of Data given a pedigree. Elston-Stewart (1971) -Temporal Peeling Algorithm: Mother Father Condition on parental states Recombination and mutation are Markovian Lander-Green (1987) - Genotype Scanning Algorithm: Mother Father Condition on paternal/maternal inheritance Recombination and mutation are Markovian Comment: Obvious parallel to Wiuf-ein99 reformulation of udson s 1983 algorithm
Isochore: Churchill,1989,92 MM: Likelihood Recursions: Further Examples I poor rich L p (C)=L p (G)=0.1, L p (A)=L p (T)=0.4, L r (C)=L r (G)=0.4, L r (A)=L r (T)=0.1 L (j,p) = (L j 1,p p p,p + L (j-1,s) p s, f )P p (S[ j]), L (j,r) = (L (j-1,r) p p,r + L (j-1,r) p r,r )P r (S[ j]) Likelihood Initialisations: L (1,p) = π p P p (S[1]), L (1,r) = π r P r (S[1]) Gene Finding: Burge and Karlin, 1996 Simple Prokaryotic Simple Eukaryotic
Secondary Structure Elements: Goldman, 1996 Further Examples II MM for SSEs: α L α L β L β α β L α.909.005.062 β.0005.881.086 L.091.184.852.325.212.462 Adding Evolution: SSE Prediction: Profile MM Alignment: Krogh et al.,1994
Summary O 1 O 2 O 3 O 4 O 5 O 6 O 7 O 8 O 9 O 10 1 2 3 Definition Three Key Algorithms Summing over Unknown States Most Probable Unknown States Marginalizing Unknown States Key Bioinformatic Applications Pedigree Analysis Isochores in Genomes (CG-rich regions) Profile MM Alignment Fast/Slowly Evolving States Secondary Structure Elements in Proteins Gene Finding Statistical Alignment
Recommended Literature Vineet Bafna and Daniel. uson (2000) The Conserved Exon Method for Gene Finding ISMB 2000 pp. 3-12 S.Batzoglou et al.(2000) uman and Mouse Gene Structure: Comparative Analysis and Application to Exon Prediction. Genome Research. 10.950-58. Blayo, Rouze & Sagot (2002) Orphan Gene Finding - An exon assembly approach J.Comp.Biol. Delcher, AL et al.(1998) Alignment of Whole Genomes Nuc.Ac.Res. 27.11.2369-76. Gravely, BR (2001) Alternative Splicing: increasing diversity in the proteomic world. TIGS 17.2.100- Guigo, R.et al.(2000) An Assesment of Gene Prediction Accuracy in Large DNA Sequences. Genome Research 10.1631-42 Kan, Z. Et al. (2001) Gene Structure Prediction and Alternative Splicing Using Genomically Aligned ESTs Genome Research 11.889-900. Ian Korf et al.(2001) Integrating genomic homology into gene structure prediction. Bioinformatics vol17.suppl.1 pages 140-148 Tejs Scharling (2001) Gene-identification using sequence comparison. Aarhus University JS Pedersen (2001) Progress Report: Comparative Gene Finding. Aarhus University Reese,MG et al.(2000) Genome Annotation Assessment in Drosophila melanogaster Genome Research 10.483-501. Stein,L.(2001) Genome Annotation: From Sequence to Biology. Nature Reviews Genetics 2.493-