Hidden Markov Models. Introduction to. Model Fitting. Hagit Shatkay, Celera. Data. Model. The Many Facets of HMMs... Tübingen, Sept.

Size: px

Start display at page:

Download "Hidden Markov Models. Introduction to. Model Fitting. Hagit Shatkay, Celera. Data. Model. The Many Facets of HMMs... Tübingen, Sept."

Philippa Lindsey
6 years ago
Views:

1 Introduction to Hidden Markov Models Hagit Shatkay, Celera Tübingen, Sept Model Fitting Data Model 2 The Many Facets of Found no match for your criteria. Speech Recognition DNA/Protein Analysis 3

2 What can you say about his condition? Let me check my HMM... Medical Decision Making Robot Navigation and Planning 4 Overview The Components of HMMs Evaluating a sequence WRT an HMM (Problem ) Fitting a sequence to an HMM (Problem 2) Fitting an HMM to sequences (Problem 3) Issues, Extensions, Applications Conclusion 5 HMMs: The Basics 6

3 Models that are: What are HMMS? Stochastic (probability-based) Generative Provide a putative production process for generating data. Satisfying the Markov Property The present state summarizes the past. Future events depend only on the current situation not on the preceding ones. 7 Example: Weather modeling and prediction Breezy & Cold Pr(T<5 )=0.8 Pr(Overcast)=0.5 Pr(Precipitation)=0.3 Pr(Wind>5mph)= Fair Pr(T<5 )= Pr(Overcast)=0.02 Pr(Precipitation)=0. Pr(Wind>5mph)=0.2 Winter Storm Pr(T<5 )=0.999 Pr(Overcast)=0.9 Pr(Precipitation)=0.95 Pr(Wind>5mph)=0.85 Rainy Pr(T<5 )=0.3 Pr(Overcast)=0.8 Pr(Precipitation)=0.95 Pr(Wind>5mph)=0.6 8 The Building Blocks of HMMs An HMM is a tuple: λ = < S, V, A, B, π > States ( S = s,..., s N ) Hidden Observations ( V = v,..., v M ) Parameters: A: Transition matrix A ij = Pr( q t+ =s j q t =s i ) B: Observation matrix B ik = Pr( o t = v k q t =s i ) π: Initial distribution π i = Pr( q =s i ) S P(H)=0.5 P(T)=0.5 OR 0.5 S 0.5 P(H)=0 P(T)= 0.5 S2 P(H)= P(T)=0 0.5 N j = M k = N j= A ij B ik π i = = = 9

4 Examples Revisited Speech Recognition States: Positions for nucleotides deletion/matching/insertion Observations: Nucleotides Transitions: Nucleotides order in the DNA States: Possible phonemes in a word Observations: Uttered Phonemes Transitions: Phoneme order in a word States for modeling purposes. DNA/Protein Analysis 0 States: Patient's (varying) condition Observations: Instrument Readings Transitions: Changes due to treatment Medical Decision Making States: Robot's position Observations: Sensors Readings Transitions: Changes due to movement A physical notion of states. Robot Navigation and Planning HMMs: The Three Problems Problem : Given a model λ and a sequence of observations O=O,...,O T, find O s probability under λ, Pr(O λ). Problem 2: Given a model λ and a sequence of observations O=O,...,O T, find the best state sequence Q=q,...,q T explaining it. Problem 3: Given a sequence of observations O=O,...,O T, find the best model λ that could have generated it. 2

Example: Modeling Protein Families Different protein families (e.g. Globin, Flavodoxin, Kinase), have different characteristic sequences. Families are represented as HMMs.

..) Anchor points for typical AA emition, insertion and deletion Problem 3: Given multiple aligned sequences, learn the family HMM Problem 2: Given a family HMM and a sequence, find the best alignment.

of Conditional Probability: Pr(X Y) = Pr(X,Y) Pr(Y) Bayes Rule: Pr(X Y) = Pr(Y X) Pr(X) Pr(Y) Chain Rule of Conditional Probability: Pr(X, X 2,..., X n ) = Pr(X ) Pr(X 2 X )...Pr(X n X,.

5 Example: Modeling Protein Families Different protein families (e.g. Globin, Flavodoxin, Kinase), have different characteristic sequences. Families are represented as HMMs. Globin Kinase (NDK) Flavodoxin ( Observations States 20 Amino acids (Glu, Gly, Arg,...) Anchor points for typical AA emition, insertion and deletion Problem 3: Given multiple aligned sequences, learn the family HMM Problem 2: Given a family HMM and a sequence, find the best alignment. Problem : Given a family HMM and a protein sequence, calculate how likely the protein is to be in the family. 3 Basic Tools Def. of Conditional Probability: Pr(X Y) = Pr(X,Y) Pr(Y) Bayes Rule: Pr(X Y) = Pr(Y X) Pr(X) Pr(Y) Chain Rule of Conditional Probability: Pr(X, X 2,..., X n ) = Pr(X ) Pr(X 2 X )...Pr(X n X,..., X n- ) The Markov Property (An assumption not a fact!): Pr(q t+ =s j q t =s i ) = Pr(q t+ =s j q =s i, q 2 =s i2,..., q t = s it =s i ) 4 [Rabiner&Juang86, Rabiner89] The ultimate Introduction to HMMs, and application in NLP (speech recognition) [Charniak93] and references therein. HMMs in NLP [Leek97, Ray&Craven0] HMMs in NLP, Infomation Extraction from Biomedical text [Hauskrecht&Fraser98] HMMs in medical decision making [Simmons&Koenig95, Koenig&Simmons96, Shatkay&Kaelbling97,Shatkay&Kaelbling02] HMMs for robot navigation [Churchill89, Krogh et al 94a, Krogh et al 94b, Eddy 98, Burge97, Durbin et al 98] and references therein. HMMs in computational biology. 5

6 Problem Pr(Sequence Model) 6 Pr(Sequence Model) Given: O = O,...,O T λ = < S, V, A, B, π > A sequence of observations An HMM Calculate: The probability of O to be generated under the model λ, Pr(O λ) Example Application: Given a protein sequence, P, and several possible protein families (λ... λ L ), find the most likely family of P. argmax[pr(p λ i )] λ i 7 Example Sequence: TPEE HMM, λ: π = Pr(S)=0.8 Pr(T)= Calculating Pr(O λ) 0.9 A (small) protein motif 2 Pr(P)=0.9 Pr(E)= Pr(E)=Pr(A)= Pr(S)=Pr(G)=0.25 Pr(E)=0.8 Pr(Q)=Pr(D)=0. If we knew that the generating state sequence is: 234 Pr(O=TPEE λ,q =234) = Pr(O =T λ, q =) Pr(O 2 =P λ, q 2 =2) Pr(O 3 =E λ, q 3 =3) Pr(O 4 =E λ, q 4 =4) = = No explicit state sequence Pr(O=TPEE λ) = Pr(O=TPEE, Q = q q 2 q 3 q 4 λ)= q q 2 q 3 q Pr(O=TPEE Q=q q 2 q 3 q 4,λ) Pr(Q=q q 2 q 3 q 4 λ)= π q B qi o i A qj q j+ q q 2 q 3 q 4 q q 2 q 3 q i= j= 4 3 Marginalize:

7 Calculating Pr(O λ) (Cont.) In the general case T Pr(O λ)= Pr(O Q,λ) Pr(Q λ)= π q B qi o Q= q...q i A qj q i= j= j+ T Q= q...q T Computation time N T State-Sequences Q (N states, T time steps) 2T products per sequence O(TN T ) Not Feasible T 9 Calculating Pr(O λ) (Cont.) Solution: An alternative approach, using Dynamic Programming Idea: Sum over all final-states rather than over all state-sequences: N Pr(O = o,...,o T λ)= Pr(O = o,...,o T, q T = s i λ) i = Making it work: Define, for any time t T: α t (i) = Pr(o,...,o t, q t = s i λ). Initialization: α (i) = π i B i o Recursion: α t+ (j) = α t (i) A ij B j ot+ Termination: Pr(O = o,...,o T λ)= α T (i) N i = N i = 20 Calculating Pr(O λ) (last!) α t (i)=pr(o,...,o t, q t = s i λ): Known as the Forward probability. Computation time Calculating each α t+ (j) : N Summands N states, T time points NT such summations. 2 products per summand O(TN 2 ) Efficient Computation! 2

8 Problem 2 Find Best States for Observations 22 Given: Best States for Observations O = O,...,O T A sequence of observations λ = < S, V, A, B, π > An HMM Find: The sequence of states, Q = q,...,q T, that generated O under the model λ Example Application: Given a protein sequence, P, and an HMM model for a family (a profile), find the best alignment of P with the profile. 23 What is the Best State Sequence? Options: Optimize the expected number of correct states. The state at time t, q t, is s i that maximizes Pr(q t =s i O,λ). The resulting sequence Q = q,...,q T may not be a valid one... Optimize the whole state-sequence probability, Pr(Q O,λ). Equivalent to: Q * = argmax[pr(q,o λ)] Q Efficiently done using the Viterbi Algorithm. 24

9 The Viterbi Algorithm Dynamic Programming (again) Q * = argmax[pr(q,o λ)], without explicit expansion of all N T state sequences. Q Define, for any time t T: δ t (i)= max[pr(q,...,q t-,q t = s i,o,...,o t λ). q,...,q t- Initialization: δ (i) = π i B io Recursion: δ t+ (j) = max[δ t (i) A ij ] B j ot+ i ψ (i)=0 ψ t+ (j)=argmax[δ t (i)a ij ] i Termination: P * = max[pr(q,o λ)] )=max[δ T (i)] Q i Q * = q *, q 2 * * *,..., q T q T = argmax[δ T ( i)], q * t = ψ t ( q* t ) i 25 Example Sequence: TPSS HMM, λ: π = Pr(S)=0.8 Pr(T)= Pr(P)=0.9 Pr(E)=0. Find the State Sequence Q * : Pr(E)=Pr(A)= Pr(S)=Pr(G)= Pr(E)=0.8 Pr(Q)=Pr(D)=0. T: δ ()=0.2 δ (2)=δ (3)=δ (4)=0 ψ (i)=0 P: δ 2 (2)= =0.62 δ 2 ()=δ 2 (3)=δ 2 (4)=0 ψ 2 (2)= S: δ 3 (3)= = δ 3 ()=δ 3 (2)=δ 3 (4)=0 ψ 3 (3)=2 S: δ 4 (3)= = δ 3 ()=δ 3 (2)=δ 3 (4)=0 ψ 4 (3)= q=3 * 4 26 Problem 3 Learning an HMM from Data 27

10 Best Model for Observations Given: O = O,...,O T Sequence(s) of observations Implicit also: The set of possible observation values, V = v,..., v M The number of states in the model, N. Find: The model λ = < S, V, A, B, π > that generated O Example Application: Given multiple protein sequences, P,...,P K, from a protein family, find an HMM for the family (a profile). 28 Example Sequences: Finding an HMM V G A H V N V E A D V N A N I A G A D N A (small) Globin motif (Modified from [Durbin et al 98]) Count-based estimates: Match-state M (# of transitions from M A 2 = Pr(q t+ =M 2 q t =M ) to M 2 ) = 4/5 (# of transitions from M ) (# of times V observed in M B V = Pr(o t =V q t =M ) ) = 4/5 (# of visits to M ) 29 Example (Cont.) Sequences: V G A H V N V E A D V N A N I A G A D N M D I Match state Delete state Insert state HMM, λ: 0.5 Pr(A)=Pr(D)=0.5 I π = M M 0.8 M 2 M M 4 Pr(V)=0.8 Pr(I)=0.2 Pr(A)=Pr(N)=0.25 Pr(G)=Pr(E)= Pr(G)=0.25 Pr(A)=0.75 Pr(N)=0.6 Pr(D)=Pr(H)= D Pr( )=

11 Formal & General Given: O = O,...,O T Sequence(s) of observations The number of states in the model, N. Find: Model λ = < S, V, A, B, π >, maximizing the likelihood Pr(O λ) Use Expected Counts: Pick an initial transition and observation model. Iterate: E M Use current model and observations to compute a distribution on state sequences. Use distribution and observations to estimate a new transition and observation model, based on expected frequencies. Increases Pr(O λ) at each iteration, until convergence. 3 Baum-Welch Algorithm [Baum et al 7, Rabiner89] Receives number of states. Picks an initial model. Updates iteratively: π i Expected frequency of s i at time 0; A ij E(# of trans. from s i to s j ) E(# of trans. from s i ) ; B ik E(# of times in s i observing o k ) E(# of times in s i ). 32 Baum-Welch Algorithm (details) Dynamic programming for calculating: α t (i) = Pr(O,..., O t, q t = s i λ) Forward β t (i) = Pr(O t+,..., O T q t = s i, λ) Backward Pr(qt = si,o λ) αt ( i) βt ( i) γ t (i) = Pr(q t = s i O,λ) = Pr(O λ) N = αt ( j) βt ( j) Pr(qt = si,qt + = s j,o λ) ξ t (i,j) = Pr(q t = s i, q t+ = s j O,λ) = Pr(O λ) = j= αt ( i) AijB j, O β + t+ ( j) t = N αt ( j) βt ( j) Sum γ t (i) s and ξ t (i,j) s to obtain expected counts. j= 33

12 Baum-Welch Algorithm (last) General Practical Issues: Reaches local maxima Strongly depends on initial conditions May require a lot of data and many iterations [Rabiner&Juang86, Rabiner89] A comprehensive introduction. [Baum et al 70] and references therein. Baum s original results. [Durbin et al 98] and references therein. Applications in computational biology. 34 Issues, Extensions, Applications 35 State Duration, Semi-Markov Models Staying d time steps in the same state, s i p Standard HMM: -p Pr(d consecutive time steps in state Si) = p (d-) (-p) Geometric distribution over duration in a state. Supporting other distributions: q q S i P i (d)... q 2 λ = < S, V, A, B, π > { P,..., P N } S i P i : A probability distribution over duration d, at state i. P i (d) = Pr(staying d time steps in state Si), for d D P i (d) can be a continuous density function (e.g. Gaussian). 36

13 Multi-Dimensional Data Observations may be structured (see weather example) Standard observation sequence: O=O,...,O T O j V={v,...,v M } V is assumed to be a set of atomic values. Multi-dimensional observations: O j V =,..., v{ v } M v= <,,..., > i V 2 i V K i V i V is a set of k-dimensional vectors. The components V, 2,..., i V K i V i are (typically) assumed to be conditionally independent given the state 37 Applications: Multi-Dimensional Data (cont.) Twinscan: [Korf et al 0] Gene structure prediction over a target DNA sequence, D HMM is used for modeling regions in the DNA 2-dimensional observations: <Nucleotide, Conservation tag>. Nucleotide:{A,C,G,T } Conservation tag: {.,, : } Conservation tag represents alignment of D with an informant sequence. Robotics: [Cassndra et al 96, Shatkay&Kaelbling97] Observations represent the robot s view in each state. They are factored into the view in each cardinal direction: front, left and right (3-dimensional). Multi-Dimensional States: Factorial HMMs [Ghahramani and Jordan 97] 38 Other Topics Pseudo-counts and priors (Never say Never...) Other methods for learning HMMs: Bayesian Model Merging [Stolcke&Omohundro93,94] Viterbi training [Durbin et al 98] Optimizing other measures [Rabiner 89] Comparing HMMs [Rabiner 89] Choosing an initial model [Rabiner 89] Constraining HMM with domain knowledge and data. [Shatkay&Kaelbling02] 39

14 Conclusion Generative probabilistic models. Useful for modeling sequences with variations and/or noise. Efficient ways exist to relate and align sequences with families (HMMs). Generally: Require a lot of data and domain specific knowledge to construct. Versatile, flexible and general. Support extensions, special cases, and a wide variety of applications ,5 [Baum et al 70] [Burge97] [Cassandra et al 96] [Charniak93] [Churchill89] Baum L. E. (970). A Maximization Technique Occurring in the Statistical Analysis of Probabilistic Functions of Markov Chains, The Annals of Mathematical Statistics, 4, #, pp Burge C. (997). Identification of Genes in Human Genomic DNA, PhD Thesis, Stanford University, Stanford, CA. Cassandra A. R., Kaelbling L. P. and Kurien J. A. (996). Acting Under Uncertainty: Discrete Bayesian Models for Mobile-Robot Navigation, Proc. of the IEEE Int. Conf. on Intelligent Robots and Systems. Charniak E. (993). Statistical Language Learning, MIT Press. Churchill G. A. (989). Stochastic Models for Heterogeneous DNA Sequences, Bulletin of Mathematical Biology, 5, #, pp [Durbin et al 98] Durbin R. et al (998). Biological Sequence Analysis, Cambridge U. Press. [Eddy98] Eddy S. (998). Profile Hidden Markov Models, Bioinformatics, 4, pp [Ghahramani&Jordan97] Gharahmani Z. and Jordan M. I. (997). Factorial Hidden Markov Models, Machine Learning, 29, pp. -3. [Hauskrecht&Fraser98] Hauskrecht M. and Fraser H. (998). Planning Medical Therapy Using Partially Observable Markov Decision Processes, Proc. of the Ninth International Workshop on Principles of Diagnostics, pp ,5.439 [Koenig&Simmons 96] Koenig S. and Simmons R. (996). Unsupervised Learning of Probabilistic Models for Robot Navigation, Proc. of the IEEE Int. Conf. on Robotics and Automation. [Korf et al 0] [Krogh et al 94a] [Krogh et al 94b] [Leek97] [Rabiner&Juang86] [Rabiner89] [Ray&Craven0] Korf I. et al (200). Integrating genomic homology into gene structure prediction, Proc. of ISMB 0, pp. S40-S48. Krogh A. et al (994). Hidden Markov Models in Computational Biology, Journal of Molecular Biology, 235, pp Krogh A., Mian S.I. and Haussler D. (994). A Hidden Markov Model that Finds Genes in E. Coli DNA, Nucleic Acids Research, 22, pp Leek T.R. (997). Information Extraction Using Hidden Markov Models, MSc thesis, Dept. of Computer Science, University of California, San Diego. Rabiner L.R., Juang B.H. (986). An Introduction to Hidden Markov Models, IEEE ASSP Magazine, 3, #, pp Rabiner L.R. (989). A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition, Proc. of the IEEE, 77, #2, pp Ray S. and Craven M. (200). Representing Sentence Structure in Hidden Markov Models for Information Extraction, Proc. of IJCAI 0. 42

15 - 4 7,5.439 [Shatkay&Kaelbling97] [Shatkay&Kaelbling02] Shatkay H. and Kaelbling L. P. (997). Learning Topological Maps with Weak Local Odometric Information, Proc. of IJCAI 97. Shatkay H. and Kaelbling L. P. (2002). Learning Hidden Markov Models with Geometrical Constraints: Bridging the Topological-Geometrical Gap, Journal of AI Research, 6, pp [Simmons&Koenig95] Simmons R. and Koenig S. (995). Probabilistic Navigation in Partially Observable Environments, Proc. of IJCAI 95. [Stolcke&Omohundro93] Stolcke A. and Omohundro S. M. (993). Hidden Markov Model Induction by Bayesian Model Merging, Advances in Neural Information Systems, 5, Morgan Kaufmann, pp. -8 [Stolcke&Omohundro94] Stolcke A. and Omohundro S. M. (994). Best-first Model Merging for Hidden Markov Model Induction, ICSI Technical Report, TR , International Computer Science Institute, Berkeley. 43

Multiscale Systems Engineering Research Group

Multiscale Systems Engineering Research Group Hidden Markov Model Prof. Yan Wang Woodruff School of Mechanical Engineering Georgia Institute of echnology Atlanta, GA 30332, U.S.A. yan.wang@me.gatech.edu Learning Objectives o familiarize the hidden