Introduction to Hidden Markov Modeling (HMM) Daniel S. Terry Scott Blanchard and Harel Weinstein labs 1
HMM is useful for many, many problems. Speech Recognition and Translation Weather Modeling Sequence Alignment Financial Modeling 2
So let s say you re riding out nuclear war in a bunker To keep sane, you want to know what the weather outside is like? but all you can observe is if the security guard brings his umbrella. 4
Observations Probabilistic reasoning X E Hidden State P(Sunny Umbrella) P(loudy Umbrella) P(Rain Umbrella) P(Sunny No Umbrella) P(loudy No Umbrella) P(Rain No Umbrella) P(X E) = probability of X happening if E is observed. 5
Probabilistic reasoning in stochastic processes Time Hidden State X 0 X 1 X 2 X 3 X 4 Observations ( Emissions ) E 0 E 1 E 2 E 3 E 4 Hidden State Observations ( Emissions ) This is called a Markov chain 6
Assumptions in Markov modeling Assumption 1: This is a stationary process, specifically a first-order Markov Process: P(X t X t-1,x t-2,x t-3, ) = P(X t X t-1 ) in other words, the current state depends only on the previous state. We call this the transition model. Assumption 2: The current observations depends only on the current state: P(E t X t,x t-1,x t-2,,e t-1,e t-2,e t-3, ) = P(E t X t ) in other words, the current observation depends only on the current state. We call this the observation (or emission) model. Hidden State X 0 X 1 X 2 X 3 X 4 Observations ( Emissions ) E 0 E 1 E 2 E 3 E 4 7
The initial and transition probability models: π and A X t-1 P(X t = Sunny) P(X t X t-1 ) P(X t = loudy) P(X t = Raining) P(X) π Sunny 0.7 loudy 0.15 Raining 0.15 Sunny 0.7 0.25 0.05 loudy 0.33 0.33 0.33 Raining 0.2 0.6 0.2 X 0 X 1 E 0 E 1 Encodes prior knowledge about weather trends. 8
The observation probability model: B Hidden State X 0 X 1 X 2 X t P(E t =Um.) Sunny 0.05 loudy 0.10 Raining 0.85 Observations ( Emissions ) E 0 E 1 E 2 Encodes prior knowledge about how likely people are to bring their umbrella depending on weather conditions. 9
Together these parameters define a Markov model. {, A, B} Initial State Probabilities State Transition Probabilities Observation Distributions a, a R,R a,r π R π R a R, b b R 10
Predicting state sequences from observations Observation Sequence (t=1..t) Predicted Hidden State Sequence Markov hain Markov Model a, a R,R X 0 X 1 X 2 X 3 π a,r a R, R π R E 0 E 1 E 2 E 3 b b R 11
Finding the optimal state sequence with Viterbi {, A, B} Given a model that describes the system ( the optimal state sequence (idealization) as follows: Time X 0 X 1 X 2 X 3 ), we can determine S S S S States R R R R For each state at time t, calculate probability of the state at time t (X t ) being a particular state x i (sunning, raining, etc), given observations and previous states: P(X t =x i E t,e t-1,e t-2,,x t-1,x t-2,x t-3, ) = P(X t =x i E t, X t-1 =x j ) = P(X t =x i E t ) P(X t =x i X t =x j ) P(X 0 =x i ) = π 12
Finding the optimal state sequence with Viterbi Time X 0 X 1 X 2 X 3 S S S S States R R R R Repeat these calculations for all possible transitions recursively. Then at each point in time we have an estimate of how likely we are to be in a particular state at that time given all possible previous paths. We also keep track of the most likely state at each point in time. (This complex looking thing is called a trellis. an you see why?) 13
Finding the optimal state sequence with Viterbi Time X 0 X 1 X 2 X 3 S S S S States R R R R Find the most likely end state from the probabilities. We can then backtrack to find the most likely state sequence. You have seen a similar procedure with sequence alignment. 14
Predicting state sequences from observations Observation Sequence (t=1..t) Predicted Hidden State Sequence Markov hain Markov Model a, a R,R X 0 X 1 X 2 X 3 π a,r a R, R π R E 0 E 1 E 2 E 3 b b R 15
FRET Fluorescence Ok, so I m bored of talking about the weather. 0 1 2 3 4 5 Time (min) A practical example of Markov modeling: Analysis of single-molecule fluorescence trajectories 16
Neurotransmitter release and reuptake is central to neuronal signaling and proper functioning of the brain. NSS Reuptake www.nia.nih.gov, public domain.
Neurotransmitter:Sodium Symporter (NSS) proteins are the targets of many clinically-important drugs. Therapeutic Inhibitors NSS Reuptake Drugs of Abuse www.nia.nih.gov, public domain.
A practical example of Markov modeling: Analysis of single-molecule fluorescence trajectories High Na + Outside Neurotransmitter Extracellular Intracellular Low Na + Inside Key Question: What are the specific conformational changes required for such a mechanism and how do they mediate transport?
FRET Single molecule FRET: A tool for examining conformational dynamics 1.0 0.8 0.6 0.4 Acceptor Donor 0.2 R 0 0.0 2 4 6 8 10 Distance (nm) 20
FRET Fluorescence FRET imaging of single-molecules can be achieved using a few tricks, including total internal reflection excitation. 15 10 5 0 1.0 0.8 0.6 0.4 0.2 0.0 0 1 2 3 4 5 Time (min) Acceptor Donor Surface 532 nm TIR Excitation 21
FRET Fluorescence onformation HMM is a statistical framework for modeling a hidden system using a sequence of observations generated by that system. Sequence of Hidden States X 0 X 1 X 2 Sequence of Observations E 0 E 1 E 2 0 2 4 6 8 10 12 14 Time (sec) We want to know: 1) How many distinct states are there? 2) What are their FRET values? 3) What are the rates? 4) Most likely state at each point in time? Unlike with the weather, we have to learn the model form the data itself!! 22
Hidden Markov models have three components: 1) Initial state probabilities: O, a O,O {, A, B} a, a O, 2) Transition probabilities: A { a i, j} a a O, O, O a a O,, π O O b O a,o b π 3) Observation probability distribution (OPD): 1 ( Et i ) B bi ( Et ) exp 2 2 2 i i 2 μ i σ i 0.4 0.5 0.6 0.7 FRET FRET distribution for state i. 23
Goal: best model to explain the experimental data. In other words, we want to maximize the probability of the model giving the data. ˆ argmax P( E) (where λ is the model, E is the observed FRET trajectory) But we don t know how to calculate P( λ E )! Instead, turn it around using Bayes theorem: P( E) P( E ) P( ) P( E) The prior probability P(E) is independent of the model choice and will not affect model ranking. If we assume all models are equally likely, then: ˆ argmax P( E) argmax P( E ) P( E λ ) is easy to calculate it is the observation distribution. Why is X not here? We have to do this over all possible state sequences! 24
FRET Segmental k-means (SKM): optimization on the cheap λ 0 State assignment (Viterbi) Parameter reestimation 1.0 0.8 0.6 0.4 0.2 0.0 0 1 2 3 4 5 Time (min) λ i To get B, simply calculate the mean and std for each state from the current assignment. To get A, count the number of transitions of each type and normalize. To get π, count the number of times each dwell starts with each state x i and normalize. Works only if the starting model that is close to final. F. Qin (2004), Biophys J 86: 1488 25
Model optimization: expectation maximization (EM). Expectation: alculate the probability of data given the model (expectation). P( E ) Initial (π) Transition (A) Observation (B) P( X ) t0.. T P( X t X t1 ) P( X t Et ) LL log[ P( X )] t0.. T log[ P( X t X ) P( X t1 t t E )] Maximization: Adjust model parameters to better fit the calculated probabilities. Termination: Iterate until log-likelihood converges (e.g., ΔLL<10-4 ). Restarts: if the likelihood landscape is very frustrated, restarting from a random initial model can help get out of local minima. 26
The forward-backward algorithm (Baum Welch) The past The future X 0 X 1 X 2 X 3 X 96 X 97 X 98 X 99 E 0 E 1 E 2 E 3 E 96 E 97 E 98 E 99 alculating the probabilities at a particular point in time (t): P( X t E.. T ) P( X t E1.. t, Et 1.. 1 T ) α P( X t E.. t ) P( X t Et 1.. 1 T Forward Backward ) We can do this because of Bayes rule and conditional independence of observations over time We calculate these much like we did with Viterbi 27
The forward algorithm Time X 0 X 1 X 2 X 3 O O O O States Partial probabilities (α) are calculated recursively as: α t (j) = P(observation hidden state is j) P(all paths to state j at time t) Initial condition: α 0 (j) = π( j ) B(j,E t ) Iterate: n t1 ( j ) B ), i1 j, E t t ( i ai j Then the total probability of the sequence is the sum of these α s 28
Maximization using forward-backward probabilities Probability of transitioning from state i to j at time t: (from the Forward-Backward algorithm) Probability of being in state i at time t: Model parameters adjusted to maximize log-likelihood: This very much like SKM, except we use explicit probabilities instead of just counting. 29
The problem of bias You can always get a better fit using more parameters! But it may not be a good model. Bayesian information criterion (BI): -2 ln* P(E k) + BI = -2 ln(ll) + k ln(n) k is the number of free parameters, LL is log-likelihood of the optimal fit, and n is the number of data points. Akike information (AI) AI = -2 k - 2 ln(ll) Maximum evidence methods (vbfret), etc. 30
FRET Fluorescence onformation HMM is a statistical framework for modeling a hidden system using a sequence of observations generated by that system. Sequence of Hidden States X 0 X 1 X 2 E 0 E 1 E 2 Sequence of Observations We want to know: 1) How many distinct states are there? 2) What are their FRET values? 3) What are the rates? 4) Most likely state at each point in time? 0 2 4 6 8 10 12 14 Time (sec) 31
Occupancy (%) FRET Dwell Time (s) Quantifying kinetics is then useful for understanding how outside factors (ligands) influence dynamics. 30 20 1.0 0.8 0.6 0.4 0.2 2 mm Na + : +2 mm Ala 0.0 0 1 2 3 4 5 Time (min) 10 0 80 60 Open State losed State -1 0 1 2 3 4 log [Ala] (M) 40 20-1 0 1 2 3 4 log [Ala] (M) Zhao and Terry, et al (2011), Nature 474
Other important examples of Markov modeling: Single-channel recordings (patch clamp) O Sequence analysis ardiac electrical modeling Systems modeling of metabolic networks 33
We can do non-equilibrium Markov modeling, too Geggier et al (2010), JMB 399: 576 34
HMM is useful for many, many problems. Speech Recognition and Translation Weather Modeling Sequence Alignment Financial Modeling 35
Some useful references Artificial Intelligence: A Modern Approach http://www.comp.leeds.ac.uk/roger/hiddenmarkovmodels/ html_dev/main.html Rabiner (1989), Proc. of the IEEE 77: 257. Qin F. Principles of single-channel kinetic analysis. Methods Mol Biol. 2007; 403. Bronson et al (2009), Biophys J 97: 3196. QuB software suite: www.qub.buffalo.edu 36