Hidden Markov Models The three basic HMM problems (note: change in notation) Mitch Marcus CSE 391
Parameters of an HMM States: A set of states S=s 1, s n Transition probabilities: A= a 1,1, a 1,2,, a n,n Each a i,j represents the probability of transitioning from state s i to s j. Emission probabilities: a set B of functions of the form b i (o t ) which is the probability of observation o t being emitted by s i Initial state distribution: s i is a start state i is the probability that (This and later slides follow classic formulation by Rabiner and Juang following Ferguson, as adapted by Manning and Schutze. Slides adapted from Dorr. Note the change in notation!!) 2
The Three Basic HMM Problems Problem 1 (Evaluation): Given the observation sequence O=o 1,,o T and an HMM model (A,B,), how do we compute the probability of O given the model? Problem 2 (Decoding): Given the observation sequence O=o 1,,o T and an HMM model (A,B,), how do we find the state sequence that best explains the observations? Problem 3 (Learning): How do we adjust the model parameters (A,B,), to maximize? P(O ) 3
Problem 1: Probability of an Observation Sequence Q: What is? A: the sum of the probabilities of all possible state sequences in the HMM. The probability of each state sequence is itself the product of the state transitions and emit probabilities P(O ) Naïve computation is very expensive. Given T observations and N states, there are N T possible state sequences. (for T=10 and N=10, 10 billion different paths!!) Solution: linear time dynamic programming! 4
The Crucial Data Structure: The Trellis 5
Forward Probabilities: For a given HMM, given that the state is i at time t (with change of notation: some arbitrary time), what is the probability that the partial observation o 1 o t has been generated? t (i) P(o 1...o t, q t s i ) Forward algorithm computes t (i) 1<i<N, 1<t<T in time 0(N 2 T) using the trellis 6
Forward Algorithm: Induction step t (i) P(o 1...o t, q t s i ) t ( j) N i1 t1 (i)a ij b j (o t ) 7
Forward Algorithm Initialization: 1 (i) i b i (o 1 ) 1 i N Induction: ( t j) N t1( i) aij b j ( ot ) 2 t T, 1 i1 j N Termination: P(O ) N i1 T (i) 8
Forward Algorithm Complexity Naïve approach requires exponential time to evaluate all N T state sequences Forward algorithm using dynamic programming takes O(N 2 T) computations 9
Backward Probabilities: For a given HMM, given that the state is i at time t, what is the probability that the partial observation o t+1 o T will be generated? (i) P(o...o q s,) t t 1 T t i Analogous to forward probability, just in the other direction: Backward algorithm computes t (i) 1<i<N, 1<t<T in time 0(N 2 T) using the trellis 10
Backward Probabilities t (i) P(o t 1...o T q t s i,) N t ( i) aijbj ( ot ) t j1 1 1( j) 11
Backward Algorithm Initialization: T (i) 1, 1 i N Induction : N t ( i) aijbj ( ot 1) t 1( j) T 1 t 1,1 i N j1 Termination: N P( O ) b ( o ) ( i i1 i i 1 1 ) 12
Problem 2: Decoding The Forward algorithm gives the sum of all paths through an HMM efficiently. Here, we want to find the highest probability path. We want to find the state sequence Q=q 1 q T, such that Q argmax Q' P(Q' O,) 13
Viterbi Algorithm Just like the forward algorithm, but instead of summing over transitions from incoming states, compute the maximum Forward: N t ( j) t 1( i) aij bj ( ot ) i1 Viterbi Recursion: t( j) max t 1( i) a ij bj ( ot ) 1 i N 14
Core Idea of Viterbi Algorithm 15
Not quite what we want. Viterbi recursion computes the maximum probability path to state j at time t given that the partial observation o 1 o t has been generated But we want the path itself that gives the maximum probability Solution: 1. Keep backpointers 2. Find max ( ) ( ) t( j) t 1 i aij bj ot 1 i N arg max ( j) j T 3. Chase backpointers from state j at time T to find state sequence (backwards) 16
Viterbi Algorithm Initialization: Induction: ` ( j) max 1( i) a b ( o ) 1 i N 2 t T 1, j N (Backpointers) t ( j) arg max t 1( i) a ij 1 i N Termination: 1 (i) i b j (o 1 ) 1 i N t t ij j t q * T arg max () i 1 i N T (Final state!) Backpointer path: q * t t 1 (q * t 1 ) t T 1,...,1 17
Problem 3: Learning Up to now we ve assumed that we know the underlying model Often these parameters are estimated on annotated training data, but: (A,B,) Annotation is often difficult and/or expensive Training data is different from the current data We want to maximize the parameters with respect to the current data, i.e., we re looking for a model ', such that ' argmax P(O ) 18
Problem 3: Learning (If Time Allows ) Unfortunately, there is no known way to analytically find a global maximum, i.e., a model ', such that ' argmax P(O ) But it is possible to find a local maximum Given an initial model, we can always find a model, such that P(O ') P(O ) ' 19
Forward-Backward (Baum-Welch) algorithm Key Idea: parameter re-estimation by hillclimbing FB algorithm iteratively re-estimates the parameters yielding a new at each iteration 1. Initialize to a random set of values 2. Estimate PO ( ), filling out the trellis for both the Forward and the Backward algorithms 3. Reestimate using both trellises, yielding a new estimate ' ' Theorem: P(O ') P(O ) 20
Parameter Re-estimation Three parameters need to be re-estimated: Initial state distribution: i Transition probabilities: a i,j Emission probabilities: b i (o t ) 21
Re-estimating Transition Probabilities: Step 1 What s the probability of being in state s i at time t and going to state s j, given the current model and parameters? t (i, j) P(q t s i, q t 1 s j O,) 22
Re-estimating Transition Probabilities: Step 1 t (i, j) P(q t s i, q t 1 s j O,) t (i, j) N t (i) a i, j b j (o t 1 ) t 1 ( j) N i1 j1 t (i) a i, j b j (o t 1 ) t 1 ( j) 23
Re-estimating Transition Probabilities: Step 2 The intuition behind the re-estimation equation for transition probabilities is â = i, j expected number of Formally: expected number of ˆ a i, j transitions T 1 t1 T 1 t1 N j'1 transitions t (i, j) t (i, j') from states from states i to state s i j 24
Re-estimating Transition Probabilities Defining N j1 t (i) t (i, j) As the probability of being in state s i, given the complete observation O We can say: ˆ a i, j T 1 t1 T 1 t1 t (i, j) t (i) 25
Re-estimating Initial State Probabilities Initial state distribution: that s i is a start state i is the probability Re-estimation is easy: πˆ i = expectednumber of timesin states i at time1 Formally: ˆ i 1 (i) 26
Re-estimation of Emission Probabilities Emission probabilities are re-estimated as bˆ ( k ) = i Formally: expected number of times in state si and observe symbol v expected number of times in state s where ˆ b i (k) T t1 (o t,v k ) t (i) T t1 t (i) (o t,v k ) 1, if o t v k, and 0 otherwise Note that here is the Kronecker delta function and is not related to the in the discussion of the Viterbi algorithm!! 27 i k
The Updated Model Coming from ' ( ˆ A, ˆ B, ˆ ) (A,B,) we get to by the following update rules: ˆ a i, j T 1 t1 T 1 t1 t (i, j) t (i) ˆ b i (k) T t1 (o t,v k ) t (i) T t1 t (i) ˆ i 1 (i) 28
Expectation Maximization The forward-backward algorithm is an instance of the more general EM algorithm The E Step: Compute the forward and backward probabilities for a give model The M Step: Re-estimate the model parameters 29