Representing sequence data Hidden Markov Models CISC 5800 Professor Daniel Leeds Spoken language DNA sequences Daily stock values Example: spoken language F?r plu? fi?e is nine Between F and r expect a vowel: aw, ee, ah ; NO oh, uh At end of plu expect consonant: g, m, s ; NO d, p 2 Markov Models Start with: n states: s 1,, s n Probability of initial start states: Π 1,, Π n Probability of transition between states: A i, = P(q t =s i q t-1 =s ).5.3 1.2.9 3.8.1 2.2 3 A dice-y example wo colored die What is the probability we start at s A? 0.3 What is the probability we have the sequence of die choices: s A, s A? 0.3x0.8=0.24 What is the probability we have the sequence of die choices: s B, s A, s B, s A? 0.7x0.2x0.2x0.2 = 0.0056 5 1
A dice-y example What is the probability we have the die choices s B at time t=5 Dynamic programming: find answer for q t, then compute q t+1 State\ime t 1 t 2 t 3 s A 0.3 0.38 0.428 s B 0.7 0.62 0.572 p t i = p q t = s i q t 1 = s p t 1 () p t (i) = P(q t =s i ) -- Probability state i at time t 7 Hidden Markov Models Probability observe value x i Actual state q hidden when state is s State produces visible data o: φ i, = P(o t = x i q t = s ) Compute Q O P O, Q θ = p(q 1 π) A A A p(q t q t 1, A) Probability of state sequence φ Probability of observation sequence, given states 8 Deducing die based on observed emissions Each color is biased Intuition balance transition and emission probabilities Observed numbers: 554565254556 the 2 is probably from s B Observed numbers: 554565213321 the 2 is probably from s A 9 Deducing die based on observed emissions Each color is biased o P(o s R ) P(o s B ) We see: 5 What is probability of o=5, q=b (blue) Π B φ 5,B = 0.7 x 0.2 = 0.14 We see: 5, 3 What is probability of o=5,3, q=b, B? Π B φ 5,B A B,B φ 3,B = 0.7 x 0.2 x 0.8 x 0.1 = 0.0112 11 2
Goal: calculate most likely states given observable data Viterbi algorithm: δ t (i) Define and use δ t (i) δ t (i) : max possible value of P(q 1,..,q t,o 1,..,o t ) given we insist q t =s i Find the most likely path from q 1 to q t that q t =s i Outputs are o 1,, o t 12 δ 1 i = Π i P o 1 q 1 = s i δ t i = P o t q t = s i φ ot,i max δ t 1 A i, = Π i φ 1,i P(Q* O)=argmax Q P(Q O) = argmax i δ t i max δ t 1 P q t = s i q t 1 = s = 13 Viterbi algorithm: bigger picture Compute all δ t (i) s At time compute δ 1 (i) for every state i At time compute δ 2 (i) for every state i (based on δ 1 i values) At time t= compute δ (i) for every state i (based on δ 1 i values) Find states going from t= back to to lead to max δ (i) Now find state that gives maximum value for δ () Find state k at time -1 used to maximize δ () Find state z at time 1 used to maximize δ 2 (y) 14 Viterbi in action: observe 5, 1 (o 1 =5) (o 2 =1) q t =s A.3x.1 =.03.0084 (from B) q t =s B.7x.2 =.14.0112 (from B) δ 2 A :.3 x max(.8x.03,.2 x.14 ) =.3 x.028 =.0084 δ 2 B :.1 x max(.2x.03,.8 x.14 ) =.1 x.112 =.0112 16 3
Viterbi in action: observe 5, 1, 1 Viterbi in action: observe 5, 1, 1, 1 δ 3 A :.3 x max(.8x.0084,.2 x.0112 ) =.3 x.00672 =.00202 δ 3 B :.1 x max(.2x.0084,.8 x.0112 ) =.1 x.00896 =.000896 δ 3 A :.3 x max(.8x.00202,.2 x.0009 ) =.3 x.0016 =.00048 δ 3 B :.1 x max(.2x.00202,.8 x.0009 ) =.1 x.000717 =.000717 (o 1 =5) (o 2 =1) t=3 (o 3 =1) q t =s A.3x.1 =.03.0084 (from B).00202 (from A) q t =s B.7x.2 =.14.0112 (from B).000896 (from B) 17 (o 1 =5) (o 2 =1) t=3 (o 3 =1) t=4 (o 4 =1) q t =s A.3x.1 =.03.0084 (from B).00202 (from A).00048 (from A) q t =s B.7x.2 =.14.0112 (from B).000896 (from B).00072 (from B) 18 Viterbi in action: observe 5, 1, 1, 1, 2 δ 3 A :.2 x max(.8x.00048,.2 x.00072 ) =.2 x.00038 =.000076 δ 3 B :.1 x max(.2x.00048,.8 x.00072 ) =.1 x.00058 =.000058 Parameters in HMM Initial probabilities: π i ransition probabilities Emission probabilities φ i, A i, How do we learn these values? (o 1 =5) (o 2 =1) t=3 (o 3 =1) t=4 (o 4 =1) t=5 (o 5 =2) q t =s A.03.0084 (<-B).00202 (<-A).00048 (<-A).00008 (<-A) q t =s B.14.0112 (<-B).00090 (<-B).00072 (<-B).00006 (<-B) State sequence: B, A, A, A, A 19 20 4
First, assume we know the states Learning HMM parameters: π i First, assume we know the states Learning HMM parameters: A i, x 1 : A,B,A,A,B x 2 : B,B,B,A,A x 3 : A,A,B,A,B Compute MLE for each parameter π = argmax π k π(q 1 ) p q t q t 1 π A = #D(q 1 = s A ) #D x 1 : A,B,A,A,B x 2 : B,B,B,A,A x 3 : A,A,B,A,B Compute MLE for each parameter A = argmax A k π(q 1 ) A i, = #D(q t=s i,q t 1 =s ) #D(q t 1 =s ) p q t q t 1 21 22 First, assume we know the states Learning HMM parameters: φ i, Challenges in HMM learning x 1 : A,B,A,A,B o 1 : 2,5,3,3,6 x 2 : B,B,B,A,A o 2 : 4,5,1,3,2 x 3 : A,A,B,A,B o 3 : 1,4,5,2,6 Compute MLE for each parameter φ = argmax φ k π(q 1 ) p q t q t 1 #D(o t = i, q t = s ) φ i, = #D(q t = s ) 23 Learning parameters (π, A, φ) with known states is not too hard BU usually states are unknown If we had the parameters and the observations, we could figure out the states: Viterbi P(Q* O)=argmax Q P(Q O) 24 5
Expectation-Maximization, or EM Problem: Uncertain of y i (class), uncertain of θ i (parameters) Solution: Guess y i, deduce θ i, re-compute y i, re-compute θ i etc. OR: Guess θ i, deduce y i, re-compute θ i, re-compute y i Will converge to a solution E step: Fill in expected values for missing labels y M step: Regular MLE for θ given known and filled-in variables Also useful when there are holes in your data Computing states q t Instead of picking one state: q t =s i, find P(q t =s i o) P q t = s i o 1,, o = α t(i)β t (i) α t()β t () Forward probability: α t i = P(o 1 o t q t = s i ) Backward probability: β t i = P(o t+1 o q t = s i ) 25 26 Details of forward probability Details of backward probability Forward probability: α t i = P(o 1 o t q t = s i ) Backward probability: β t i = P(o t+1 o q t = s i ) α 1 i = φ o1,iπ i = P o 1 q 1 = s i P(q 1 = s i ) α t i = φ ot,i A i, α t 1 α t i = P o t q t = s i P q t = s i q t 1 = s α t 1 β t i = β t i = A,i φ ot+1,β t+1 P q t+1 = s q t = s i P o t+1 q t+1 = s β t+1 Final β: β 1 (i) β 1 (i) = A,i φ o, 28 = P q = s q = s i P(o q = s ) 29 6
E-step: State probabilities One state: wo states in a row: P q t = s i o 1,, o = α t(i)β t (i) α t()β t () = S t(i) P q t = s, q t+1 = s i o 1,, o = = S t (i, ) α t ()A i, φ ot+1,iβ t+1 (i) f g α t (g)a f,g φ ot+1,fβ t+1 (f) Recall: when states known π A = #D(q 1=s A ) #D A i, = #D(q t=s i,q t 1 =s ) #D(q t 1 =s ) φ i, = #D(o t=i) #D(q t =s ) 30 31 M-step A i, = φ obs,i = π i = S 1 (i) t S t i, t S t() t o t=obs S t(i) t S t(i) Known states: π A = #D(q 1=s A ) #D A i, = #D(q t=s i,q t 1 =s ) #D(q t 1 =s ) φ i, = #D(o t=i AND q t =s ) #D(q t =s ) Review of HMMs in action For classification, find highest probability class given features Features for one sound: [q 1, o 1, q 2, o 2,, q, o ] Conclude word: Generates states: Q O 32 sound1 sound2 sound3 34 7