Hdden Markov Models Namrata Vaswan, Iowa State Unversty Aprl 24, 204 Hdden Markov Model Defntons and Examples Defntons:. A hdden Markov model (HMM) refers to a set of hdden states X 0, X,..., X t,..., X T and a set of observatons, Y,..., Y t,..., Y T wth the followng jont PMF or PDF: p(x 0:T, y :T ) [p(x 0 )[ p(x τ x τ )]][[ p(y τ x τ )]] () 2. The sequence s an HMM f and only f (a) gven X t, X t+ s ndependent of X 0:t (past-x) and (b) gven X t, Y t s ndependent of X 0:t, X t+:t (all-x) and Y 0:t (past-y) Ths follows by wrtng out the expresson for p(x 0:t, y 0:t ) usng chan rule and then usng () and comparng coeffcents. By chan rule, p(x 0:T, y :T ) p(x 0 ) p(x τ x τ, x 0:τ 2 ) p(y τ x 0:T, y :τ ) (2) Compare ths wth (). In both equatons ntegrate over y :T and cancel out p(x x 0 ) to get p(x τ x τ, x 0:τ 2 ) τ2 p(x τ x τ ). Now ntegrate also over x 3:T on both sdes to get p(x 2 x, x 0 ) p(x 2 x ). Next, ntegrate over only x 4:T and use ths to conclude that p(x 2 x, x 0 )p(x 3 x 2, x, x 0 ) τ2 p(x 2 x )p(x 3 x 2 ) and so p(x 3 x 2, x, x 0 ) p(x 3 x 2 ). conclude that p(x t x 0:t ) p(x t x t ) for each t,.e. tem (a) holds. At the end of the above, we conclude that p(y τ x 0:T, y :τ ) Proceed n a smlar fashon to p(y τ x τ ).
Integrate over y 2:T to conclude that p(y x 0:T ) p(y x ). Use ths and ntegrate over only y 3:T to conclude that p(y 2 x 0:T, y ) p(y 2 x 2 ). Proceed n a smlar fashon to conclude that p(y t x 0:T, y :t ) p(y t x t ) for each t,.e. tem (b) holds. 3. The sequence s an HMM f and only f (a) gven X t, X t+ s ndependent of X 0:t (past-x) and Y 0:t (past-y) (b) gven X t, Y t s ndependent of X 0:t (past-x) and Y 0:t (past-y) Ths also follows by wrtng out the expresson for p(x 0:t, y 0:t ) usng chan rule and then usng () and comparng coeffcents. Ether of the above can also be concluded by usng results from the Graphcal Models handout. The followng can be shown ether usng Theorem 2 of the Graphcal Models handout or drectly.. The jont PMF or PDF of the hdden states gven by p(x 0:T ) p(x 0 ) p(x τ x τ ) (3) Ths follows usng () and ntegratng over y :T. 2. Gven X t, X t+:t are condtonally ndependent (c..) of past-x (X 0:t ) and of past-y (Y 0:t ). 3. Gven X t, Y t:t are c.. of past-x (X 0:t ) and of past-y (Y 0:t ). 4. Gven X t k, Y t:t s c.. of Y 0:t k and of X 0:t k for k > 0. 5. gven X t, X t+ s c.. of past-x (X 0:t ) and of past-y (Y 0:t ), and 6. gven X t, Y t s c.. of all-x (X 0:t, X t+:t ) and all-y (Y 0:t, Y 0:t+:T ). 7. By reversng the Markov chan {X t }, we can also clam that gven X t, X t s c.. of all future-x (X t+:t ) and all future-y (Y t:t ). 8. Gven X t k, Y t:t s c.. of X 0:t k and Y 0:t k for k > 0. If k 0, replace Y 0:t k by Y 0:t 9. By reversng the Markov chan {X t }, the opposte of 3 can also be shown for future. 0. Many more 2
Let us try to prove tem 2. We get p(x t+:t x t, x 0:t, y 0:t ) p(x t+ x t, x 0:t, y 0:t )p(x t+2:t x t+, x 0:t, y 0:t ) p(x t+ x t )p(x t+2:t x t+, x 0:t, y 0:t ) p(x t+ x t )p(x t+2 x t+, x 0:t, y 0:t )p(x t+3:t x t+2, x 0:t+, y 0:t ) p(x t+ x t )p(x t+2 x t+ )p(x t+3:t x t+2, x 0:t+, y 0:t ) (4) The frst equalty uses chan rule, the second uses (a) of defnton 3, the thrd uses chan rule. The fourth uses (a) of defnton 3 and the followng fact wth X X t+2, W X t+, Z Y 0:t and Y Y t+. Fact X ndependent of {Z, Y } mples that X ndependent of Z. Smlarly gven W, X c.. of {Z, Y } mples that gven W, X c.. of Z. The proof of ths follows by wrtng p(x, z, y w) p(x w)p(z, y w) and ntegratng over y. Proceedng n a smlar fashon, we fnally get p(x t+:t x t, x 0:t, y 0:t ) τt+ p(x τ x τ ) Usng (a) of Defnton 2 and Fact, p(x t+:t x t ) T τt+ p(x τ x τ ) and thus we get.e. the result follows. p(x t+:t x t, x 0:t, y 0:t ) p(x t+:t x t ) The other conclusons gven above can be proved smlarly. HMM Examples.. The state space model used for defnng the Kalman flter was an example of an HMM wth contnuous states, X t and contnuous observatons, Y t. 2. X t refers to today s weather whch can take one of three possble values, {rany, cloudy, sunny}. Y t s a bnary random varable whch can take two possble values {class occurs, no class occurs}. It s natural to clam that today s weather depends only on yesterday s weather,.e. gven yesterday s weather, today s weather s c.. of past weather or of whether class occurred yesterday or n the past or not. Also, the chance that class wll occur today or not s governed only by today s weather (f t s sunny, t s more lkely that the class wll not occur!) and gven today s weather, the chance s ndependent of all past or future weather and also of whether classes occurred n the past or n the future. Ths, of course models, an rresponsble professor who does not care about whether the materal s covered or not! 3
3. Speech recognton, X t : dfferent phonems, Y t : lnear predcton coeffcents (LPC s) of the AR model descrbng observed speech. 4. Gesture recognton, X t : dfferent gestures out of a set, Y t : outer contour of the observed hand shape (for hand gestures) 5. In last two examples, X t s dscrete, Y t s contnuous, that s allowed too. Causal Posteror Computaton.. We refer to p(x t y 0:t ) as the causal posteror. In real-tme applcatons, there s a need to compute t recursvely, for example, to be able to compute the causal MMSE or causal MAP estmate. 2. Recursve computaton means use the causal posteror and t and the current observaton to compute the causal posteror at t. 3. Usng Bayes rule and HMM propertes, the causal posteror satsfes p(x t y 0:t ) p(x t, y t y 0:t ) p(x t y 0:t )p(y t x t, y 0:t ) p(x t y 0:t )p(y t x t ) (usng HMM defnton 3) p(y t x t ) p(x t, x t y 0:t )dx t p(y t x t ) p(x t y 0:t )p(x t x t, y 0:t )dx t p(y t x t ) p(x t y 0:t )p(x t x t )dx t (usng HMM defnton 3)(5) 4. The above recurson s another way to derve the Kalman flter recurson: the causal MMSE estmate, E[X t Y 0:t ], s the expectaton of X t under the causal posteror. Snce everythng there s jontly Gaussan, the posterors wll also be Gaussan and hence completely specfed by the mean and covarance. Kay s book does t ths way. 5. The same rules apply for dscrete states: just replace by. 2 Dscrete-state HMM We study the set of technques developed for dscrete-state HMM s. The materal s based on Rabner s tutoral (Proc. IEEE, February 989). Thus any X t s a dscrete random varable whch takngs one of N possble values,, 2,... N. Y t s ether dscrete or contnuous. 4
2. Notaton A tme-homogenous dscrete state HMM s completely specfed by π P (X t ) a,j P (X t j X t ) b j (y) P (Y t y X t j) (f Y t s contnuous ths s replaced by the condtonal PDF) (6) The followng notaton s used n effcent computaton of varous quanttes. α t () p(y 0:t, x t ) β t () p(y t+:t x t ) γ t () p(x t y 0:T ) (note ths condtons on all observatons) ξ t (, j) p(x t, x t+ j y 0:T ) (note ths condtons on all observatons) (7) 2.2 Recurson for α t, β t, γ t, ξ t Consder α t α t () p(y 0:t, x t ) p(y 0:t, x t, x t j) j p(y 0:t, x t j)p(x t x t j, y 0:t )p(y t x t, x t j, y 0:t ) j p(y 0:t, x t j)p(x t x t j)p(y t x t ) (usng HMM defnton 3) j p(y 0:t, x t j)a j b (y t ) j b (y t ) α t (j)a j, (8) j 5
Consder β t β t () p(y t+:t x t ) p(y t+:t, x t+ j x t ) j p(x t+ j x t )p(y t+ x t+ j, x t )p(y t+2:t x t+ j, x t, y t+ ) j p(x t+ j x t )p(y t+ x t+ j)p(y t+2:t x t+ j) (usng HMM defnton 3) j a,j b j (y t+ )β t+ (j) (9) j Consder γ t. Usng defntons of α t () and β t (), t s clear that γ t () α t ()β t (). Thus, γ t () N j α t(j)β t (j) α t()β t () (0) Consder ξ t ξ t (, j) p(x t, x t+ j y 0:T ) p(y 0:T ) p(y 0:T, x t, x t+ j) p(y 0:T ) p(x t, y 0:t )p(x t+ j x t, y 0:t )p(y t+:t x t+ j, x t, y 0:t ) p(y 0:T ) p(x t, y 0:t )p(x t+ j x t )p(y t+:t x t+ j) (usng HMM defnton 3) p(y 0:T ) α t()a,j β t (j) j α t ( )a,j β t(j ) α t()a,j β t (j) () 2.3 Computng p(y 0:T ): Forward algorthm, Backward algorthm Brute force computaton of p(y 0:T ) wll requre evaluatng p(y 0:T ) x 0:T wll requre O(N T ) computatons. p(x 0 )[ p(x t x t )]p(y 0 x 0 )[ p(y t x t )] (2) t t 6
2.3. Forward algorthm A fast and causal way to compute p(y 0:T ) s to go forward n tme p(y 0:T ) α T () (3) The recurson for α t () s gven n (8). Ths takes O(N 2 T ) computaton only. 2.3.2 Backward Algorthm Another O(N 2 T ) way to compute p(y 0:T ) s gong backwards n tme p(y 0:T ) β 0 ()π (4) The recurson for β t () s gven n (9). Typcally one would use the forward algorthm to compute ths, snce that s also causal. There may be stuatons, e.g. f ths s done offlne and f observatons are stored as last-nfrst-out where one may need to use the backward algorthm. 2.4 EM algorthm for dscrete-state HMM parameter estmaton: Baum Welch algorthm Let θ denote the set of parameters. In ths case, θ ncludes all elements {a,j }, {π } and the parameters of b (y). Assumpton Assume for the dscusson below that Y t s are also dscrete and take M possble values,, 2,... M. Thus, n b (y), y can be, 2,... M. Then θ {π },...,N, {a,j },...N,j,...N, {b (y)}...n,y,...m. Let θ k denote the parameter estmate at the k th teraton. Recall that the EM algorthm computes θ k+ arg max Q(θ, θ k ) s.t. constrants on θ θ where Q(θ, θ k ) E[log p(y :T, X 0:T ; θ) y :T ; θ k ] (5).e. at each teraton EM maxmzes the posteror expectaton of the logarthm of the complete data lkelhood (the posteror expectaton s computed usng the parameter estmates from the prevous teraton). As dscussed earler (when talkng about EM algorthm), under certan assumptons, ths leads to maxmzaton of the observed data lkelhood,.e. ts soluton converges to arg max θ p(y 0:T ; θ). 7
Now for our HMM, log p(y 0:T, X 0:T ; θ) log π X0 + log a Xt,X t + t log b Xt (y t ) (6) t0 Thus the frst term s only a functon of random varable X 0, the t th entry of the second term s only a functon of X t, X t and the t th entry of the thrd term s only a functon of X t. E[log p(y 0:T, X 0:T ; θ) y 0:T ; θ k ] E[log π X0 y 0:T ; θ k ] + E[log a Xt,X t y 0:T ; θ k ] + t p(x 0 y 0:T ) log π + γ k 0 () log π + E[log b Xt (y t ) y 0:T ; θ k ] t0 p(x t, x t j y 0:T ) log a,j + t,j ξt (, k j) log a,j + t γ k 0 () log π +,j,j γt k () log b (y t ) t0 ξt (, k j) log a,j + t where γ k t, ξ k t are computed usng θ k n the recursons gven n (0) and (). We need to maxmze the above subject to the constrants π p(x t y 0:T ) log b (y t ) t0 γt k () log b (y t ) (7) t0 a,j,,... N j M b (y),,... N (8) y Usng Lagrange multplers, dfferentatng and solvng, the fnal solutons are π k+ γ k 0 () a k+,j b (m) k+ N j T t0 γk t () T t ξk t (, j ) ( ξt (, k j)) where I(A) s f A occurs and 0 otherwse. Here γt k, ξt k estmates at teraton k n the recursons gven earler. t T t γ t() ( ξt (, k j)) I(y t m)γt k () (9) t0 Thus, the stepwse EM algorthm s as follows. At teraton k +, 8 t are computed usng the parameter
. Compute γ k t () for all for all t usng (0) and parameter estmates from teraton k, θ k. 2. Compute ξ k t (, j) for all, j for all t usng () and parameter estmates from teraton k, θ k. 3. Compute parameter estmates at teraton k +, θ k+, usng (9. Now, f Y t s are not dscrete, but are contnuous r.v. s wth parameters of ther PDF beng governed by the current state, e.g. Y t s can be scalar Gaussans wth mean µ and varance σ 2 Thus, f the state X t. In ths case, ther estmates can be computed as follows. We need to maxmze the followng w.r.t. µ, σ 2. γt k () log b (y t ) t0 µ k+ σ 2 k+ γt k ()[ log( 2π)σ 2 (y t µ ) 2 ] (20) t0 T t0 γk t () T t0 γk t () γt k ()y t t0 t0 2σ 2 γ k t ()(y t µ k+ ) 2 (2) 2.5 General dea of Vterb algorthm / dynamc programmng In dynamc programmng / Vterb algorthm, the goal s to fnd where f t (q 0:t ) at any t satsfes arg max q0:t f T (q 0:T ) (22) f t (q 0:t ) f t (q 0:t ) + h t (q t, q t ) + g t (q t ) (23) Notce that f t (.) s a functon only of the frst t + varables. Typcally path optmzaton problems are of ths type. Effcent solutons strategy: Let Then, usng (23), δ t () max q 0:t f t (q 0:t, ) (24) δ t () max q 0:t [f t (q 0:t ) + h t (q t, ) + g t ()] max max[f t (q 0:t ) + h t (q t, ) + g t ()] q t q 0:t 2 g t () + max[h t (q t, ) + max f t (q 0:t )] q t q 0:t 2 g t () + max[h t (q t, ) + δ t (q t )] q t (25) 9
Also store the optmal path to get to q t for each value of q t. So f q t {, 2,... N} then store the optmal path to get to q t for each n the set. For the above problem, ths can be done effcently by only storng the optmal value of q t that gets you to q t and dong ths for each at each t. Thus, at each t, for each q t, we store ψ t () arg max q t [h t (q t, ) + δ t (q t )] (26) To summarze the above dea, we do the followng.. Intalze at t 0 to δ 0 () f 0 () for all, 2... N 2. Startng at t, at each t, compute the followng for all, 2... N δ t () g t () + max q t [h t (q t, ) + δ t (q t )] (27) 3. Smultaneously, at each t, for each, 2... N also store the maxmzer of the above,.e. store ψ t () arg max q t [h t (q t, ) + δ t (q t )] (28) 4. At t T, fnd the optmal cost and the optmal value of q T as max δ T (), qt arg max δ T () (29) 5. Backtrack usng ψ t to fnd the optmal state sequence,.e. startng wth t T, go backwards, q t ψ t+ (q t+) (30) 2.6 Posteror MAP (non-causal) computaton: Vterb algorthm We would lke to fnd the non-causal posteror MAP estmate s x 0:T arg max x 0:T Usng notaton from above, p(x 0:T y 0:T ) arg max p(x 0:T, y 0:T ) arg max log p(x 0:T, y 0:T ) (3) x 0:T x 0:T Usng the HMM defnton, t s easy to see that Thus, Thus, the fnal Vterb algorthm s f t (x 0:t ) : log p(x 0:t, y 0:t ) (32) f t (x 0:t ) f t (x 0:t ) + log p(x t x t ) + log p(y t x t ) (33) h t (x t, x t ) : log p(x t x t ) g t (x t ) : log p(y t x t ) (34) 0
. Intalze at t 0 to δ 0 () f 0 () π b (y 0 ) for all, 2... N 2. Startng at t, at each t, compute the followng for all, 2... N δ t () log b (y t ) + max x t,2,...n [log a x t, + δ t (x t )] (35) 3. Smultaneously, at each t, for each, 2... N also store the maxmzer of the above,.e. store ψ t () arg max x t,2,...n [log a x t, + δ t (x t )] (36) 4. At t T, fnd the optmal cost and the optmal value of q T as max δ T (), x T arg max δ T () (37) 5. Backtrack usng ψ t to fnd the optmal state sequence,.e. startng wth t T, go backwards, z t ψ t+ (x t+) (38) 2.7 Drect dervaton of Vterb algorthm for HMMs Let Recurson for δ t Also, δ t () max x 0:t p(x 0:t, y 0:t ) Thus, δ t (x t ) max x 0:t p(x 0:t, y 0:t ) ψ t (x t ) arg max x t N δ t (x t )a xt,x t (39) max p(x 0:t, y 0:t )p(x t x 0:t, y 0:t )p(y t x t, x 0:t, y 0:t ) x 0:t max p(x 0:t, y 0:t )p(x t x t )p(y t x t ) x 0:t (usng HMM defnton 3) max p(x 0:t, y 0:t )a xt,b (y t ) x 0:t b (y t ) max(max p(x 0:t, y 0:t ))a xt, x t x 0:t 2 b (y t ) max δ t (j)a j, j (40) ψ t () arg max δ t (j)a j, (4) j x T arg max T () N x t ψ t+ (x t+), t T, T 2,... 0 (42)