Introduction to Hidden Markov Models

Introducton to Hdden Markov Models Alperen Degrmenc Ths document contans dervatons and algorthms for mplementng Hdden Markov Models. The content presented here s a collecton of my notes and personal nsghts from two semnal papers on HMMs by Rabner n 1989 [2] and Ghahraman n 2001 [1], and also from Kevn Murphy s book [3]. Ths s an excerpt from my project report for the MIT 6.867 Machne Learnng class taught n Fall 2014. I. HIDDEN MARKOV MODELS (HMMS) HMMs have been wdely used n many applcatons, such as speech recognton, actvty recognton from vdeo, gene fndng, gesture trackng. In ths secton, we wll explan what HMMs are, how they are used for machne learnng, ther advantages and dsadvantages, and how we mplemented our own HMM algorthm. A. Defnton A hdden Markov model s a tool for representng probablty dstrbutons over sequences of observatons [1]. In ths model, an observaton X t at tme t s produced by a stochastc process, but the state Z t of ths process cannot be drectly observed,.e. t s hdden [2]. Ths hdden process s assumed to satsfy the Markov property, where state Z t at tme t depends only on the prevous state, Z t 1 at tme t 1. Ths s, n fact, called the frst-order Markov model. The n th - order Markov model depends on the n prevous states. Fg. 1 shows a Bayesan network representng the frst-order HMM, where the hdden states are shaded n gray. We should note that even though we talk about tme to ndcate that observatons occur at dscrete tme steps, tme could also refer to locatons wthn a sequence [3]. The jont dstrbuton of a sequence of states and observatons for the frst-order HMM can be wrtten as, P (Z 1:N, X 1:N ) = P (Z 1 )P (X 1 Z 1 ) P (Z t Z t 1 )P (X t Z t ) t=2 (1) where the notaton Z 1:N s used as a shorthand for Z 1,..., Z N. Notce that Eq. 1 can be also wrtten as, P (X 1:N, Z 1:N ) = P (Z 1 ) P (Z t Z t 1 ) P (X t Z t ) t=2 (2) whch s same as the expresson gven n the lecture notes. There are fve elements that characterze a hdden Markov model: The author s wth the School of Engneerng and Appled Scences at Harvard Unversty, Cambrdge, MA 02138 USA. (adegrmenc@seas.harvard.edu). Ths document s an excerpt from a project report for the MIT 6.867 Machne Learnng class taught n Fall 2014. Z 1 X 1 Z 2 X 2 Z t X t Z N X N Fg. 1. A Bayesan network representng a frst-order HMM. The hdden states are shaded n gray. 1) Number of states n the model, K: Ths s the number of states that the underlyng hdden Markov process has. The states often have some relaton to the phenomena beng modeled. For example, f a HMM s beng used for gesture recognton, each state may be a dfferent gesture, or a part of the gesture. States are represented as ntegers 1,..., K. We wll encode the state Z t at tme t as a K 1 vector of bnary numbers, where the only non-zero element s the k-th element (.e. Z tk = 1), correspondng to state k K at tme t. Whle ths may seem contrved, t wll later on help us n our computatons. (Note that [2] uses N nstead of K). 2) Number of dstnct observatons, Ω: Observatons are represented as ntegers 1,..., Ω. We wll encode the observaton X t at tme t as a Ω 1 vector of bnary numbers, where the only non-zero element s the l-th element (.e. X tl = 1), correspondng to state l Ω at tme t. Whle ths may seem contrved, t wll later on help us n our computatons. (Note that [2] uses M nstead of Ω, and [1] uses D. We decded to use Ω snce ths agrees wth the lecture notes). 3) State transton model, A: Also called the state transton probablty dstrbuton [2] or the transton matrx [3], ths s a K K matrx whose elements A j descrbe the probablty of transtonng from state Z t 1, to Z t,j n one tme step where, j {1,..., K}. Ths can be wrtten as, A j = P (Z t,j = 1 Z t 1, = 1). (3) Each row of A sums to 1, j A j = 1, and therefore t s called a stochastc matrx. If any state can reach any other state n a sngle step (fully-connected), then A j > 0 for 1-α 1-β α 1 2 β (a) A 11 1 A 12 A 21 A 22 A 33 A 23 2 3 (b) A 32 Fg. 2. A state transton dagram for (a) a 2-state, and (b) a 3-state ergodc Markov chan. For a chan to be ergodc, any state should be reachable from any other state n a fnte amount of tme. 1 c 2014 Alperen Degrmenc

all, j; otherwse A wll have some zero-valued elements. Fg. 2 shows two state transton dagrams for a 2-state and 3-state frst-order Markov chan. For these dagrams, the state transton models are, [ 1 α α A (a) = β 1 β ], A (b) = A 11 A 12 0 A 21 A 22 A 23 0 A 32 A 33 The condtonal probablty can be wrtten as P (Z t Z t 1 ) = =1 j=1 Takng the logarthm, we can wrte ths as logp (Z t Z t 1 ) = =1 j=1. (4) A Zt 1,Zt,j j. (5) Z t 1, Z t,j log A j (6) = Z t log (A)Z t 1. (7) 4) Observaton model, B: Also called the emsson probabltes, B s a Ω K matrx whose elements B kj descrbe the probablty of makng observaton X t,k gven state Z t,j. Ths can be wrtten as, B kj = P (X t = k Z t = j). (8) The condtonal probablty can be wrtten as P (X t Z t ) = Ω j=1 k=1 Takng the logarthm, we can wrte ths as logp (X t Z t ) = j=1 k=1 B Zt,jX t,k kj. (9) Ω Z t,j X t,k log B kj (10) = X t log (B)Z t. (11) 5) Intal state dstrbuton, π: Ths s a K 1 vector of probabltes π = P (Z 1=1 ). The condtonal probablty can be wrtten as, P (Z 1 π) = =1 π Z1. (12) Gven these fve parameters presented above, an HMM can be completely specfed. In lterature, ths often gets abbrevated as λ = (A, B, π). (13) B. Three Problems of Interest In [2] Rabner states that for the HMM to be useful n real-world applcatons, the followng three problems must be solved: Problem 1: Gven observatons X 1,..., X N and a model λ = (A, B, π), how do we effcently compute P (X 1:N λ), the probablty of the observatons gven the model? Ths s a part of the exact nference problem presented n the lecture notes, and can be solved usng forward flterng. μ Φ Zt Φ Zt-1,Z t Z t Φ Zt,Z t+1 Zt-1,Z t μ Zt Φ Zt,Z t+1 μ Φ Zt+1 μ Zt Φ Zt-1,Z t μ Zt Φ Xt,Z t μ Φ Xt,Z t Xt Φ Xt,Z t X t μ Φ Zt,Z t+1 Zt μ Φ Xt,Z t Zt μ Xt Φ Xt,Z t Zt,Z t+1 μ Zt+1 Φ Zt-1,Z t Z t+1 Fg. 3. Factor graph for a slce of the HMM at tme t. Problem 2: Gven observatons X 1,..., X N and the model λ, how do we fnd the correct hdden state sequence Z 1,..., Z N that best explans the observatons? Ths corresponds to fndng the most probable sequence of hdden states from the lecture notes, and can be solved usng the Vterb algorthm. A related problem s calculatng the probablty of beng n state Z tk gven the observatons, P (Z t = k X 1:N ), whch can be calculated usng the forward-backward algorthm. Problem 3: How do we adjust the model parameters λ = (A, B, π) to maxmze P (X 1:N λ)? Ths corresponds to the learnng problem presented n the lecture notes, and can be solved usng the Expectaton-Maxmzaton (EM) algorthm (n the case of HMMs, ths s called the Baum-Welch algorthm). C. The Forward-Backward Algorthm The forward-backward algorthm s a dynamc programmng algorthm that makes use of message passng (belef propagaton). It allows us to compute the fltered and smoothed margnals, whch can be then used to perform nference, MAP estmaton, sequence classfcaton, anomaly detecton, and model-based clusterng. We wll follow the dervaton presented n Murphy [3]. 1) The Forward Algorthm: In ths part, we compute the fltered margnals, P (Z t X 1:T ) usng the predct-update cycle. The predcton step calculates the one-step-ahead predctve densty, P (Z t =j X 1:t 1 ) = = P (Z t = j Z t 1 = )P (Z t 1 = X 1:t 1 ) =1 (14) whch acts as the new pror for tme t. In the update state, the observed data from tme t s absorbed usng Bayes rule: α t (j) P (Z t = j X 1:t ) = P (Z t = j X t, X 1:t 1 ) = P (X t Z t = j, X 1:t 1 )P (Z t = j X 1:t 1 ) j P (X t Z t = j, X 1:t 1 )P (Z t = j X 1:t 1 ) = 1 C t P (X t Z t = j)p (Z t = j X 1:t 1 ) (15) 2 c 2014 Alperen Degrmenc

Algorthm 1 Forward algorthm 1: Input: A, ψ 1:N, π 2: [α 1, C 1 ] = normalze(ψ 1 π) ; 3: for t = 2 : N do 4: [α t, C t ] = normalze(ψ t (A α t 1 )) ; 5: Return α 1:N and log P (X 1:N ) = t log C t 6: Sub: [α, C] = normalze(u): C u j; α j = u j /C; Algorthm 2 Backward algorthm 1: Input: A, ψ 1:N, α 2: β N = 1; 3: for t = N 1 : 1 do 4: β t = normalze(a(ψ t+1 β t+1 ) ; 5: γ = normalze(α β, 1) 6: Return γ 1:N where the observatons X 1:t 1 cancel out because they are d-separated from X t. C t s the normalzaton constant (to avod confuson, we used C t as opposed to Z t from [3]) gven by, C t P (X t X 1:t 1 ) = = P (X t Z t = j)p (Z t = j X 1:t 1 ). j=1 (16) The K 1 vector α t = P (Z t X 1:T ) s called the (fltered) belef state at tme t. In matrx notaton, we can wrte the recursve update as: ) α t ψ t (A α t 1 (17) where ψ t = [ψ t1, ψ t2,..., ψ tk ] = {P (X t Z t = )} 1 K s the local evdence at tme t whch can be calculated usng Eq. 9, A s the transton matrx, and s the Hadamard product, representng elementwse vector multplcaton. The pseudo-code n Algorthm 1 outlnes the steps of the computaton. The log probablty of the evdence can be computed as N N log P (X 1:N λ) = log P (X t X 1:t 1 ) = log C t (18) Ths, n fact, s the soluton for Problem 1 stated by Rabner [2]. Workng n the log doman allows us to avod numercal underflow durng computatons. 2) The Forward-Backward Algorthm: Now that we have the fltered belef states α from the forward messages, we can compute the backward messages to get the smoothed margnals: P (Z t = j X 1:N ) P (Z t = j.x t+1:n X 1:t ) (19) P (Z t = j X 1:t )P (X t+1:n Z t = j, X1:t ). whch s the probablty of beng n state Z tj. Gven that the hdden state at tme t s j, defne the condtonal lkelhood of future evdence as β t (j) P (X t+1:n Z t = j). (20) Also defne the desred smoothed posteror margnal as Then we can rewrte Eq. 19 as γ t (j) P (Z t = j X 1:N ). (21) γ t (j) α t (j)β t (j) (22) We can now compute the β s recursvely from rght to left: β t 1 () = P (X t:n Z t 1 = ) Ths can be wrtten as The base case for β N s P (Z t = j, X t, X t+1:n Z t 1 = ) P (X t+1:n Z t = j, X t, Z t 1 = j ) P (Z t = j, X t Z t 1 = ) P (X t+1:n Z t = j)p (X t Z t = j, Z t 1 = ) P (Z t = j Z t 1 = ) β t (j)ψ t (j)a(, j) (23) β t 1 = A (ψ t β t ) (24) β N () = P (X N+1:N Z N = ) = P ( Z N = ) = 1 (25) Fnally, the smoothed posteror s then α β γ = j (α (j) β (j)) (26) where the denomnator ensures that each column of γ sums to 1 to ensure t s a stochastc matrx. The pseudo-code n Algorthm 2 outlnes the steps of the computaton. D. The Vterb Algorthm In order to compute the most probable sequence of hdden states (Problem 2), we wll use the Vterb algorthm. Ths algorthm computes the shortest path through the trells dagram of the HMM. The trells dagram shows how each state n the model at one tme step connects to the states n the next tme step. In ths secton, we agan follow the dervaton presented n Murphy [3]. The Vterb algorthm also has a forward and backward pass. In the forward pass, nstead of the sum-product algorthm, we utlze the max-product algorthm. The backward pass recovers the most probable path through the trells dagram usng a traceback procedure, propagatng the most lkely state at tme t back n tme to recursvely fnd the most lkely sequence between tmes 1 : t. Ths can be expressed as, δ t (j) max P (Z 1:t 1, Z t = j X 1:t ). (27) Z 1,...,Z t 1 Ths probablty can be expressed as a combnaton of the transton from the prevous state at tme t 1 and the most 3 c 2014 Alperen Degrmenc

Algorthm 3 Vterb algorthm 1: Input: X 1:N, K, A, B, π 2: Intalze: δ 1 = π B X1, a 1 = 0; 3: for t = 2 : N do 4: for j = 1 : K do 5: [a t (j), δ t (j)] = max (log δ t 1 (:) + log A j + log B Xt (j)); 6: Z N = arg max(δ N ); 7: for t = N 1 : 1 do 8: Z t = a t+1 Z t+1; 9: Return Z 1:N probable path leadng to, δ t (j) = max 1 K δ t 1()A j B Xt (j). (28) Here B Xt (j) s the emsson probablty of observaton X t gven state j. We also need to keep track of the most lkely prevous state, a t (j) = arg max δ t 1 ()A j B Xt (j). (29) The ntal probablty s The most probable fnal state s δ 1 (j) = π j B X1 (j). (30) ZN = arg max δ N (). (31) The most probable sequence can be computng usng traceback, Z t = a t+1 Z t+1. (32) In order to avod underflow, we can work n the log doman. Ths s one of the advantages of the Vterb algorthm, snce log max = max log; ths s not possble wth the forwardbackward algorthm snce log log. Therefore log δ t (j) max log δ t 1 () + log A j + log B Xt (j). (33) The pseudo-code n Algorthm 3 outlnes the steps of the computaton. E. The Baum-Welch Algorthm The Baum-Welch algorthm s n essence the Expectaton- Maxmzaton (EM) algorthm for HMMs. Gven a sequence of observatons X 1:N, we would lke to fnd arg max λ P (X; λ) = arg max P (X, Z; λ) (34) by dong maxmum-lkelhood estmaton. Snce summng over all possble Z s not possble n terms of computaton tme, we use EM to estmate the model parameters. The algorthm requres us to have the forward and backward probabltes α, β calculated usng the forwardbackward algorthm. In ths secton we follow the dervaton presented n Murphy [3] and the lecture notes. λ Z Algorthm 4 Baum-Welch algorthm 1: Input: X 1:N, A, B, α, β 2: for t = 1 : N do 3: γ(:, t) = (α(:, t) β(:, t))./sum(α(:, t) β(:, t)); 4: ξ(:, :, t) = ((α(:, t) A(t + 1)) (β(:, t + 1) B(X t+1 )) T )./sum(α(:, t) β(:, t)); 5: ˆπ = γ(:, 1)./sum(γ(:, 1)); 6: for j = 1 : K do 7: Â(j, :) = sum(ξ(2 : N, j, :), 1)./sum(sum(ξ(2 : N, j, :), 1), 2); 8: B(j, ˆ :) = ( X(:, j) T γ )./sum(γ, 1); 9: Return ˆπ, Â, ˆB 1) E Step: γ tk P (Z tk = 1 X, λ old ) α k (t)β k (t) = j=1 α j(t)β j (t) ξ tjk P (Z t 1,j = 1, Z tk = 1 X, λ old ) = α j(t)a jk β k (t + 1)B k (X t+1 ) =1 α (t)β (t) (35) (36) 2) M Step: The parameter estmaton problem can be turned nto a constraned optmzaton problem where P (X 1:N λ) s maxmzed, subject to the stochastc constrants of the HMM parameters [2]. The technques of Lagrange multplers can be then used to fnd the model parameters, yeldng the followng expressons: ˆπ k = E[N 1 k ] N = γ 1k K j=1 γ 1j Â jk = E[N jk] k E[N jk] = ˆB jl = E[M jl] E[N j ] t=2 ξ tjk K l=1 t=2 ξ tjl = γ tlx tj γ tl (37) (38) (39) λ new = (Â, ˆB, ˆλ) (40) The pseudo-code n Algorthm 4 outlnes the steps of the computaton. F. Lmtatons A fully-connected transton dagram can lead to severe overfttng. [1] explans ths by gvng an example from computer vson, where objects are tracked n a sequence of mages. In problems wth large parameter spaces lke ths, the transton matrx ends up beng very large. Unless there are lots of examples n the data set, or unless some a pror knowledge about the problem s used, then ths leads to severe overfttng. A soluton to ths s to use other types of HMMs, such as factoral or herarchcal HMMs. REFERENCES [1] Z. Ghahraman, An Introducton to Hdden Markov Models and Bayesan Networks, Internatonal Journal of Pattern Recognton and Artfcal Intellgence, vol. 15, no. 1, pp. 9 42, 2001. 4 c 2014 Alperen Degrmenc

[2] L. Rabner, A Tutoral on Hdden Markov Models and Selected Applcatons n Speech Recognton, Proceedngs of the IEEE, vol. 77, no. 2, pp. 257 286, 1989. [3] K.P. Murphy, Machne Learnng: A Probablstc Perspectve, Cambrdge, MA: MIT Press, 2012. 5 c 2014 Alperen Degrmenc