CS281A/Stat241A: Statstcal Learnng Theory Hdden Markov Models & The Multvarate Gaussan (10/26/04) Lecturer: Mchael I. Jordan Scrbes: Jonathan W. Hu 1 Hdden Markov Models As a bref revew, hdden Markov models (HMMs) are approprate for modelng sequental data. Thus, HMMs have been appled to speech recognton, gene fndng, and other applcatons whch may nvolve, but are not restrcted to, parsng or segmentng. The formal structure of an HMM s shown below, where the representaton can be vewed as a chan of mxture models. The hdden states are denoted by q t and the observed values are denoted by y t where t s a specfc pont n tme. q 0 q 1 q T A A A 0 y 0 y 1 y T 2 HMM Parameter Estmaton Recall from last lecture the complete log lkelhood for the HMM: l c (θ) = =1 q0 log π +,j=1 qtq j t+1 log a j + log p(y t q t, η) (1) Where p(y t q t, η) s the probablty dstrbuton of each output node. We observe that the complete probablty dstrbuton s n the exponental famly where q0 s the suffcent statstc for π, and T q tq j t+1 s the suffcent statstc for a j (the suffcent statstc for η depends on the dstrbuton chosen for p(y t q t, η)). Note: In ths dscusson, we have left the dstrbuton on the output values arbtrary and thus gnore the log p(y t q t, y) term. Refer to chapter 12 of the text for an example where the outputs y t are multnomal varables. 1
2 Hdden Markov Models & The Multvarate Gaussan 2.1 E Step For the E step, we take the expected value of the complete log lkelhood, condtonng on the parameters at teraton p, θ (p). l c (θ q, y) y,θ (p) = q0 log π + = =1 q0 log π + =1,j=1 qtq j t+1 log a j +,j=1 log p(y t q t, η) (2) qtq j t+1 log a j + log p(y t q t, η) (3) Thus, we must compute q 0 y,θ (p) and qtq j t+1 y,θ (p). However, these are just margnal dstrbutons: q0 y,θ (p) = E(q0 y, θ (p) ) (4) = p(q0 = 1 y, θ (p) ) (5) T qtq j t+1 y,θ = E(qtq j (p) t+1 y, θ(p) ) (6) = p(qtq j t+1 y, θ(p) ) (7) We can compute these values va the SUM-PRODUCT algorthm. Whle the margnals can be computed va the SUM-PRODUCT algorthm, we wll reman consstent wth the HMM lterature and show how to calculate these values va the alpha-beta algorthm (also called forwardbackward). We frst show that the calculaton of the β s n the alpha-beta algorthm s dentcal to the SUM-PRODUCT algorthm. Consder the fragment of the graphcal model representaton of the HMM below: q t Q t+1 A y t Y t+1 The β s of the alpha-beta algorthm s gven by: β(q t ) = q t+1 p(y t+1 q t+1 )β(q t+1 )a qt,q t+1 (8) By the SUM-PRODUCT algorthm, the message sent from q t+1 to q t s gven by: m qt+1 (q t ) = q t+1 m qt+2 (q t+1 )p(y t+1 q t+1 )a qt,q t+1 (9)
Hdden Markov Models & The Multvarate Gaussan 3 We see that ths s exactly the same as the calculaton of the β s where β(q t ) = m qt+1 (q t ). Note that we drop the q t+1 notaton n the β s snce the chan structure of the HMM already mples that the message s sent by q t+1. We can also wrte β(q t ) as follows: β(q t ) p(y t+1,..., y T q t ) (10) whch s the probablty of emttng a partal sequence of outputs y t+1,..., y T gven that the system starts n state q t. In the alpha-beta algorthm, the α s are defned to be: α(q t ) p(y 0,..., y t, q t ) (11) whch s the probablty of emttng a partal sequence of outputs y 0,..., y t and endng up n state q t. 2.2 M Step We defne γt to be equal to qt, and ξ j t,t+1 to be equal to q tq j t+1. We can wrte the expected complete log lkelhood as: l c (θ) = =1 γt log π +,j=1 ξ j t,t+1 log a j + log p(y t q t, y) (12) In dervng the M step for the HMM, we use a lemma that s useful throughout ths class, and thus we present t here. Gven J(π) as follows: J(π) = a log π (13) we would lke to maxmze J(π) such that π = 1 and π > 0. The soluton s ˆπ = j a. To see that aj ths s true, we smply take the dervatve wth respect to π and set to zero. We use a Lagrange multpler to represent the constrant that the π s must sum to one. J(π) = a log π + λ(1 π ) (14) J = a λ (15) π π λ = a π (16) a λ ˆπ = = π (17) a j a j Usng ths lemma, we derve the followng equatons for the M step. (18) ˆπ (p+1) = ˆα (p+1) j = = γ 0 j γj 0 T ξ,j t,t+1 M k=1 T T t=1 ξ,k t,t+1 ξt t,t+1 T γ t (19) (20) (21)
4 Hdden Markov Models & The Multvarate Gaussan In the case of HMMs, these equatons are also known as the Baum-Welch updates. Note: In some cases, we would lke to calculate the confguraton of states on the HMM that has the hghest probablty gven observed values for y t. We can solve ths by usng the well-known Verterb algorthm, whch essentally s the MAX-PRODUCT algorthm. 2.3 Concrete Example To gve a concrete example, we compute the expected complete log lkelhood when the probablty dstrbuton on the output values s Gaussan. l c (θ) = + log p(y t q t, η) (22) = + log p(y t qt = 1, η) q t (23) = + qt log p(y t qt = 1, η) (24) = + = ( qt 1 ) 2σ 2 (y t µ ) 2 Note that ths s a weghted least squares problem. γ t ( ) 1 2σ 2 (y t µ ) 2 (25) (26) 2.4 Numercal Issues When mplementng an HMM on the computer, specal care has to be taken wth regard to numercal ssues. Specfcally, the forward and backward recursons nvolve repeated multplcatons of probabltes (.e. numbers less than one). These repeated multplcatons of small numbers generally lead to underflow. However, we can smply get around ths problem by normalzng after each recurson step. See chapter 12 for more detals on how to derve these normalzaton update equatons. 3 Factor Analyss Models and HMMs So far, we ve vewed HMMs as a chan of mxture models where the states q t are based on a dscrete latent varable. We can also consder factor analyss models whch are based on contnuous latent varables. The underlyng graphs of these models are dentcal. Roughly, factor analyss can be consdered a probablstc form of prncple component analyss (PCA).
Hdden Markov Models & The Multvarate Gaussan 5 X Y In the dynamcal generalzaton of factor analyss, called the Kalman Flter, the hdden states are represented by x t and the observed values as y t. To represent the transton between nodes, we allow the mean of the state at tme t + 1 be a lnear functon of the state at tme t plus Gaussan error ɛ t, wth mean 0. The ntal state, x 0, s endowed wth a Gaussan dstrbuton wth mean 0 and covarance Σ 0. The state space model s shown below. x 0 x 1 x T A A A 0 C C C y 0 y 1 y T Ths dynamcal generalzaton of factor analyss yelds tme seres analyss methods known as the Kalman flter and the Rauch-Tung-Strebel smoother. 4 Multvarate Gaussans We often express the multvarate Gaussan dstrbuton usng the parameters µ and Σ, where µ s a d 1 vector and Σ s a d d, symmetrc matrx. Usng these parameters, we have the followng form for the densty functon: { 1 p(x µ, Σ) = exp 1 } (2π) d/2 Σ 1/2 2 (x µ)t Σ (x µ) (27) where x s a vector n R d. Alternatvely, we can use a dfferent parameterzaton, the canoncal parameterzaton. We defne the canoncal parameters as follows: Λ = Σ (28) η = Σ µ (29)
6 Hdden Markov Models & The Multvarate Gaussan Note that these are nvertble and that we can calculate the moment parameters as follows: µ = Λ η (30) Σ = Λ (31) Usng the canoncal parameterzaton, we obtan the followng densty functon: { p(x η, Λ) = exp η T x 1 } 2 xt Λx + a(η, Λ) (32) We now ntroduce the trace trck. We defne the trace of a square matrx A to be the sum of the dagonal elements a of A: tr(a) a (33) An mportant property s ts nvarance to cyclcal permutatons of matrx products: tr(abc) = tr(cab) = tr(bca) (34) Usng the trace trck, we can have the followng: x T Λx = tr(x T Λx) = tr(λxx T ) (35) Thus, we can rewrte the densty functon as follows: { p(x η, Λ) = exp η T x 1 } 2 tr(λxxt ) + a(η, Λ) (36) We now see that suffcent statstc s (x, xx T ) and the canoncal parameter s (η, 1 2 Λ). We can partton the d 1 parameter vector x nto a p 1 sub-vector x 1 and q 1 sub-vector x 2, where n = p + q. The correspondng parttons of the µ and Σ parameters are: [ ] µ1 µ = (37) µ 2 [ ] Σ11 Σ Σ = 12 (38) Σ 21 Σ 22 In the next class, we wll talk about nverses of partton matrces. Gven a block matrx: ( ) E F M = G H (39) The Schur complement of the matrx M wth respect to H, denoted M\H, s defned to be E F H G. We wll also obtan an mportant result nvolvng the determnant of M: M = H M\H (40)