ARTIFICIAL INTELLIGENCE LABORATORY. and. A.I. Memo No January, Factorial Hidden Markov Models.

Size: px

Start display at page:

Download "ARTIFICIAL INTELLIGENCE LABORATORY. and. A.I. Memo No January, Factorial Hidden Markov Models."

Christal Berry
6 years ago
Views:

1 MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE LABORATORY and CENTER FOR BIOLOGICAL AND COMPUTATIONAL LEARNING DEPARTMENT OF BRAIN AND COGNITIVE SCIENCES A.I. Memo No January, 1996 C.B.C.L. Paper No. 130 Factoral Hdden Markov Models Zoubn Ghahraman and Mchael I. Jordan Ths publcaton can be retreved by anonymous ftp to publcatons.a.mt.edu. Abstract We present a framework for learnng n hdden Markov models wth dstrbuted state representatons. Wthn ths framework, we derve a learnng algorthm based on the Expectaton{Maxmzaton (EM) procedure for maxmum lkelhood estmaton. Analogous to the standard Baum-Welch update rules, the M-step of our algorthm s exact and can be solved analytcally. However, due to the combnatoral nature of the hdden state representaton, the exact E-step s ntractable. A smple and tractable mean eld approxmaton s derved. Emprcal results on a set of problems suggest that both the mean eld approxmaton and Gbbs samplng are vable alternatves to the computatonally expensve exact algorthm. Copyrght c Massachusetts Insttute of Technology, 1994 Ths report descrbes research done at the Center for Bologcal and Computatonal Learnng and the Artcal Intellgence Laboratory of the Massachusetts Insttute of Technology. Support for the Center s provded n part by a grant from the Natonal Scence Foundaton under contract ASC{ Ths project was supported n part by a grant from the McDonnell-Pew Foundaton, by a grant from ATR Human Informaton Processng Research Laboratores, by a grant from Semens Corporaton, and by grant N from the Oce of Naval Research.

2 1 Introducton A problem of fundamental nterest to machne learnng s tme seres modelng. Due to the smplcty and ecency of ts parameter estmaton algorthm, the hdden Markov model (HMM) has emerged as one of the basc statstcal tools for modelng dscrete tme seres, ndng wdespread applcaton n the areas of speech recognton (Rabner and Juang, 1986) and computatonal molecular bology (Bald et al., 1994). An HMM s essentally a mxture model, encodng nformaton about the hstory of a tme seres n the value of a sngle multnomal varable (the hdden state). Ths multnomal assumpton allows an ecent parameter estmaton algorthm to be derved (the Baum-Welch algorthm). However, t also severely lmts the representatonal capacty of HMMs. For example, to represent 30 bts of nformaton about the hstory of a tme sequence, an HMM would need 2 30 dstnct states. On the other hand an HMM wth a dstrbuted state representaton could acheve the same task wth 30 bnary unts (Wllams and Hnton, 1991). Ths paper addresses the problem of dervng ecent learnng algorthms for hdden Markov models wth dstrbuted state representatons. The need for dstrbuted state representatons n HMMs can be motvated n twoways. Frst, such representatons allow the state space to be decomposed nto features that naturally decouple the dynamcs of a sngle process generatng the tme seres. Second, dstrbuted state representatons smplfy the task of modelng tme seres generated by the nteracton of multple ndependent processes. For example, a speech sgnal generated by the superposton of multple smultaneous speakers can be potentally modeled wth such an archtecture. Wllams and Hnton (1991) rst formulated the problem of learnng n HMMs wth dstrbuted state representaton and proposed a soluton based on determnstc Boltzmann learnng. The approach presented n ths paper s smlar to Wllams and Hnton's n that t s also based on a statstcal mechancal formulaton of hdden Markov models. However, our learnng algorthm s qute derent n that t makes use of the specal structure of HMMs wth dstrbuted state representaton, resultng n a more ecent learnng procedure. Antcpatng the results n secton 2, ths learnng algorthm both obvates the need for the two-phase procedure of Boltzmann machnes, and has an exact M-step. A derent approach comes from Saul and Jordan (1995), who derved a set of rules for computng the gradents requred for learnng n HMMs wth dstrbuted state spaces. However, ther methods can only be appled to a lmted class of archtectures. 2 Factoral hdden Markov models Hdden Markov models are a generalzaton of mxture models. At any tme step, the probablty densty over the observables dened by an HMM s a mxture of the denstes dened by each state n the underlyng Markov model. Temporal dependences are ntroduced by specfyng that the pror probablty of the state at tme t depends on the state at tme t 1 through a transton matrx, P (Fgure 1a). Another generalzaton of mxture models, the cooperatve vector quantzer (CVQ; Hnton and Zemel, 1994 ), provdes a natural formalsm for dstrbuted state representatons n HMMs. Whereas n smple mxture models each data pont must be accounted for by a sngle mxture component, n CVQs each data pont s accounted for by the combnaton of contrbutons from many mxture components, one from each separate vector quantzer. The total probablty densty modeled by a CVQ s also a mxture model; however ths mxture densty s assumed to factorze nto a product of denstes, each densty assocated wth one of the vector quantzers. Thus, the CVQ s a mxture 1

3 model wth dstrbuted representatons for the mxture components. Factoral hdden Markov models 1 combne the state transton structure of HMMs wth the dstrbuted representatons of CVQs (Fgure 1b). Each of the d underlyng Markov models has a dscrete state s t at tme t and transton probablty matrx P. As n the CVQ, the states are mutually exclusve wthn each vector quantzer and we assume real-valued outputs. The sequence of observable output vectors s generated from a normal dstrbuton wth mean gven by the weghted combnaton of the states of the underlyng Markov models:! dx y t N W s t ;C ; =1 where C s a common covarance matrx. The k-valued states s are represented as dscrete column vectors wth a 1 n one poston and 0 everywhere else; the mean of the observable s therefore a combnaton of columns from each of the W matrces. a) b) y y W s W 1 W 2 s1 s2... W d s d P P 1 P 2 P d Fgure 1. a) Hdden Markov model. b) Factoral hdden Markov model. We capture the above probablty model by denng the energy of a sequence of T states and observatons, f(s t ; y t )g T t=1, whch we abbrevate to fs; yg, as: H(fs; yg) = 1 2 " TX t=1 y t dx =1 W s t # 0 C 1 " y t dx =1 W s t # T X dx t=1 =1 s t0 A s t 1 ; (1) where [A ] jl = log P (s t j jst 1 l ) such that P k j=1 e[a ] jl =1,and 0 denotes matrx transpose. Prors P for the ntal state, s 1, are ntroduced by settng the second term n (1) to d =1 s 10 log. The probablty model s dened from ths energy by the Boltzmann dstrbuton P (fs; yg) = 1 Z expf H(fs; yg)g: (2) 1 We refer to HMMs wth dstrbuted state as factoral HMMs as the features of the dstrbuted state factorze the total state representaton. 2

4 Note that lke n the CVQ (Ghahraman, 1995), the unclamped partton functon Z = Z dfyg X fsg expf H(fs; yg)g; evaluates to a constant, ndependent of the parameters. Ths can be shown by rst ntegratng the Gaussan varables, removng all dependency on fyg, and then summng over the states usng the constrant one [A ] jl. The EM algorthm for Factoral HMMs As n HMMs, the parameters of a factoral HMM can be estmated va the EM (Baum-Welch) algorthm. Ths procedure terates between assumng the current parameters to compute probabltes over the hdden states (E-step), and usng these probabltes to maxmze the expected log lkelhood of the parameters (M-step). Usng the lkelhood (2), the expected log lkelhood of the parameters s Q( new j) =h H(fs; yg) log Z c ; (3) where = fw ;P ;Cg d denotes the current parameters, and h =1 c denotes expectaton gven the clamped observaton sequence and. Gven the observaton sequence, the only random varables are the hdden states. Expandng equaton (3) and lmtng the expectaton to these random varables we nd that the statstcs that need to be computed for the E-step are hs t c, hs t s t0 j c, and hs t s t 10 c. Note that n standard HMM notaton (Rabner and Juang, 1986), hs t c corresponds to t and hs t s t 10 c corresponds to t, whereas hs t s t0 j c has no analogue when there s only a sngle underlyng Markov model. The M-step uses these expectatons to maxmze Q wth respect to the parameters. The constant partton functon allowed us to drop the second term n (3). Therefore, unlke the Boltzmann machne, the expected log lkelhood does not depend on statstcs collected n an unclamped phase of learnng, resultng n much faster learnng than the tradtonal Boltzmann machne (Neal, 1992). M-step Settng the dervatves of Q wth respect to the output weghts to zero, we obtan a lnear system of equatons for W : W new = 2 4 X N;t 3y 2 hss 0 c 5 4 X 3 hs c y 0 5 ; N;t where s and W are the vector and matrx of concatenated s and W, respectvely, P N denotes summaton over a data set of N sequences, and y s the Moore-Penrose pseudo-nverse. To estmate the log transton probabltes we ] jl = 0 subject to the constrant P j e[a ] jl =1, obtanng! [A ] new jl = log The covarance matrx can be smlarly estmated: C new = X N;t yy 0 PN;ths t j st 1 l c P N;t;jhs t j st 1 l c X N;t yhs 0 chss 0 y chs c y 0 : : (4) The M-step equatons can therefore be solved analytcally; furthermore, for a sngle underlyng Markov chan, they reduce to the tradtonal Baum-Welch re-estmaton equatons. 3

5 E-step Unfortunately, as n the smpler CVQ, the exact E-step for factoral HMMs s computatonally ntractable. For example, the expectaton of the j th unt n vector at tme step t, gven fyg, s: hs t j c = P (s t j =1jfyg;) = kx j1;:::;j h6=;:::;j d P (s t 1j1 =1;:::;st j =1;:::;st d;j d =1jfyg;) Although the Markov property can be used to obtan a forward-backward{lke factorzaton of ths expectaton across tme steps, the sum over all possble conguratons of the other hdden unts wthn each tme step s unavodable. For a data set of N sequences of length T, the full E-step calculated through the forward-backward procedure has tme complexty O(NTk 2d ). Although more careful bookkeepng can reduce the complexty too(ntdk d+1 ), the exponental tme cannot be avoded. Ths ntractablty of the exact E-step s due nherently to the cooperatve nature of the model the settng of one vector only determnes the mean of the observable f all the other vectors are xed. Rather than summng over all possble hdden state patterns to compute the exact expectatons, a natural approach s to approxmate them through a Monte Carlo method such as Gbbs samplng. The procedure starts wth a clamped observable sequence fyg and a random settng of the hdden states fs t jg. At each tme step, each state vector s updated stochastcally accordng to ts probablty dstrbuton condtoned on the settng of all the other state vectors: s t P (s t jfyg; fs j : j 6= or 6= tg;): These condtonal dstrbutons are straghtforward to compute and a full pass of Gbbs samplng requres O(NTkd) operatons. The rst and second-order statstcs needed to estmate hs t c, hs t st0 j c and hs t 10 st c are collected usng the s t j 's vsted and the probabltes estmated durng ths samplng process. Mean eld approxmaton A derent approach to computng the expectatons n an ntractable system s gven by mean eld theory. A mean eld approxmaton for factoral HMMs can be obtaned by denng the energy functon ~H(fs; yg) = 1 2 Xhy t t 0C hy t X 1 t t t; s t0 log m t : whch results n a completely factorzed approxmaton to probablty densty (2): ~P (fs; yg) / Y t expf 1 2 hy t t 0 C 1 h y t t g Y t;;j(m t j) st j (5) In ths approxmaton, the observables are ndependently Gaussan dstrbuted wth mean t and each hdden state vector s multnomally dstrbuted wth mean m t. Ths approxmaton s made as tght as possble bychosng the mean eld parameters t and m t that mnmze the Kullback-Lebler dvergence KL( PkP) ~ hlog P P ~ hlog P ~ P ~ where h P ~ denotes expectaton over the mean eld dstrbuton (5). Wth the observables clamped, t can be set equal to the observable y t. Mnmzng KL( PkP) ~ wth respect to the mean eld 4

6 parameters for the states results n a xed-pont equaton whch can be terated untl convergence: m t new = fw 0 C 1 h y t ^y t + W 0 C +A m t 1 + A 0 mt+1 g 1 W m t 1 2 dagfw 0 C 1 W g 1 (6) where ^y t P W m t and fg s the softmax exponental, normalzed over each hdden state vector. The rst term s the projecton of the error n the observable onto the weghts of state vector the more a hdden unt can reduce ths error, the larger ts mean eld parameter. The next three terms arse from the fact that hs 2 j s equal to P ~ m j and not m 2 j. The last two terms ntroduce dependences forward and backward n tme. Each state vector s asynchronously updated usng (6), at a tme cost of O(NTkd) per teraton. Convergence s dagnosed by montorng the KL dvergence n the mean eld dstrbuton between successve tme steps; n practce convergence s very rapd (about 2 to 10 teratons of (6)). 3 Emprcal Results We compared three EM algorthms for learnng n factoral HMMs usng Gbbs samplng, mean eld approxmaton, and the exact (exponental) E step on the bass of performance and speed on randomly generated problems. Problems were generated from a factoral HMM structure, the parameters of whch were sampled from a unform [0; 1] dstrbuton, and approprately normalzed to satsfy the sum-to-one constrants of the transton matrces and prors. Also ncluded n the comparson was a tradtonal HMM wth as many states (k d ) as the factoral HMM. Table 1 summarzes the results. Even for moderately large state spaces (d 3 and k 3) the standard HMM wth k d states suers from severe overttng. Furthermore, both the standard HMM and the exact E-step factoral HMM are extremely slow on the larger problems. The Gbbs samplng and mean eld approxmatons oer roughly comparable performance at a great ncrease n speed. 4 Dscusson The basc contrbuton of ths paper s a learnng algorthm for hdden Markov models wth dstrbuted state representatons. The standard Baum-Welch procedure s ntractable for such archtectures as the sze of the state space generated from the cross product of d k-valued features s O(k d ), and the tme complexty of Baum-Welch s quadratc n ths sze. More mportantly, unless specal constrants are appled to ths cross-product HMM archtecture, the number of parameters also grows as O(k 2d ), whch can result n severe overttng. The archtecture for factoral HMMs presented n ths paper dd not nclude any couplng between the underlyng Markov chans. It s possble to extend the algorthm presented to archtectures whch ncorporate such couplngs. However, these couplngs must be ntroduced wth cauton as they may result ether n an exponental growth n parameters or n a loss of the constant partton functon property. The learnng algorthm derved n ths paper assumed real-valued observables. The algorthm can also be derved for HMMs wth dscrete observables, an archtecture closely related to sgmod belef networks (Neal, 1992). However, the nonlneartes nduced by dscrete observables make both the E-step and M-step of the algorthm more dcult. 5

7 Table 1: Comparson of factoral HMM on four problems of varyng sze d k Alg # Tran Test Cycles Tme/Cycle 3 2 HMM s Exact s Gbbs s MF s 3 3 HMM s Exact s Gbbs s MF s 5 2 HMM s Exact s Gbbs s MF s 5 3 HMM ,1678,1690-1,-1,-1 14,14, s Exact -55,-354, ,-378, ,100, s Gbbs -123,-160, ,-237, ,73, s MF -287,-286, ,-370, ,100, s Table 1. Data was generated from a factoral HMM wth d underlyng Markov models of k states each. The tranng set was 10 sequences of length 20 where the observable was a 4-dmensonal vector; the test set was 20 such sequences. HMM ndcates a hdden Markov model wth k d states; the other algorthms are factoral HMMs wth d underlyng k-state models. Gbbs samplng used 10 samples of each state. The algorthms were run untl convergence, as montored by relatve change n the lkelhood, or a maxmum of 100 cycles. The # column ndcates number of runs. The Tran and Test columns show the log lkelhood one standard devaton on the two data sets. The last column ndcates approxmate tme per cycle on a Slcon Graphcs R4400 processor runnng Matlab. 6

8 In concluson, we have presented Gbbs samplng and mean eld learnng algorthms for factoral hdden Markov models. Such models ncorporate the tme seres modelng capabltes of hdden Markov models and the advantages of dstrbuted representatons for the state space. Future work wll concentrate on a more ecent mean eld approxmaton n whch the forward-backward algorthm s used to compute the E-step exactly wthn each Markov chan, and mean eld theory s used to handle nteractons between chans (Saul and Jordan, 1996). References Bald, P., Chauvn, Y., Hunkapller, T., and McClure, M. (1994). Hdden Markov models of bologcal prmary sequence nformaton. Proc. Nat. Acad. Sc. (USA), 91(3):1059{1063. Hnton, G. and Zemel, R. (1994). Autoencoders, mnmum descrpton length, and Helmholtz free energy. In Cowan, J., Tesauro, G., and Alspector, J., edtors, Advances n Neural Informaton Processng Systems 6. Morgan Kaufmanm Publshers, San Francsco, CA. Neal, R. (1992). Connectonst learnng of belef networks. Artcal Intellgence, 56:71{113. Rabner, L. and Juang, B. (1986). An Introducton to hdden Markov models. IEEE Acoustcs, Speech & Sgnal Processng Magazne, 3:4{16. Saul, L. and Jordan, M. (1995). Boltzmann chans and hdden Markov models. In Tesauro, G., Touretzky, D., and Leen, T., edtors, Advances n Neural Informaton Processng Systems 7. MIT Press, Cambrdge, MA. Saul, L. and Jordan, M. (1996). Explotng tractable substructures n Intractable networks. In Touretzky, D., Mozer, M., and Hasselmo, M., edtors, Advances n Neural Informaton Processng Systems 8. MIT Press. Wllams, C. and Hnton, G. (1991). Mean eld networks that learn to dscrmnate temporally dstorted strngs. In Touretzky, D., Elman, J., Sejnowsk, T., and Hnton, G., edtors, Connectonst Models: Proceedngs of the 1990 Summer School, pages 18{22. Morgan Kaufmann Publshers, Man Mateo, CA. 7

Hidden Markov Models & The Multivariate Gaussian (10/26/04)

Hidden Markov Models & The Multivariate Gaussian (10/26/04) CS281A/Stat241A: Statstcal Learnng Theory Hdden Markov Models & The Multvarate Gaussan (10/26/04) Lecturer: Mchael I. Jordan Scrbes: Jonathan W. Hu 1 Hdden Markov Models As a bref revew, hdden Markov models