Department of Computer Science, University of Toronto, Toronto, ON M5S 3H5, Canada

Size: px

Start display at page:

Download "Department of Computer Science, University of Toronto, Toronto, ON M5S 3H5, Canada"

Sophia Evans
5 years ago
Views:

1 Machine Learning,?, {3 (997) c 997 Kluwer Academic Publishers, Boson. Manufacured in The Neherlands. Facorial Hidden Markov Models ZOUBIN GHAHRAMANI zoubin@cs.orono.edu Deparmen of Compuer Science, Universiy of Torono, Torono, ON M5S 3H5, Canada MICHAEL I. JORDAN jordan@psyche.mi.edu Deparmen of Brain & Cogniive Sciences, Massachuses Insiue of Technology, Cambridge, MA 0239, USA Edior: Padhraic Smyh Absrac. Hidden Markov models (HMMs) have proven o be one of he mos widely used ools for learning probabilisic models of ime series daa. In an HMM, informaion abou he pas is conveyed hrough a single discree variable he hidden sae. We discuss a generalizaion of HMMs in which his sae is facored ino muliple sae variables and is herefore represened in a disribued manner. We describe an exac algorihm for inferring he poserior probabiliies of he hidden sae variables given he observaions, and relae i o he forward{backward algorihm for HMMs and o algorihms for more general graphical models. Due o he combinaorial naure of he hidden sae represenaion, his exac algorihm is inracable. As in oher inracable sysems, approximae inference can be carried ou using Gibbs sampling or variaional mehods. Wihin he variaional framework, we presen a srucured approximaion in which he he sae variables are decoupled, yielding a racable algorihm for learning he parameers of he model. Empirical comparisons sugges ha hese approximaions are ecien and provide accurae alernaives o he exac mehods. Finally, we use he srucured approximaion o model Bach's chorales and show ha facorial HMMs can capure saisical srucure in his daa se which an unconsrained HMM canno. Keywords: Hidden Markov models, ime series, EM algorihm, graphical models, Bayesian neworks, mean eld heory. Inroducion Due o is exibiliy and o he simpliciy and eciency of is parameer esimaion algorihm, he hidden Markov model (HMM) has emerged as one of he basic saisical ools for modeling discree ime series, nding widespread applicaion in he areas of speech recogniion (Rabiner & Juang, 986) and compuaional molecular biology (Krogh, Brown, Mian, Sjolander, & Haussler, 994). An HMM is essenially a mixure model, encoding informaion abou he hisory of a ime series in he value of a single mulinomial variable he hidden sae which can ake on one of K discree values. This mulinomial assumpion suppors an ecien parameer esimaion algorihm he Baum-Welch algorihm which considers each of he K seings of he hidden sae a each ime sep. However, he mulinomial assumpion also severely limis he represenaional capaciy of HMMs. For example, o represen 30 bis of informaion abou he hisory of a ime sequence, an HMM would need K = 2 30 disinc saes. On he oher hand, an HMM wih a disribued sae represenaion could achieve he same ask wih 30 binary sae

2 2 Z. GHAHRAMANI AND M.I. JORDAN variables (Williams & Hinon, 99). This paper addresses he problem of consrucing ecien learning algorihms for hidden Markov models wih disribued sae represenaions. The need for disribued sae represenaions in HMMs can be moivaed in wo ways. Firs, such represenaions le he model auomaically decompose he sae space ino feaures ha decouple he dynamics of he process ha generaed he daa. Second, disribued sae represenaions simplify he ask of modeling ime series ha are known a priori o be generaed from an ineracion of muliple, loosely-coupled processes. For example, a speech signal generaed by he superposiion of muliple simulaneous speakers can be poenially modeled wih such an archiecure. Williams and Hinon (99) rs formulaed he problem of learning in HMMs wih disribued sae represenaions and proposed a soluion based on deerminisic Bolzmann learning. The approach presened in his paper is similar o Williams and Hinon's in ha i can also be viewed from he framework of saisical mechanics and mean eld heory. However, our learning algorihm is quie dieren in ha i makes use of he special srucure of HMMs wih a disribued sae represenaion, resuling in a signicanly more ecien learning procedure. Anicipaing he resuls in Secion 3, his learning algorihm obviaes he need for he wo-phase procedure of Bolzmann machines, has an exac M sep, and makes use of he forward{backward algorihm from classical HMMs as a subrouine. A dieren approach comes from Saul and Jordan (995), who derived a se of rules for compuing he gradiens required for learning in HMMs wih disribued sae spaces. However, heir mehods can only be applied o a limied class of archiecures. Hidden Markov models wih disribued sae represenaions are a paricular class of probabilisic graphical model (Pearl, 988; Laurizen & Spiegelhaler, 988), which represen probabiliy disribuions as graphs in which he nodes correspond o random variables and he links represen condiional independence relaions. The relaion beween hidden Markov models and graphical models has recenly been reviewed in Smyh, Heckerman and Jordan (997). Alhough exac probabiliy propagaion algorihms exis for general graphical models (Jensen, Laurizen, & Olesen, 990), hese algorihms are inracable for densely-conneced models such as he ones we consider in his paper. One approach o dealing wih his issue is o uilize sochasic sampling mehods (Kanazawa e al., 995). Anoher approach, which provides he basis for algorihms described in he curren paper, is o make use of variaional mehods (cf. Saul, Jaakkola, & Jordan, 996). In he following secion we dene he probabilisic model for facorial HMMs and in Secion 3 we presen algorihms for inference and learning. In Secion 4 we describe empirical resuls comparing exac and approximae algorihms for learning on he basis of ime complexiy and model qualiy. We also apply facorial HMMs o a real ime series daa se consising of he melody lines from a collecion of chorales by J. S. Bach. We discuss several generalizaions of he probabilisic model

3 FACTORIAL HIDDEN MARKOV MODELS 3 (a) (b) S () - S () S () + S - S S + S (2) - S (2) S (2) + Y- Y Y+ S (3) - S (3) S (3) + Y- Y Y+ Figure. (a) A direced acyclic graph (DAG) specifying condiional independence relaions for a hidden Markov model. Each node is condiionally independen from is non-descendans given is parens. (b) A DAG represening he condiional independence relaions in a facorial HMM wih M = 3 underlying Markov chains. in Secion 5, and we conclude in Secion 6. Where necessary, deails of derivaions are provided in he appendixes. 2. The probabilisic model We begin by describing he hidden Markov model, in which a sequence of observaions fy g where = ; : : :T, is modeled by specifying a probabilisic relaion beween he observaions and a sequence of hidden saes fs g, and a Markov ransiion srucure linking he hidden saes. The model assumes wo ses of condiional independence relaions: ha Y is independen of all oher observaions and saes given S, and ha S is independen of S : : :S 2 given S (he rs-order Markov propery). Using hese independence relaions, he join probabiliy for he sequence of saes and observaions can be facored as P (fs ; Y g) = P (S )P (Y js ) TY =2 P (S js )P (Y js ): () The condiional independencies specied by equaion () can be expressed graphically in he form of Figure (a). The sae is a single mulinomial random variable ha can ake one of K discree values, S 2 f; : : :; Kg. The sae ransiion probabiliies, P (S js ), are specied by a K K ransiion marix. If he observaions are discree symbols aking on one of D values, he observaion probabiliies P (Y js ) can be fully specied as a K D observaion marix. For a coninuous observaion vecor, P (Y js ) can be modeled in many dieren forms, such as a Gaussian, a mixure of Gaussians, or even a neural nework. 2 In he presen paper, we generalize he HMM sae represenaion by leing he sae be represened by a collecion of sae variables S = S () ; : : :S ; : : :; S (M) ; (2)

4 4 Z. GHAHRAMANI AND M.I. JORDAN each of which can ake on K values. We refer o hese models as facorial hidden Markov models, as he sae space consiss of he cross produc of hese sae variables. For simpliciy, we will assume ha K = K, for all m, alhough all he resuls we presen can be rivially generalized o he case of diering K. Given ha he sae space of his facorial HMM consiss of all K M combinaions of he S variables, placing no consrains on he sae ransiion srucure would resul in a K M K M ransiion marix. Such an unconsrained sysem is unineresing for several reasons: i is equivalen o an HMM wih K M saes; i is unlikely o discover any ineresing srucure in he K sae variables, as all variables are allowed o inerac arbirarily; and boh he ime complexiy and sample complexiy of he esimaion algorihm are exponenial in M. We herefore focus on facorial HMMs in which he underlying sae ransiions are consrained. A naural srucure o consider is one in which each sae variable evolves according o is own dynamics, and is a priori uncoupled from he oher sae variables: P (S js ) = MY P (S js ): (3) A graphical represenaion for his model is presened in Figure (b). The ransiion srucure for his sysem can be represened as M disinc K K marices. Generalizaions ha allow coupling beween he sae variables are briey discussed in Secion 5. As shown in Figure (b), in a facorial HMM he observaion a ime sep can depend on all he sae variables a ha ime sep. For coninuous observaions, one simple form for his dependence is linear Gaussian; ha is, he observaion Y is a Gaussian random vecor whose mean is a linear funcion of he sae variables. We represen he sae variables as K vecors, where each of he K discree values corresponds o a in one posiion and 0 elsewhere. The resuling probabiliy densiy for a D observaion vecor Y is where P (Y js ) = jcj =2 (2) D=2 = exp 2 (Y ) 0 C (Y ) ; (4a) W S : (4b) Each W marix is a D K marix whose columns are he conribuions o he means for each of he seings of S, C is he D D covariance marix, 0 denoes marix ranspose, and j j is he marix deerminan operaor. One way o undersand he observaion model in equaions (4a) and (4b) is o consider he marginal disribuion for Y, obained by summing over he possible saes. There are K seings for each of he M sae variables, and hus here

5 FACTORIAL HIDDEN MARKOV MODELS 5 are K M possible mean vecors obained by forming sums of M columns where one column is chosen from each of he W marices. The resuling marginal densiy of Y is hus a Gaussian mixure model, wih K M Gaussian mixure componens each having a consan covariance marix C. This saic mixure model, wihou inclusion of he ime index and he Markov dynamics, is a facorial parameerizaion of he sandard mixure of Gaussians model ha has ineres in is own righ (Zemel, 993; Hinon & Zemel, 994; Ghahramani, 995). The model we are considering in he curren paper exends his model by allowing Markov dynamics in he discree sae variables underlying he mixure. Unless oherwise saed, we will assume he Gaussian observaion model hroughou he paper. The hidden sae variables a one ime sep, alhough marginally independen, become condiionally dependen given he observaion sequence. This can be deermined by applying he semanics of direced graphs, in paricular he d-separaion crierion (Pearl, 988), o he graphical model in Figure (b). Consider he Gaussian model in equaions (4a)-(4b). Given an observaion vecor Y, he poserior probabiliy of each of he seings of he hidden sae variables is proporional o he probabiliy of Y under a Gaussian wih mean. Since is a funcion of all he sae variables, he probabiliy of a seing of one of he sae variables will depend on he seing of he oher sae variables. 3 This dependency eecively couples all of he hidden sae variables for he purposes of calculaing poserior probabiliies and makes exac inference inracable for he facorial HMM. 3. Inference and learning The inference problem in a probabilisic graphical model consiss of compuing he probabiliies of he hidden variables given he observaions. In he conex of speech recogniion, for example, he observaions may be acousic vecors and he goal of inference may be o compue he probabiliy for a paricular word or sequence of phonemes (he hidden sae). This problem can be solved ecienly via he forward{backward algorihm (Rabiner & Juang, 986), which can be shown o be a special case of he Jensen, Laurizen, and Olesen (990) algorihm for probabiliy propagaion in more general graphical models (Smyh e al., 997). In some cases, raher han a probabiliy disribuion over hidden saes i is desirable o infer he single mos probable hidden sae sequence. This can be achieved via he Vierbi (967) algorihm, a form of dynamic programming ha is very closely relaed o he forward{backward algorihm and also has analogues in he graphical model lieraure (Dawid, 992). The learning problem for probabilisic models consiss of wo componens: learning he srucure of he model and learning is parameers. Srucure learning is a opic of curren research in boh he graphical model and machine learning communiies (e.g. Heckerman, 995; Solcke & Omohundro, 993). In he curren paper we deal exclusively wih he problem of learning he parameers for a given srucure.

6 6 Z. GHAHRAMANI AND M.I. JORDAN 3.. The EM algorihm The parameers of a facorial HMM can be esimaed via he expecaion maximizaion (EM) algorihm (Dempser, Laird, & Rubin, 977), which in he case of classical HMMs is known as he Baum{Welch algorihm (Baum, Perie, Soules, & Weiss, 970). This procedure ieraes beween a sep ha xes he curren parameers and compues poserior probabiliies over he hidden saes (he E sep) and a sep ha uses hese probabiliies o maximize he expeced log likelihood of he observaions as a funcion of he parameers (he M sep). Since he E sep of EM is exacly he inference problem as described above, we subsume he discussion of boh inference and learning problems ino our descripion of he EM algorihm for facorial HMMs. The EM algorihm follows from he deniion of he expeced log likelihood of he complee (observed and hidden) daa: Q( new j) = E log P (fs ; Y gj new ) j ; fy g ; (5) where Q is a funcion of he parameers new, given he curren parameer esimae and he observaion sequence fy g. For he facorial HMM he parameers of he model are = fw ; ; P ; Cg, where = P (S ) and P = P (S js ). The E sep consiss of compuing Q. By expanding (5) using equaions (){(4b), we nd ha Q can be expressed as a funcion of hree ypes of expecaions over he hidden sae variables: hs i, hs S (n)0 i, and hs S0 i, where hi has been used o abbreviae Efj; fy gg. In he HMM noaion of Rabiner and Juang (986), hs i corresponds o, he vecor of sae occupaion probabiliies, hs i corresponds o, he K K marix of S0 sae occupaion probabiliies a wo consecuive ime seps, and hs S (n)0 i has no analogue when here is only a single underlying Markov model. The M sep uses hese expecaions o maximize Q as a funcion of new. Using Jensen's inequaliy, Baum, Perie, Soules & Weiss (970) showed ha each ieraion of he E and M seps increases he likelihood, P (fy gj), unil convergence o a (local) opimum. As in hidden Markov models, he exac M sep for facorial HMMs is simple and racable. In paricular, he M sep for he parameers of he oupu model described in equaions (4a)-(4b) can be found by solving a weighed linear regression problem. Similarly, he M seps for he priors,, and sae ransiion marices, P, are idenical o he ones used in he Baum{Welch algorihm. The deails of he M sep are given in Appendix A. We now urn o he subsanially more dicul problem of compuing he expecaions required for he E sep Exac inference Unforunaely, he exac E sep for facorial HMMs is compuaionally inracable. This fac can bes be shown by making reference o sandard algorihms for prob-

7 FACTORIAL HIDDEN MARKOV MODELS 7 abilisic inference in graphical models (Laurizen & Spiegelhaler, 988), alhough i can also be derived readily from direc applicaion of Bayes rule. Consider he compuaions ha are required for calculaing poserior probabiliies for he facorial HMM shown in Figure (b) wihin he framework of he Laurizen and Spiegelhaler algorihm. Moralizing and riangulaing he graphical srucure for he facorial HMM resuls in a juncion ree (in fac a chain) wih T (M + ) M cliques of size M +. The resuling probabiliy propagaion algorihm has ime complexiy O(T MK M+ ) for a single observaion sequence of lengh T. We presen a forward{backward ype recursion ha implemens he exac E sep in Appendix B. The naive exac algorihm which consiss of ranslaing he facorial HMM ino an equivalen HMM wih K M saes and using he forward{backward algorihm, has ime complexiy O(T K 2M ). Like oher models wih muliple densely-conneced hidden variables, his exponenial ime complexiy makes exac learning and inference inracable. Thus, alhough he Markov propery can be used o obain forward{backwardlike facorizaions of he expecaions across ime seps, he sum over all possible conguraions of he oher hidden sae variables wihin each ime sep is unavoidable. This inracabiliy is due inherenly o he cooperaive naure of he model: for he Gaussian oupu model, for example, he seings of all he sae variables a one ime sep cooperae in deermining he mean of he observaion vecor Inference using Gibbs sampling Raher han compuing he exac poserior probabiliies, one can approximae hem using a Mone Carlo sampling procedure, and hereby avoid he sum over exponenially many sae paerns a some cos in accuracy. Alhough here are many possible sampling schemes (for a review see Neal, 993), here we presen one of he simples Gibbs sampling (Geman & Geman, 984). For a given observaion sequence fy g, his procedure sars wih a random seing of he hidden saes fs g. A each sep of he sampling process, each sae vecor is updaed sochasically according o is probabiliy disribuion condiioned on he seing of all he oher sae vecors. The graphical model is again useful here, as each node is condiionally independen of all oher nodes given is Markov blanke, dened as he se of children, parens, and parens of he children of a node. To sample from a ypical sae variable S we only need o examine he saes of a few neighboring nodes: S sampled from P (S jfs (n) : n 6= mg; S ; S ; Y + ) / P (S js ) P (S + js ) P (Y js () ; : : :; S ; : : :; S (M) ): Sampling once from each of he T M hidden variables in he model resuls in a new sample of he hidden sae of he model and requires O(T MK) operaions. The sequence of overall saes resuling from each pass of Gibbs sampling denes a Markov chain over he sae space of he model. Assuming ha all probabiliies are bounded away from zero, his Markov chain is guaraneed o converge o he

8 8 Z. GHAHRAMANI AND M.I. JORDAN poserior probabiliies of he saes given he observaions (Geman & Geman, 984). Thus, afer some suiable ime, samples from he Markov chain can be aken as approximae samples from he poserior probabiliies. The rs and second-order saisics needed o esimae hs i, hs S (n)0 i and hs S0 i are colleced using he saes visied and he probabiliies esimaed during his sampling process are used in he approximae E sep of EM Compleely facorized variaional inference There also exiss a second approximaion of he poserior probabiliy of he hidden saes ha is boh racable and deerminisic. The basic idea is o approximae he poserior disribuion over he hidden variables P (fs gjfy g) by a racable disribuion Q(fS g). This approximaion provides a lower bound on he log likelihood ha can be used o obain an ecien learning algorihm. The argumen can be formalized following he reasoning of Saul, Jaakkola, and Jordan (996). Any disribuion over he hidden variables Q(fS g) can be used o dene a lower bound on he log likelihood X log P (fy g) = log P (fs ; Y g) fs X g P (fs ; Y g) = log Q(fS g) Q(fS g) fs g X P (fs ; Y g) Q(fS g) log ; Q(fS g) fs g where we have made use of Jensen's inequaliy in he las sep. The dierence beween he lef-hand side and he righ-hand side of his inequaliy is given by he Kullback-Leibler divergence (Cover & Thomas, 99): KL(QkP ) = X fs g Q(fS g) log Q(fS g) P (fs gjfy g) : (6) The complexiy of exac inference in he approximaion given by Q is deermined by is condiional independence relaions, no by is parameers. Thus, we can chose Q o have a racable srucure a graphical represenaion ha eliminaes some of he dependencies in P. Given his srucure, we are free o vary he parameers of Q so as o obain he ighes possible bound by minimizing (6). We will refer o he general sraegy of using a parameerized approximaing disribuion as a variaional approximaion and refer o he free parameers of he disribuion as variaional parameers. To illusrae, consider he simples variaional approximaion, in which he sae variables are assumed independen given he observaions (Figure 2 (a)). This disribuion can be wrien as

9 FACTORIAL HIDDEN MARKOV MODELS 9 (a) (b) S () - S () S () + S () - S () S () + S (2) - S (2) S (2) + S (2) - S (2) S (2) + S (3) - S (3) S (3) + S (3) - S (3) S (3) + Figure 2. (a) The compleely facorized variaional approximaion assumes ha all he sae variables are independen (condiional on he observaion sequence). (b) The srucured variaional approximaion assumes ha he sae variables reain heir Markov srucure wihin each chain, bu are independen across chains. Q(fS gj) = TY MY = Q(S j ): (7) The variaional parameers, = f g, are he means of he sae variables, where, as before, a sae variable S is represened as a K-dimensional vecor wih a in he k h posiion and 0 elsewhere, if he m h Markov chain is in sae k a ime. The elemens of he vecor herefore dene he sae occupaion probabiliies for he mulinomial variable S under he disribuion Q: Q(S j ) = KY k= ;k S ;k where X K S ;k 2 f0; g; S ;k = : (8) k= This compleely facorized approximaion is ofen used in saisical physics, where i provides he basis for simple ye powerful mean eld approximaions o saisical mechanical sysems (Parisi, 988). To make he bound as igh as possible we vary separaely for each observaion sequence so as o minimize he KL divergence. Taking he derivaives of (6) wih respec o and seing hem o zero, we obain he se of xed poin equaions (see Appendix C) dened by new = ' W 0 C Y ~ 2 + (log P ) + (log P ) 0 + where Y ~ is he residual error in Y given he predicions from all he sae variables no including m: ~Y Y `6=m (9a) W (`) (`) ; (9b)

10 0 Z. GHAHRAMANI AND M.I. JORDAN is he vecor of diagonal elemens of W 0 C W, and 'fg is he sofmax operaor, which maps a vecor A ino a vecor B of he same size, wih elemens B i = expfa ig ; (0) expfa j g X j and log P denoes he elemenwise logarihm of he ransiion marix P. The rs erm of (9a) is he projecion of he error in reconsrucing he observaion ono he weighs of sae vecor m he more a paricular seing of a sae vecor can reduce his error, he larger is associaed variaional parameer. The second erm arises from he fac ha he second order correlaion hs S i evaluaed under he variaional disribuion is a diagonal marix composed of he elemens of. The las wo erms inroduce dependencies forward and backward in ime. 5 Therefore, alhough he poserior disribuion over he hidden variables is approximaed wih a compleely facorized disribuion, he xed poin equaions couple he parameers associaed wih each node wih he parameers of is Markov blanke. In his sense, he xed poin equaions propagae informaion along he same pahways as hose dening he exac algorihms for probabiliy propagaion. The following may provide an inuiive inerpreaion of he approximaion being made by his disribuion. Given a paricular observaion sequence, he hidden sae variables for he M Markov chains a ime sep are sochasically coupled. This sochasic coupling is approximaed by a sysem in which he hidden variables are uncorrelaed bu have coupled means. The variaional or \mean-eld" equaions solve for he deerminisic coupling of he means ha bes approximaes he sochasically coupled sysem. Each hidden sae vecor is updaed in urn using (9a), wih a ime complexiy of O(T MK 2 ) per ieraion. Convergence is deermined by monioring he KL divergence in he variaional disribuion beween successive ime seps; in pracice convergence is very rapid (abou 2 o 0 ieraions of (9a)). Once he xed poin equaions have converged, he expecaions required for he E sep can be obained as a simple funcion of he parameers (equaions (C.6){(C.8) in Appendix C) Srucured variaional inference The approximaion presened in he previous secion facors he poserior probabiliy such ha all he sae variables are saisically independen. In conras o his raher exreme facorizaion, here exiss a hird approximaion ha is boh racable and preserves much of he probabilisic srucure of he original sysem. In his scheme, he facorial HMM is approximaed by M uncoupled HMMs as shown in Figure 2 (b). Wihin each HMM, ecien and exac inference is implemened via he forward{backward algorihm. The approach of exploiing such racable subsrucures was rs suggesed in he machine learning lieraure by Saul and Jordan (996).

11 FACTORIAL HIDDEN MARKOV MODELS Noe ha he argumens presened in he previous secion did no hinge on he he form of he approximaing disribuion. Therefore, more srucured variaional approximaions can be obained by using more srucured variaional disribuions Q. Each such Q provides a lower bound on he log likelihood and can be used o obain a learning algorihm. We wrie he srucured variaional approximaion as Q(fS gj) = Z Q k= MY Q(S j) TY =2 Q(S js ; ); where Z Q is a normalizaion consan ensuring ha Q inegraes o one and KY S Q(S j) = h ;k ;k k Q(S js ; ) = K Y k= = KY k= h ;k h ;k KX j= KY j= P k;j S ;j P k;j A S ;j S ;k A S ;k (a) (b) ; (c) where he las equaliy follows from he fac ha S is a vecor wih a in one posiion and 0 elsewhere. The parameers of his disribuion are = f ; P ; h g he original priors and sae ransiion marices of he facorial HMM and a imevarying bias for each sae variable. Comparing equaions (a){(c) o equaion (), we can see ha he K vecor h plays he role of he probabiliy of an observaion (P (Y js ) in ()) for each of he K seings of S. For example, Q(S ;j = j) = h ;j P (S ;j = j) corresponds o having an observaion a ime = ha under sae S ;j = has probabiliy h ;j. Inuiively, his approximaion uncouples he M Markov chains and aaches o each sae variable a disinc ciious observaion. The probabiliy of his ciious observaion can be varied so as o minimize he KL divergence beween Q and P. Applying he same argumens as before, we obain a se of xed poin equaions for h ha minimize KL(QkP ), as deailed in Appendix D: new h = exp W 0 C Y ~ 2 ; (2a) where is dened as before, and where we redene he residual error o be ~Y Y `6=m W (`) hs (`) i: (2b)

12 2 Z. GHAHRAMANI AND M.I. JORDAN The parameer h obained from hese xed poin equaions is he observaion probabiliy associaed wih sae variable S in hidden Markov model m. Using hese probabiliies, he forward{backward algorihm is used o compue a new se of expecaions for hs i, which are fed back ino (2a) and (2b). This algorihm is herefore used as a subrouine in he minimizaion of he KL divergence. Noe he similariy beween equaions (2a){(2b) and equaions (9a){(9b) for he compleely facorized sysem. In he compleely facorized sysem, since hs i =, he xed poin equaions can be wrien explicily in erms of he variaional parameers. In he srucured approximaion, he dependence of hs i on h is compued via he forward{backward algorihm. Noe also ha (2a) does no conain erms involving he prior,, or ransiion marix, P. These erms have cancelled by our choice of approximaion Choice of approximaion The heory of he EM algorihm as presened in Dempser e al. (977) assumes he use of an exac E sep. For models in which he exac E sep is inracable, one mus insead use an approximaion like hose we have jus described. The choice among hese approximaions mus ake ino accoun several heoreical and pracical issues. Mone Carlo approximaions based on Markov chains, such as Gibbs sampling, oer he heoreical assurance ha he sampling procedure will converge o he correc poserior disribuion in he limi. Alhough his means ha one can come arbirarily close o he exac E sep, in pracice convergence can be slow (especially for mulimodal disribuions) and i is ofen very dicul o deermine how close one is o convergence. However, when sampling is used for he E sep of EM, here is a ime radeo beween he number of samples used and he number of EM ieraions. I seems waseful o wai unil convergence early on in learning, when he poserior disribuion from which samples are drawn is far from he poserior given he opimal parameers. In pracice we have found ha even approximae E seps using very few Gibbs samples (e.g. around en samples of each hidden variable) end o increase he rue likelihood. Variaional approximaions oer he heoreical assurance ha a lower bound on he likelihood is being maximized. Boh he minimizaion of he KL divergence in he E sep and he parameer updae in he M sep are guaraneed no o decrease his lower bound, and herefore convergence can be dened in erms of he bound. An alernaive view given by Neal and Hinon (993) describes EM in erms of he negaive free energy, F, which is a funcion of he parameers,, he observaions, Y, and a poserior probabiliy disribuion over he hidden variables, Q(S): F (Q; ) = E Q flog P (Y; Sj)g E Q flog Q(S)g ; where E Q denoes expecaion over S using he disribuion Q(S). The exac E sep in EM maximizes F wih respec o Q given. The variaional E seps used

13 FACTORIAL HIDDEN MARKOV MODELS 3 here maximize F wih respec o Q given, subjec o he consrain ha Q is of a paricular racable form. Given his view, i seems clear ha he srucured approximaion is preferable o he compleely facorized approximaion since i places fewer consrains on Q, a no cos in racabiliy. 4. Experimenal resuls To invesigae learning and inference in facorial HMMs we conduced wo experimens. The rs experimen compared he dieren approximae and exac mehods of inference on he basis of compuaion ime and he likelihood of he model obained from synheic daa. The second experimen sough o deermine wheher he decomposiion of he sae space in facorial HMMs presens any advanage in modeling a real ime series daa se ha we migh assume o have complex inernal srucure Bach's chorale melodies. 4.. Experimen : Performance and iming benchmarks Using daa generaed from a facorial HMM srucure wih M underlying Markov models wih K saes each, we compared he ime per EM ieraion and he raining and es se likelihoods of ve models: HMM rained using he Baum-Welch algorihm; Facorial HMM rained wih exac inference for he E sep, using a sraighforward applicaion of he forward{backward algorihm, raher han he more ecien algorihm oulined in Appendix B; Facorial HMM rained using Gibbs sampling for he E sep wih he number of samples xed a 0 samples per variable; 6 Facorial HMM rained using he compleely facorized variaional approximaion; and Facorial HMM rained using he srucured variaional approximaion. All facorial HMMs consised of M underlying Markov models wih K saes each, whereas he HMM had K M saes. The daa were generaed from a facorial HMM srucure wih M sae variables, each of which could ake on K discree values. All of he parameers of his model, excep for he oupu covariance marix, were sampled from a uniform [0; ] disribuion and appropriaely normalized o saisfy he sum-o-one consrains of he ransiion marices and priors. The covariance marix was se o a muliple of he ideniy marix C = 0:0025I. The raining and es ses consised of 20 sequences of lengh 20, where he observable was a four-dimensional vecor. For each randomly sampled se of parameers, a separae raining se and es se were generaed and each algorihm was run once.

14 4 Z. GHAHRAMANI AND M.I. JORDAN Fifeen ses of parameers were generaed for each of he four problem sizes. Algorihms were run for a maximum of 00 ieraions of EM or unil convergence, dened as he ieraion k a which he log likelihood L(k), or approximae log likelihood if an approximae algorihm was used, saised L(k) L(k ) < 0 5 (L(k ) L(2)). A he end of learning, he log likelihoods on he raining and es se were compued for all models using he exac algorihm. Also included in he comparison was he log likelihood of he raining and es ses under he rue model ha generaed P he daa. The es se log likelihood for N observaion sequences is dened N (n) as log P (Y n= ; : : :; Y (n) T j); where is obained by maximizing he log likelihood over a raining se ha is disjoin from he es se. This provides a measure of how well he model generalizes o a novel observaion sequence from he same disribuion as he raining daa. Resuls averaged over 5 runs for each algorihm on each of he four problem sizes (a oal of 300 runs) are presened in Table. Even for he smalles problem size (M = 3 and K = 2), he sandard HMM wih K M saes suers from overing: he es se log likelihood is signicanly worse han he raining se log likelihood. As expeced, his overing problem becomes worse as he size of he sae space increases; i is paricularly serious for M = 5 and K = 3. For he facorial HMMs, he log likelihoods for each of he hree approximae EM algorihms were compared o he exac algorihm. Gibbs sampling appeared o have he poores performance: for each of he hree smaller size problems is log likelihood was signicanly worse han ha of he exac algorihm on boh he raining ses and es ses (p < 0:05). This may be due o insucien sampling. However, we will soon see ha running he Gibbs sampler for more han 0 samples, while poenially improving performance, makes i subsanially slower han he variaional mehods. Surprisingly, he Gibbs sampler appears o do quie well on he larges size problem, alhough he dierences o he oher mehods are no saisically signican. The performance of he compleely facorized variaional approximaion was no saisically signicanly dieren from ha of he exac algorihm on eiher he raining se or he es se for any of he problem sizes. The performance of he srucured variaional approximaion was no saisically dieren from ha of he exac mehod on hree of he four problem sizes, and appeared o be beer on one of he problem sizes (M = 5; K = 2; p < 0:05). Alhough his resul may be a uke arising from random variabiliy, here is anoher more ineresing (and speculaive) explanaion. The exac EM algorihm implemens unconsrained maximizaion of F, as dened in secion 3.6, while he variaional mehods maximize F subjec o a consrained disribuion. These consrains could presumably ac as regularizers, reducing overing. There was a large amoun of variabiliy in he nal log likelihoods for he models learned by all he algorihms. We subraced he log likelihood of he rue generaive model from ha of each rained model o eliminae he main eec of he randomly sampled generaive model and o reduce he variabiliy due o raining and es ses. One imporan remaining source of variance was he random seed used in

15 FACTORIAL HIDDEN MARKOV MODELS 5 Table. Comparison of he facorial HMM on four problems of varying size. The negaive log likelihood for he raining and es se, plus or minus one sandard deviaion, is shown for each problem size and algorihm, measured in bis per observaion (log likelihood in bis divided by N T ) relaive o he log likelihood under he rue generaive model for ha daa se. 7 True is he rue generaive model (he log likelihood per symbol is dened o be zero for his model by our measure); HMM is he hidden Markov model wih K M saes; Exac is he facorial HMM rained using an exac E sep; Gibbs is he facorial HMM rained using Gibbs sampling; CFVA is he facorial HMM rained using he compleely facorized variaional approximaion; SVA is he facorial HMM rained using he srucured variaional approximaion. M K Algorihm Training Se Tes Se 3 2 True HMM Exac Gibbs CFVA SVA True HMM Exac Gibbs CFVA SVA True HMM Exac Gibbs CFVA SVA True HMM Exac Gibbs CFVA SVA

16 6 Z. GHAHRAMANI AND M.I. JORDAN (a) 5 (b) log likelihood (bis) 3 2 log likelihood (bis) ieraions of EM ieraions of EM (c) 5 (d) log likelihood (bis) 3 2 log likelihood (bis) ieraions of EM ieraions of EM Figure 3. Learning curves for ve runs of each of he four learning algorihms for facorial HMMs: (a) exac; (b) compleely facorized variaional approximaion; (c) srucured variaional approximaion; and (d) Gibbs sampling. A single raining se sampled from he M = 3; K = 2 problem size was used for all hese runs. The solid lines show he negaive log likelihood per observaion (in bis) relaive o he rue model ha generaed he daa, calculaed using he exac algorihm. The circles denoe he poin a which he convergence crierion was me and he run ended. For he hree approximae algorihms, he dashed lines show an approximae negaive log likelihood. 8 each raining run, which deermined he iniial parameers and he samples used in he Gibbs algorihm. All algorihms appeared o be very sensiive o his random seed, suggesing ha dieren runs on each raining se found dieren local maxima or plaeaus of he likelihood (Figure 3). Some of his variabiliy could be eliminaed by explicily adding a regularizaion erm, which can be viewed as a prior on he parameers in maximum a poseriori parameer esimaion. Alernaively, Bayesian (or ensemble) mehods could be used o average ou his variabiliy by inegraing over he parameer space. The iming comparisons conrm he fac ha boh he sandard HMM and he exac E sep facorial HMM are exremely slow for models wih large sae spaces (Fig-

17 FACTORIAL HIDDEN MARKOV MODELS 7 Time/ieraion (sec) HMM Exac fhmm G ibbs fhmm CFVA fhmm SVA fhmm M Sae space size (K ) 5 2 Figure 4. Time per ieraion of EM on a Silicon Graphics R4400 processor running Malab. ure 4). Gibbs sampling was slower han he variaional mehods even when limied o en samples of each hidden variable per ieraion of EM. Since one pass of he variaional xed poin equaions has he same ime complexiy as one pass of Gibbs sampling, and since he variaional xed poin equaions were found o converge very quickly, hese experimens sugges ha Gibbs sampling is no as compeiive ime-wise as he variaional mehods. The ime per ieraion for he variaional mehods scaled well o large sae spaces Experimen 2: Bach chorales Musical pieces naurally exhibi complex srucure a many dieren ime scales. Furhermore, one can imagine ha o represen he \sae" of he musical piece a any given ime i would be necessary o specify a conjuncion of many dieren feaures. For hese reasons, we chose o es wheher a facorial HMM would provide an advanage over a regular HMM in modeling a collecion of musical pieces. The daa se consised of discree even sequences encoding he melody lines of J. S. Bach's Chorales, obained from he UCI Reposiory for Machine Learning Daabases (Merz & Murphy, 996) and originally discussed in Conklin and Wien (995). Each even in he sequence was represened by six aribues, described in Table 2. Sixy-six chorales, wih 40 or more evens each, were divided ino a raining se (30 chorales) and a es se (36 chorales). Using he rs se, hidden Markov models wih sae space ranging from 2 o 00 saes were rained unil convergence (30 2 seps of EM). Facorial HMMs of varying sizes (K ranging from 2 o 6; M ranging from 2 o 9) were also rained on he same daa. To

18 8 Z. GHAHRAMANI AND M.I. JORDAN Table 2. Aribues in he Bach chorale daa se. The key signaure and ime signaure aribues were consan over he duraion of he chorale. All aribues were reaed as real numbers and modeled as linear-gaussian observaions (4a). Aribue Descripion Represenaion pich pich of he even in [0; 27] keysig key signaure in [ 7; 7] imesig ime signaure (/6 noes) fermaa even under fermaa? binary s sar ime of even in (/6 noes) dur duraion of even in (/6 noes) approximae he E sep for facorial HMMs we used he srucured variaional approximaion. This choice was moivaed by hree consideraions. Firs, for he size of sae space we wished o explore, he exac algorihms were prohibiively slow. Second, he Gibbs sampling algorihm did no appear o presen any advanages in speed or performance and required some heurisic mehod for deermining he number of samples. Third, heoreical argumens sugges ha he srucured approximaion should in general be superior o he compleely facorized variaional approximaion, since more of he dependencies of he original model are preserved. The es se log likelihoods for he HMMs, shown in Figure 5 (a), exhibied he ypical U-shaped curve demonsraing a rade-o beween bias and variance (Geman, Bienensock, & Doursa, 992). HMMs wih fewer han 0 saes did no predic well, while HMMs wih more han 40 saes over he raining daa and herefore provided a poor model of he es daa. Ou of he 75 runs, he highes es se log likelihood per observaion was 9:0 bis, obained by an HMM wih 30 hidden saes. 9 The facorial HMM provides a more saisfacory model of he chorales from hree poins of view. Firs, he ime complexiy is such ha i is possible o consider models wih signicanly larger sae spaces; in paricular, we models wih up o 000 saes. Second, given he componenial paramerizaion of he facorial HMM, hese large sae spaces do no require excessively large numbers of parameers relaive o he number of daa poins. In paricular, we saw no evidence of overing even for he larges facorial HMM as seen in Figures 5 (c) & (d). Finally, his approach resuled in signicanly beer predicors; he es se likelihood for he bes facorial HMM was an order of magniude larger han he es se likelihood for he bes HMM, as Figure 5 (d) reveals. While he facorial HMM is clearly a beer predicor han a single HMM, i should be acknowledged ha neiher approach produces models ha are easily inerpreable from a musicological poin of view. The siuaion is reminiscen of ha in speech recogniion, where HMMs have proved heir value as predicive models of he speech signal wihou necessarily being viewed as causal generaive models of speech. A facorial HMM is clearly an impoverished represenaion of

19 FACTORIAL HIDDEN MARKOV MODELS 9 log likelihood (bis) (a) (b) (c) Size of sae space (d) Figure 5. Tes se log likelihood per even of he Bach chorale daa se as a funcion of number of saes for (a) HMMs, and facorial HMMs wih (b) K = 2, (c) K = 3, and (d) K = 4 ('s; heavy dashed line) and K = 5 ('s; solid line). Each symbol represens a single run; he lines indicae he mean performances. The hin dashed line in (b){(d) indicaes he log likelihood per observaion of he bes run in (a). The facorial HMMs were rained using he srucured approximaion. For boh mehods he rue likelihood was compued using he exac algorihm. musical srucure, bu is promising performance as a predicor provides hope ha i could serve as a sep on he way oward increasingly srucured saisical models for music and oher complex mulivariae ime series. 5. Generalizaions of he model In his secion, we describe four variaions and generalizaions of he facorial HMM. 5.. Discree observables The probabilisic model presened in his paper has assumed real-valued Gaussian observaions. One of he advanages arising from his assumpion is ha he condiional densiy of a D-dimensional observaion, P (Y js () ; : : :; S (M) ), can be compacly specied hrough M mean marices of dimension D K, and one D D covariance marix. Furhermore, he M sep for such a model reduces o a se of weighed leas squares equaions. The model can be generalized o handle discree observaions in several ways. Consider a single D-valued discree observaion Y. In analogy o radiional HMMs, he oupu probabiliies could be modeled using a marix. However, in he case of a facorial HMM, his marix would have D K M enries for each combinaion of he sae variables and observaion. Thus he compacness of he represenaion would be enirely los. Sandard mehods from graphical models sugges approximaing such large marices wih \noisy-or" (Pearl, 988) or \sigmoid" (Neal, 992) models of ineracion. For example, in he sofmax model, which generalizes he sigmoid model o D > 2, P (Y js () ; : : :; S (M) ) is mulinomial wih mean proporional o

20 20 Z. GHAHRAMANI AND M.I. JORDAN P o expn m W S. Like he Gaussian model, his specicaion is again compac, using M marices of size D K. (As in he linear-gaussian model, he W are overparamerized since hey can each model he overall mean of Y, as shown in Appendix A.) While he nonlineariy induced by he sofmax funcion makes boh he E sep and M sep of he algorihm more dicul, ieraive numerical mehods can be used in he M sep whereas Gibbs sampling and variaional mehods can again be used in he E sep (see Neal, 992; Hinon e al., 995; and Saul e al., 996, for discussions of dieren approaches o learning in sigmoid neworks) Inroducing couplings The archiecure for facorial HMMs presened in Secion 2 assumes ha he underlying Markov chains inerac only hrough he observaions. This consrain can be relaxed by inroducing couplings beween he hidden sae variables (cf. Saul & Jordan, 997). For example, if S depends on S and S(m ), equaion (3) is replaced by he following facorizaion Y M P (S js ) = P (S () js () ) P (S js ; S(m ) ): (3) Similar exac, variaional, and Gibbs sampling procedures can be dened for his archiecure. However, noe ha hese couplings mus be inroduced wih cauion, as hey may resul in an exponenial growh in parameers. For example, he above facorizaion requires ransiion marices of size K 2 K. Raher han specifying hese higher-order couplings hrough probabiliy ransiion marices, one can inroduce second-order ineracion erms in he energy (log probabiliy) funcion. Such erms eecively couple he chains wihou he number of parameers incurred by a full probabiliy ransiion marix. 0 In he graphical model formalism hese correspond o symmeric undireced links, making he model a chain graph. While he Jensen, Laurizen and Olesen (990) algorihm can sill be used o propagae informaion exacly in chain graphs, such undireced links cause he normalizaion consan of he probabiliy disribuion he pariion funcion o depend on he coupling parameers. As in Bolzmann machines (Hinon & Sejnowski, 986), boh a clamped and an unclamped phase are herefore required for learning, where he goal of he unclamped phase is o compue he derivaive of he pariion funcion wih respec o he parameers (Neal, 992) Condiioning on inpus Like he hidden Markov model, he facorial HMM provides a model of he uncondiional densiy of he observaion sequences. In cerain problem domains, some of he observaions can be beer hough of as inpus or explanaory variables, and

21 FACTORIAL HIDDEN MARKOV MODELS 2 he ohers as oupus or response variables. The goal, in hese cases, is o model he condiional densiy of he oupu sequence given he inpu sequence. In machine learning erminology, uncondiional densiy esimaion is unsupervised while condiional densiy esimaion is supervised. Several algorihms for learning in hidden Markov models ha are condiioned on inpus have been recenly presened in he lieraure (Cacciaore & Nowlan, 994; Bengio & Frasconi, 995; Meila & Jordan, 996). Given a sequence of inpu vecors fx g, he probabilisic model for an inpu-condiioned facorial HMM is P (fs ; Y gjfx g) = MY TY P (S jx )P (Y js ; X ) MY =2 P (S js ; X )P (Y js ; X ): (4) The model depends on he specicaion of P (Y js ; X ) and P (S js ; X ), which are condiioned boh on a discree sae variable and on a (possibly coninuous) inpu vecor. The approach used in Bengio and Frasconi's Inpu Oupu HMMs (IOHMMs) suggess modeling P (S js ; X ) as K separae neural neworks, one for each seing of S. This decomposiion ensures ha a valid probabiliy ransiion marix is dened a each poin in inpu space if a sum-o-one consrain (e.g., sofmax nonlineariy) is used in he oupu of hese neworks. Using he decomposiion of each condiional probabiliy ino K neworks, inference in inpu-condiioned facorial HMMs is a sraighforward generalizaion of he algorihms we have presened for facorial HMMs. The exac forward{backward algorihm in Appendix B can be adaped by using he appropriae condiional probabiliies. Similarly, he Gibbs sampling procedure is no more complex when condiioned on inpus. Finally, he compleely facorized and srucured approximaions can also be generalized readily if he approximaing disribuion has a dependence on he inpu similar o he model's. If he probabiliy ransiion srucure P (S js ; X ) is no decomposed as above, bu has a complex dependence on he previous sae variable and inpu, inference may become considerably more complex. Depending on he form of he inpu condiioning, he Maximizaion sep of learning may also change considerably. In general, if he oupu and ransiion probabiliies are modeled as neural neworks, he M sep can no longer be solved exacly and a gradien-based generalized EM algorihm mus be used. For log-linear models, he M sep can be solved using an inner loop of ieraively reweighed leas-squares (McCullagh & Nelder, 989) Hidden Markov decision rees An ineresing generalizaion of facorial HMMs resuls if one condiions on an inpu X and orders he M sae variables such ha S depends on S (n) for

22 22 Z. GHAHRAMANI AND M.I. JORDAN X - X X + S () - S () S () + S (2) - S (2) S (2) + S (3) - S (3) S (3) + Y - Y Y + Figure 6. The hidden Markov decision ree. n < m (Figure 6). The resuling archiecure can be seen as a probabilisic decision ree wih Markovian dynamics linking he decision variables. Consider how his probabilisic model would generae daa a he rs ime sep, =. Given inpu X, he op node S () can ake on K values. This sochasically pariions X space ino K decision regions. The nex node down he hierarchy, S (2), subdivides each of hese regions ino K subregions, and so on. The oupu Y is generaed from he inpu X and he K-way decisions a each of he M hidden nodes. A he nex ime sep, a similar procedure is used o generae daa from he model, excep ha now each decision in he ree is dependen on he decision aken a ha node in he previous ime sep. Thus, he \hierarchical mixure of expers" archiecure (Jordan & Jacobs, 994) is generalized o include Markovian dynamics for he decisions. Hidden Markov decision rees provide a useful saring poin for modeling ime series wih boh emporal and spaial srucure a muliple resoluions. We explore his generalizaion of facorial HMMs in Jordan, Ghahramani, and Saul (997). 6. Conclusion In his paper we have examined he problem of learning for a class of generalized hidden Markov models wih disribued sae represenaions. This generalizaion provides boh a richer modeling ool and a mehod for incorporaing prior srucural informaion abou he sae variables underlying he dynamics of he sysem generaing he daa. Alhough exac inference in his class of models is generally inracable, we provided a srucured variaional approximaion ha can be compued racably. This approximaion forms he basis of he Expecaion sep in an EM algorihm for learning he parameers of he model. Empirical comparisons o several oher approximaions and o he exac algorihm show ha his approximaion is boh ecien o compue and accurae. Finally, we have shown ha

Georey E. Hinton. University oftoronto. Technical Report CRG-TR February 22, Abstract

Georey E. Hinton. University oftoronto. Technical Report CRG-TR February 22, Abstract Parameer Esimaion for Linear Dynamical Sysems Zoubin Ghahramani Georey E. Hinon Deparmen of Compuer Science Universiy oftorono 6 King's College Road Torono, Canada M5S A4 Email: zoubin@cs.orono.edu Technical