Variational Learning for Switching State-Space Models

Size: px

Start display at page:

Download "Variational Learning for Switching State-Space Models"

Madeline Kelly Robbins
6 years ago
Views:

1 LEER Communicaed by Volker resp Variaional Learning for Swiching Sae-Space odels Zoubin Ghahramani Geoffrey E. Hinon Gasby Compuaional euroscience Uni, Universiy College London, London WC 3R, U.K. We inroduce a new saisical model for ime series ha ieraively segmens daa ino regimes wih approximaely linear dynamics and learns he parameers of each of hese linear regimes. his model combines and generalizes wo of he mos widely used sochasic ime-series models hidden arkov models and linear dynamical sysems and is closely relaed o models ha are widely used in he conrol and economerics lieraures. I can also be derived by exending he mixure of expers neural nework Jacobs, Jordan, owlan, & Hinon, 99) o is fully dynamical version, in which boh exper and gaing neworks are recurren. Inferring he poserior probabiliies of he hidden saes of his model is compuaionally inracable, and herefore he exac expecaion maximizaion E) algorihm canno be applied. However, we presen a variaional approximaion ha maximizes a lower bound on he log-likelihood and makes use of boh he forward and backward recursions for hidden arkov models and he Kalman filer recursions for linear dynamical sysems. We esed he algorihm on arificial daa ses and a naural daa se of respiraion force from a paien wih sleep apnea. he resuls sugges ha variaional approximaions are a viable mehod for inference and learning in swiching sae-space models. Inroducion os commonly used probabilisic models of ime series are descendans of eiher hidden arkov models H) or sochasic linear dynamical sysems, also known as sae-space models SS). Hs represen informaion abou he pas of a sequence hrough a single discree random variable he hidden sae. he prior probabiliy disribuion of his sae is derived from he previous hidden sae using a sochasic ransiion marix. Knowing he sae a any ime makes he pas, presen, and fuure observaions saisically independen. his is he arkov independence propery ha gives he model is name. SSs represen informaion abou he pas hrough a real-valued hidden sae vecor. gain, condiioned on his sae vecor, he pas, presen, and fuure observaions are saisically independen. he dependency beween eural Compuaion 2, ) c 2000 assachuses Insiue of echnology

2 832 Zoubin Ghahramani and Geoffrey E. Hinon he presen sae vecor and he previous sae vecor is specified hrough he dynamic equaions of he sysem and he noise model. When hese equaions are linear and he noise model is gaussian, he SS is also known as a linear dynamical sysem or Kalman filer model. Unforunaely, mos real-world processes canno be characerized by eiher purely discree or purely linear-gaussian dynamics. For example, an indusrial plan may have muliple discree modes of behavior, each wih approximaely linear dynamics. Similarly, he pixel inensiies in an image of a ranslaing objec vary according o approximaely linear dynamics for subpixel ranslaions, bu as he image moves over a larger range, he dynamics change significanly and nonlinearly. his aricle addresses models of dynamical phenomena ha are characerized by a combinaion of discree and coninuous dynamics. We inroduce a probabilisic model called he swiching SS inspired by he divideand-conquer principle underlying he mixure-of-expers neural nework Jacobs, Jordan, owlan, & Hinon, 99). Swiching SSs are a naural generalizaion of Hs and SSs in which he dynamics can ransiion in a discree manner from one linear operaing regime o anoher. here is a large lieraure on models of his kind in economerics, signal processing, and oher fields Harrison & Sevens, 976; Chang & hans, 978; Hamilon, 989; Shumway & Soffer, 99; Bar-Shalom & Li, 993). Here we exend hese models o allow for muliple real-valued sae vecors, draw connecions beween hese fields and he relevan lieraure on neural compuaion and probabilisic graphical models, and derive a learning algorihm for all he parameers of he model based on a srucured variaional approximaion ha rigorously maximizes a lower bound on he log-likelihood. In he following secion we review he background maerial on SSs, Hs, and hybrids of he wo. In secion 3, we describe he generaive model he probabiliy disribuion defined over he observaion sequences for swiching SSs. In secion 4, we describe a learning algorihm for swiching sae-space models ha is based on a srucured variaional approximaion o he expecaion-maximizaion algorihm. In secion 5 we presen simulaion resuls in boh an arificial domain, o assess he qualiy of he approximae inference mehod, and a naural domain. We conclude wih secion 6. 2 Background 2. Sae-Space odels. n SS defines a probabiliy densiy over ime series of real-valued observaion vecors {Y } by assuming ha he observaions were generaed from a sequence of hidden sae vecors {X }. ppendix describes he variables and noaion used hroughou his aricle.) In paricular, he SS specifies ha given he hidden sae vecor a one ime sep, he observaion vecor a ha ime sep is saisically independen from all oher observaion vecors, and ha he hidden sae vecors

3 Swiching Sae-Space odels 833 X X 2 X 3 X Y Y 2 Y 3 Y Figure : direced acyclic graph DG) specifying condiional independence relaions for a sae-space model. Each node is condiionally independen from is nondescendans given is parens. he oupu Y is condiionally independen from all oher variables given he sae X ; and X is condiionally independen from X,,X 2 given X. In his and he following figures, shaded nodes represen observed variables, and unshaded nodes represen hidden variables. obey he arkov independence propery. he join probabiliy for he sequences of saes X and observaions Y can herefore be facored as: P{X, Y }) = PX )PY X ) PX X )PY X ). 2.) =2 he condiional independencies specified by equaion 2. can be expressed graphically in he form of Figure. he simples and mos commonly used models of his kind assume ha he ransiion and oupu funcions are linear and ime invarian and he disribuions of he sae and observaion variables are mulivariae gaussian. We use he erm sae-space model o refer o his simple form of he model. For such models, he sae ransiion funcion is X = X + w, 2.2) where is he sae ransiion marix and w is zero-mean gaussian noise in he dynamics, wih covariance marix Q. PX ) is assumed o be gaussian. Equaion 2.2 ensures ha if PX ) is gaussian, so is PX ). he oupu funcion is Y = CX + v, 2.3) where C is he oupu marix and v is zero-mean gaussian oupu noise wih covariance marix R; PY X ) is herefore also gaussian: { PY X ) = 2π) D/2 R /2 exp } 2 Y CX ) R Y CX ), 2.4) where D is he dimensionaliy of he Y vecors.

4 834 Zoubin Ghahramani and Geoffrey E. Hinon Ofen he observaion vecor can be divided ino inpu or predicor) variables and oupu or response) variables. o model he inpu-oupu behavior of such a sysem he condiional probabiliy of oupu sequences given inpu sequences he linear gaussian SS can be modified o have a sae-ransiion funcion, X = X + BU + w, 2.5) where U is he inpu observaion vecor and B is he fixed) inpu marix. he problem of inference or sae esimaion for an SS wih known parameers consiss of esimaing he poserior probabiliies of he hidden variables given a sequence of observed variables. Since he local likelihood funcions for he observaions are gaussian and he priors for he hidden saes are gaussian, he resuling poserior is also gaussian. hree special cases of he inference problem are ofen considered: filering, smoohing, and predicion nderson & oore, 979; Goodwin & Sin, 984). he goal of filering is o compue he probabiliy of he curren hidden sae X given he sequence of inpus and oupus up o ime PX {Y}, {U} ).2 he recursive algorihm used o perform his compuaion is known as he Kalman filer Kalman & Bucy, 96). he goal of smoohing is o compue he probabiliy of X given he sequence of inpus and oupus up o ime, where >. he Kalman filer is used in he forward direcion o compue he probabiliy of X given {Y} and {U}. similar se of backward recursions from o complees he compuaion by accouning for he observaions afer ime Rauch, 963). We will refer o he combined forward and backward recursions for smoohing as he Kalman smoohing recursions also known as he RS, or Rauch-ung-Sreibel smooher). Finally, he goal of predicion is o compue he probabiliy of fuure saes and observaions given observaions up o ime. Given PX {Y}, {U} ) compued as before, he model is simulaed in he forward direcion using equaions 2.2 or 2.5 if here are inpus) and 2.3 o compue he probabiliy densiy of he sae or oupu a fuure ime + τ. he problem of learning he parameers of an SS is known in engineering as he sysem idenificaion problem; in is mos general form i assumes access only o sequences of inpu and oupu observaions. We focus on maximum likelihood learning in which a single locally opimal) value of he parameers is esimaed, raher han Bayesian approaches ha rea he parameers as random variables and compue or approximae he poserior disribuion of he parameers given he daa. One can also disinguish beween on-line and off-line approaches o learning. On-line recursive algorihms, favored in real-ime adapive conrol applicaions, can be obained by compuing he gradien or he second derivaives of he log-likelihood Ljung One can also define he sae such ha X + = X + BU + w. 2 he noaion {Y} is shorhand for he sequence Y,,Y.

5 Swiching Sae-Space odels 835 & Södersröm, 983). Similar gradien-based mehods can be obained for off-line mehods. n alernaive mehod for off-line learning makes use of he expecaion maximizaion E) algorihm Dempser, Laird, & Rubin, 977). his procedure ieraes beween an E-sep ha fixes he curren parameers and compues poserior probabiliies over he hidden saes given he observaions, and an -sep ha maximizes he expeced log-likelihood of he parameers using he poserior disribuion compued in he E-sep. For linear gaussian sae-space models, he E-sep is exacly he Kalman smoohing problem as defined above, and he -sep simplifies o a linear regression problem Shumway & Soffer, 982; Digalakis, Rohlicek, & Osendorf, 993). Deails on he E algorihm for SSs can be found in Ghahramani and Hinon 996a), as well as in he original Shumway and Soffer 982) aricle. 2.2 Hidden arkov odels. Hidden arkov models also define probabiliy disribuions over sequences of observaions {Y }. he disribuion over sequences is obained by specifying a disribuion over observaions a each ime sep given a discree hidden sae S, and he probabiliy of ransiioning from one hidden sae o anoher. Using he arkov propery, he join probabiliy for he sequences of saes S and observaions Y, can be facored in exacly he same manner as equaion 2., wih S aking he place of X : P{S, Y }) = PS )PY S ) PS S )PY S ). 2.6) =2 Similarly, he condiional independencies in an H can be expressed graphically in he same form as Figure. he sae is represened by a single mulinomial variable ha can ake one of K discree values, S {,,K}. he sae ransiion probabiliies, PS S ), are specified by a K K ransiion marix. If he observables are discree symbols aking on one of L values, he observaion probabiliies PY S ) can be fully specified as a K L observaion marix. For a coninuous observaion vecor, PY S ) can be modeled in many differen forms, such as a gaussian, mixure of gaussians, or neural nework. Hs have been applied exensively o problems in speech recogniion Juang & Rabiner, 99), compuaional biology Baldi, Chauvin, Hunkapiller, & cclure, 994), and faul deecion Smyh, 994). Given an H wih known parameers and a sequence of observaions, wo algorihms are commonly used o solve wo differen forms of he inference problem Rabiner & Juang, 986). he firs compues he poserior probabiliies of he hidden saes using a recursive algorihm known as he forward-backward algorihm. he compuaions in he forward pass are exacly analogous o he Kalman filer for SSs, and he compuaions in he backward pass are analogous o he backward pass of he Kalman smoohing

6 836 Zoubin Ghahramani and Geoffrey E. Hinon equaions. s noed by Bridle pers. comm., 985) and Smyh, Heckerman, and Jordan 997), he forward-backward algorihm is a special case of exac inference algorihms for more general graphical probabilisic models Laurizen & Spiegelhaler, 988; Pearl, 988). he same observaion holds rue for he Kalman smoohing recursions. he oher inference problem commonly posed for Hs is o compue he single mos likely sequence of hidden saes. he soluion o his problem is given by he Vierbi algorihm, which also consiss of a forward and backward pass hrough he model. o learn maximum likelihood parameers for an H given sequences of observaions, one can use he well-known Baum-Welch algorihm Baum, Perie, Soules, & Weiss, 970). his algorihm is a special case of E ha uses he forward-backward algorihm o infer he poserior probabiliies of he hidden saes in he E-sep. he -sep uses expeced couns of ransiions and observaions o reesimae he ransiion and oupu marices or linear regression equaions in he case where he observaions are gaussian disribued). Like SSs, Hs can be augmened o allow for inpu variables, such ha hey model he condiional disribuion of sequences of oupu observaions given sequences of inpus Cacciaore & owlan, 994; Bengio & Frasconi, 995; eila & Jordan, 996). 2.3 Hybrids. burgeoning lieraure on models ha combine he discree ransiion srucure of Hs wih he linear dynamics of SSs has developed in fields ranging from economerics o conrol engineering Harrison & Sevens, 976; Chang & hans, 978; Hamilon, 989; Shumway & Soffer, 99; Bar-Shalom & Li, 993; Deng, 993; Kadirkamanahan & Kadirkamanahan, 996; Chaer, Bishop, & Ghosh, 997). hese models are known alernaely as hybrid models, SSs wih swiching, and jump-linear sysems. We briefly review some of his lieraure, including some relaed neural nework models. 3 Shorly afer Kalman and Bucy solved he problem of sae esimaion for linear gaussian SSs, aenion urned o he analogous problem for swiching models ckerson & Fu, 970). Chang and hans 978) derive he equaions for compuing he condiional mean and variance of he sae when he parameers of a linear SS swich according o arbirary and arkovian dynamics. he prior and ransiion probabiliies of he swiching process are assumed o be known. hey noe ha for models ses of parameers) and an observaion lengh, he exac condiional disribuion of he sae is a gaussian mixure wih componens. he condiional mean and variance, which require far less compuaion, are herefore only summary saisics. 3 review of how SSs and Hs are relaed o simpler saisical models such as principal componens analysis, facor analysis, mixure of gaussians, vecor quanizaion, and independen componens analysis IC) can be found in Roweis and Ghahramani 999).

7 Swiching Sae-Space odels 837 a b S S 2 S 3 S S 2 S 3 X X 2 X 3 X X X 2 3 Y Y Y 2 3 Y Y Y 2 3 c d U U2 U3 S S2 S3 S S 2 S 3 X X 2 X 3 X X 2 X 3 Y Y Y 2 3 Y Y Y 2 3 Figure 2: Direced acyclic graphs specifying condiional independence relaions for various swiching sae-space models. a) Shumway and Soffer 99): he oupu marix C in equaion 2.3) swiches independenly beween a fixed number of choices a each ime sep. Is seing is represened by he discree hidden variable S ; b) Bar-Shalom and Li 993): boh he oupu equaion and he dynamic equaion can swich, and he swiches are arkov; c) Kim 994); d) Fraser and Dimiriadis 993): oupus and saes are observed. Here we have shown a simple case where he oupu depends direcly on he curren sae, previous sae, and previous oupu. Shumway and Soffer 99) consider he problem of learning he parameers of SSs wih a single real-valued hidden sae vecor and swiching oupu marices. he probabiliy of choosing a paricular oupu marix is a prespecified ime-varying funcion, independen of previous choices see Figure 2a). pseudo-e algorihm is derived in which he E-sep, which

8 838 Zoubin Ghahramani and Geoffrey E. Hinon in is exac form would require compuing a gaussian mixure wih componens, is approximaed by a single gaussian a each ime sep. Bar-Shalom and Li 993; sec..6) review models in which boh he sae dynamics and he oupu marices swich, and where he swiching follows arkovian dynamics see Figure 2b). hey presen several mehods for approximaely solving he sae-esimaion problem in swiching models hey do no discuss parameer esimaion for such models). hese mehods, referred o as generalized pseudo-bayesian GPB) and ineracing muliple models I), are all based on he idea of collapsing ino one gaussian he mixure of gaussians ha resuls from considering all he seings of he swich sae a a given ime sep. his avoids he exponenial growh of mixure componens a he cos of providing an approximae soluion. ore sophisicaed bu compuaionally expensive mehods ha collapse 2 gaussians ino gaussians are also derived. Kim 994) derives a similar approximaion for a closely relaed model, which also includes observed inpu variables see Figure 2c). Furhermore, Kim discusses parameer esimaion for his model, alhough wihou making reference o he E algorihm. Oher auhors have used arkov chain one Carlo mehods for sae and parameer esimaion in swiching models Carer & Kohn, 994; haide, 995) and in oher relaed dynamic probabilisic neworks Dean & Kanazawa, 989; Kanazawa, Koller, & Russell, 995). Hamilon 989, 994, sec. 22.4) describes a class of swiching models in which he real-valued observaion a ime, Y, depends on boh he observaions a imes o r and he discree saes a ime o r. ore precisely, Y is gaussian wih mean ha is a linear funcion of Y,,Y r and of binary indicaor variables for he discree saes, S,,S r. he sysem can herefore be seen as an r + )h order H driving an rh order auoregressive process, and is racable for small r and number of discree saes in S. Hamilon s models are closely relaed o hidden filer H HFH; Fraser & Dimiriadis, 993). HFHs have boh discree and real-valued saes. However, he real-valued saes are assumed o be eiher observed or a known, deerminisic funcion of he pas observaions i.e., an embedding). he oupus depend on he saes and previous oupus, and he form of his dependence can swich randomly see Figure 2d). Because a any ime sep he only hidden variable is he swich sae, S, exac inference in his model can be carried ou racably. he resuling algorihm is a varian of he forward-backward procedure for Hs. Kehagias and Peridis 997) and Pawelzik, Kohlmorgen, and üller 996) presen oher varians of his model. Ellio, ggoun, and oore 995; sec. 2.5) presen an inference algorihm for hybrid arkov swiching) sysems for which here is a separae observable from which he swich sae can be esimaed. he rue swich saes, S, are represened as uni vecors in R, and he esimaed swich sae is a vecor in he uni square wih elemens corresponding o he es-

9 Swiching Sae-Space odels 839 imaed probabiliy of being in each swich sae. he real-valued sae, X, is approximaed as a gaussian given he esimaed swich sae by forming a linear combinaion of he ransiion and observaion marices for he differen SSs weighed by he esimaed swich sae. Elio e al. also derive conrol equaions for such hybrid sysems and discuss applicaions of he change-of-measures whiening procedure o a large family of models. Wih regard o he lieraure on neural compuaion, he model presened in his aricle is a generalizaion of boh he mixure-of-expers neural nework Jacobs e al., 99; Jordan & Jacobs, 994) and he relaed mixure of facor analyzers Hinon, Dayan, & Revow, 997; Ghahramani & Hinon, 996b). Previous dynamical generalizaions of he mixure-of-expers archiecure consider he case in which he gaing nework has arkovian dynamics Cacciaore & owlan, 994; Kadirkamanahan & Kadirkamanahan, 996; eila & Jordan, 996). One limiaion of his generalizaion is ha he enire pas sequence is summarized in he value of a single discree variable he gaing acivaion), which for a sysem wih expers can convey on average a mos log bis of informaion abou he pas. In he models we consider here, boh he expers and he gaing nework have arkovian dynamics. he pas is herefore summarized by a sae composed of he cross-produc of he discree variable and he combined real-valued sae-space of all he expers. his provides a much wider informaion channel from he pas. One advanage of his represenaion is ha he real-valued sae can conain componenial srucure. hus, aribues such as he posiion, orienaion, and scale of an objec in an image, which are mos naurally encoded as independen real-valued variables, can be accommodaed in he sae wihou he exponenial growh required of a discreized H-like represenaion. I is imporan o place he work in his aricle in he conex of he lieraure we have jus reviewed. he hybrid models, sae-space models wih swiching, and jump-linear sysems we have described all assume a single real-valued sae vecor. he model considered in his aricle generalizes his o muliple real-valued sae vecors. 4 Unlike he models described in Hamilon 994), Fraser and Dimiradis 993), and he curren dynamical exensions of mixures of expers, in he model we presen, he real-valued sae vecors are hidden. he inference algorihm we derive, which is based on making a srucured variaional approximaion, is enirely novel in he conex of swiching SSs. Specifically, our mehod is unlike all he approximae mehods we have reviewed in ha i is no based on fiing a single gaussian o a mixure of gaussians by compuing he mean and covariance of he mixure. 5 We derive a learning algorihm for all of he parameers 4 oe ha he sae vecors could be concaenaed ino one large sae vecor wih facorized block-diagonal) ransiion marices cf. facorial hidden arkov model; Ghahramani & Jordan, 997). However, his obscures he decoupled srucure of he model. 5 Boh classes of mehods can be seen as minimizing Kullback-Liebler KL) diver-

10 840 Zoubin Ghahramani and Geoffrey E. Hinon of he model, including he arkov swiching parameers. his algorihm maximizes a lower bound on he log-likelihood of he daa raher han a heurisically moivaed approximaion o he likelihood. he algorihm has a simple and inuiive flavor: I decouples ino forward-backward recursions on a H, and Kalman smoohing recursions on each SS. he saes of he H deermine he sof assignmen of each observaion o an SS; he predicion errors of he SSs deermine he observaion probabiliies for he H. 3 he Generaive odel In swiching SSs, he sequence of observaions {Y } is modeled by specifying a probabilisic relaion beween he observaions and a hidden saespace comprising real-valued sae vecors,, and one discree sae vecor S. he discree sae, S, is modeled as a mulinomial variable ha can ake on values: S {,,}; for reasons ha will become obvious we refer o i as he swich variable. he join probabiliy of observaions and hidden saes can be facored as { }) P S, X ),,X ), Y ) ) = PS ) PS S ) P P =2 m= =2 ) P Y X ),,X ), S, 3.) = which corresponds graphically o he condiional independencies represened by Figure 3. Condiioned on a seing of he swich sae, S = m, he observable is mulivariae gaussian wih oupu equaion given by saespace model m. oice ha m is used as boh an index for he real-valued sae variables and a value for he swich sae. he probabiliy of he observaion vecor Y is herefore ),,X ), S = m { = 2π R 2 exp 2 P Y X ) Y C m) ) R ) } Y C m), 3.2) where R is he observaion noise covariance marix and C m) is he oupu marix for SS m cf. equaion 2.4 for a single linear-gaussian SS). Each gences. However, he KL divergence is asymmerical, and whereas he variaional mehods minimize i in one direcion, he mehods ha merge gaussians minimize i in he oher direcion. We reurn o his poin in secion 4.2.

11 Swiching Sae-Space odels 84 a b X ) X ) 2 X ) 3 ) ) ) X X 2 X 3 X ) 2) X X ) S S S 2 3 S Y Y Y 2 3 Y Figure 3: a) Graphical model represenaion for swiching sae-space models. S is he discree swich variable, and are he real-valued sae vecors. b) Swiching sae-space model depiced as a generalizaion of he mixure of expers. he dashed arrows correspond o he connecions in a mixure of expers. In a swiching sae-space model, he saes of he expers and he gaing nework also depend on heir previous saes solid arrows). real-valued sae vecor evolves according o he linear gaussian dynamics of an SS wih differing iniial sae, ransiion marix, and sae noise see equaion 2.2). For simpliciy we assume ha all sae vecors have idenical dimensionaliy; he generalizaion of he algorihms we presen o models wih differen-sized sae-spaces is immediae. he swich sae iself evolves according o he discree arkov ransiion srucure specified by he iniial sae probabiliies PS ) and he sae ransiion marix PS S ). n exac analogy can be made o he mixure-of-expers archiecure for modular learning in neural neworks see Figure 3b; Jacobs e al., 99). Each SS is a linear exper wih gaussian oupu noise model and lineargaussian dynamics. he swich sae gaes he oupus of he SSs, and herefore plays he role of a gaing nework wih arkovian dynamics. here are many possible exensions of he model; we shall consider hree obvious and sraighforward ones: Ex) Differing oupu covariances, R m), for each SS; Ex2) Differing oupu means, µ m) Y, for each SS, such ha each model is allowed o capure observaions in a differen operaing range Ex3) Condiioning on a sequence of observed inpu vecors, {U }

12 842 Zoubin Ghahramani and Geoffrey E. Hinon 4 Learning n efficien learning algorihm for he parameers of a swiching SS can be derived by generalizing he E algorihm Baum e al., 970; Dempser e al., 977). E alernaes beween opimizing a disribuion over he hidden saes he E-sep) and opimizing he parameers given he disribuion over hidden saes he -sep). ny disribuion over he hidden saes, Q{S, X }), where X = [X ),X ) ] is he combined sae of he SSs, can be used o define a lower bound, B, on he log probabiliy of he observed daa: log P{Y } θ) = log P{S, X, Y } θ) d{x } 4.) {S } = log [ ] P{S, X, Y } θ) Q{S, X }) d{x } 4.2) Q{S {S }, X }) [ ] P{S, X, Y } θ) Q{S, X }) log d{x } Q{S {S }, X }) = BQ,θ), 4.3) where θ denoes he parameers of he model and we have made use of Jensen s inequaliy Cover & homas, 99) o esablish equaion 4.3. Boh seps of E increase he lower bound on he log probabiliy of he observed daa. he E-sep holds he parameers fixed and ses Q o be he poserior disribuion over he hidden saes given he parameers, Q{S, X }) = P{S, X } {Y },θ). 4.4) his maximizes B wih respec o he disribuion, urning he lower bound ino an equaliy, which can be easily seen by subsiuion. he -sep holds he disribuion fixed and compues he parameers ha maximize B for ha disribuion. Since B = log P{Y } θ) a he sar of he -sep and since he E-sep does no affec log P, he wo seps combined can never decrease log P. Given he change in he parameers produced by he - sep, he disribuion produced by he previous E-sep is ypically no longer opimal, so he whole procedure mus be ieraed. Unforunaely, he exac E-sep for swiching SSs is inracable. Like he relaed hybrid models described in secion 2.3, he poserior probabiliy of he real-valued saes is a gaussian mixure wih erms. his can be seen by using he semanics of direced graphs, in paricular he d-separaion crierion Pearl, 988), which implies ha he hidden sae variables in Figure 3, while marginally independen, become condiionally dependen given he observaion sequence. his induced dependency effecively couples all of he real-valued hidden sae variables o he discree swich variable, as a

13 Swiching Sae-Space odels 843 consequence of which he exac poseriors become Gaussian mixures wih an exponenial number of erms. 6 In order o derive an efficien learning algorihm for his sysem, we relax he E algorihm by approximaing he poserior probabiliy of he hidden saes. he basic idea is ha since expecaions wih respec o P are inracable, raher han seing Q{S, X }) = P{S, X } {Y }) in he E-sep, a racable disribuion Q is used o approximae P. his resuls in an E learning algorihm ha maximizes a lower bound on he log-likelihood. he difference beween he bound B and he log-likelihood is given by he Kullback-Liebler KL) divergence beween Q and P: KLQ P) = {S } [ ] Q{S, X }) Q{S, X }) log d{x }. 4.5) P{S, X } {Y }) Since he complexiy of exac inference in he approximaion given by Q is deermined by is condiional independence relaions, no by is parameers, we can choose Q o have a racable srucure a graphical represenaion ha eliminaes some of he dependencies in P. Given his srucure, he parameers of Q are varied o obain he ighes possible bound by minimizing equaion 4.5. herefore, he algorihm alernaes beween opimizing he parameers of he disribuion Q o minimize equaion 4.5 he E-sep) and opimizing he parameers of P given he disribuion over he hidden saes he -sep). s in exac E, boh seps increase he lower bound B on he log-likelihood; however, equaliy is no reached in he E-sep. We will refer o he general sraegy of using a parameerized approximaing disribuion as a variaional approximaion and refer o he free parameers of he disribuion as variaional parameers. compleely facorized approximaion is ofen used in saisical physics, where i provides he basis for simple ye powerful mean-field approximaions o saisical mechanical sysems Parisi, 988). heoreical argumens moivaing approximae E-seps are presened in eal and Hinon 998; originally in a echnical repor in 993). Saul and Jordan 996) showed ha approximae E-seps could be used o maximize a lower bound on he log-likelihood, and proposed he powerful echnique of srucured variaional approximaions o inracable probabilisic neworks. he key insigh of heir work, which his aricle makes use of, is ha by judicious use of an approximaion Q, exac inference algorihms can be used on he racable subsrucures in an inracable nework. general uorial on variaional approximaions can be found in Jordan, Ghahramani, Jaakkola, and Saul 998). 6 he inracabiliy of he E-sep or smoohing problem in he simpler single-sae swiching model has been noed by ckerson and Fu 970), Chang and hans 978), Bar-Shalom and Li 993), and ohers.

14 844 Zoubin Ghahramani and Geoffrey E. Hinon X ) X ) 2 X ) 3 ) ) ) X X 2 X 3 S S S 2 3 Figure 4: Graphical model represenaion for he srucured variaional approximaion o he poserior disribuion of he hidden saes of a swiching saespace model. he parameers of he swiching SS are θ = { m), C m), Q m),µ m) X,, R, π, }, where m) is he sae dynamics marix for model m, C m) is Q m) is he mean of he iniial sae, Q m) is he covariance of he iniial sae, R is he ied) oupu noise covariance, π = PS ) is he prior for he discree arkov process, and = PS S ) is he discree ransiion marix. Exensions Ex) hrough Ex3) can be readily implemened by subsiuing R m) for R, adding means is oupu marix, Q m) is is sae noise covariance, µ m) X µ m) Y and inpu marices Bm). lhough here are many possible approximaions o he poserior disribuion of he hidden variables ha one could use for learning and inference in swiching SSs, we focus on he following: [ Q{S, X }) = ψs ) Z Q =2 ] ψs, S ) =2 m= ψ ) ψ, Xm), 4.6) where he ψ are unnormalized probabiliies, which we will call poenial funcions and define soon, and Z Q is a normalizaion consan ensuring ha Q inegraes o one. lhough Q has been wrien in erms of poenial funcions raher han condiional probabiliies, i corresponds o he simple graphical model shown in Figure 4. he erms involving he swich variables S define a discree arkov chain, and he erms involving he sae vecors define uncoupled SSs. s in mean-field approximaions, we have approximaed he sochasically coupled sysem by removing some of he )

15 Swiching Sae-Space odels 845 couplings of he original sysem. Specifically, we have removed he sochasic coupling beween he chains ha resuls from he fac ha he observaion a ime depends on all he hidden variables a ime. However, we reain he coupling beween he hidden variables a successive ime seps since hese couplings can be handled exacly using he forward-backward and Kalman smoohing recursions. his approximaion is herefore srucured, in he sense ha no all variables are uncoupled. he discree swiching process is defined by ψs = m) = PS = m) q m) 4.7) ψs, S = m) = PS = m S ) q m), 4.8) where he q m) are variaional parameers of he Q disribuion. hese parameers scale he probabiliies of each of he saes of he swich variable a each ime sep, so ha q m) plays exacly he same role ha he observaion probabiliy PY S = m) would play in a regular H. We will soon see ha minimizing KLQ P) resuls in an equaion for q m) ha suppors his inuiion. he uncoupled SSs in he approximaion Q are also defined by poenial funcions ha are relaed o probabiliies in he original sysem. hese poenials are he prior and ransiion probabiliies for muliplied by a facor ha changes hese poenials o ry o accoun for he daa: ψ ) ) ψ, Xm) = P = P )[ P )] h m) Y, S = m 4.9) )[ )] h P Y m), S = m, 4.0) where he h m) are variaional parameers of Q. he vecor h plays a role very similar o he swich variable S. Each componen h m) can range beween 0 and. When h m) = 0, he poserior probabiliy of under Q does no depend on he observaion a ime Y. When h m) =, he poserior probabiliy of under Q includes a erm ha assumes ha SS m generaed Y. We call h m) he responsibiliy assigned o SS m for he observaion vecor Y. he difference beween h m) and S m) is ha h m) is a deerminisic parameer, while S m) is a sochasic random variable. o maximize he lower bound on he log-likelihood, KLQ P) is minimized wih respec o he variaional parameers h m) and q m) separaely for each sequence of observaions. Using he definiion of P for he swiching sae-space model equaions 3. and 3.2) and he approximaing disribuion Q, he minimum of KL saisfies he following fixed-poin equaions for he variaional parameers see appendix B): h m) = QS = m) 4.)

16 846 Zoubin Ghahramani and Geoffrey E. Hinon q m) { = exp 2 Y C m) ) R Y C m) ) }, 4.2) where denoes expecaion over he Q disribuion. Inuiively, he responsibiliy, h m) is equal o he probabiliy under Q ha SS m generaed observaion vecor Y, and q m) is an unnormalized gaussian funcion of he expeced squared error if SS m generaed Y. o compue h m) i is necessary o sum Q over all he S τ variables no including S. his can be done efficienly using he forward-backward algorihm on he swich sae variables, wih q m) playing exacly he same role as an observaion probabiliy associaed wih each seing of he swich variable. Since q m) is relaed o he predicion error of model m on daa Y, his has he inuiive inerpreaion ha he swich sae associaed wih models wih smaller expeced predicion error on a paricular observaion will be favored a ha ime sep. However, he forward-backward algorihm ensures ha he final responsibiliies for he models are obained afer considering he enire sequence of observaions. o compue q m), i is necessary o calculae he expecaions of and under Q. We see his by expanding equaion 4.2: q m) { = exp 2 Y R Y + Y R C m) [ 2 r C m) R C m) ]}, 4.3) where r is he marix race operaor, and we have used rb) = rb). he expecaions of and can be compued efficienly using he Kalman smoohing algorihm on each SS, where for model m a ime, he daa are weighed by he responsibiliies h m). 7 Since he h parameers depend on he q parameers, and vice versa, he whole process has o be ieraed, where each ieraion involves calls o he forward-backward and Kalman smoohing algorihms. Once he ieraions have converged, he E- sep oupus he expeced values of he hidden variables under he final Q. he -sep compues he model parameers ha opimize he expecaion of he log-likelihood see equaion B.7), which is a funcion of he expecaions of he hidden variables. For swiching SSs, all he parameer reesimaes can be compued analyically. For example, aking derivaives of he expecaion of equaion B.7 wih respec o C m) and seing o zero, 7 Weighing he daa by h m) is equivalen o running he Kalman smooher on he unweighed daa using a ime-varying observaion noise covariance marix R m) = R/h m).

17 Swiching Sae-Space odels 847 Figure 5: Learning algorihm for swiching sae-space models. we ge Ĉ m) = = S m) Y ) = S m) ), 4.4) which is a weighed version of he reesimaion equaions for SSs. Similarly, he reesimaion equaions for he swich process are analogous o he Baum-Welch updae rules for Hs. he learning algorihm for swiching sae-space models using he above srucured variaional approximaion is summarized in Figure Deerminisic nnealing. he KL divergence minimized in he E- sep of he variaional E algorihm can have muliple minima in general. One way o visualize hese minima is o consider he space of all possible segmenaions of an observaion sequence of lengh, where by segmenaion we mean a discree pariion of he sequence beween he SSs. If here are SSs, hen here are possible segmenaions of he sequence. Given one such segmenaion, inferring he opimal disribuion for he real-valued saes of he SSs is a convex opimizaion problem, since hese real-valued saes are condiionally gaussian. So he difficuly in he KL minimizaion lies in rying o find he bes sof) pariion of he daa.

18 848 Zoubin Ghahramani and Geoffrey E. Hinon s in oher combinaorial opimizaion problems, he possibiliy of geing rapped in local minima can be reduced by gradually annealing he cos funcion. We can employ a deerminisic varian of he annealing idea by making he following simple modificaions o he variaional fixed-poin equaions 4. and 4.2: h m) = QS = m) 4.5) { = exp ) ) } Y C m) R Y C m). 4.6) 2 q m) Here is a emperaure parameer, which is iniialized o a large value and gradually reduced o. he above equaions maximize a modified form of he bound B in equaion 4.3, where he enropy of Q has been muliplied by Ueda & akano, 995). 4.2 erging Gaussians. lmos all he approximae inference mehods ha are described in he lieraure for swiching SSs are based on he idea of merging, a each ime sep, a mixure of gaussians ino one gaussian. he merged gaussian is obained by seing is mean and covariance equal o he mean and covariance of he mixure. Here we briefly describe, as an alernaive o he variaional approximaion mehods we have derived, how his more radiional gaussian merging procedure can be applied o he model we have defined. In he swiching sae-space models described in secion 3, here are differen SSs, wih possibly differen sae-space dimensionaliies, so i would be inappropriae o merge heir saes ino one gaussian. However, i is sill possibly o apply a gaussian merging echnique by considering each SS separaely. In each SS, m, he hidden sae densiy produces a each ime sep a mixure of wo gaussians: one for he case S = m and one for S m. We merge hese wo gaussians, weighed he curren esimaes of PS = m Y,Y ) and PS = m Y,Y ), respecively. his merged gaussian is used o obain he gaussian prior for + for he nex ime sep. We implemened a forward-pass version of his approximae inference scheme, which is analogous o he I procedure described in Bar-Shalom and Li 993). his procedure finds a each ime sep he bes gaussian fi o he curren mixure of gaussians for each SS. If we denoe he approximaing gaussian by Q and he mixure being approximaed by P, bes is defined here as minimizing KLP Q). Furhermore, gaussian merging echniques are greedy in ha he bes gaussian is compued a every ime sep and used immediaely for he nex ime sep. For a gaussian Q,KLP Q) has no local minima, and i is very easy o find he opimal Q by compuing he firs wo momens of P. Inaccuracies in his greedy procedure arise because he esimaes of PS Y,,Y ) are based on his single merged gaussian, no on he real mixure.

19 Swiching Sae-Space odels 849 In conras, variaional mehods seek o minimize KLQ P), which can have many local minima. oreover, hese mehods are no greedy in he same sense: hey ierae forward and backward in ime unil obaining a locally opimal Q. 5 Simulaions 5. Experimen : Variaional Segmenaion and Deerminisic nnealing. he goal of his experimen was o assess he qualiy of soluions found by he variaional inference algorihm and he effec of using deerminisic annealing on hese soluions. We generaed 200 sequences of lengh 200 from a simple model ha swiched beween wo SSs. hese SSs and he swiching process were defined by: X ) = 0.99 X ) + w) w ) 0, ) 5.) X 2) = 0.9 X 2) + w2) w 2) 0, 0) 5.2) Y = + v v 0, 0.), 5.3) where he swich sae m was chosen using priors π ) = π 2) = /2 and ransiion probabiliies = 22 = 0.95; 2 = 2 = Five sequences from his daa se are shown in in Figure 6, along wih he rue sae of he swich variable. We compared hree differen inference algorihms: variaional inference, variaional inference wih deerminisic annealing secion 4.), and inference by gaussian merging secion 4.2). For each sequence, we iniialized he variaional inference algorihms wih equal responsibiliies for he wo SSs and ran hem for 2 ieraions. he nonannealed inference algorihm ran a a fixed emperaure of =, while he annealed algorihm was iniialized o a emperaure of = 00, which decayed down o over he 2 ieraions, using he decay funcion i+ = 2 i + 2. o eliminae he effec of model inaccuracies we gave all hree inference algorihms he rue parameers of he generaive model. he segmenaions found by he nonannealed variaional inference algorihm showed lile similariy o he rue segmenaions of he daa see Figure 7). Furhermore, he nonannealed algorihm generally underesimaed he number of swiches, ofen converging on soluions wih no swiches a all. Boh he annealed variaional algorihm and he gaussian merging mehod found segmenaions ha were more similar o he rue segmenaions of he daa. Comparing percenage correc segmenaions, we see ha annealing subsanially improves he variaional inference mehod and ha he gaussian merging and annealed variaional mehods perform comparably see Figure 8). he average performance of he annealed variaional mehod is only abou.3% beer han gaussian merging.

20 850 Zoubin Ghahramani and Geoffrey E. Hinon Figure 6: Five daa sequences of lengh 200, wih heir rue segmenaions below hem. In he segmenaions, swich saes and 2 are represened wih presence and absence of dos, respecively. oice ha i is difficul o segmen he sequences correcly based only on knowing he dynamics of he wo processes. 5.2 Experimen 2: odeling Respiraion in a Paien wih Sleep pnea. Swiching sae-space models should prove useful in modeling ime series ha have dynamics characerized by several differen regimes. o illusrae his poin, we examined a physiological daa se from a paien enaively diagnosed wih sleep apnea, a medical condiion in which an individual inermienly sops breahing during sleep. he daa were obained from he reposiory of ime-series daa ses associaed wih he Sana Fe ime Series nalysis and Predicion Compeiion Weigend & Gershenfeld, 993) and is described in deail in Rigney e al. 993). 8 he respiraion paern in sleep apnea is characerized by a leas wo regimes: no breahing and gasping breahing induced by a reflex arousal. Furhermore, in his paien here also seem o be periods of normal rhyhmic breahing see Figure 9). 8 he daa are available online a hp:// aweigend/ime- Series/SanaFe.hml#seB. We used samples for raining and for esing.

21 Swiching Sae-Space odels 85 Figure 7: For 0 differen sequences of lengh 200, segmenaions are shown wih presence and absence of dos corresponding o he wo SSs generaing hese daa. he rows are he segmenaions found using he variaional mehod wih no annealing ), he variaional mehod wih deerminisic annealing ), he gaussian merging mehod ), and he rue segmenaion ). ll hree inference algorihms give real-valued h m) ; hard segmenaions were obained by hresholding he final h m) values a 0.5. he firs five sequences are he ones shown in Figure 6.

22 852 Zoubin Ghahramani and Geoffrey E. Hinon a b c d Percen Correc Segmenaion Figure 8: Hisograms of percenage correc segmenaions: a) conrol, using random segmenaion; b) variaional inference wihou annealing; c) variaional inference wih annealing; d) gaussian merging. Percenage correc segmenaion was compued by couning he number of ime seps for which he rue and esimaed segmenaions agree. We rained swiching SSs varying he random seed, he number of componens in he mixure = 2 o 5), and he dimensionaliy of he sae-space in each componen K = o 0) on a daa se consising of 000 consecuive measuremens of he ches volume. s conrols, we also rained simple SSs i.e., = ), varying he dimension of he sae-space from K = o 0, and simple Hs i.e., K = 0), varying he number of discree hidden saes from = 2o = 50. Simulaions were run unil convergence or for 200 ieraions, whichever came firs; convergence was assessed by measuring he change in likelihood or bound on he likelihood) over consecuive seps of E. he likelihood of he simple SSs and he Hs was calculaed on a es se consising of 000 consecuive measuremens of he ches volume. For he swiching SSs, he likelihood is inracable, so we calculaed he lower bound on he likelihood, B. he simple SSs modeled he daa very poorly for K =, and he performance was fla for values of K = 2 o 0 see

23 Swiching Sae-Space odels 853 respiraion respiraion a ime s) b ime s) Figure 9: Ches volume respiraion force) of a paien wih sleep apnea during wo nonconinuous ime segmens of he same nigh measuremens sampled a 2 Hz). a) raining daa. pnea is characerized by exended periods of small variabiliy in ches volume, followed by burss gasping). Here we see such behavior around = 250, followed by normal rhyhmic breahing. b) es daa. In his segmen we find several insances of apnea and an approximaely rhyhmic region. he hick lines a he boom of each plo are explained in he main ex.) Figure 0a). he large majoriy of runs of he swiching sae-space model resuled in models wih higher likelihood han hose of he simple Ss see Figures 0b 0e). One consisen excepion should be noed: for values of = 2 and K = 6 o 0, he swiching SS performed almos idenically o he simple SS. Exploraory experimens sugges ha in hese cases, a single componen akes responsibiliy for all he daa, so he model has = effecively. his may be a local minimum problem or a resul of poor iniializaion heurisics. Looking a he learning curves for simple and swiching SSs, i is easy o see ha here are plaeaus a he soluions found by he simple one-componen SSs ha he swiching SS can ge caugh in see Figure ). he likelihoods for Hs wih around = 5 were comparable o hose of he bes swiching SSs see Figure 0f). Purely in erms of cod-

24 854 Zoubin Ghahramani and Geoffrey E. Hinon a 0.8 SS b 0.8 SwSS =2) c 0.8 SwSS =3) K K K d 0.8 SwSS =4) e 0.8 SwSS =5) f 0.8 H K K Figure 0: Log likelihood nas per observaion) on he es daa from a oal of almos 400 runs of simple sae-space models a), swiching sae-space models wih differing numbers of componens b e), and hidden arkov models f). ing efficiency, swiching SSs have lile advanage over Hs on his daa. However, i is useful o conras he soluions learned by Hs wih he soluions learned by he swiching SSs. he hick dos a he boom of he Figures 9a and 9b show he responsibiliy assigned o one of wo componens in a fairly ypical swiching SS wih = 2 componens of sae size K = 2. his componen has clearly specialized o modeling he daa during periods of apnea, while he oher componen models he gasps and periods of rhyhmic breahing. hese wo swiching componens provide a much more inuiive model of he daa han he 0 o 20 discree componens needed in an H wih comparable coding efficiency. 9 9 By using furher assumpions o consrain he model, such as coninuiy of he realvalued hidden sae a swich imes, i should be possible o obain even beer performance on hese daa.

25 Swiching Sae-Space odels 855 log P ieraions of E Figure : Learning curves for a sae space model K = 4) and a swiching sae-space model = 2, K = 2). 6 Discussion he main conclusion we can draw from he firs series of experimens is ha even when given he correc model parameers, he problem of segmening a swiching ime series ino is componens is difficul. here are combinaorially many alernaives o be considered, and he energy surface suffers from many local minima, so local opimizaion approaches like he variaional mehod we used are limied by he qualiy of he iniial condiions. Deerminisic annealing can be hough of as a sophisicaed iniializaion procedure for he hidden saes: he final soluion a each emperaure provides he iniial condiions a he nex. We found ha annealing subsanially improved he qualiy of he segmenaions found. he firs experimen also indicaes ha he much simpler gaussian merging mehod performs comparably o annealed variaional inference. he gaussian merging mehods have he advanage ha a each ime sep, he cos funcion minimized has no local minima. his may accoun for how well hey perform relaive o he nonannealed variaional mehod. On he oher hand, he variaional mehods have he advanage ha hey ieraively improve heir approximaion o he poserior, and hey define a lower bound

26 856 Zoubin Ghahramani and Geoffrey E. Hinon on he likelihood. Our resuls sugges ha i may be very fruiful o use he gaussian merging mehod o iniialize he variaional inference procedure. Furhermore, i is possible o derive variaional approximaions for oher swiching models described in he lieraure, and a combinaion of gaussian merging and variaional approximaion may provide a fas and robus mehod for learning and inference in hose models. he second series of experimens suggess ha on a real daa se believed o have swiching dynamics, he swiching SS can indeed uncover muliple regimes. When i capures hese regimes, i generalizes o he es se much beer han he simple linear dynamical model. Similar coding efficiency can be obained by using Hs, which due o he discree naure of he sae-space, can model nonlinear dynamics. However, in doing so, he Hs had o use 0 o 20 discree saes, which makes heir soluions less inerpreable. Variaional approximaions provide a powerful ool for inference and learning in complex probabilisic models. We have seen ha when applied o he swiching SS, hey can incorporae wihin a single framework wellknown exac inference mehods like Kalman smoohing and he forwardbackward algorihm. Variaional mehods can be applied o many of he oher classes of inracable swiching models described in secion 2.3. However, raining more complex models also makes apparen he imporance of good mehods for model selecion and iniializaion. o summarize, swiching SSs are a dynamical generalizaion of mixure-of-expers neural neworks, are closely relaed o well-known models in economerics and conrol, and combine he represenaions underlying Hs and linear dynamical sysems. For domains in which we have some a priori belief ha here are muliple, approximaely linear dynamical regimes, swiching SSs provide a naural modeling ool. Variaional approximaions provide a mehod o overcome he mos difficul problem in learning swiching SSs: ha he inference sep is inracable. Deerminisic annealing furher improves on he soluions found by he variaional mehod. ppendix : oaion Symbol Size Descripion Variables Y D observaion vecor a ime {Y } D sequence of observaion vecors [Y, Y 2,Y ] K sae vecor of sae-space model SS) m a ime X K enire real-valued hidden sae a ime : X =,,X ) [X ) ]

Georey E. Hinton. University oftoronto. Technical Report CRG-TR February 22, Abstract

Georey E. Hinton. University oftoronto. Technical Report CRG-TR February 22, Abstract Parameer Esimaion for Linear Dynamical Sysems Zoubin Ghahramani Georey E. Hinon Deparmen of Compuer Science Universiy oftorono 6 King's College Road Torono, Canada M5S A4 Email: zoubin@cs.orono.edu Technical