An EM based training algorithm for recurrent neural networks

An EM based raining algorihm for recurren neural neworks Jan Unkelbach, Sun Yi, and Jürgen Schmidhuber IDSIA,Galleria 2, 6928 Manno, Swizerland {jan.unkelbach,yi,juergen}@idsia.ch hp://www.idsia.ch Absrac. Recurren neural neworks serve as black-box models for nonlinear dynamical sysems idenificaion and ime series predicion. Training of recurren neworks ypically minimizes he quadraic difference of he nework oupu and an observed ime series. This impliciely assumes ha he dynamics of he underlying sysem is deerminisic, which is no a realisic assumpion in many cases. In conras, saespace models allow for noise in boh he inernal sae ransiions and he mapping from inernal saes o observaions. Here, we consider recurren neworks as nonlinear sae space models and sugges a raining algorihm based on Expecaion-Maximizaion. A nonlinear ransfer funcion for he hidden neurons leads o an inracable inference problem. We invesigae he use of a Paricle Smooher o approximae he E-sep and simulaneously esimae he expecaions required in he M- sep. The mehod is demonsraed for a syheic daa se and a ime series predicion ask arising in radiaion herapy where i is he goal o predic he moion of a lung umor during respiraion. Key words: Recurren neural neworks, Dynamical Sysem idenificaion, EM, Paricle Smooher 1 Inroducion Recurren neural neworks (RNNs) represen dynamical sysems. Wih a large enough number of hidden neurons, a given dynamical sysem can in principle be approximaed o arbirary precision. Hence, recurren neworks have been used as black-box models for dynamical sysems, e.g. in he conex of ime series predicion [1]. When using recurren neural neworks for dynamical sysem idenificaion, i is ypically assumed ha he dynamics of he sysem o be modeled is deerminisic. RNNs do no model process noise, i.e. uncerainy in he inernal sae ransiion beween wo ime seps. Hence, RNNs may no be suiable models for sysems exhibiing process noise as a characerizing feaure. In conras, sae space models allow for noise in boh he inernal sae ransiions and he mapping from inernal saes o observaions, and herefore represen a richer class of probabilisic models for sequenial daa. In he special

2 Jan Unkelbach, Sun Yi, and Jürgen Schmidhuber case of linear-gaussian sae space models [6, chaper 13.3], he parameers of he model can be learned using an insaniaion of he Expecaion-Maximizaion (EM) algorihm [2]. However, compuaionally racable algorihms for exac inference in he E-sep are limied o he linear-gaussian case. Nonlinear, nongaussian models require approximaion. Here, we consider RNNs as nonlinear sae space models and sugges a raining algorihm based on Expecaion-Maximizaion. We invesigae he use of a sequenial sampling mehod, he Paricle Smooher [3], o approximae he E- sep and simulaneously esimae he expecaions required in he M-sep. In he M-sep, he similariy of linear-gaussian sae space models and simple recurren neworks can be exploied. The proposed mehod has he advanage ha he RNN can model dynamical sysems characerized by process noise, and ha i represens a generaive model of he daa. This is no he case for convenionally rained RNNs, e.g. using a quadraic objecive funcion minimized hrough gradien descen. Sochasiciy in recurren neworks has been inroduced before in he conex of raining algorihms based on Exended Kalman Filering [4]. However, his work did no aim a raining a recurren nework as a generaive model for a sochasic dynamical sysem, bu a furher developing raining algorihms for convenional recurren neworks. Relaed work includes inference and learning in nonlinear sae space models in general. For example, Ghahramani [5] addresses learning in nonlinear sae space models where he nonlineariy is represened by radial basis funcions. The E-sep is approximaed via Exended Kalman Filering. The M-sep can be solved in closed form for his model. Our algorihm is demonsraed for a synheic daa se and a ime series predicion ask arising in radiaion herapy. Lung umors move during respiraion due o expansion and conracion of he lungs. Online adjusmen of he radiaion beam for umor racking during reamen requires he predicion of he umor moion for abou half a second. The variabiliy of he breahing paern is characerized by process noise raher han measuremen noise. The remainder of his paper is organized as follows: Secion 2 reviews Linear Dynamical Sysems 1 and Recurren neural neworks as black-box models for dynamical sysems. Secion 3 describes he EM-based raining algorihm for recurren neworks, where subsecion 3.1 deails he sampling mehod used in he E-sep. Secion 4 discusses wo applicaions of he mehod. 2 Dynamical sysem idenificaion Given is a sequence Y = {y } T, where y is a vecor of observaions a ime sep. We wish o find a model of he underlying dynamical sysem ha generaed he sequence, e.g. for he purpose of ime series predicion. Below, we inroduce 1 Linear Dynamical Sysem and Linear-gaussian sae space models refer o he same model.

An EM based raining algorihm for recurren neural neworks 3 noaion, review linear dynamical sysems (LDS) and recurren neural neworks, and address he parameer learning problem in hese models. 2.1 Linear Dynamical Sysems A sochasic linear dynamical sysem is described by x = Ax 1 + η p (1) y = Bx + η m (2) where x is an inernal sae vecor, y is an observaion vecor, A and B are marices wih consan coefficiens, η p is zero mean gaussian process noise wih covariance marix Γ, and η m is zero mean gaussian measuremen noise wih covariance marix Σ. The sochasic LDS can be formulaed as a probabilisic model wih observed variables y and laen variables x. The model is defined via he condiional probabiliies P (x x 1 ) = N (x Ax 1, Γ) (3) Hence, he hidden variables x form a markov chain. 2.2 Learning in linear dynamical sysems P (y x ) = N (y Bx, Σ) (4) In LDS he model parameers θ = (A, B, Γ, Σ) can be learned using maximum likelihood hrough an insaniaion of he EM algorihm [6, chaper 13.3.2]. In he E-sep, we need o solve he inference problem and calculae he marginal probabiliies of he hidden variables condiioned on he observaion sequence: P (X Y ; θ) (5) where X = {x } T and Y = {y } T denoe he enire sequence of hidden saes and observaions, respecively. For LDS, he inference problem can be solved exacly as P (X Y ; θ) remains a Gaussian. I s mean and covariance marix can be calculaed wih a forward-backward algorihm. Given he poserior disribuions over he laen variables, he M-sep can also be performed analyically 2. 2.3 Recurren neural neworks If we model he sequence of observaions wih a recurren neural nework 3, equaions 1 and 2 are replaced by x = Af (x 1 ) + η p (6) y = Bf (x ) + η m (7) 2 Here, we assume for simpliciy ha x 0 = 0. However, i is sraigh forward o learn he mean and covariance marix of a Gaussian disribuion over he iniial inernal sae. 3 This ype of nework is also referred o as an Elman nework.

4 Jan Unkelbach, Sun Yi, and Jürgen Schmidhuber where x is now inerpreed as he vecor of ne inpus of he hidden neurons, f = anh is he ransfer funcion of he hidden neurons so ha f(x) is he vecor of hidden neuron acivaions, y is he nework oupu, A is he weigh marix for he recurren connecions, and B is he oupu weigh marix. Hence, he srucure of he RNN model is very similar o LDS. The measuremen equaion 7 represens a linear oupu layer. The nework has no exernal inpus in his formulaion (excep he process noise which can be inerpreed as an unknown exernal influence). 2.4 Convenional learning in recurren neural neworks When raining recurren neworks, he process noise η p is ypically negleced. In addiion, he measuremen noise η m is assumed o be uncorrelaed and of he same variance in all componens of y. In his case, maximizing he likelihood of he daa corresponds o minimizing a quadraic cos funcion: minimize A,B 1 2 T [y Bf (x )] 2 (8) Minimizaion of he cos funcion can be performed via gradien descen. Backpropagaion hrough ime (BPTT) and Real ime recurren learning (RTRL) are well-known algorihms o calculae he gradien [7]. Alernaive mehods include Evolino [8], where he marix A is deermined via evoluionary mehods whereas B is deermined hrough linear regression. In Echo-Sae neworks [9], A is chosen o be a fixed sparse random marix and B is deermined via linear regression. 3 EM based learning in recurren neural neworks Here, we wish o rain recurren neworks wihou neglecing he process noise erm. The RNN is considered as a nonlinear sae-space model. We exploi he similariies of RNNs and LDS o obain an EM-based raining algorihm. We wish o maximize he following likelihood funcion wih respec o he model parameers θ = (A, B, Γ, Σ): maximize L = P (X, Y θ) dx (9) θ where dx denoes he inegraion over all laen variables X = {x } T. The complee daa Likelihood is given by P (X, Y θ) = T P (x x 1 ; A, Γ)P (y x ; B, Σ) (10)

An EM based raining algorihm for recurren neural neworks 5 so ha he complee daa log-likelihood is given by lnp (X, Y θ) = T ln N (x Af(x 1 ), Γ) + T ln N (y Bf(x ), Σ) (11) In he M-sep, we maximize he following Q-funcion wih respec o he model parameers θ: maximize Q(θ θ old ) = P ( X Y ; θ old) lnp (X, Y θ) dx (12) θ The srucure of he Q-funcion is, apar from he non-linear ransfer funcion f, idenical o he one for linear dynamical sysems [6, chaper 13.3.2]. This similariy can be exploied in he M-sep as deailed in secion 3.2. In he E-sep, we need o deermine he join probabiliy disribuion over he laen variables X condiioned on he observaion sequence Y, given he curren parameer values θ old : P ( X Y ; θ old) (13) Since exac inference is inracable, we employ a sampling mehod, he Paricle Smooher, o approximae he disribuion 13 as deailed in secion 3.1. 3.1 The Paricle Smooher for approximae inference In he E-sep of he Expecaion-Maximizaion algorihm, we wish o approximae he probabiliy densiy of he laen variables X condiioned on he observed sequence Y. However, inspecing he srucure of he log-likelihood funcion 11 shows ha we only need he marginals P (x Y ; θ) and P (x, x 1 Y ; θ). To sar, we consider P (x Y ), i.e. he probabiliy of x given he observaions Y = {y( )} =1 unil ime sep. This can be approximaed via a sequenial Mone Carlo mehod, he paricle filer. A each ime sep, he disribuion is approximaed by ˆP (x Y ) = w i δ ( x x i ) (14) i=1 where x i is he posiion of paricle i a ime, wi is he paricle weigh, δ denoes he dela-funcion, and N is he number of paricles. To proceed o he nex ime sep, we sample new paricles x j +1 from a mixure of Gaussians: x j +1 N i=1 w i N ( x j +1 Af(xi ), Γ ) (15) The new paricles are weighed according o he probabiliy of generaing he nex obsevaion y +1 : ( ) P y +1 x j w j +1 = +1 N k=1 P ( (16) y +1 x+1) k

6 Jan Unkelbach, Sun Yi, and Jürgen Schmidhuber An approximaion of he probabiliy densiy P (x Y ), now condiioned on he enire observaion sequence Y, can be obained using a forward-backward smooher [3]. Following his mehod, we firs approximae P (x Y ) for every ime sep using a paricle filer, and in a second sep, correc he paricle weighs via a backwards recursion. The smoohed disribuion P (x Y ) is hen approximaed via ˆP (x Y ) = vδ i ( x x i ) (17) i=1 wih modified weighs v. i To derive he backward recursion, we consider he following ideniy: P (x Y ) = P (x +1 Y )P (x +1 x, Y ) dx +1 = P (x Y ) P (x +1 Y )P (x +1 x ) P (x+1 x )P (x Y ) dx dx +1 (18) By insering equaions 14 and 17 ino 18, we obain he backward recursion for he correced weighs v i : v i = w i j=1 v j +1 P ji N k=1 wk P jk (19) where P ji = P(x j +i xi ) is he ransiion probabiliy from paricle i a ime o paricle j a ime + 1. The backward recursion is iniialized as vt i = wi T. Afer obaining paricle weighs w i in he forward pass and weighs v i in he backward pass, equaion 18 gives also rise o an approximaion of he join probabiliy for x and x +1 : ˆP (x, x +1 Y ) = i=1 j=1 [ w i v j +1 P ji N k=1 wk P jk δ(x x i )δ(x +1 x j +1 ) ] (20) 3.2 The M-sep In he M-sep, we need o maximize he Q-funcion 12 wih respec o he model parameers θ = (A, B, Γ, Σ). Here, we can make use of he fac ha he srucure of he dynamic equaion describing he RNN is similar o hose for LDS. This leads o similar updae equaions for he parameers. Exemplarily, we consider he updae equaion for he recurren weigh marix A which we obain by seing he Q-funcion derivaive wih respec o A o zero: [ T A new = E [ x f(x 1 ) ]][ T E [ 1 f(x 1 )f(x 1 ) ]] (21)

An EM based raining algorihm for recurren neural neworks 7 where E denoes he expecaion wih respec o he disribuion P(X Y ; θ old ). The expecaions in equaion 21 are approximaed hrough he paricle smooher as E [ f(x 1 )f(x 1 ) ] = v 1 i f(xi 1 )f(xi 1 ) (22) E [ x f(x 1 ) ] = i=1 i=1 j=1 w 1 i vj P ji 1 N k=1 wk P jk x i f(x i 1) (23) 1 The updae equaions for he parameers B, Γ and Σ are obained in analogy o he updae equaions for linear-gaussian models as described in [6, chaper 13.3.2]. They are obmied here due o space limiaions. 4 Applicaions The mehod is demonsraed for a synheic daa se and a real daa se describing irregular breahing moion of a lung cancer paien. 4.1 Synheic daa We consider synheic daa generaed by a recurren nework wih wo hidden neurons wih parameers ( ) 1 1 A = B = ( 1 0 ) ( ) 0.001 0 Γ = Σ = (0.01) 0.05 0.98 0 0.001 These parameers were chosen so ha he sysem oupus an oscillaion which is perurbed by boh process and measuremen noise. Wihou noise, he sysem generaes an (almos harmonic) oscillaing oupu signal wih a period of 28 ime seps. Figure 1 shows a sample sequence generaed by he sysem wih noise. A recurren nework wih 5 hidden neurons 4 was rained on a daa se conaining 1000 ime seps. Figure 1 shows pars of a es daa se, ogeher wih predicions of he rained RNN (blue circles). Every 20 ime seps, he nework predicion for he nex 10 ime seps is shown. For predicion, a paricle filer is used o esimae he mean of he poserior disribuion over he hidden sae. Then he deerminisic dynamics of he nework (wihou noise) is used for predicion, saring from he esimaed hidden sae 5. In order o compare he predicions of he RNN, figure 1 also shows he predicions based on he rue dynamical sysem (green dos). In his example, he predicion performance of he RNN is similar o he bes possible predicion where he rue generaive process of he daa is known 6. 4 Similar resuls are obained wih 2 or more hidden neurons. 5 A predicion mehod where all paricles were propagaed forward in ime and averaged a each ime sep, yields almos idenical resuls. 6 The mean squared errors of he rained nework are 0.145/0.259/0.327 for he 1/5/10-sep predicion. The corresponding values for he rue sysem are 0.142/0.240/0.307. The sandard deviaion of he es daa is 0.380.

8 Jan Unkelbach, Sun Yi, and Jürgen Schmidhuber Figure 2 shows he inernal, noise free dynamics of he nework. The nework oupu is shown as he hick red line, whereas he blue hin lines show he dynamics of he inernal sae. The nework was able o learn he underlying harmonic oscillaion of he dynamical sysem and esimae he amoun of process and measuremen noise. 0.5 0 0.5 500 550 600 650 700 750 800 850 900 950 1000 Fig. 1. Synheic ime series (red line). Every 20 ime seps, he nework predicions for he nex 10 ime seps are shown (blue circles), ogeher wih he predicions obained using he rue dynamical sysem (green dos). 0.5 0 0.5 20 40 60 80 100 120 140 160 Fig. 2. Deerminisic dynamics of he nework rained on he noisy daa se. Red hick line: nework oupu; Blue hin lines: inernal saes 4.2 Breahing moion daa Figure 3 (red line) shows a surrogae for breahing moion for a lung cancer paien 7. The signal is roughly periodic, bu has subsanial variaions in ampliude and period. In radiaion herapy pracice, he goal is o predic he breahing paern for abou 500 milliseconds (5 ime seps), where he lengh of one breahing cycle is around 3 seconds. I is also of ineres o have a generaive model of he breahing signal in order o simulae he effec of breahing on he accuracy of a radiaion reamen. A recurren nework wih 10 hidden neurons was rained on a sequence of 1000 ime seps. Figure 4 shows he inrinsic dynamics of he rained nework, i.e. 7 The expansion of he abdomen was measured as a funcion of ime.

An EM based raining algorihm for recurren neural neworks 9 he dynamics wihou noise. Figure 3 shows samples of he nework predicion on he es daa. Every 20 ime seps, he nework predicions for he nex 10 ime seps are shown. Like in he synheic daa example above, he raining algorihm is able o find soluions ha model he oscillaory behaviour of he sysem. However, i failed o model he dynamics very precisely. Consequenly, in order o explain he deviaion of daa and nework oupu, he esimaed amoun of measuremen noise Σ is oo large, and he rained nework does no represen an adequae generaive model. 3 2 1 0 1 1200 1250 1300 1350 1400 1450 1500 1550 1600 Fig. 3. Breahing signal ime series (red line). Every 20 ime seps, he nex 10 ime seps prediced by a rained nework are shown (blue circles). 1 0 1 10 20 30 40 50 60 70 80 90 100 Fig. 4. Deerminisic, noise free dynamics of a rained nework. Red hick line: nework oupu; Blue hin lines: inernal saes 4.3 Remarks Iniializaion of parameers: The weigh marix A was iniialized o a diagonal marix plus random values wih a sandard deviaion of 0.1. The componens of B were iniialized o random values wih mean one and sandard deviaion 0.1. These values led o soluions qualiaively similar o hose shown above. Local minima: Dynamical sysem idenificaion ends o be a difficul ask. For he breahing moion daa se, he nework did no learn deails of he dynamics. However, he same problem applies o linear-gaussian sae space models and RNNs rained wih gradien descen.

10 Jan Unkelbach, Sun Yi, and Jürgen Schmidhuber Number of paricles: The resuls presened above were generaed wih N = 100 paricles. However, similar resuls were obained for only 10 paricles alhough paricle numbers ha low are no expeced o provide an adequae characerizaion of he poserior disribuion a a given ime sep. This aspec will be invesigaed in fuure work. 5 Conclusion We address he problem of raining recurren neural neworks as black box models for sochasic nonlinear dynamical sysems. In convenional raining algorihms based on a quadraic objecive funcion, i is assumed ha he underlying dynamics o be modelled is deerminisic, i.e. process noise is negleced. In his paper, we generalize RNNs o model sochasic dynamical sysems characerized by process noise. We sugges a raining algorihm based on Expecaion- Maximizaion, exploiing he similariies beween RNNs and linear dynamical sysems. We apply a paricle smooher for approximae inference in he E-sep and simulaneously esimae expecaions required in he M-sep. The algorihm is sucessfully demonsraed for one synheic and one realisic daa se. Furher characerizaion of he performance of he raining algorihm, he influence of he number of paricles, and comparison o sandard raining mehods for RNNs are subjec o curren sudies. References 1. D. P. Mandic and J. A. Chambers. Recurren neural neworks for predicion. John Wiley & Sons, Inc., 2001. 2. G. J. McLachlan and T. Krishnan. The EM Algorihm and Exensions. John Wiley & Sons, Inc., 1997. 3. Mike Klaas, Mark Briers, Arnaud Douce, and Simon Maskell. Fas paricle smoohing: If i had a million paricles. In In Inernaional Conference on Machine Learning (ICML, pages 25 29, 2006. 4. Ronald J. Williams. Training recurren neworks using he exended kalman filer. In In Proceedings Inernaional Join Conference on Neural Neworks, pages 241 246. Online]. Available: cieseer.nj.nec.com/williams92raining.hml, 1992. 5. Zoubin Ghahramani and Sam T. Roweis. Learning nonlinear dynamical sysems using an EM algorihm. In Advances in Neural Informaion Processing Sysems 11, pages 599 605. MIT Press, 1999. 6. C. M. Bishop. Paern recogniion and Machine learning. Springer, 2006. 7. R. J. Williams and D. Zipser. Gradien-based learning algorihms for recurren neworks and heir compuaional complexiy. In Back-propagaion: Theory, Archiecures and Applicaions. Hillsdale, NJ: Erlbaum, 1994. 8. J. Schmidhuber, D. Wiersra, M. Gagliolo, and F. Gomez. Training recurren neworks by evolino. Neural Compuaion, 19(3):757 779, 2007. 9. H. Jaeger. The echo sae approach o analysing and raining recurren neural neworks. Technical Repor GMD Repor 148, German Naional Research Cener for Informaion Technology, 2001.