Online natural gradient as a Kalman filter

Size: px
Start display at page:

Download "Online natural gradient as a Kalman filter"

Transcription

1 Elecronic Journal of Saisics Vol. 12 (2018) ISSN: hps://doi.org/ /18-ejs1468 Online naural gradien as a Kalman filer Yann Ollivier Facebook Arificial Inelligence Research 6rueMénars, Paris, France yol@fb.com Absrac: We cas Amari s naural gradien in saisical learning as a specific case of Kalman filering. Namely, applying an exended Kalman filer o esimae a fixed unknown parameer of a probabilisic model from a series of observaions, is rigorously equivalen o esimaing his parameer via an online sochasic naural gradien descen on he log-likelihood of he observaions. In he i.i.d. case, his relaion is a consequence of he informaion filer phrasing of he exended Kalman filer. In he recurren (sae space, non-i.i.d.) case, we prove ha he join Kalman filer over saes and parameers is a naural gradien on op of real-ime recurren learning (RTRL), a classical algorihm o rain recurren models. This exac algebraic correspondence provides relevan inerpreaions for naural gradien hyperparameers such as learning raes or iniializaion and regularizaion of he Fisher informaion marix. MSC 2010 subjec classificaions: Primary 68T05, 65K10; secondary 93E35, 90C26, 93E11, 49M15. Keywords and phrases: Saisical learning, naural gradien, Kalman filer, sochasic gradien descen. Received June Conens 1 Problem seing, naural gradien, Kalman filer Problem seing Naural gradien descen Kalman filering for parameer esimaion Naural gradien as a Kalman filer: he saic (i.i.d.) case Naural gradien as a Kalman filer: heurisics Saemen of he correspondence, saic (i.i.d.) case Proofs for he saic case Naural gradien as a Kalman filer: he sae space (recurren) case Recurren models, RTRL Saemen of he correspondence, recurren case Proofs for he recurren case A Reminder on exponenial families References Work done in par while a CNRS, TAO, Universié Paris-Sud 2930

2 Online naural gradien as a Kalman filer 2931 In saisical learning, sochasic gradien descen is a widely used ool o esimae he parameers of a model from empirical daa, especially when he parameer dimension and he amoun of daa are large [BL03] (such as is ypically he case wih neural neworks, for insance). The naural gradien [Ama98] is a ool from informaion geomery, which aims a correcing several shorcomings of he widely ordinary sochasic gradien descen, such as is sensiiviy o rescalings or simple changes of variables in parameer space [Oll15]. The naural gradien modifies he ordinary gradien by using he informaion geomery of he saisical model, via he Fisher informaion marix (see formal definiion in Secion 1.2; see also [Mar14]). The naural gradien comes wih a heoreical guaranee of asympoic opimaliy [Ama98] ha he ordinary gradien lacks, and wih he heoreical knowledge and various connecions from informaion geomery, e.g., [AN00, OAAH17]. In large dimension, is compuaional complexiy makes approximaions necessary, e.g., [LMB07, Oll15, MCO16, GS15, MG15]; his has limied is adopion despie many desirable heoreical properies. The exended Kalman filer (see e.g., he exbooks [Sim06, Sä13, Jaz70]) is a generic and effecive ool o esimae in real ime he sae of a nonlinear dynamical sysem, from noisy measuremens of some par or some funcion of he sysem. (The ordinary Kalman filer deals wih linear sysems.) Is use in navigaion sysems (GPS, vehicle conrol, spacecraf...), ime series analysis, economerics, ec. [Sä13], is exensive o he poin i can been described as one of he grea discoveries of mahemaical engineering [GA15]. The goal of his ex is o show ha he naural gradien, when applied online, is a paricular case of he exended Kalman filer. Indeed, he exended Kalman filer can be used o esimae he parameers of a saisical model (probabiliy disribuion), by viewing he parameers as he hidden sae of a saic dynamical sysem, and viewing i.i.d. samples as noisy observaions depending on he parameers. 1 We show ha doing so is exacly equivalen o performing an online sochasic naural gradien descen (Theorem 2). This resuls in a rigorous dicionary beween he naural gradien objecs from saisical learning, and he objecs appearing in Kalman filering; for insance, a larger learning rae for he naural gradien descen exacly corresponds o a fading memory in he Kalman filer (Proposiion 3). Table 1 liss a few correspondences beween objecs from he Kalman filer side and from he naural gradien side, as resuls from he heorems and proposiions below. Noe ha he correspondence is one-sided: he online naural gradien is exacly an exended Kalman filer, bu only corresponds o a paricular use of he Kalman filer for parameer esimaion problems (i.e., wih saic dynamics on he parameer par of he sysem). 1 For his we slighly exend he definiion of he Kalman filer o include discree observaions, by defining (Def. 5) he measuremen error as T (y) ŷ insead of y ŷ, wheret is he sufficien saisics of an exponenial family model for oupu noise wih mean ŷ. This reduces o he sandard filer for Gaussian oupu noise, and naurally covers caegorical oupus as ofen used in saisical learning (wih ŷ he class probabiliies in a sofmax classifier and T a one-ho encoding of y).

3 2932 Y. Ollivier Table 1 Kalman filer objecs vs naural gradien objecs. The inpus are u, he prediced values are ŷ, and he model parameers are θ. iid (saic, non-recurren) model ŷ = h(θ, u ) Exended Kalman filer on saic Online naural gradien on θ wih parameer θ learning rae η =1/( +1) Covariance marix P Fisher informaion marix J = η P 1 Bayesian prior P 0 Fisher marix iniializaion J 0 = P 1 0 Fading memory Larger or consan learning rae Fading memory+consan prior Fisher marix regularizaion Recurren (sae space) model ŷ =Φ(ŷ 1,θ,u ) ExendedKalmanfileron(θ, ŷ) RTRL+naural gradien+sae correcion Covariance of θ alone, P θ Fisher marix J = η (P θ ) 1 Correlaion beween θ and ŷ RTRL gradien esimae ŷ / θ Beyond he saic case, we also consider he learning of he parameers of a general dynamical sysem, where subsequen observaions exhibi emporal paerns insead of being i.i.d.; in saisical learning his is called a recurren model, for insance, a recurren neural nework. We refer o [Jae02] for an inroducion o recurren models in saisical learning (recurren neural neworks) and he afferen echniques (including Kalman filers), and o [Hay01] for a clear, in-deph reamen of Kalman filering for recurren models. We prove (Theorem 12) ha he exended Kalman filer applied joinly o he sae and parameer, amouns o a naural gradien on op of real-ime recurren learning (RTRL), a classical (and cosly) online algorihm for recurren nework raining [Jae02]. Thus, we provide a bridge beween echniques from large-scale saisical learning (naural gradien, RTRL) and a cenral objec from mahemaical engineering, signal processing, and esimaion heory. Casing he naural gradien as a specific case of he exended Kalman filer is an insance of he provocaive saemen from [LS83] ha here is only one recursive idenificaion mehod ha is opimal on quadraic funcions. Indeed, he online naural gradien descen fis ino he framework of [LS83, 3.4.5]. Arguably, his saemen is limied o linear models, and for non-linear models one would expec differen algorihms o coincide only a a cerain order, or asympoically; however, all he correspondences presened below are exac. Relaed work. In he i.i.d. (saic) case, he naural gradien/kalman filer correspondence follows from he informaion filer phrasing of Kalman filering [Sim06, 6.2] by relaively direc manipulaions. Neverheless, we could find no reference in he lieraure explicily idenifying he wo. [SW88] is an early example of he use of Kalman filering for raining feedforward neural neworks in saisical learning, bu does no menion he naural gradien. [RRK + 92] argue ha for neural neworks, backpropagaion, i.e., ordinary gradien descen, is a degenerae form of he exended Kalman filer. [Ber96] idenifies he exended Kalman filer wih a Gauss Newon gradien descen for he specific case of nonlinear regression. [dfng00] inerpres process noise in he saic Kalman filer as an adapive, per-parameer learning rae, hus akin o a pre-

4 Online naural gradien as a Kalman filer 2933 condiioning marix. [ŠKT01] uses he Fisher informaion marix o sudy he variance of parameer esimaion in Kalman-like filers, wihou using a naural gradien; [BL03] commen on he similariy beween Kalman filering and a version of Amari s naural gradien for he specific case of leas squares regression; [Mar14] and [Oll15] menion he relaionship beween naural gradien and he Gauss Newon Hessian approximaion; [Pa16] explois he relaionship beween second-order gradien descen and Kalman filering in specific cases including linear regression; [LCL + 17] use a naural gradien descen over Gaussian disribuions for an auxiliary problem arising in Kalman-like Bayesian filering, a problem independen from he one reaed here. For he recurren (non-i.i.d.) case, our resul is ha join Kalman filering is essenially a naural gradien on op of he classical RTRL algorihm for recurren models [Jae02]. [Wil92] already observed ha saring wih he Kalman filer and inroducing drasic simplificaions (doing away wih he covariance marix) resuls in RTRL, while [Hay01, 5] conains saemens ha can be inerpreed as relaing Kalman filering and precondiioned RTRL-like gradien descen for recurren models (Secion 3.2). Perspecives. In his ex our goal is o derive he precise correspondence beween naural gradien and Kalman filering for parameer esimaion (Thm. 2, Prop. 3, Prop.4, Thm.12), and o work ou an exac dicionary beween he mahemaical objecs on boh sides. This correspondence suggess several possible venues for research, which neverheless are no explored here. Firs, he correspondence wih he Kalman filer brings new inerpreaions and suggesions for several naural gradien hyperparameers, such as Fisher marix iniializaion, equaliy beween Fisher marix decay rae and learning rae, or amoun of regularizaion o he Fisher marix (Secion 2.2). The naural gradien can be quie sensiive o hese hyperparameers. A firs sep would be o es he marix decay rae and regularizaion values suggesed by he Bayesian inerpreaion (Prop. 4) and see if hey help wih he naural gradien, or if hese suggesions are overriden by he various approximaions needed o apply he naural gradien in pracice. These empirical ess are beyond he scope of he presen sudy. Nex, since saisical learning deals wih eiher coninuous or caegorical daa, we had o exend he usual Kalman filer o such a seing. Tradiionally, non-gaussian oupu models have been reaed by applying a nonlineariy o a sandard Gaussian noise (Secion 2.3). Insead, modeling he measuremen noise as an exponenial family (Appendix and Def. 5) allows for a unified reamen of he sandard case (Gaussian oupu noise wih known variance), of discree caegorical observaions, or oher exponenial noise models (e.g., Gaussian noise wih unknown variance). We did no es he empirical consequences of his choice, bu i cerainly makes he mahemaical reamen flow smoohly, in paricular he view of he Kalman filer as precondiioned gradien descen (Prop. 6). Neiher he naural gradien nor he exended Kalman filer scale well o large-dimensional models as currenly used in machine learning, so ha approx-

5 2934 Y. Ollivier imaions are required. The correspondence raises he possibiliy ha various mehods developed for Kalman filering (e.g., paricle or unscened filers) or for naural gradien approximaions (e.g., marix facorizaions such as he Kronecker produc [MG15] or quasi-diagonal reducions [Oll15, MCO16]) could be ransferred from one viewpoin o he oher. In saisical learning, oher means have been developed o aain he same asympoic efficiency as he naural gradien, noably rajecory averaging (e.g., [PJ92], or [Mar14] for he relaionship o naural gradien) a lile algorihmic cos. One may wonder if hese can be generalized o filering problems. Proof echniques could be ransferred as well: for insance, Amari [Ama98] gave a srong bu someimes informal argumen ha he naural gradien is Fisher-efficien, i.e., he resuling parameer esimae is asympoically opimal for he Cramér Rao bound; alernae proofs could be obained by ransferring relaed saemens for he exended Kalman filer, e.g., combining echniques from [ŠKT01, BRD97, LS83]. Organizaion of he ex. In Secion 1 we se he noaion, recall he definiion of he naural gradien (Def. 1), and explain how Kalman filering can be used for parameer esimaion in saisical learning (Secion 1.3); he definiion of he Kalman filer is included in Def. 5. Secion 2 gives he main saemens for viewing he naural gradien as an insance of an exended Kalman filer for i.i.d. observaions (saic sysems), firs inuiively via a heurisic asympoic argumen (Secion 2.1), hen rigorously (Thm. 2, Prop.3, Prop.4). The proof of hese resuls appears in Secion 2.3 and sheds some ligh on he geomery of Kalman filering. Finally, he case of non-i.i.d. observaions (recurren or sae space model) is reaed in Secion 3. Acknowledgmens. Many hanks o Silvère Bonnabel, Gaéan Marceau- Caron, and he anonymous reviewers for heir careful reading of he manuscrip, correcions, and suggesions for he presenaion and organizaion of he ex. I would also like o hank Shun-ichi Amari, Frédéric Barbaresco, and Nando de Freias for addiional commens and for poining ou relevan references. 1. Problem seing, naural gradien, Kalman filer 1.1. Problem seing In saisical learning, we have a series of observaion pairs (u 1,y 1 ),..., (u,y ),... and wan o predic y from u using a probabilisic model p θ.assume for now ha y is real-valued (regression problem) and ha he model for y is a Gaussian cenered on a prediced value ŷ, wih known covariance marix R, namely y =ŷ + N (0,R ), ŷ = h(θ, u ) (1.1) The funcion h may represen any compuaion, for insance, a feedforward neural nework wih inpu u, parameers θ, and oupu ŷ. The goal is o find

6 Online naural gradien as a Kalman filer 2935 he parameers θ such ha he predicion ŷ = h(θ, u ) is as close as possible o y : he loss funcion is l = 1 2 (ŷ y ) R 1 (ŷ y )= ln p(y ŷ ) (1.2) up o an addiive consan. For non-gaussian oupus, we assume ha he noise model on y given ŷ belongs o an exponenial family, namely, ha ŷ is he mean parameer of an exponenial family of disribuions 2 over y ; we again define he loss funcion as l := ln p(y ŷ ), and he oupu noise R can be defined as he covariance marix of he sufficien saisics of y given his mean (Def. 5). For a Gaussian oupu noise his works as expeced. For insance, for a classificaion problem, he oupu is caegorical, y {1,...,K}, andŷ will be he se of probabiliies ŷ =(p 1,...,p K 1 )ohavey =1,...,K 1. In ha case R is he (K 1) (K 1) marix (R ) kk =diag(p k ) p k p k. (The las probabiliy p K is deermined by he ohers via p k = 1 and has o be excluded o obain a non-degenerae parameerizaion and an inverible covariance marix R.) This convenion allows us o exend he definiion of he Kalman filer o such a seing (Def. 5) in a naural way, jus by replacing he measuremen error y ŷ wih T (y ) ŷ,wiht he sufficien saisics for he exponenial family. (For Gaussian noise his is he same, as T (y) isy.) In neural nework erms, his means ha he oupu layer of he nework is fed o a loss funcion ha is he log-loss of an exponenial family, bu places no resricion on he res of he model. General noaion. In saisical learning, he exernal inpus or regressor variables are ofen denoed x. In Kalman filering, x ofen denoes he sae of he sysem, while he exernal inpus are ofen u. Thus we will avoid x alogeher and denoe by u he inpus and by s he sae of he sysem. The variable o be prediced a ime will be y,andŷ is he corresponding predicion. In general ŷ and y may be differen objecs in ha ŷ encodes a full probabilisic predicion for y. For Gaussians wih known variance, ŷ is jus he prediced mean of y, so in his case y and ŷ are he same ype of objec. For Gaussians wih unknown variance, ŷ encodes boh he mean and second momen of y. For discree caegorical daa, ŷ encodes he probabiliy of each possible oucome y. 2 The Appendix conains a reminder on exponenial families. An exponenial family of probabiliy disribuions on y, wih sufficien saisics T 1 (y),...,t K (y), and wih parameer β R K,isgivenby p β (y) := 1 Z(β) e k β kt k (y) λ(dy) (1.3) where Z(β) is a normalizing consan, and λ(dy) is any reference measure on y. For insance, if y R K, T k (y) =y k and λ(dy) is a Gaussian measure cenered on 0, by varying β one ges all Gaussian measures wih he same covariance marix and anoher mean. y may be discree, e.g., Bernoulli disribuions correspond o λ he uniform measure on y {0, 1} and a single sufficien saisic T (0) = 0, T (1) = 1. Ofen, he mean parameer T := E y pβ T (y) isa more convenien parameerizaion han β. Exponenial families maximize enropy (minimize informaion divergence from λ) for a given mean of T.

7 2936 Y. Ollivier Thus, he formal seing for his ex is as follows: we are given a sequence of finie-dimensional observaions (y ) wih each y R dim(y), a sequence of inpus (u ) wih each u R dim(u), a parameric model ŷ = h(θ, u ) wih parameer θ R dim(θ) and h some fixed smooh funcion from R dim(θ) R dim(u) o R dim(ŷ). We are given an exponenial family (oupu noise model) p(y ŷ) ony wih mean parameer ŷ and sufficien saisics T (y) (see he Appendix), and we define he loss funcion l := ln p(y ŷ ). The naural gradien descen on parameer θ will use he Fisher marix J. The Kalman filer will have poserior covariance marix P. For mulidimensional quaniies x and y = f(x), we denoe by y x he Jacobian marix of y w.r.. x, whose(i, j) enryis fi(x). This saisfies he chain rule z y y x = z x. Wih his convenion, gradiens of real-valued funcions are row vecors, so ha a gradien descen akes he form x x η ( f/ x). For a column vecor u, u 2 is synonymous wih uu,andwihu u for a row vecor. x j 1.2. Naural gradien descen A sandard approach o opimize he parameer θ of a probabilisic model, given a sequence of observaions (y ), is an online gradien descen l (y ) θ θ 1 η (1.4) θ wih learning rae η. This simple gradien descen is paricularly suiable for large daases and large-dimensional models [BL03], bu has several pracical and heoreical shorcomings. For insance, i uses he same non-adapive learning rae for all parameer componens. Moreover, simple changes in parameer encoding or in daa presenaion (e.g., encoding black and whie in images by 0/1 or 1/0) can resul in differen learning performance. This moivaed he inroducion of he naural gradien [Ama98]. I is buil o achieve invariance wih respec o parameer re-encoding; in paricular, learning become insensiive o he characerisic scale of each parameer direcion, so ha differen direcions naurally ge suiable learning raes. The naural gradien is he only general way o achieve such invariance [AN00, 2.4]. The naural gradien precondiions he gradien descen wih J(θ) 1 where J is he Fisher informaion marix [Kul97] wih respec o he parameer θ. For a smooh probabilisic model p(y θ) over a random variable y wih parameer θ, he laer is defined as J(θ) :=E y p(y θ) [ ln p(y θ) θ ] 2 [ 2 ] ln p(y θ) = E y p(y θ) θ 2 (1.5) Definiion 1 below formally inroduces he online naural gradien. If he model for y involves an inpu u, hen an expecaion or empirical average over he inpu is inroduced in he definiion of J [AN00, 8.2] [Mar14, 5].

8 Online naural gradien as a Kalman filer 2937 However, his comes a a large compuaional cos for large-dimensional models: jus soring he Fisher marix already coss O((dim θ) 2 ). Various sraegies are available o approximae he naural gradien for complex models such as neural neworks, using diagonal or block-diagonal approximaion schemes for he Fisher marix, e.g., [LMB07, Oll15, MCO16, GS15, MG15]. Definiion 1 (Online naural gradien). Consider a saisical model wih parameer θ ha predics an oupu y given an inpu u. Suppose ha he predicion akes he form y p(y ŷ) where ŷ = h(θ, u) depends on he inpu via a model h wih parameer θ. Given observaion pairs (u,y ), he goal is o minimize, online, he loss funcion l (y ), l (y) := ln p(y ŷ ) (1.6) asafuncionofθ. The online naural gradien mainains a curren esimae θ of he parameer θ, and a curren approximaion J of he Fisher marix. The parameer is esimaed by a gradien descen wih precondiioning marix J 1,namely J (1 γ )J 1 + γ E y p(y ŷ) θ θ 1 η J 1 wih learning rae η and Fisher marix decay rae γ. [ ] 2 l (y) (1.7) θ ( ) l (y ) (1.8) In he Fisher marix updae, he expecaion over all possible values y p(y ŷ) can ofen be compued algebraically, bu his is someimes compuaionally bohersome (for insance, in neural neworks, i requires dim(ŷ )disinc backpropagaion seps [Oll15]). A common soluion [APF00, LMB07, Oll15, PB13] is o jus use he value y = y (ouer produc approximaion) insead of he expecaion over y. Anoher is o use a Mone Carlo approximaion wih asinglesampleofy p(y ŷ ) [Oll15, MCO16], namely, using he gradien of a synheic sample insead of he acual observaion y in he Fisher marix. These laer wo soluions are ofen confused; only he laer provides an unbiased esimae, see discussion in [Oll15, PB13]. The online smoohed updae of he Fisher marix in (1.7) mixes pas and presen esimaes (his or similar updaes are used in [LMB07, MCO16]). The reason is a leas wofold. Firs, he genuine Fisher marix involves an expecaion over he inpus u [AN00, 8.2]: his can be approximaed online only via a moving average over inpus (e.g., γ =1/ realizes an equal-weigh average over all inpus seen so far). Second, he expecaion over y p(y ŷ )in(1.7) is ofen replaced wih a Mone Carlo esimaion wih only one value of y, and averaging over ime compensaes for his Mone Carlo sampling. As a consequence, since θ changes over ime, his means ha he esimae J mixes values obained a differen values of θ, and converges o he Fisher θ

9 2938 Y. Ollivier marix only if θ changes slowly, i.e., if η 0. The correspondence below wih Kalman filering suggess using γ = η Kalman filering for parameer esimaion One possible definiion of he exended Kalman filer is as follows [Sim06, 15.1]. We are rying o esimae he curren sae of a dynamical sysem s whose evoluion equaion is known bu whose precise value is unknown; a each ime sep, we have access o a noisy measuremen y of a quaniy ŷ = h(s )which depends on his sae. The Kalman filer mainains an approximaion of a Bayesian poserior on s given he observaions y 1,...,y. The poserior disribuion afer observaions is approximaed by a Gaussian wih mean s and covariance marix P.(Indeed, Bayesian poseriors always end o Gaussians asympoically under mild condiions, by he Bernsein von Mises heorem [vdv00].) The Kalman filer prescribes a way o updae s and P when new observaions become available. The Kalman filer updae is summarized in Definiion 5 below. I is buil o provide he exac value of he Bayesian poserior in he case of linear dynamical sysems wih Gaussian measuremens and a Gaussian prior. In ha sense, i is exac a firs order. The Kalman filering viewpoin on a saisical learning problem is ha we are facing a sysem wih hidden variable θ, wih an unknown value ha does no evolve in ime, and ha he observaions y bring more and more informaion on θ. Thus, a saisical learning problem can be ackled by applying he exended Kalman filer o he unknown variable s = θ, whose underlying dynamics from ime o ime + 1 is jus o remain unchanged (f = Id and noise on s is 0 in Definiion 5). In such a seing, he poserior covariance marix P will generally end o 0 as observaions accumulae and he parameer is idenified beer 3 (his occurs a rae 1/ for he basic filer, which esimaes from all pas observaions a ime, or a oher raes if fading memory is included, see below). The iniializaion θ 0 and is covariance P 0 can be inerpreed as Bayesian priors on θ [SW88, LS83]. We will refer o his as a saic Kalman filer. In he saic case and wihou fading memory, he poserior covariance P afer observaions will decrease like O(1/), so ha he parameer ges updaed by O(1/) afer each new observaion. Inroducing fading memory for pas observaions (equivalen o adding noise on θ a each sep, Q P 1 in Def. 5) leads o a larger covariance and faser updaes. An example: Feedforward neural neworks. The Kalman approach above can be applied o any parameric saisical model. For insance [SW88] rea he case of a feedforward neural nework. In our seing his is described as follows. Le u be he inpu of he model and y he rue (desired) oupu. A feedforward 3 Bu P mus sill be mainained even if i ends o 0, since i is used o updae he parameer a he correc rae.

10 Online naural gradien as a Kalman filer 2939 neural nework can be described as a funcion ŷ = h(θ, u) whereθ is he se of all parameers of he nework, where h represens all compuaions performed by he nework on inpu u, andŷ encodes he nework predicion for he value of he oupu y on inpu u. For caegorical observaions y, ŷ is usually a se of prediced probabiliies for all possible classes; while for regression problems, ŷ is direcly he prediced value. In boh cases, he error funcion o be minimized can be defined as l(y) := ln p(y ŷ): in he regression case, ŷ is inerpreed as a mean of a Gaussian model on y, soha ln p(y ŷ) is he square error up o a consan. Training he neural nework amouns o esimaing he nework parameer θ from he observaions. Applying a saic Kalman filer for his problem [SW88] amouns o using Def. 5 wih s = θ, f =IdandQ =0.Afirsglancehis looks quie differen from he common gradien descen (backpropagaion) approach for neural neworks. The backpropagaion operaion is represened in he Kalman filer by he compuaion of H = h(s,u) s (2.17) wheres is he parameer. We show ha he addiional operaions of he Kalman filer correspond o using a naural gradien insead of a vanilla gradien. Unforunaely, for models wih high-dimensional parameers such as neural neworks, he Kalman filer is compuaionally cosly and requires blockdiagonal approximaions for P (which is a square marix of size dim θ); moreover, compuing H = ŷ / θ is needed in he filer, and requires doing one separae backpropagaion for each componen of he oupu ŷ. 2. Naural gradien as a Kalman filer: he saic (i.i.d.) case We now wrie he explici correspondence beween an online naural gradien o esimae he parameer of a saisical model from i.i.d. observaions, and a saic exended Kalman filer. We firs give a heurisic argumen ha oulines he main ideas from he proof (Secion 2.1). Then we sae he formal correspondences. Firs, he saic Kalman filer corresponds o an online naural gradien wih learning rae 1/ (Thm. 2). The rae 1/ arises because such a filer akes ino accoun all previous evidence wihou decay facors (and wih process noise Q = 0 in he Kalman filer), hus he poserior covariance marix decreases like O(1/). Asympoically, his is he opimal rae in saisical learning [Ama98]. (Noe, however, ha he online naural gradien and exended Kalman filer are idenical a every ime sep, no only asympoically.) The 1/ rae is ofen oo slow in pracical applicaions, especially when saring far away from an opimal parameer value. The naural gradien/kalman filer correspondence is no specific o he O(1/) rae. Larger learning raes in he naural gradien correspond o a fading memory Kalman filer (adding process noise Q proporional o he poserior covariance a each sep, corresponding o a decay facor for he weigh of previous observaions); his is Proposiion 3. In such a seing, he poserior covariance marix in he Kalman filer does no decrease like O(1/); for insance, a fixed decay facor for he fading memory

11 2940 Y. Ollivier corresponds o a consan learning rae. Finally, a fading memory in he Kalman filer may erase prior Bayesian informaion (θ 0,P 0 ) oo fas; mainaining he weigh of he prior in a fading memory Kalman filer is reaed in Proposiion 4 and corresponds, on he naural gradien side, o a so-called weigh decay [Bis06] owards θ 0 ogeher wih a regularizaion of he Fisher marix, a specific raes Naural gradien as a Kalman filer: heurisics As a firs ingredien in he correspondence, we inerpre Kalman filers as gradien descens: he exended Kalman filer acually performs a gradien descen on he log-likelihood of each new observaion, wih precondiioning marix equal o he poserior covariance marix. This is Proposiion 6 below. This relies on having an exponenial family as he oupu noise model. Meanwhile, he naural gradien uses he Fisher marix as a precondiioning marix. The Fisher marix is he average Hessian of log-likelihood, hanks o he classical double definiion of he Fisher marix as square gradien or Hessian, J(θ) =E y p(y θ) [ ln p(y) θ 2 ] [ ] = E 2 ln p(y) y p(y θ) θ for any probabilisic model 2 p(y θ) [Kul97]. Assume ha he probabiliy of he daa given he parameer θ is approximaely Gaussian, p(y 1,...,y θ) exp( (θ θ ) Σ 1 (θ θ )) wih covariance Σ. This ofen holds asympoically hanks o he Bernsein von Mises heorem; moreover, he poserior covariance Σ ypically decreases like 1/. Then he Hessian (w.r.. θ) of he oal log-likelihood of (y 1,...,y )isσ 1,heinverse covariance of θ. Soheaverage Hessian per daa poin, he Fisher marix J, is approximaely J Σ 1 /. Since a Kalman filer o esimae θ is essenially a gradien descen precondiioned wih Σ, i will be he same as using a naural gradien wih learning rae 1/. Using a fading memory Kalman filer will esimae Σ from fewer pas observaions and provide larger learning raes. Anoher way o undersand he link beween naural gradien and Kalman filer is as a second-order Taylor expansion of daa log-likelihood. Assume ha he oal daa log-likelihood a ime, L (θ) := s=1 ln p(y s θ), is approximaely quadraic as a funcion of θ, wihaminimumaθ and a Hessian h, namely, L (θ) 1 2 (θ θ ) h (θ θ ). Then when new daa poins become available, his quadraic approximaion would be updaed as follows (online Newon mehod): h h 1 + θ( 2 ln p(y θ 1)) (2.1) θ θ 1 h 1 θ ( ln p(y θ 1)) (2.2) and indeed hese are equaliies for a quadraic log-likelihood. Namely, he updae of θ is a gradien ascen on log-likelihood, precondiioned by he inverse Hessian (Newon mehod). Noe ha h grows like (each daa poin adds is own conribuion). Thus, h is imes he empirical average of he Hessian, i.e.,

12 Online naural gradien as a Kalman filer 2941 approximaely imes he Fisher marix of he model (h J). So his updae is approximaely a naural gradien descen wih learning rae 1/. Meanwhile, he Bayesian poserior on θ (wih uniform prior) afer observaions y 1,...,y is proporional o e L by definiion of L.IfL 1 2 (θ θ ) h (θ θ ), his is a Gaussian disribuion cenered a θ wih covariance marix h 1. The Kalman filer is buil o mainain an approximaion P of his covariance marix h 1, and hen performs a gradien sep precondiioned on P similar o (2.2). The simples siuaion corresponds o an asympoic rae O(1/), i.e., esimaing he parameer based on all pas evidence; he updae (2.1) of he Hessian in (2.2) produces an effecive learning rae O(1/). Inroducing a decay facor for older observaions, muliplying he erm h 1 in (2.1), produces a fading memory effec and resuls in larger learning raes. is addiive, so ha h grows like and h 1 These heurisics jusify he saemen from [LS83] ha here is only one recursive idenificaion mehod. Close o an opimum (so ha he Hessian is posiive), all second-order algorihms are essenially an online Newon sep (2.1) (2.2) approximaed in various ways. Bu even hough his heurisic argumen appears o be approximae or asympoic, he correspondence beween online naural gradien and Kalman filer presened below is exac a every ime sep Saemen of he correspondence, saic (i.i.d.) case For he saemen of he correspondence, we assume ha he oupu noise on y given ŷ is modelled by an exponenial family wih mean parameer ŷ. This covers he radiional Gaussian case y = N (ŷ, Σ) wih fixed Σ ofen used in Kalman filers. The Appendix conains necessary background on exponenial families. Theorem 2 (Naural gradien as a saic Kalman filer). These wo algorihms are idenical under he correspondence (θ,j ) (s,p 1 /( + 1)): 1. The online naural gradien (Def. 1) wihlearningraesη = γ =1/( + 1), applied o learn he parameer θ of a model ha predics observaions (y ) wih inpus (u ), using a probabilisic model y p(y ŷ) wih ŷ = h(θ, u), whereh is any model and p(y ŷ) is an exponenial family wih mean parameer ŷ. 2. The exended Kalman filer (Def. 5) o esimae he sae s from observaions (y ) and inpus (u ), using a probabilisic model y p(y ŷ) wih ŷ = h(s, u) and p(y ŷ) an exponenial family wih mean parameer ŷ, wih saic dynamics and no added noise on s (f(s, u) =s and Q =0in Def. 5). Namely, if a sarup (θ 0,J 0 )=(s 0,P 1 0 ), hen(θ,j )=(s,p 1 /( + 1)) for all 0.

13 2942 Y. Ollivier The correspondence is exac only if he Fisher meric is updaed before he parameer in he naural gradien descen (as in Definiion 1). The correspondence wih a Kalman filer provides an inerpreaion for various hyper-parameers of online naural gradien descen. In paricular, J 0 = P0 1 can be inerpreed as he inverse covariance of a Bayesian prior on θ [SW88]. This relaes he iniializaion J 0 of he Fisher marix o he iniializaion of θ:for insance, in neural neworks i is recommended o iniialize he weighs according o a Gaussian of covariance diag(1/fan-in) (number of incoming weighs) for each neuron; inerpreing his as a Bayesian prior on weighs, one may recommend o iniialize he Fisher marix o he inverse of his covariance, namely, J 0 diag(fan-in) (2.3) Indeed his seemed o perform quie well in small-scale experimens. Learning raes, fading memory, and meric decay rae. Theorem 2 exhibis a 1/( + 1) learning rae for he online naural gradien. This is because he saic Kalman filer for i.i.d. observaions approximaes he maximum a poseriori (MAP) of he parameer θ based on all pas observaions; MAP and maximum likelihood esimaors change by O(1/) when a new daa poin is observed. However, for nonlinear sysems, opimaliy of he 1/ rae only occurs asympoically, close enough o he opimum. In general, a 1/( + 1) learning rae is far from opimal if opimizaion does no sar close o he opimum or if one is no using he exac Fisher marix J or covariance marix P. Larger effecive learning raes are achieved hanks o so-called fading memory varians of he Kalman filer, which pu less weigh on older observaions. For insance, one may muliply he log-likelihood of previous poins by a forgeing facor (1 λ ) before each new observaion. This is equivalen o an addiional sep P 1 P 1 /(1 λ ) in he Kalman filer, or o he addiion of an arificial process noise Q proporional o P 1 in he model. Such sraegies are repored o ofen improve performance, especially when he daa do no ruly follow he model [Sim06, 5.5, 7.4], [Hay01, 5.2.2]. See for insance [Ber96] for he relaionship beween Kalman fading memory and gradien descen learning raes (in a paricular case). Proposiion 3 (Naural gradien raes and fading memory). Under he same model and assumpions as in Theorem 2, he following wo algorihms are idenical via he correspondence (θ,j ) (s,η P 1 ): An online naural gradien sep wih learning rae η and meric decay rae γ A fading memory Kalman filer wih an addiional sep P 1 P 1 /(1 λ ) before he ransiion sep; such a filer ieraively opimizes a weighed log-likelihood funcion L of recen observaions, wih decay (1 λ ) a each

14 Online naural gradien as a Kalman filer 2943 sep, namely: L (θ) =lnp θ (y )+(1 λ ) L 1 (θ), provided he following relaions are saisfied: L 0 (θ) := 1 2 (θ θ 0) P0 1 (θ θ 0 ) (2.4) η = γ, P 0 = η 0 J0 1, (2.5) 1 λ = η 1 /η η 1 for 1 (2.6) For example, aking η =1/( + cs) corresponds o λ = 0, no decay for older observaions, and an iniial covariance P 0 = J0 1 /cs. Taking a consan learning rae η = η 0 corresponds o a consan decay facor λ = η 0. The proposiion above compues he fading memory decay facors 1 λ from he naural gradien learning raes η via (2.6). In he oher direcion, one can sar wih he decay facors λ and obain he learning raes η via he cumulaed sum of weighs S : S 0 := 1/η 0 hen S := (1 λ )S 1 +1, hen η := 1/S.This clarifies how λ = 0 corresponds o η =1/( + cs) where he consan is S 0. The learning raes also conrol he weigh given o he Bayesian prior and o he saring poin θ 0. For insance, wih η =1/( + 0 )andlarge 0,he gradien descen will move away slowly from θ 0 ; in he Kalman inerpreaion his corresponds o λ = 0 and a small iniial covariance P 0 = J0 1 / 0 around θ 0, so ha he prior weighs as much as 0 observaions. This resul suggess o se γ = η in he online naural gradien descen of Definiion 1. The inuiive explanaion for his seing is as follows: Boh he Kalman filer and he naural gradien build a second-order approximaion of he log-likelihood of pas observaions as a funcion of he parameer θ, as explained in Secion 2.1. Using a fading memory corresponds o puing smaller weighs on pas observaions; hese weighs affec he firs-order and he secondorder pars of he approximaion in he same way. In he gradien viewpoin, he learning rae η corresponds o he firs-order erm (comparing (1.8) and (2.2)) while he Fisher marix decay rae corresponds o he rae a which he second-order informaion is updaed. Thus, he seing η = γ in he naural gradien corresponds o using he same decay weighs for he firs-order and second-order expansion of he log-likelihood of pas observaions. Sill, one should keep in mind ha he exended Kalman filer is iself only an approximaion for nonlinear sysems. Moreover, from a saisical poin of view, he second-order objec J is higher-dimensional han he firs-order informaion, so ha esimaing J based on more pas observaions may be more sable. Finally, for large-dimensional problems he Fisher marix is always approximaed, which affecs opimaliy of he learning raes. So in pracice, considering γ and η as hyperparameers o be uned independenly may sill be beneficial, hough γ = η seems a good place o sar. Regularizaion of he Fisher marix and Bayesian priors. A poenial downside of fading memory in he Kalman filer is ha he Bayesian inerpreaion is parially los, because he Bayesian prior is forgoen oo quickly. For

15 2944 Y. Ollivier insance, wih a consan learning rae, he weigh of he Bayesian prior decreases exponenially; likewise, wih η = O(1/ ), he filer essenially works wih he O( ) mos recen observaions, while he weigh of he prior decreases like e (as does he weigh of he earlies observaions; his is he produc (1 λ )). Bu precisely, when working wih fewer daa poins one may wish he prior o play a greaer role. The Bayesian inerpreaion can be resored by explicily opimizing a combinaion of he log-likelihood of recen poins, and he log-likelihood of he prior. This is implemened in Proposiion 4. From he naural gradien viewpoin, his ranslaes boh as a regularizaion of he Fisher marix (ofen useful in pracice o numerically sabilize is inversion) and of he gradien sep. Wih a Gaussian prior N (θ prior, Id), his manifess as an addiional sep owards θ prior and adding ε. Id o he Fisher marix, known respecively as weigh decay and Tikhonov regularizaion [Bis06, 3.3, 5.5] in saisical learning. Proposiion 4 (Bayesian regularizaion of he Fisher marix). Le π = N (θ prior, Σ 0 ) be a Gaussian prior on θ. Under he same model and assumpions as in Theorem 2, he following wo algorihms are equivalen: A modified fading memory Kalman filer ha ieraively opimizes L (θ)+ n prior ln π(θ) where L is a weighed log-likelihood funcion of recen observaions wih decay (1 λ ): L (θ) =lnp θ (y )+(1 λ ) L 1 (θ), L 0 := 0 (2.7) η iniialized wih P 0 = 1 1+n priorη 1 Σ 0. A regularized online naural gradien sep wih learning rae η and meric decay rae γ, iniialized wih J 0 =Σ 1 θ θ 1 η ( J + η n prior Σ 1 0 provided he following relaions are saisfied: 0, ) 1 ( l (y ) θ + λ n prior Σ 1 0 (θ θ prior) ) (2.8) η = γ, 1 λ = η 1 /η η 1, η 0 := η 1 (2.9) Thus, he regularizaion erms are fully deermined by choosing he learning raes η, a prior such as N (0, 1/fan-in) (for neural neworks), and a value of n prior such as n prior = 1 (he prior weighs as much as n prior daa poins). This holds boh for regularizaion of he Fisher marix J + η n prior Σ 1 0,andfor regularizaion of he parameer via he exra gradien sep λ n prior Σ 1 0 (θ θ prior ). The relaive srengh of regularizaion in he Fisher marix decreases like η. In paricular, a consan learning rae resuls in a consan regularizaion. The added gradien sep λ n prior Σ 1 0 (θ θ prior) is modulaed by λ which depends on η ; his exra erm pulls owards he prior θ prior.thebayesian

16 Online naural gradien as a Kalman filer 2945 viewpoin guaranees ha his exra erm will no ulimaely preven convergence of he gradien descen (as he influence of he prior vanishes when he number of observaions increases). I is no clear how much hese recommendaions for naural gradien descen coming from is Bayesian inerpreaion are sensiive o using only an approximaion of he Fisher marix Proofs for he saic case The proof of Theorem 2 sars wih he inerpreaion of he Kalman filer as a gradien descen (Proposiion 6). We firs recall he exac definiion and he noaion we use for he exended Kalman filer. Definiion 5 (Exended Kalman filer). Consider a dynamical sysem wih sae s,inpusu and oupus y, s = f(s 1,u )+N (0,Q ), ŷ = h(s,u ), y p(y ŷ ) (2.10) where p( ŷ) denoes an exponenial family wih mean parameer ŷ (e.g., y = N (ŷ, R) wih fixed covariance marix R). The exended Kalman filer for his dynamical sysem esimaes he curren sae s given observaions y 1,...,y in a Bayesian fashion. A each ime, he Bayesian poserior disribuion of he sae given y 1,...,y is approximaed by a Gaussian N (s,p ) so ha s is he approximae maximum a poseriori, and P is he approximae poserior covariance marix. (The prior is N (s 0,P 0 ) a ime 0.) Each ime a new observaion y is available, hese esimaes are updaed as follows. The ransiion sep (before observing y )is and he observaion sep afer observing y is s 1 f(s 1,u ) (2.11) F 1 f s (2.12) (s 1,u ) P 1 F 1 P 1 F 1+ Q (2.13) ŷ h(s 1,u ) (2.14) E sufficien saisics(y ) ŷ (2.15) R Cov(sufficien saisics(y) ŷ ) (2.16) (hese are jus he error E = y ŷ and he covariance marix R = R for a Gaussian model y = N (ŷ, R) wih known R) H h (2.17) s (s 1,u )

17 2946 Y. Ollivier K P 1 H ( H P 1 H ) 1 + R (2.18) P (Id K H ) P 1 (2.19) s s 1 + K E (2.20) For non-gaussian oupu noise, he definiion of E and R above via he mean parameer ŷ of an exponenial family, differs from he pracice of modelling non-gaussian noise via a nonlinear funcion applied o Gaussian noise. This allows for a sraighforward reamen of various oupu models, such as discree oupus or Gaussians wih unknown variance. In he Gaussian case wih known variance our definiion is fully sandard. 4 The proof sars wih he inerpreaion of he Kalman filer as a gradien descen precondiioned by P. Compare his resul and Lemma 9 o [Hay01, (5.68) (5.73)]. Proposiion 6 (Kalman filer as precondiioned gradien descen). The updae of he sae s in a Kalman filer can be seen as an online gradien descen on daa log-likelihood, wih precondiioning marix P. More precisely, denoing l (y) := ln p(y ŷ ), he updae (2.20) is equivalen o ( ) l (y ) s = s 1 P (2.21) s 1 where in he derivaive, l depends on s 1 via ŷ = h(s 1,u ). Lemma 7 (Errors and gradiens). When he oupu model is an exponenial family wih mean parameer ŷ,heerrore is relaed o he gradien of he log-likelihood of he observaion y wih respec o he predicion ŷ by ( ) ln p(y ŷ ) E = R ŷ Proof of he lemma. For a Gaussian y = N (ŷ,r), his is jus a direc compuaion. For a general exponenial family, consider he naural parameer β of he exponenial family which defines he law of y, namely, p(y β) =exp( i β it i (y))/z(β) wihsufficien saisics T i and normalizing consan Z. An elemenary compuaion (Appendix, (A.3)) shows ha ln p(y β) β i = T i (y) ET i = T i (y) ŷ i (2.22) 4 Non-Gaussian oupu noise is ofen modelled in Kalman filering via a coninuous nonlinear funcion applied o a Gaussian noise [Sim06, 13.1]; his canno easily represen discree random variables. Moreover, since he filer linearizes he funcion around he 0 value of he noise [Sim06, 13.1], he noise is sill implicily Gaussian, hough wih a sae-dependen variance.

18 Online naural gradien as a Kalman filer 2947 by definiion of he mean parameer ŷ. Thus, ( ) ln p(y β) E = T (y ) ŷ = (2.23) β where he derivaive is wih respec o he naural parameer β. Toexpresshe derivaive wih respec o ŷ, we apply he chain rule ln p(y β) β = ln p(y ŷ) ŷ and use he fac ha, for exponenial families, he Jacobian marix of he mean parameer ŷ β is equal o he covariance marix R of he sufficien saisics (Appendix, (A.11) and(a.6)). ŷ β Lemma 8. The exended Kalman filer saisfies K R = P H. Proofofhelemma. This relaion is known, e.g., [Sim06, (6.34)]. Indeed, using he definiion of K, we have K R = K (R + H P 1 H ) K H P 1 H = P 1 H K H P 1 H =(Id K H )P 1 H = P H. Proof of Proposiion 6. By definiion of he Kalman filer we have s = s 1 + K E. By Lemma 7, ( l ). ( l ) E = R Thanks o Lemma 8 we find s = s 1 + K R = ŷ ŷ ( s 1 + P H l ) ( l ). = s 1 + P H Bu by he definiion of H, H is ŷ ŷ ŷ s 1 so ha l ŷ H is l s 1. The firs par of he nex lemma is known as he informaion filer in he Kalman filer lieraure, and saes ha he observaion sep for P is addiive when considered on P 1 [Sim06, 6.2]: afer each observaion, he Fisher informaion marix of he laes observaion is added o P 1. Lemma 9 (Informaion filer). The updae (2.18) (2.19) of P in he exended Kalman filer is equivalen o P 1 P H R 1 H (2.24) (assuming P 1 and R are inverible). In paricular, for saic dynamical sysems (f(s, u) =s and Q =0), he whole exended Kalman filer (2.12) (2.20) is equivalen o P 1 P H R 1 H (2.25) s s 1 P ( l (y ) s 1 ) (2.26)

19 2948 Y. Ollivier Proof. The firs saemen is well-known for Kalman filers [Sim06, (6.33)]. Indeed, expanding he definiion of K in he updae (2.19) ofp,wehave P = P 1 P 1 H ( H P 1 H ) 1 + R H P 1 (2.27) bu his is equal o (P H R 1 H ) 1 hanks o he Woodbury marix ideniy. The second saemen follows from Proposiion 6 and he fac ha for f(s, u) =s, he ransiion sep of he Kalman filer is jus s 1 = s 1 and P 1 = P 1. Lemma 10. For exponenial families p(y ŷ), heermh R 1 H appearing in Lemma 9 is equal o he Fisher informaion marix of y wih respec o he sae s, H R 1 H = E y p(y ŷ) [ l (y) s 1 where l (y) = ln p(y ŷ ) depends on s via ŷ = h(s, u). Proof. Leusomiimeindicesforbreviy.Wehave l(y) = l(y) ŷ s ŷ s = l(y) ŷ H.Consequenly, E y = H E y H. The middle erm E y is [ l(y) 2] [ l(y) 2] [ l(y) 2] s ŷ ŷ he Fisher marix of he random variable y wih respec o ŷ. Now, for an exponenial family y p(y ŷ) in mean parameerizaion ŷ, he Fisher marix wih respec o ŷ is equal o he inverse covariance marix of he sufficien saisics of y (Appendix, (A.16)), ha is, R 1. Proof of Theorem 2. By inducion on. By he combinaion of Lemmas 9 and 10, he updae of he Kalman filer wih saic dynamics (s 1 = s 1 )is [ P 1 P E y p(y ŷ ) 2 ] ] 2 l (y) (2.28) s 1 ( ) l (y ) s s 1 P (2.29) s 1 Defining J = P 1 /( + 1), his updae is equivalen o [ ] J +1 J E l (y) y p(y ŷ ) s 1 s s 1 1 ( ) +1 J 1 l (y ) s 1 Under he idenificaion s 1 θ 1, his is he online naural gradien updae wih learning rae η =1/( + 1) and meric updae rae γ =1/( + 1).

20 Online naural gradien as a Kalman filer 2949 The proof of Proposiion 3 is similar, wih addiional facors (1 λ ). Proposiion 4 is proved by applying a fading memory Kalman filer o a modified log-likelihood L 0 := n prior ln π(θ), L := ln p θ (y )+(1 λ ) L 1 +λ n prior ln π(θ) so ha he prior is kep consan in L. 3. Naural gradien as a Kalman filer: he sae space (recurren) case 3.1. Recurren models, RTRL Le us now consider non-memoryless models, i.e., models defined by a recurren or sae space equaion ŷ =Φ(ŷ 1,θ,u ) (3.1) wih u he observaions a ime. To save noaion, here we dump ino ŷ he whole sae of he model, including boh he par ha conains he predicion abou y and all sae or inernal variables (e.g., all inernal and oupu layers of a recurren neural nework, no only he oupu layer). The sae ŷ,orapar hereof, defines a loss funcion l (y ):= ln p(y ŷ ) for each observaion y. The curren sae ŷ can be seen as a funcion which depends on θ via he whole rajecory. The derivaive of he curren sae wih respec o θ can be compued inducively jus by differeniaing he recurren equaion (3.1) defining ŷ : ŷ θ = Φ(ŷ 1,θ,u ) θ + Φ(ŷ 1,θ,u ) ŷ 1 ŷ 1 θ (3.2) Real-ime recurren learning [Jae02] uses his equaion o keep an esimae G of ŷ θ.rtrlhenusesg o esimae he gradien of he loss funcion l wih respec o θ via he chain rule, l / θ =( l / ŷ )( ŷ / θ) =( l / ŷ )G. Definiion 11 (Real-ime recurren learning). Given a recurren model ŷ = Φ(ŷ 1,θ 1,u ), real-ime recurren learning (RTRL) learns he parameer θ via G Φ + Φ G 1, θ 1 ŷ 1 G 0 := 0 (3.3) g l (y ) G ŷ (3.4) θ θ 1 η g (3.5) Since θ changes a each sep, he acual esimae G in RTRL is only an approximaion of he gradien ŷ θ a θ = θ, valid in he limi of small learning raes η. In pracice, RTRL has a high compuaional cos due o he necessary sorage of G, a marix of size dim θ dim ŷ. For large-dimensional models, backpropagaion hrough ime is usually preferred, runcaed o a cerain lengh in he pas [Jae02]; [OTC15, TO17] inroduce a low-rank, unbiased approximaion of G.

21 2950 Y. Ollivier 3.2. Saemen of he correspondence, recurren case There are several ways in which a Kalman filer can be used o esimae θ for such recurren models. 1. A firs possibiliy is o view each ŷ as a funcion of θ via he whole rajecory, and o apply a Kalman filer on θ. This would require, in principle, recompuing he whole rajecory from ime 0 o ime using he new value of θ a each sep, and using RTRL o compue ŷ / θ, which is needed in he filer. In pracice, he pas rajecory is no updaed, and runcaed backpropagaion hrough ime is used o approximae he derivaice ŷ / θ [Jae02, Hay01]. 2. A second possibiliy is he join Kalman filer, namely, a Kalman filer on he pair (θ, ŷ )[Hay01, 5], [Sim06, 13.4]. This does no require going back in ime, as ŷ is a funcion of ŷ 1 and θ. This is he version appearing in Theorem 12 below. 3. A hird possibiliy is he dual Kalman filer [WN96]: a Kalman filer for θ given ŷ, and anoher one for ŷ given θ. This requires o explicily couple he wo Kalman filers by manually adding RTRL-like erms o accoun for he (linearized) dependency of ŷ on θ [Hay01, 5]. Inuiively, he join Kalman filer mainains a covariance marix on (θ, ŷ ), whose off-diagonal erm is he covariance beween ŷ and θ. This erm capures how he curren sae would change if anoher value of he parameer had been used. The decomposiion (3.13) in he heorem makes his inuiion precise in relaion o RTRL: he Kalman covariance beween ŷ and θ is direcly given by he RTRL gradien G. Theorem 12 (Kalman filer on (θ, ŷ) as RTRL+naural gradien+sae correcion). Consider a recurren model ŷ =Φ(ŷ 1,θ 1,u ). Assume ha he observaions y are prediced wih a probabilisic model p(y ŷ ) ha is an exponenial family wih mean parameer a subse of ŷ. Given an esimae G of ŷ / θ, and an observaion y, denoe g (y) := l (y) G (3.6) ŷ he corresponding esimae of l (y)/ θ. Then hese wo algorihms are equivalen: The exended Kalman filer on he pair (θ, ŷ) wih( ransiion ) funcion (Id, Φ), iniialized wih covariance marix P (θ,ŷ) P θ 0 = 0 0,andwih 0 0 no process noise (Q =0). A naural gradien RTRL algorihm wih learning rae η = 1/( +1), defined as follows. The sae, RTRL gradien and Fisher marix have a ransiion sep ŷ Φ(ŷ 1,θ 1,u ) (3.7)

L07. KALMAN FILTERING FOR NON-LINEAR SYSTEMS. NA568 Mobile Robotics: Methods & Algorithms

L07. KALMAN FILTERING FOR NON-LINEAR SYSTEMS. NA568 Mobile Robotics: Methods & Algorithms L07. KALMAN FILTERING FOR NON-LINEAR SYSTEMS NA568 Mobile Roboics: Mehods & Algorihms Today s Topic Quick review on (Linear) Kalman Filer Kalman Filering for Non-Linear Sysems Exended Kalman Filer (EKF)

More information

Online Natural Gradient as a Kalman Filter

Online Natural Gradient as a Kalman Filter Online Natural Gradient as a Kalman Filter Yann Ollivier Abstract We cast Amari s natural gradient in statistical learning as a specific case of Kalman filtering. Namely, applying an extended Kalman filter

More information

Two Popular Bayesian Estimators: Particle and Kalman Filters. McGill COMP 765 Sept 14 th, 2017

Two Popular Bayesian Estimators: Particle and Kalman Filters. McGill COMP 765 Sept 14 th, 2017 Two Popular Bayesian Esimaors: Paricle and Kalman Filers McGill COMP 765 Sep 14 h, 2017 1 1 1, dx x Bel x u x P x z P Recall: Bayes Filers,,,,,,, 1 1 1 1 u z u x P u z u x z P Bayes z = observaion u =

More information

State-Space Models. Initialization, Estimation and Smoothing of the Kalman Filter

State-Space Models. Initialization, Estimation and Smoothing of the Kalman Filter Sae-Space Models Iniializaion, Esimaion and Smoohing of he Kalman Filer Iniializaion of he Kalman Filer The Kalman filer shows how o updae pas predicors and he corresponding predicion error variances when

More information

T L. t=1. Proof of Lemma 1. Using the marginal cost accounting in Equation(4) and standard arguments. t )+Π RB. t )+K 1(Q RB

T L. t=1. Proof of Lemma 1. Using the marginal cost accounting in Equation(4) and standard arguments. t )+Π RB. t )+K 1(Q RB Elecronic Companion EC.1. Proofs of Technical Lemmas and Theorems LEMMA 1. Le C(RB) be he oal cos incurred by he RB policy. Then we have, T L E[C(RB)] 3 E[Z RB ]. (EC.1) Proof of Lemma 1. Using he marginal

More information

CHAPTER 10 VALIDATION OF TEST WITH ARTIFICAL NEURAL NETWORK

CHAPTER 10 VALIDATION OF TEST WITH ARTIFICAL NEURAL NETWORK 175 CHAPTER 10 VALIDATION OF TEST WITH ARTIFICAL NEURAL NETWORK 10.1 INTRODUCTION Amongs he research work performed, he bes resuls of experimenal work are validaed wih Arificial Neural Nework. From he

More information

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation Course Noes for EE7C Spring 018: Convex Opimizaion and Approximaion Insrucor: Moriz Hard Email: hard+ee7c@berkeley.edu Graduae Insrucor: Max Simchowiz Email: msimchow+ee7c@berkeley.edu Ocober 15, 018 3

More information

Deep Learning: Theory, Techniques & Applications - Recurrent Neural Networks -

Deep Learning: Theory, Techniques & Applications - Recurrent Neural Networks - Deep Learning: Theory, Techniques & Applicaions - Recurren Neural Neworks - Prof. Maeo Maeucci maeo.maeucci@polimi.i Deparmen of Elecronics, Informaion and Bioengineering Arificial Inelligence and Roboics

More information

Econ107 Applied Econometrics Topic 7: Multicollinearity (Studenmund, Chapter 8)

Econ107 Applied Econometrics Topic 7: Multicollinearity (Studenmund, Chapter 8) I. Definiions and Problems A. Perfec Mulicollineariy Econ7 Applied Economerics Topic 7: Mulicollineariy (Sudenmund, Chaper 8) Definiion: Perfec mulicollineariy exiss in a following K-variable regression

More information

STATE-SPACE MODELLING. A mass balance across the tank gives:

STATE-SPACE MODELLING. A mass balance across the tank gives: B. Lennox and N.F. Thornhill, 9, Sae Space Modelling, IChemE Process Managemen and Conrol Subjec Group Newsleer STE-SPACE MODELLING Inroducion: Over he pas decade or so here has been an ever increasing

More information

Physics 235 Chapter 2. Chapter 2 Newtonian Mechanics Single Particle

Physics 235 Chapter 2. Chapter 2 Newtonian Mechanics Single Particle Chaper 2 Newonian Mechanics Single Paricle In his Chaper we will review wha Newon s laws of mechanics ell us abou he moion of a single paricle. Newon s laws are only valid in suiable reference frames,

More information

Lecture 20: Riccati Equations and Least Squares Feedback Control

Lecture 20: Riccati Equations and Least Squares Feedback Control 34-5 LINEAR SYSTEMS Lecure : Riccai Equaions and Leas Squares Feedback Conrol 5.6.4 Sae Feedback via Riccai Equaions A recursive approach in generaing he marix-valued funcion W ( ) equaion for i for he

More information

An introduction to the theory of SDDP algorithm

An introduction to the theory of SDDP algorithm An inroducion o he heory of SDDP algorihm V. Leclère (ENPC) Augus 1, 2014 V. Leclère Inroducion o SDDP Augus 1, 2014 1 / 21 Inroducion Large scale sochasic problem are hard o solve. Two ways of aacking

More information

References are appeared in the last slide. Last update: (1393/08/19)

References are appeared in the last slide. Last update: (1393/08/19) SYSEM IDEIFICAIO Ali Karimpour Associae Professor Ferdowsi Universi of Mashhad References are appeared in he las slide. Las updae: 0..204 393/08/9 Lecure 5 lecure 5 Parameer Esimaion Mehods opics o be

More information

Time series model fitting via Kalman smoothing and EM estimation in TimeModels.jl

Time series model fitting via Kalman smoothing and EM estimation in TimeModels.jl Time series model fiing via Kalman smoohing and EM esimaion in TimeModels.jl Gord Sephen Las updaed: January 206 Conens Inroducion 2. Moivaion and Acknowledgemens....................... 2.2 Noaion......................................

More information

Diebold, Chapter 7. Francis X. Diebold, Elements of Forecasting, 4th Edition (Mason, Ohio: Cengage Learning, 2006). Chapter 7. Characterizing Cycles

Diebold, Chapter 7. Francis X. Diebold, Elements of Forecasting, 4th Edition (Mason, Ohio: Cengage Learning, 2006). Chapter 7. Characterizing Cycles Diebold, Chaper 7 Francis X. Diebold, Elemens of Forecasing, 4h Ediion (Mason, Ohio: Cengage Learning, 006). Chaper 7. Characerizing Cycles Afer compleing his reading you should be able o: Define covariance

More information

R t. C t P t. + u t. C t = αp t + βr t + v t. + β + w t

R t. C t P t. + u t. C t = αp t + βr t + v t. + β + w t Exercise 7 C P = α + β R P + u C = αp + βr + v (a) (b) C R = α P R + β + w (c) Assumpions abou he disurbances u, v, w : Classical assumions on he disurbance of one of he equaions, eg. on (b): E(v v s P,

More information

Chapter 2. First Order Scalar Equations

Chapter 2. First Order Scalar Equations Chaper. Firs Order Scalar Equaions We sar our sudy of differenial equaions in he same way he pioneers in his field did. We show paricular echniques o solve paricular ypes of firs order differenial equaions.

More information

1 Review of Zero-Sum Games

1 Review of Zero-Sum Games COS 5: heoreical Machine Learning Lecurer: Rob Schapire Lecure #23 Scribe: Eugene Brevdo April 30, 2008 Review of Zero-Sum Games Las ime we inroduced a mahemaical model for wo player zero-sum games. Any

More information

GMM - Generalized Method of Moments

GMM - Generalized Method of Moments GMM - Generalized Mehod of Momens Conens GMM esimaion, shor inroducion 2 GMM inuiion: Maching momens 2 3 General overview of GMM esimaion. 3 3. Weighing marix...........................................

More information

ACE 562 Fall Lecture 5: The Simple Linear Regression Model: Sampling Properties of the Least Squares Estimators. by Professor Scott H.

ACE 562 Fall Lecture 5: The Simple Linear Regression Model: Sampling Properties of the Least Squares Estimators. by Professor Scott H. ACE 56 Fall 005 Lecure 5: he Simple Linear Regression Model: Sampling Properies of he Leas Squares Esimaors by Professor Sco H. Irwin Required Reading: Griffihs, Hill and Judge. "Inference in he Simple

More information

PENALIZED LEAST SQUARES AND PENALIZED LIKELIHOOD

PENALIZED LEAST SQUARES AND PENALIZED LIKELIHOOD PENALIZED LEAST SQUARES AND PENALIZED LIKELIHOOD HAN XIAO 1. Penalized Leas Squares Lasso solves he following opimizaion problem, ˆβ lasso = arg max β R p+1 1 N y i β 0 N x ij β j β j (1.1) for some 0.

More information

Lecture 33: November 29

Lecture 33: November 29 36-705: Inermediae Saisics Fall 2017 Lecurer: Siva Balakrishnan Lecure 33: November 29 Today we will coninue discussing he boosrap, and hen ry o undersand why i works in a simple case. In he las lecure

More information

Zürich. ETH Master Course: L Autonomous Mobile Robots Localization II

Zürich. ETH Master Course: L Autonomous Mobile Robots Localization II Roland Siegwar Margaria Chli Paul Furgale Marco Huer Marin Rufli Davide Scaramuzza ETH Maser Course: 151-0854-00L Auonomous Mobile Robos Localizaion II ACT and SEE For all do, (predicion updae / ACT),

More information

Georey E. Hinton. University oftoronto. Technical Report CRG-TR February 22, Abstract

Georey E. Hinton. University oftoronto.   Technical Report CRG-TR February 22, Abstract Parameer Esimaion for Linear Dynamical Sysems Zoubin Ghahramani Georey E. Hinon Deparmen of Compuer Science Universiy oftorono 6 King's College Road Torono, Canada M5S A4 Email: zoubin@cs.orono.edu Technical

More information

Vehicle Arrival Models : Headway

Vehicle Arrival Models : Headway Chaper 12 Vehicle Arrival Models : Headway 12.1 Inroducion Modelling arrival of vehicle a secion of road is an imporan sep in raffic flow modelling. I has imporan applicaion in raffic flow simulaion where

More information

Estimation of Poses with Particle Filters

Estimation of Poses with Particle Filters Esimaion of Poses wih Paricle Filers Dr.-Ing. Bernd Ludwig Chair for Arificial Inelligence Deparmen of Compuer Science Friedrich-Alexander-Universiä Erlangen-Nürnberg 12/05/2008 Dr.-Ing. Bernd Ludwig (FAU

More information

Lecture 2-1 Kinematics in One Dimension Displacement, Velocity and Acceleration Everything in the world is moving. Nothing stays still.

Lecture 2-1 Kinematics in One Dimension Displacement, Velocity and Acceleration Everything in the world is moving. Nothing stays still. Lecure - Kinemaics in One Dimension Displacemen, Velociy and Acceleraion Everyhing in he world is moving. Nohing says sill. Moion occurs a all scales of he universe, saring from he moion of elecrons in

More information

14 Autoregressive Moving Average Models

14 Autoregressive Moving Average Models 14 Auoregressive Moving Average Models In his chaper an imporan parameric family of saionary ime series is inroduced, he family of he auoregressive moving average, or ARMA, processes. For a large class

More information

ACE 562 Fall Lecture 4: Simple Linear Regression Model: Specification and Estimation. by Professor Scott H. Irwin

ACE 562 Fall Lecture 4: Simple Linear Regression Model: Specification and Estimation. by Professor Scott H. Irwin ACE 56 Fall 005 Lecure 4: Simple Linear Regression Model: Specificaion and Esimaion by Professor Sco H. Irwin Required Reading: Griffihs, Hill and Judge. "Simple Regression: Economic and Saisical Model

More information

Linear Response Theory: The connection between QFT and experiments

Linear Response Theory: The connection between QFT and experiments Phys540.nb 39 3 Linear Response Theory: The connecion beween QFT and experimens 3.1. Basic conceps and ideas Q: How do we measure he conduciviy of a meal? A: we firs inroduce a weak elecric field E, and

More information

The Simple Linear Regression Model: Reporting the Results and Choosing the Functional Form

The Simple Linear Regression Model: Reporting the Results and Choosing the Functional Form Chaper 6 The Simple Linear Regression Model: Reporing he Resuls and Choosing he Funcional Form To complee he analysis of he simple linear regression model, in his chaper we will consider how o measure

More information

Article from. Predictive Analytics and Futurism. July 2016 Issue 13

Article from. Predictive Analytics and Futurism. July 2016 Issue 13 Aricle from Predicive Analyics and Fuurism July 6 Issue An Inroducion o Incremenal Learning By Qiang Wu and Dave Snell Machine learning provides useful ools for predicive analyics The ypical machine learning

More information

Notes on Kalman Filtering

Notes on Kalman Filtering Noes on Kalman Filering Brian Borchers and Rick Aser November 7, Inroducion Daa Assimilaion is he problem of merging model predicions wih acual measuremens of a sysem o produce an opimal esimae of he curren

More information

Final Spring 2007

Final Spring 2007 .615 Final Spring 7 Overview The purpose of he final exam is o calculae he MHD β limi in a high-bea oroidal okamak agains he dangerous n = 1 exernal ballooning-kink mode. Effecively, his corresponds o

More information

ACE 562 Fall Lecture 8: The Simple Linear Regression Model: R 2, Reporting the Results and Prediction. by Professor Scott H.

ACE 562 Fall Lecture 8: The Simple Linear Regression Model: R 2, Reporting the Results and Prediction. by Professor Scott H. ACE 56 Fall 5 Lecure 8: The Simple Linear Regression Model: R, Reporing he Resuls and Predicion by Professor Sco H. Irwin Required Readings: Griffihs, Hill and Judge. "Explaining Variaion in he Dependen

More information

Outline. lse-logo. Outline. Outline. 1 Wald Test. 2 The Likelihood Ratio Test. 3 Lagrange Multiplier Tests

Outline. lse-logo. Outline. Outline. 1 Wald Test. 2 The Likelihood Ratio Test. 3 Lagrange Multiplier Tests Ouline Ouline Hypohesis Tes wihin he Maximum Likelihood Framework There are hree main frequenis approaches o inference wihin he Maximum Likelihood framework: he Wald es, he Likelihood Raio es and he Lagrange

More information

Robust estimation based on the first- and third-moment restrictions of the power transformation model

Robust estimation based on the first- and third-moment restrictions of the power transformation model h Inernaional Congress on Modelling and Simulaion, Adelaide, Ausralia, 6 December 3 www.mssanz.org.au/modsim3 Robus esimaion based on he firs- and hird-momen resricions of he power ransformaion Nawaa,

More information

10. State Space Methods

10. State Space Methods . Sae Space Mehods. Inroducion Sae space modelling was briefly inroduced in chaper. Here more coverage is provided of sae space mehods before some of heir uses in conrol sysem design are covered in he

More information

Tom Heskes and Onno Zoeter. Presented by Mark Buller

Tom Heskes and Onno Zoeter. Presented by Mark Buller Tom Heskes and Onno Zoeer Presened by Mark Buller Dynamic Bayesian Neworks Direced graphical models of sochasic processes Represen hidden and observed variables wih differen dependencies Generalize Hidden

More information

OBJECTIVES OF TIME SERIES ANALYSIS

OBJECTIVES OF TIME SERIES ANALYSIS OBJECTIVES OF TIME SERIES ANALYSIS Undersanding he dynamic or imedependen srucure of he observaions of a single series (univariae analysis) Forecasing of fuure observaions Asceraining he leading, lagging

More information

( ) ( ) if t = t. It must satisfy the identity. So, bulkiness of the unit impulse (hyper)function is equal to 1. The defining characteristic is

( ) ( ) if t = t. It must satisfy the identity. So, bulkiness of the unit impulse (hyper)function is equal to 1. The defining characteristic is UNIT IMPULSE RESPONSE, UNIT STEP RESPONSE, STABILITY. Uni impulse funcion (Dirac dela funcion, dela funcion) rigorously defined is no sricly a funcion, bu disribuion (or measure), precise reamen requires

More information

Sequential Importance Resampling (SIR) Particle Filter

Sequential Importance Resampling (SIR) Particle Filter Paricle Filers++ Pieer Abbeel UC Berkeley EECS Many slides adaped from Thrun, Burgard and Fox, Probabilisic Roboics 1. Algorihm paricle_filer( S -1, u, z ): 2. Sequenial Imporance Resampling (SIR) Paricle

More information

Bias in Conditional and Unconditional Fixed Effects Logit Estimation: a Correction * Tom Coupé

Bias in Conditional and Unconditional Fixed Effects Logit Estimation: a Correction * Tom Coupé Bias in Condiional and Uncondiional Fixed Effecs Logi Esimaion: a Correcion * Tom Coupé Economics Educaion and Research Consorium, Naional Universiy of Kyiv Mohyla Academy Address: Vul Voloska 10, 04070

More information

Kriging Models Predicting Atrazine Concentrations in Surface Water Draining Agricultural Watersheds

Kriging Models Predicting Atrazine Concentrations in Surface Water Draining Agricultural Watersheds 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Kriging Models Predicing Arazine Concenraions in Surface Waer Draining Agriculural Waersheds Paul L. Mosquin, Jeremy Aldworh, Wenlin Chen Supplemenal Maerial Number

More information

12: AUTOREGRESSIVE AND MOVING AVERAGE PROCESSES IN DISCRETE TIME. Σ j =

12: AUTOREGRESSIVE AND MOVING AVERAGE PROCESSES IN DISCRETE TIME. Σ j = 1: AUTOREGRESSIVE AND MOVING AVERAGE PROCESSES IN DISCRETE TIME Moving Averages Recall ha a whie noise process is a series { } = having variance σ. The whie noise process has specral densiy f (λ) = of

More information

Lecture 2 October ε-approximation of 2-player zero-sum games

Lecture 2 October ε-approximation of 2-player zero-sum games Opimizaion II Winer 009/10 Lecurer: Khaled Elbassioni Lecure Ocober 19 1 ε-approximaion of -player zero-sum games In his lecure we give a randomized ficiious play algorihm for obaining an approximae soluion

More information

INTRODUCTION TO MACHINE LEARNING 3RD EDITION

INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN The MIT Press, 2014 Lecure Slides for INTRODUCTION TO MACHINE LEARNING 3RD EDITION alpaydin@boun.edu.r hp://www.cmpe.boun.edu.r/~ehem/i2ml3e CHAPTER 2: SUPERVISED LEARNING Learning a Class

More information

Simulation-Solving Dynamic Models ABE 5646 Week 2, Spring 2010

Simulation-Solving Dynamic Models ABE 5646 Week 2, Spring 2010 Simulaion-Solving Dynamic Models ABE 5646 Week 2, Spring 2010 Week Descripion Reading Maerial 2 Compuer Simulaion of Dynamic Models Finie Difference, coninuous saes, discree ime Simple Mehods Euler Trapezoid

More information

IMPLICIT AND INVERSE FUNCTION THEOREMS PAUL SCHRIMPF 1 OCTOBER 25, 2013

IMPLICIT AND INVERSE FUNCTION THEOREMS PAUL SCHRIMPF 1 OCTOBER 25, 2013 IMPLICI AND INVERSE FUNCION HEOREMS PAUL SCHRIMPF 1 OCOBER 25, 213 UNIVERSIY OF BRIISH COLUMBIA ECONOMICS 526 We have exensively sudied how o solve sysems of linear equaions. We know how o check wheher

More information

Let us start with a two dimensional case. We consider a vector ( x,

Let us start with a two dimensional case. We consider a vector ( x, Roaion marices We consider now roaion marices in wo and hree dimensions. We sar wih wo dimensions since wo dimensions are easier han hree o undersand, and one dimension is a lile oo simple. However, our

More information

Christos Papadimitriou & Luca Trevisan November 22, 2016

Christos Papadimitriou & Luca Trevisan November 22, 2016 U.C. Bereley CS170: Algorihms Handou LN-11-22 Chrisos Papadimiriou & Luca Trevisan November 22, 2016 Sreaming algorihms In his lecure and he nex one we sudy memory-efficien algorihms ha process a sream

More information

t is a basis for the solution space to this system, then the matrix having these solutions as columns, t x 1 t, x 2 t,... x n t x 2 t...

t is a basis for the solution space to this system, then the matrix having these solutions as columns, t x 1 t, x 2 t,... x n t x 2 t... Mah 228- Fri Mar 24 5.6 Marix exponenials and linear sysems: The analogy beween firs order sysems of linear differenial equaions (Chaper 5) and scalar linear differenial equaions (Chaper ) is much sronger

More information

3.1.3 INTRODUCTION TO DYNAMIC OPTIMIZATION: DISCRETE TIME PROBLEMS. A. The Hamiltonian and First-Order Conditions in a Finite Time Horizon

3.1.3 INTRODUCTION TO DYNAMIC OPTIMIZATION: DISCRETE TIME PROBLEMS. A. The Hamiltonian and First-Order Conditions in a Finite Time Horizon 3..3 INRODUCION O DYNAMIC OPIMIZAION: DISCREE IME PROBLEMS A. he Hamilonian and Firs-Order Condiions in a Finie ime Horizon Define a new funcion, he Hamilonian funcion, H. H he change in he oal value of

More information

LAPLACE TRANSFORM AND TRANSFER FUNCTION

LAPLACE TRANSFORM AND TRANSFER FUNCTION CHBE320 LECTURE V LAPLACE TRANSFORM AND TRANSFER FUNCTION Professor Dae Ryook Yang Spring 2018 Dep. of Chemical and Biological Engineering 5-1 Road Map of he Lecure V Laplace Transform and Transfer funcions

More information

Probabilistic Robotics

Probabilistic Robotics Probabilisic Roboics Bayes Filer Implemenaions Gaussian filers Bayes Filer Reminder Predicion bel p u bel d Correcion bel η p z bel Gaussians : ~ π e p N p - Univariae / / : ~ μ μ μ e p Ν p d π Mulivariae

More information

2.160 System Identification, Estimation, and Learning. Lecture Notes No. 8. March 6, 2006

2.160 System Identification, Estimation, and Learning. Lecture Notes No. 8. March 6, 2006 2.160 Sysem Idenificaion, Esimaion, and Learning Lecure Noes No. 8 March 6, 2006 4.9 Eended Kalman Filer In many pracical problems, he process dynamics are nonlinear. w Process Dynamics v y u Model (Linearized)

More information

MATH 5720: Gradient Methods Hung Phan, UMass Lowell October 4, 2018

MATH 5720: Gradient Methods Hung Phan, UMass Lowell October 4, 2018 MATH 5720: Gradien Mehods Hung Phan, UMass Lowell Ocober 4, 208 Descen Direcion Mehods Consider he problem min { f(x) x R n}. The general descen direcions mehod is x k+ = x k + k d k where x k is he curren

More information

Testing for a Single Factor Model in the Multivariate State Space Framework

Testing for a Single Factor Model in the Multivariate State Space Framework esing for a Single Facor Model in he Mulivariae Sae Space Framework Chen C.-Y. M. Chiba and M. Kobayashi Inernaional Graduae School of Social Sciences Yokohama Naional Universiy Japan Faculy of Economics

More information

23.2. Representing Periodic Functions by Fourier Series. Introduction. Prerequisites. Learning Outcomes

23.2. Representing Periodic Functions by Fourier Series. Introduction. Prerequisites. Learning Outcomes Represening Periodic Funcions by Fourier Series 3. Inroducion In his Secion we show how a periodic funcion can be expressed as a series of sines and cosines. We begin by obaining some sandard inegrals

More information

20. Applications of the Genetic-Drift Model

20. Applications of the Genetic-Drift Model 0. Applicaions of he Geneic-Drif Model 1) Deermining he probabiliy of forming any paricular combinaion of genoypes in he nex generaion: Example: If he parenal allele frequencies are p 0 = 0.35 and q 0

More information

Forecasting optimally

Forecasting optimally I) ile: Forecas Evaluaion II) Conens: Evaluaing forecass, properies of opimal forecass, esing properies of opimal forecass, saisical comparison of forecas accuracy III) Documenaion: - Diebold, Francis

More information

hen found from Bayes rule. Specically, he prior disribuion is given by p( ) = N( ; ^ ; r ) (.3) where r is he prior variance (we add on he random drif

hen found from Bayes rule. Specically, he prior disribuion is given by p( ) = N( ; ^ ; r ) (.3) where r is he prior variance (we add on he random drif Chaper Kalman Filers. Inroducion We describe Bayesian Learning for sequenial esimaion of parameers (eg. means, AR coeciens). The updae procedures are known as Kalman Filers. We show how Dynamic Linear

More information

Modal identification of structures from roving input data by means of maximum likelihood estimation of the state space model

Modal identification of structures from roving input data by means of maximum likelihood estimation of the state space model Modal idenificaion of srucures from roving inpu daa by means of maximum likelihood esimaion of he sae space model J. Cara, J. Juan, E. Alarcón Absrac The usual way o perform a forced vibraion es is o fix

More information

2. Nonlinear Conservation Law Equations

2. Nonlinear Conservation Law Equations . Nonlinear Conservaion Law Equaions One of he clear lessons learned over recen years in sudying nonlinear parial differenial equaions is ha i is generally no wise o ry o aack a general class of nonlinear

More information

Understanding the asymptotic behaviour of empirical Bayes methods

Understanding the asymptotic behaviour of empirical Bayes methods Undersanding he asympoic behaviour of empirical Bayes mehods Boond Szabo, Aad van der Vaar and Harry van Zanen EURANDOM, 11.10.2011. Conens 2/20 Moivaion Nonparameric Bayesian saisics Signal in Whie noise

More information

Augmented Reality II - Kalman Filters - Gudrun Klinker May 25, 2004

Augmented Reality II - Kalman Filters - Gudrun Klinker May 25, 2004 Augmened Realiy II Kalman Filers Gudrun Klinker May 25, 2004 Ouline Moivaion Discree Kalman Filer Modeled Process Compuing Model Parameers Algorihm Exended Kalman Filer Kalman Filer for Sensor Fusion Lieraure

More information

Maximum Likelihood Parameter Estimation in State-Space Models

Maximum Likelihood Parameter Estimation in State-Space Models Maximum Likelihood Parameer Esimaion in Sae-Space Models Arnaud Douce Deparmen of Saisics, Oxford Universiy Universiy College London 4 h Ocober 212 A. Douce (UCL Maserclass Oc. 212 4 h Ocober 212 1 / 32

More information

Application of a Stochastic-Fuzzy Approach to Modeling Optimal Discrete Time Dynamical Systems by Using Large Scale Data Processing

Application of a Stochastic-Fuzzy Approach to Modeling Optimal Discrete Time Dynamical Systems by Using Large Scale Data Processing Applicaion of a Sochasic-Fuzzy Approach o Modeling Opimal Discree Time Dynamical Sysems by Using Large Scale Daa Processing AA WALASZE-BABISZEWSA Deparmen of Compuer Engineering Opole Universiy of Technology

More information

Exponential Weighted Moving Average (EWMA) Chart Under The Assumption of Moderateness And Its 3 Control Limits

Exponential Weighted Moving Average (EWMA) Chart Under The Assumption of Moderateness And Its 3 Control Limits DOI: 0.545/mjis.07.5009 Exponenial Weighed Moving Average (EWMA) Char Under The Assumpion of Moderaeness And Is 3 Conrol Limis KALPESH S TAILOR Assisan Professor, Deparmen of Saisics, M. K. Bhavnagar Universiy,

More information

Online Convex Optimization Example And Follow-The-Leader

Online Convex Optimization Example And Follow-The-Leader CSE599s, Spring 2014, Online Learning Lecure 2-04/03/2014 Online Convex Opimizaion Example And Follow-The-Leader Lecurer: Brendan McMahan Scribe: Sephen Joe Jonany 1 Review of Online Convex Opimizaion

More information

An EM based training algorithm for recurrent neural networks

An EM based training algorithm for recurrent neural networks An EM based raining algorihm for recurren neural neworks Jan Unkelbach, Sun Yi, and Jürgen Schmidhuber IDSIA,Galleria 2, 6928 Manno, Swizerland {jan.unkelbach,yi,juergen}@idsia.ch hp://www.idsia.ch Absrac.

More information

On Measuring Pro-Poor Growth. 1. On Various Ways of Measuring Pro-Poor Growth: A Short Review of the Literature

On Measuring Pro-Poor Growth. 1. On Various Ways of Measuring Pro-Poor Growth: A Short Review of the Literature On Measuring Pro-Poor Growh 1. On Various Ways of Measuring Pro-Poor Growh: A Shor eview of he Lieraure During he pas en years or so here have been various suggesions concerning he way one should check

More information

Online Appendix to Solution Methods for Models with Rare Disasters

Online Appendix to Solution Methods for Models with Rare Disasters Online Appendix o Soluion Mehods for Models wih Rare Disasers Jesús Fernández-Villaverde and Oren Levinal In his Online Appendix, we presen he Euler condiions of he model, we develop he pricing Calvo block,

More information

EKF SLAM vs. FastSLAM A Comparison

EKF SLAM vs. FastSLAM A Comparison vs. A Comparison Michael Calonder, Compuer Vision Lab Swiss Federal Insiue of Technology, Lausanne EPFL) michael.calonder@epfl.ch The wo algorihms are described wih a planar robo applicaion in mind. Generalizaion

More information

Unit Root Time Series. Univariate random walk

Unit Root Time Series. Univariate random walk Uni Roo ime Series Univariae random walk Consider he regression y y where ~ iid N 0, he leas squares esimae of is: ˆ yy y y yy Now wha if = If y y hen le y 0 =0 so ha y j j If ~ iid N 0, hen y ~ N 0, he

More information

Testing the Random Walk Model. i.i.d. ( ) r

Testing the Random Walk Model. i.i.d. ( ) r he random walk heory saes: esing he Random Walk Model µ ε () np = + np + Momen Condiions where where ε ~ i.i.d he idea here is o es direcly he resricions imposed by momen condiions. lnp lnp µ ( lnp lnp

More information

A Specification Test for Linear Dynamic Stochastic General Equilibrium Models

A Specification Test for Linear Dynamic Stochastic General Equilibrium Models Journal of Saisical and Economeric Mehods, vol.1, no.2, 2012, 65-70 ISSN: 2241-0384 (prin), 2241-0376 (online) Scienpress Ld, 2012 A Specificaion Tes for Linear Dynamic Sochasic General Equilibrium Models

More information

18 Biological models with discrete time

18 Biological models with discrete time 8 Biological models wih discree ime The mos imporan applicaions, however, may be pedagogical. The elegan body of mahemaical heory peraining o linear sysems (Fourier analysis, orhogonal funcions, and so

More information

Some Basic Information about M-S-D Systems

Some Basic Information about M-S-D Systems Some Basic Informaion abou M-S-D Sysems 1 Inroducion We wan o give some summary of he facs concerning unforced (homogeneous) and forced (non-homogeneous) models for linear oscillaors governed by second-order,

More information

Chapter 6. Systems of First Order Linear Differential Equations

Chapter 6. Systems of First Order Linear Differential Equations Chaper 6 Sysems of Firs Order Linear Differenial Equaions We will only discuss firs order sysems However higher order sysems may be made ino firs order sysems by a rick shown below We will have a sligh

More information

5. Stochastic processes (1)

5. Stochastic processes (1) Lec05.pp S-38.45 - Inroducion o Teleraffic Theory Spring 2005 Conens Basic conceps Poisson process 2 Sochasic processes () Consider some quaniy in a eleraffic (or any) sysem I ypically evolves in ime randomly

More information

RANDOM LAGRANGE MULTIPLIERS AND TRANSVERSALITY

RANDOM LAGRANGE MULTIPLIERS AND TRANSVERSALITY ECO 504 Spring 2006 Chris Sims RANDOM LAGRANGE MULTIPLIERS AND TRANSVERSALITY 1. INTRODUCTION Lagrange muliplier mehods are sandard fare in elemenary calculus courses, and hey play a cenral role in economic

More information

Recursive Least-Squares Fixed-Interval Smoother Using Covariance Information based on Innovation Approach in Linear Continuous Stochastic Systems

Recursive Least-Squares Fixed-Interval Smoother Using Covariance Information based on Innovation Approach in Linear Continuous Stochastic Systems 8 Froniers in Signal Processing, Vol. 1, No. 1, July 217 hps://dx.doi.org/1.2266/fsp.217.112 Recursive Leas-Squares Fixed-Inerval Smooher Using Covariance Informaion based on Innovaion Approach in Linear

More information

Matrix Versions of Some Refinements of the Arithmetic-Geometric Mean Inequality

Matrix Versions of Some Refinements of the Arithmetic-Geometric Mean Inequality Marix Versions of Some Refinemens of he Arihmeic-Geomeric Mean Inequaliy Bao Qi Feng and Andrew Tonge Absrac. We esablish marix versions of refinemens due o Alzer ], Carwrigh and Field 4], and Mercer 5]

More information

3.1 More on model selection

3.1 More on model selection 3. More on Model selecion 3. Comparing models AIC, BIC, Adjused R squared. 3. Over Fiing problem. 3.3 Sample spliing. 3. More on model selecion crieria Ofen afer model fiing you are lef wih a handful of

More information

Ensamble methods: Bagging and Boosting

Ensamble methods: Bagging and Boosting Lecure 21 Ensamble mehods: Bagging and Boosing Milos Hauskrech milos@cs.pi.edu 5329 Senno Square Ensemble mehods Mixure of expers Muliple base models (classifiers, regressors), each covers a differen par

More information

d 1 = c 1 b 2 - b 1 c 2 d 2 = c 1 b 3 - b 1 c 3

d 1 = c 1 b 2 - b 1 c 2 d 2 = c 1 b 3 - b 1 c 3 and d = c b - b c c d = c b - b c c This process is coninued unil he nh row has been compleed. The complee array of coefficiens is riangular. Noe ha in developing he array an enire row may be divided or

More information

Tracking. Announcements

Tracking. Announcements Tracking Tuesday, Nov 24 Krisen Grauman UT Ausin Announcemens Pse 5 ou onigh, due 12/4 Shorer assignmen Auo exension il 12/8 I will no hold office hours omorrow 5 6 pm due o Thanksgiving 1 Las ime: Moion

More information

Class Meeting # 10: Introduction to the Wave Equation

Class Meeting # 10: Introduction to the Wave Equation MATH 8.5 COURSE NOTES - CLASS MEETING # 0 8.5 Inroducion o PDEs, Fall 0 Professor: Jared Speck Class Meeing # 0: Inroducion o he Wave Equaion. Wha is he wave equaion? The sandard wave equaion for a funcion

More information

Inventory Analysis and Management. Multi-Period Stochastic Models: Optimality of (s, S) Policy for K-Convex Objective Functions

Inventory Analysis and Management. Multi-Period Stochastic Models: Optimality of (s, S) Policy for K-Convex Objective Functions Muli-Period Sochasic Models: Opimali of (s, S) Polic for -Convex Objecive Funcions Consider a seing similar o he N-sage newsvendor problem excep ha now here is a fixed re-ordering cos (> 0) for each (re-)order.

More information

Speaker Adaptation Techniques For Continuous Speech Using Medium and Small Adaptation Data Sets. Constantinos Boulis

Speaker Adaptation Techniques For Continuous Speech Using Medium and Small Adaptation Data Sets. Constantinos Boulis Speaker Adapaion Techniques For Coninuous Speech Using Medium and Small Adapaion Daa Ses Consaninos Boulis Ouline of he Presenaion Inroducion o he speaker adapaion problem Maximum Likelihood Sochasic Transformaions

More information

Chapter 5. Heterocedastic Models. Introduction to time series (2008) 1

Chapter 5. Heterocedastic Models. Introduction to time series (2008) 1 Chaper 5 Heerocedasic Models Inroducion o ime series (2008) 1 Chaper 5. Conens. 5.1. The ARCH model. 5.2. The GARCH model. 5.3. The exponenial GARCH model. 5.4. The CHARMA model. 5.5. Random coefficien

More information

Nature Neuroscience: doi: /nn Supplementary Figure 1. Spike-count autocorrelations in time.

Nature Neuroscience: doi: /nn Supplementary Figure 1. Spike-count autocorrelations in time. Supplemenary Figure 1 Spike-coun auocorrelaions in ime. Normalized auocorrelaion marices are shown for each area in a daase. The marix shows he mean correlaion of he spike coun in each ime bin wih he spike

More information

A Dynamic Model of Economic Fluctuations

A Dynamic Model of Economic Fluctuations CHAPTER 15 A Dynamic Model of Economic Flucuaions Modified for ECON 2204 by Bob Murphy 2016 Worh Publishers, all righs reserved IN THIS CHAPTER, OU WILL LEARN: how o incorporae dynamics ino he AD-AS model

More information

Comparing Means: t-tests for One Sample & Two Related Samples

Comparing Means: t-tests for One Sample & Two Related Samples Comparing Means: -Tess for One Sample & Two Relaed Samples Using he z-tes: Assumpions -Tess for One Sample & Two Relaed Samples The z-es (of a sample mean agains a populaion mean) is based on he assumpion

More information

SOLUTIONS TO ECE 3084

SOLUTIONS TO ECE 3084 SOLUTIONS TO ECE 384 PROBLEM 2.. For each sysem below, specify wheher or no i is: (i) memoryless; (ii) causal; (iii) inverible; (iv) linear; (v) ime invarian; Explain your reasoning. If he propery is no

More information

The electromagnetic interference in case of onboard navy ships computers - a new approach

The electromagnetic interference in case of onboard navy ships computers - a new approach The elecromagneic inerference in case of onboard navy ships compuers - a new approach Prof. dr. ing. Alexandru SOTIR Naval Academy Mircea cel Bărân, Fulgerului Sree, Consanţa, soiralexandru@yahoo.com Absrac.

More information

School and Workshop on Market Microstructure: Design, Efficiency and Statistical Regularities March 2011

School and Workshop on Market Microstructure: Design, Efficiency and Statistical Regularities March 2011 2229-12 School and Workshop on Marke Microsrucure: Design, Efficiency and Saisical Regulariies 21-25 March 2011 Some mahemaical properies of order book models Frederic ABERGEL Ecole Cenrale Paris Grande

More information

EXERCISES FOR SECTION 1.5

EXERCISES FOR SECTION 1.5 1.5 Exisence and Uniqueness of Soluions 43 20. 1 v c 21. 1 v c 1 2 4 6 8 10 1 2 2 4 6 8 10 Graph of approximae soluion obained using Euler s mehod wih = 0.1. Graph of approximae soluion obained using Euler

More information