Online natural gradient as a Kalman filter

Similar documents
L07. KALMAN FILTERING FOR NON-LINEAR SYSTEMS. NA568 Mobile Robotics: Methods & Algorithms

Online Natural Gradient as a Kalman Filter

Two Popular Bayesian Estimators: Particle and Kalman Filters. McGill COMP 765 Sept 14 th, 2017

State-Space Models. Initialization, Estimation and Smoothing of the Kalman Filter

T L. t=1. Proof of Lemma 1. Using the marginal cost accounting in Equation(4) and standard arguments. t )+Π RB. t )+K 1(Q RB

CHAPTER 10 VALIDATION OF TEST WITH ARTIFICAL NEURAL NETWORK

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Deep Learning: Theory, Techniques & Applications - Recurrent Neural Networks -

Econ107 Applied Econometrics Topic 7: Multicollinearity (Studenmund, Chapter 8)

STATE-SPACE MODELLING. A mass balance across the tank gives:

Physics 235 Chapter 2. Chapter 2 Newtonian Mechanics Single Particle

Lecture 20: Riccati Equations and Least Squares Feedback Control

An introduction to the theory of SDDP algorithm

References are appeared in the last slide. Last update: (1393/08/19)

Time series model fitting via Kalman smoothing and EM estimation in TimeModels.jl

Diebold, Chapter 7. Francis X. Diebold, Elements of Forecasting, 4th Edition (Mason, Ohio: Cengage Learning, 2006). Chapter 7. Characterizing Cycles

R t. C t P t. + u t. C t = αp t + βr t + v t. + β + w t

Chapter 2. First Order Scalar Equations

1 Review of Zero-Sum Games

GMM - Generalized Method of Moments

ACE 562 Fall Lecture 5: The Simple Linear Regression Model: Sampling Properties of the Least Squares Estimators. by Professor Scott H.

PENALIZED LEAST SQUARES AND PENALIZED LIKELIHOOD

Lecture 33: November 29

Zürich. ETH Master Course: L Autonomous Mobile Robots Localization II

Georey E. Hinton. University oftoronto. Technical Report CRG-TR February 22, Abstract

Vehicle Arrival Models : Headway

Estimation of Poses with Particle Filters

Lecture 2-1 Kinematics in One Dimension Displacement, Velocity and Acceleration Everything in the world is moving. Nothing stays still.

14 Autoregressive Moving Average Models

ACE 562 Fall Lecture 4: Simple Linear Regression Model: Specification and Estimation. by Professor Scott H. Irwin

Linear Response Theory: The connection between QFT and experiments

The Simple Linear Regression Model: Reporting the Results and Choosing the Functional Form

Article from. Predictive Analytics and Futurism. July 2016 Issue 13

Notes on Kalman Filtering

Final Spring 2007

ACE 562 Fall Lecture 8: The Simple Linear Regression Model: R 2, Reporting the Results and Prediction. by Professor Scott H.

Outline. lse-logo. Outline. Outline. 1 Wald Test. 2 The Likelihood Ratio Test. 3 Lagrange Multiplier Tests

Robust estimation based on the first- and third-moment restrictions of the power transformation model

10. State Space Methods

Tom Heskes and Onno Zoeter. Presented by Mark Buller

OBJECTIVES OF TIME SERIES ANALYSIS

( ) ( ) if t = t. It must satisfy the identity. So, bulkiness of the unit impulse (hyper)function is equal to 1. The defining characteristic is

Sequential Importance Resampling (SIR) Particle Filter

Bias in Conditional and Unconditional Fixed Effects Logit Estimation: a Correction * Tom Coupé

Kriging Models Predicting Atrazine Concentrations in Surface Water Draining Agricultural Watersheds

12: AUTOREGRESSIVE AND MOVING AVERAGE PROCESSES IN DISCRETE TIME. Σ j =

Lecture 2 October ε-approximation of 2-player zero-sum games

INTRODUCTION TO MACHINE LEARNING 3RD EDITION

Simulation-Solving Dynamic Models ABE 5646 Week 2, Spring 2010

IMPLICIT AND INVERSE FUNCTION THEOREMS PAUL SCHRIMPF 1 OCTOBER 25, 2013

Let us start with a two dimensional case. We consider a vector ( x,

Christos Papadimitriou & Luca Trevisan November 22, 2016

t is a basis for the solution space to this system, then the matrix having these solutions as columns, t x 1 t, x 2 t,... x n t x 2 t...

3.1.3 INTRODUCTION TO DYNAMIC OPTIMIZATION: DISCRETE TIME PROBLEMS. A. The Hamiltonian and First-Order Conditions in a Finite Time Horizon

LAPLACE TRANSFORM AND TRANSFER FUNCTION

Probabilistic Robotics

2.160 System Identification, Estimation, and Learning. Lecture Notes No. 8. March 6, 2006

MATH 5720: Gradient Methods Hung Phan, UMass Lowell October 4, 2018

Testing for a Single Factor Model in the Multivariate State Space Framework

23.2. Representing Periodic Functions by Fourier Series. Introduction. Prerequisites. Learning Outcomes

20. Applications of the Genetic-Drift Model

Forecasting optimally

hen found from Bayes rule. Specically, he prior disribuion is given by p( ) = N( ; ^ ; r ) (.3) where r is he prior variance (we add on he random drif

Modal identification of structures from roving input data by means of maximum likelihood estimation of the state space model

2. Nonlinear Conservation Law Equations

Understanding the asymptotic behaviour of empirical Bayes methods

Augmented Reality II - Kalman Filters - Gudrun Klinker May 25, 2004

Maximum Likelihood Parameter Estimation in State-Space Models

Application of a Stochastic-Fuzzy Approach to Modeling Optimal Discrete Time Dynamical Systems by Using Large Scale Data Processing

Exponential Weighted Moving Average (EWMA) Chart Under The Assumption of Moderateness And Its 3 Control Limits

Online Convex Optimization Example And Follow-The-Leader

An EM based training algorithm for recurrent neural networks

On Measuring Pro-Poor Growth. 1. On Various Ways of Measuring Pro-Poor Growth: A Short Review of the Literature

Online Appendix to Solution Methods for Models with Rare Disasters

EKF SLAM vs. FastSLAM A Comparison

Unit Root Time Series. Univariate random walk

Testing the Random Walk Model. i.i.d. ( ) r

A Specification Test for Linear Dynamic Stochastic General Equilibrium Models

18 Biological models with discrete time

Some Basic Information about M-S-D Systems

Chapter 6. Systems of First Order Linear Differential Equations

5. Stochastic processes (1)

RANDOM LAGRANGE MULTIPLIERS AND TRANSVERSALITY

Recursive Least-Squares Fixed-Interval Smoother Using Covariance Information based on Innovation Approach in Linear Continuous Stochastic Systems

Matrix Versions of Some Refinements of the Arithmetic-Geometric Mean Inequality

3.1 More on model selection

Ensamble methods: Bagging and Boosting

d 1 = c 1 b 2 - b 1 c 2 d 2 = c 1 b 3 - b 1 c 3

Tracking. Announcements

Class Meeting # 10: Introduction to the Wave Equation

Inventory Analysis and Management. Multi-Period Stochastic Models: Optimality of (s, S) Policy for K-Convex Objective Functions

Speaker Adaptation Techniques For Continuous Speech Using Medium and Small Adaptation Data Sets. Constantinos Boulis

Chapter 5. Heterocedastic Models. Introduction to time series (2008) 1

Nature Neuroscience: doi: /nn Supplementary Figure 1. Spike-count autocorrelations in time.

A Dynamic Model of Economic Fluctuations

Comparing Means: t-tests for One Sample & Two Related Samples

SOLUTIONS TO ECE 3084

The electromagnetic interference in case of onboard navy ships computers - a new approach

School and Workshop on Market Microstructure: Design, Efficiency and Statistical Regularities March 2011

EXERCISES FOR SECTION 1.5

Transcription:

Elecronic Journal of Saisics Vol. 12 (2018) 2930 2961 ISSN: 1935-7524 hps://doi.org/10.1214/18-ejs1468 Online naural gradien as a Kalman filer Yann Ollivier Facebook Arificial Inelligence Research 6rueMénars, 75002 Paris, France e-mail: yol@fb.com Absrac: We cas Amari s naural gradien in saisical learning as a specific case of Kalman filering. Namely, applying an exended Kalman filer o esimae a fixed unknown parameer of a probabilisic model from a series of observaions, is rigorously equivalen o esimaing his parameer via an online sochasic naural gradien descen on he log-likelihood of he observaions. In he i.i.d. case, his relaion is a consequence of he informaion filer phrasing of he exended Kalman filer. In he recurren (sae space, non-i.i.d.) case, we prove ha he join Kalman filer over saes and parameers is a naural gradien on op of real-ime recurren learning (RTRL), a classical algorihm o rain recurren models. This exac algebraic correspondence provides relevan inerpreaions for naural gradien hyperparameers such as learning raes or iniializaion and regularizaion of he Fisher informaion marix. MSC 2010 subjec classificaions: Primary 68T05, 65K10; secondary 93E35, 90C26, 93E11, 49M15. Keywords and phrases: Saisical learning, naural gradien, Kalman filer, sochasic gradien descen. Received June 2017. Conens 1 Problem seing, naural gradien, Kalman filer............ 2934 1.1 Problem seing............................ 2934 1.2 Naural gradien descen....................... 2936 1.3 Kalman filering for parameer esimaion............. 2938 2 Naural gradien as a Kalman filer: he saic (i.i.d.) case....... 2939 2.1 Naural gradien as a Kalman filer: heurisics........... 2940 2.2 Saemen of he correspondence, saic (i.i.d.) case........ 2941 2.3 Proofs for he saic case....................... 2945 3 Naural gradien as a Kalman filer: he sae space (recurren) case. 2949 3.1 Recurren models, RTRL...................... 2949 3.2 Saemen of he correspondence, recurren case.......... 2950 3.3 Proofs for he recurren case..................... 2952 A Reminder on exponenial families..................... 2956 References.................................... 2958 Work done in par while a CNRS, TAO, Universié Paris-Sud 2930

Online naural gradien as a Kalman filer 2931 In saisical learning, sochasic gradien descen is a widely used ool o esimae he parameers of a model from empirical daa, especially when he parameer dimension and he amoun of daa are large [BL03] (such as is ypically he case wih neural neworks, for insance). The naural gradien [Ama98] is a ool from informaion geomery, which aims a correcing several shorcomings of he widely ordinary sochasic gradien descen, such as is sensiiviy o rescalings or simple changes of variables in parameer space [Oll15]. The naural gradien modifies he ordinary gradien by using he informaion geomery of he saisical model, via he Fisher informaion marix (see formal definiion in Secion 1.2; see also [Mar14]). The naural gradien comes wih a heoreical guaranee of asympoic opimaliy [Ama98] ha he ordinary gradien lacks, and wih he heoreical knowledge and various connecions from informaion geomery, e.g., [AN00, OAAH17]. In large dimension, is compuaional complexiy makes approximaions necessary, e.g., [LMB07, Oll15, MCO16, GS15, MG15]; his has limied is adopion despie many desirable heoreical properies. The exended Kalman filer (see e.g., he exbooks [Sim06, Sä13, Jaz70]) is a generic and effecive ool o esimae in real ime he sae of a nonlinear dynamical sysem, from noisy measuremens of some par or some funcion of he sysem. (The ordinary Kalman filer deals wih linear sysems.) Is use in navigaion sysems (GPS, vehicle conrol, spacecraf...), ime series analysis, economerics, ec. [Sä13], is exensive o he poin i can been described as one of he grea discoveries of mahemaical engineering [GA15]. The goal of his ex is o show ha he naural gradien, when applied online, is a paricular case of he exended Kalman filer. Indeed, he exended Kalman filer can be used o esimae he parameers of a saisical model (probabiliy disribuion), by viewing he parameers as he hidden sae of a saic dynamical sysem, and viewing i.i.d. samples as noisy observaions depending on he parameers. 1 We show ha doing so is exacly equivalen o performing an online sochasic naural gradien descen (Theorem 2). This resuls in a rigorous dicionary beween he naural gradien objecs from saisical learning, and he objecs appearing in Kalman filering; for insance, a larger learning rae for he naural gradien descen exacly corresponds o a fading memory in he Kalman filer (Proposiion 3). Table 1 liss a few correspondences beween objecs from he Kalman filer side and from he naural gradien side, as resuls from he heorems and proposiions below. Noe ha he correspondence is one-sided: he online naural gradien is exacly an exended Kalman filer, bu only corresponds o a paricular use of he Kalman filer for parameer esimaion problems (i.e., wih saic dynamics on he parameer par of he sysem). 1 For his we slighly exend he definiion of he Kalman filer o include discree observaions, by defining (Def. 5) he measuremen error as T (y) ŷ insead of y ŷ, wheret is he sufficien saisics of an exponenial family model for oupu noise wih mean ŷ. This reduces o he sandard filer for Gaussian oupu noise, and naurally covers caegorical oupus as ofen used in saisical learning (wih ŷ he class probabiliies in a sofmax classifier and T a one-ho encoding of y).

2932 Y. Ollivier Table 1 Kalman filer objecs vs naural gradien objecs. The inpus are u, he prediced values are ŷ, and he model parameers are θ. iid (saic, non-recurren) model ŷ = h(θ, u ) Exended Kalman filer on saic Online naural gradien on θ wih parameer θ learning rae η =1/( +1) Covariance marix P Fisher informaion marix J = η P 1 Bayesian prior P 0 Fisher marix iniializaion J 0 = P 1 0 Fading memory Larger or consan learning rae Fading memory+consan prior Fisher marix regularizaion Recurren (sae space) model ŷ =Φ(ŷ 1,θ,u ) ExendedKalmanfileron(θ, ŷ) RTRL+naural gradien+sae correcion Covariance of θ alone, P θ Fisher marix J = η (P θ ) 1 Correlaion beween θ and ŷ RTRL gradien esimae ŷ / θ Beyond he saic case, we also consider he learning of he parameers of a general dynamical sysem, where subsequen observaions exhibi emporal paerns insead of being i.i.d.; in saisical learning his is called a recurren model, for insance, a recurren neural nework. We refer o [Jae02] for an inroducion o recurren models in saisical learning (recurren neural neworks) and he afferen echniques (including Kalman filers), and o [Hay01] for a clear, in-deph reamen of Kalman filering for recurren models. We prove (Theorem 12) ha he exended Kalman filer applied joinly o he sae and parameer, amouns o a naural gradien on op of real-ime recurren learning (RTRL), a classical (and cosly) online algorihm for recurren nework raining [Jae02]. Thus, we provide a bridge beween echniques from large-scale saisical learning (naural gradien, RTRL) and a cenral objec from mahemaical engineering, signal processing, and esimaion heory. Casing he naural gradien as a specific case of he exended Kalman filer is an insance of he provocaive saemen from [LS83] ha here is only one recursive idenificaion mehod ha is opimal on quadraic funcions. Indeed, he online naural gradien descen fis ino he framework of [LS83, 3.4.5]. Arguably, his saemen is limied o linear models, and for non-linear models one would expec differen algorihms o coincide only a a cerain order, or asympoically; however, all he correspondences presened below are exac. Relaed work. In he i.i.d. (saic) case, he naural gradien/kalman filer correspondence follows from he informaion filer phrasing of Kalman filering [Sim06, 6.2] by relaively direc manipulaions. Neverheless, we could find no reference in he lieraure explicily idenifying he wo. [SW88] is an early example of he use of Kalman filering for raining feedforward neural neworks in saisical learning, bu does no menion he naural gradien. [RRK + 92] argue ha for neural neworks, backpropagaion, i.e., ordinary gradien descen, is a degenerae form of he exended Kalman filer. [Ber96] idenifies he exended Kalman filer wih a Gauss Newon gradien descen for he specific case of nonlinear regression. [dfng00] inerpres process noise in he saic Kalman filer as an adapive, per-parameer learning rae, hus akin o a pre-

Online naural gradien as a Kalman filer 2933 condiioning marix. [ŠKT01] uses he Fisher informaion marix o sudy he variance of parameer esimaion in Kalman-like filers, wihou using a naural gradien; [BL03] commen on he similariy beween Kalman filering and a version of Amari s naural gradien for he specific case of leas squares regression; [Mar14] and [Oll15] menion he relaionship beween naural gradien and he Gauss Newon Hessian approximaion; [Pa16] explois he relaionship beween second-order gradien descen and Kalman filering in specific cases including linear regression; [LCL + 17] use a naural gradien descen over Gaussian disribuions for an auxiliary problem arising in Kalman-like Bayesian filering, a problem independen from he one reaed here. For he recurren (non-i.i.d.) case, our resul is ha join Kalman filering is essenially a naural gradien on op of he classical RTRL algorihm for recurren models [Jae02]. [Wil92] already observed ha saring wih he Kalman filer and inroducing drasic simplificaions (doing away wih he covariance marix) resuls in RTRL, while [Hay01, 5] conains saemens ha can be inerpreed as relaing Kalman filering and precondiioned RTRL-like gradien descen for recurren models (Secion 3.2). Perspecives. In his ex our goal is o derive he precise correspondence beween naural gradien and Kalman filering for parameer esimaion (Thm. 2, Prop. 3, Prop.4, Thm.12), and o work ou an exac dicionary beween he mahemaical objecs on boh sides. This correspondence suggess several possible venues for research, which neverheless are no explored here. Firs, he correspondence wih he Kalman filer brings new inerpreaions and suggesions for several naural gradien hyperparameers, such as Fisher marix iniializaion, equaliy beween Fisher marix decay rae and learning rae, or amoun of regularizaion o he Fisher marix (Secion 2.2). The naural gradien can be quie sensiive o hese hyperparameers. A firs sep would be o es he marix decay rae and regularizaion values suggesed by he Bayesian inerpreaion (Prop. 4) and see if hey help wih he naural gradien, or if hese suggesions are overriden by he various approximaions needed o apply he naural gradien in pracice. These empirical ess are beyond he scope of he presen sudy. Nex, since saisical learning deals wih eiher coninuous or caegorical daa, we had o exend he usual Kalman filer o such a seing. Tradiionally, non-gaussian oupu models have been reaed by applying a nonlineariy o a sandard Gaussian noise (Secion 2.3). Insead, modeling he measuremen noise as an exponenial family (Appendix and Def. 5) allows for a unified reamen of he sandard case (Gaussian oupu noise wih known variance), of discree caegorical observaions, or oher exponenial noise models (e.g., Gaussian noise wih unknown variance). We did no es he empirical consequences of his choice, bu i cerainly makes he mahemaical reamen flow smoohly, in paricular he view of he Kalman filer as precondiioned gradien descen (Prop. 6). Neiher he naural gradien nor he exended Kalman filer scale well o large-dimensional models as currenly used in machine learning, so ha approx-

2934 Y. Ollivier imaions are required. The correspondence raises he possibiliy ha various mehods developed for Kalman filering (e.g., paricle or unscened filers) or for naural gradien approximaions (e.g., marix facorizaions such as he Kronecker produc [MG15] or quasi-diagonal reducions [Oll15, MCO16]) could be ransferred from one viewpoin o he oher. In saisical learning, oher means have been developed o aain he same asympoic efficiency as he naural gradien, noably rajecory averaging (e.g., [PJ92], or [Mar14] for he relaionship o naural gradien) a lile algorihmic cos. One may wonder if hese can be generalized o filering problems. Proof echniques could be ransferred as well: for insance, Amari [Ama98] gave a srong bu someimes informal argumen ha he naural gradien is Fisher-efficien, i.e., he resuling parameer esimae is asympoically opimal for he Cramér Rao bound; alernae proofs could be obained by ransferring relaed saemens for he exended Kalman filer, e.g., combining echniques from [ŠKT01, BRD97, LS83]. Organizaion of he ex. In Secion 1 we se he noaion, recall he definiion of he naural gradien (Def. 1), and explain how Kalman filering can be used for parameer esimaion in saisical learning (Secion 1.3); he definiion of he Kalman filer is included in Def. 5. Secion 2 gives he main saemens for viewing he naural gradien as an insance of an exended Kalman filer for i.i.d. observaions (saic sysems), firs inuiively via a heurisic asympoic argumen (Secion 2.1), hen rigorously (Thm. 2, Prop.3, Prop.4). The proof of hese resuls appears in Secion 2.3 and sheds some ligh on he geomery of Kalman filering. Finally, he case of non-i.i.d. observaions (recurren or sae space model) is reaed in Secion 3. Acknowledgmens. Many hanks o Silvère Bonnabel, Gaéan Marceau- Caron, and he anonymous reviewers for heir careful reading of he manuscrip, correcions, and suggesions for he presenaion and organizaion of he ex. I would also like o hank Shun-ichi Amari, Frédéric Barbaresco, and Nando de Freias for addiional commens and for poining ou relevan references. 1. Problem seing, naural gradien, Kalman filer 1.1. Problem seing In saisical learning, we have a series of observaion pairs (u 1,y 1 ),..., (u,y ),... and wan o predic y from u using a probabilisic model p θ.assume for now ha y is real-valued (regression problem) and ha he model for y is a Gaussian cenered on a prediced value ŷ, wih known covariance marix R, namely y =ŷ + N (0,R ), ŷ = h(θ, u ) (1.1) The funcion h may represen any compuaion, for insance, a feedforward neural nework wih inpu u, parameers θ, and oupu ŷ. The goal is o find

Online naural gradien as a Kalman filer 2935 he parameers θ such ha he predicion ŷ = h(θ, u ) is as close as possible o y : he loss funcion is l = 1 2 (ŷ y ) R 1 (ŷ y )= ln p(y ŷ ) (1.2) up o an addiive consan. For non-gaussian oupus, we assume ha he noise model on y given ŷ belongs o an exponenial family, namely, ha ŷ is he mean parameer of an exponenial family of disribuions 2 over y ; we again define he loss funcion as l := ln p(y ŷ ), and he oupu noise R can be defined as he covariance marix of he sufficien saisics of y given his mean (Def. 5). For a Gaussian oupu noise his works as expeced. For insance, for a classificaion problem, he oupu is caegorical, y {1,...,K}, andŷ will be he se of probabiliies ŷ =(p 1,...,p K 1 )ohavey =1,...,K 1. In ha case R is he (K 1) (K 1) marix (R ) kk =diag(p k ) p k p k. (The las probabiliy p K is deermined by he ohers via p k = 1 and has o be excluded o obain a non-degenerae parameerizaion and an inverible covariance marix R.) This convenion allows us o exend he definiion of he Kalman filer o such a seing (Def. 5) in a naural way, jus by replacing he measuremen error y ŷ wih T (y ) ŷ,wiht he sufficien saisics for he exponenial family. (For Gaussian noise his is he same, as T (y) isy.) In neural nework erms, his means ha he oupu layer of he nework is fed o a loss funcion ha is he log-loss of an exponenial family, bu places no resricion on he res of he model. General noaion. In saisical learning, he exernal inpus or regressor variables are ofen denoed x. In Kalman filering, x ofen denoes he sae of he sysem, while he exernal inpus are ofen u. Thus we will avoid x alogeher and denoe by u he inpus and by s he sae of he sysem. The variable o be prediced a ime will be y,andŷ is he corresponding predicion. In general ŷ and y may be differen objecs in ha ŷ encodes a full probabilisic predicion for y. For Gaussians wih known variance, ŷ is jus he prediced mean of y, so in his case y and ŷ are he same ype of objec. For Gaussians wih unknown variance, ŷ encodes boh he mean and second momen of y. For discree caegorical daa, ŷ encodes he probabiliy of each possible oucome y. 2 The Appendix conains a reminder on exponenial families. An exponenial family of probabiliy disribuions on y, wih sufficien saisics T 1 (y),...,t K (y), and wih parameer β R K,isgivenby p β (y) := 1 Z(β) e k β kt k (y) λ(dy) (1.3) where Z(β) is a normalizing consan, and λ(dy) is any reference measure on y. For insance, if y R K, T k (y) =y k and λ(dy) is a Gaussian measure cenered on 0, by varying β one ges all Gaussian measures wih he same covariance marix and anoher mean. y may be discree, e.g., Bernoulli disribuions correspond o λ he uniform measure on y {0, 1} and a single sufficien saisic T (0) = 0, T (1) = 1. Ofen, he mean parameer T := E y pβ T (y) isa more convenien parameerizaion han β. Exponenial families maximize enropy (minimize informaion divergence from λ) for a given mean of T.

2936 Y. Ollivier Thus, he formal seing for his ex is as follows: we are given a sequence of finie-dimensional observaions (y ) wih each y R dim(y), a sequence of inpus (u ) wih each u R dim(u), a parameric model ŷ = h(θ, u ) wih parameer θ R dim(θ) and h some fixed smooh funcion from R dim(θ) R dim(u) o R dim(ŷ). We are given an exponenial family (oupu noise model) p(y ŷ) ony wih mean parameer ŷ and sufficien saisics T (y) (see he Appendix), and we define he loss funcion l := ln p(y ŷ ). The naural gradien descen on parameer θ will use he Fisher marix J. The Kalman filer will have poserior covariance marix P. For mulidimensional quaniies x and y = f(x), we denoe by y x he Jacobian marix of y w.r.. x, whose(i, j) enryis fi(x). This saisfies he chain rule z y y x = z x. Wih his convenion, gradiens of real-valued funcions are row vecors, so ha a gradien descen akes he form x x η ( f/ x). For a column vecor u, u 2 is synonymous wih uu,andwihu u for a row vecor. x j 1.2. Naural gradien descen A sandard approach o opimize he parameer θ of a probabilisic model, given a sequence of observaions (y ), is an online gradien descen l (y ) θ θ 1 η (1.4) θ wih learning rae η. This simple gradien descen is paricularly suiable for large daases and large-dimensional models [BL03], bu has several pracical and heoreical shorcomings. For insance, i uses he same non-adapive learning rae for all parameer componens. Moreover, simple changes in parameer encoding or in daa presenaion (e.g., encoding black and whie in images by 0/1 or 1/0) can resul in differen learning performance. This moivaed he inroducion of he naural gradien [Ama98]. I is buil o achieve invariance wih respec o parameer re-encoding; in paricular, learning become insensiive o he characerisic scale of each parameer direcion, so ha differen direcions naurally ge suiable learning raes. The naural gradien is he only general way o achieve such invariance [AN00, 2.4]. The naural gradien precondiions he gradien descen wih J(θ) 1 where J is he Fisher informaion marix [Kul97] wih respec o he parameer θ. For a smooh probabilisic model p(y θ) over a random variable y wih parameer θ, he laer is defined as J(θ) :=E y p(y θ) [ ln p(y θ) θ ] 2 [ 2 ] ln p(y θ) = E y p(y θ) θ 2 (1.5) Definiion 1 below formally inroduces he online naural gradien. If he model for y involves an inpu u, hen an expecaion or empirical average over he inpu is inroduced in he definiion of J [AN00, 8.2] [Mar14, 5].

Online naural gradien as a Kalman filer 2937 However, his comes a a large compuaional cos for large-dimensional models: jus soring he Fisher marix already coss O((dim θ) 2 ). Various sraegies are available o approximae he naural gradien for complex models such as neural neworks, using diagonal or block-diagonal approximaion schemes for he Fisher marix, e.g., [LMB07, Oll15, MCO16, GS15, MG15]. Definiion 1 (Online naural gradien). Consider a saisical model wih parameer θ ha predics an oupu y given an inpu u. Suppose ha he predicion akes he form y p(y ŷ) where ŷ = h(θ, u) depends on he inpu via a model h wih parameer θ. Given observaion pairs (u,y ), he goal is o minimize, online, he loss funcion l (y ), l (y) := ln p(y ŷ ) (1.6) asafuncionofθ. The online naural gradien mainains a curren esimae θ of he parameer θ, and a curren approximaion J of he Fisher marix. The parameer is esimaed by a gradien descen wih precondiioning marix J 1,namely J (1 γ )J 1 + γ E y p(y ŷ) θ θ 1 η J 1 wih learning rae η and Fisher marix decay rae γ. [ ] 2 l (y) (1.7) θ ( ) l (y ) (1.8) In he Fisher marix updae, he expecaion over all possible values y p(y ŷ) can ofen be compued algebraically, bu his is someimes compuaionally bohersome (for insance, in neural neworks, i requires dim(ŷ )disinc backpropagaion seps [Oll15]). A common soluion [APF00, LMB07, Oll15, PB13] is o jus use he value y = y (ouer produc approximaion) insead of he expecaion over y. Anoher is o use a Mone Carlo approximaion wih asinglesampleofy p(y ŷ ) [Oll15, MCO16], namely, using he gradien of a synheic sample insead of he acual observaion y in he Fisher marix. These laer wo soluions are ofen confused; only he laer provides an unbiased esimae, see discussion in [Oll15, PB13]. The online smoohed updae of he Fisher marix in (1.7) mixes pas and presen esimaes (his or similar updaes are used in [LMB07, MCO16]). The reason is a leas wofold. Firs, he genuine Fisher marix involves an expecaion over he inpus u [AN00, 8.2]: his can be approximaed online only via a moving average over inpus (e.g., γ =1/ realizes an equal-weigh average over all inpus seen so far). Second, he expecaion over y p(y ŷ )in(1.7) is ofen replaced wih a Mone Carlo esimaion wih only one value of y, and averaging over ime compensaes for his Mone Carlo sampling. As a consequence, since θ changes over ime, his means ha he esimae J mixes values obained a differen values of θ, and converges o he Fisher θ

2938 Y. Ollivier marix only if θ changes slowly, i.e., if η 0. The correspondence below wih Kalman filering suggess using γ = η. 1.3. Kalman filering for parameer esimaion One possible definiion of he exended Kalman filer is as follows [Sim06, 15.1]. We are rying o esimae he curren sae of a dynamical sysem s whose evoluion equaion is known bu whose precise value is unknown; a each ime sep, we have access o a noisy measuremen y of a quaniy ŷ = h(s )which depends on his sae. The Kalman filer mainains an approximaion of a Bayesian poserior on s given he observaions y 1,...,y. The poserior disribuion afer observaions is approximaed by a Gaussian wih mean s and covariance marix P.(Indeed, Bayesian poseriors always end o Gaussians asympoically under mild condiions, by he Bernsein von Mises heorem [vdv00].) The Kalman filer prescribes a way o updae s and P when new observaions become available. The Kalman filer updae is summarized in Definiion 5 below. I is buil o provide he exac value of he Bayesian poserior in he case of linear dynamical sysems wih Gaussian measuremens and a Gaussian prior. In ha sense, i is exac a firs order. The Kalman filering viewpoin on a saisical learning problem is ha we are facing a sysem wih hidden variable θ, wih an unknown value ha does no evolve in ime, and ha he observaions y bring more and more informaion on θ. Thus, a saisical learning problem can be ackled by applying he exended Kalman filer o he unknown variable s = θ, whose underlying dynamics from ime o ime + 1 is jus o remain unchanged (f = Id and noise on s is 0 in Definiion 5). In such a seing, he poserior covariance marix P will generally end o 0 as observaions accumulae and he parameer is idenified beer 3 (his occurs a rae 1/ for he basic filer, which esimaes from all pas observaions a ime, or a oher raes if fading memory is included, see below). The iniializaion θ 0 and is covariance P 0 can be inerpreed as Bayesian priors on θ [SW88, LS83]. We will refer o his as a saic Kalman filer. In he saic case and wihou fading memory, he poserior covariance P afer observaions will decrease like O(1/), so ha he parameer ges updaed by O(1/) afer each new observaion. Inroducing fading memory for pas observaions (equivalen o adding noise on θ a each sep, Q P 1 in Def. 5) leads o a larger covariance and faser updaes. An example: Feedforward neural neworks. The Kalman approach above can be applied o any parameric saisical model. For insance [SW88] rea he case of a feedforward neural nework. In our seing his is described as follows. Le u be he inpu of he model and y he rue (desired) oupu. A feedforward 3 Bu P mus sill be mainained even if i ends o 0, since i is used o updae he parameer a he correc rae.

Online naural gradien as a Kalman filer 2939 neural nework can be described as a funcion ŷ = h(θ, u) whereθ is he se of all parameers of he nework, where h represens all compuaions performed by he nework on inpu u, andŷ encodes he nework predicion for he value of he oupu y on inpu u. For caegorical observaions y, ŷ is usually a se of prediced probabiliies for all possible classes; while for regression problems, ŷ is direcly he prediced value. In boh cases, he error funcion o be minimized can be defined as l(y) := ln p(y ŷ): in he regression case, ŷ is inerpreed as a mean of a Gaussian model on y, soha ln p(y ŷ) is he square error up o a consan. Training he neural nework amouns o esimaing he nework parameer θ from he observaions. Applying a saic Kalman filer for his problem [SW88] amouns o using Def. 5 wih s = θ, f =IdandQ =0.Afirsglancehis looks quie differen from he common gradien descen (backpropagaion) approach for neural neworks. The backpropagaion operaion is represened in he Kalman filer by he compuaion of H = h(s,u) s (2.17) wheres is he parameer. We show ha he addiional operaions of he Kalman filer correspond o using a naural gradien insead of a vanilla gradien. Unforunaely, for models wih high-dimensional parameers such as neural neworks, he Kalman filer is compuaionally cosly and requires blockdiagonal approximaions for P (which is a square marix of size dim θ); moreover, compuing H = ŷ / θ is needed in he filer, and requires doing one separae backpropagaion for each componen of he oupu ŷ. 2. Naural gradien as a Kalman filer: he saic (i.i.d.) case We now wrie he explici correspondence beween an online naural gradien o esimae he parameer of a saisical model from i.i.d. observaions, and a saic exended Kalman filer. We firs give a heurisic argumen ha oulines he main ideas from he proof (Secion 2.1). Then we sae he formal correspondences. Firs, he saic Kalman filer corresponds o an online naural gradien wih learning rae 1/ (Thm. 2). The rae 1/ arises because such a filer akes ino accoun all previous evidence wihou decay facors (and wih process noise Q = 0 in he Kalman filer), hus he poserior covariance marix decreases like O(1/). Asympoically, his is he opimal rae in saisical learning [Ama98]. (Noe, however, ha he online naural gradien and exended Kalman filer are idenical a every ime sep, no only asympoically.) The 1/ rae is ofen oo slow in pracical applicaions, especially when saring far away from an opimal parameer value. The naural gradien/kalman filer correspondence is no specific o he O(1/) rae. Larger learning raes in he naural gradien correspond o a fading memory Kalman filer (adding process noise Q proporional o he poserior covariance a each sep, corresponding o a decay facor for he weigh of previous observaions); his is Proposiion 3. In such a seing, he poserior covariance marix in he Kalman filer does no decrease like O(1/); for insance, a fixed decay facor for he fading memory

2940 Y. Ollivier corresponds o a consan learning rae. Finally, a fading memory in he Kalman filer may erase prior Bayesian informaion (θ 0,P 0 ) oo fas; mainaining he weigh of he prior in a fading memory Kalman filer is reaed in Proposiion 4 and corresponds, on he naural gradien side, o a so-called weigh decay [Bis06] owards θ 0 ogeher wih a regularizaion of he Fisher marix, a specific raes. 2.1. Naural gradien as a Kalman filer: heurisics As a firs ingredien in he correspondence, we inerpre Kalman filers as gradien descens: he exended Kalman filer acually performs a gradien descen on he log-likelihood of each new observaion, wih precondiioning marix equal o he poserior covariance marix. This is Proposiion 6 below. This relies on having an exponenial family as he oupu noise model. Meanwhile, he naural gradien uses he Fisher marix as a precondiioning marix. The Fisher marix is he average Hessian of log-likelihood, hanks o he classical double definiion of he Fisher marix as square gradien or Hessian, J(θ) =E y p(y θ) [ ln p(y) θ 2 ] [ ] = E 2 ln p(y) y p(y θ) θ for any probabilisic model 2 p(y θ) [Kul97]. Assume ha he probabiliy of he daa given he parameer θ is approximaely Gaussian, p(y 1,...,y θ) exp( (θ θ ) Σ 1 (θ θ )) wih covariance Σ. This ofen holds asympoically hanks o he Bernsein von Mises heorem; moreover, he poserior covariance Σ ypically decreases like 1/. Then he Hessian (w.r.. θ) of he oal log-likelihood of (y 1,...,y )isσ 1,heinverse covariance of θ. Soheaverage Hessian per daa poin, he Fisher marix J, is approximaely J Σ 1 /. Since a Kalman filer o esimae θ is essenially a gradien descen precondiioned wih Σ, i will be he same as using a naural gradien wih learning rae 1/. Using a fading memory Kalman filer will esimae Σ from fewer pas observaions and provide larger learning raes. Anoher way o undersand he link beween naural gradien and Kalman filer is as a second-order Taylor expansion of daa log-likelihood. Assume ha he oal daa log-likelihood a ime, L (θ) := s=1 ln p(y s θ), is approximaely quadraic as a funcion of θ, wihaminimumaθ and a Hessian h, namely, L (θ) 1 2 (θ θ ) h (θ θ ). Then when new daa poins become available, his quadraic approximaion would be updaed as follows (online Newon mehod): h h 1 + θ( 2 ln p(y θ 1)) (2.1) θ θ 1 h 1 θ ( ln p(y θ 1)) (2.2) and indeed hese are equaliies for a quadraic log-likelihood. Namely, he updae of θ is a gradien ascen on log-likelihood, precondiioned by he inverse Hessian (Newon mehod). Noe ha h grows like (each daa poin adds is own conribuion). Thus, h is imes he empirical average of he Hessian, i.e.,

Online naural gradien as a Kalman filer 2941 approximaely imes he Fisher marix of he model (h J). So his updae is approximaely a naural gradien descen wih learning rae 1/. Meanwhile, he Bayesian poserior on θ (wih uniform prior) afer observaions y 1,...,y is proporional o e L by definiion of L.IfL 1 2 (θ θ ) h (θ θ ), his is a Gaussian disribuion cenered a θ wih covariance marix h 1. The Kalman filer is buil o mainain an approximaion P of his covariance marix h 1, and hen performs a gradien sep precondiioned on P similar o (2.2). The simples siuaion corresponds o an asympoic rae O(1/), i.e., esimaing he parameer based on all pas evidence; he updae (2.1) of he Hessian in (2.2) produces an effecive learning rae O(1/). Inroducing a decay facor for older observaions, muliplying he erm h 1 in (2.1), produces a fading memory effec and resuls in larger learning raes. is addiive, so ha h grows like and h 1 These heurisics jusify he saemen from [LS83] ha here is only one recursive idenificaion mehod. Close o an opimum (so ha he Hessian is posiive), all second-order algorihms are essenially an online Newon sep (2.1) (2.2) approximaed in various ways. Bu even hough his heurisic argumen appears o be approximae or asympoic, he correspondence beween online naural gradien and Kalman filer presened below is exac a every ime sep. 2.2. Saemen of he correspondence, saic (i.i.d.) case For he saemen of he correspondence, we assume ha he oupu noise on y given ŷ is modelled by an exponenial family wih mean parameer ŷ. This covers he radiional Gaussian case y = N (ŷ, Σ) wih fixed Σ ofen used in Kalman filers. The Appendix conains necessary background on exponenial families. Theorem 2 (Naural gradien as a saic Kalman filer). These wo algorihms are idenical under he correspondence (θ,j ) (s,p 1 /( + 1)): 1. The online naural gradien (Def. 1) wihlearningraesη = γ =1/( + 1), applied o learn he parameer θ of a model ha predics observaions (y ) wih inpus (u ), using a probabilisic model y p(y ŷ) wih ŷ = h(θ, u), whereh is any model and p(y ŷ) is an exponenial family wih mean parameer ŷ. 2. The exended Kalman filer (Def. 5) o esimae he sae s from observaions (y ) and inpus (u ), using a probabilisic model y p(y ŷ) wih ŷ = h(s, u) and p(y ŷ) an exponenial family wih mean parameer ŷ, wih saic dynamics and no added noise on s (f(s, u) =s and Q =0in Def. 5). Namely, if a sarup (θ 0,J 0 )=(s 0,P 1 0 ), hen(θ,j )=(s,p 1 /( + 1)) for all 0.

2942 Y. Ollivier The correspondence is exac only if he Fisher meric is updaed before he parameer in he naural gradien descen (as in Definiion 1). The correspondence wih a Kalman filer provides an inerpreaion for various hyper-parameers of online naural gradien descen. In paricular, J 0 = P0 1 can be inerpreed as he inverse covariance of a Bayesian prior on θ [SW88]. This relaes he iniializaion J 0 of he Fisher marix o he iniializaion of θ:for insance, in neural neworks i is recommended o iniialize he weighs according o a Gaussian of covariance diag(1/fan-in) (number of incoming weighs) for each neuron; inerpreing his as a Bayesian prior on weighs, one may recommend o iniialize he Fisher marix o he inverse of his covariance, namely, J 0 diag(fan-in) (2.3) Indeed his seemed o perform quie well in small-scale experimens. Learning raes, fading memory, and meric decay rae. Theorem 2 exhibis a 1/( + 1) learning rae for he online naural gradien. This is because he saic Kalman filer for i.i.d. observaions approximaes he maximum a poseriori (MAP) of he parameer θ based on all pas observaions; MAP and maximum likelihood esimaors change by O(1/) when a new daa poin is observed. However, for nonlinear sysems, opimaliy of he 1/ rae only occurs asympoically, close enough o he opimum. In general, a 1/( + 1) learning rae is far from opimal if opimizaion does no sar close o he opimum or if one is no using he exac Fisher marix J or covariance marix P. Larger effecive learning raes are achieved hanks o so-called fading memory varians of he Kalman filer, which pu less weigh on older observaions. For insance, one may muliply he log-likelihood of previous poins by a forgeing facor (1 λ ) before each new observaion. This is equivalen o an addiional sep P 1 P 1 /(1 λ ) in he Kalman filer, or o he addiion of an arificial process noise Q proporional o P 1 in he model. Such sraegies are repored o ofen improve performance, especially when he daa do no ruly follow he model [Sim06, 5.5, 7.4], [Hay01, 5.2.2]. See for insance [Ber96] for he relaionship beween Kalman fading memory and gradien descen learning raes (in a paricular case). Proposiion 3 (Naural gradien raes and fading memory). Under he same model and assumpions as in Theorem 2, he following wo algorihms are idenical via he correspondence (θ,j ) (s,η P 1 ): An online naural gradien sep wih learning rae η and meric decay rae γ A fading memory Kalman filer wih an addiional sep P 1 P 1 /(1 λ ) before he ransiion sep; such a filer ieraively opimizes a weighed log-likelihood funcion L of recen observaions, wih decay (1 λ ) a each

Online naural gradien as a Kalman filer 2943 sep, namely: L (θ) =lnp θ (y )+(1 λ ) L 1 (θ), provided he following relaions are saisfied: L 0 (θ) := 1 2 (θ θ 0) P0 1 (θ θ 0 ) (2.4) η = γ, P 0 = η 0 J0 1, (2.5) 1 λ = η 1 /η η 1 for 1 (2.6) For example, aking η =1/( + cs) corresponds o λ = 0, no decay for older observaions, and an iniial covariance P 0 = J0 1 /cs. Taking a consan learning rae η = η 0 corresponds o a consan decay facor λ = η 0. The proposiion above compues he fading memory decay facors 1 λ from he naural gradien learning raes η via (2.6). In he oher direcion, one can sar wih he decay facors λ and obain he learning raes η via he cumulaed sum of weighs S : S 0 := 1/η 0 hen S := (1 λ )S 1 +1, hen η := 1/S.This clarifies how λ = 0 corresponds o η =1/( + cs) where he consan is S 0. The learning raes also conrol he weigh given o he Bayesian prior and o he saring poin θ 0. For insance, wih η =1/( + 0 )andlarge 0,he gradien descen will move away slowly from θ 0 ; in he Kalman inerpreaion his corresponds o λ = 0 and a small iniial covariance P 0 = J0 1 / 0 around θ 0, so ha he prior weighs as much as 0 observaions. This resul suggess o se γ = η in he online naural gradien descen of Definiion 1. The inuiive explanaion for his seing is as follows: Boh he Kalman filer and he naural gradien build a second-order approximaion of he log-likelihood of pas observaions as a funcion of he parameer θ, as explained in Secion 2.1. Using a fading memory corresponds o puing smaller weighs on pas observaions; hese weighs affec he firs-order and he secondorder pars of he approximaion in he same way. In he gradien viewpoin, he learning rae η corresponds o he firs-order erm (comparing (1.8) and (2.2)) while he Fisher marix decay rae corresponds o he rae a which he second-order informaion is updaed. Thus, he seing η = γ in he naural gradien corresponds o using he same decay weighs for he firs-order and second-order expansion of he log-likelihood of pas observaions. Sill, one should keep in mind ha he exended Kalman filer is iself only an approximaion for nonlinear sysems. Moreover, from a saisical poin of view, he second-order objec J is higher-dimensional han he firs-order informaion, so ha esimaing J based on more pas observaions may be more sable. Finally, for large-dimensional problems he Fisher marix is always approximaed, which affecs opimaliy of he learning raes. So in pracice, considering γ and η as hyperparameers o be uned independenly may sill be beneficial, hough γ = η seems a good place o sar. Regularizaion of he Fisher marix and Bayesian priors. A poenial downside of fading memory in he Kalman filer is ha he Bayesian inerpreaion is parially los, because he Bayesian prior is forgoen oo quickly. For

2944 Y. Ollivier insance, wih a consan learning rae, he weigh of he Bayesian prior decreases exponenially; likewise, wih η = O(1/ ), he filer essenially works wih he O( ) mos recen observaions, while he weigh of he prior decreases like e (as does he weigh of he earlies observaions; his is he produc (1 λ )). Bu precisely, when working wih fewer daa poins one may wish he prior o play a greaer role. The Bayesian inerpreaion can be resored by explicily opimizing a combinaion of he log-likelihood of recen poins, and he log-likelihood of he prior. This is implemened in Proposiion 4. From he naural gradien viewpoin, his ranslaes boh as a regularizaion of he Fisher marix (ofen useful in pracice o numerically sabilize is inversion) and of he gradien sep. Wih a Gaussian prior N (θ prior, Id), his manifess as an addiional sep owards θ prior and adding ε. Id o he Fisher marix, known respecively as weigh decay and Tikhonov regularizaion [Bis06, 3.3, 5.5] in saisical learning. Proposiion 4 (Bayesian regularizaion of he Fisher marix). Le π = N (θ prior, Σ 0 ) be a Gaussian prior on θ. Under he same model and assumpions as in Theorem 2, he following wo algorihms are equivalen: A modified fading memory Kalman filer ha ieraively opimizes L (θ)+ n prior ln π(θ) where L is a weighed log-likelihood funcion of recen observaions wih decay (1 λ ): L (θ) =lnp θ (y )+(1 λ ) L 1 (θ), L 0 := 0 (2.7) η iniialized wih P 0 = 1 1+n priorη 1 Σ 0. A regularized online naural gradien sep wih learning rae η and meric decay rae γ, iniialized wih J 0 =Σ 1 θ θ 1 η ( J + η n prior Σ 1 0 provided he following relaions are saisfied: 0, ) 1 ( l (y ) θ + λ n prior Σ 1 0 (θ θ prior) ) (2.8) η = γ, 1 λ = η 1 /η η 1, η 0 := η 1 (2.9) Thus, he regularizaion erms are fully deermined by choosing he learning raes η, a prior such as N (0, 1/fan-in) (for neural neworks), and a value of n prior such as n prior = 1 (he prior weighs as much as n prior daa poins). This holds boh for regularizaion of he Fisher marix J + η n prior Σ 1 0,andfor regularizaion of he parameer via he exra gradien sep λ n prior Σ 1 0 (θ θ prior ). The relaive srengh of regularizaion in he Fisher marix decreases like η. In paricular, a consan learning rae resuls in a consan regularizaion. The added gradien sep λ n prior Σ 1 0 (θ θ prior) is modulaed by λ which depends on η ; his exra erm pulls owards he prior θ prior.thebayesian

Online naural gradien as a Kalman filer 2945 viewpoin guaranees ha his exra erm will no ulimaely preven convergence of he gradien descen (as he influence of he prior vanishes when he number of observaions increases). I is no clear how much hese recommendaions for naural gradien descen coming from is Bayesian inerpreaion are sensiive o using only an approximaion of he Fisher marix. 2.3. Proofs for he saic case The proof of Theorem 2 sars wih he inerpreaion of he Kalman filer as a gradien descen (Proposiion 6). We firs recall he exac definiion and he noaion we use for he exended Kalman filer. Definiion 5 (Exended Kalman filer). Consider a dynamical sysem wih sae s,inpusu and oupus y, s = f(s 1,u )+N (0,Q ), ŷ = h(s,u ), y p(y ŷ ) (2.10) where p( ŷ) denoes an exponenial family wih mean parameer ŷ (e.g., y = N (ŷ, R) wih fixed covariance marix R). The exended Kalman filer for his dynamical sysem esimaes he curren sae s given observaions y 1,...,y in a Bayesian fashion. A each ime, he Bayesian poserior disribuion of he sae given y 1,...,y is approximaed by a Gaussian N (s,p ) so ha s is he approximae maximum a poseriori, and P is he approximae poserior covariance marix. (The prior is N (s 0,P 0 ) a ime 0.) Each ime a new observaion y is available, hese esimaes are updaed as follows. The ransiion sep (before observing y )is and he observaion sep afer observing y is s 1 f(s 1,u ) (2.11) F 1 f s (2.12) (s 1,u ) P 1 F 1 P 1 F 1+ Q (2.13) ŷ h(s 1,u ) (2.14) E sufficien saisics(y ) ŷ (2.15) R Cov(sufficien saisics(y) ŷ ) (2.16) (hese are jus he error E = y ŷ and he covariance marix R = R for a Gaussian model y = N (ŷ, R) wih known R) H h (2.17) s (s 1,u )

2946 Y. Ollivier K P 1 H ( H P 1 H ) 1 + R (2.18) P (Id K H ) P 1 (2.19) s s 1 + K E (2.20) For non-gaussian oupu noise, he definiion of E and R above via he mean parameer ŷ of an exponenial family, differs from he pracice of modelling non-gaussian noise via a nonlinear funcion applied o Gaussian noise. This allows for a sraighforward reamen of various oupu models, such as discree oupus or Gaussians wih unknown variance. In he Gaussian case wih known variance our definiion is fully sandard. 4 The proof sars wih he inerpreaion of he Kalman filer as a gradien descen precondiioned by P. Compare his resul and Lemma 9 o [Hay01, (5.68) (5.73)]. Proposiion 6 (Kalman filer as precondiioned gradien descen). The updae of he sae s in a Kalman filer can be seen as an online gradien descen on daa log-likelihood, wih precondiioning marix P. More precisely, denoing l (y) := ln p(y ŷ ), he updae (2.20) is equivalen o ( ) l (y ) s = s 1 P (2.21) s 1 where in he derivaive, l depends on s 1 via ŷ = h(s 1,u ). Lemma 7 (Errors and gradiens). When he oupu model is an exponenial family wih mean parameer ŷ,heerrore is relaed o he gradien of he log-likelihood of he observaion y wih respec o he predicion ŷ by ( ) ln p(y ŷ ) E = R ŷ Proof of he lemma. For a Gaussian y = N (ŷ,r), his is jus a direc compuaion. For a general exponenial family, consider he naural parameer β of he exponenial family which defines he law of y, namely, p(y β) =exp( i β it i (y))/z(β) wihsufficien saisics T i and normalizing consan Z. An elemenary compuaion (Appendix, (A.3)) shows ha ln p(y β) β i = T i (y) ET i = T i (y) ŷ i (2.22) 4 Non-Gaussian oupu noise is ofen modelled in Kalman filering via a coninuous nonlinear funcion applied o a Gaussian noise [Sim06, 13.1]; his canno easily represen discree random variables. Moreover, since he filer linearizes he funcion around he 0 value of he noise [Sim06, 13.1], he noise is sill implicily Gaussian, hough wih a sae-dependen variance.

Online naural gradien as a Kalman filer 2947 by definiion of he mean parameer ŷ. Thus, ( ) ln p(y β) E = T (y ) ŷ = (2.23) β where he derivaive is wih respec o he naural parameer β. Toexpresshe derivaive wih respec o ŷ, we apply he chain rule ln p(y β) β = ln p(y ŷ) ŷ and use he fac ha, for exponenial families, he Jacobian marix of he mean parameer ŷ β is equal o he covariance marix R of he sufficien saisics (Appendix, (A.11) and(a.6)). ŷ β Lemma 8. The exended Kalman filer saisfies K R = P H. Proofofhelemma. This relaion is known, e.g., [Sim06, (6.34)]. Indeed, using he definiion of K, we have K R = K (R + H P 1 H ) K H P 1 H = P 1 H K H P 1 H =(Id K H )P 1 H = P H. Proof of Proposiion 6. By definiion of he Kalman filer we have s = s 1 + K E. By Lemma 7, ( l ). ( l ) E = R Thanks o Lemma 8 we find s = s 1 + K R = ŷ ŷ ( s 1 + P H l ) ( l ). = s 1 + P H Bu by he definiion of H, H is ŷ ŷ ŷ s 1 so ha l ŷ H is l s 1. The firs par of he nex lemma is known as he informaion filer in he Kalman filer lieraure, and saes ha he observaion sep for P is addiive when considered on P 1 [Sim06, 6.2]: afer each observaion, he Fisher informaion marix of he laes observaion is added o P 1. Lemma 9 (Informaion filer). The updae (2.18) (2.19) of P in he exended Kalman filer is equivalen o P 1 P 1 1 + H R 1 H (2.24) (assuming P 1 and R are inverible). In paricular, for saic dynamical sysems (f(s, u) =s and Q =0), he whole exended Kalman filer (2.12) (2.20) is equivalen o P 1 P 1 1 + H R 1 H (2.25) s s 1 P ( l (y ) s 1 ) (2.26)

2948 Y. Ollivier Proof. The firs saemen is well-known for Kalman filers [Sim06, (6.33)]. Indeed, expanding he definiion of K in he updae (2.19) ofp,wehave P = P 1 P 1 H ( H P 1 H ) 1 + R H P 1 (2.27) bu his is equal o (P 1 1 + H R 1 H ) 1 hanks o he Woodbury marix ideniy. The second saemen follows from Proposiion 6 and he fac ha for f(s, u) =s, he ransiion sep of he Kalman filer is jus s 1 = s 1 and P 1 = P 1. Lemma 10. For exponenial families p(y ŷ), heermh R 1 H appearing in Lemma 9 is equal o he Fisher informaion marix of y wih respec o he sae s, H R 1 H = E y p(y ŷ) [ l (y) s 1 where l (y) = ln p(y ŷ ) depends on s via ŷ = h(s, u). Proof. Leusomiimeindicesforbreviy.Wehave l(y) = l(y) ŷ s ŷ s = l(y) ŷ H.Consequenly, E y = H E y H. The middle erm E y is [ l(y) 2] [ l(y) 2] [ l(y) 2] s ŷ ŷ he Fisher marix of he random variable y wih respec o ŷ. Now, for an exponenial family y p(y ŷ) in mean parameerizaion ŷ, he Fisher marix wih respec o ŷ is equal o he inverse covariance marix of he sufficien saisics of y (Appendix, (A.16)), ha is, R 1. Proof of Theorem 2. By inducion on. By he combinaion of Lemmas 9 and 10, he updae of he Kalman filer wih saic dynamics (s 1 = s 1 )is [ P 1 P 1 1 + E y p(y ŷ ) 2 ] ] 2 l (y) (2.28) s 1 ( ) l (y ) s s 1 P (2.29) s 1 Defining J = P 1 /( + 1), his updae is equivalen o [ ] J +1 J 1 + 1 2 +1 E l (y) y p(y ŷ ) s 1 s s 1 1 ( ) +1 J 1 l (y ) s 1 Under he idenificaion s 1 θ 1, his is he online naural gradien updae wih learning rae η =1/( + 1) and meric updae rae γ =1/( + 1).

Online naural gradien as a Kalman filer 2949 The proof of Proposiion 3 is similar, wih addiional facors (1 λ ). Proposiion 4 is proved by applying a fading memory Kalman filer o a modified log-likelihood L 0 := n prior ln π(θ), L := ln p θ (y )+(1 λ ) L 1 +λ n prior ln π(θ) so ha he prior is kep consan in L. 3. Naural gradien as a Kalman filer: he sae space (recurren) case 3.1. Recurren models, RTRL Le us now consider non-memoryless models, i.e., models defined by a recurren or sae space equaion ŷ =Φ(ŷ 1,θ,u ) (3.1) wih u he observaions a ime. To save noaion, here we dump ino ŷ he whole sae of he model, including boh he par ha conains he predicion abou y and all sae or inernal variables (e.g., all inernal and oupu layers of a recurren neural nework, no only he oupu layer). The sae ŷ,orapar hereof, defines a loss funcion l (y ):= ln p(y ŷ ) for each observaion y. The curren sae ŷ can be seen as a funcion which depends on θ via he whole rajecory. The derivaive of he curren sae wih respec o θ can be compued inducively jus by differeniaing he recurren equaion (3.1) defining ŷ : ŷ θ = Φ(ŷ 1,θ,u ) θ + Φ(ŷ 1,θ,u ) ŷ 1 ŷ 1 θ (3.2) Real-ime recurren learning [Jae02] uses his equaion o keep an esimae G of ŷ θ.rtrlhenusesg o esimae he gradien of he loss funcion l wih respec o θ via he chain rule, l / θ =( l / ŷ )( ŷ / θ) =( l / ŷ )G. Definiion 11 (Real-ime recurren learning). Given a recurren model ŷ = Φ(ŷ 1,θ 1,u ), real-ime recurren learning (RTRL) learns he parameer θ via G Φ + Φ G 1, θ 1 ŷ 1 G 0 := 0 (3.3) g l (y ) G ŷ (3.4) θ θ 1 η g (3.5) Since θ changes a each sep, he acual esimae G in RTRL is only an approximaion of he gradien ŷ θ a θ = θ, valid in he limi of small learning raes η. In pracice, RTRL has a high compuaional cos due o he necessary sorage of G, a marix of size dim θ dim ŷ. For large-dimensional models, backpropagaion hrough ime is usually preferred, runcaed o a cerain lengh in he pas [Jae02]; [OTC15, TO17] inroduce a low-rank, unbiased approximaion of G.

2950 Y. Ollivier 3.2. Saemen of he correspondence, recurren case There are several ways in which a Kalman filer can be used o esimae θ for such recurren models. 1. A firs possibiliy is o view each ŷ as a funcion of θ via he whole rajecory, and o apply a Kalman filer on θ. This would require, in principle, recompuing he whole rajecory from ime 0 o ime using he new value of θ a each sep, and using RTRL o compue ŷ / θ, which is needed in he filer. In pracice, he pas rajecory is no updaed, and runcaed backpropagaion hrough ime is used o approximae he derivaice ŷ / θ [Jae02, Hay01]. 2. A second possibiliy is he join Kalman filer, namely, a Kalman filer on he pair (θ, ŷ )[Hay01, 5], [Sim06, 13.4]. This does no require going back in ime, as ŷ is a funcion of ŷ 1 and θ. This is he version appearing in Theorem 12 below. 3. A hird possibiliy is he dual Kalman filer [WN96]: a Kalman filer for θ given ŷ, and anoher one for ŷ given θ. This requires o explicily couple he wo Kalman filers by manually adding RTRL-like erms o accoun for he (linearized) dependency of ŷ on θ [Hay01, 5]. Inuiively, he join Kalman filer mainains a covariance marix on (θ, ŷ ), whose off-diagonal erm is he covariance beween ŷ and θ. This erm capures how he curren sae would change if anoher value of he parameer had been used. The decomposiion (3.13) in he heorem makes his inuiion precise in relaion o RTRL: he Kalman covariance beween ŷ and θ is direcly given by he RTRL gradien G. Theorem 12 (Kalman filer on (θ, ŷ) as RTRL+naural gradien+sae correcion). Consider a recurren model ŷ =Φ(ŷ 1,θ 1,u ). Assume ha he observaions y are prediced wih a probabilisic model p(y ŷ ) ha is an exponenial family wih mean parameer a subse of ŷ. Given an esimae G of ŷ / θ, and an observaion y, denoe g (y) := l (y) G (3.6) ŷ he corresponding esimae of l (y)/ θ. Then hese wo algorihms are equivalen: The exended Kalman filer on he pair (θ, ŷ) wih( ransiion ) funcion (Id, Φ), iniialized wih covariance marix P (θ,ŷ) P θ 0 = 0 0,andwih 0 0 no process noise (Q =0). A naural gradien RTRL algorihm wih learning rae η = 1/( +1), defined as follows. The sae, RTRL gradien and Fisher marix have a ransiion sep ŷ Φ(ŷ 1,θ 1,u ) (3.7)