Gaussian Process Approximations of Stochastic Differential Equations

Similar documents
Gaussian Process Approximations of Stochastic Differential Equations

A variational radial basis function approximation for diffusion processes

PILCO: A Model-Based and Data-Efficient Approach to Policy Search

Expectation Propagation in Dynamical Systems

The Variational Gaussian Approximation Revisited

Gaussian Process Approximations of Stochastic Differential Equations

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Lecture 2: From Linear Regression to Kalman Filter and Beyond

Lecture 2: From Linear Regression to Kalman Filter and Beyond

Consider the joint probability, P(x,y), shown as the contours in the figure above. P(x) is given by the integral of P(x,y) over all values of y.

Lecture 6: Bayesian Inference in SDE Models

NON-LINEAR NOISE ADAPTIVE KALMAN FILTERING VIA VARIATIONAL BAYES

(Extended) Kalman Filter

Gaussian processes for inference in stochastic differential equations

State Space Gaussian Processes with Non-Gaussian Likelihoods

4. DATA ASSIMILATION FUNDAMENTALS

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 13: SEQUENTIAL DATA

Expectation propagation for signal detection in flat-fading channels

Variational Principal Components

Expectation Propagation for Approximate Bayesian Inference

ECE276A: Sensing & Estimation in Robotics Lecture 10: Gaussian Mixture and Particle Filtering

A new unscented Kalman filter with higher order moment-matching

A State Space Model for Wind Forecast Correction

M.Sc. in Meteorology. Numerical Weather Prediction

Fundamentals of Data Assimilation

State Space Representation of Gaussian Processes

Bayesian Inference for Wind Field Retrieval

Part 1: Expectation Propagation

Gaussian Filtering Strategies for Nonlinear Systems

Why do we care? Measurements. Handling uncertainty over time: predicting, estimating, recognizing, learning. Dealing with time

Kalman Filter. Predict: Update: x k k 1 = F k x k 1 k 1 + B k u k P k k 1 = F k P k 1 k 1 F T k + Q

Lecture 7: Optimal Smoothing

The Ensemble Kalman Filter:

Approximate Inference Part 1 of 2

Gaussian Process Regression

Approximate Inference Part 1 of 2

9 Multi-Model State Estimation

Forecasting and data assimilation

Sequential Monte Carlo Methods for Bayesian Computation

STAT 518 Intro Student Presentation

Fundamentals of Data Assimila1on

Statistical Techniques in Robotics (16-831, F12) Lecture#20 (Monday November 12) Gaussian Processes

Why do we care? Examples. Bayes Rule. What room am I in? Handling uncertainty over time: predicting, estimating, recognizing, learning

Nonparametric Bayesian Methods (Gaussian Processes)

1 Kalman Filter Introduction

Gaussian Processes. 1 What problems can be solved by Gaussian Processes?

A new Hierarchical Bayes approach to ensemble-variational data assimilation

Lecture 6: Multiple Model Filtering, Particle Filtering and Other Approximations

Kalman filtering and friends: Inference in time series models. Herke van Hoof slides mostly by Michael Rubinstein

Recursive Noise Adaptive Kalman Filtering by Variational Bayesian Approximations

Lecture 1a: Basic Concepts and Recaps

Introduction to emulators - the what, the when, the why

Reinforcement Learning with Reference Tracking Control in Continuous State Spaces

Statistical Techniques in Robotics (16-831, F12) Lecture#21 (Monday November 12) Gaussian Processes

Kalman Filter and Ensemble Kalman Filter

Expectation. DS GA 1002 Statistical and Mathematical Models. Carlos Fernandez-Granda

Nonlinear and/or Non-normal Filtering. Jesús Fernández-Villaverde University of Pennsylvania

Lecture 1c: Gaussian Processes for Regression

Lecture 7 and 8: Markov Chain Monte Carlo

Probabilistic Graphical Models

Stochastic Collocation Methods for Polynomial Chaos: Analysis and Applications

The Kalman Filter ImPr Talk

A Note on the Particle Filter with Posterior Gaussian Resampling

New Fast Kalman filter method

Probabilistic Graphical Models

Bayes Filter Reminder. Kalman Filter Localization. Properties of Gaussians. Gaussians. Prediction. Correction. σ 2. Univariate. 1 2πσ e.

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING. Non-linear regression techniques Part - II

Data assimilation with and without a model

Probabilistic Graphical Models

Explicit Rates of Convergence for Sparse Variational Inference in Gaussian Process Regression

Deep learning with differential Gaussian process flows

In the derivation of Optimal Interpolation, we found the optimal weight matrix W that minimizes the total analysis error variance.

Smoothers: Types and Benchmarks

Local Positioning with Parallelepiped Moving Grid

Relative Merits of 4D-Var and Ensemble Kalman Filter

(Multivariate) Gaussian (Normal) Probability Densities

Introduction to Machine Learning

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Probabilistic Graphical Models

Convergence of Square Root Ensemble Kalman Filters in the Large Ensemble Limit

Introduction to Data Assimilation

Dynamic System Identification using HDMR-Bayesian Technique

Gaussian Processes for Machine Learning

Learning Gaussian Process Models from Uncertain Data

Variational Learning : From exponential families to multilinear systems

Extended Kalman Filter Tutorial

Adaptive ensemble Kalman filtering of nonlinear systems

The University of Reading

Graphical Models for Statistical Inference and Data Assimilation

The Kalman Filter. Data Assimilation & Inverse Problems from Weather Forecasting to Neuroscience. Sarah Dance

An introduction to Sequential Monte Carlo

Dual Estimation and the Unscented Transformation

The Inversion Problem: solving parameters inversion and assimilation problems

A nested sampling particle filter for nonlinear data assimilation

RECURSIVE OUTLIER-ROBUST FILTERING AND SMOOTHING FOR NONLINEAR SYSTEMS USING THE MULTIVARIATE STUDENT-T DISTRIBUTION

Model Selection for Gaussian Processes

Density Estimation. Seungjin Choi

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes

Inverse problems and uncertainty quantification in remote sensing

A Spectral Approach to Linear Bayesian Updating

Transcription:

Gaussian Process Approximations of Stochastic Differential Equations Cédric Archambeau Dan Cawford Manfred Opper John Shawe-Taylor May, 2006 1 Introduction Some of the most complex models routinely run are numerical weather prediction models. These models are based on a discretisation of a coupled set of partial differential equations (the dynamics) which govern the time evolution of the atmosphere, described in terms of temperature, pressure, velocity, etc, together with parameterisations of physical processes such as radiation and clouds [3]. These dynamical models typically have state vectors with dimension O(10 6 ) or more. A key issue in numerical weather prediction is the inference of the state vector, given a set of observations, which is referred to as data assimilation. Most modern data assimilation methods can be seen in a sequential Bayesian setting as the estimation of the posterior distribution of the state, given a prior distribution of the state that is derived from the numerical weather prediction model, and a likelihood that relates the observations to the state [4]. In general the prior distribution is approximated as a space only Gaussian process, the observations are often given Gaussian likelihoods, and thus inference of the posterior proceeds using well known Gaussian process methods [6]. It is possible to view almost all data assimilation methods as approximations to the Kalman filter / smoother. Much early work in data assimilation focussed on the static case where the prior distribution was assumed to have a climatological covariance structure whose form was dictated University of Southampton Aston University Technical University Berlin 1

in many cases from the model dynamics, and a mean given by a deterministic forecast from the previous data assimilation cycles posterior mean [3]. At each time step, the covariance of the state was thus ignored and only the mean was propagated forward in time. Recently much work has been done to address the issue of the propagation of the uncertainty at initial time through the non-linear model equations. The most popular method is called the ensemble Kalman filter [5], and is a Monte Carlo approach. The basic idea is very simple: 1. take a sample from your prior distribution (in practice a sample size O(100) or less is used, even for high dimensional systems); 2. propagate each sample, integrating the model equations deterministically to produce a forecast ensemble; 3. use the forecast ensemble to estimate the forecast mean and covariance (avoid rank deficiency problems by localising this, which also minimises the effect of sampling noise); 4. update each ensemble member carefully using a Kalman filter derived update, so they are a sample from the corresponding posterior distribution and iterate to 2. The ensemble approach has many advantages, but it is suboptimal from many aspects, for example the sampling noise can be large and uncontrolled, it is not trivial to incorporate non-gaussian likelihoods, it requires several model integrations and it is impractical to implement as a smoother. See Appendix A for more information about the Kalman filter link. In the following we present some initial results from the VISDEM project, where we seek a variational Bayesian treatment of the dynamic data assimilation problem which builds upon our variational Bayesian Gaussian process treatment of the static data assimilation problem [1]. In particular we focus on the issue of defining a Gaussian process approximation to the temporal evolution of the solution of a general stochastic differential equation with additive noise. 2 Modelling stochastic differential equations We assume that the variables divide into two groups ( ) x x = x 2

such that the noise level is σ on the x variables assumed to be k of the n, while for the remainder it is σ. Assuming the noise to be equal variance and uncorrelated we have x k = x k+1 x k = f (x k ) t + σ z t, x k = x k+1 x k = f (x k ) t + σ z t where z (z ) is a vector of dimension k (n k) drawn from a multivariate Gaussian with identity covariance. Hence, the true probability of a sequence {x i } N i=1 is given by P ( N 1 {x i } N ) i=1 = i=1 1 (2πσ 2 t) k/2 (2πσ 2 t) (n k)/2 ( exp x i f (x i ) t 2 2σ 2 t x i f (x i ) t 2 2σ 2 t We will approximate this distribution by a distribution Q which we assume has the following form ). Q ( N 1 {x i } N ) 1 i=1 = (2πσ 2 t) k/2 (2πσ 2 t) (n k)/2 i=1 ( exp x i (A i x i + b i ) t 2 2σ 2 x i t ) (A i x i + b i ) t 2 2σ 2, t where the matrices A i, A i and vectors b i, b i are parameters that will be adjusted to minimise the KL divergence KL(Q P ) between the two distributions. We present here an approximation of the exact optimisation of the KL divergence that extends the Kalman filter approach. The detailed derivations are given in Appendix B. We summarise the results here. If we define A i = (b i f(x i ))x i x i x i 1 = A(m i, Σ i ) and b i = f(x i ) A i x i = b(m i, Σ i ), where averages are over the Gaussian approximation at stage i with mean m i and covariance Σ i. Taking limits as the interval tends to zero we obtain differential equations for these quantities which are now all functions of t: dm dt = A(m, Σ)m + b(m, Σ) = f(x), 3

and dσ dt = diag(σ) + A(m, Σ)Σ + ΣA(m, Σ) f(x) f(x) = diag(σ) + Σ + Σ, x x These equations define the evolution of our approximation of the Q process. 3 Discussion The final version of the paper will present results for a Gaussian process with the covariance function determined by the approximation of the process Q. A discussion of how the kernel can be obtained from the above computations is included in Appendix D. References [1] D Cornford, L Csató, D J Evans, and M Opper. Bayesian analysis of the scatterometer wind retrieval inverse problem: some new approaches. Journal of the Royal Statistical Society B, 66:609 626, 2004. [2] R. E. Kalman. A new approach to linear filtering and prediction problems. Transactions of the ASME, Journal of Basic Engineering, 82:34 45, 1960. [3] E Kalnay. Atmospheric Modelling, Data Assimilation and Predictability. Cambridge University Press, Cambridge, 2003. [4] A. C. Lorenc. Analysis methods for numerical weather prediction. Quarterly Journal of the Royal Meteorological Society, 112:1177 1194, 1986. [5] A C Lorenc. The potential of the ensemble Kalman filter for NWP? a comparison with 4D-Var. Quarterly Journal of the Royal Meteorological Society, 129:3183 3203, 2003. [6] Carl Edward Rasmussen and Christopher K.I. Williams. Gaussian Processes for Machine Learning. The MIT Press, Cambridge, Massachusetts, 2006. 4

A Link to the Kalman Filter A standard approach for modelling dynamical systems with additive Gaussian measurement noise is the Kalman Filter (KF) [2]. When the system is nonlinear, one of its variants, for example the extended Kalman Filter (EKF), can be used. KF and EKF are concerned with propagating the two first moments of the filtering distribution p(x i+1 y 0:i+1 ), which are given by x i+1 = E{x i y 0:i+1 }, (1) S i+1 = E{(x i x i+1 )(x i+1 x i+1 ) T y 0:i+1 }, (2) where y 0:i+1 {y 0,..., y i+1 } are the observations up to time t i+1. In order to propagate these quantities, they proceed in two steps. In the prediction step, the two moments are estimated given the observations up to time t i : p(x i+1 y 0:i ) = p(x i+1 x i )p(x i y 0:i )dx i. (3) This allows computing the predicted state E{x i+1 y 0:i } and the predicted state covariance E{(x i+1 x i+1 )(x i+1 x i+1 ) T y 0:i }. Next, the correction step consists in updating these estimates based on the new observation y i+1 : p(x i+1 y 0:i+1 ) p(y i+1 x i+1 )p(x i+1 y 0:i ), (4) which leads to the desired conditional moments (1) and (2). KF is particularly attractive for on-line learning as it is not required to keep trace of the previous conditional expectations. Unfortunately, the integral in (3) is in general intractable when the system is nonlinear or when the state transition probability p(x i+1 x i ) is non-gaussian. Therefore, approximations are required. A common approach is to linearize the system, which corresponds to EKF. This leads to a Gaussian approximation of the transition probability. Moreover, if the likelihood p(y i+1 x i+1 ) is assumed to be Gaussian, then the filtering density is a Gaussian one at each t i+1. An alternative approach is to resume the past information by the marginal Q(x i+1 ) and make predictions as follows where p(x i+1 y 0:i ) = N (x i+1 m i+1, Σ i+1 ), (5) m i+1 = m i + (Am i + b i ) t (6) Σ i+1 = Σ i + (2AΣ i + diag{σ}) t. (7) 5

This approximation is expected to be better than EKF, as the parameters A and b of the linear approximation are adjusted at each iteration. If we further assume that the likelihood p(y i+1 x i+1 ) is of the form N (y i+1 x i+1, R), then the correction step (4) is given by with p(x i+1 y 0:i+1 ) = N (y i+1 x i+1, S i+1 ), (8) x i+1 = S i+1 (Σ 1 i+1 m i+1 + R 1 y i+1 ), (9) S i+1 = (Σ 1 i+1 + R 1 ) 1. (10) Note that in this approach, only the filtering density (and its associated moments) are propagated through time. In contrast, GP framework allows us to define a distribution over the entire function space (i.e., over time). It is expected that this will have a smoothing effect and will lead to a better tracking of the state transitions. B Approximation of KL divergence If we use the variable X to denote the complete sequence {x i } N i=1, we can compute this as [ KL(Q P ) = E X Q log Q(X) ] P (X) = = = N 1 i=1 N 1 i=1 i=1 E X Q [ log Q(x i+1 x i ) P (x i+1 x i ) dx i Q(x i ) ] dx i+1 Q(x i+1 x i ) [ x i f (x i ) t 2 x i (A i x i + b i ) t 2 2σ 2 t + x i f (x i ) t 2 x i (A i x i + b ] i ) t 2 2σ 2 t N 1 [ t dx i Q(x i ) 2σ 2 (A i x i + b i ) f (x i ) 2 + t 2σ 2 (A i x i + b i ) f (x i ) 2 6 ],

where the last equality follows from using the equality E x N(µ,σ) [(x a) 2 ] = σ 2 + (µ a) 2 for the k components of the first vector and n k components of the second. In order to minimise the KL divergence, we must take the derivative with respect to the parameters, set to zero and solve. We obtain 2A i x ix i + 2 (b i f (x i ))x i = 0 and 2(A i x i + b i f (x i ) ) = 0, and 2A i x i x i + 2 (b i f (x i ))x i = 0 and 2(A i x i + b i f (x i ) ) = 0 where the angle brackets indicate expectations with respect to the marginal distribution Q(x i ), which is Gaussian with mean and covariance m i and Σ i respectively (note that these are over the full variable set). We obtain the equations for the parameters as A i = (b i f(x i ))x i x i x i 1 = A(m i, Σ i ) and b i = f(x i ) A i x i = b(m i, Σ i ), where the matrix A i is formed by concatenating A i and A i and similarly b i. The expressions for A(m, Σ) and b(m, Σ) can be given as A(m, Σ)(Σ + mm ) = f(x)x bm b(m, Σ) = f(x) Am. Substituting b from the second equation in the first gives A(m, Σ)(Σ + mm ) = f(x)x f(x) m + Amm A(m, Σ)Σ = f(x)(x m ) f(x) = Σ, (11) x with the last equality following from an integration by parts. Writing expressions for the generation of x i+1 we have x i+1 = x i + (A x i + b ) t + σ u t x i+1 = x i + (A x i + b ) t + σ u t 7

where u is an n-dimensional vector of independent zero mean Gaussian variables with unit variance. We can combine these into the single equation x i+1 = x i + (Ax i + b) t + diag(σ)u t where diag(σ) denotes the diagonal matrix whose first k entries are σ and remaining entries σ. We also know that x i is generated by a Gaussian with mean m i and covariance Σ i, so that x i = m i + Σ i v where v is a vector of zero mean unit variance Gaussian variables. Hence, we obtain the following expression for x i+1 x i+1 = m i + Σ i v + Am i t + A Σ i v t + b t + diag(σ)u t Hence, we can compute m i+1 = E[x i+1 ] = m i + Am i t + b t. Taking the difference between means, dividing by t and taking limits gives Next consider dm dt = A(m, Σ)m + b(m, Σ) = f(x). Σ i+1 = E [ (x i+1 m i+1 )(x i+1 m i+1 ) ] [ = E Σi vv ] [ Σ i + E Σi vv ] Σ i A t +diag(σ) t + O ( ( t) 2) = Σ i + Σ i A t + AΣ i t + diag(σ) t + O ( ( t) 2) [ + E A Σ i vv ] Σ i t Taking the difference between means, dividing by t and taking limits gives dσ dt = diag(σ) + A(m, Σ)Σ + ΣA(m, Σ) f(x) f(x) = diag(σ) + Σ + Σ, x x where we have made use of equation (11). The result of this computation hold for all values of σ and σ, so that we can consider the case where we let σ tend to zero. This allows us to encode quite general noise models and covariances as the examples in Appendix C illustrate. 8

C Examples of different noise models General covariance If we wish to introduce noise with a known fixed covariance Σ 0 as for example given by spatial relations between the locations of a climate grid model, we can double the number of variables to n = 2k to obtain x k = x k+1 x k = z t, x k = x k+1 x k = f(x k ) t + Σ 0 x k, where z is k dimensional zero mean unit variance Gaussian random variables and f(x ) is the system being studied. Ornstein-Uhlenbeck process coloured noise Consider the two variable stochastic differential equation: y k = y k+1 y k = my k t + σz t, x k = x k+1 x k = (f(x k ) + y k ) t, where z is unit variance, zero mean Gaussian and f( ) is some possibly non-linear function. We can approximate this system using the above model with n = 2 and k = 1, σ = σ and σ 0. It gives a general one-dimensional system driven by coloured noise. D On the kernel The problem is: Consider dx = (Ax + b)dt + σdw with W a standard Wiener process (assuming uncorrelated noise with equal variance). The goal is to find the two time covariance kernel K(t 1, t 2 ) = E [ (x(t 1 ) m(t 1 ))(x(t 2 ) m(t 2 )) ] Let U be a solution to homogenous (nonstochastic) equation du dt = AU 9

with U(0) = I. Then we can solve the inhomogenous equation as t x(t) = U(t)x(0) + σ 0 U(t s)dw(s) The naive play with the differentials seems to be justified for the linear case! We understand that the first part is nonrandom, the second part is of zero mean. Hence, if we want the covariance, we just have K(t 1, t 2 ) = σ 2 t1 0 t2 0 U(t 1 s 1 )E[dW(s 1 )dw (s 2 )]U(t 2 s 2 ) We also have E[dW(s 1 )dw(s 2 ) ] = I δ(s 1 s 2 )ds 1 ds 2. So we should end up with min (t1 K(t 1, t 2 ) = σ 2,t 2 ) ds U(t 1 s)u (t 2 s) 0 So far we have not used any properties of time independence. The simplification that comes with a constant A is that we can immediately write U(t) = e At For the time dependent case, finding explicit solutions is problematic by the fact that matrices at different times may not commute, i.e. A(t)A(t ) A(t )A(t). The time independent case gives (assume t 1 < t 2 ) t1 K(t 1, t 2 ) = σ 2 e A(t 2 t 1 ) 0 ds e A(t 1 s) e A (t 1 s) Now, the integral (with σ 2 ) is precisely the kernel at equal times (the variance) and we have the final result. and similar K(t 1, t 2 ) = e A(t 2 t 1 ) K(t 1, t 1 ) for t 1 < t 2 K(t 1, t 2 ) = K(t 2, t 2 )e A (t 1 t 2 ) for t 2 < t 1 10