Gaussian processes for inference in stochastic differential equations

Gaussian processes for inference in stochastic differential equations Manfred Opper, AI group, TU Berlin November 6, 2017 Manfred Opper, AI group, TU Berlin (TU Berlin) inference in SDE November 6, 2017 1 / 34

Gaussian Process models in machine learning Gaussian processes provide prior distributions over latent functions f ( ) to be learnt from data. anfred Opper, AI group, TU Berlin (TU Berlin) inference in SDE November 6, 2017 2 / 34

Gaussian Process models in machine learning Gaussian processes provide prior distributions over latent functions f ( ) to be learnt from data. 2.5 2 1.5 1 0.5 0!0.5!1!1.5!2 0 10 20 30 40 50 60 70 80 90 100 f ( ) GP(m, K) with m(x) = E[f (x)] and K(x, x ) = Cov [f (x), f (x )] anfred Opper, AI group, TU Berlin (TU Berlin) inference in SDE November 6, 2017 2 / 34

Gaussian Process models in machine learning Gaussian processes provide prior distributions over latent functions f ( ) to be learnt from data. 2.5 2 1.5 1 0.5 0!0.5!1!1.5!2 0 10 20 30 40 50 60 70 80 90 100 f ( ) GP(m, K) with m(x) = E[f (x)] and K(x, x ) = Cov [f (x), f (x )] Well known examples: 1 Regression: y i = f (x i ) + ν i anfred Opper, AI group, TU Berlin (TU Berlin) inference in SDE November 6, 2017 2 / 34

Gaussian Process models in machine learning Gaussian processes provide prior distributions over latent functions f ( ) to be learnt from data. 2.5 2 1.5 1 0.5 0!0.5!1!1.5!2 0 10 20 30 40 50 60 70 80 90 100 f ( ) GP(m, K) with m(x) = E[f (x)] and K(x, x ) = Cov [f (x), f (x )] Well known examples: 1 Regression: y i = f (x i ) + ν i 2 Classification (y i {0, 1}): P[y i = 1 f (x i )] = σ (f (x i )) 3... anfred Opper, AI group, TU Berlin (TU Berlin) inference in SDE November 6, 2017 2 / 34

Inference We would like to predict f (x) given observations y. = (y 1,..., y n ) at x 1,..., x n. GP prior is infinite dimensional! Manfred Opper, AI group, TU Berlin (TU Berlin) inference in SDE November 6, 2017 3 / 34

Inference We would like to predict f (x) given observations y. = (y 1,..., y n ) at x 1,..., x n. GP prior is infinite dimensional! For these simple models, inference reduces to n dimensional problem. Let f i = f (xi ) p(f (x) y) = p(f (x) f 1,..., f n )p(f 1,..., f n y)df 1 df n Manfred Opper, AI group, TU Berlin (TU Berlin) inference in SDE November 6, 2017 3 / 34

A more complicated likelihood Poisson process: Likelihood for set of n points D = (x 1, x 2,..., x n ) T is given by { } n L(D λ) = exp λ(x)dx λ(x i ) T i=1 Manfred Opper, AI group, TU Berlin (TU Berlin) inference in SDE November 6, 2017 4 / 34

A more complicated likelihood Poisson process: Likelihood for set of n points D = (x 1, x 2,..., x n ) T is given by { } n L(D λ) = exp λ(x)dx λ(x i ) T i=1 λ(x) is the unknown intensity or rate of the process. Gaussian Process Modulated Poisson Processes Model (Lloyd et al, 2015) λ(x) = f 2 (x), where f is a Gaussian process. Manfred Opper, AI group, TU Berlin (TU Berlin) inference in SDE November 6, 2017 4 / 34

Outline Stochastic differential equations Drift estimation for dense observations Drift estimation for sparse observations Sparse GP approximation Drift estimation from empirical distribution Outlook Manfred Opper, AI group, TU Berlin (TU Berlin) inference in SDE November 6, 2017 5 / 34

Stochastic differential equations dv (x) E.g. f (x) = dx dx dt = f (X ) + white noise Manfred Opper, AI group, TU Berlin (TU Berlin) inference in SDE November 6, 2017 6 / 34

Prior process: Stochastic differential equations (SDE) Mathematicians prefer Ito version for X t R d dx t = f (X t ) dt + D 1/2 (X }{{} t ) }{{} Drift Diffusion Limit of discrete time process X k ɛ t i.i.d. Gaussian. dw t }{{} Wiener process X t+ t X t = f (X t ) t + D 1/2 (X t ) t ɛ t. Manfred Opper, AI group, TU Berlin (TU Berlin) inference in SDE November 6, 2017 7 / 34

X(t) 0.4 0.8 1.2 Path with observations. 42.0 42.5 43.0 43.5 44.0 t Manfred Opper, AI group, TU Berlin (TU Berlin) inference in SDE November 6, 2017 8 / 34

Nonparametric drift estimation Infer the drift function f ( ) under smoothness assumptions from observations of the process X. Manfred Opper, AI group, TU Berlin (TU Berlin) inference in SDE November 6, 2017 9 / 34

Nonparametric drift estimation Infer the drift function f ( ) under smoothness assumptions from observations of the process X. Idea (see e.g. Papaspilioupoulis, Pokern, Roberts & Stuart (2012) Assume a Gaussian Process prior f ( ) GP(0, K) with covariance kernel K(x, x ). Manfred Opper, AI group, TU Berlin (TU Berlin) inference in SDE November 6, 2017 9 / 34

Simple for observations dense in time Euler discretization of SDE X t+ t X t = f (X t ) t + t ɛ t, for t 0. Manfred Opper, AI group, TU Berlin (TU Berlin) inference in SDE November 6, 2017 10 / 34

Simple for observations dense in time Euler discretization of SDE X t+ t X t = f (X t ) t + t ɛ t, for t 0. Likelihood (assume densely observed path X 0:T ) is Gaussian [ ] p(x 0:T f ) exp 1 X t+ t X t 2 2 t t [ exp 1 f (X t ) 2 t + ] f (X t ) (X t+ t X t ). 2 t t Posterior process is also a GP! Manfred Opper, AI group, TU Berlin (TU Berlin) inference in SDE November 6, 2017 10 / 34

Simple for observations dense in time Euler discretization of SDE X t+ t X t = f (X t ) t + t ɛ t, for t 0. Likelihood (assume densely observed path X 0:T ) is Gaussian [ ] p(x 0:T f ) exp 1 X t+ t X t 2 2 t t [ exp 1 f (X t ) 2 t + ] f (X t ) (X t+ t X t ). 2 t t Posterior process is also a GP! Solves the[ regression problem ] f (x) E Xt+ t X t t X t = x. Works well for t 0. Manfred Opper, AI group, TU Berlin (TU Berlin) inference in SDE November 6, 2017 10 / 34

Example: Double well model n = 6000 data points with t = 0.002, GP with polynomial kernel of order 4. X(t) 1.5 0.5 0.5 1.5 0 2 4 6 8 10 12 t f(x) 15 5 0 5 10 1.5 0.5 0.0 0.5 1.0 1.5 x Manfred Opper, AI group, TU Berlin (TU Berlin) inference in SDE November 6, 2017 11 / 34

Estimation of diffusion Euler discretized SDE Diffusion D(x) = lim t 0 X t+ X t = f (X t ) t + t ɛ t lim t 0 1 t Var (X t+ t X t X t = x) = 1 t E [ (X t+ t X t ) 2 X t = x ]. Independent of drift! Estimate D(x) with GPs by regression with data y t = (X t+ t X t ) 2 D(x) 1 2 0.6 1.0 1.4 1.8 1.5 0.5 0.0 0.5 1.0 1.5 Manfred Opper, AI group, TU Berlin (TU Berlin) inference in xsde November 6, 2017 12 / 34

Ice core data δ 18 O 46 42 38 34 0 20 40 60 80 100 120 t [ky] Manfred Opper, AI group, TU Berlin (TU Berlin) inference in SDE November 6, 2017 13 / 34

GP inference using RBF kernels U(x) 0 2 4 6 8 10 46 42 38 34 x D(x) 1 2 1 2 3 46 42 38 34 x Manfred Opper, AI group, TU Berlin (TU Berlin) inference in SDE November 6, 2017 14 / 34

For larger t... X(t) 1.4 1.2 1.0 0.8 f(x) 6 2 0 2 4 6 199.5 200.0 200.5 201.0 201.5 t 1.5 0.5 0.0 0.5 1.0 1.5 x Problem: We can compute p(x 0:T f ) but NOT P(y f )! anfred Opper, AI group, TU Berlin (TU Berlin) inference in SDE November 6, 2017 15 / 34

EM Algorithm Treat unobserved path X t for times t between observations as latent random variables (Batz, Ruttor & Opper NIPS 2013). EM algorithm: 1 E-step: Compute expected complete data likelihood L(f, f old ) = E pold [ln L(X 0:T f )] (1) where p old = posterior p(x 0:T y) computed with the previous estimate f old of the drift. 2 M-Step: Recompute the drift function as f new = arg min f (L(f, f old ) ln P 0 (f )) (2) Manfred Opper, AI group, TU Berlin (TU Berlin) inference in SDE November 6, 2017 16 / 34

Likelihood for a complete path p(x 0:T f ) exp [ 1 2 t ] X t+ t X t 2 [ exp 1 f (X t ) 2 + ] f (X t ) (X t+ t X t ) 2 t t }{{} L(X 0:T f ) t Manfred Opper, AI group, TU Berlin (TU Berlin) inference in SDE November 6, 2017 17 / 34

The complete likelihood 1 lim t 0 2 E p [ln L(X 0:T f )] = [ E p f (Xt ) 2] t 2E p [(f (X t ), X t+ t X t )] t = 1 2 T [ E p f (Xt ) 2] 2E p [(f (X t ), g t (X t ))] dt 0 = 1 f (x) 2 A(x)dx (f (x), z(x)) dx. (3) 2 where 1 g t (x) = lim t 0 t E p[x t+ t X t X t = x], as well as the functions A(x) = T 0 q t (x)dt b(x) = T 0 g t (x)q t (x)dt. Manfred Opper, AI group, TU Berlin (TU Berlin) inference in SDE November 6, 2017 18 / 34

Problems E-step requires posterior marginal densities q t for diffusion processes with arbitrary prior drift functions f (x). p(x 0:T y, f ) p(x 0:T f ) n δ(y k X kτ ), (4) k=1 GP has to deal with an infinite amount of densely imputed data. Manfred Opper, AI group, TU Berlin (TU Berlin) inference in SDE November 6, 2017 19 / 34

Approximate solutions Linearize drift between consecutive observations (Ornstein Uhlenbeck bridge). Hence, we consider the approximate process for t between two observations dx t = [f (y k ) Γ k (X t y k )]dt + D 1/2 k dw (5) with Γ k = f (y k ) and D k = D(y k ). For this process, the transition density is a multivariate Gaussian! Work with sparse GP approximation. Manfred Opper, AI group, TU Berlin (TU Berlin) inference in SDE November 6, 2017 20 / 34

Variational sparse GP approximation L. Csato (2002), M. Titsias (2009) Assume measure over functions f of the form dp(f ) = 1 Z dp 0(f ) e U(f ). anfred Opper, AI group, TU Berlin (TU Berlin) inference in SDE November 6, 2017 21 / 34

Variational sparse GP approximation L. Csato (2002), M. Titsias (2009) Assume measure over functions f of the form dp(f ) = 1 Z dp 0(f ) e U(f ). Approximate by dq(f ) = 1 Z s dp 0 (f ) e Us(fs). U s depends only on sparse set f s = {f (x)} x S of dim m. Minimize KL divergence D(Q P) = dq(f ) ln dq(f ) dp(f ) Integrating over dq(f f s ) = dp 0 (f f s ) yields optimal U s (f s ) = E 0 [U(f ) f s ] Manfred Opper, AI group, TU Berlin (TU Berlin) inference in SDE November 6, 2017 21 / 34

Variational sparse GP approximation L. Csato (2002), M. Titsias (2009) Assume measure over functions f of the form dp(f ) = 1 Z dp 0(f ) e U(f ). Approximate by dq(f ) = 1 Z s dp 0 (f ) e Us(fs). U s depends only on sparse set f s = {f (x)} x S of dim m. Minimize KL divergence D(Q P) = dq(f ) ln dq(f ) dp(f ) Integrating over dq(f f s ) = dp 0 (f f s ) yields optimal U s (f s ) = E 0 [U(f ) f s ] Can be computed analytically for Gaussian prior P 0 and quadratic log likelihood U Manfred Opper, AI group, TU Berlin (TU Berlin) inference in SDE November 6, 2017 21 / 34

Some details We can show that E 0 [f (x) f s ] = k s (x)(k s ) 1 f s (6) Hence, if U(f ) = 1 2 f 2 (x)a(x)dx f (x)b(x)dx, we get U s (f s ) = E 0 [U(f) f s ] = 1 { 2 f s K 1 s fs K 1 s k s (x) b(x) dx. } k s (x) A(x) k s (x)dx K 1 s f s Manfred Opper, AI group, TU Berlin (TU Berlin) inference in SDE November 6, 2017 22 / 34

GP estimation after one iteration of EM. f(x) 6 2 0 2 4 6 1.5 0.5 0.0 0.5 1.0 1.5 x f(x) 10 0 5 10 1.5 0.5 0.0 0.5 1.0 1.5 x Manfred Opper, AI group, TU Berlin (TU Berlin) inference in SDE November 6, 2017 23 / 34

MSE 0.0 1.0 2.0 3.0 Direct EM Gibbs 0.0 0.5 1.0 1.5 2.0 2.5 3.0 t Figure: (color online) Comparison of the MSE for different methods over different time intervals. Manfred Opper, AI group, TU Berlin (TU Berlin) inference in SDE November 6, 2017 24 / 34

Example: A simple pendulum dx = Vdt, dv = γv + mgl sin(x ) ml 2 dt + d 1/2 dw t, Manfred Opper, AI group, TU Berlin (TU Berlin) inference in SDE November 6, 2017 25 / 34

Example: A simple pendulum dx = Vdt, dv = γv + mgl sin(x ) ml 2 dt + d 1/2 dw t, N = 4000 data points (x, v) with t obs = 0.3 and known diffusion constant d = 1. GP with periodic kernel. f(x, v = 0) 10 5 0 5 10 6 4 2 0 2 4 6 x Manfred Opper, AI group, TU Berlin (TU Berlin) inference in SDE November 6, 2017 25 / 34

Drift estimation for SDE using empirical distribution For X R d consider SDE dx t = f (X t )dt + σ(x t )dw t Try to estimate the drift function g( ) given only empirical distribution of (noise free) observations ˆp(x) = 1 n n δ(x x i ) i=1 Manfred Opper, AI group, TU Berlin (TU Berlin) inference in SDE November 6, 2017 26 / 34

Drift estimation for SDE using empirical distribution For X R d consider SDE dx t = f (X t )dt + σ(x t )dw t Try to estimate the drift function g( ) given only empirical distribution of (noise free) observations ˆp(x) = 1 n n δ(x x i ) i=1 Possible, if drift has the form f (x) = g(x) + A(x) ψ(x) where A and g are known functions. Manfred Opper, AI group, TU Berlin (TU Berlin) inference in SDE November 6, 2017 26 / 34

Generalised score functional Define with ε[ψ] = { } 1 2 ψ(x) 2 A + L g ψ(x) p(x)dx L g ψ(x) = g(x) ψ(x) + 1 2 ij D ij (x) 2 ψ(x) x (i) x (j) and g(x) 2 A = g(x) A(x)g(x) and D(x). = σ(x)σ(x). Manfred Opper, AI group, TU Berlin (TU Berlin) inference in SDE November 6, 2017 27 / 34

Generalised score functional Define with ε[ψ] = { } 1 2 ψ(x) 2 A + L g ψ(x) p(x)dx L g ψ(x) = g(x) ψ(x) + 1 2 ij D ij (x) 2 ψ(x) x (i) x (j) and g(x) 2 A = g(x) A(x)g(x) and D(x). = σ(x)σ(x). δε[ψ] δψ = 0 yields stationary Fokker Planck equation L g p(x) (A(x)ψ(x)p(x)) = 0 where L g is the adjoint of L g L g p(x) = [ f (x)p(x) + 12 ] (D(x)p(x)). A = g = 0: Score matching (Hyvärinen, 2005), (Sriperumbudur et al, 2014) for density estimation. Manfred Opper, AI group, TU Berlin (TU Berlin) inference in SDE November 6, 2017 27 / 34

Consider pseudo log likelihood n i=1 { } 1 2 ψ(x i) 2 A + L g ψ(x i ) where x i p( ). = X (ti ), i = 1,..., n is sample from the stationary density Manfred Opper, AI group, TU Berlin (TU Berlin) inference in SDE November 6, 2017 28 / 34

Consider pseudo log likelihood where x i p( ) n i=1 { } 1 2 ψ(x i) 2 A + L g ψ(x i ). = X (ti ), i = 1,..., n is sample from the stationary density Combine with a GP prior over ψ yields estimator for drift of the SDE if this is of the form f (x) = g(x) + A(x) ψ(x) where A and g are given (Batz, Ruttor, Opper 2016). anfred Opper, AI group, TU Berlin (TU Berlin) inference in SDE November 6, 2017 28 / 34

Langevin dynamics Consider 2nd order SDE dx t = V t dt, dv t = (F (x) λv)dt + σ(x t, V t )dw t. and set A xx = A xv = 0 and A vv = I, f x = v, g v = λv (known) anfred Opper, AI group, TU Berlin (TU Berlin) inference in SDE November 6, 2017 29 / 34

Langevin dynamics Consider 2nd order SDE dx t = V t dt, dv t = (F (x) λv)dt + σ(x t, V t )dw t. and set A xx = A xv = 0 and A vv = I, f x = v, g v = λv (known) The condition f v (x, v) = λv + v ψ(x, v) = λv + F (x) is fulfilled with ψ(x, v) = v F (x) and allows for arbitrary F (x). Use GP prior over F. anfred Opper, AI group, TU Berlin (TU Berlin) inference in SDE November 6, 2017 29 / 34

Example 1: Periodic model: F (x) = a sin x, D v = (σ cos(x)) 2 with n = 2000 observations, time lag τ = 0.25 f(x) 10 5 0 5 10 6 4 2 0 2 4 6 x Manfred Opper, AI group, TU Berlin (TU Berlin) inference in SDE November 6, 2017 30 / 34

Example 2: Non conservative drift model: F (1) (x) = x (1) (1 (x (1) ) 2 (x (2) ) 2 ) x (2) F (2) (x) = x (2) (1 (x (1) ) 2 (x (2) ) 2 ) x (1) with friction λ known, but D unknown, n = 2000 observations. 3 1 0 1 2 3 3 2 1 0 1 2 3 Manfred Opper, AI group, TU Berlin (TU Berlin) inference in SDE November 6, 2017 31 / 34

The case A = D: Likelihood for densely observed path ln L(X 0:T g) = 1 2 with u, v. = u D 1 v. { f (Xt ) 2 D 1 dt 2 f (X t ), dx t } + const Manfred Opper, AI group, TU Berlin (TU Berlin) inference in SDE November 6, 2017 32 / 34

The case A = D: Likelihood for densely observed path ln L(X 0:T g) = 1 2 { f (Xt ) 2 D 1 dt 2 f (X t ), dx t } + const with u, v. = u D 1 v. Assume f = g + D ψ and apply Ito formula = 1 2 T 0 { ψ D ψ dt + 2g ψ dt 2 ψ dx t } Manfred Opper, AI group, TU Berlin (TU Berlin) inference in SDE November 6, 2017 32 / 34

The case A = D: Likelihood for densely observed path ln L(X 0:T g) = 1 2 { f (Xt ) 2 D 1 dt 2 f (X t ), dx t } + const with u, v. = u D 1 v. Assume f = g + D ψ and apply Ito formula = 1 2 = 1 2 T 0 T 0 { ψ D ψ dt + 2g ψ dt 2 ψ dx t } { ψ(x t ) 2 D + L g ψ(x t ) } dt ψ(x T ) + ψ(x 0 ) Manfred Opper, AI group, TU Berlin (TU Berlin) inference in SDE November 6, 2017 32 / 34

Outlook Beyond EM: Full variational Bayesian approximation Estimation of diffusion from sparse data Quality of sparse GP approximation? Langevin dynamics with unobserved velocities? Manfred Opper, AI group, TU Berlin (TU Berlin) inference in SDE November 6, 2017 33 / 34

Many thanks to my collaborators: Andreas Ruttor, Philipp Batz funded by EU STREP project Manfred Opper, AI group, TU Berlin (TU Berlin) inference in SDE November 6, 2017 34 / 34