Gaussian processes for inference in stochastic differential equations

Size: px

Start display at page:

Download "Gaussian processes for inference in stochastic differential equations"

Noreen Reeves
5 years ago
Views:

1 Gaussian processes for inference in stochastic differential equations Manfred Opper, AI group, TU Berlin November 6, 2017 Manfred Opper, AI group, TU Berlin (TU Berlin) inference in SDE November 6, / 34

2 Gaussian Process models in machine learning Gaussian processes provide prior distributions over latent functions f ( ) to be learnt from data. anfred Opper, AI group, TU Berlin (TU Berlin) inference in SDE November 6, / 34

3 Gaussian Process models in machine learning Gaussian processes provide prior distributions over latent functions f ( ) to be learnt from data !0.5!1!1.5! f ( ) GP(m, K) with m(x) = E[f (x)] and K(x, x ) = Cov [f (x), f (x )] anfred Opper, AI group, TU Berlin (TU Berlin) inference in SDE November 6, / 34

4 Gaussian Process models in machine learning Gaussian processes provide prior distributions over latent functions f ( ) to be learnt from data !0.5!1!1.5! f ( ) GP(m, K) with m(x) = E[f (x)] and K(x, x ) = Cov [f (x), f (x )] Well known examples: 1 Regression: y i = f (x i ) + ν i anfred Opper, AI group, TU Berlin (TU Berlin) inference in SDE November 6, / 34

5 Gaussian Process models in machine learning Gaussian processes provide prior distributions over latent functions f ( ) to be learnt from data !0.5!1!1.5! f ( ) GP(m, K) with m(x) = E[f (x)] and K(x, x ) = Cov [f (x), f (x )] Well known examples: 1 Regression: y i = f (x i ) + ν i 2 Classification (y i {0, 1}): P[y i = 1 f (x i )] = σ (f (x i )) 3... anfred Opper, AI group, TU Berlin (TU Berlin) inference in SDE November 6, / 34

6 Inference We would like to predict f (x) given observations y. = (y 1,..., y n ) at x 1,..., x n. GP prior is infinite dimensional! Manfred Opper, AI group, TU Berlin (TU Berlin) inference in SDE November 6, / 34

7 Inference We would like to predict f (x) given observations y. = (y 1,..., y n ) at x 1,..., x n. GP prior is infinite dimensional! For these simple models, inference reduces to n dimensional problem. Let f i = f (xi ) p(f (x) y) = p(f (x) f 1,..., f n )p(f 1,..., f n y)df 1 df n Manfred Opper, AI group, TU Berlin (TU Berlin) inference in SDE November 6, / 34

8 A more complicated likelihood Poisson process: Likelihood for set of n points D = (x 1, x 2,..., x n ) T is given by { } n L(D λ) = exp λ(x)dx λ(x i ) T i=1 Manfred Opper, AI group, TU Berlin (TU Berlin) inference in SDE November 6, / 34

9 A more complicated likelihood Poisson process: Likelihood for set of n points D = (x 1, x 2,..., x n ) T is given by { } n L(D λ) = exp λ(x)dx λ(x i ) T i=1 λ(x) is the unknown intensity or rate of the process. Gaussian Process Modulated Poisson Processes Model (Lloyd et al, 2015) λ(x) = f 2 (x), where f is a Gaussian process. Manfred Opper, AI group, TU Berlin (TU Berlin) inference in SDE November 6, / 34

10 Outline Stochastic differential equations Drift estimation for dense observations Drift estimation for sparse observations Sparse GP approximation Drift estimation from empirical distribution Outlook Manfred Opper, AI group, TU Berlin (TU Berlin) inference in SDE November 6, / 34

11 Stochastic differential equations dv (x) E.g. f (x) = dx dx dt = f (X ) + white noise Manfred Opper, AI group, TU Berlin (TU Berlin) inference in SDE November 6, / 34

12 Prior process: Stochastic differential equations (SDE) Mathematicians prefer Ito version for X t R d dx t = f (X t ) dt + D 1/2 (X }{{} t ) }{{} Drift Diffusion Limit of discrete time process X k ɛ t i.i.d. Gaussian. dw t }{{} Wiener process X t+ t X t = f (X t ) t + D 1/2 (X t ) t ɛ t. Manfred Opper, AI group, TU Berlin (TU Berlin) inference in SDE November 6, / 34

13 X(t) Path with observations t Manfred Opper, AI group, TU Berlin (TU Berlin) inference in SDE November 6, / 34

14 Nonparametric drift estimation Infer the drift function f ( ) under smoothness assumptions from observations of the process X. Manfred Opper, AI group, TU Berlin (TU Berlin) inference in SDE November 6, / 34

15 Nonparametric drift estimation Infer the drift function f ( ) under smoothness assumptions from observations of the process X. Idea (see e.g. Papaspilioupoulis, Pokern, Roberts & Stuart (2012) Assume a Gaussian Process prior f ( ) GP(0, K) with covariance kernel K(x, x ). Manfred Opper, AI group, TU Berlin (TU Berlin) inference in SDE November 6, / 34

16 Simple for observations dense in time Euler discretization of SDE X t+ t X t = f (X t ) t + t ɛ t, for t 0. Manfred Opper, AI group, TU Berlin (TU Berlin) inference in SDE November 6, / 34

17 Simple for observations dense in time Euler discretization of SDE X t+ t X t = f (X t ) t + t ɛ t, for t 0. Likelihood (assume densely observed path X 0:T ) is Gaussian [ ] p(x 0:T f ) exp 1 X t+ t X t 2 2 t t [ exp 1 f (X t ) 2 t + ] f (X t ) (X t+ t X t ). 2 t t Posterior process is also a GP! Manfred Opper, AI group, TU Berlin (TU Berlin) inference in SDE November 6, / 34

18 Simple for observations dense in time Euler discretization of SDE X t+ t X t = f (X t ) t + t ɛ t, for t 0. Likelihood (assume densely observed path X 0:T ) is Gaussian [ ] p(x 0:T f ) exp 1 X t+ t X t 2 2 t t [ exp 1 f (X t ) 2 t + ] f (X t ) (X t+ t X t ). 2 t t Posterior process is also a GP! Solves the[ regression problem ] f (x) E Xt+ t X t t X t = x. Works well for t 0. Manfred Opper, AI group, TU Berlin (TU Berlin) inference in SDE November 6, / 34

19 Example: Double well model n = 6000 data points with t = 0.002, GP with polynomial kernel of order 4. X(t) t f(x) x Manfred Opper, AI group, TU Berlin (TU Berlin) inference in SDE November 6, / 34

20 Estimation of diffusion Euler discretized SDE Diffusion D(x) = lim t 0 X t+ X t = f (X t ) t + t ɛ t lim t 0 1 t Var (X t+ t X t X t = x) = 1 t E [ (X t+ t X t ) 2 X t = x ]. Independent of drift! Estimate D(x) with GPs by regression with data y t = (X t+ t X t ) 2 D(x) Manfred Opper, AI group, TU Berlin (TU Berlin) inference in xsde November 6, / 34

21 Ice core data δ 18 O t [ky] Manfred Opper, AI group, TU Berlin (TU Berlin) inference in SDE November 6, / 34

22 GP inference using RBF kernels U(x) x D(x) x Manfred Opper, AI group, TU Berlin (TU Berlin) inference in SDE November 6, / 34

23 For larger t... X(t) f(x) t x Problem: We can compute p(x 0:T f ) but NOT P(y f )! anfred Opper, AI group, TU Berlin (TU Berlin) inference in SDE November 6, / 34

24 EM Algorithm Treat unobserved path X t for times t between observations as latent random variables (Batz, Ruttor & Opper NIPS 2013). EM algorithm: 1 E-step: Compute expected complete data likelihood L(f, f old ) = E pold [ln L(X 0:T f )] (1) where p old = posterior p(x 0:T y) computed with the previous estimate f old of the drift. 2 M-Step: Recompute the drift function as f new = arg min f (L(f, f old ) ln P 0 (f )) (2) Manfred Opper, AI group, TU Berlin (TU Berlin) inference in SDE November 6, / 34

25 Likelihood for a complete path p(x 0:T f ) exp [ 1 2 t ] X t+ t X t 2 [ exp 1 f (X t ) 2 + ] f (X t ) (X t+ t X t ) 2 t t }{{} L(X 0:T f ) t Manfred Opper, AI group, TU Berlin (TU Berlin) inference in SDE November 6, / 34

26 The complete likelihood 1 lim t 0 2 E p [ln L(X 0:T f )] = [ E p f (Xt ) 2] t 2E p [(f (X t ), X t+ t X t )] t = 1 2 T [ E p f (Xt ) 2] 2E p [(f (X t ), g t (X t ))] dt 0 = 1 f (x) 2 A(x)dx (f (x), z(x)) dx. (3) 2 where 1 g t (x) = lim t 0 t E p[x t+ t X t X t = x], as well as the functions A(x) = T 0 q t (x)dt b(x) = T 0 g t (x)q t (x)dt. Manfred Opper, AI group, TU Berlin (TU Berlin) inference in SDE November 6, / 34

27 Problems E-step requires posterior marginal densities q t for diffusion processes with arbitrary prior drift functions f (x). p(x 0:T y, f ) p(x 0:T f ) n δ(y k X kτ ), (4) k=1 GP has to deal with an infinite amount of densely imputed data. Manfred Opper, AI group, TU Berlin (TU Berlin) inference in SDE November 6, / 34

28 Approximate solutions Linearize drift between consecutive observations (Ornstein Uhlenbeck bridge). Hence, we consider the approximate process for t between two observations dx t = [f (y k ) Γ k (X t y k )]dt + D 1/2 k dw (5) with Γ k = f (y k ) and D k = D(y k ). For this process, the transition density is a multivariate Gaussian! Work with sparse GP approximation. Manfred Opper, AI group, TU Berlin (TU Berlin) inference in SDE November 6, / 34

29 Variational sparse GP approximation L. Csato (2002), M. Titsias (2009) Assume measure over functions f of the form dp(f ) = 1 Z dp 0(f ) e U(f ). anfred Opper, AI group, TU Berlin (TU Berlin) inference in SDE November 6, / 34

30 Variational sparse GP approximation L. Csato (2002), M. Titsias (2009) Assume measure over functions f of the form dp(f ) = 1 Z dp 0(f ) e U(f ). Approximate by dq(f ) = 1 Z s dp 0 (f ) e Us(fs). U s depends only on sparse set f s = {f (x)} x S of dim m. anfred Opper, AI group, TU Berlin (TU Berlin) inference in SDE November 6, / 34

31 Variational sparse GP approximation L. Csato (2002), M. Titsias (2009) Assume measure over functions f of the form dp(f ) = 1 Z dp 0(f ) e U(f ). Approximate by dq(f ) = 1 Z s dp 0 (f ) e Us(fs). U s depends only on sparse set f s = {f (x)} x S of dim m. Minimize KL divergence D(Q P) = dq(f ) ln dq(f ) dp(f ) Manfred Opper, AI group, TU Berlin (TU Berlin) inference in SDE November 6, / 34

32 Variational sparse GP approximation L. Csato (2002), M. Titsias (2009) Assume measure over functions f of the form dp(f ) = 1 Z dp 0(f ) e U(f ). Approximate by dq(f ) = 1 Z s dp 0 (f ) e Us(fs). U s depends only on sparse set f s = {f (x)} x S of dim m. Minimize KL divergence D(Q P) = dq(f ) ln dq(f ) dp(f ) Integrating over dq(f f s ) = dp 0 (f f s ) yields optimal U s (f s ) = E 0 [U(f ) f s ] Manfred Opper, AI group, TU Berlin (TU Berlin) inference in SDE November 6, / 34

33 Variational sparse GP approximation L. Csato (2002), M. Titsias (2009) Assume measure over functions f of the form dp(f ) = 1 Z dp 0(f ) e U(f ). Approximate by dq(f ) = 1 Z s dp 0 (f ) e Us(fs). U s depends only on sparse set f s = {f (x)} x S of dim m. Minimize KL divergence D(Q P) = dq(f ) ln dq(f ) dp(f ) Integrating over dq(f f s ) = dp 0 (f f s ) yields optimal U s (f s ) = E 0 [U(f ) f s ] Can be computed analytically for Gaussian prior P 0 and quadratic log likelihood U Manfred Opper, AI group, TU Berlin (TU Berlin) inference in SDE November 6, / 34

34 Some details We can show that E 0 [f (x) f s ] = k s (x)(k s ) 1 f s (6) Hence, if U(f ) = 1 2 f 2 (x)a(x)dx f (x)b(x)dx, we get U s (f s ) = E 0 [U(f) f s ] = 1 { 2 f s K 1 s fs K 1 s k s (x) b(x) dx. } k s (x) A(x) k s (x)dx K 1 s f s Manfred Opper, AI group, TU Berlin (TU Berlin) inference in SDE November 6, / 34

35 GP estimation after one iteration of EM. f(x) x f(x) x Manfred Opper, AI group, TU Berlin (TU Berlin) inference in SDE November 6, / 34

36 MSE Direct EM Gibbs t Figure: (color online) Comparison of the MSE for different methods over different time intervals. Manfred Opper, AI group, TU Berlin (TU Berlin) inference in SDE November 6, / 34

37 Example: A simple pendulum dx = Vdt, dv = γv + mgl sin(x ) ml 2 dt + d 1/2 dw t, Manfred Opper, AI group, TU Berlin (TU Berlin) inference in SDE November 6, / 34

38 Example: A simple pendulum dx = Vdt, dv = γv + mgl sin(x ) ml 2 dt + d 1/2 dw t, N = 4000 data points (x, v) with t obs = 0.3 and known diffusion constant d = 1. GP with periodic kernel. f(x, v = 0) x Manfred Opper, AI group, TU Berlin (TU Berlin) inference in SDE November 6, / 34

39 Drift estimation for SDE using empirical distribution For X R d consider SDE dx t = f (X t )dt + σ(x t )dw t Try to estimate the drift function g( ) given only empirical distribution of (noise free) observations ˆp(x) = 1 n n δ(x x i ) i=1 Manfred Opper, AI group, TU Berlin (TU Berlin) inference in SDE November 6, / 34

40 Drift estimation for SDE using empirical distribution For X R d consider SDE dx t = f (X t )dt + σ(x t )dw t Try to estimate the drift function g( ) given only empirical distribution of (noise free) observations ˆp(x) = 1 n n δ(x x i ) i=1 Possible, if drift has the form f (x) = g(x) + A(x) ψ(x) where A and g are known functions. Manfred Opper, AI group, TU Berlin (TU Berlin) inference in SDE November 6, / 34

41 Generalised score functional Define with ε[ψ] = { } 1 2 ψ(x) 2 A + L g ψ(x) p(x)dx L g ψ(x) = g(x) ψ(x) ij D ij (x) 2 ψ(x) x (i) x (j) and g(x) 2 A = g(x) A(x)g(x) and D(x). = σ(x)σ(x). Manfred Opper, AI group, TU Berlin (TU Berlin) inference in SDE November 6, / 34

42 Generalised score functional Define with ε[ψ] = { } 1 2 ψ(x) 2 A + L g ψ(x) p(x)dx L g ψ(x) = g(x) ψ(x) ij D ij (x) 2 ψ(x) x (i) x (j) and g(x) 2 A = g(x) A(x)g(x) and D(x). = σ(x)σ(x). δε[ψ] δψ = 0 yields stationary Fokker Planck equation L g p(x) (A(x)ψ(x)p(x)) = 0 where L g is the adjoint of L g L g p(x) = [ f (x)p(x) + 12 ] (D(x)p(x)). A = g = 0: Score matching (Hyvärinen, 2005), (Sriperumbudur et al, 2014) for density estimation. Manfred Opper, AI group, TU Berlin (TU Berlin) inference in SDE November 6, / 34

43 Consider pseudo log likelihood n i=1 { } 1 2 ψ(x i) 2 A + L g ψ(x i ) where x i p( ). = X (ti ), i = 1,..., n is sample from the stationary density Manfred Opper, AI group, TU Berlin (TU Berlin) inference in SDE November 6, / 34

44 Consider pseudo log likelihood where x i p( ) n i=1 { } 1 2 ψ(x i) 2 A + L g ψ(x i ). = X (ti ), i = 1,..., n is sample from the stationary density Combine with a GP prior over ψ yields estimator for drift of the SDE if this is of the form f (x) = g(x) + A(x) ψ(x) where A and g are given (Batz, Ruttor, Opper 2016). anfred Opper, AI group, TU Berlin (TU Berlin) inference in SDE November 6, / 34

45 Langevin dynamics Consider 2nd order SDE dx t = V t dt, dv t = (F (x) λv)dt + σ(x t, V t )dw t. and set A xx = A xv = 0 and A vv = I, f x = v, g v = λv (known) anfred Opper, AI group, TU Berlin (TU Berlin) inference in SDE November 6, / 34

46 Langevin dynamics Consider 2nd order SDE dx t = V t dt, dv t = (F (x) λv)dt + σ(x t, V t )dw t. and set A xx = A xv = 0 and A vv = I, f x = v, g v = λv (known) The condition f v (x, v) = λv + v ψ(x, v) = λv + F (x) is fulfilled with ψ(x, v) = v F (x) and allows for arbitrary F (x). Use GP prior over F. anfred Opper, AI group, TU Berlin (TU Berlin) inference in SDE November 6, / 34

47 Example 1: Periodic model: F (x) = a sin x, D v = (σ cos(x)) 2 with n = 2000 observations, time lag τ = 0.25 f(x) x Manfred Opper, AI group, TU Berlin (TU Berlin) inference in SDE November 6, / 34

48 Example 2: Non conservative drift model: F (1) (x) = x (1) (1 (x (1) ) 2 (x (2) ) 2 ) x (2) F (2) (x) = x (2) (1 (x (1) ) 2 (x (2) ) 2 ) x (1) with friction λ known, but D unknown, n = 2000 observations Manfred Opper, AI group, TU Berlin (TU Berlin) inference in SDE November 6, / 34

49 The case A = D: Likelihood for densely observed path ln L(X 0:T g) = 1 2 with u, v. = u D 1 v. { f (Xt ) 2 D 1 dt 2 f (X t ), dx t } + const Manfred Opper, AI group, TU Berlin (TU Berlin) inference in SDE November 6, / 34

50 The case A = D: Likelihood for densely observed path ln L(X 0:T g) = 1 2 { f (Xt ) 2 D 1 dt 2 f (X t ), dx t } + const with u, v. = u D 1 v. Assume f = g + D ψ and apply Ito formula = 1 2 T 0 { ψ D ψ dt + 2g ψ dt 2 ψ dx t } Manfred Opper, AI group, TU Berlin (TU Berlin) inference in SDE November 6, / 34

51 The case A = D: Likelihood for densely observed path ln L(X 0:T g) = 1 2 { f (Xt ) 2 D 1 dt 2 f (X t ), dx t } + const with u, v. = u D 1 v. Assume f = g + D ψ and apply Ito formula = 1 2 = 1 2 T 0 T 0 { ψ D ψ dt + 2g ψ dt 2 ψ dx t } { ψ(x t ) 2 D + L g ψ(x t ) } dt ψ(x T ) + ψ(x 0 ) Manfred Opper, AI group, TU Berlin (TU Berlin) inference in SDE November 6, / 34

52 Outlook Beyond EM: Full variational Bayesian approximation Estimation of diffusion from sparse data Quality of sparse GP approximation? Langevin dynamics with unobserved velocities? Manfred Opper, AI group, TU Berlin (TU Berlin) inference in SDE November 6, / 34

53 Many thanks to my collaborators: Andreas Ruttor, Philipp Batz funded by EU STREP project Manfred Opper, AI group, TU Berlin (TU Berlin) inference in SDE November 6, / 34

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Gaussian Processes Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 01 Pictorial view of embedding distribution Transform the entire distribution to expected features Feature space Feature