Nonparametric Drift Estimation for Stochastic Differential Equations

Nonparametric Drift Estimation for Stochastic Differential Equations Gareth Roberts 1 Department of Statistics University of Warwick Brazilian Bayesian meeting, March 2010 Joint work with O. Papaspiliopoulos, Y. Pokern and A. Stuart

Centre for Research in Statistical Methodology http://www2.warwick.ac.uk/fac/sci/statistics/crism/ Conferences and workshops Academic visitor programme Currently preparing for a pre-valencia meeting on Model Uncertainty to take place between 30st May and 1st June.

Fitting SDEs to Molecular Dynamics MD data X(m t) R d Multiple Timescales High frequency data High dimension, only few dimensions of chemical interest Diffusion good description at some timescales only.

Plan Start from SDE dx t = b(x t )dt + db t, X(0) = x 0 and high-frequency discrete time observations x i. Write down likelihood for b( ). Manipulate likelihood to make the local time L an (almost) sufficient statistic. Specify prior on function space H, compute posterior. Make Bayesian framework rigorous. Make numerics robust. Application: Toy example from Molecular Dynamics

SDE properties - Girsanov dx t = b(x t )dt + db t Generates measure P on path space C ([0, T ], [0, 2π)). P is absolutely continuous w.r.t. W generated by Brownian Motion. The likelihood (Radon-Nikodym derivative) is dp dw = exp (I[b]) I[b] viewed as functional of the drift: T (b 2 (X t )dt 2b(X t )dx t ) I[b] = 1 2 0

A hint from the discrete problem n state Markov chain, discrete time, with transition matrix p 11 p 12 p 1n p 21 p 22 p 2n p n1 p n2 p nn Data X 0, X 1,... X T. Likelihood is T t=1 p Xt 1,X t Information about the ith row is only available from visits to i. So, let L i = T t=1 1 X t 1 =i be the local time at i, and factorise: n p i,xt. i=1 t; X t 1 =i Conditional on L, we have n independent inference problems.

Diffusion Local Time Local time is defined (at least in one-dimension) as the occupation density at a location: L(a) = lim ɛ 0 T s=0 1(X s (a ɛ, a + ɛ))ds 2ɛ This gives a natural way to replace time averages by space averages. T 0 f (x t )dt = f (a)l(a)da R

Inference for SDEs - parametric dx t = b(x t )dt + σ(x t )db t, Estimate drift function b( ) from observations {x i } M i=1, x i = x(i t) High frequency setup ( t 0), discretely observed case ( t = O(1)), or more generally... Drift functions parametrised by θ R N Could just choose a basis {b j }. Substantial theory and methodology available for these cases, and huge numbers of successful applications. BUT choice of basis functions can be far from obvious.

Inference for SDEs - parametric dx t = b(x t, θ)dt + σ(x t )db t, Estimate drift function b( ) from observations {x i } M i=1, x i = x(i t) High frequency setup ( t 0), discretely observed case ( t = O(1)), or more generally... Drift functions parametrised by θ R N Could just choose a basis {b j }. Substantial theory and methodology available for these cases, and huge numbers of successful applications. BUT choice of basis functions can be far from obvious.

Inference for SDEs - parametric dx = m θ j b j (X t )dt + σ(x t )db t, j=1 Estimate drift function b( ) from observations {x i } M i=1, x i = x(i t) High frequency setup ( t 0), discretely observed case ( t = O(1)), or more generally... Drift functions parametrised by θ R N Could just choose a basis {b j }. Substantial theory and methodology available for these cases, and huge numbers of successful applications. BUT choice of basis functions can be far from obvious.

Inference for SDEs - parametric Estimate drift function b( ) from observations {x i } M i=1, x i = x(i t) High frequency setup ( t 0), discretely observed case ( t = O(1)), or more generally... Drift functions parametrised by θ R N Could just choose a basis {b j }. Substantial theory and methodology available for these cases, and huge numbers of successful applications. BUT choice of basis functions can be far from obvious.

The Likelihood Functional I[b] = log dp T ) (b dw = 1 2 (X t )dt 2b(X t )dx t 2 0 Start from the log-likelihood. We write V (x) = x 0 b(u)du. Apply Ito s formula for V (x) to rewrite the stochastic integral as boundary terms plus correction. Replace Integrals along the trajectory by integral against local time L(a)da.

The Likelihood Functional I[b] = W 2 (V (2π) V (0)) + 1 2 (V (X T ) V (X 0 )) 1 2 T 0 ( ) b 2 (X t ) + 2b (X t ) dt. Start from the log-likelihood. We write V (x) = x 0 b(u)du. Apply Ito s formula for V (x) to rewrite the stochastic integral as boundary terms plus correction. Replace Integrals along the trajectory by integral against local time L(a)da.

The Likelihood Functional I[b] = W 2 (V (2π) V (0)) + 1 2 (V (X T ) V (X 0 )) 1 2 2π 0 ( ) b 2 (a) + 2b (a) L(a)da. Start from the log-likelihood. We write V (x) = x 0 b(u)du. Apply Ito s formula for V (x) to rewrite the stochastic integral as boundary terms plus correction. Replace Integrals along the trajectory by integral against local time L(a)da.

The Likelihood Functional I[b] = W 2 1 2 2π 0 (V (2π) V (0)) [( ) ] b 2 (a) + 2b (a) L(a) χ x0,x T b(a) da. Start from the log-likelihood. We write V (x) = x 0 b(u)du. Apply Ito s formula for V (x) to rewrite the stochastic integral as boundary terms plus correction. Replace Integrals along the trajectory by integral against local time L(a)da.

If local time was smooth..... the log-likelihood I[b] would be bounded above on b L 2 (0, 2π). Taking the functional derivative yields the MLE b = L 2L

Infinite Dimensional Trouble I[b] = 2π 0 b(a) 2 L(a) + b (a)l(a)da For smooth L the functional is positive definite and bounded below. BUT L is not differentiable! Likelihood is easily shown to be almost surely unbounded.

Various Options Boundary 1 2 2π 0 ( ) b(a) 2 + b (a) L(a)da. 1 Use a non-likelihood based approach 2 Assume a parametric form b(x, θ) 3 Adopt some kind of penalised likelihood approach 4 Introduce a prior measure on drift functions b( ) and perform Bayesian estimation.

Gaussian Prior on drift functions Specify a prior Gaussian measure for zero-mean drift functions by Its Mean: b 0 H 2 per([0, 2π]) Its Precision (operator): A 0 on [0, 2π] with periodic boundary conditions. Thus a continuum Gaussian Markov random field. Smoothness imposed in A persists in the posterior

Choice of A A simple choice would take A = (= d 2 da 2 ). This is (approximately..) assuming independent increments of db(a) for different a, and would lead to continuous non-differentiable sample paths with probability 1. We mostly use instead A = 2, whereby sample paths have the smoothness of once integrated diffusions. Intuitively: ( ) b exp 2π 0 b(a) b 0 (a) 2 da

Finding the Posterior Multiply prior density by likelihood: ( ) exp exp ( 2π 0 2π 0 b(a) 2 da ) 1 [ ] (b(a) 2 + b (a))l(a)) + χ X0,X 2 T b(a) da + W term Complete the square to find that the posterior is Gaussian with Mean ( ) 2 1 + L ˆb = 2 L + χ X0,X T + W Posterior Covariance ( ) 1 2 + L

Towards the Posterior rigorously Standard PDE theory shows that the posterior mean is the weak solution of a PDE: Theorem Let L C([0, 2π]) be continuous and periodic and not identically zero. Then the PDE 2 u + Lu = 1 2 L + W + χ x0,x T (1) has a unique weak solution u H 2 per([0, 2π]).

Robustness of the Posterior Mean Theorem There exists a constant C(W, L ) > 0 such that for all admissible perturbed local times L the deviation of the perturbed posterior mean ũ from the unperturbed posterior mean u is bounded in the H 2 -norm: C > 0 L Λ : ũ u H 2 C(W, L ) L L L 2

Cleanup Observe absolute continuity of posterior and prior measure, 2 and 2 + L differ only in lower order differential parts. Compute the Radon-Nikodym derivative and identify with the likelihood. Simple estimate of local time from pointwise estimations combined with Hölder continuity of local time, so that ˆL L L 2 is small.

Estimating local time Choose N equal-sized bins. Count realisations falling in each bin. Width of bins adapts to M (i.e. t) (rather like bandwidth selection in kernel density estimation). Pointwise estimates converge to local time in probability. Use Hölder continuity of local time to obtain an L 2 error bound, using point values to construct piecewise constant approximants of L.

Numerical Analysis Fourth order elliptic PDE with non-regular right hand side. Use piecewise cubic polynomial base functions on each finite element. b(a) = N e=1 f =1 4 B e,f φ e,f (a) These span the approximation space H 2 h H2.

Finite Elements 2 Finite element representation turns weak PDE into a collection of linear equation for B e,f : (i,j),(e,f ) Ψ (i,j) M (i,j),(e,f ) B e,f = (i,j) Ψ (i,j) F (i,j) Standard numerical analysis guarantees the accuracy of these methods.

Numerics: Samples from Posterior dx = sin(x) + 3cos 2 (x)sin(x)dt + db t Second order prior covariance: A = 2

Numerics: Samples from Posterior dx = sin(x) + 3cos 2 (x)sin(x)dt + db t First order prior covariance: A = 1

Convergence as T Gaussian boundary conditions with second order covariance operator. T = 0.02

Convergence as T Gaussian boundary conditions with second order covariance operator. T = 50

Convergence as T Gaussian boundary conditions with second order covariance operator. T = 5000

Rates of posterior contraction For Z 1 = ˆb(0.38π) b(0.38π) and Z 2 = 2π 0 ˆb(a) sin(a)da. Questions: Do we have a law of large numbers Z i 0 as T? Do we get CLT-like convergence? Var(Z i ) = O ( ) 1 T (Numerical) Answers: Numerically, lim T Z i = 0 is observed. Decay of Variance: Answer depends on i! High frequency components of L can dominate the convergence.

Rate of Posterior Contraction Smooth Functional

Rate of Posterior Contraction Point Evaluation

Marginal likelihood More general prior precision operator: A(η) = η k + ε How to choose hyper-parameter η? Fully Bayesian approach clearly possible. We choose to maximise marginal likelihood. P({x t } T t=0 b)p 0(db) = A(η) A(η) + L T 1 L ( 2 exp 1 ) 2π ] [ (Ab 0 + f ) (A(η) + L T ) 1 (Ab 0 + f ) + b 0 Ab 0 da 2 0

Optimising Smoothness η

Molecular Dynamics MẌ(t) = V (X(t)) γmẋ(t) + 2γk B TMḂ X ω(x)

Fitting Result Whether data looks like a diffusion depends on timescale.

Fitting Result Posterior mean and standard deviation band for k = 1000:

Future Work / In Progress The case where the state space is R Extension to higher dimensions (2, 3) Consistency issues: rate of posterior contraction depends on smoothness of prior, truth and functional to be estimated. Extend to O(1)-spaced data using perfect simulation of Brownian local times with Metropolis-Hastings correction. Heterogenous diffusion coefficient is (in principle) straightforward for high-frequency data, though there would be associated numerical issues to resolve. In the discretely-observed case, this requires reparameterisation techniques (see R + Stramer, 2001, Durham and Gallant, 2002.

Summary Nonparametric drift estimation for diffusions on the circle can be performed rigorously for Gaussian prior (conjugate prior). Finite element implementation enables error control from discrete time high frequency samples all the way to numerically obtained posterior means. Applications in molecular dynamics and many other areas...

Samples from the Posterior are usable

Nonexistence Local time has the same regularity as Brownian motion: C α, α < 1 2. Unboundedness of the log-likelihood functional is linked to the regularity of local time. Substituting local time by Brownian bridge we get the Theorem Let W be a realisation of the Brownian bridge on [0, 1]. Then with probability one, the functional I[b] = 1 ( ) b 2 (s) + b (s) W (s)ds 2 is not bounded above on b H 1 ([0, 1]).

Classic example: Gibbs sampler for drift and diffusivity Algorithm Use sequential Gibbs sampler to estimate diffusivity σ and drift parameters θ j : 1 P(θ j x i, σ) 2 P(σ x i, θ j ) Observations: Algorithm works fine provided good approximations of the conditional densities are available. Frequently, approximations are good only for t 0. Simple fix: Augment the data {x j } by imputed datapoints x i,k at times t i < t i,1 < t i,2 <... < t i,k < t i+1.

Classic example: Gibbs sampler for drift and diffusivity Augmented Data Algorithm Use sequential Gibbs sampler to estimate diffusivity σ, drift parameters θ j and imputed data points {x i,k }: 1 θ j P(θ j x i, x i,k, σ) 2 σ P(σ x i, x i,k, θ j ) 3 x i,k P( x i,k x i, σ, θ j ) Observations: Algorithm grinds to a halt as augmentation is increased Bad mixing for σ is observed Reason: Imputed data points determine σ, σ determines quadratic variation of imputed data points. Analysis via continuous time asymptotically equivalent diffusions (time rescaling)