A variational radial basis function approximation for diffusion processes

Similar documents
Gaussian Process Approximations of Stochastic Differential Equations

Gaussian Process Approximations of Stochastic Differential Equations

A variational radial basis function approximation for diffusion processes.

The Variational Gaussian Approximation Revisited

Gaussian Process Approximations of Stochastic Differential Equations

Gaussian processes for inference in stochastic differential equations

Variational Principal Components

Auto-Encoding Variational Bayes

Nonparametric Drift Estimation for Stochastic Differential Equations

Lecture 1a: Basic Concepts and Recaps

Expectation Propagation for Approximate Bayesian Inference

Nonparametric Bayesian Methods (Gaussian Processes)

Pattern Recognition and Machine Learning

Widths. Center Fluctuations. Centers. Centers. Widths

Lecture 4: Introduction to stochastic processes and stochastic calculus

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014

PILCO: A Model-Based and Data-Efficient Approach to Policy Search

Efficient and Flexible Inference for Stochastic Systems

Deep learning with differential Gaussian process flows

Regression with Input-Dependent Noise: A Bayesian Treatment

Recent Advances in Bayesian Inference Techniques

Linear Dynamical Systems

Bayesian Inference for Wind Field Retrieval

Quantitative Biology II Lecture 4: Variational Methods

Bayesian Inference of Noise Levels in Regression

Gaussian process for nonstationary time series prediction

Latent Variable Models and EM algorithm

Expectation Propagation Algorithm

Gaussian Process Dynamical Models Jack M Wang, David J Fleet, Aaron Hertzmann, NIPS 2005

Lecture 1c: Gaussian Processes for Regression

Connections between score matching, contrastive divergence, and pseudolikelihood for continuous-valued variables. Revised submission to IEEE TNN

Modelling Transcriptional Regulation with Gaussian Processes

Bayesian Machine Learning - Lecture 7

Cheng Soon Ong & Christian Walder. Canberra February June 2017

The Expectation-Maximization Algorithm

STA 4273H: Sta-s-cal Machine Learning

Bayesian Data Fusion with Gaussian Process Priors : An Application to Protein Fold Recognition

Understanding Covariance Estimates in Expectation Propagation

Variational Autoencoder

Radial Basis Functions: a Bayesian treatment

CPSC 540: Machine Learning

Learning Gaussian Process Models from Uncertain Data

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision

Learning curves for Gaussian processes regression: A framework for good approximations

The Coloured Noise Expansion and Parameter Estimation of Diffusion Processes

Using Kernel PCA for Initialisation of Variational Bayesian Nonlinear Blind Source Separation Method

Bayesian Inference Course, WTCN, UCL, March 2013

Gaussian Processes for Regression. Carl Edward Rasmussen. Department of Computer Science. Toronto, ONT, M5S 1A4, Canada.

An introduction to Variational calculus in Machine Learning

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

output dimension input dimension Gaussian evidence Gaussian Gaussian evidence evidence from t +1 inputs and outputs at time t x t+2 x t-1 x t+1

Density Estimation. Seungjin Choi

The connection of dropout and Bayesian statistics

Introduction to Machine Learning

Black-box α-divergence Minimization

Curve Fitting Re-visited, Bishop1.2.5

Least Absolute Shrinkage is Equivalent to Quadratic Penalization

Week 3: The EM algorithm

Machine Learning Techniques for Computer Vision

Information geometry for bivariate distribution control

GWAS IV: Bayesian linear (variance component) models

Active and Semi-supervised Kernel Classification

Degrees of Freedom in Regression Ensembles

Kernel adaptive Sequential Monte Carlo

2 XAVIER GIANNAKOPOULOS AND HARRI VALPOLA In many situations the measurements form a sequence in time and then it is reasonable to assume that the fac

Variational Methods in Bayesian Deconvolution

Lecture 10. Announcement. Mixture Models II. Topics of This Lecture. This Lecture: Advanced Machine Learning. Recap: GMMs as Latent Variable Models

Sparse Bayesian Logistic Regression with Hierarchical Prior and Variational Inference

Covariance function estimation in Gaussian process regression

EM-algorithm for Training of State-space Models with Application to Time Series Prediction

Lecture 16 Deep Neural Generative Models

Regression. Machine Learning and Pattern Recognition. Chris Williams. School of Informatics, University of Edinburgh.

The Expectation Maximization or EM algorithm

Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm

MEASUREMENT UNCERTAINTY AND SUMMARISING MONTE CARLO SAMPLES

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Bayesian Methods for Machine Learning

GWAS V: Gaussian processes

ECE662: Pattern Recognition and Decision Making Processes: HW TWO

NON-LINEAR NOISE ADAPTIVE KALMAN FILTERING VIA VARIATIONAL BAYES

Approximate Inference Part 1 of 2

DETECTING PROCESS STATE CHANGES BY NONLINEAR BLIND SOURCE SEPARATION. Alexandre Iline, Harri Valpola and Erkki Oja

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

Confidence Estimation Methods for Neural Networks: A Practical Comparison

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

STAT 518 Intro Student Presentation

Non-conjugate Variational Message Passing for Multinomial and Binary Regression

Approximate Inference Part 1 of 2

Expectation propagation for signal detection in flat-fading channels

Variational Gaussian Process Dynamical Systems

Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms *

Two Useful Bounds for Variational Inference

Variational Bayesian Dirichlet-Multinomial Allocation for Exponential Family Mixtures

CSC 2541: Bayesian Methods for Machine Learning

Series 7, May 22, 2018 (EM Convergence)

Vector spaces. DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis.

Outline Lecture 2 2(32)

Lecture 6: Bayesian Inference in SDE Models

Unsupervised Variational Bayesian Learning of Nonlinear Models

Transcription:

A variational radial basis function approximation for diffusion processes Michail D. Vrettas, Dan Cornford and Yuan Shen Aston University - Neural Computing Research Group Aston Triangle, Birmingham B4 7ET - United Kingdom Email : {vrettasm, d.cornford, y.shen2}@aston.ac.uk Abstract. In this paper we present a radial basis function based extension to a recently proposed variational algorithm for approximate inference for diffusion processes. Inference, for state and in particular (hyper-) parameters, in diffusion processes is a challenging and crucial task. We show that the new radial basis function approximation based algorithm converges to the original algorithm and has beneficial characteristics when estimating (hyper-)parameters. We validate our new approach on a nonlinear double well potential dynamical system. 1 Introduction Inference in diffusion processes is a well studied domain in statistics [5], and more recently machine learning [6]. In this paper we employ a radial basis function [3, 7] framework to extend the variational treatment proposed in [6]. The motivation for this work is inference of the state and (hyper-)parameters in models of real dynamical systems, such as weather prediction models, although at present the methods can only be applied to relatively low dimensional models, such as might be found for chemical reactions or simpler biological systems. The rest of the paper is organised as follows. In Section 2 we put forward the recently developed variational Gaussian processes based algorithm. Section 3 introduces the new RBF approximation which is tested for its stability and convergence in Section 4. Conclusions are given in Section 5. 2 Approximate inference in diffusion processes Diffusion processes are a class of continuous-time stochastic processes, with continuous sample paths [1]. Since diffusion processes satisfy the Markov property, their marginals can be written as a product of their transition densities. However these densities are unknown in most realistic systems making exact inference challenging. The approximate Bayesian inference framework that we apply is based on a variational approach [8]. In our work we consider a diffusion process with additive system noise [6], although re-parametrisation makes it possible to map a class of multiplicative noise models into this additive class [1]. The time evolution of a diffusion process This work has been funded by EPSRC as part of the Variational Inference for Stochastic Dynamic Environmental Models (VISDEM) project (EP/C005848/1).

can be described by a stochastic differential equation, henceforth SDE, (to be interpreted in the Itō sense): dx t = f θ (t, X t )dt + Σ 1/2 dw t, (1) where f θ (t, X t ) R D is the non-linear drift function, Σ =diag{σ1 2,...,σ2 D } is the diffusion (system noise covariance matrix) and dw t is a D dimensional Wiener process. This (latent) process is partially observed, at discrete times, subject to error. Hence, Y k = HX tk + ɛ tk,wherey k R d denotes the k-th observation, H R d D is the linear observation operator and ɛ tk N(0, R) R d i.i.d. Gaussian white noise, with covariance matrix R R d d.thebayesian posterior measure is given by: dp post dp sde = 1 K Z p(y k X tk ), (2) k=1 using Radon-Nikodym notation, where K denotes the number of noisy observations and Z is the normalising marginal likelihood (i.e. Z = p(y 1:K )). The key idea is to approximate the true (unknown) posterior process by another one that belongs to a family of tractable Gaussian processes. We minimise the variational free energy, defined as follows: F Σ (q, θ) = ln p(y, X θ, Σ) q(x θ, Σ) where p is the true posterior process and q is the approximate Gaussian process (dropping time indices). F Σ (q, θ) provides an upper bound to the negative log marginal likelihood (evidence). The approximating Gaussian process implies a linear SDE: q (3) dx t =( A t X t + b t )dt + Σ 1/2 dw t (4) where A t R D D and b t R D define the linear drift in the approximating process. A t and b t, are time dependent functions that need to be optimised. The time evolution of this system is given by two ordinary differential equations for the marginal mean m t and covariance S t, which must be enforced to ensure consistency in the algorithm. To enforce these constraints, within a predefined time window [t 0 t f ], the following Lagrangian is formulated: tf L θ,σ = F Σ (q, θ) tr{ψ t (Ṡt + A t S t + S t A t Σ)}dt t 0 tf λ t (ṁ t + A t m t b t )dt (5) t 0 where λ t R D and Ψ t R D D are time dependant Lagrange multipliers, with Ψ t being symmetric. Given a set of fixed parameters for the system noise Σ and the drift θ, the minimisation of this quantity (5) and hence of the free energy (3), will lead to the optimal approximate posterior process. Further details of this variational algorithm (henceforth VGPA) can be found in [6, 9, 10].

3 Radial basis function approximation Radial basis function networks are a class of neural networks [2] that were introduced as an alternative to multi-layer perceptrons [3]. In this work we use RBFs to approximate the time varying variational parameters (A t and b t ). The idea of approximating continuous (or discrete) functions by RBFs is not new [7]. In the original variational framework (VGPA), these functions are discretized with a small time discretisation step (e.g. dt =0.01), resulting in set of discrete time variables that need to be optimised during the process of minimising the free energy. The size of that set (number of variables) scales proportional with the length of the time window, the dimensionality of the data and the time discretisation step. In total we need to infer N tot =(D +1) D t f t 0 dt 1 variables, where D is the system dimension, t 0 and t f are the initial and final times and dt must be small for stability. In this paper we derive expressions and present the one dimensional case (D =1). Replacing the discretisation with RBFs we get the following expressions: M A M b à t = a i φ i (t), bt = b i π i (t) (6) i=1 where a i,b i Rare the weights, φ i (t),π i (t) :R + Rare fixed basis functions and M A, M b N are the total number of RBFs considered. The number of basis functions for each term, along with their class, need not to be the same. However, in the absence of particular knowledge about the functions we suggest the same number of Gaussian basis functions seems reasonable. Hence we have M A = M b and φ i (t) =π i (t), where: i=1 φ i (t) =e 0.5 t ci λ i 2 (7) where c i and λ i are the i-th centre and width respectively and. is the Euclidean norm. Having precomputed the basis function maps φ i (t) i {1, 2,,M A } and t [t 0 t f ], the optimisation problem reduces to calculating the weights of the basis functions with M tot =2M A parameters. Typically we expect that M tot N tot, making the optimisation problem smaller. The derivation of the equations is beyond the scope of this paper, but in essence the problem still requires us to minimise (5) with the variational parameters being the basis function weights. In practice the computation of the free energy (3) is achieved in discrete time, using precomputed matrices of the basis function maps. To improve stability and convergence a Gram-Schmidt orthogonalisation is employed. The gradients of the Lagrangian (5) w.r.t. a and b are used in a scaled conjugate gradient optimisation algorithm. In practice around 60 iterations are required for convergence.

4 Numerical experiments To test the stability and the convergence properties of the new RBF approximation algorithm, we consider a one dimensional double well system, with drift function f θ (t, X t )=4X t (θ Xt 2), θ > 0, and constant diffusion. This is a non-linear dynamical system, whose stationary distribution has two stable states X t = ±θ. The system is driven by white noise and according to the strength of the random fluctuation (system noise coefficient Σ) occasionally flips from one stable state to the other (Figure 1(a)). In the simulations we consider a time window of ten units (t 0 =0,t f = 10). The true parameters, that generated the sample path, are Σ true = 0.8 and θ true =1.0. The observation density was fixed to two per time unit (i.e. twenty observations in total). To provide robust results one hundred different realisations of the observation noise were used. The basis functions were Gaussian (7) with centres c i chosen equal spaced within the time window and widths λ i sufficiently large to permit overlap of neighbouring basis functions. In Figure 1(b), we compare the results obtained from the RBF approximation algorithm, with basis function density M = 40, per time unit, (M tot = 800) against the true posterior obtained from a Hybrid Monte Carlo (HMC) method. We note that although the variance of the RBF approximation is slightly underestimated, the mean path matches the true HMC results quite well, as was the case in VGPA. The new RBF approximation algorithm is extremely stable and 1.5 2 1 1.5 0.5 1 X(t) 0 0.5 X(t) 0.5 0 1 0.5 1.5 1 2 0 1 2 3 4 5 6 7 8 9 10 t (a) True sample path 1.5 0 1 2 3 4 5 6 7 8 9 10 t (b) HMC vs RBF approximation Fig. 1: (a) Sample path of a double well potential system used in the experiments. In (b) we compare the approximated marginal mean and variances (of a single realisation) between the true HMC estimates (solid lines) and the RBF variational algorithm (dashed lines) solutions. The crosses indicate the noisy observations. converges to the original VGPA, given a sufficient number of basis functions. Figure 2(a) shows convergence of the free energy of the RBF approximation to the VGPA results after thirty five basis functions per time unit (M tot = 700) and this is also apparent in comparing the correct KL(p,q) divergence [4], between the approximations q and the posterior p derived from the HMC, Figure 2(b). The computational time for the RBF method is similar or slightly higher, however it is more robust and stable in practice. The VGPA approximation can also be used to compute marginal likelihood based estimates of (hyper-)parameters

21 50 F(M θ, Σ, R) 20 19 18 17 16 15 KL 45 40 35 30 14 0 20 40 60 80 100 M (a) Free energy 25 0 20 40 60 80 100 M (b) KL divergence Fig. 2: (a) Comparison of the free energy, at convergence, between the RBF algorithm (squares, dashed lines) and the original VGPA (solid line, shaded area). The plot shows the 25, 50 and 75 percentiles (from 100 realisations) of the free energy as a function of basis function density. (b) shows a similar plot (for one realisation) for the integral of the KL(p,q), between the true (HMC) and approximate VGPA (dashed line, shaded area) and RBF (squares, dashed lines) posteriors, over the whole time window [t 0 t f ]. including the system noise and the drift parameters [9]. In the RBF version this is also possible and empirical results show that this is faster and more robust compared to VGPA. As shown in profile marginal likelihood plots in Figures 3(a) and 3(b), even with a relative small basis function density, the Σ and θ minima are very close to the ones determined by the VGPA. For the Σ we need around thirty basis functions (M = 30), to reach the same minimum, whereas for the θ parameter the minimum is almost identical using only ten basis functions (M = 10), per time unit. These conclusions are supported by further experiments on 100 realisations (not shown in the present paper), which show consistency in the estimates of the maximum marginal likelihood parameters both in value and variability. 26 24 22 M = 10 M = 20 M = 30 VGPA 24 22 20 M = 10 M = 20 M = 30 VGPA F(Σ θ) 20 18 F(θ Σ) 18 16 16 14 14 12 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 12 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 Σ test (a) Σ profile Θ test (b) θ profile Fig. 3: (a) Profile marginal likelihood for the system noise coefficient Σ keeping the drift parameter θ fixed to the true one. (b) as (a) but for the drift parameter θ keeping Σ fixed to its true value. Both simulations run for M = 10, M =20andM =30and compared with the profiles from the VGPA on a typical realisation of the observations.

5 Conclusions We have presented a new variational radial basis function approximation for inference for diffusion processes. Results show that the new algorithm converges to the original VGPA with a relatively small number of basis functions per time unit, needing only 40% of the number of parameters required in the VGPA. We expect that further work on the choice of the basis functions will further reduce this. Thus far only Gaussian basis functions have been considered, with fixed centres and widths. A future approach should include different basis functions that will better capture the roughness of the variational (control) parameters of the original algorithm, along with an adaptive (re)estimation scheme for the widths of the basis functions. We go on to show estimation of (hyper-)parameters within the SDE is remarkably stable to the number of basis functions used in the RBF. Reducing the number of parameters makes it possible to apply the RBF approximation to larger time windows, although this is limited by the underlying discrete computational framework. We are investigating the possibility of computing entirely in continuous time using the basis function expansion. Acknowledgements The authors acknowledge the help of Prof. Manfred Opper, whose useful comments help to better shape the theoretical part of the RBF approximations. References [1] Peter E. Kloeden and Eckhard Platen. Numerical Solutions of Stochastic Differential Equations, Springer-Verlag, Berlin, 1992. [2] Christopher M. Bishop. Neural Networks for Pattern Recognition, Oxford University Press, New York, 1995. [3] D. S. Broomhead and D. Lowe. Multivariate functional interpolation and adaptive networks. Complex Systems, 2:321-355, 1988. [4] S. Kullback and R. A. Leibler, On information and sufficiency. Annal of Mathematical Statistics, 22:79-86, 1951. [5] A. Golightly and D. J. Wilkinson. Bayesian inference for non-linear multivariate diffusion models observed with error. Computational Statistics and Data Analysis, 2007. Accepted. [6] C. Archambeau, D. Cornford, M. Opper and J. Shawe Taylor. Gaussian process approximation of stochastic differential equations. Journal of Machine Learning Research, Workshop and Conference Proceedings, 1:1-16, 2007. [7] V. Kůrkova and K. Hlavačkova. Approximation of Continuous Functions by RBF and KBF Networks. ESANN, Proceedings, pp. 167-74, 1994. [8] T. Jaakkola. Tutorial on variational approximation methods. In M. Opper and D. Saad, editors, Advanced Mean Field Methods: Theory and Practise, The MIT press, 2001. [9] C. Archambeau, M. Opper, Y. Shen, D. Cornford, J. Shawe-Taylor. Variational Inference for Diffusion Processes. In C. Platt, D. Koller, Y. Singer and S. Roweis editors, Neural Information Processing Systems (NIPS) 20, pages 17-24, 2008. The MIT Press. [10] M. D. Vrettas, Y. Shen and D. Cornford. Derivations of Variational Gaussian Process Approximation. Technical Report (NCRG/2008/002). Neural Computing Research Group, Aston University, Birmingham, B4 7ET, United Kingdom, March 2008.