Sequential Monte Carlo Methods for Bayesian Computation

Similar documents
An Brief Overview of Particle Filtering

A Note on Auxiliary Particle Filters

L09. PARTICLE FILTERING. NA568 Mobile Robotics: Methods & Algorithms

Introduction. log p θ (y k y 1:k 1 ), k=1

Computer Intensive Methods in Mathematical Statistics

Advanced Computational Methods in Statistics: Lecture 5 Sequential Monte Carlo/Particle Filtering

Exercises Tutorial at ICASSP 2016 Learning Nonlinear Dynamical Models Using Particle Filters

Particle Filters: Convergence Results and High Dimensions

An introduction to particle filters

Lecture Particle Filters. Magnus Wiktorsson

Computer Intensive Methods in Mathematical Statistics

Lecture 2: From Linear Regression to Kalman Filter and Beyond

Lecture Particle Filters

Lecture 2: From Linear Regression to Kalman Filter and Beyond

An introduction to Sequential Monte Carlo

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 13: SEQUENTIAL DATA

The Kalman Filter ImPr Talk

Auxiliary Particle Methods

Computer Intensive Methods in Mathematical Statistics

Particle Filtering for Data-Driven Simulation and Optimization

Unsupervised Learning

Sensor Fusion: Particle Filter

Markov Chain Monte Carlo Methods for Stochastic Optimization

Controlled sequential Monte Carlo

Sequential Monte Carlo Samplers for Applications in High Dimensions

Learning Static Parameters in Stochastic Processes

Particle Filtering Approaches for Dynamic Stochastic Optimization

Sequential Monte Carlo Methods (for DSGE Models)

Sequential Monte Carlo and Particle Filtering. Frank Wood Gatsby, November 2007

Introduction to Machine Learning

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling

Linear Dynamical Systems

STA 4273H: Statistical Machine Learning

Graphical Models and Kernel Methods

ECE276A: Sensing & Estimation in Robotics Lecture 10: Gaussian Mixture and Particle Filtering

Approximate Bayesian Computation and Particle Filters

State-Space Methods for Inferring Spike Trains from Calcium Imaging

An efficient stochastic approximation EM algorithm using conditional particle filters

Kalman filtering and friends: Inference in time series models. Herke van Hoof slides mostly by Michael Rubinstein

CPSC 540: Machine Learning

13: Variational inference II

Markov Chain Monte Carlo Methods for Stochastic

Sensitivity analysis in HMMs with application to likelihood maximization

ECE521 Lecture 19 HMM cont. Inference in HMM

Particle Filters. Outline

Answers and expectations

Bayesian Methods for Machine Learning

Robert Collins CSE586, PSU Intro to Sampling Methods

STA 414/2104: Machine Learning

Bayesian Monte Carlo Filtering for Stochastic Volatility Models

Lecture 6: Bayesian Inference in SDE Models

Robotics. Mobile Robotics. Marc Toussaint U Stuttgart

Seminar: Data Assimilation

Why do we care? Examples. Bayes Rule. What room am I in? Handling uncertainty over time: predicting, estimating, recognizing, learning

Why do we care? Measurements. Handling uncertainty over time: predicting, estimating, recognizing, learning. Dealing with time

Lagrangian Data Assimilation and Manifold Detection for a Point-Vortex Model. David Darmon, AMSC Kayo Ide, AOSC, IPST, CSCAMM, ESSIC

Introduction to Particle Filters for Data Assimilation

Particle Filtering a brief introductory tutorial. Frank Wood Gatsby, August 2007

An Introduction to Sequential Monte Carlo for Filtering and Smoothing

Robert Collins CSE586, PSU Intro to Sampling Methods

RAO-BLACKWELLIZED PARTICLE FILTER FOR MARKOV MODULATED NONLINEARDYNAMIC SYSTEMS

Patterns of Scalable Bayesian Inference Background (Session 1)

A State Space Model for Wind Forecast Correction

The chopthin algorithm for resampling

Blind Equalization via Particle Filtering

Statistical Inference and Methods

Advanced Monte Carlo integration methods. P. Del Moral (INRIA team ALEA) INRIA & Bordeaux Mathematical Institute & X CMAP

A Backward Particle Interpretation of Feynman-Kac Formulae

AN EFFICIENT TWO-STAGE SAMPLING METHOD IN PARTICLE FILTER. Qi Cheng and Pascal Bondon. CNRS UMR 8506, Université Paris XI, France.

Concentration inequalities for Feynman-Kac particle models. P. Del Moral. INRIA Bordeaux & IMB & CMAP X. Journées MAS 2012, SMAI Clermond-Ferrand

Density Propagation for Continuous Temporal Chains Generative and Discriminative Models

Lecture 13 : Variational Inference: Mean Field Approximation

Approximate Bayesian Computation

A new class of interacting Markov Chain Monte Carlo methods

Sequential Monte Carlo methods for system identification

Learning the Linear Dynamical System with ASOS ( Approximated Second-Order Statistics )

Sparse Stochastic Inference for Latent Dirichlet Allocation

On Markov chain Monte Carlo methods for tall data

PROBABILISTIC REASONING OVER TIME

Gaussian Process Approximations of Stochastic Differential Equations

Infinite-State Markov-switching for Dynamic. Volatility Models : Web Appendix

Probabilistic Graphical Models

On-Line Parameter Estimation in General State-Space Models

Sequential Monte Carlo Methods in High Dimensions

SMC 2 : an efficient algorithm for sequential analysis of state-space models

28 : Approximate Inference - Distributed MCMC

Financial Econometrics

Introduction to Mobile Robotics Bayes Filter Particle Filter and Monte Carlo Localization

Autonomous Navigation for Flying Robots

Divide-and-Conquer Sequential Monte Carlo

STA 4273H: Sta-s-cal Machine Learning

EVALUATING SYMMETRIC INFORMATION GAP BETWEEN DYNAMICAL SYSTEMS USING PARTICLE FILTER

Introduction to Machine Learning CMU-10701

Online appendix to On the stability of the excess sensitivity of aggregate consumption growth in the US

Rao-Blackwellized Particle Filter for Multiple Target Tracking

Probabilistic Graphical Models

Bayesian Computations for DSGE Models

Approximate Inference

Pseudo-marginal MCMC methods for inference in latent variable models

Variational Scoring of Graphical Model Structures

Transcription:

Sequential Monte Carlo Methods for Bayesian Computation A. Doucet Kyoto Sept. 2012 A. Doucet (MLSS Sept. 2012) Sept. 2012 1 / 136

Motivating Example 1: Generic Bayesian Model Let X be a vector parameter of interest with an associated prior µ; i.e. X µ ( ). A. Doucet (MLSS Sept. 2012) Sept. 2012 2 / 136

Motivating Example 1: Generic Bayesian Model Let X be a vector parameter of interest with an associated prior µ; i.e. X µ ( ). We observe a realization of y of Y which is assumed to satisfy Y (X = x) g ( x) ; i.e. the likelihood function is g (y x). A. Doucet (MLSS Sept. 2012) Sept. 2012 2 / 136

Motivating Example 1: Generic Bayesian Model Let X be a vector parameter of interest with an associated prior µ; i.e. X µ ( ). We observe a realization of y of Y which is assumed to satisfy Y (X = x) g ( x) ; i.e. the likelihood function is g (y x). Bayesian inference on X relies on the posterior of X given Y = y: p (x y) = µ (x) g (y x) p (y) where the marginal likelihood/evidence satisfies p (y) = µ (x) g (y x) dx. A. Doucet (MLSS Sept. 2012) Sept. 2012 2 / 136

Motivating Example 1: Generic Bayesian Model Let X be a vector parameter of interest with an associated prior µ; i.e. X µ ( ). We observe a realization of y of Y which is assumed to satisfy Y (X = x) g ( x) ; i.e. the likelihood function is g (y x). Bayesian inference on X relies on the posterior of X given Y = y: p (x y) = µ (x) g (y x) p (y) where the marginal likelihood/evidence satisfies p (y) = µ (x) g (y x) dx. Machine learning examples: Latent Dirichlet Allocation, (Hiearchical) Dirichlet processes... A. Doucet (MLSS Sept. 2012) Sept. 2012 2 / 136

Motivating Example 2: State-Space Models Let {X t } t 1 be a latent/hidden Markov process with X 1 µ ( ) and X t (X t 1 = x) f ( x). A. Doucet (MLSS Sept. 2012) Sept. 2012 3 / 136

Motivating Example 2: State-Space Models Let {X t } t 1 be a latent/hidden Markov process with X 1 µ ( ) and X t (X t 1 = x) f ( x). Let {Y t } t 1 be an observation process such that observations are conditionally independent given {X t } t 1 and Y t (X t = x) g ( x). A. Doucet (MLSS Sept. 2012) Sept. 2012 3 / 136

Motivating Example 2: State-Space Models Let {X t } t 1 be a latent/hidden Markov process with X 1 µ ( ) and X t (X t 1 = x) f ( x). Let {Y t } t 1 be an observation process such that observations are conditionally independent given {X t } t 1 and Y t (X t = x) g ( x). Let z i :j := (z i, z i+1,..., z j ) then Bayesian inference on X 1:t relies on the posterior of X 1:t given Y = y 1:t : p (x 1:t y 1:t ) = p (x 1:t, y 1:t ) p (y 1:t ) where the marginal likelihood/evidence satisfies p (y 1:t ) = p (x 1:t, y 1:t ) dx 1:t. A. Doucet (MLSS Sept. 2012) Sept. 2012 3 / 136

Motivating Example 2: State-Space Models Let {X t } t 1 be a latent/hidden Markov process with X 1 µ ( ) and X t (X t 1 = x) f ( x). Let {Y t } t 1 be an observation process such that observations are conditionally independent given {X t } t 1 and Y t (X t = x) g ( x). Let z i :j := (z i, z i+1,..., z j ) then Bayesian inference on X 1:t relies on the posterior of X 1:t given Y = y 1:t : p (x 1:t y 1:t ) = p (x 1:t, y 1:t ) p (y 1:t ) where the marginal likelihood/evidence satisfies p (y 1:t ) = p (x 1:t, y 1:t ) dx 1:t. Machine learning examples: Biochemical network models, Dynamic topic models, Neuroscience models etc. A. Doucet (MLSS Sept. 2012) Sept. 2012 3 / 136

Bayesian Inference and Machine Learning Bayesian approaches have been adopted by a large part of the ML community. A. Doucet (MLSS Sept. 2012) Sept. 2012 4 / 136

Bayesian Inference and Machine Learning Bayesian approaches have been adopted by a large part of the ML community. Bayesian inference offers a number of attractive advantages over conventional approach A. Doucet (MLSS Sept. 2012) Sept. 2012 4 / 136

Bayesian Inference and Machine Learning Bayesian approaches have been adopted by a large part of the ML community. Bayesian inference offers a number of attractive advantages over conventional approach flexibility in constructing complex models from simple parts; A. Doucet (MLSS Sept. 2012) Sept. 2012 4 / 136

Bayesian Inference and Machine Learning Bayesian approaches have been adopted by a large part of the ML community. Bayesian inference offers a number of attractive advantages over conventional approach flexibility in constructing complex models from simple parts; the incorporation of prior knowledge is very natural; A. Doucet (MLSS Sept. 2012) Sept. 2012 4 / 136

Bayesian Inference and Machine Learning Bayesian approaches have been adopted by a large part of the ML community. Bayesian inference offers a number of attractive advantages over conventional approach flexibility in constructing complex models from simple parts; the incorporation of prior knowledge is very natural; all modelling assumptions are made explicit; A. Doucet (MLSS Sept. 2012) Sept. 2012 4 / 136

Bayesian Inference and Machine Learning Bayesian approaches have been adopted by a large part of the ML community. Bayesian inference offers a number of attractive advantages over conventional approach flexibility in constructing complex models from simple parts; the incorporation of prior knowledge is very natural; all modelling assumptions are made explicit; uncertainties over model order; A. Doucet (MLSS Sept. 2012) Sept. 2012 4 / 136

Bayesian Inference and Machine Learning Bayesian approaches have been adopted by a large part of the ML community. Bayesian inference offers a number of attractive advantages over conventional approach flexibility in constructing complex models from simple parts; the incorporation of prior knowledge is very natural; all modelling assumptions are made explicit; uncertainties over model order; model parameters and predictions are technically straightforward to compute; A. Doucet (MLSS Sept. 2012) Sept. 2012 4 / 136

Bayesian Inference and Machine Learning Bayesian approaches have been adopted by a large part of the ML community. Bayesian inference offers a number of attractive advantages over conventional approach flexibility in constructing complex models from simple parts; the incorporation of prior knowledge is very natural; all modelling assumptions are made explicit; uncertainties over model order; model parameters and predictions are technically straightforward to compute; The cost to pay is that approximate inference techniques are necessary to approximate the resulting posterior distributions for all but trivial models. A. Doucet (MLSS Sept. 2012) Sept. 2012 4 / 136

Approximate Inference Methods Gaussian/Laplace approximation, local linearization, Extended Kalman filters. A. Doucet (MLSS Sept. 2012) Sept. 2012 5 / 136

Approximate Inference Methods Gaussian/Laplace approximation, local linearization, Extended Kalman filters. Variational methods, density assumed filters. A. Doucet (MLSS Sept. 2012) Sept. 2012 5 / 136

Approximate Inference Methods Gaussian/Laplace approximation, local linearization, Extended Kalman filters. Variational methods, density assumed filters. Expectation-Propagation. A. Doucet (MLSS Sept. 2012) Sept. 2012 5 / 136

Approximate Inference Methods Gaussian/Laplace approximation, local linearization, Extended Kalman filters. Variational methods, density assumed filters. Expectation-Propagation. Markov chain Monte Carlo (MCMC) methods. A. Doucet (MLSS Sept. 2012) Sept. 2012 5 / 136

Approximate Inference Methods Gaussian/Laplace approximation, local linearization, Extended Kalman filters. Variational methods, density assumed filters. Expectation-Propagation. Markov chain Monte Carlo (MCMC) methods. Sequential Monte Carlo (SMC) methods. A. Doucet (MLSS Sept. 2012) Sept. 2012 5 / 136

Monte Carlo Methods Variational and EP methods are computationally cheap but perform functional approximations of the posteriors of interest. A. Doucet (MLSS Sept. 2012) Sept. 2012 6 / 136

Monte Carlo Methods Variational and EP methods are computationally cheap but perform functional approximations of the posteriors of interest. Both MCMC and SMC are asymptotically (as you increase computational efforts) bias-free but computationally expensive. A. Doucet (MLSS Sept. 2012) Sept. 2012 6 / 136

Monte Carlo Methods Variational and EP methods are computationally cheap but perform functional approximations of the posteriors of interest. Both MCMC and SMC are asymptotically (as you increase computational efforts) bias-free but computationally expensive. MCMC are the tools of choice in Bayesian computation for over 20 years whereas SMC have been widely used for 15 years in vision and robotics. A. Doucet (MLSS Sept. 2012) Sept. 2012 6 / 136

Monte Carlo Methods Variational and EP methods are computationally cheap but perform functional approximations of the posteriors of interest. Both MCMC and SMC are asymptotically (as you increase computational efforts) bias-free but computationally expensive. MCMC are the tools of choice in Bayesian computation for over 20 years whereas SMC have been widely used for 15 years in vision and robotics. The development of new methodology combined to the emergence of cheap multicore architectures makes now SMC a powerful alternative/complementary approach to MCMC to address general Bayesian computational problems. A. Doucet (MLSS Sept. 2012) Sept. 2012 6 / 136

Monte Carlo Methods Variational and EP methods are computationally cheap but perform functional approximations of the posteriors of interest. Both MCMC and SMC are asymptotically (as you increase computational efforts) bias-free but computationally expensive. MCMC are the tools of choice in Bayesian computation for over 20 years whereas SMC have been widely used for 15 years in vision and robotics. The development of new methodology combined to the emergence of cheap multicore architectures makes now SMC a powerful alternative/complementary approach to MCMC to address general Bayesian computational problems. The aim of these lectures is to provide an introduction to this active research field and discuss some open research problems. A. Doucet (MLSS Sept. 2012) Sept. 2012 6 / 136

Some References and Resources A.D., J.F.G. De Freitas & N.J. Gordon (editors), Sequential Monte Carlo Methods in Practice, Springer-Verlag: New York, 2001. A. Doucet (MLSS Sept. 2012) Sept. 2012 7 / 136

Some References and Resources A.D., J.F.G. De Freitas & N.J. Gordon (editors), Sequential Monte Carlo Methods in Practice, Springer-Verlag: New York, 2001. P. Del Moral, Feynman-Kac Formulae: Genealogical and Interacting Particle Systems with Applications, Springer-Verlag: New York, 2004. A. Doucet (MLSS Sept. 2012) Sept. 2012 7 / 136

Some References and Resources A.D., J.F.G. De Freitas & N.J. Gordon (editors), Sequential Monte Carlo Methods in Practice, Springer-Verlag: New York, 2001. P. Del Moral, Feynman-Kac Formulae: Genealogical and Interacting Particle Systems with Applications, Springer-Verlag: New York, 2004. O. Cappé, E. Moulines & T. Ryden, Hidden Markov Models, Springer-Verlag: New York, 2005. A. Doucet (MLSS Sept. 2012) Sept. 2012 7 / 136

Some References and Resources A.D., J.F.G. De Freitas & N.J. Gordon (editors), Sequential Monte Carlo Methods in Practice, Springer-Verlag: New York, 2001. P. Del Moral, Feynman-Kac Formulae: Genealogical and Interacting Particle Systems with Applications, Springer-Verlag: New York, 2004. O. Cappé, E. Moulines & T. Ryden, Hidden Markov Models, Springer-Verlag: New York, 2005. Webpage with links to papers and codes: http://www.stats.ox.ac.uk/~doucet/smc_resources.html A. Doucet (MLSS Sept. 2012) Sept. 2012 7 / 136

Some References and Resources A.D., J.F.G. De Freitas & N.J. Gordon (editors), Sequential Monte Carlo Methods in Practice, Springer-Verlag: New York, 2001. P. Del Moral, Feynman-Kac Formulae: Genealogical and Interacting Particle Systems with Applications, Springer-Verlag: New York, 2004. O. Cappé, E. Moulines & T. Ryden, Hidden Markov Models, Springer-Verlag: New York, 2005. Webpage with links to papers and codes: http://www.stats.ox.ac.uk/~doucet/smc_resources.html Thousands of papers on the subject appear every year. A. Doucet (MLSS Sept. 2012) Sept. 2012 7 / 136

Organization of Lectures State-Space Models (approx.4 hours) A. Doucet (MLSS Sept. 2012) Sept. 2012 8 / 136

Organization of Lectures State-Space Models (approx.4 hours) SMC filtering and smoothing A. Doucet (MLSS Sept. 2012) Sept. 2012 8 / 136

Organization of Lectures State-Space Models (approx.4 hours) SMC filtering and smoothing Maximum likelihood parameter inference A. Doucet (MLSS Sept. 2012) Sept. 2012 8 / 136

Organization of Lectures State-Space Models (approx.4 hours) SMC filtering and smoothing Maximum likelihood parameter inference Bayesian parameter inference A. Doucet (MLSS Sept. 2012) Sept. 2012 8 / 136

Organization of Lectures State-Space Models (approx.4 hours) SMC filtering and smoothing Maximum likelihood parameter inference Bayesian parameter inference Beyond State-Space Models (approx. 2 hours) A. Doucet (MLSS Sept. 2012) Sept. 2012 8 / 136

Organization of Lectures State-Space Models (approx.4 hours) SMC filtering and smoothing Maximum likelihood parameter inference Bayesian parameter inference Beyond State-Space Models (approx. 2 hours) SMC methods for generic sequence of target distributions A. Doucet (MLSS Sept. 2012) Sept. 2012 8 / 136

Organization of Lectures State-Space Models (approx.4 hours) SMC filtering and smoothing Maximum likelihood parameter inference Bayesian parameter inference Beyond State-Space Models (approx. 2 hours) SMC methods for generic sequence of target distributions SMC samplers. A. Doucet (MLSS Sept. 2012) Sept. 2012 8 / 136

Organization of Lectures State-Space Models (approx.4 hours) SMC filtering and smoothing Maximum likelihood parameter inference Bayesian parameter inference Beyond State-Space Models (approx. 2 hours) SMC methods for generic sequence of target distributions SMC samplers. Approximate Bayesian Computation. A. Doucet (MLSS Sept. 2012) Sept. 2012 8 / 136

Organization of Lectures State-Space Models (approx.4 hours) SMC filtering and smoothing Maximum likelihood parameter inference Bayesian parameter inference Beyond State-Space Models (approx. 2 hours) SMC methods for generic sequence of target distributions SMC samplers. Approximate Bayesian Computation. Optimal design, optimal control. A. Doucet (MLSS Sept. 2012) Sept. 2012 8 / 136

State-Space Models Let {X t } t 1 be a latent/hidden X -valued Markov process with X 1 µ ( ) and X t (X t 1 = x) f ( x). A. Doucet (MLSS Sept. 2012) Sept. 2012 9 / 136

State-Space Models Let {X t } t 1 be a latent/hidden X -valued Markov process with X 1 µ ( ) and X t (X t 1 = x) f ( x). Let {Y t } t 1 be an Y-valued Markov observation process such that observations are conditionally independent given {X t } t 1 and Y t (X t = x) g ( x). A. Doucet (MLSS Sept. 2012) Sept. 2012 9 / 136

State-Space Models Let {X t } t 1 be a latent/hidden X -valued Markov process with X 1 µ ( ) and X t (X t 1 = x) f ( x). Let {Y t } t 1 be an Y-valued Markov observation process such that observations are conditionally independent given {X t } t 1 and Y t (X t = x) g ( x). General class of time series models aka Hidden Markov Models (HMM) including X t = Ψ (X t 1, V t ), Y t = Φ (X t, W t ) where V t, W t are two sequences of i.i.d. random variables. A. Doucet (MLSS Sept. 2012) Sept. 2012 9 / 136

State-Space Models Let {X t } t 1 be a latent/hidden X -valued Markov process with X 1 µ ( ) and X t (X t 1 = x) f ( x). Let {Y t } t 1 be an Y-valued Markov observation process such that observations are conditionally independent given {X t } t 1 and Y t (X t = x) g ( x). General class of time series models aka Hidden Markov Models (HMM) including X t = Ψ (X t 1, V t ), Y t = Φ (X t, W t ) where V t, W t are two sequences of i.i.d. random variables. Aim: Infer {X t } given observations {Y t } on-line or off-line. A. Doucet (MLSS Sept. 2012) Sept. 2012 9 / 136

State-Space Models State-space models are ubiquitous in control, data mining, econometrics, geosciences, system biology etc. Since Jan. 2012, more than 13,500 papers have already appeared (source: Google Scholar). A. Doucet (MLSS Sept. 2012) Sept. 2012 10 / 136

State-Space Models State-space models are ubiquitous in control, data mining, econometrics, geosciences, system biology etc. Since Jan. 2012, more than 13,500 papers have already appeared (source: Google Scholar). Finite State-space HMM: X is a finite space, i.e. {X t } is a finite Markov chain Y t (X t = x) g ( x) A. Doucet (MLSS Sept. 2012) Sept. 2012 10 / 136

State-Space Models State-space models are ubiquitous in control, data mining, econometrics, geosciences, system biology etc. Since Jan. 2012, more than 13,500 papers have already appeared (source: Google Scholar). Finite State-space HMM: X is a finite space, i.e. {X t } is a finite Markov chain Y t (X t = x) g ( x) Linear Gaussian state-space model X t = AX t 1 + BV t, V t i.i.d. N (0, I ) Y t = CX t + DW t, W t i.i.d. N (0, I ) A. Doucet (MLSS Sept. 2012) Sept. 2012 10 / 136

State-Space Models State-space models are ubiquitous in control, data mining, econometrics, geosciences, system biology etc. Since Jan. 2012, more than 13,500 papers have already appeared (source: Google Scholar). Finite State-space HMM: X is a finite space, i.e. {X t } is a finite Markov chain Y t (X t = x) g ( x) Linear Gaussian state-space model X t = AX t 1 + BV t, V t i.i.d. N (0, I ) i.i.d. Y t = CX t + DW t, W t N (0, I ) Switching Linear Gaussian state-space model: X t = ( Xt 1, Xt 2 ) where { Xt 1 } is a finite Markov chain, Xt 2 = A ( Xt 1 ) X 2 t 1 + B ( Xt 1 ) i.i.d. Vt, V t N (0, I ) Y t = C ( X 1 t ) X 2 t + D ( Xt 1 ) Wt, W t i.i.d. N (0, I ) A. Doucet (MLSS Sept. 2012) Sept. 2012 10 / 136

State-Space Models Stochastic Volatility model X t = φx t 1 + σv t, V t i.i.d. N (0, 1) Y t = β exp (X t /2) W t, W t i.i.d. N (0, 1) A. Doucet (MLSS Sept. 2012) Sept. 2012 11 / 136

State-Space Models Stochastic Volatility model X t = φx t 1 + σv t, V t i.i.d. N (0, 1) Y t = β exp (X t /2) W t, W t i.i.d. N (0, 1) Biochemical Network model Pr ( Xt+dt 1 =x t 1 +1, Xt+dt 2 =x t 2 xt 1, xt 2 ) = α x 1 t dt + o (dt), Pr ( Xt+dt 1 =x t 1 1, Xt+dt 2 =x t 2 +1 xt 1, xt 2 ) = β x 1 t xt 2 dt + o (dt), Pr ( Xt+dt 1 =x t 1, Xt+dt 2 =x t 2 1 xt 1, xt 2 ) = γ x 2 t dt + o (dt), with Y k = Xk 1 T + W i.i.d. k with W k N ( 0, σ 2). A. Doucet (MLSS Sept. 2012) Sept. 2012 11 / 136

State-Space Models Stochastic Volatility model X t = φx t 1 + σv t, V t i.i.d. N (0, 1) Y t = β exp (X t /2) W t, W t i.i.d. N (0, 1) Biochemical Network model Pr ( Xt+dt 1 =x t 1 +1, Xt+dt 2 =x t 2 xt 1, xt 2 ) = α x 1 t dt + o (dt), Pr ( Xt+dt 1 =x t 1 1, Xt+dt 2 =x t 2 +1 xt 1, xt 2 ) = β x 1 t xt 2 dt + o (dt), Pr ( Xt+dt 1 =x t 1, Xt+dt 2 =x t 2 1 xt 1, xt 2 ) = γ x 2 t dt + o (dt), with Y k = Xk 1 T + W i.i.d. k with W k N ( 0, σ 2). Nonlinear Diffusion model dx t = α (X t ) dt + β (X t ) dv t, V t Brownian motion Y k = γ (X k T ) +W k, W k i.i.d. N ( 0, σ 2). A. Doucet (MLSS Sept. 2012) Sept. 2012 11 / 136

Inference in State-Space Models Given observations y 1:t := (y 1, y 2,..., y t ), inference about X 1:t := (X 1,..., X t ) relies on the posterior where p (x 1:t, y 1:t ) = µ (x 1 ) p (y 1:t ) = p (x 1:t y 1:t ) = p (x 1:t, y 1:t ) p (y 1:t ) t k=2 f (x k x k 1 ) }{{}}{{} p(x 1:t ) p( y 1:t x 1:t ) p (x 1:t, y 1:t ) dx 1:t t k=1 g (y k x k ), A. Doucet (MLSS Sept. 2012) Sept. 2012 12 / 136

Inference in State-Space Models Given observations y 1:t := (y 1, y 2,..., y t ), inference about X 1:t := (X 1,..., X t ) relies on the posterior where p (x 1:t, y 1:t ) = µ (x 1 ) p (y 1:t ) = p (x 1:t y 1:t ) = p (x 1:t, y 1:t ) p (y 1:t ) t k=2 f (x k x k 1 ) }{{}}{{} p(x 1:t ) p( y 1:t x 1:t ) p (x 1:t, y 1:t ) dx 1:t t k=1 g (y k x k ), When X is finite & linear Gaussian models, {p (x t y 1:t )} t 1 can be computed exactly. For non-linear models, approximations are required: EKF, UKF, Gaussian sum filters, etc. A. Doucet (MLSS Sept. 2012) Sept. 2012 12 / 136

Inference in State-Space Models Given observations y 1:t := (y 1, y 2,..., y t ), inference about X 1:t := (X 1,..., X t ) relies on the posterior where p (x 1:t, y 1:t ) = µ (x 1 ) p (y 1:t ) = p (x 1:t y 1:t ) = p (x 1:t, y 1:t ) p (y 1:t ) t k=2 f (x k x k 1 ) }{{}}{{} p(x 1:t ) p( y 1:t x 1:t ) p (x 1:t, y 1:t ) dx 1:t t k=1 g (y k x k ), When X is finite & linear Gaussian models, {p (x t y 1:t )} t 1 can be computed exactly. For non-linear models, approximations are required: EKF, UKF, Gaussian sum filters, etc. Approximations of {p (x t y 1:t )} T t=1 provide approximation of p (x 1:T y 1:T ). A. Doucet (MLSS Sept. 2012) Sept. 2012 12 / 136

Monte Carlo Methods Basics Assume you can generate X (i) 1:t p (x 1:t y 1:t ) where i = 1,..., N then MC approximation is p (x 1:t y 1:t ) = 1 N N δ (i) X (x 1:t ) 1:t i=1 A. Doucet (MLSS Sept. 2012) Sept. 2012 13 / 136

Monte Carlo Methods Basics Assume you can generate X (i) 1:t p (x 1:t y 1:t ) where i = 1,..., N then MC approximation is p (x 1:t y 1:t ) = 1 N N δ (i) X (x 1:t ) 1:t i=1 Integration is straightforward. ϕt (x 1:t ) p (x 1:t y 1:t ) dx 1:t ϕ t (x 1:t ) p ((x 1:t ) y 1:t ) dx 1:t = 1 N N i=1 ϕ X (i) 1:t A. Doucet (MLSS Sept. 2012) Sept. 2012 13 / 136

Monte Carlo Methods Basics Assume you can generate X (i) 1:t p (x 1:t y 1:t ) where i = 1,..., N then MC approximation is p (x 1:t y 1:t ) = 1 N N δ (i) X (x 1:t ) 1:t i=1 Integration is straightforward. ϕt (x 1:t ) p (x 1:t y 1:t ) dx 1:t ϕ t (x 1:t ) p ((x 1:t ) y 1:t ) dx 1:t = 1 N N i=1 ϕ Marginalization is straightforward. X (i) 1:t p (x k y 1:t ) = p (x 1:t y 1:t ) dx 1:k 1 dx k+1:t = 1 N N δ (i) X (x k ). k i=1 A. Doucet (MLSS Sept. 2012) Sept. 2012 13 / 136

Monte Carlo Methods Basics Assume you can generate X (i) 1:t p (x 1:t y 1:t ) where i = 1,..., N then MC approximation is p (x 1:t y 1:t ) = 1 N N δ (i) X (x 1:t ) 1:t i=1 Integration is straightforward. ϕt (x 1:t ) p (x 1:t y 1:t ) dx 1:t ϕ t (x 1:t ) p ((x 1:t ) y 1:t ) dx 1:t = 1 N N i=1 ϕ Marginalization is straightforward. X (i) 1:t p (x k y 1:t ) = p (x 1:t y 1:t ) dx 1:k 1 dx k+1:t = 1 N [ ( )] Basic and key property: V 1 N N i=1 ϕ = X (i) 1:t N δ (i) X (x k ). k i=1 C (t dim(x )) N, i.e. rate of convergence to zero is independent of dim (X ) and t. A. Doucet (MLSS Sept. 2012) Sept. 2012 13 / 136

Monte Carlo Methods Problem 1: We cannot typically generate exact samples from p (x 1:t y 1:t ) for non-linear non-gaussian models. A. Doucet (MLSS Sept. 2012) Sept. 2012 14 / 136

Monte Carlo Methods Problem 1: We cannot typically generate exact samples from p (x 1:t y 1:t ) for non-linear non-gaussian models. Problem 2: Even if we could, algorithms to generate samples from p (x 1:t y 1:t ) will have at least complexity O (t). A. Doucet (MLSS Sept. 2012) Sept. 2012 14 / 136

Monte Carlo Methods Problem 1: We cannot typically generate exact samples from p (x 1:t y 1:t ) for non-linear non-gaussian models. Problem 2: Even if we could, algorithms to generate samples from p (x 1:t y 1:t ) will have at least complexity O (t). Typical solution to problem 1 is to generate approximate samples using MCMC methods but these methods are not recursive. A. Doucet (MLSS Sept. 2012) Sept. 2012 14 / 136

Monte Carlo Methods Problem 1: We cannot typically generate exact samples from p (x 1:t y 1:t ) for non-linear non-gaussian models. Problem 2: Even if we could, algorithms to generate samples from p (x 1:t y 1:t ) will have at least complexity O (t). Typical solution to problem 1 is to generate approximate samples using MCMC methods but these methods are not recursive. SMC Methods solves partially Problem 1 and Problem 2 by breaking the problem of sampling from p (x 1:t y 1:t ) into a collection of simpler subproblems. First approximate p (x 1 y 1 ) and p (y 1 ) at time 1, then p (x 1:2 y 1:2 ) and p (y 1:2 ) at time 2 and so on. A. Doucet (MLSS Sept. 2012) Sept. 2012 14 / 136

Monte Carlo Methods Problem 1: We cannot typically generate exact samples from p (x 1:t y 1:t ) for non-linear non-gaussian models. Problem 2: Even if we could, algorithms to generate samples from p (x 1:t y 1:t ) will have at least complexity O (t). Typical solution to problem 1 is to generate approximate samples using MCMC methods but these methods are not recursive. SMC Methods solves partially Problem 1 and Problem 2 by breaking the problem of sampling from p (x 1:t y 1:t ) into a collection of simpler subproblems. First approximate p (x 1 y 1 ) and p (y 1 ) at time 1, then p (x 1:2 y 1:2 ) and p (y 1:2 ) at time 2 and so on. Each target distribution is approximated by a cloud of random samples termed particles evolving according to importance sampling and resampling steps. A. Doucet (MLSS Sept. 2012) Sept. 2012 14 / 136

Standard Bayesian Recursion In most textbooks, you will find the following recursion for {p (x t y 1:t )} t 1. A. Doucet (MLSS Sept. 2012) Sept. 2012 15 / 136

Standard Bayesian Recursion In most textbooks, you will find the following recursion for {p (x t y 1:t )} t 1. Prediction step p (x t y 1:t 1 ) = p (x t 1, x t y 1:t 1 ) dx t 1 = p (x t y 1:t 1, x t 1 ) p (x t 1 y 1:t 1 ) dx t 1 = f (x t x t 1 ) p (x t 1 y 1:t 1 ) dx t 1. A. Doucet (MLSS Sept. 2012) Sept. 2012 15 / 136

Standard Bayesian Recursion In most textbooks, you will find the following recursion for {p (x t y 1:t )} t 1. Prediction step p (x t y 1:t 1 ) = p (x t 1, x t y 1:t 1 ) dx t 1 = p (x t y 1:t 1, x t 1 ) p (x t 1 y 1:t 1 ) dx t 1 = f (x t x t 1 ) p (x t 1 y 1:t 1 ) dx t 1. Bayes Updating step where p (x t y 1:t ) = g (y t x t ) p (x t y 1:t 1 ) p (y t y 1:t 1 ) p (y t y 1:t 1 ) = g (y t x t ) p (x t y 1:t 1 ) dx t A. Doucet (MLSS Sept. 2012) Sept. 2012 15 / 136

Standard Bayesian Recursion In most textbooks, you will find the following recursion for {p (x t y 1:t )} t 1. Prediction step p (x t y 1:t 1 ) = p (x t 1, x t y 1:t 1 ) dx t 1 = p (x t y 1:t 1, x t 1 ) p (x t 1 y 1:t 1 ) dx t 1 = f (x t x t 1 ) p (x t 1 y 1:t 1 ) dx t 1. Bayes Updating step where p (x t y 1:t ) = g (y t x t ) p (x t y 1:t 1 ) p (y t y 1:t 1 ) p (y t y 1:t 1 ) = g (y t x t ) p (x t y 1:t 1 ) dx t This is the recursion implemented by Wonham and Kalman filters... A. Doucet (MLSS Sept. 2012) Sept. 2012 15 / 136

Bayesian Recursion on Path Space SMC approximate directly {p (x 1:t y 1:t )} t 1 not {p (x t y 1:t )} t 1 and relies on p (x 1:t y 1:t ) = p (x 1:t, y 1:t ) = g (y t x t ) f (x t x t 1 ) p (x 1:t 1, y 1:t 1 ) p (y 1:t ) p (y t y 1:t 1 ) p (y 1:t 1 ) where = g (y t x t ) predictive p( x 1:t y 1:t 1 ) {}}{ f (x t x t 1 ) p (x 1:t 1 y 1:t 1 ) p (y t y 1:t 1 ) p (y t y 1:t 1 ) = g (y t x t ) p (x 1:t y 1:t 1 ) dx 1:t A. Doucet (MLSS Sept. 2012) Sept. 2012 16 / 136

Bayesian Recursion on Path Space SMC approximate directly {p (x 1:t y 1:t )} t 1 not {p (x t y 1:t )} t 1 and relies on p (x 1:t y 1:t ) = p (x 1:t, y 1:t ) = g (y t x t ) f (x t x t 1 ) p (x 1:t 1, y 1:t 1 ) p (y 1:t ) p (y t y 1:t 1 ) p (y 1:t 1 ) where = g (y t x t ) predictive p( x 1:t y 1:t 1 ) {}}{ f (x t x t 1 ) p (x 1:t 1 y 1:t 1 ) p (y t y 1:t 1 ) p (y t y 1:t 1 ) = g (y t x t ) p (x 1:t y 1:t 1 ) dx 1:t This can be alternatively written as Prediction p (x 1:t y 1:t 1 ) = f (x t x t 1 ) p (x 1:t 1 y 1:t 1 ), Update p (x 1:t y 1:t ) = g ( y t x t )p( x 1:t y 1:t 1 ) p( y t y 1:t 1. ) A. Doucet (MLSS Sept. 2012) Sept. 2012 16 / 136

Bayesian Recursion on Path Space SMC approximate directly {p (x 1:t y 1:t )} t 1 not {p (x t y 1:t )} t 1 and relies on p (x 1:t y 1:t ) = p (x 1:t, y 1:t ) = g (y t x t ) f (x t x t 1 ) p (x 1:t 1, y 1:t 1 ) p (y 1:t ) p (y t y 1:t 1 ) p (y 1:t 1 ) where = g (y t x t ) predictive p( x 1:t y 1:t 1 ) {}}{ f (x t x t 1 ) p (x 1:t 1 y 1:t 1 ) p (y t y 1:t 1 ) p (y t y 1:t 1 ) = g (y t x t ) p (x 1:t y 1:t 1 ) dx 1:t This can be alternatively written as Prediction p (x 1:t y 1:t 1 ) = f (x t x t 1 ) p (x 1:t 1 y 1:t 1 ), Update p (x 1:t y 1:t ) = g ( y t x t )p( x 1:t y 1:t 1 ) p( y t y 1:t 1. ) SMC is a simple and natural simulation-based implementation of this recursion. A. Doucet (MLSS Sept. 2012) Sept. 2012 16 / 136

Monte Carlo Implementation of Prediction Step Assume you have at time t 1 p (x 1:t 1 y 1:t 1 ) = 1 N N δ (i) X (x 1:t 1 ). 1:t 1 i=1 A. Doucet (MLSS Sept. 2012) Sept. 2012 17 / 136

Monte Carlo Implementation of Prediction Step Assume you have at time t 1 p (x 1:t 1 y 1:t 1 ) = 1 N N δ (i) X (x 1:t 1 ). 1:t 1 i=1 ( ) ( ) By sampling X (i) t f x t X (i) t 1 and setting X (i) 1:t = X (i) 1:t 1, X (i) t then p (x 1:t y 1:t 1 ) = 1 N N δ X (i) (x 1:t ). 1:t i=1 A. Doucet (MLSS Sept. 2012) Sept. 2012 17 / 136

Monte Carlo Implementation of Prediction Step Assume you have at time t 1 p (x 1:t 1 y 1:t 1 ) = 1 N N δ (i) X (x 1:t 1 ). 1:t 1 i=1 ( ) ( ) By sampling X (i) t f x t X (i) t 1 and setting X (i) 1:t = X (i) 1:t 1, X (i) t then p (x 1:t y 1:t 1 ) = 1 N N δ X (i) (x 1:t ). 1:t i=1 Sampling from f (x t x t 1 ) is usually straightforward and can be done even if f (x t x t 1 ) does not admit any analytical expression; e.g. biochemical network models. A. Doucet (MLSS Sept. 2012) Sept. 2012 17 / 136

Importance Sampling Implementation of Updating Step Our target at time t is p (x 1:t y 1:t ) = g (y t x t ) p (x 1:t y 1:t 1 ) p (y t y 1:t 1 ) so by substituting p (x 1:t y 1:t 1 ) to p (x 1:t y 1:t 1 ) we obtain p (y t y 1:t 1 ) = g (y t x t ) p (x 1:t y 1:t 1 ) dx 1:t = 1 N N ( ) g y t X (i) t. i=1 A. Doucet (MLSS Sept. 2012) Sept. 2012 18 / 136

Importance Sampling Implementation of Updating Step Our target at time t is p (x 1:t y 1:t ) = g (y t x t ) p (x 1:t y 1:t 1 ) p (y t y 1:t 1 ) so by substituting p (x 1:t y 1:t 1 ) to p (x 1:t y 1:t 1 ) we obtain p (y t y 1:t 1 ) = g (y t x t ) p (x 1:t y 1:t 1 ) dx 1:t We now have = 1 N N ( ) g y t X (i) t. i=1 p (x 1:t y 1:t ) = g (y t x t ) p (x 1:t y 1:t 1 ) = p (y t y 1:t 1 ) ( ) with W (i) t g y t X (i) t, N i=1 W (i) t = 1. N i=1 W (i) t δ X (i) (x 1:t ). 1:t A. Doucet (MLSS Sept. 2012) Sept. 2012 18 / 136

Multinomial Resampling We have a weighted approximation p (x 1:t y 1:t ) of p (x 1:t y 1:t ) p (x 1:t y 1:t ) = N i=1 W (i) t δ X (i) (x 1:t ). 1:t A. Doucet (MLSS Sept. 2012) Sept. 2012 19 / 136

Multinomial Resampling We have a weighted approximation p (x 1:t y 1:t ) of p (x 1:t y 1:t ) p (x 1:t y 1:t ) = N i=1 W (i) t δ X (i) (x 1:t ). 1:t To obtain N samples X (i) 1:t approximately distributed according to p (x 1:t y 1:t ), resample N times with replacement to obtain X (i) 1:t p (x 1:t y 1:t ) N δ (i) X (x 1:t ) = 1:t i=1 p (x 1:t y 1:t ) = 1 N { } [ where N (i) t follow a multinomial with E [ ] ( ) V N (1) t = NW (i) t 1 W (i) t. N i=1 N (i) t N (i) t N δ X (i) 1:t ] (x 1:t ) = NW (i) t, A. Doucet (MLSS Sept. 2012) Sept. 2012 19 / 136

Multinomial Resampling We have a weighted approximation p (x 1:t y 1:t ) of p (x 1:t y 1:t ) p (x 1:t y 1:t ) = N i=1 W (i) t δ X (i) (x 1:t ). 1:t To obtain N samples X (i) 1:t approximately distributed according to p (x 1:t y 1:t ), resample N times with replacement to obtain X (i) 1:t p (x 1:t y 1:t ) N δ (i) X (x 1:t ) = 1:t i=1 p (x 1:t y 1:t ) = 1 N { } [ where N (i) t follow a multinomial with E [ ] ( ) V N (1) t = NW (i) t 1 W (i) t. This can be achieved in O (N). N i=1 N (i) t N (i) t N δ X (i) 1:t ] (x 1:t ) = NW (i) t, A. Doucet (MLSS Sept. 2012) Sept. 2012 19 / 136

Vanilla SMC: Bootstrap Filter (Gordon et al., 1993) At time t = 1 Sample X (i) 1 µ (x 1 ) then p (x 1 y 1 ) = N ( ) W (i) 1 δ X (i) (x 1 ), W (i) 1 g y 1 X (i) 1. 1 i=1 A. Doucet (MLSS Sept. 2012) Sept. 2012 20 / 136

Vanilla SMC: Bootstrap Filter (Gordon et al., 1993) At time t = 1 Sample X (i) 1 µ (x 1 ) then p (x 1 y 1 ) = N ( ) W (i) 1 δ X (i) (x 1 ), W (i) 1 g y 1 X (i) 1. 1 i=1 Resample X (i) 1 p (x 1 y 1 ) to obtain p (x 1 y 1 ) = 1 N N i=1 δ (i) X (x 1 ). 1 A. Doucet (MLSS Sept. 2012) Sept. 2012 20 / 136

Vanilla SMC: Bootstrap Filter (Gordon et al., 1993) At time t = 1 Sample X (i) 1 µ (x 1 ) then p (x 1 y 1 ) = N ( ) W (i) 1 δ X (i) (x 1 ), W (i) 1 g y 1 X (i) 1. 1 i=1 Resample X (i) 1 p (x 1 y 1 ) to obtain p (x 1 y 1 ) = 1 N N i=1 δ (i) X (x 1 ). 1 A. Doucet (MLSS Sept. 2012) Sept. 2012 20 / 136

Vanilla SMC: Bootstrap Filter (Gordon et al., 1993) At time t = 1 Sample X (i) 1 µ (x 1 ) then p (x 1 y 1 ) = N ( ) W (i) 1 δ X (i) (x 1 ), W (i) 1 g y 1 X (i) 1. 1 i=1 Resample X (i) 1 p (x 1 y 1 ) to obtain p (x 1 y 1 ) = 1 N N i=1 δ (i) X (x 1 ). 1 At time t 2 Sample X (i) t f p (x 1:t y 1:t ) = ( ) ( ) x t X (i) t 1, set X (i) 1:t = X (i) 1:t 1, X (i) t and N i=1 ( W (i) t δ X (i) (x 1:t ), W (i) t g 1:t ) y t X (i) t. A. Doucet (MLSS Sept. 2012) Sept. 2012 20 / 136

Vanilla SMC: Bootstrap Filter (Gordon et al., 1993) At time t = 1 Sample X (i) 1 µ (x 1 ) then p (x 1 y 1 ) = N ( ) W (i) 1 δ X (i) (x 1 ), W (i) 1 g y 1 X (i) 1. 1 i=1 Resample X (i) 1 p (x 1 y 1 ) to obtain p (x 1 y 1 ) = 1 N N i=1 δ (i) X (x 1 ). 1 At time t 2 Sample X (i) t f p (x 1:t y 1:t ) = ( ) ( ) x t X (i) t 1, set X (i) 1:t = X (i) 1:t 1, X (i) t and N i=1 ( W (i) t δ X (i) (x 1:t ), W (i) t g 1:t Resample X (i) 1:t p (x 1:t y 1:t ) to obtain p (x 1:t y 1:t ) = 1 N N i=1 δ (i) X (x 1:t ). 1:t ) y t X (i) t. A. Doucet (MLSS Sept. 2012) Sept. 2012 20 / 136

SMC Output At time t, we get p (x 1:t y 1:t ) = N i=1 p (x 1:t y 1:t ) = 1 N W (i) t δ X (i) (x 1:t ), 1:t N δ (i) X (x 1:t ). 1:t i=1 A. Doucet (MLSS Sept. 2012) Sept. 2012 21 / 136

SMC Output At time t, we get p (x 1:t y 1:t ) = N i=1 p (x 1:t y 1:t ) = 1 N W (i) t δ X (i) (x 1:t ), 1:t N δ (i) X (x 1:t ). 1:t i=1 The marginal likelihood estimate is given by ( t 1 p (y 1:t ) = p (y k y 1:k 1 ) = N k=1 t k=1 N ( g i=1 ) ) y k X (i) k. A. Doucet (MLSS Sept. 2012) Sept. 2012 21 / 136

SMC Output At time t, we get p (x 1:t y 1:t ) = N i=1 p (x 1:t y 1:t ) = 1 N W (i) t δ X (i) (x 1:t ), 1:t N δ (i) X (x 1:t ). 1:t i=1 The marginal likelihood estimate is given by ( t 1 p (y 1:t ) = p (y k y 1:k 1 ) = N k=1 t k=1 N ( g i=1 ) ) y k X (i) k. Computational complexity is O (N) at each time step and memory requirements O (tn). A. Doucet (MLSS Sept. 2012) Sept. 2012 21 / 136

SMC Output At time t, we get p (x 1:t y 1:t ) = N i=1 p (x 1:t y 1:t ) = 1 N W (i) t δ X (i) (x 1:t ), 1:t N δ (i) X (x 1:t ). 1:t i=1 The marginal likelihood estimate is given by ( t 1 p (y 1:t ) = p (y k y 1:k 1 ) = N k=1 t k=1 N ( g i=1 ) ) y k X (i) k. Computational complexity is O (N) at each time step and memory requirements O (tn). If we are only interested in p (x t y 1:t ) or p (s t (x 1:t ) y 1:t ) where s t (x 1:t ) = Ψ t (x t, s t 1 (x 1:t 1 )) - e.g. s t (x 1:t ) = t k=1 x 2 k - is fixed-dimensional then memory requirements O (N). A. Doucet (MLSS Sept. 2012) Sept. 2012 21 / 136

state state Figure: p ( x 1 y 1 ) and Ê [ X 1 y 1 ] (top) and particle approximation of p ( x 1 y 1 ) (bottom) A. Doucet (MLSS Sept. 2012) Sept. 2012 22 / 136 SMC on Path-Space - figures by Olivier Cappė 1.6 1.4 1.2 1 0.8 0.6 0.4 5 10 15 20 25 time index 1.6 1.4 1.2 1 0.8 0.6 0.4 5 10 15 20 25 time index

state state 1.6 1.4 1.2 1 0.8 0.6 0.4 5 10 15 20 25 time index 1.6 1.4 1.2 1 0.8 0.6 0.4 5 10 15 20 25 time index Figure: p ( x 1 y 1 ), p ( x 2 y 1:2 )and Ê [ X 1 y 1 ], Ê [ X 2 y 1:2 ] (top) and particle approximation of p ( x 1:2 y 1:2 ) (bottom) A. Doucet (MLSS Sept. 2012) Sept. 2012 23 / 136

state state 1.6 1.4 1.2 1 0.8 0.6 0.4 5 10 15 20 25 time index 1.6 1.4 1.2 1 0.8 0.6 0.4 5 10 15 20 25 time index Figure: p ( x t y 1:t ) and Ê [ X t y 1:t ] for t = 1, 2, 3 (top) and particle approximation of p ( x 1:3 y 1:3 ) (bottom) A. Doucet (MLSS Sept. 2012) Sept. 2012 24 / 136

state state 1.6 1.4 1.2 1 0.8 0.6 0.4 5 10 15 20 25 time index 1.6 1.4 1.2 1 0.8 0.6 0.4 5 10 15 20 25 time index Figure: p ( x t y 1:t ) and Ê [ X t y 1:t ] for t = 1,..., 10 (top) and particle approximation of p ( x 1:10 y 1:10 ) (bottom) A. Doucet (MLSS Sept. 2012) Sept. 2012 25 / 136

state state 1.6 1.4 1.2 1 0.8 0.6 0.4 5 10 15 20 25 time index 1.6 1.4 1.2 1 0.8 0.6 0.4 5 10 15 20 25 time index Figure: p ( x t y 1:t ) and Ê [ X t y 1:t ] for t = 1,..., 24 (top) and particle approximation of p ( x 1:24 y 1:24 ) (bottom) A. Doucet (MLSS Sept. 2012) Sept. 2012 26 / 136

Remarks Empirically this SMC strategy performs well in terms of estimating the marginals {p (x t y 1:t )} t 1. This is what is only necessary in many applications thankfully. A. Doucet (MLSS Sept. 2012) Sept. 2012 27 / 136

Remarks Empirically this SMC strategy performs well in terms of estimating the marginals {p (x t y 1:t )} t 1. This is what is only necessary in many applications thankfully. However, the joint distribution p (x 1:t y 1:t ) is poorly estimated when t is large; i.e. we have in the previous example p (x 1:11 y 1:24 ) = δ X 1:11 (x 1:11 ). A. Doucet (MLSS Sept. 2012) Sept. 2012 27 / 136

Remarks Empirically this SMC strategy performs well in terms of estimating the marginals {p (x t y 1:t )} t 1. This is what is only necessary in many applications thankfully. However, the joint distribution p (x 1:t y 1:t ) is poorly estimated when t is large; i.e. we have in the previous example p (x 1:11 y 1:24 ) = δ X 1:11 (x 1:11 ). Degeneracy problem. For any N and any k, there exists t (k, N) such that for any t t (k, N) p (x 1:k y 1:t ) = δ X 1:k (x 1:k ) ; p (x 1:t y 1:t ) is an unreliable approximation of p (x 1:t y 1:t ) as t. A. Doucet (MLSS Sept. 2012) Sept. 2012 27 / 136

Another Illustration of the Degeneracy Phenomenon For the linear Gaussian state-space model described before, we can compute exactly S t /t where ( ) t S t = xk 2 p (x 1:t y 1:t ) dx 1:t k=1 using Kalman techniques. A. Doucet (MLSS Sept. 2012) Sept. 2012 28 / 136

Another Illustration of the Degeneracy Phenomenon For the linear Gaussian state-space model described before, we can compute exactly S t /t where ( ) t S t = xk 2 p (x 1:t y 1:t ) dx 1:t k=1 using Kalman techniques. We compute the SMC estimate of this quantity using Ŝ t /t where ( ) t Ŝ t = xk 2 p (x 1:t y 1:t ) dx 1:t k=1 can be computed sequentially. A. Doucet (MLSS Sept. 2012) Sept. 2012 28 / 136

Another Illustration of the Degeneracy Phenomenon 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 Figure: S t /t obtained through the Kalman smoother (blue) and its SMC estimate Ŝ t /t (red). A. Doucet (MLSS Sept. 2012) Sept. 2012 29 / 136

Some Convergence Results for SMC Numerous convergence results for SMC are available; see (Del Moral, 2004). A. Doucet (MLSS Sept. 2012) Sept. 2012 30 / 136

Some Convergence Results for SMC Numerous convergence results for SMC are available; see (Del Moral, 2004). Let ϕ t : X t R and consider ϕ t = ϕ t (x 1:t ) p (x 1:t y 1:t ) dx 1:t, ϕ t = ϕ t (x 1:t ) p (x 1:t y 1:t ) dx 1:t = 1 N N ( ϕ t i=1 X (i) 1:t ). A. Doucet (MLSS Sept. 2012) Sept. 2012 30 / 136

Some Convergence Results for SMC Numerous convergence results for SMC are available; see (Del Moral, 2004). Let ϕ t : X t R and consider ϕ t = ϕ t (x 1:t ) p (x 1:t y 1:t ) dx 1:t, ϕ t = ϕ t (x 1:t ) p (x 1:t y 1:t ) dx 1:t = 1 N N ( ϕ t i=1 X (i) 1:t We can prove that for any bounded function ϕ and any p 1 E [ ϕ t ϕ t p ] 1/p B (t) c (p) ϕ, N lim N ( ϕt ϕ N t ) N ( 0, σ 2 ) t. ). A. Doucet (MLSS Sept. 2012) Sept. 2012 30 / 136

Some Convergence Results for SMC Numerous convergence results for SMC are available; see (Del Moral, 2004). Let ϕ t : X t R and consider ϕ t = ϕ t (x 1:t ) p (x 1:t y 1:t ) dx 1:t, ϕ t = ϕ t (x 1:t ) p (x 1:t y 1:t ) dx 1:t = 1 N N ( ϕ t i=1 X (i) 1:t We can prove that for any bounded function ϕ and any p 1 E [ ϕ t ϕ t p ] 1/p B (t) c (p) ϕ, N lim N ( ϕt ϕ N t ) N ( 0, σ 2 ) t. Very weak results: B (t) and σ 2 t can increase with t and will for a path-dependent ϕ t (x 1:t ) as the degeneracy problem suggests. A. Doucet (MLSS Sept. 2012) Sept. 2012 30 / 136 ).

Stronger Convergence Results Assume the following exponentially stability assumption: For any x 1, x 1 1 p (x t y 2:t, X 1 = x 1 ) p ( x t y 2:t, X 1 = x ) 1 dx t α t for 0 α < 1. 2 A. Doucet (MLSS Sept. 2012) Sept. 2012 31 / 136

Stronger Convergence Results Assume the following exponentially stability assumption: For any x 1, x 1 1 p (x t y 2:t, X 1 = x 1 ) p ( x t y 2:t, X 1 = x ) 1 dx t α t for 0 α < 1. 2 Marginal distribution. For ϕ t (x 1:t ) = ϕ (x t L:t ), there exists B 1, B 2 < s.t. E [ ϕ t ϕ t p ] 1/p B 1 c (p) ϕ N, lim N N ( ϕt ϕ t ) N ( 0, σ 2 ) t where σ 2 t B 2, i.e. there is no accumulation of numerical errors over time. A. Doucet (MLSS Sept. 2012) Sept. 2012 31 / 136

Stronger Convergence Results Assume the following exponentially stability assumption: For any x 1, x 1 1 p (x t y 2:t, X 1 = x 1 ) p ( x t y 2:t, X 1 = x ) 1 dx t α t for 0 α < 1. 2 Marginal distribution. For ϕ t (x 1:t ) = ϕ (x t L:t ), there exists B 1, B 2 < s.t. E [ ϕ t ϕ t p ] 1/p B 1 c (p) ϕ N, lim N N ( ϕt ϕ t ) N ( 0, σ 2 ) t where σ 2 t B 2, i.e. there is no accumulation of numerical errors over time. L1 distance. If p (x 1:t y 1:t ) = E ( p (x 1:t y 1:t )), there exists B 3 < s.t. p (x 1:t y 1:t ) p (x 1:t y 1:t ) dx 1:t B 3 t N ; i.e. the bias only increases in t. A. Doucet (MLSS Sept. 2012) Sept. 2012 31 / 136

Stronger Convergence Results Unbiasedness. The marginal likelihood estimate is unbiased E ( p (y 1:t )) = p (y 1:t ). A. Doucet (MLSS Sept. 2012) Sept. 2012 32 / 136

Stronger Convergence Results Unbiasedness. The marginal likelihood estimate is unbiased E ( p (y 1:t )) = p (y 1:t ). Relative Variance Bound. There exists B 4 < ( ) ) ( p (y1:t ) 2 E p (y 1:t ) 1 B 4 t N A. Doucet (MLSS Sept. 2012) Sept. 2012 32 / 136

Stronger Convergence Results Unbiasedness. The marginal likelihood estimate is unbiased E ( p (y 1:t )) = p (y 1:t ). Relative Variance Bound. There exists B 4 < ( ) ) ( p (y1:t ) 2 E p (y 1:t ) 1 B 4 t N Central Limit Theorem. There exists B 5 < s.t. N (log p (y1:t ) log p (y 1:t )) N ( 0, σ 2 ) t with σ 2 t B 5 t. lim N A. Doucet (MLSS Sept. 2012) Sept. 2012 32 / 136

Basic Idea Used to Establish Uniform Lp Bounds We denote η k (x k ) = p (x k y 1:k 1 ) and η k (x k ) = p (x k y 1:k 1 ) its particle approximation. A. Doucet (MLSS Sept. 2012) Sept. 2012 33 / 136

Basic Idea Used to Establish Uniform Lp Bounds We denote and η k (x k ) = p (x k y 1:k 1 ) η k (x k ) = p (x k y 1:k 1 ) its particle approximation. Let Φ k,t be the measure-valued mapping such that η t = Φ k,t (η k ), which satifies Φ k,t (η k ) (x t ) = η k (x k ).p (y k :t 1 x k ) p (x t x k, y k+1:t 1 ) dx k. ηk (x k ) p (y k :t 1 x k ) dx k }{{} p(x k y 1:t 1 ) A. Doucet (MLSS Sept. 2012) Sept. 2012 33 / 136

Key Decomposition Formula η 1 η 2 = Φ 1,2 (η 1 ) η t = Φ 1,t (η 1 ) η 1 Φ 1,2 ( η 1 ) Φ 1,t ( η 1 ) η 2 Φ 2,t ( η 2 ) η t 1 ) Φ t 1,t ( ηt 1 Decomposition of the error η t η t = η t t [ ( ))] Φk,t ( η k ) Φ k,t Φk 1,k ( ηk 1 k=1 A. Doucet (MLSS Sept. 2012) Sept. 2012 34 / 136

Stability Properties We have p (x t x k, y k+1:t 1 ) = p (x k+1:t x k, y k+1:t 1 ) dx k+1:t 1 where p (x k+1:t x k, y k+1:t 1 ) = t p (x m x m 1, y m:t 1 ) m=k+1 A. Doucet (MLSS Sept. 2012) Sept. 2012 35 / 136

Stability Properties We have p (x t x k, y k+1:t 1 ) = p (x k+1:t x k, y k+1:t 1 ) dx k+1:t 1 where p (x k+1:t x k, y k+1:t 1 ) = To summarize, we have Φ k,t (η k ) (x t ) = t p (x m x m 1, y m:t 1 ) m=k+1 η k (x k ).p (y k :t 1 x k ) ηk (x k ) p (y k :t 1 x k ) dx k }{{} p(x k y 1:t 1 ) t m=k+1 p (x m x m 1, y m:t 1 ) dx k :t 1 A. Doucet (MLSS Sept. 2012) Sept. 2012 35 / 136

Stability Properties Assume there exists ɛ > 0 s.t. for any x, x and for any y, x, ɛ 1 ν ( x ) f ( x x ) ɛν ( x ) 0 < g g (y x) g < then there exists 0 λ < 1 1 ( Φ k,k+t (η) (x) Φ k,k+t η ) (x) dx λ t 2 A. Doucet (MLSS Sept. 2012) Sept. 2012 36 / 136

Stability Properties Assume there exists ɛ > 0 s.t. for any x, x and for any y, x, ɛ 1 ν ( x ) f ( x x ) ɛν ( x ) 0 < g g (y x) g < then there exists 0 λ < 1 1 ( Φ k,k+t (η) (x) Φ k,k+t η ) (x) dx λ t 2 Hence we have as (t k). Φ k,t (η k ) (x t ) Φ k,t ( η k ) (xt ) A. Doucet (MLSS Sept. 2012) Sept. 2012 36 / 136

Putting Everything Together Under such strong mixing assumptions η t η t = t k=1 [ Φk,t ( η k ) Φ k,t ( Φk 1,k ( ηk 1 ))] } {{ } 1 λ t k+1 for 0 λ 1 N A. Doucet (MLSS Sept. 2012) Sept. 2012 37 / 136

Putting Everything Together Under such strong mixing assumptions η t η t = t k=1 [ Φk,t ( η k ) Φ k,t ( Φk 1,k ( ηk 1 ))] } {{ } 1 λ t k+1 for 0 λ 1 N We can then obtain results such as there exists B 1 < s.t. E [ ϕ t ϕ t p ] 1/p B 1 c (p) ϕ N A. Doucet (MLSS Sept. 2012) Sept. 2012 37 / 136

Putting Everything Together Under such strong mixing assumptions η t η t = t k=1 [ Φk,t ( η k ) Φ k,t ( Φk 1,k ( ηk 1 ))] } {{ } 1 λ t k+1 for 0 λ 1 N We can then obtain results such as there exists B 1 < s.t. E [ ϕ t ϕ t p ] 1/p B 1 c (p) ϕ N Much work has been done recently on removing such strong mixing assumptions; e.g. Whiteley (2012) for much weaker and realistic assumptions. A. Doucet (MLSS Sept. 2012) Sept. 2012 37 / 136

Summary SMC provide consistent estimates under weak assumptions. A. Doucet (MLSS Sept. 2012) Sept. 2012 38 / 136

Summary SMC provide consistent estimates under weak assumptions. Under stability assumptions, we have uniform in time stability of the SMC estimates of {p (x t y 1:t )} t 1. A. Doucet (MLSS Sept. 2012) Sept. 2012 38 / 136

Summary SMC provide consistent estimates under weak assumptions. Under stability assumptions, we have uniform in time stability of the SMC estimates of {p (x t y 1:t )} t 1. Under stability assumptions, the relative variance of the SMC estimate of {p (y 1:t )} t 1 only increases linearly with t. A. Doucet (MLSS Sept. 2012) Sept. 2012 38 / 136

Summary SMC provide consistent estimates under weak assumptions. Under stability assumptions, we have uniform in time stability of the SMC estimates of {p (x t y 1:t )} t 1. Under stability assumptions, the relative variance of the SMC estimate of {p (y 1:t )} t 1 only increases linearly with t. Even under stability assumptions, one cannot expect to obtain uniform in time stability for SMC estimates of {p (x 1:t y 1:t )} t 1 ; this is due to the degeneracy problem. A. Doucet (MLSS Sept. 2012) Sept. 2012 38 / 136

Summary SMC provide consistent estimates under weak assumptions. Under stability assumptions, we have uniform in time stability of the SMC estimates of {p (x t y 1:t )} t 1. Under stability assumptions, the relative variance of the SMC estimate of {p (y 1:t )} t 1 only increases linearly with t. Even under stability assumptions, one cannot expect to obtain uniform in time stability for SMC estimates of {p (x 1:t y 1:t )} t 1 ; this is due to the degeneracy problem. Is it possible to Q1: eliminate, Q2: mitigate the degeneracy problem? A. Doucet (MLSS Sept. 2012) Sept. 2012 38 / 136

Summary SMC provide consistent estimates under weak assumptions. Under stability assumptions, we have uniform in time stability of the SMC estimates of {p (x t y 1:t )} t 1. Under stability assumptions, the relative variance of the SMC estimate of {p (y 1:t )} t 1 only increases linearly with t. Even under stability assumptions, one cannot expect to obtain uniform in time stability for SMC estimates of {p (x 1:t y 1:t )} t 1 ; this is due to the degeneracy problem. Is it possible to Q1: eliminate, Q2: mitigate the degeneracy problem? Answer: Q1: no, Q2: yes. A. Doucet (MLSS Sept. 2012) Sept. 2012 38 / 136

Is Resampling Really Necessary? Resampling is the source of the degeneracy problem and might appear wasteful. A. Doucet (MLSS Sept. 2012) Sept. 2012 39 / 136

Is Resampling Really Necessary? Resampling is the source of the degeneracy problem and might appear wasteful. The resampling step is an unbiased operation E [ p (x 1:t y 1:t ) p (x 1:t y 1:t )] = p (x 1:t y 1:t ) but clearly it introduces some errors locally in time. That is for any test function, we have [ ] [ ] V ϕ (x 1:t ) p (x 1:t y 1:t ) dx 1:t V ϕ (x 1:t ) p (x 1:t y 1:t ) dx 1:t A. Doucet (MLSS Sept. 2012) Sept. 2012 39 / 136

Is Resampling Really Necessary? Resampling is the source of the degeneracy problem and might appear wasteful. The resampling step is an unbiased operation E [ p (x 1:t y 1:t ) p (x 1:t y 1:t )] = p (x 1:t y 1:t ) but clearly it introduces some errors locally in time. That is for any test function, we have [ ] [ ] V ϕ (x 1:t ) p (x 1:t y 1:t ) dx 1:t V ϕ (x 1:t ) p (x 1:t y 1:t ) dx 1:t What about eliminating the resampling step? A. Doucet (MLSS Sept. 2012) Sept. 2012 39 / 136

Sequential Importance Samping: SMC Without Resampling In this case, the estimate of the posterior is p SIS (x 1:t y 1:t ) = where X (i) 1:t p (x 1:t) and W (i) t ( p N i=1 ) y 1:t X (i) 1:t W (i) t δ (i) X (x 1:t ) 1:t t ( g k=1 ) y k X (i) t. A. Doucet (MLSS Sept. 2012) Sept. 2012 40 / 136

Sequential Importance Samping: SMC Without Resampling In this case, the estimate of the posterior is p SIS (x 1:t y 1:t ) = where X (i) 1:t p (x 1:t) and W (i) t ( p N i=1 ) y 1:t X (i) 1:t W (i) t δ (i) X (x 1:t ) 1:t t ( g k=1 In this case, the marginal likelihood estimate is p SIS (y 1:t ) = 1 N ) y k X (i) t. N ( ) p y 1:t X (i) 1:t i=1 A. Doucet (MLSS Sept. 2012) Sept. 2012 40 / 136

Sequential Importance Samping: SMC Without Resampling In this case, the estimate of the posterior is p SIS (x 1:t y 1:t ) = where X (i) 1:t p (x 1:t) and W (i) t ( p N i=1 ) y 1:t X (i) 1:t W (i) t δ (i) X (x 1:t ) 1:t t ( g k=1 In this case, the marginal likelihood estimate is p SIS (y 1:t ) = 1 N ) y k X (i) t. N ( ) p y 1:t X (i) 1:t i=1 ( ) Relative variance of p y 1:t X (i) t 1:t = g k=1 exponentially fast... ( ) y k X (i) t is increasing A. Doucet (MLSS Sept. 2012) Sept. 2012 40 / 136

SIS For Stochastic Volatility Model 1000 500 1000 0 25 20 15 10 5 0 500 0 25 20 15 10 5 0 100 50 Figure: Histograms of log 10 ( t = 100 (bottom). 0 25 20 15 10 5 0 Importance Weights (base 10 logarithm) W (i ) t ) for t = 1 (top), t = 50 (middle) and The algorithm performance collapse as t increases as expected. A. Doucet (MLSS Sept. 2012) Sept. 2012 41 / 136

Central Limit Theorems For both SIS and SMC, we have a CLT for the estimates of the marginal likelihood ) ( psis (y 1:t ) N 1 N ( 0, σ 2 p (y 1:t ) t,sis), ) ( psmc (y 1:t ) N 1 N ( 0, σ 2 p (y 1:t ) t,smc). A. Doucet (MLSS Sept. 2012) Sept. 2012 42 / 136

Central Limit Theorems For both SIS and SMC, we have a CLT for the estimates of the marginal likelihood ) ( psis (y 1:t ) N 1 N ( 0, σ 2 p (y 1:t ) t,sis), ) ( psmc (y 1:t ) N 1 N ( 0, σ 2 p (y 1:t ) t,smc). The variance expressions are σ 2 t,sis = p 2 ( x 1:t y 1:t ) p(x 1:t dx ) 1:t 1 = σ 2 t,smc = p 2 ( x 1 y 1:t ) µ(x 1 dx ) 1 + t k=2 g = 2 ( y 1 x 1 )µ(x 1 )dx 1 p 2 (y 1 + ) t k=2 p 2 ( y 1:t x 1:t )p(x 1:t )dx 1:t p 2 (y 1:t ) 1 p 2 ( x 1:k y 1:t ) p( x 1:k 1 y 1:k 1 )f ( x k x k 1 ) dx 1:k t p 2 ( y k :t x k )p( x k y 1:k 1 )dx k p 2 ( y k :t y 1:k 1 ) t A. Doucet (MLSS Sept. 2012) Sept. 2012 42 / 136

Central Limit Theorems For both SIS and SMC, we have a CLT for the estimates of the marginal likelihood ) ( psis (y 1:t ) N 1 N ( 0, σ 2 p (y 1:t ) t,sis), ) ( psmc (y 1:t ) N 1 N ( 0, σ 2 p (y 1:t ) t,smc). The variance expressions are σ 2 t,sis = p 2 ( x 1:t y 1:t ) p(x 1:t dx ) 1:t 1 = σ 2 t,smc = p 2 ( x 1 y 1:t ) µ(x 1 dx ) 1 + t k=2 g = 2 ( y 1 x 1 )µ(x 1 )dx 1 p 2 (y 1 + ) t k=2 p 2 ( y 1:t x 1:t )p(x 1:t )dx 1:t p 2 (y 1:t ) 1 p 2 ( x 1:k y 1:t ) p( x 1:k 1 y 1:k 1 )f ( x k x k 1 ) dx 1:k t p 2 ( y k :t x k )p( x k y 1:k 1 )dx k p 2 ( y k :t y 1:k 1 ) SMC breaks the integral over X t into t integrals over X. t A. Doucet (MLSS Sept. 2012) Sept. 2012 42 / 136

A Toy Example Consider the case where f (x x) = µ (x ) = N ( x ; 0, σ 2) and g (y x) = N ( y; 0, 1 1 σ 2 ) where σ 2 > 1. A. Doucet (MLSS Sept. 2012) Sept. 2012 43 / 136

A Toy Example Consider the case where f (x x) = µ (x ) = N ( x ; 0, σ 2) and g (y x) = N ( y; 0, 1 1 σ 2 ) where σ 2 > 1. Assume we observe y 1 = = y t = 0 then we have ) [ ( ( psis (y 1:t ) V = σ2 t,sis p (y 1:t ) N = 1 ) σ 4 t/2 N 2σ 2 1], 1 ) [ ( ( psmc (y 1:t ) V σ2 t,smc = t ) σ 4 1/2 p (y 1:t ) N N 2σ 2 1]. 1 A. Doucet (MLSS Sept. 2012) Sept. 2012 43 / 136

A Toy Example Consider the case where f (x x) = µ (x ) = N ( x ; 0, σ 2) and g (y x) = N ( y; 0, 1 1 σ 2 ) where σ 2 > 1. Assume we observe y 1 = = y t = 0 then we have ) [ ( ( psis (y 1:t ) V = σ2 t,sis p (y 1:t ) N = 1 ) σ 4 t/2 N 2σ 2 1], 1 ) [ ( ( psmc (y 1:t ) V σ2 t,smc = t ) σ 4 1/2 p (y 1:t ) N N 2σ 2 1]. 1 If select σ 2 = 1.2 then it is necessary to use N 2 10 23 particles to obtain σ2 t,sis N = 10 2 for t = 1000. A. Doucet (MLSS Sept. 2012) Sept. 2012 43 / 136

A Toy Example Consider the case where f (x x) = µ (x ) = N ( x ; 0, σ 2) and g (y x) = N ( y; 0, 1 1 σ 2 ) where σ 2 > 1. Assume we observe y 1 = = y t = 0 then we have ) [ ( ( psis (y 1:t ) V = σ2 t,sis p (y 1:t ) N = 1 ) σ 4 t/2 N 2σ 2 1], 1 ) [ ( ( psmc (y 1:t ) V σ2 t,smc = t ) σ 4 1/2 p (y 1:t ) N N 2σ 2 1]. 1 If select σ 2 = 1.2 then it is necessary to use N 2 10 23 particles to obtain σ2 t,sis N = 10 2 for t = 1000. To obtain σ2 t,smc N = 10 2, SMC requires only N 10 4 particles: improvement by 19 orders of magnitude! A. Doucet (MLSS Sept. 2012) Sept. 2012 43 / 136

Better Resampling Schemes [ Better resampling steps can be designed such that E [ ] ( but V < NW (i) t 1 W (i) t entropy resampling etc. (Cappé et al., 2005). N (i) t N (i) t ] = NW (i) t ) ; residual resampling, minimal A. Doucet (MLSS Sept. 2012) Sept. 2012 44 / 136

Better Resampling Schemes [ Better resampling steps can be designed such that E [ ] ( but V < NW (i) t 1 W (i) t entropy resampling etc. (Cappé et al., 2005). N (i) t Residual Resampling. Set Ñ (i) t = NW (i) t ( multinomial of parameters N, W (1:N ) ) t where W (i) t W (i) t N 1 Ñ (i) t then set N (i) t N (i) t ] = NW (i) t ) ; residual resampling, minimal, sample N 1:N t = Ñ (i) t + N (i) t. from a A. Doucet (MLSS Sept. 2012) Sept. 2012 44 / 136

Better Resampling Schemes [ Better resampling steps can be designed such that E [ ] ( but V < NW (i) t 1 W (i) t entropy resampling etc. (Cappé et al., 2005). N (i) t Residual Resampling. Set Ñ (i) t = NW (i) t ( multinomial of parameters N, W (1:N ) ) t where N (i) t ] = NW (i) t ) ; residual resampling, minimal, sample N 1:N t from a W (i) t W (i) t N 1 Ñ (i) t then set N (i) t = Ñ (i) t + N (i) t. Systematic Resampling. Sample U 1 U [ 0, 1 ] N and define U i = U { 1 + i 1 N for i = 2,..., N, then set } Nt i = U j : i 1 k=1 W (k) t U j i k=1 W (k) t with the convention 0 k=1 := 0. A. Doucet (MLSS Sept. 2012) Sept. 2012 44 / 136

Measuring Variability of the Weights To measure the variation of the weights, we can use the Effective Sample Size (ESS) ( N ( ESS = i=1 W (i) t ) 2 ) 1 A. Doucet (MLSS Sept. 2012) Sept. 2012 45 / 136

Measuring Variability of the Weights To measure the variation of the weights, we can use the Effective Sample Size (ESS) We have ESS = N if W (i) t and W (j) t = 1 for j = i. ( N ( ESS = i=1 W (i) t ) 2 ) 1 = 1/N for any i and ESS = 1 if W (i) t = 1 A. Doucet (MLSS Sept. 2012) Sept. 2012 45 / 136

Measuring Variability of the Weights To measure the variation of the weights, we can use the Effective Sample Size (ESS) We have ESS = N if W (i) t ( N ( ESS = i=1 W (i) t ) 2 ) 1 = 1/N for any i and ESS = 1 if W (i) t = 1 and W (j) t = 1 for j = i. Liu (1996) showed that for simple importance sampling for ϕ regular enough V ( N i=1 ( W (i) t ϕ X (i) t ) ) V p( x1:t y 1:t ) ( 1 ESS ESS ( ϕ i=1 X (i) t ) ) ; i.e. the estimate is roughly as accurate as using an iid sample of size ESS from p (x 1:t y 1:t ). A. Doucet (MLSS Sept. 2012) Sept. 2012 45 / 136

Dynamic Resampling Resampling at each time step can be harmful: only resample when necessary. A. Doucet (MLSS Sept. 2012) Sept. 2012 46 / 136

Dynamic Resampling Resampling at each time step can be harmful: only resample when necessary. Dynamic Resampling: If the variation of the weights as measured by ESS is too high, e.g. ESS < N/2, then resample the particles. A. Doucet (MLSS Sept. 2012) Sept. 2012 46 / 136

Dynamic Resampling Resampling at each time step can be harmful: only resample when necessary. Dynamic Resampling: If the variation of the weights as measured by ESS is too high, e.g. ESS < N/2, then resample the particles. We can also use the entropy Ent = N i=1 W (i) t log 2 ( W (i) t ) A. Doucet (MLSS Sept. 2012) Sept. 2012 46 / 136

Dynamic Resampling Resampling at each time step can be harmful: only resample when necessary. Dynamic Resampling: If the variation of the weights as measured by ESS is too high, e.g. ESS < N/2, then resample the particles. We can also use the entropy Ent = N i=1 W (i) t log 2 ( W (i) t We have Ent = log 2 (N) if W (i) t = 1/N for any i. We have Ent = 0 if W (i) t = 1 and W (j) t = 1 for j = i. ) A. Doucet (MLSS Sept. 2012) Sept. 2012 46 / 136

Improving the Sampling Step Bootstrap filter. Sample particles blindly according to the prior without taking into account the observation Very ineffi cient for vague prior/peaky likelihood. A. Doucet (MLSS Sept. 2012) Sept. 2012 47 / 136

Improving the Sampling Step Bootstrap filter. Sample particles blindly according to the prior without taking into account the observation Very ineffi cient for vague prior/peaky likelihood. Optimal proposal/perfect adaptation. Implement the following alternative update-propagate Bayesian recursion where Update p (x 1:t 1 y 1:t ) = p( y t x t 1 )p( x 1:t 1 y 1:t 1 ) p( y t y 1:t 1 ) Propagate p (x 1:t y 1:t ) = p (x 1:t 1 y 1:t ) p (x t y t, x t 1 ) p (x t y t, x t 1 ) = f (x t x t 1 ) g (y t x t 1 ) p (y t x t 1 ) Much more effi cient when applicable; e.g. f (x t x t 1 ) = N (x t ; ϕ (x t 1 ), Σ v ), g (y t x t ) = N (y t ; x t, Σ w ). A. Doucet (MLSS Sept. 2012) Sept. 2012 47 / 136

A General Bayesian Recursion Introduce an arbitrary proposal distribution q (x t y t, x t 1 ); i.e. an approximation to p (x t y t, x t 1 ). A. Doucet (MLSS Sept. 2012) Sept. 2012 48 / 136

A General Bayesian Recursion Introduce an arbitrary proposal distribution q (x t y t, x t 1 ); i.e. an approximation to p (x t y t, x t 1 ). We have seen that so clearly p (x 1:t y 1:t ) = g (y t x t ) f (x t x t 1 ) p (x 1:t 1 y 1:t 1 ) p (y t y 1:t 1 ) p (x 1:t y 1:t ) = w (x t 1, x t, y t ) q (x t y t, x t 1 ) p (x 1:t 1 y 1:t 1 ) p (y t y 1:t 1 ) where w (x t 1, x t, y t ) = g (y t x t ) f (x t x t 1 ) q (x t y t, x t 1 ) A. Doucet (MLSS Sept. 2012) Sept. 2012 48 / 136

A General Bayesian Recursion Introduce an arbitrary proposal distribution q (x t y t, x t 1 ); i.e. an approximation to p (x t y t, x t 1 ). We have seen that so clearly where p (x 1:t y 1:t ) = g (y t x t ) f (x t x t 1 ) p (x 1:t 1 y 1:t 1 ) p (y t y 1:t 1 ) p (x 1:t y 1:t ) = w (x t 1, x t, y t ) q (x t y t, x t 1 ) p (x 1:t 1 y 1:t 1 ) p (y t y 1:t 1 ) w (x t 1, x t, y t ) = g (y t x t ) f (x t x t 1 ) q (x t y t, x t 1 ) This suggests a more general SMC algorithm. A. Doucet (MLSS Sept. 2012) Sept. 2012 48 / 136

A General SMC Algorithm { } Assume we have N weighted particles W (i) t 1, X (i) 1:t 1 approximating p (x 1:t 1 y 1:t 1 ) then at time t, ( ) ( ) Sample X (i) t q x t y t, X (i) t 1, set X (i) 1:t = X (i) 1:t 1, X (i) t and p (x 1:t y 1:t ) = N i=1 W (i) t W (i) f t 1 W (i) t δ X (i) (x 1:t ), 1:t ( X (i) t q ) ( ) X (i) t 1 g y t X (i) t ( ) X (i) t y t, X (i). t 1 A. Doucet (MLSS Sept. 2012) Sept. 2012 49 / 136

A General SMC Algorithm { } Assume we have N weighted particles W (i) t 1, X (i) 1:t 1 approximating p (x 1:t 1 y 1:t 1 ) then at time t, ( ) ( ) Sample X (i) t q x t y t, X (i) t 1, set X (i) 1:t = X (i) 1:t 1, X (i) t and p (x 1:t y 1:t ) = N i=1 W (i) t W (i) f t 1 W (i) t δ X (i) (x 1:t ), 1:t ( X (i) t q ) ( ) X (i) t 1 g y t X (i) t ( ) X (i) t y t, X (i). If ESS< N/2 resample X (i) 1:t p (x 1:t y 1:t ) and set W (i) t 1 N to obtain p (x 1:t y 1:t ) = 1 N N i=1 δ (i) X (x 1:t ). 1:t t 1 A. Doucet (MLSS Sept. 2012) Sept. 2012 49 / 136

Building Proposals Our aim is to select q (x t y t, x t 1 ) as close as possible to p (x t y t, x t 1 ) as this minimizes the variance of w (x t 1, x t, y t ) = g (y t x t ) f (x t x t 1 ). q (x t y t, x t 1 ) A. Doucet (MLSS Sept. 2012) Sept. 2012 50 / 136

Building Proposals Our aim is to select q (x t y t, x t 1 ) as close as possible to p (x t y t, x t 1 ) as this minimizes the variance of w (x t 1, x t, y t ) = g (y t x t ) f (x t x t 1 ). q (x t y t, x t 1 ) Example - EKF proposal: Let X t = ϕ (X t 1 ) + V t, Y t = Ψ (X t ) + W t, with V t N (0, Σ v ), W t N (0, Σ w ). We perform local linearization Ψ (x) Y t Ψ (ϕ (X t 1 )) + (X t ϕ (X t 1 )) + W t x and use as a proposal. ϕ(xt 1 ) q (x t y t, x t 1 ) ĝ (y t x t ) f (x t x t 1 ). A. Doucet (MLSS Sept. 2012) Sept. 2012 50 / 136

Building Proposals Our aim is to select q (x t y t, x t 1 ) as close as possible to p (x t y t, x t 1 ) as this minimizes the variance of w (x t 1, x t, y t ) = g (y t x t ) f (x t x t 1 ). q (x t y t, x t 1 ) Example - EKF proposal: Let X t = ϕ (X t 1 ) + V t, Y t = Ψ (X t ) + W t, with V t N (0, Σ v ), W t N (0, Σ w ). We perform local linearization Ψ (x) Y t Ψ (ϕ (X t 1 )) + (X t ϕ (X t 1 )) + W t x and use as a proposal. ϕ(xt 1 ) q (x t y t, x t 1 ) ĝ (y t x t ) f (x t x t 1 ). Any standard suboptimal filtering methods can be used: Unscented Particle filter, Gaussan Quadrature particle filter etc. A. Doucet (MLSS Sept. 2012) Sept. 2012 50 / 136

Implicit Proposals Proposed recently by Chorin (2012). Let F (x t 1, x t ) = log g (y t x t ) + log f (x t x t 1 ) and xt = arg max F (x t 1, x t ) = arg max p (x t y t, x t 1 ) A. Doucet (MLSS Sept. 2012) Sept. 2012 51 / 136

Implicit Proposals Proposed recently by Chorin (2012). Let and F (x t 1, x t ) = log g (y t x t ) + log f (x t x t 1 ) x t = arg max F (x t 1, x t ) = arg max p (x t y t, x t 1 ) We sample Z N (0, I nx ), then we solve in X t F (x t 1, x t ) F (x t 1, X t ) = 1 2 Z T Z, Z N (0, I nx ) so if there is a unique solution q (x t y t, x t 1 ) = p Z (z) det z/ x t exp ( F (x t 1, xt )) g (y t x t ) f (x t x t 1 ) det x t / z A. Doucet (MLSS Sept. 2012) Sept. 2012 51 / 136

Implicit Proposals Proposed recently by Chorin (2012). Let and F (x t 1, x t ) = log g (y t x t ) + log f (x t x t 1 ) x t = arg max F (x t 1, x t ) = arg max p (x t y t, x t 1 ) We sample Z N (0, I nx ), then we solve in X t F (x t 1, x t ) F (x t 1, X t ) = 1 2 Z T Z, Z N (0, I nx ) so if there is a unique solution q (x t y t, x t 1 ) = p Z (z) det z/ x t The incremental weight is g (y t x t ) f (x t x t 1 ) q (x t y t, x t 1 ) exp ( F (x t 1, xt )) g (y t x t ) f (x t x t 1 ) det x t / z det x t / z exp (F (x t 1, x t )) A. Doucet (MLSS Sept. 2012) Sept. 2012 51 / 136

Auxiliary Particle Filters Popular variation introduced by (Pitt & Shephard, 1999). A. Doucet (MLSS Sept. 2012) Sept. 2012 52 / 136

Auxiliary Particle Filters Popular variation introduced by (Pitt & Shephard, 1999). This corresponds to a standard SMC algorithm (Johansen & D., 2008) where we target p (x 1:t y 1:t+1 ) p (x 1:t y 1:t ) p (y t+1 x t ) where p (y t+1 x t ) p (y t+1 x t ) using a proposal p (x t y t, x t 1 ). A. Doucet (MLSS Sept. 2012) Sept. 2012 52 / 136

Auxiliary Particle Filters Popular variation introduced by (Pitt & Shephard, 1999). This corresponds to a standard SMC algorithm (Johansen & D., 2008) where we target p (x 1:t y 1:t+1 ) p (x 1:t y 1:t ) p (y t+1 x t ) where p (y t+1 x t ) p (y t+1 x t ) using a proposal p (x t y t, x t 1 ). When p (y t+1 x t ) = p (y t+1 x t ) and p (x t+1 y t+1, x t ) = p (x t+1 y t+1, x t ) then we are back to perfect adaptation. A. Doucet (MLSS Sept. 2012) Sept. 2012 52 / 136

Block Sampling Proposals Problem: we only sample X t at time t so, even if you use p (x t y t, x t 1 ), the SMC estimates could have high variance if V p( xt 1 y 1:t 1 ) [p (y t x t 1 )] is high. A. Doucet (MLSS Sept. 2012) Sept. 2012 53 / 136

Block Sampling Proposals Problem: we only sample X t at time t so, even if you use p (x t y t, x t 1 ), the SMC estimates could have high variance if V p( xt 1 y 1:t 1 ) [p (y t x t 1 )] is high. Block sampling idea: allows yourself to sample again X t L+1:t 1 as well as X t in light of y t. Optimally we would like at time t to sample ( ) X (i) t L+1:t p x t L+1:t y t L+1:t, X (i) t L and W (i) t W (i) ( p ( ) y 1:t ) X (i) t L+1:t y t L+1:t, X (i) t L ) X (i) 1:t t 1 ( ) p X (i) 1:t L y 1:t 1 p ( W (i) t 1 p y t y t L+1:t 1, X (i) t L A. Doucet (MLSS Sept. 2012) Sept. 2012 53 / 136

Block Sampling Proposals Problem: we only sample X t at time t so, even if you use p (x t y t, x t 1 ), the SMC estimates could have high variance if V p( xt 1 y 1:t 1 ) [p (y t x t 1 )] is high. Block sampling idea: allows yourself to sample again X t L+1:t 1 as well as X t in light of y t. Optimally we would like at time t to sample ( ) X (i) t L+1:t p x t L+1:t y t L+1:t, X (i) t L and W (i) t W (i) ( p ( ) y 1:t ) X (i) t L+1:t y t L+1:t, X (i) t L ) X (i) 1:t t 1 ( ) p X (i) 1:t L y 1:t 1 p ( W (i) t 1 p y t y t L+1:t 1, X (i) t L When p (x t L+1:t y t L+1:t, x t L ) and p (y t y t L+1:t 1, x t L ) are not available, we can use analytical approximations of them and still have consistent estimates (D., Briers & Senecal, 2006). A. Doucet (MLSS Sept. 2012) Sept. 2012 53 / 136

Block Sampling Proposals Computational cost is increased from O (N) to O (LN) so is it worth it? A. Doucet (MLSS Sept. 2012) Sept. 2012 54 / 136

Block Sampling Proposals Computational cost is increased from O (N) to O (LN) so is it worth it? Consider the ideal scenario where X t = X t 1 + V t Y t = X t + W t where X 1 N (0, 1) and V t, W t i.i.d. N (0, 1). A. Doucet (MLSS Sept. 2012) Sept. 2012 54 / 136

Block Sampling Proposals Computational cost is increased from O (N) to O (LN) so is it worth it? Consider the ideal scenario where X t = X t 1 + V t Y t = X t + W t where X 1 N (0, 1) and V t, W t i.i.d. N (0, 1). In this case, we have p(y t y t L+1:t 1, x t L ) p(y t y t L+1:t 1, x t L) < c x t L x t L /2 L where the rate of exponential convergence depends upon the signal-to-noise ratio if more general Gaussian AR are considered. A. Doucet (MLSS Sept. 2012) Sept. 2012 54 / 136

Block Sampling Proposals Computational cost is increased from O (N) to O (LN) so is it worth it? Consider the ideal scenario where X t = X t 1 + V t Y t = X t + W t where X 1 N (0, 1) and V t, W t i.i.d. N (0, 1). In this case, we have p(y t y t L+1:t 1, x t L ) p(y t y t L+1:t 1, x t L) < c x t L x t L /2 L where the rate of exponential convergence depends upon the signal-to-noise ratio if more general Gaussian AR are considered. We can obtain an analytic expression of the variance of the (normalized) weight. A. Doucet (MLSS Sept. 2012) Sept. 2012 54 / 136

Block Sampling Proposals Variance of incremental weight w.r.t. p ( x1:t A. Doucet (MLSS Sept. 2012) L j y1:t 1 ). Sept. 2012 55 / 136