Sequential Monte Carlo Methods for Bayesian Computation

Size: px

Start display at page:

Download "Sequential Monte Carlo Methods for Bayesian Computation"

Sherilyn Moody
5 years ago
Views:

1 Sequential Monte Carlo Methods for Bayesian Computation A. Doucet Kyoto Sept A. Doucet (MLSS Sept. 2012) Sept / 136

2 Motivating Example 1: Generic Bayesian Model Let X be a vector parameter of interest with an associated prior µ; i.e. X µ ( ). A. Doucet (MLSS Sept. 2012) Sept / 136

3 Motivating Example 1: Generic Bayesian Model Let X be a vector parameter of interest with an associated prior µ; i.e. X µ ( ). We observe a realization of y of Y which is assumed to satisfy Y (X = x) g ( x) ; i.e. the likelihood function is g (y x). A. Doucet (MLSS Sept. 2012) Sept / 136

4 Motivating Example 1: Generic Bayesian Model Let X be a vector parameter of interest with an associated prior µ; i.e. X µ ( ). We observe a realization of y of Y which is assumed to satisfy Y (X = x) g ( x) ; i.e. the likelihood function is g (y x). Bayesian inference on X relies on the posterior of X given Y = y: p (x y) = µ (x) g (y x) p (y) where the marginal likelihood/evidence satisfies p (y) = µ (x) g (y x) dx. A. Doucet (MLSS Sept. 2012) Sept / 136

5 Motivating Example 1: Generic Bayesian Model Let X be a vector parameter of interest with an associated prior µ; i.e. X µ ( ). We observe a realization of y of Y which is assumed to satisfy Y (X = x) g ( x) ; i.e. the likelihood function is g (y x). Bayesian inference on X relies on the posterior of X given Y = y: p (x y) = µ (x) g (y x) p (y) where the marginal likelihood/evidence satisfies p (y) = µ (x) g (y x) dx. Machine learning examples: Latent Dirichlet Allocation, (Hiearchical) Dirichlet processes... A. Doucet (MLSS Sept. 2012) Sept / 136

6 Motivating Example 2: State-Space Models Let {X t } t 1 be a latent/hidden Markov process with X 1 µ ( ) and X t (X t 1 = x) f ( x). A. Doucet (MLSS Sept. 2012) Sept / 136

7 Motivating Example 2: State-Space Models Let {X t } t 1 be a latent/hidden Markov process with X 1 µ ( ) and X t (X t 1 = x) f ( x). Let {Y t } t 1 be an observation process such that observations are conditionally independent given {X t } t 1 and Y t (X t = x) g ( x). A. Doucet (MLSS Sept. 2012) Sept / 136

8 Motivating Example 2: State-Space Models Let {X t } t 1 be a latent/hidden Markov process with X 1 µ ( ) and X t (X t 1 = x) f ( x). Let {Y t } t 1 be an observation process such that observations are conditionally independent given {X t } t 1 and Y t (X t = x) g ( x). Let z i :j := (z i, z i+1,..., z j ) then Bayesian inference on X 1:t relies on the posterior of X 1:t given Y = y 1:t : p (x 1:t y 1:t ) = p (x 1:t, y 1:t ) p (y 1:t ) where the marginal likelihood/evidence satisfies p (y 1:t ) = p (x 1:t, y 1:t ) dx 1:t. A. Doucet (MLSS Sept. 2012) Sept / 136

9 Motivating Example 2: State-Space Models Let {X t } t 1 be a latent/hidden Markov process with X 1 µ ( ) and X t (X t 1 = x) f ( x). Let {Y t } t 1 be an observation process such that observations are conditionally independent given {X t } t 1 and Y t (X t = x) g ( x). Let z i :j := (z i, z i+1,..., z j ) then Bayesian inference on X 1:t relies on the posterior of X 1:t given Y = y 1:t : p (x 1:t y 1:t ) = p (x 1:t, y 1:t ) p (y 1:t ) where the marginal likelihood/evidence satisfies p (y 1:t ) = p (x 1:t, y 1:t ) dx 1:t. Machine learning examples: Biochemical network models, Dynamic topic models, Neuroscience models etc. A. Doucet (MLSS Sept. 2012) Sept / 136

10 Bayesian Inference and Machine Learning Bayesian approaches have been adopted by a large part of the ML community. A. Doucet (MLSS Sept. 2012) Sept / 136

11 Bayesian Inference and Machine Learning Bayesian approaches have been adopted by a large part of the ML community. Bayesian inference offers a number of attractive advantages over conventional approach A. Doucet (MLSS Sept. 2012) Sept / 136

12 Bayesian Inference and Machine Learning Bayesian approaches have been adopted by a large part of the ML community. Bayesian inference offers a number of attractive advantages over conventional approach flexibility in constructing complex models from simple parts; A. Doucet (MLSS Sept. 2012) Sept / 136

13 Bayesian Inference and Machine Learning Bayesian approaches have been adopted by a large part of the ML community. Bayesian inference offers a number of attractive advantages over conventional approach flexibility in constructing complex models from simple parts; the incorporation of prior knowledge is very natural; A. Doucet (MLSS Sept. 2012) Sept / 136

14 Bayesian Inference and Machine Learning Bayesian approaches have been adopted by a large part of the ML community. Bayesian inference offers a number of attractive advantages over conventional approach flexibility in constructing complex models from simple parts; the incorporation of prior knowledge is very natural; all modelling assumptions are made explicit; A. Doucet (MLSS Sept. 2012) Sept / 136

15 Bayesian Inference and Machine Learning Bayesian approaches have been adopted by a large part of the ML community. Bayesian inference offers a number of attractive advantages over conventional approach flexibility in constructing complex models from simple parts; the incorporation of prior knowledge is very natural; all modelling assumptions are made explicit; uncertainties over model order; A. Doucet (MLSS Sept. 2012) Sept / 136

16 Bayesian Inference and Machine Learning Bayesian approaches have been adopted by a large part of the ML community. Bayesian inference offers a number of attractive advantages over conventional approach flexibility in constructing complex models from simple parts; the incorporation of prior knowledge is very natural; all modelling assumptions are made explicit; uncertainties over model order; model parameters and predictions are technically straightforward to compute; A. Doucet (MLSS Sept. 2012) Sept / 136

17 Bayesian Inference and Machine Learning Bayesian approaches have been adopted by a large part of the ML community. Bayesian inference offers a number of attractive advantages over conventional approach flexibility in constructing complex models from simple parts; the incorporation of prior knowledge is very natural; all modelling assumptions are made explicit; uncertainties over model order; model parameters and predictions are technically straightforward to compute; The cost to pay is that approximate inference techniques are necessary to approximate the resulting posterior distributions for all but trivial models. A. Doucet (MLSS Sept. 2012) Sept / 136

18 Approximate Inference Methods Gaussian/Laplace approximation, local linearization, Extended Kalman filters. A. Doucet (MLSS Sept. 2012) Sept / 136

19 Approximate Inference Methods Gaussian/Laplace approximation, local linearization, Extended Kalman filters. Variational methods, density assumed filters. A. Doucet (MLSS Sept. 2012) Sept / 136

20 Approximate Inference Methods Gaussian/Laplace approximation, local linearization, Extended Kalman filters. Variational methods, density assumed filters. Expectation-Propagation. A. Doucet (MLSS Sept. 2012) Sept / 136

21 Approximate Inference Methods Gaussian/Laplace approximation, local linearization, Extended Kalman filters. Variational methods, density assumed filters. Expectation-Propagation. Markov chain Monte Carlo (MCMC) methods. A. Doucet (MLSS Sept. 2012) Sept / 136

22 Approximate Inference Methods Gaussian/Laplace approximation, local linearization, Extended Kalman filters. Variational methods, density assumed filters. Expectation-Propagation. Markov chain Monte Carlo (MCMC) methods. Sequential Monte Carlo (SMC) methods. A. Doucet (MLSS Sept. 2012) Sept / 136

23 Monte Carlo Methods Variational and EP methods are computationally cheap but perform functional approximations of the posteriors of interest. A. Doucet (MLSS Sept. 2012) Sept / 136

24 Monte Carlo Methods Variational and EP methods are computationally cheap but perform functional approximations of the posteriors of interest. Both MCMC and SMC are asymptotically (as you increase computational efforts) bias-free but computationally expensive. A. Doucet (MLSS Sept. 2012) Sept / 136

25 Monte Carlo Methods Variational and EP methods are computationally cheap but perform functional approximations of the posteriors of interest. Both MCMC and SMC are asymptotically (as you increase computational efforts) bias-free but computationally expensive. MCMC are the tools of choice in Bayesian computation for over 20 years whereas SMC have been widely used for 15 years in vision and robotics. A. Doucet (MLSS Sept. 2012) Sept / 136

26 Monte Carlo Methods Variational and EP methods are computationally cheap but perform functional approximations of the posteriors of interest. Both MCMC and SMC are asymptotically (as you increase computational efforts) bias-free but computationally expensive. MCMC are the tools of choice in Bayesian computation for over 20 years whereas SMC have been widely used for 15 years in vision and robotics. The development of new methodology combined to the emergence of cheap multicore architectures makes now SMC a powerful alternative/complementary approach to MCMC to address general Bayesian computational problems. A. Doucet (MLSS Sept. 2012) Sept / 136

27 Monte Carlo Methods Variational and EP methods are computationally cheap but perform functional approximations of the posteriors of interest. Both MCMC and SMC are asymptotically (as you increase computational efforts) bias-free but computationally expensive. MCMC are the tools of choice in Bayesian computation for over 20 years whereas SMC have been widely used for 15 years in vision and robotics. The development of new methodology combined to the emergence of cheap multicore architectures makes now SMC a powerful alternative/complementary approach to MCMC to address general Bayesian computational problems. The aim of these lectures is to provide an introduction to this active research field and discuss some open research problems. A. Doucet (MLSS Sept. 2012) Sept / 136

28 Some References and Resources A.D., J.F.G. De Freitas & N.J. Gordon (editors), Sequential Monte Carlo Methods in Practice, Springer-Verlag: New York, A. Doucet (MLSS Sept. 2012) Sept / 136

29 Some References and Resources A.D., J.F.G. De Freitas & N.J. Gordon (editors), Sequential Monte Carlo Methods in Practice, Springer-Verlag: New York, P. Del Moral, Feynman-Kac Formulae: Genealogical and Interacting Particle Systems with Applications, Springer-Verlag: New York, A. Doucet (MLSS Sept. 2012) Sept / 136

30 Some References and Resources A.D., J.F.G. De Freitas & N.J. Gordon (editors), Sequential Monte Carlo Methods in Practice, Springer-Verlag: New York, P. Del Moral, Feynman-Kac Formulae: Genealogical and Interacting Particle Systems with Applications, Springer-Verlag: New York, O. Cappé, E. Moulines & T. Ryden, Hidden Markov Models, Springer-Verlag: New York, A. Doucet (MLSS Sept. 2012) Sept / 136

31 Some References and Resources A.D., J.F.G. De Freitas & N.J. Gordon (editors), Sequential Monte Carlo Methods in Practice, Springer-Verlag: New York, P. Del Moral, Feynman-Kac Formulae: Genealogical and Interacting Particle Systems with Applications, Springer-Verlag: New York, O. Cappé, E. Moulines & T. Ryden, Hidden Markov Models, Springer-Verlag: New York, Webpage with links to papers and codes: A. Doucet (MLSS Sept. 2012) Sept / 136

32 Some References and Resources A.D., J.F.G. De Freitas & N.J. Gordon (editors), Sequential Monte Carlo Methods in Practice, Springer-Verlag: New York, P. Del Moral, Feynman-Kac Formulae: Genealogical and Interacting Particle Systems with Applications, Springer-Verlag: New York, O. Cappé, E. Moulines & T. Ryden, Hidden Markov Models, Springer-Verlag: New York, Webpage with links to papers and codes: Thousands of papers on the subject appear every year. A. Doucet (MLSS Sept. 2012) Sept / 136

33 Organization of Lectures State-Space Models (approx.4 hours) A. Doucet (MLSS Sept. 2012) Sept / 136

34 Organization of Lectures State-Space Models (approx.4 hours) SMC filtering and smoothing A. Doucet (MLSS Sept. 2012) Sept / 136

35 Organization of Lectures State-Space Models (approx.4 hours) SMC filtering and smoothing Maximum likelihood parameter inference A. Doucet (MLSS Sept. 2012) Sept / 136

36 Organization of Lectures State-Space Models (approx.4 hours) SMC filtering and smoothing Maximum likelihood parameter inference Bayesian parameter inference A. Doucet (MLSS Sept. 2012) Sept / 136

37 Organization of Lectures State-Space Models (approx.4 hours) SMC filtering and smoothing Maximum likelihood parameter inference Bayesian parameter inference Beyond State-Space Models (approx. 2 hours) A. Doucet (MLSS Sept. 2012) Sept / 136

38 Organization of Lectures State-Space Models (approx.4 hours) SMC filtering and smoothing Maximum likelihood parameter inference Bayesian parameter inference Beyond State-Space Models (approx. 2 hours) SMC methods for generic sequence of target distributions A. Doucet (MLSS Sept. 2012) Sept / 136

39 Organization of Lectures State-Space Models (approx.4 hours) SMC filtering and smoothing Maximum likelihood parameter inference Bayesian parameter inference Beyond State-Space Models (approx. 2 hours) SMC methods for generic sequence of target distributions SMC samplers. A. Doucet (MLSS Sept. 2012) Sept / 136

40 Organization of Lectures State-Space Models (approx.4 hours) SMC filtering and smoothing Maximum likelihood parameter inference Bayesian parameter inference Beyond State-Space Models (approx. 2 hours) SMC methods for generic sequence of target distributions SMC samplers. Approximate Bayesian Computation. A. Doucet (MLSS Sept. 2012) Sept / 136

41 Organization of Lectures State-Space Models (approx.4 hours) SMC filtering and smoothing Maximum likelihood parameter inference Bayesian parameter inference Beyond State-Space Models (approx. 2 hours) SMC methods for generic sequence of target distributions SMC samplers. Approximate Bayesian Computation. Optimal design, optimal control. A. Doucet (MLSS Sept. 2012) Sept / 136

42 State-Space Models Let {X t } t 1 be a latent/hidden X -valued Markov process with X 1 µ ( ) and X t (X t 1 = x) f ( x). A. Doucet (MLSS Sept. 2012) Sept / 136

43 State-Space Models Let {X t } t 1 be a latent/hidden X -valued Markov process with X 1 µ ( ) and X t (X t 1 = x) f ( x). Let {Y t } t 1 be an Y-valued Markov observation process such that observations are conditionally independent given {X t } t 1 and Y t (X t = x) g ( x). A. Doucet (MLSS Sept. 2012) Sept / 136

44 State-Space Models Let {X t } t 1 be a latent/hidden X -valued Markov process with X 1 µ ( ) and X t (X t 1 = x) f ( x). Let {Y t } t 1 be an Y-valued Markov observation process such that observations are conditionally independent given {X t } t 1 and Y t (X t = x) g ( x). General class of time series models aka Hidden Markov Models (HMM) including X t = Ψ (X t 1, V t ), Y t = Φ (X t, W t ) where V t, W t are two sequences of i.i.d. random variables. A. Doucet (MLSS Sept. 2012) Sept / 136

45 State-Space Models Let {X t } t 1 be a latent/hidden X -valued Markov process with X 1 µ ( ) and X t (X t 1 = x) f ( x). Let {Y t } t 1 be an Y-valued Markov observation process such that observations are conditionally independent given {X t } t 1 and Y t (X t = x) g ( x). General class of time series models aka Hidden Markov Models (HMM) including X t = Ψ (X t 1, V t ), Y t = Φ (X t, W t ) where V t, W t are two sequences of i.i.d. random variables. Aim: Infer {X t } given observations {Y t } on-line or off-line. A. Doucet (MLSS Sept. 2012) Sept / 136

46 State-Space Models State-space models are ubiquitous in control, data mining, econometrics, geosciences, system biology etc. Since Jan. 2012, more than 13,500 papers have already appeared (source: Google Scholar). A. Doucet (MLSS Sept. 2012) Sept / 136

47 State-Space Models State-space models are ubiquitous in control, data mining, econometrics, geosciences, system biology etc. Since Jan. 2012, more than 13,500 papers have already appeared (source: Google Scholar). Finite State-space HMM: X is a finite space, i.e. {X t } is a finite Markov chain Y t (X t = x) g ( x) A. Doucet (MLSS Sept. 2012) Sept / 136

48 State-Space Models State-space models are ubiquitous in control, data mining, econometrics, geosciences, system biology etc. Since Jan. 2012, more than 13,500 papers have already appeared (source: Google Scholar). Finite State-space HMM: X is a finite space, i.e. {X t } is a finite Markov chain Y t (X t = x) g ( x) Linear Gaussian state-space model X t = AX t 1 + BV t, V t i.i.d. N (0, I ) Y t = CX t + DW t, W t i.i.d. N (0, I ) A. Doucet (MLSS Sept. 2012) Sept / 136

49 State-Space Models State-space models are ubiquitous in control, data mining, econometrics, geosciences, system biology etc. Since Jan. 2012, more than 13,500 papers have already appeared (source: Google Scholar). Finite State-space HMM: X is a finite space, i.e. {X t } is a finite Markov chain Y t (X t = x) g ( x) Linear Gaussian state-space model X t = AX t 1 + BV t, V t i.i.d. N (0, I ) i.i.d. Y t = CX t + DW t, W t N (0, I ) Switching Linear Gaussian state-space model: X t = ( Xt 1, Xt 2 ) where { Xt 1 } is a finite Markov chain, Xt 2 = A ( Xt 1 ) X 2 t 1 + B ( Xt 1 ) i.i.d. Vt, V t N (0, I ) Y t = C ( X 1 t ) X 2 t + D ( Xt 1 ) Wt, W t i.i.d. N (0, I ) A. Doucet (MLSS Sept. 2012) Sept / 136

50 State-Space Models Stochastic Volatility model X t = φx t 1 + σv t, V t i.i.d. N (0, 1) Y t = β exp (X t /2) W t, W t i.i.d. N (0, 1) A. Doucet (MLSS Sept. 2012) Sept / 136

51 State-Space Models Stochastic Volatility model X t = φx t 1 + σv t, V t i.i.d. N (0, 1) Y t = β exp (X t /2) W t, W t i.i.d. N (0, 1) Biochemical Network model Pr ( Xt+dt 1 =x t 1 +1, Xt+dt 2 =x t 2 xt 1, xt 2 ) = α x 1 t dt + o (dt), Pr ( Xt+dt 1 =x t 1 1, Xt+dt 2 =x t 2 +1 xt 1, xt 2 ) = β x 1 t xt 2 dt + o (dt), Pr ( Xt+dt 1 =x t 1, Xt+dt 2 =x t 2 1 xt 1, xt 2 ) = γ x 2 t dt + o (dt), with Y k = Xk 1 T + W i.i.d. k with W k N ( 0, σ 2). A. Doucet (MLSS Sept. 2012) Sept / 136

52 State-Space Models Stochastic Volatility model X t = φx t 1 + σv t, V t i.i.d. N (0, 1) Y t = β exp (X t /2) W t, W t i.i.d. N (0, 1) Biochemical Network model Pr ( Xt+dt 1 =x t 1 +1, Xt+dt 2 =x t 2 xt 1, xt 2 ) = α x 1 t dt + o (dt), Pr ( Xt+dt 1 =x t 1 1, Xt+dt 2 =x t 2 +1 xt 1, xt 2 ) = β x 1 t xt 2 dt + o (dt), Pr ( Xt+dt 1 =x t 1, Xt+dt 2 =x t 2 1 xt 1, xt 2 ) = γ x 2 t dt + o (dt), with Y k = Xk 1 T + W i.i.d. k with W k N ( 0, σ 2). Nonlinear Diffusion model dx t = α (X t ) dt + β (X t ) dv t, V t Brownian motion Y k = γ (X k T ) +W k, W k i.i.d. N ( 0, σ 2). A. Doucet (MLSS Sept. 2012) Sept / 136

53 Inference in State-Space Models Given observations y 1:t := (y 1, y 2,..., y t ), inference about X 1:t := (X 1,..., X t ) relies on the posterior where p (x 1:t, y 1:t ) = µ (x 1 ) p (y 1:t ) = p (x 1:t y 1:t ) = p (x 1:t, y 1:t ) p (y 1:t ) t k=2 f (x k x k 1 ) }{{}}{{} p(x 1:t ) p( y 1:t x 1:t ) p (x 1:t, y 1:t ) dx 1:t t k=1 g (y k x k ), A. Doucet (MLSS Sept. 2012) Sept / 136

54 Inference in State-Space Models Given observations y 1:t := (y 1, y 2,..., y t ), inference about X 1:t := (X 1,..., X t ) relies on the posterior where p (x 1:t, y 1:t ) = µ (x 1 ) p (y 1:t ) = p (x 1:t y 1:t ) = p (x 1:t, y 1:t ) p (y 1:t ) t k=2 f (x k x k 1 ) }{{}}{{} p(x 1:t ) p( y 1:t x 1:t ) p (x 1:t, y 1:t ) dx 1:t t k=1 g (y k x k ), When X is finite & linear Gaussian models, {p (x t y 1:t )} t 1 can be computed exactly. For non-linear models, approximations are required: EKF, UKF, Gaussian sum filters, etc. A. Doucet (MLSS Sept. 2012) Sept / 136

55 Inference in State-Space Models Given observations y 1:t := (y 1, y 2,..., y t ), inference about X 1:t := (X 1,..., X t ) relies on the posterior where p (x 1:t, y 1:t ) = µ (x 1 ) p (y 1:t ) = p (x 1:t y 1:t ) = p (x 1:t, y 1:t ) p (y 1:t ) t k=2 f (x k x k 1 ) }{{}}{{} p(x 1:t ) p( y 1:t x 1:t ) p (x 1:t, y 1:t ) dx 1:t t k=1 g (y k x k ), When X is finite & linear Gaussian models, {p (x t y 1:t )} t 1 can be computed exactly. For non-linear models, approximations are required: EKF, UKF, Gaussian sum filters, etc. Approximations of {p (x t y 1:t )} T t=1 provide approximation of p (x 1:T y 1:T ). A. Doucet (MLSS Sept. 2012) Sept / 136

56 Monte Carlo Methods Basics Assume you can generate X (i) 1:t p (x 1:t y 1:t ) where i = 1,..., N then MC approximation is p (x 1:t y 1:t ) = 1 N N δ (i) X (x 1:t ) 1:t i=1 A. Doucet (MLSS Sept. 2012) Sept / 136

57 Monte Carlo Methods Basics Assume you can generate X (i) 1:t p (x 1:t y 1:t ) where i = 1,..., N then MC approximation is p (x 1:t y 1:t ) = 1 N N δ (i) X (x 1:t ) 1:t i=1 Integration is straightforward. ϕt (x 1:t ) p (x 1:t y 1:t ) dx 1:t ϕ t (x 1:t ) p ((x 1:t ) y 1:t ) dx 1:t = 1 N N i=1 ϕ X (i) 1:t A. Doucet (MLSS Sept. 2012) Sept / 136

58 Monte Carlo Methods Basics Assume you can generate X (i) 1:t p (x 1:t y 1:t ) where i = 1,..., N then MC approximation is p (x 1:t y 1:t ) = 1 N N δ (i) X (x 1:t ) 1:t i=1 Integration is straightforward. ϕt (x 1:t ) p (x 1:t y 1:t ) dx 1:t ϕ t (x 1:t ) p ((x 1:t ) y 1:t ) dx 1:t = 1 N N i=1 ϕ Marginalization is straightforward. X (i) 1:t p (x k y 1:t ) = p (x 1:t y 1:t ) dx 1:k 1 dx k+1:t = 1 N N δ (i) X (x k ). k i=1 A. Doucet (MLSS Sept. 2012) Sept / 136

59 Monte Carlo Methods Basics Assume you can generate X (i) 1:t p (x 1:t y 1:t ) where i = 1,..., N then MC approximation is p (x 1:t y 1:t ) = 1 N N δ (i) X (x 1:t ) 1:t i=1 Integration is straightforward. ϕt (x 1:t ) p (x 1:t y 1:t ) dx 1:t ϕ t (x 1:t ) p ((x 1:t ) y 1:t ) dx 1:t = 1 N N i=1 ϕ Marginalization is straightforward. X (i) 1:t p (x k y 1:t ) = p (x 1:t y 1:t ) dx 1:k 1 dx k+1:t = 1 N [ ( )] Basic and key property: V 1 N N i=1 ϕ = X (i) 1:t N δ (i) X (x k ). k i=1 C (t dim(x )) N, i.e. rate of convergence to zero is independent of dim (X ) and t. A. Doucet (MLSS Sept. 2012) Sept / 136

60 Monte Carlo Methods Problem 1: We cannot typically generate exact samples from p (x 1:t y 1:t ) for non-linear non-gaussian models. A. Doucet (MLSS Sept. 2012) Sept / 136

61 Monte Carlo Methods Problem 1: We cannot typically generate exact samples from p (x 1:t y 1:t ) for non-linear non-gaussian models. Problem 2: Even if we could, algorithms to generate samples from p (x 1:t y 1:t ) will have at least complexity O (t). A. Doucet (MLSS Sept. 2012) Sept / 136

62 Monte Carlo Methods Problem 1: We cannot typically generate exact samples from p (x 1:t y 1:t ) for non-linear non-gaussian models. Problem 2: Even if we could, algorithms to generate samples from p (x 1:t y 1:t ) will have at least complexity O (t). Typical solution to problem 1 is to generate approximate samples using MCMC methods but these methods are not recursive. A. Doucet (MLSS Sept. 2012) Sept / 136

63 Monte Carlo Methods Problem 1: We cannot typically generate exact samples from p (x 1:t y 1:t ) for non-linear non-gaussian models. Problem 2: Even if we could, algorithms to generate samples from p (x 1:t y 1:t ) will have at least complexity O (t). Typical solution to problem 1 is to generate approximate samples using MCMC methods but these methods are not recursive. SMC Methods solves partially Problem 1 and Problem 2 by breaking the problem of sampling from p (x 1:t y 1:t ) into a collection of simpler subproblems. First approximate p (x 1 y 1 ) and p (y 1 ) at time 1, then p (x 1:2 y 1:2 ) and p (y 1:2 ) at time 2 and so on. A. Doucet (MLSS Sept. 2012) Sept / 136

64 Monte Carlo Methods Problem 1: We cannot typically generate exact samples from p (x 1:t y 1:t ) for non-linear non-gaussian models. Problem 2: Even if we could, algorithms to generate samples from p (x 1:t y 1:t ) will have at least complexity O (t). Typical solution to problem 1 is to generate approximate samples using MCMC methods but these methods are not recursive. SMC Methods solves partially Problem 1 and Problem 2 by breaking the problem of sampling from p (x 1:t y 1:t ) into a collection of simpler subproblems. First approximate p (x 1 y 1 ) and p (y 1 ) at time 1, then p (x 1:2 y 1:2 ) and p (y 1:2 ) at time 2 and so on. Each target distribution is approximated by a cloud of random samples termed particles evolving according to importance sampling and resampling steps. A. Doucet (MLSS Sept. 2012) Sept / 136

65 Standard Bayesian Recursion In most textbooks, you will find the following recursion for {p (x t y 1:t )} t 1. A. Doucet (MLSS Sept. 2012) Sept / 136

66 Standard Bayesian Recursion In most textbooks, you will find the following recursion for {p (x t y 1:t )} t 1. Prediction step p (x t y 1:t 1 ) = p (x t 1, x t y 1:t 1 ) dx t 1 = p (x t y 1:t 1, x t 1 ) p (x t 1 y 1:t 1 ) dx t 1 = f (x t x t 1 ) p (x t 1 y 1:t 1 ) dx t 1. A. Doucet (MLSS Sept. 2012) Sept / 136

67 Standard Bayesian Recursion In most textbooks, you will find the following recursion for {p (x t y 1:t )} t 1. Prediction step p (x t y 1:t 1 ) = p (x t 1, x t y 1:t 1 ) dx t 1 = p (x t y 1:t 1, x t 1 ) p (x t 1 y 1:t 1 ) dx t 1 = f (x t x t 1 ) p (x t 1 y 1:t 1 ) dx t 1. Bayes Updating step where p (x t y 1:t ) = g (y t x t ) p (x t y 1:t 1 ) p (y t y 1:t 1 ) p (y t y 1:t 1 ) = g (y t x t ) p (x t y 1:t 1 ) dx t A. Doucet (MLSS Sept. 2012) Sept / 136

68 Standard Bayesian Recursion In most textbooks, you will find the following recursion for {p (x t y 1:t )} t 1. Prediction step p (x t y 1:t 1 ) = p (x t 1, x t y 1:t 1 ) dx t 1 = p (x t y 1:t 1, x t 1 ) p (x t 1 y 1:t 1 ) dx t 1 = f (x t x t 1 ) p (x t 1 y 1:t 1 ) dx t 1. Bayes Updating step where p (x t y 1:t ) = g (y t x t ) p (x t y 1:t 1 ) p (y t y 1:t 1 ) p (y t y 1:t 1 ) = g (y t x t ) p (x t y 1:t 1 ) dx t This is the recursion implemented by Wonham and Kalman filters... A. Doucet (MLSS Sept. 2012) Sept / 136

69 Bayesian Recursion on Path Space SMC approximate directly {p (x 1:t y 1:t )} t 1 not {p (x t y 1:t )} t 1 and relies on p (x 1:t y 1:t ) = p (x 1:t, y 1:t ) = g (y t x t ) f (x t x t 1 ) p (x 1:t 1, y 1:t 1 ) p (y 1:t ) p (y t y 1:t 1 ) p (y 1:t 1 ) where = g (y t x t ) predictive p( x 1:t y 1:t 1 ) {}}{ f (x t x t 1 ) p (x 1:t 1 y 1:t 1 ) p (y t y 1:t 1 ) p (y t y 1:t 1 ) = g (y t x t ) p (x 1:t y 1:t 1 ) dx 1:t A. Doucet (MLSS Sept. 2012) Sept / 136

70 Bayesian Recursion on Path Space SMC approximate directly {p (x 1:t y 1:t )} t 1 not {p (x t y 1:t )} t 1 and relies on p (x 1:t y 1:t ) = p (x 1:t, y 1:t ) = g (y t x t ) f (x t x t 1 ) p (x 1:t 1, y 1:t 1 ) p (y 1:t ) p (y t y 1:t 1 ) p (y 1:t 1 ) where = g (y t x t ) predictive p( x 1:t y 1:t 1 ) {}}{ f (x t x t 1 ) p (x 1:t 1 y 1:t 1 ) p (y t y 1:t 1 ) p (y t y 1:t 1 ) = g (y t x t ) p (x 1:t y 1:t 1 ) dx 1:t This can be alternatively written as Prediction p (x 1:t y 1:t 1 ) = f (x t x t 1 ) p (x 1:t 1 y 1:t 1 ), Update p (x 1:t y 1:t ) = g ( y t x t )p( x 1:t y 1:t 1 ) p( y t y 1:t 1. ) A. Doucet (MLSS Sept. 2012) Sept / 136

71 Bayesian Recursion on Path Space SMC approximate directly {p (x 1:t y 1:t )} t 1 not {p (x t y 1:t )} t 1 and relies on p (x 1:t y 1:t ) = p (x 1:t, y 1:t ) = g (y t x t ) f (x t x t 1 ) p (x 1:t 1, y 1:t 1 ) p (y 1:t ) p (y t y 1:t 1 ) p (y 1:t 1 ) where = g (y t x t ) predictive p( x 1:t y 1:t 1 ) {}}{ f (x t x t 1 ) p (x 1:t 1 y 1:t 1 ) p (y t y 1:t 1 ) p (y t y 1:t 1 ) = g (y t x t ) p (x 1:t y 1:t 1 ) dx 1:t This can be alternatively written as Prediction p (x 1:t y 1:t 1 ) = f (x t x t 1 ) p (x 1:t 1 y 1:t 1 ), Update p (x 1:t y 1:t ) = g ( y t x t )p( x 1:t y 1:t 1 ) p( y t y 1:t 1. ) SMC is a simple and natural simulation-based implementation of this recursion. A. Doucet (MLSS Sept. 2012) Sept / 136

72 Monte Carlo Implementation of Prediction Step Assume you have at time t 1 p (x 1:t 1 y 1:t 1 ) = 1 N N δ (i) X (x 1:t 1 ). 1:t 1 i=1 A. Doucet (MLSS Sept. 2012) Sept / 136

73 Monte Carlo Implementation of Prediction Step Assume you have at time t 1 p (x 1:t 1 y 1:t 1 ) = 1 N N δ (i) X (x 1:t 1 ). 1:t 1 i=1 ( ) ( ) By sampling X (i) t f x t X (i) t 1 and setting X (i) 1:t = X (i) 1:t 1, X (i) t then p (x 1:t y 1:t 1 ) = 1 N N δ X (i) (x 1:t ). 1:t i=1 A. Doucet (MLSS Sept. 2012) Sept / 136

74 Monte Carlo Implementation of Prediction Step Assume you have at time t 1 p (x 1:t 1 y 1:t 1 ) = 1 N N δ (i) X (x 1:t 1 ). 1:t 1 i=1 ( ) ( ) By sampling X (i) t f x t X (i) t 1 and setting X (i) 1:t = X (i) 1:t 1, X (i) t then p (x 1:t y 1:t 1 ) = 1 N N δ X (i) (x 1:t ). 1:t i=1 Sampling from f (x t x t 1 ) is usually straightforward and can be done even if f (x t x t 1 ) does not admit any analytical expression; e.g. biochemical network models. A. Doucet (MLSS Sept. 2012) Sept / 136

75 Importance Sampling Implementation of Updating Step Our target at time t is p (x 1:t y 1:t ) = g (y t x t ) p (x 1:t y 1:t 1 ) p (y t y 1:t 1 ) so by substituting p (x 1:t y 1:t 1 ) to p (x 1:t y 1:t 1 ) we obtain p (y t y 1:t 1 ) = g (y t x t ) p (x 1:t y 1:t 1 ) dx 1:t = 1 N N ( ) g y t X (i) t. i=1 A. Doucet (MLSS Sept. 2012) Sept / 136

76 Importance Sampling Implementation of Updating Step Our target at time t is p (x 1:t y 1:t ) = g (y t x t ) p (x 1:t y 1:t 1 ) p (y t y 1:t 1 ) so by substituting p (x 1:t y 1:t 1 ) to p (x 1:t y 1:t 1 ) we obtain p (y t y 1:t 1 ) = g (y t x t ) p (x 1:t y 1:t 1 ) dx 1:t We now have = 1 N N ( ) g y t X (i) t. i=1 p (x 1:t y 1:t ) = g (y t x t ) p (x 1:t y 1:t 1 ) = p (y t y 1:t 1 ) ( ) with W (i) t g y t X (i) t, N i=1 W (i) t = 1. N i=1 W (i) t δ X (i) (x 1:t ). 1:t A. Doucet (MLSS Sept. 2012) Sept / 136

77 Multinomial Resampling We have a weighted approximation p (x 1:t y 1:t ) of p (x 1:t y 1:t ) p (x 1:t y 1:t ) = N i=1 W (i) t δ X (i) (x 1:t ). 1:t A. Doucet (MLSS Sept. 2012) Sept / 136

78 Multinomial Resampling We have a weighted approximation p (x 1:t y 1:t ) of p (x 1:t y 1:t ) p (x 1:t y 1:t ) = N i=1 W (i) t δ X (i) (x 1:t ). 1:t To obtain N samples X (i) 1:t approximately distributed according to p (x 1:t y 1:t ), resample N times with replacement to obtain X (i) 1:t p (x 1:t y 1:t ) N δ (i) X (x 1:t ) = 1:t i=1 p (x 1:t y 1:t ) = 1 N { } [ where N (i) t follow a multinomial with E [ ] ( ) V N (1) t = NW (i) t 1 W (i) t. N i=1 N (i) t N (i) t N δ X (i) 1:t ] (x 1:t ) = NW (i) t, A. Doucet (MLSS Sept. 2012) Sept / 136

79 Multinomial Resampling We have a weighted approximation p (x 1:t y 1:t ) of p (x 1:t y 1:t ) p (x 1:t y 1:t ) = N i=1 W (i) t δ X (i) (x 1:t ). 1:t To obtain N samples X (i) 1:t approximately distributed according to p (x 1:t y 1:t ), resample N times with replacement to obtain X (i) 1:t p (x 1:t y 1:t ) N δ (i) X (x 1:t ) = 1:t i=1 p (x 1:t y 1:t ) = 1 N { } [ where N (i) t follow a multinomial with E [ ] ( ) V N (1) t = NW (i) t 1 W (i) t. This can be achieved in O (N). N i=1 N (i) t N (i) t N δ X (i) 1:t ] (x 1:t ) = NW (i) t, A. Doucet (MLSS Sept. 2012) Sept / 136

80 Vanilla SMC: Bootstrap Filter (Gordon et al., 1993) At time t = 1 Sample X (i) 1 µ (x 1 ) then p (x 1 y 1 ) = N ( ) W (i) 1 δ X (i) (x 1 ), W (i) 1 g y 1 X (i) 1. 1 i=1 A. Doucet (MLSS Sept. 2012) Sept / 136

81 Vanilla SMC: Bootstrap Filter (Gordon et al., 1993) At time t = 1 Sample X (i) 1 µ (x 1 ) then p (x 1 y 1 ) = N ( ) W (i) 1 δ X (i) (x 1 ), W (i) 1 g y 1 X (i) 1. 1 i=1 Resample X (i) 1 p (x 1 y 1 ) to obtain p (x 1 y 1 ) = 1 N N i=1 δ (i) X (x 1 ). 1 A. Doucet (MLSS Sept. 2012) Sept / 136

82 Vanilla SMC: Bootstrap Filter (Gordon et al., 1993) At time t = 1 Sample X (i) 1 µ (x 1 ) then p (x 1 y 1 ) = N ( ) W (i) 1 δ X (i) (x 1 ), W (i) 1 g y 1 X (i) 1. 1 i=1 Resample X (i) 1 p (x 1 y 1 ) to obtain p (x 1 y 1 ) = 1 N N i=1 δ (i) X (x 1 ). 1 A. Doucet (MLSS Sept. 2012) Sept / 136

83 Vanilla SMC: Bootstrap Filter (Gordon et al., 1993) At time t = 1 Sample X (i) 1 µ (x 1 ) then p (x 1 y 1 ) = N ( ) W (i) 1 δ X (i) (x 1 ), W (i) 1 g y 1 X (i) 1. 1 i=1 Resample X (i) 1 p (x 1 y 1 ) to obtain p (x 1 y 1 ) = 1 N N i=1 δ (i) X (x 1 ). 1 At time t 2 Sample X (i) t f p (x 1:t y 1:t ) = ( ) ( ) x t X (i) t 1, set X (i) 1:t = X (i) 1:t 1, X (i) t and N i=1 ( W (i) t δ X (i) (x 1:t ), W (i) t g 1:t ) y t X (i) t. A. Doucet (MLSS Sept. 2012) Sept / 136

84 Vanilla SMC: Bootstrap Filter (Gordon et al., 1993) At time t = 1 Sample X (i) 1 µ (x 1 ) then p (x 1 y 1 ) = N ( ) W (i) 1 δ X (i) (x 1 ), W (i) 1 g y 1 X (i) 1. 1 i=1 Resample X (i) 1 p (x 1 y 1 ) to obtain p (x 1 y 1 ) = 1 N N i=1 δ (i) X (x 1 ). 1 At time t 2 Sample X (i) t f p (x 1:t y 1:t ) = ( ) ( ) x t X (i) t 1, set X (i) 1:t = X (i) 1:t 1, X (i) t and N i=1 ( W (i) t δ X (i) (x 1:t ), W (i) t g 1:t Resample X (i) 1:t p (x 1:t y 1:t ) to obtain p (x 1:t y 1:t ) = 1 N N i=1 δ (i) X (x 1:t ). 1:t ) y t X (i) t. A. Doucet (MLSS Sept. 2012) Sept / 136

85 SMC Output At time t, we get p (x 1:t y 1:t ) = N i=1 p (x 1:t y 1:t ) = 1 N W (i) t δ X (i) (x 1:t ), 1:t N δ (i) X (x 1:t ). 1:t i=1 A. Doucet (MLSS Sept. 2012) Sept / 136

86 SMC Output At time t, we get p (x 1:t y 1:t ) = N i=1 p (x 1:t y 1:t ) = 1 N W (i) t δ X (i) (x 1:t ), 1:t N δ (i) X (x 1:t ). 1:t i=1 The marginal likelihood estimate is given by ( t 1 p (y 1:t ) = p (y k y 1:k 1 ) = N k=1 t k=1 N ( g i=1 ) ) y k X (i) k. A. Doucet (MLSS Sept. 2012) Sept / 136

87 SMC Output At time t, we get p (x 1:t y 1:t ) = N i=1 p (x 1:t y 1:t ) = 1 N W (i) t δ X (i) (x 1:t ), 1:t N δ (i) X (x 1:t ). 1:t i=1 The marginal likelihood estimate is given by ( t 1 p (y 1:t ) = p (y k y 1:k 1 ) = N k=1 t k=1 N ( g i=1 ) ) y k X (i) k. Computational complexity is O (N) at each time step and memory requirements O (tn). A. Doucet (MLSS Sept. 2012) Sept / 136

88 SMC Output At time t, we get p (x 1:t y 1:t ) = N i=1 p (x 1:t y 1:t ) = 1 N W (i) t δ X (i) (x 1:t ), 1:t N δ (i) X (x 1:t ). 1:t i=1 The marginal likelihood estimate is given by ( t 1 p (y 1:t ) = p (y k y 1:k 1 ) = N k=1 t k=1 N ( g i=1 ) ) y k X (i) k. Computational complexity is O (N) at each time step and memory requirements O (tn). If we are only interested in p (x t y 1:t ) or p (s t (x 1:t ) y 1:t ) where s t (x 1:t ) = Ψ t (x t, s t 1 (x 1:t 1 )) - e.g. s t (x 1:t ) = t k=1 x 2 k - is fixed-dimensional then memory requirements O (N). A. Doucet (MLSS Sept. 2012) Sept / 136

89 state state Figure: p ( x 1 y 1 ) and Ê [ X 1 y 1 ] (top) and particle approximation of p ( x 1 y 1 ) (bottom) A. Doucet (MLSS Sept. 2012) Sept / 136 SMC on Path-Space - figures by Olivier Cappė time index time index

90 state state time index time index Figure: p ( x 1 y 1 ), p ( x 2 y 1:2 )and Ê [ X 1 y 1 ], Ê [ X 2 y 1:2 ] (top) and particle approximation of p ( x 1:2 y 1:2 ) (bottom) A. Doucet (MLSS Sept. 2012) Sept / 136

91 state state time index time index Figure: p ( x t y 1:t ) and Ê [ X t y 1:t ] for t = 1, 2, 3 (top) and particle approximation of p ( x 1:3 y 1:3 ) (bottom) A. Doucet (MLSS Sept. 2012) Sept / 136

92 state state time index time index Figure: p ( x t y 1:t ) and Ê [ X t y 1:t ] for t = 1,..., 10 (top) and particle approximation of p ( x 1:10 y 1:10 ) (bottom) A. Doucet (MLSS Sept. 2012) Sept / 136

93 state state time index time index Figure: p ( x t y 1:t ) and Ê [ X t y 1:t ] for t = 1,..., 24 (top) and particle approximation of p ( x 1:24 y 1:24 ) (bottom) A. Doucet (MLSS Sept. 2012) Sept / 136

94 Remarks Empirically this SMC strategy performs well in terms of estimating the marginals {p (x t y 1:t )} t 1. This is what is only necessary in many applications thankfully. A. Doucet (MLSS Sept. 2012) Sept / 136

95 Remarks Empirically this SMC strategy performs well in terms of estimating the marginals {p (x t y 1:t )} t 1. This is what is only necessary in many applications thankfully. However, the joint distribution p (x 1:t y 1:t ) is poorly estimated when t is large; i.e. we have in the previous example p (x 1:11 y 1:24 ) = δ X 1:11 (x 1:11 ). A. Doucet (MLSS Sept. 2012) Sept / 136

96 Remarks Empirically this SMC strategy performs well in terms of estimating the marginals {p (x t y 1:t )} t 1. This is what is only necessary in many applications thankfully. However, the joint distribution p (x 1:t y 1:t ) is poorly estimated when t is large; i.e. we have in the previous example p (x 1:11 y 1:24 ) = δ X 1:11 (x 1:11 ). Degeneracy problem. For any N and any k, there exists t (k, N) such that for any t t (k, N) p (x 1:k y 1:t ) = δ X 1:k (x 1:k ) ; p (x 1:t y 1:t ) is an unreliable approximation of p (x 1:t y 1:t ) as t. A. Doucet (MLSS Sept. 2012) Sept / 136

97 Another Illustration of the Degeneracy Phenomenon For the linear Gaussian state-space model described before, we can compute exactly S t /t where ( ) t S t = xk 2 p (x 1:t y 1:t ) dx 1:t k=1 using Kalman techniques. A. Doucet (MLSS Sept. 2012) Sept / 136

98 Another Illustration of the Degeneracy Phenomenon For the linear Gaussian state-space model described before, we can compute exactly S t /t where ( ) t S t = xk 2 p (x 1:t y 1:t ) dx 1:t k=1 using Kalman techniques. We compute the SMC estimate of this quantity using Ŝ t /t where ( ) t Ŝ t = xk 2 p (x 1:t y 1:t ) dx 1:t k=1 can be computed sequentially. A. Doucet (MLSS Sept. 2012) Sept / 136

99 Another Illustration of the Degeneracy Phenomenon Figure: S t /t obtained through the Kalman smoother (blue) and its SMC estimate Ŝ t /t (red). A. Doucet (MLSS Sept. 2012) Sept / 136

100 Some Convergence Results for SMC Numerous convergence results for SMC are available; see (Del Moral, 2004). A. Doucet (MLSS Sept. 2012) Sept / 136

101 Some Convergence Results for SMC Numerous convergence results for SMC are available; see (Del Moral, 2004). Let ϕ t : X t R and consider ϕ t = ϕ t (x 1:t ) p (x 1:t y 1:t ) dx 1:t, ϕ t = ϕ t (x 1:t ) p (x 1:t y 1:t ) dx 1:t = 1 N N ( ϕ t i=1 X (i) 1:t ). A. Doucet (MLSS Sept. 2012) Sept / 136

102 Some Convergence Results for SMC Numerous convergence results for SMC are available; see (Del Moral, 2004). Let ϕ t : X t R and consider ϕ t = ϕ t (x 1:t ) p (x 1:t y 1:t ) dx 1:t, ϕ t = ϕ t (x 1:t ) p (x 1:t y 1:t ) dx 1:t = 1 N N ( ϕ t i=1 X (i) 1:t We can prove that for any bounded function ϕ and any p 1 E [ ϕ t ϕ t p ] 1/p B (t) c (p) ϕ, N lim N ( ϕt ϕ N t ) N ( 0, σ 2 ) t. ). A. Doucet (MLSS Sept. 2012) Sept / 136

103 Some Convergence Results for SMC Numerous convergence results for SMC are available; see (Del Moral, 2004). Let ϕ t : X t R and consider ϕ t = ϕ t (x 1:t ) p (x 1:t y 1:t ) dx 1:t, ϕ t = ϕ t (x 1:t ) p (x 1:t y 1:t ) dx 1:t = 1 N N ( ϕ t i=1 X (i) 1:t We can prove that for any bounded function ϕ and any p 1 E [ ϕ t ϕ t p ] 1/p B (t) c (p) ϕ, N lim N ( ϕt ϕ N t ) N ( 0, σ 2 ) t. Very weak results: B (t) and σ 2 t can increase with t and will for a path-dependent ϕ t (x 1:t ) as the degeneracy problem suggests. A. Doucet (MLSS Sept. 2012) Sept / 136 ).

104 Stronger Convergence Results Assume the following exponentially stability assumption: For any x 1, x 1 1 p (x t y 2:t, X 1 = x 1 ) p ( x t y 2:t, X 1 = x ) 1 dx t α t for 0 α < 1. 2 A. Doucet (MLSS Sept. 2012) Sept / 136

105 Stronger Convergence Results Assume the following exponentially stability assumption: For any x 1, x 1 1 p (x t y 2:t, X 1 = x 1 ) p ( x t y 2:t, X 1 = x ) 1 dx t α t for 0 α < 1. 2 Marginal distribution. For ϕ t (x 1:t ) = ϕ (x t L:t ), there exists B 1, B 2 < s.t. E [ ϕ t ϕ t p ] 1/p B 1 c (p) ϕ N, lim N N ( ϕt ϕ t ) N ( 0, σ 2 ) t where σ 2 t B 2, i.e. there is no accumulation of numerical errors over time. A. Doucet (MLSS Sept. 2012) Sept / 136

106 Stronger Convergence Results Assume the following exponentially stability assumption: For any x 1, x 1 1 p (x t y 2:t, X 1 = x 1 ) p ( x t y 2:t, X 1 = x ) 1 dx t α t for 0 α < 1. 2 Marginal distribution. For ϕ t (x 1:t ) = ϕ (x t L:t ), there exists B 1, B 2 < s.t. E [ ϕ t ϕ t p ] 1/p B 1 c (p) ϕ N, lim N N ( ϕt ϕ t ) N ( 0, σ 2 ) t where σ 2 t B 2, i.e. there is no accumulation of numerical errors over time. L1 distance. If p (x 1:t y 1:t ) = E ( p (x 1:t y 1:t )), there exists B 3 < s.t. p (x 1:t y 1:t ) p (x 1:t y 1:t ) dx 1:t B 3 t N ; i.e. the bias only increases in t. A. Doucet (MLSS Sept. 2012) Sept / 136

107 Stronger Convergence Results Unbiasedness. The marginal likelihood estimate is unbiased E ( p (y 1:t )) = p (y 1:t ). A. Doucet (MLSS Sept. 2012) Sept / 136

108 Stronger Convergence Results Unbiasedness. The marginal likelihood estimate is unbiased E ( p (y 1:t )) = p (y 1:t ). Relative Variance Bound. There exists B 4 < ( ) ) ( p (y1:t ) 2 E p (y 1:t ) 1 B 4 t N A. Doucet (MLSS Sept. 2012) Sept / 136

109 Stronger Convergence Results Unbiasedness. The marginal likelihood estimate is unbiased E ( p (y 1:t )) = p (y 1:t ). Relative Variance Bound. There exists B 4 < ( ) ) ( p (y1:t ) 2 E p (y 1:t ) 1 B 4 t N Central Limit Theorem. There exists B 5 < s.t. N (log p (y1:t ) log p (y 1:t )) N ( 0, σ 2 ) t with σ 2 t B 5 t. lim N A. Doucet (MLSS Sept. 2012) Sept / 136

110 Basic Idea Used to Establish Uniform Lp Bounds We denote η k (x k ) = p (x k y 1:k 1 ) and η k (x k ) = p (x k y 1:k 1 ) its particle approximation. A. Doucet (MLSS Sept. 2012) Sept / 136

111 Basic Idea Used to Establish Uniform Lp Bounds We denote and η k (x k ) = p (x k y 1:k 1 ) η k (x k ) = p (x k y 1:k 1 ) its particle approximation. Let Φ k,t be the measure-valued mapping such that η t = Φ k,t (η k ), which satifies Φ k,t (η k ) (x t ) = η k (x k ).p (y k :t 1 x k ) p (x t x k, y k+1:t 1 ) dx k. ηk (x k ) p (y k :t 1 x k ) dx k }{{} p(x k y 1:t 1 ) A. Doucet (MLSS Sept. 2012) Sept / 136

112 Key Decomposition Formula η 1 η 2 = Φ 1,2 (η 1 ) η t = Φ 1,t (η 1 ) η 1 Φ 1,2 ( η 1 ) Φ 1,t ( η 1 ) η 2 Φ 2,t ( η 2 ) η t 1 ) Φ t 1,t ( ηt 1 Decomposition of the error η t η t = η t t [ ( ))] Φk,t ( η k ) Φ k,t Φk 1,k ( ηk 1 k=1 A. Doucet (MLSS Sept. 2012) Sept / 136

113 Stability Properties We have p (x t x k, y k+1:t 1 ) = p (x k+1:t x k, y k+1:t 1 ) dx k+1:t 1 where p (x k+1:t x k, y k+1:t 1 ) = t p (x m x m 1, y m:t 1 ) m=k+1 A. Doucet (MLSS Sept. 2012) Sept / 136

114 Stability Properties We have p (x t x k, y k+1:t 1 ) = p (x k+1:t x k, y k+1:t 1 ) dx k+1:t 1 where p (x k+1:t x k, y k+1:t 1 ) = To summarize, we have Φ k,t (η k ) (x t ) = t p (x m x m 1, y m:t 1 ) m=k+1 η k (x k ).p (y k :t 1 x k ) ηk (x k ) p (y k :t 1 x k ) dx k }{{} p(x k y 1:t 1 ) t m=k+1 p (x m x m 1, y m:t 1 ) dx k :t 1 A. Doucet (MLSS Sept. 2012) Sept / 136

115 Stability Properties Assume there exists ɛ > 0 s.t. for any x, x and for any y, x, ɛ 1 ν ( x ) f ( x x ) ɛν ( x ) 0 < g g (y x) g < then there exists 0 λ < 1 1 ( Φ k,k+t (η) (x) Φ k,k+t η ) (x) dx λ t 2 A. Doucet (MLSS Sept. 2012) Sept / 136

116 Stability Properties Assume there exists ɛ > 0 s.t. for any x, x and for any y, x, ɛ 1 ν ( x ) f ( x x ) ɛν ( x ) 0 < g g (y x) g < then there exists 0 λ < 1 1 ( Φ k,k+t (η) (x) Φ k,k+t η ) (x) dx λ t 2 Hence we have as (t k). Φ k,t (η k ) (x t ) Φ k,t ( η k ) (xt ) A. Doucet (MLSS Sept. 2012) Sept / 136

117 Putting Everything Together Under such strong mixing assumptions η t η t = t k=1 [ Φk,t ( η k ) Φ k,t ( Φk 1,k ( ηk 1 ))] } {{ } 1 λ t k+1 for 0 λ 1 N A. Doucet (MLSS Sept. 2012) Sept / 136

118 Putting Everything Together Under such strong mixing assumptions η t η t = t k=1 [ Φk,t ( η k ) Φ k,t ( Φk 1,k ( ηk 1 ))] } {{ } 1 λ t k+1 for 0 λ 1 N We can then obtain results such as there exists B 1 < s.t. E [ ϕ t ϕ t p ] 1/p B 1 c (p) ϕ N A. Doucet (MLSS Sept. 2012) Sept / 136

119 Putting Everything Together Under such strong mixing assumptions η t η t = t k=1 [ Φk,t ( η k ) Φ k,t ( Φk 1,k ( ηk 1 ))] } {{ } 1 λ t k+1 for 0 λ 1 N We can then obtain results such as there exists B 1 < s.t. E [ ϕ t ϕ t p ] 1/p B 1 c (p) ϕ N Much work has been done recently on removing such strong mixing assumptions; e.g. Whiteley (2012) for much weaker and realistic assumptions. A. Doucet (MLSS Sept. 2012) Sept / 136

120 Summary SMC provide consistent estimates under weak assumptions. A. Doucet (MLSS Sept. 2012) Sept / 136

121 Summary SMC provide consistent estimates under weak assumptions. Under stability assumptions, we have uniform in time stability of the SMC estimates of {p (x t y 1:t )} t 1. A. Doucet (MLSS Sept. 2012) Sept / 136

122 Summary SMC provide consistent estimates under weak assumptions. Under stability assumptions, we have uniform in time stability of the SMC estimates of {p (x t y 1:t )} t 1. Under stability assumptions, the relative variance of the SMC estimate of {p (y 1:t )} t 1 only increases linearly with t. A. Doucet (MLSS Sept. 2012) Sept / 136

123 Summary SMC provide consistent estimates under weak assumptions. Under stability assumptions, we have uniform in time stability of the SMC estimates of {p (x t y 1:t )} t 1. Under stability assumptions, the relative variance of the SMC estimate of {p (y 1:t )} t 1 only increases linearly with t. Even under stability assumptions, one cannot expect to obtain uniform in time stability for SMC estimates of {p (x 1:t y 1:t )} t 1 ; this is due to the degeneracy problem. A. Doucet (MLSS Sept. 2012) Sept / 136

124 Summary SMC provide consistent estimates under weak assumptions. Under stability assumptions, we have uniform in time stability of the SMC estimates of {p (x t y 1:t )} t 1. Under stability assumptions, the relative variance of the SMC estimate of {p (y 1:t )} t 1 only increases linearly with t. Even under stability assumptions, one cannot expect to obtain uniform in time stability for SMC estimates of {p (x 1:t y 1:t )} t 1 ; this is due to the degeneracy problem. Is it possible to Q1: eliminate, Q2: mitigate the degeneracy problem? A. Doucet (MLSS Sept. 2012) Sept / 136

125 Summary SMC provide consistent estimates under weak assumptions. Under stability assumptions, we have uniform in time stability of the SMC estimates of {p (x t y 1:t )} t 1. Under stability assumptions, the relative variance of the SMC estimate of {p (y 1:t )} t 1 only increases linearly with t. Even under stability assumptions, one cannot expect to obtain uniform in time stability for SMC estimates of {p (x 1:t y 1:t )} t 1 ; this is due to the degeneracy problem. Is it possible to Q1: eliminate, Q2: mitigate the degeneracy problem? Answer: Q1: no, Q2: yes. A. Doucet (MLSS Sept. 2012) Sept / 136

126 Is Resampling Really Necessary? Resampling is the source of the degeneracy problem and might appear wasteful. A. Doucet (MLSS Sept. 2012) Sept / 136

127 Is Resampling Really Necessary? Resampling is the source of the degeneracy problem and might appear wasteful. The resampling step is an unbiased operation E [ p (x 1:t y 1:t ) p (x 1:t y 1:t )] = p (x 1:t y 1:t ) but clearly it introduces some errors locally in time. That is for any test function, we have [ ] [ ] V ϕ (x 1:t ) p (x 1:t y 1:t ) dx 1:t V ϕ (x 1:t ) p (x 1:t y 1:t ) dx 1:t A. Doucet (MLSS Sept. 2012) Sept / 136

128 Is Resampling Really Necessary? Resampling is the source of the degeneracy problem and might appear wasteful. The resampling step is an unbiased operation E [ p (x 1:t y 1:t ) p (x 1:t y 1:t )] = p (x 1:t y 1:t ) but clearly it introduces some errors locally in time. That is for any test function, we have [ ] [ ] V ϕ (x 1:t ) p (x 1:t y 1:t ) dx 1:t V ϕ (x 1:t ) p (x 1:t y 1:t ) dx 1:t What about eliminating the resampling step? A. Doucet (MLSS Sept. 2012) Sept / 136

129 Sequential Importance Samping: SMC Without Resampling In this case, the estimate of the posterior is p SIS (x 1:t y 1:t ) = where X (i) 1:t p (x 1:t) and W (i) t ( p N i=1 ) y 1:t X (i) 1:t W (i) t δ (i) X (x 1:t ) 1:t t ( g k=1 ) y k X (i) t. A. Doucet (MLSS Sept. 2012) Sept / 136

130 Sequential Importance Samping: SMC Without Resampling In this case, the estimate of the posterior is p SIS (x 1:t y 1:t ) = where X (i) 1:t p (x 1:t) and W (i) t ( p N i=1 ) y 1:t X (i) 1:t W (i) t δ (i) X (x 1:t ) 1:t t ( g k=1 In this case, the marginal likelihood estimate is p SIS (y 1:t ) = 1 N ) y k X (i) t. N ( ) p y 1:t X (i) 1:t i=1 A. Doucet (MLSS Sept. 2012) Sept / 136

131 Sequential Importance Samping: SMC Without Resampling In this case, the estimate of the posterior is p SIS (x 1:t y 1:t ) = where X (i) 1:t p (x 1:t) and W (i) t ( p N i=1 ) y 1:t X (i) 1:t W (i) t δ (i) X (x 1:t ) 1:t t ( g k=1 In this case, the marginal likelihood estimate is p SIS (y 1:t ) = 1 N ) y k X (i) t. N ( ) p y 1:t X (i) 1:t i=1 ( ) Relative variance of p y 1:t X (i) t 1:t = g k=1 exponentially fast... ( ) y k X (i) t is increasing A. Doucet (MLSS Sept. 2012) Sept / 136

132 SIS For Stochastic Volatility Model Figure: Histograms of log 10 ( t = 100 (bottom) Importance Weights (base 10 logarithm) W (i ) t ) for t = 1 (top), t = 50 (middle) and The algorithm performance collapse as t increases as expected. A. Doucet (MLSS Sept. 2012) Sept / 136

133 Central Limit Theorems For both SIS and SMC, we have a CLT for the estimates of the marginal likelihood ) ( psis (y 1:t ) N 1 N ( 0, σ 2 p (y 1:t ) t,sis), ) ( psmc (y 1:t ) N 1 N ( 0, σ 2 p (y 1:t ) t,smc). A. Doucet (MLSS Sept. 2012) Sept / 136

134 Central Limit Theorems For both SIS and SMC, we have a CLT for the estimates of the marginal likelihood ) ( psis (y 1:t ) N 1 N ( 0, σ 2 p (y 1:t ) t,sis), ) ( psmc (y 1:t ) N 1 N ( 0, σ 2 p (y 1:t ) t,smc). The variance expressions are σ 2 t,sis = p 2 ( x 1:t y 1:t ) p(x 1:t dx ) 1:t 1 = σ 2 t,smc = p 2 ( x 1 y 1:t ) µ(x 1 dx ) 1 + t k=2 g = 2 ( y 1 x 1 )µ(x 1 )dx 1 p 2 (y 1 + ) t k=2 p 2 ( y 1:t x 1:t )p(x 1:t )dx 1:t p 2 (y 1:t ) 1 p 2 ( x 1:k y 1:t ) p( x 1:k 1 y 1:k 1 )f ( x k x k 1 ) dx 1:k t p 2 ( y k :t x k )p( x k y 1:k 1 )dx k p 2 ( y k :t y 1:k 1 ) t A. Doucet (MLSS Sept. 2012) Sept / 136

135 Central Limit Theorems For both SIS and SMC, we have a CLT for the estimates of the marginal likelihood ) ( psis (y 1:t ) N 1 N ( 0, σ 2 p (y 1:t ) t,sis), ) ( psmc (y 1:t ) N 1 N ( 0, σ 2 p (y 1:t ) t,smc). The variance expressions are σ 2 t,sis = p 2 ( x 1:t y 1:t ) p(x 1:t dx ) 1:t 1 = σ 2 t,smc = p 2 ( x 1 y 1:t ) µ(x 1 dx ) 1 + t k=2 g = 2 ( y 1 x 1 )µ(x 1 )dx 1 p 2 (y 1 + ) t k=2 p 2 ( y 1:t x 1:t )p(x 1:t )dx 1:t p 2 (y 1:t ) 1 p 2 ( x 1:k y 1:t ) p( x 1:k 1 y 1:k 1 )f ( x k x k 1 ) dx 1:k t p 2 ( y k :t x k )p( x k y 1:k 1 )dx k p 2 ( y k :t y 1:k 1 ) SMC breaks the integral over X t into t integrals over X. t A. Doucet (MLSS Sept. 2012) Sept / 136

136 A Toy Example Consider the case where f (x x) = µ (x ) = N ( x ; 0, σ 2) and g (y x) = N ( y; 0, 1 1 σ 2 ) where σ 2 > 1. A. Doucet (MLSS Sept. 2012) Sept / 136

137 A Toy Example Consider the case where f (x x) = µ (x ) = N ( x ; 0, σ 2) and g (y x) = N ( y; 0, 1 1 σ 2 ) where σ 2 > 1. Assume we observe y 1 = = y t = 0 then we have ) [ ( ( psis (y 1:t ) V = σ2 t,sis p (y 1:t ) N = 1 ) σ 4 t/2 N 2σ 2 1], 1 ) [ ( ( psmc (y 1:t ) V σ2 t,smc = t ) σ 4 1/2 p (y 1:t ) N N 2σ 2 1]. 1 A. Doucet (MLSS Sept. 2012) Sept / 136

138 A Toy Example Consider the case where f (x x) = µ (x ) = N ( x ; 0, σ 2) and g (y x) = N ( y; 0, 1 1 σ 2 ) where σ 2 > 1. Assume we observe y 1 = = y t = 0 then we have ) [ ( ( psis (y 1:t ) V = σ2 t,sis p (y 1:t ) N = 1 ) σ 4 t/2 N 2σ 2 1], 1 ) [ ( ( psmc (y 1:t ) V σ2 t,smc = t ) σ 4 1/2 p (y 1:t ) N N 2σ 2 1]. 1 If select σ 2 = 1.2 then it is necessary to use N particles to obtain σ2 t,sis N = 10 2 for t = A. Doucet (MLSS Sept. 2012) Sept / 136

139 A Toy Example Consider the case where f (x x) = µ (x ) = N ( x ; 0, σ 2) and g (y x) = N ( y; 0, 1 1 σ 2 ) where σ 2 > 1. Assume we observe y 1 = = y t = 0 then we have ) [ ( ( psis (y 1:t ) V = σ2 t,sis p (y 1:t ) N = 1 ) σ 4 t/2 N 2σ 2 1], 1 ) [ ( ( psmc (y 1:t ) V σ2 t,smc = t ) σ 4 1/2 p (y 1:t ) N N 2σ 2 1]. 1 If select σ 2 = 1.2 then it is necessary to use N particles to obtain σ2 t,sis N = 10 2 for t = To obtain σ2 t,smc N = 10 2, SMC requires only N 10 4 particles: improvement by 19 orders of magnitude! A. Doucet (MLSS Sept. 2012) Sept / 136

140 Better Resampling Schemes [ Better resampling steps can be designed such that E [ ] ( but V < NW (i) t 1 W (i) t entropy resampling etc. (Cappé et al., 2005). N (i) t N (i) t ] = NW (i) t ) ; residual resampling, minimal A. Doucet (MLSS Sept. 2012) Sept / 136

141 Better Resampling Schemes [ Better resampling steps can be designed such that E [ ] ( but V < NW (i) t 1 W (i) t entropy resampling etc. (Cappé et al., 2005). N (i) t Residual Resampling. Set Ñ (i) t = NW (i) t ( multinomial of parameters N, W (1:N ) ) t where W (i) t W (i) t N 1 Ñ (i) t then set N (i) t N (i) t ] = NW (i) t ) ; residual resampling, minimal, sample N 1:N t = Ñ (i) t + N (i) t. from a A. Doucet (MLSS Sept. 2012) Sept / 136

142 Better Resampling Schemes [ Better resampling steps can be designed such that E [ ] ( but V < NW (i) t 1 W (i) t entropy resampling etc. (Cappé et al., 2005). N (i) t Residual Resampling. Set Ñ (i) t = NW (i) t ( multinomial of parameters N, W (1:N ) ) t where N (i) t ] = NW (i) t ) ; residual resampling, minimal, sample N 1:N t from a W (i) t W (i) t N 1 Ñ (i) t then set N (i) t = Ñ (i) t + N (i) t. Systematic Resampling. Sample U 1 U [ 0, 1 ] N and define U i = U { 1 + i 1 N for i = 2,..., N, then set } Nt i = U j : i 1 k=1 W (k) t U j i k=1 W (k) t with the convention 0 k=1 := 0. A. Doucet (MLSS Sept. 2012) Sept / 136

143 Measuring Variability of the Weights To measure the variation of the weights, we can use the Effective Sample Size (ESS) ( N ( ESS = i=1 W (i) t ) 2 ) 1 A. Doucet (MLSS Sept. 2012) Sept / 136

144 Measuring Variability of the Weights To measure the variation of the weights, we can use the Effective Sample Size (ESS) We have ESS = N if W (i) t and W (j) t = 1 for j = i. ( N ( ESS = i=1 W (i) t ) 2 ) 1 = 1/N for any i and ESS = 1 if W (i) t = 1 A. Doucet (MLSS Sept. 2012) Sept / 136

145 Measuring Variability of the Weights To measure the variation of the weights, we can use the Effective Sample Size (ESS) We have ESS = N if W (i) t ( N ( ESS = i=1 W (i) t ) 2 ) 1 = 1/N for any i and ESS = 1 if W (i) t = 1 and W (j) t = 1 for j = i. Liu (1996) showed that for simple importance sampling for ϕ regular enough V ( N i=1 ( W (i) t ϕ X (i) t ) ) V p( x1:t y 1:t ) ( 1 ESS ESS ( ϕ i=1 X (i) t ) ) ; i.e. the estimate is roughly as accurate as using an iid sample of size ESS from p (x 1:t y 1:t ). A. Doucet (MLSS Sept. 2012) Sept / 136

146 Dynamic Resampling Resampling at each time step can be harmful: only resample when necessary. A. Doucet (MLSS Sept. 2012) Sept / 136

147 Dynamic Resampling Resampling at each time step can be harmful: only resample when necessary. Dynamic Resampling: If the variation of the weights as measured by ESS is too high, e.g. ESS < N/2, then resample the particles. A. Doucet (MLSS Sept. 2012) Sept / 136

148 Dynamic Resampling Resampling at each time step can be harmful: only resample when necessary. Dynamic Resampling: If the variation of the weights as measured by ESS is too high, e.g. ESS < N/2, then resample the particles. We can also use the entropy Ent = N i=1 W (i) t log 2 ( W (i) t ) A. Doucet (MLSS Sept. 2012) Sept / 136

149 Dynamic Resampling Resampling at each time step can be harmful: only resample when necessary. Dynamic Resampling: If the variation of the weights as measured by ESS is too high, e.g. ESS < N/2, then resample the particles. We can also use the entropy Ent = N i=1 W (i) t log 2 ( W (i) t We have Ent = log 2 (N) if W (i) t = 1/N for any i. We have Ent = 0 if W (i) t = 1 and W (j) t = 1 for j = i. ) A. Doucet (MLSS Sept. 2012) Sept / 136

150 Improving the Sampling Step Bootstrap filter. Sample particles blindly according to the prior without taking into account the observation Very ineffi cient for vague prior/peaky likelihood. A. Doucet (MLSS Sept. 2012) Sept / 136

151 Improving the Sampling Step Bootstrap filter. Sample particles blindly according to the prior without taking into account the observation Very ineffi cient for vague prior/peaky likelihood. Optimal proposal/perfect adaptation. Implement the following alternative update-propagate Bayesian recursion where Update p (x 1:t 1 y 1:t ) = p( y t x t 1 )p( x 1:t 1 y 1:t 1 ) p( y t y 1:t 1 ) Propagate p (x 1:t y 1:t ) = p (x 1:t 1 y 1:t ) p (x t y t, x t 1 ) p (x t y t, x t 1 ) = f (x t x t 1 ) g (y t x t 1 ) p (y t x t 1 ) Much more effi cient when applicable; e.g. f (x t x t 1 ) = N (x t ; ϕ (x t 1 ), Σ v ), g (y t x t ) = N (y t ; x t, Σ w ). A. Doucet (MLSS Sept. 2012) Sept / 136

152 A General Bayesian Recursion Introduce an arbitrary proposal distribution q (x t y t, x t 1 ); i.e. an approximation to p (x t y t, x t 1 ). A. Doucet (MLSS Sept. 2012) Sept / 136

153 A General Bayesian Recursion Introduce an arbitrary proposal distribution q (x t y t, x t 1 ); i.e. an approximation to p (x t y t, x t 1 ). We have seen that so clearly p (x 1:t y 1:t ) = g (y t x t ) f (x t x t 1 ) p (x 1:t 1 y 1:t 1 ) p (y t y 1:t 1 ) p (x 1:t y 1:t ) = w (x t 1, x t, y t ) q (x t y t, x t 1 ) p (x 1:t 1 y 1:t 1 ) p (y t y 1:t 1 ) where w (x t 1, x t, y t ) = g (y t x t ) f (x t x t 1 ) q (x t y t, x t 1 ) A. Doucet (MLSS Sept. 2012) Sept / 136

154 A General Bayesian Recursion Introduce an arbitrary proposal distribution q (x t y t, x t 1 ); i.e. an approximation to p (x t y t, x t 1 ). We have seen that so clearly where p (x 1:t y 1:t ) = g (y t x t ) f (x t x t 1 ) p (x 1:t 1 y 1:t 1 ) p (y t y 1:t 1 ) p (x 1:t y 1:t ) = w (x t 1, x t, y t ) q (x t y t, x t 1 ) p (x 1:t 1 y 1:t 1 ) p (y t y 1:t 1 ) w (x t 1, x t, y t ) = g (y t x t ) f (x t x t 1 ) q (x t y t, x t 1 ) This suggests a more general SMC algorithm. A. Doucet (MLSS Sept. 2012) Sept / 136

155 A General SMC Algorithm { } Assume we have N weighted particles W (i) t 1, X (i) 1:t 1 approximating p (x 1:t 1 y 1:t 1 ) then at time t, ( ) ( ) Sample X (i) t q x t y t, X (i) t 1, set X (i) 1:t = X (i) 1:t 1, X (i) t and p (x 1:t y 1:t ) = N i=1 W (i) t W (i) f t 1 W (i) t δ X (i) (x 1:t ), 1:t ( X (i) t q ) ( ) X (i) t 1 g y t X (i) t ( ) X (i) t y t, X (i). t 1 A. Doucet (MLSS Sept. 2012) Sept / 136

156 A General SMC Algorithm { } Assume we have N weighted particles W (i) t 1, X (i) 1:t 1 approximating p (x 1:t 1 y 1:t 1 ) then at time t, ( ) ( ) Sample X (i) t q x t y t, X (i) t 1, set X (i) 1:t = X (i) 1:t 1, X (i) t and p (x 1:t y 1:t ) = N i=1 W (i) t W (i) f t 1 W (i) t δ X (i) (x 1:t ), 1:t ( X (i) t q ) ( ) X (i) t 1 g y t X (i) t ( ) X (i) t y t, X (i). If ESS< N/2 resample X (i) 1:t p (x 1:t y 1:t ) and set W (i) t 1 N to obtain p (x 1:t y 1:t ) = 1 N N i=1 δ (i) X (x 1:t ). 1:t t 1 A. Doucet (MLSS Sept. 2012) Sept / 136

157 Building Proposals Our aim is to select q (x t y t, x t 1 ) as close as possible to p (x t y t, x t 1 ) as this minimizes the variance of w (x t 1, x t, y t ) = g (y t x t ) f (x t x t 1 ). q (x t y t, x t 1 ) A. Doucet (MLSS Sept. 2012) Sept / 136

158 Building Proposals Our aim is to select q (x t y t, x t 1 ) as close as possible to p (x t y t, x t 1 ) as this minimizes the variance of w (x t 1, x t, y t ) = g (y t x t ) f (x t x t 1 ). q (x t y t, x t 1 ) Example - EKF proposal: Let X t = ϕ (X t 1 ) + V t, Y t = Ψ (X t ) + W t, with V t N (0, Σ v ), W t N (0, Σ w ). We perform local linearization Ψ (x) Y t Ψ (ϕ (X t 1 )) + (X t ϕ (X t 1 )) + W t x and use as a proposal. ϕ(xt 1 ) q (x t y t, x t 1 ) ĝ (y t x t ) f (x t x t 1 ). A. Doucet (MLSS Sept. 2012) Sept / 136

159 Building Proposals Our aim is to select q (x t y t, x t 1 ) as close as possible to p (x t y t, x t 1 ) as this minimizes the variance of w (x t 1, x t, y t ) = g (y t x t ) f (x t x t 1 ). q (x t y t, x t 1 ) Example - EKF proposal: Let X t = ϕ (X t 1 ) + V t, Y t = Ψ (X t ) + W t, with V t N (0, Σ v ), W t N (0, Σ w ). We perform local linearization Ψ (x) Y t Ψ (ϕ (X t 1 )) + (X t ϕ (X t 1 )) + W t x and use as a proposal. ϕ(xt 1 ) q (x t y t, x t 1 ) ĝ (y t x t ) f (x t x t 1 ). Any standard suboptimal filtering methods can be used: Unscented Particle filter, Gaussan Quadrature particle filter etc. A. Doucet (MLSS Sept. 2012) Sept / 136

160 Implicit Proposals Proposed recently by Chorin (2012). Let F (x t 1, x t ) = log g (y t x t ) + log f (x t x t 1 ) and xt = arg max F (x t 1, x t ) = arg max p (x t y t, x t 1 ) A. Doucet (MLSS Sept. 2012) Sept / 136

161 Implicit Proposals Proposed recently by Chorin (2012). Let and F (x t 1, x t ) = log g (y t x t ) + log f (x t x t 1 ) x t = arg max F (x t 1, x t ) = arg max p (x t y t, x t 1 ) We sample Z N (0, I nx ), then we solve in X t F (x t 1, x t ) F (x t 1, X t ) = 1 2 Z T Z, Z N (0, I nx ) so if there is a unique solution q (x t y t, x t 1 ) = p Z (z) det z/ x t exp ( F (x t 1, xt )) g (y t x t ) f (x t x t 1 ) det x t / z A. Doucet (MLSS Sept. 2012) Sept / 136

162 Implicit Proposals Proposed recently by Chorin (2012). Let and F (x t 1, x t ) = log g (y t x t ) + log f (x t x t 1 ) x t = arg max F (x t 1, x t ) = arg max p (x t y t, x t 1 ) We sample Z N (0, I nx ), then we solve in X t F (x t 1, x t ) F (x t 1, X t ) = 1 2 Z T Z, Z N (0, I nx ) so if there is a unique solution q (x t y t, x t 1 ) = p Z (z) det z/ x t The incremental weight is g (y t x t ) f (x t x t 1 ) q (x t y t, x t 1 ) exp ( F (x t 1, xt )) g (y t x t ) f (x t x t 1 ) det x t / z det x t / z exp (F (x t 1, x t )) A. Doucet (MLSS Sept. 2012) Sept / 136

163 Auxiliary Particle Filters Popular variation introduced by (Pitt & Shephard, 1999). A. Doucet (MLSS Sept. 2012) Sept / 136

164 Auxiliary Particle Filters Popular variation introduced by (Pitt & Shephard, 1999). This corresponds to a standard SMC algorithm (Johansen & D., 2008) where we target p (x 1:t y 1:t+1 ) p (x 1:t y 1:t ) p (y t+1 x t ) where p (y t+1 x t ) p (y t+1 x t ) using a proposal p (x t y t, x t 1 ). A. Doucet (MLSS Sept. 2012) Sept / 136

165 Auxiliary Particle Filters Popular variation introduced by (Pitt & Shephard, 1999). This corresponds to a standard SMC algorithm (Johansen & D., 2008) where we target p (x 1:t y 1:t+1 ) p (x 1:t y 1:t ) p (y t+1 x t ) where p (y t+1 x t ) p (y t+1 x t ) using a proposal p (x t y t, x t 1 ). When p (y t+1 x t ) = p (y t+1 x t ) and p (x t+1 y t+1, x t ) = p (x t+1 y t+1, x t ) then we are back to perfect adaptation. A. Doucet (MLSS Sept. 2012) Sept / 136

166 Block Sampling Proposals Problem: we only sample X t at time t so, even if you use p (x t y t, x t 1 ), the SMC estimates could have high variance if V p( xt 1 y 1:t 1 ) [p (y t x t 1 )] is high. A. Doucet (MLSS Sept. 2012) Sept / 136

167 Block Sampling Proposals Problem: we only sample X t at time t so, even if you use p (x t y t, x t 1 ), the SMC estimates could have high variance if V p( xt 1 y 1:t 1 ) [p (y t x t 1 )] is high. Block sampling idea: allows yourself to sample again X t L+1:t 1 as well as X t in light of y t. Optimally we would like at time t to sample ( ) X (i) t L+1:t p x t L+1:t y t L+1:t, X (i) t L and W (i) t W (i) ( p ( ) y 1:t ) X (i) t L+1:t y t L+1:t, X (i) t L ) X (i) 1:t t 1 ( ) p X (i) 1:t L y 1:t 1 p ( W (i) t 1 p y t y t L+1:t 1, X (i) t L A. Doucet (MLSS Sept. 2012) Sept / 136

168 Block Sampling Proposals Problem: we only sample X t at time t so, even if you use p (x t y t, x t 1 ), the SMC estimates could have high variance if V p( xt 1 y 1:t 1 ) [p (y t x t 1 )] is high. Block sampling idea: allows yourself to sample again X t L+1:t 1 as well as X t in light of y t. Optimally we would like at time t to sample ( ) X (i) t L+1:t p x t L+1:t y t L+1:t, X (i) t L and W (i) t W (i) ( p ( ) y 1:t ) X (i) t L+1:t y t L+1:t, X (i) t L ) X (i) 1:t t 1 ( ) p X (i) 1:t L y 1:t 1 p ( W (i) t 1 p y t y t L+1:t 1, X (i) t L When p (x t L+1:t y t L+1:t, x t L ) and p (y t y t L+1:t 1, x t L ) are not available, we can use analytical approximations of them and still have consistent estimates (D., Briers & Senecal, 2006). A. Doucet (MLSS Sept. 2012) Sept / 136

169 Block Sampling Proposals Computational cost is increased from O (N) to O (LN) so is it worth it? A. Doucet (MLSS Sept. 2012) Sept / 136

170 Block Sampling Proposals Computational cost is increased from O (N) to O (LN) so is it worth it? Consider the ideal scenario where X t = X t 1 + V t Y t = X t + W t where X 1 N (0, 1) and V t, W t i.i.d. N (0, 1). A. Doucet (MLSS Sept. 2012) Sept / 136

171 Block Sampling Proposals Computational cost is increased from O (N) to O (LN) so is it worth it? Consider the ideal scenario where X t = X t 1 + V t Y t = X t + W t where X 1 N (0, 1) and V t, W t i.i.d. N (0, 1). In this case, we have p(y t y t L+1:t 1, x t L ) p(y t y t L+1:t 1, x t L) < c x t L x t L /2 L where the rate of exponential convergence depends upon the signal-to-noise ratio if more general Gaussian AR are considered. A. Doucet (MLSS Sept. 2012) Sept / 136

172 Block Sampling Proposals Computational cost is increased from O (N) to O (LN) so is it worth it? Consider the ideal scenario where X t = X t 1 + V t Y t = X t + W t where X 1 N (0, 1) and V t, W t i.i.d. N (0, 1). In this case, we have p(y t y t L+1:t 1, x t L ) p(y t y t L+1:t 1, x t L) < c x t L x t L /2 L where the rate of exponential convergence depends upon the signal-to-noise ratio if more general Gaussian AR are considered. We can obtain an analytic expression of the variance of the (normalized) weight. A. Doucet (MLSS Sept. 2012) Sept / 136

173 Block Sampling Proposals Variance of incremental weight w.r.t. p ( x1:t A. Doucet (MLSS Sept. 2012) L j y1:t 1 ). Sept / 136

An Brief Overview of Particle Filtering

1 An Brief Overview of Particle Filtering Adam M. Johansen a.m.johansen@warwick.ac.uk www2.warwick.ac.uk/fac/sci/statistics/staff/academic/johansen/talks/ May 11th, 2010 Warwick University Centre for Systems