Sequential Monte Carlo Methods for Bayesian Computation

Sequential Monte Carlo Methods for Bayesian Computation A. Doucet Kyoto Sept. 2012 A. Doucet (MLSS Sept. 2012) Sept. 2012 1 / 136

Motivating Example 1: Generic Bayesian Model Let X be a vector parameter of interest with an associated prior µ; i.e. X µ ( ). A. Doucet (MLSS Sept. 2012) Sept. 2012 2 / 136

Motivating Example 1: Generic Bayesian Model Let X be a vector parameter of interest with an associated prior µ; i.e. X µ ( ). We observe a realization of y of Y which is assumed to satisfy Y (X = x) g ( x) ; i.e. the likelihood function is g (y x). Bayesian inference on X relies on the posterior of X given Y = y: p (x y) = µ (x) g (y x) p (y) where the marginal likelihood/evidence satisfies p (y) = µ (x) g (y x) dx. A. Doucet (MLSS Sept. 2012) Sept. 2012 2 / 136

Motivating Example 2: State-Space Models Let {X t } t 1 be a latent/hidden Markov process with X 1 µ ( ) and X t (X t 1 = x) f ( x). A. Doucet (MLSS Sept. 2012) Sept. 2012 3 / 136

Motivating Example 2: State-Space Models Let {X t } t 1 be a latent/hidden Markov process with X 1 µ ( ) and X t (X t 1 = x) f ( x). Let {Y t } t 1 be an observation process such that observations are conditionally independent given {X t } t 1 and Y t (X t = x) g ( x). Let z i :j := (z i, z i+1,..., z j ) then Bayesian inference on X 1:t relies on the posterior of X 1:t given Y = y 1:t : p (x 1:t y 1:t ) = p (x 1:t, y 1:t ) p (y 1:t ) where the marginal likelihood/evidence satisfies p (y 1:t ) = p (x 1:t, y 1:t ) dx 1:t. A. Doucet (MLSS Sept. 2012) Sept. 2012 3 / 136

Bayesian Inference and Machine Learning Bayesian approaches have been adopted by a large part of the ML community. A. Doucet (MLSS Sept. 2012) Sept. 2012 4 / 136

Bayesian Inference and Machine Learning Bayesian approaches have been adopted by a large part of the ML community. Bayesian inference offers a number of attractive advantages over conventional approach flexibility in constructing complex models from simple parts; the incorporation of prior knowledge is very natural; all modelling assumptions are made explicit; uncertainties over model order; model parameters and predictions are technically straightforward to compute; A. Doucet (MLSS Sept. 2012) Sept. 2012 4 / 136

Approximate Inference Methods Gaussian/Laplace approximation, local linearization, Extended Kalman filters. A. Doucet (MLSS Sept. 2012) Sept. 2012 5 / 136

Approximate Inference Methods Gaussian/Laplace approximation, local linearization, Extended Kalman filters. Variational methods, density assumed filters. A. Doucet (MLSS Sept. 2012) Sept. 2012 5 / 136

Approximate Inference Methods Gaussian/Laplace approximation, local linearization, Extended Kalman filters. Variational methods, density assumed filters. Expectation-Propagation. A. Doucet (MLSS Sept. 2012) Sept. 2012 5 / 136

Monte Carlo Methods Variational and EP methods are computationally cheap but perform functional approximations of the posteriors of interest. A. Doucet (MLSS Sept. 2012) Sept. 2012 6 / 136

Monte Carlo Methods Variational and EP methods are computationally cheap but perform functional approximations of the posteriors of interest. Both MCMC and SMC are asymptotically (as you increase computational efforts) bias-free but computationally expensive. MCMC are the tools of choice in Bayesian computation for over 20 years whereas SMC have been widely used for 15 years in vision and robotics. The development of new methodology combined to the emergence of cheap multicore architectures makes now SMC a powerful alternative/complementary approach to MCMC to address general Bayesian computational problems. A. Doucet (MLSS Sept. 2012) Sept. 2012 6 / 136

Some References and Resources A.D., J.F.G. De Freitas & N.J. Gordon (editors), Sequential Monte Carlo Methods in Practice, Springer-Verlag: New York, 2001. A. Doucet (MLSS Sept. 2012) Sept. 2012 7 / 136

Some References and Resources A.D., J.F.G. De Freitas & N.J. Gordon (editors), Sequential Monte Carlo Methods in Practice, Springer-Verlag: New York, 2001. P. Del Moral, Feynman-Kac Formulae: Genealogical and Interacting Particle Systems with Applications, Springer-Verlag: New York, 2004. O. Cappé, E. Moulines & T. Ryden, Hidden Markov Models, Springer-Verlag: New York, 2005. A. Doucet (MLSS Sept. 2012) Sept. 2012 7 / 136

Organization of Lectures State-Space Models (approx.4 hours) A. Doucet (MLSS Sept. 2012) Sept. 2012 8 / 136

Organization of Lectures State-Space Models (approx.4 hours) SMC filtering and smoothing A. Doucet (MLSS Sept. 2012) Sept. 2012 8 / 136

Organization of Lectures State-Space Models (approx.4 hours) SMC filtering and smoothing Maximum likelihood parameter inference A. Doucet (MLSS Sept. 2012) Sept. 2012 8 / 136

Organization of Lectures State-Space Models (approx.4 hours) SMC filtering and smoothing Maximum likelihood parameter inference Bayesian parameter inference A. Doucet (MLSS Sept. 2012) Sept. 2012 8 / 136

Organization of Lectures State-Space Models (approx.4 hours) SMC filtering and smoothing Maximum likelihood parameter inference Bayesian parameter inference Beyond State-Space Models (approx. 2 hours) SMC methods for generic sequence of target distributions A. Doucet (MLSS Sept. 2012) Sept. 2012 8 / 136

State-Space Models Let {X t } t 1 be a latent/hidden X -valued Markov process with X 1 µ ( ) and X t (X t 1 = x) f ( x). A. Doucet (MLSS Sept. 2012) Sept. 2012 9 / 136

State-Space Models Let {X t } t 1 be a latent/hidden X -valued Markov process with X 1 µ ( ) and X t (X t 1 = x) f ( x). Let {Y t } t 1 be an Y-valued Markov observation process such that observations are conditionally independent given {X t } t 1 and Y t (X t = x) g ( x). General class of time series models aka Hidden Markov Models (HMM) including X t = Ψ (X t 1, V t ), Y t = Φ (X t, W t ) where V t, W t are two sequences of i.i.d. random variables. A. Doucet (MLSS Sept. 2012) Sept. 2012 9 / 136

State-Space Models State-space models are ubiquitous in control, data mining, econometrics, geosciences, system biology etc. Since Jan. 2012, more than 13,500 papers have already appeared (source: Google Scholar). Finite State-space HMM: X is a finite space, i.e. {X t } is a finite Markov chain Y t (X t = x) g ( x) Linear Gaussian state-space model X t = AX t 1 + BV t, V t i.i.d. N (0, I ) i.i.d. Y t = CX t + DW t, W t N (0, I ) Switching Linear Gaussian state-space model: X t = ( Xt 1, Xt 2 ) where { Xt 1 } is a finite Markov chain, Xt 2 = A ( Xt 1 ) X 2 t 1 + B ( Xt 1 ) i.i.d. Vt, V t N (0, I ) Y t = C ( X 1 t ) X 2 t + D ( Xt 1 ) Wt, W t i.i.d. N (0, I ) A. Doucet (MLSS Sept. 2012) Sept. 2012 10 / 136

State-Space Models Stochastic Volatility model X t = φx t 1 + σv t, V t i.i.d. N (0, 1) Y t = β exp (X t /2) W t, W t i.i.d. N (0, 1) A. Doucet (MLSS Sept. 2012) Sept. 2012 11 / 136

State-Space Models Stochastic Volatility model X t = φx t 1 + σv t, V t i.i.d. N (0, 1) Y t = β exp (X t /2) W t, W t i.i.d. N (0, 1) Biochemical Network model Pr ( Xt+dt 1 =x t 1 +1, Xt+dt 2 =x t 2 xt 1, xt 2 ) = α x 1 t dt + o (dt), Pr ( Xt+dt 1 =x t 1 1, Xt+dt 2 =x t 2 +1 xt 1, xt 2 ) = β x 1 t xt 2 dt + o (dt), Pr ( Xt+dt 1 =x t 1, Xt+dt 2 =x t 2 1 xt 1, xt 2 ) = γ x 2 t dt + o (dt), with Y k = Xk 1 T + W i.i.d. k with W k N ( 0, σ 2). A. Doucet (MLSS Sept. 2012) Sept. 2012 11 / 136

Inference in State-Space Models Given observations y 1:t := (y 1, y 2,..., y t ), inference about X 1:t := (X 1,..., X t ) relies on the posterior where p (x 1:t, y 1:t ) = µ (x 1 ) p (y 1:t ) = p (x 1:t y 1:t ) = p (x 1:t, y 1:t ) p (y 1:t ) t k=2 f (x k x k 1 ) }{{}}{{} p(x 1:t ) p( y 1:t x 1:t ) p (x 1:t, y 1:t ) dx 1:t t k=1 g (y k x k ), When X is finite & linear Gaussian models, {p (x t y 1:t )} t 1 can be computed exactly. For non-linear models, approximations are required: EKF, UKF, Gaussian sum filters, etc. A. Doucet (MLSS Sept. 2012) Sept. 2012 12 / 136

Monte Carlo Methods Basics Assume you can generate X (i) 1:t p (x 1:t y 1:t ) where i = 1,..., N then MC approximation is p (x 1:t y 1:t ) = 1 N N δ (i) X (x 1:t ) 1:t i=1 A. Doucet (MLSS Sept. 2012) Sept. 2012 13 / 136

Monte Carlo Methods Basics Assume you can generate X (i) 1:t p (x 1:t y 1:t ) where i = 1,..., N then MC approximation is p (x 1:t y 1:t ) = 1 N N δ (i) X (x 1:t ) 1:t i=1 Integration is straightforward. ϕt (x 1:t ) p (x 1:t y 1:t ) dx 1:t ϕ t (x 1:t ) p ((x 1:t ) y 1:t ) dx 1:t = 1 N N i=1 ϕ Marginalization is straightforward. X (i) 1:t p (x k y 1:t ) = p (x 1:t y 1:t ) dx 1:k 1 dx k+1:t = 1 N N δ (i) X (x k ). k i=1 A. Doucet (MLSS Sept. 2012) Sept. 2012 13 / 136

Monte Carlo Methods Basics Assume you can generate X (i) 1:t p (x 1:t y 1:t ) where i = 1,..., N then MC approximation is p (x 1:t y 1:t ) = 1 N N δ (i) X (x 1:t ) 1:t i=1 Integration is straightforward. ϕt (x 1:t ) p (x 1:t y 1:t ) dx 1:t ϕ t (x 1:t ) p ((x 1:t ) y 1:t ) dx 1:t = 1 N N i=1 ϕ Marginalization is straightforward. X (i) 1:t p (x k y 1:t ) = p (x 1:t y 1:t ) dx 1:k 1 dx k+1:t = 1 N [ ( )] Basic and key property: V 1 N N i=1 ϕ = X (i) 1:t N δ (i) X (x k ). k i=1 C (t dim(x )) N, i.e. rate of convergence to zero is independent of dim (X ) and t. A. Doucet (MLSS Sept. 2012) Sept. 2012 13 / 136

Monte Carlo Methods Problem 1: We cannot typically generate exact samples from p (x 1:t y 1:t ) for non-linear non-gaussian models. A. Doucet (MLSS Sept. 2012) Sept. 2012 14 / 136

Monte Carlo Methods Problem 1: We cannot typically generate exact samples from p (x 1:t y 1:t ) for non-linear non-gaussian models. Problem 2: Even if we could, algorithms to generate samples from p (x 1:t y 1:t ) will have at least complexity O (t). Typical solution to problem 1 is to generate approximate samples using MCMC methods but these methods are not recursive. SMC Methods solves partially Problem 1 and Problem 2 by breaking the problem of sampling from p (x 1:t y 1:t ) into a collection of simpler subproblems. First approximate p (x 1 y 1 ) and p (y 1 ) at time 1, then p (x 1:2 y 1:2 ) and p (y 1:2 ) at time 2 and so on. A. Doucet (MLSS Sept. 2012) Sept. 2012 14 / 136

Standard Bayesian Recursion In most textbooks, you will find the following recursion for {p (x t y 1:t )} t 1. A. Doucet (MLSS Sept. 2012) Sept. 2012 15 / 136

Standard Bayesian Recursion In most textbooks, you will find the following recursion for {p (x t y 1:t )} t 1. Prediction step p (x t y 1:t 1 ) = p (x t 1, x t y 1:t 1 ) dx t 1 = p (x t y 1:t 1, x t 1 ) p (x t 1 y 1:t 1 ) dx t 1 = f (x t x t 1 ) p (x t 1 y 1:t 1 ) dx t 1. Bayes Updating step where p (x t y 1:t ) = g (y t x t ) p (x t y 1:t 1 ) p (y t y 1:t 1 ) p (y t y 1:t 1 ) = g (y t x t ) p (x t y 1:t 1 ) dx t A. Doucet (MLSS Sept. 2012) Sept. 2012 15 / 136

Bayesian Recursion on Path Space SMC approximate directly {p (x 1:t y 1:t )} t 1 not {p (x t y 1:t )} t 1 and relies on p (x 1:t y 1:t ) = p (x 1:t, y 1:t ) = g (y t x t ) f (x t x t 1 ) p (x 1:t 1, y 1:t 1 ) p (y 1:t ) p (y t y 1:t 1 ) p (y 1:t 1 ) where = g (y t x t ) predictive p( x 1:t y 1:t 1 ) {}}{ f (x t x t 1 ) p (x 1:t 1 y 1:t 1 ) p (y t y 1:t 1 ) p (y t y 1:t 1 ) = g (y t x t ) p (x 1:t y 1:t 1 ) dx 1:t This can be alternatively written as Prediction p (x 1:t y 1:t 1 ) = f (x t x t 1 ) p (x 1:t 1 y 1:t 1 ), Update p (x 1:t y 1:t ) = g ( y t x t )p( x 1:t y 1:t 1 ) p( y t y 1:t 1. ) A. Doucet (MLSS Sept. 2012) Sept. 2012 16 / 136

Monte Carlo Implementation of Prediction Step Assume you have at time t 1 p (x 1:t 1 y 1:t 1 ) = 1 N N δ (i) X (x 1:t 1 ). 1:t 1 i=1 A. Doucet (MLSS Sept. 2012) Sept. 2012 17 / 136

Monte Carlo Implementation of Prediction Step Assume you have at time t 1 p (x 1:t 1 y 1:t 1 ) = 1 N N δ (i) X (x 1:t 1 ). 1:t 1 i=1 ( ) ( ) By sampling X (i) t f x t X (i) t 1 and setting X (i) 1:t = X (i) 1:t 1, X (i) t then p (x 1:t y 1:t 1 ) = 1 N N δ X (i) (x 1:t ). 1:t i=1 Sampling from f (x t x t 1 ) is usually straightforward and can be done even if f (x t x t 1 ) does not admit any analytical expression; e.g. biochemical network models. A. Doucet (MLSS Sept. 2012) Sept. 2012 17 / 136

Importance Sampling Implementation of Updating Step Our target at time t is p (x 1:t y 1:t ) = g (y t x t ) p (x 1:t y 1:t 1 ) p (y t y 1:t 1 ) so by substituting p (x 1:t y 1:t 1 ) to p (x 1:t y 1:t 1 ) we obtain p (y t y 1:t 1 ) = g (y t x t ) p (x 1:t y 1:t 1 ) dx 1:t We now have = 1 N N ( ) g y t X (i) t. i=1 p (x 1:t y 1:t ) = g (y t x t ) p (x 1:t y 1:t 1 ) = p (y t y 1:t 1 ) ( ) with W (i) t g y t X (i) t, N i=1 W (i) t = 1. N i=1 W (i) t δ X (i) (x 1:t ). 1:t A. Doucet (MLSS Sept. 2012) Sept. 2012 18 / 136

Multinomial Resampling We have a weighted approximation p (x 1:t y 1:t ) of p (x 1:t y 1:t ) p (x 1:t y 1:t ) = N i=1 W (i) t δ X (i) (x 1:t ). 1:t A. Doucet (MLSS Sept. 2012) Sept. 2012 19 / 136

Multinomial Resampling We have a weighted approximation p (x 1:t y 1:t ) of p (x 1:t y 1:t ) p (x 1:t y 1:t ) = N i=1 W (i) t δ X (i) (x 1:t ). 1:t To obtain N samples X (i) 1:t approximately distributed according to p (x 1:t y 1:t ), resample N times with replacement to obtain X (i) 1:t p (x 1:t y 1:t ) N δ (i) X (x 1:t ) = 1:t i=1 p (x 1:t y 1:t ) = 1 N { } [ where N (i) t follow a multinomial with E [ ] ( ) V N (1) t = NW (i) t 1 W (i) t. N i=1 N (i) t N (i) t N δ X (i) 1:t ] (x 1:t ) = NW (i) t, A. Doucet (MLSS Sept. 2012) Sept. 2012 19 / 136

Multinomial Resampling We have a weighted approximation p (x 1:t y 1:t ) of p (x 1:t y 1:t ) p (x 1:t y 1:t ) = N i=1 W (i) t δ X (i) (x 1:t ). 1:t To obtain N samples X (i) 1:t approximately distributed according to p (x 1:t y 1:t ), resample N times with replacement to obtain X (i) 1:t p (x 1:t y 1:t ) N δ (i) X (x 1:t ) = 1:t i=1 p (x 1:t y 1:t ) = 1 N { } [ where N (i) t follow a multinomial with E [ ] ( ) V N (1) t = NW (i) t 1 W (i) t. This can be achieved in O (N). N i=1 N (i) t N (i) t N δ X (i) 1:t ] (x 1:t ) = NW (i) t, A. Doucet (MLSS Sept. 2012) Sept. 2012 19 / 136

Vanilla SMC: Bootstrap Filter (Gordon et al., 1993) At time t = 1 Sample X (i) 1 µ (x 1 ) then p (x 1 y 1 ) = N ( ) W (i) 1 δ X (i) (x 1 ), W (i) 1 g y 1 X (i) 1. 1 i=1 A. Doucet (MLSS Sept. 2012) Sept. 2012 20 / 136

Vanilla SMC: Bootstrap Filter (Gordon et al., 1993) At time t = 1 Sample X (i) 1 µ (x 1 ) then p (x 1 y 1 ) = N ( ) W (i) 1 δ X (i) (x 1 ), W (i) 1 g y 1 X (i) 1. 1 i=1 Resample X (i) 1 p (x 1 y 1 ) to obtain p (x 1 y 1 ) = 1 N N i=1 δ (i) X (x 1 ). 1 At time t 2 Sample X (i) t f p (x 1:t y 1:t ) = ( ) ( ) x t X (i) t 1, set X (i) 1:t = X (i) 1:t 1, X (i) t and N i=1 ( W (i) t δ X (i) (x 1:t ), W (i) t g 1:t ) y t X (i) t. A. Doucet (MLSS Sept. 2012) Sept. 2012 20 / 136

Vanilla SMC: Bootstrap Filter (Gordon et al., 1993) At time t = 1 Sample X (i) 1 µ (x 1 ) then p (x 1 y 1 ) = N ( ) W (i) 1 δ X (i) (x 1 ), W (i) 1 g y 1 X (i) 1. 1 i=1 Resample X (i) 1 p (x 1 y 1 ) to obtain p (x 1 y 1 ) = 1 N N i=1 δ (i) X (x 1 ). 1 At time t 2 Sample X (i) t f p (x 1:t y 1:t ) = ( ) ( ) x t X (i) t 1, set X (i) 1:t = X (i) 1:t 1, X (i) t and N i=1 ( W (i) t δ X (i) (x 1:t ), W (i) t g 1:t Resample X (i) 1:t p (x 1:t y 1:t ) to obtain p (x 1:t y 1:t ) = 1 N N i=1 δ (i) X (x 1:t ). 1:t ) y t X (i) t. A. Doucet (MLSS Sept. 2012) Sept. 2012 20 / 136

SMC Output At time t, we get p (x 1:t y 1:t ) = N i=1 p (x 1:t y 1:t ) = 1 N W (i) t δ X (i) (x 1:t ), 1:t N δ (i) X (x 1:t ). 1:t i=1 A. Doucet (MLSS Sept. 2012) Sept. 2012 21 / 136

SMC Output At time t, we get p (x 1:t y 1:t ) = N i=1 p (x 1:t y 1:t ) = 1 N W (i) t δ X (i) (x 1:t ), 1:t N δ (i) X (x 1:t ). 1:t i=1 The marginal likelihood estimate is given by ( t 1 p (y 1:t ) = p (y k y 1:k 1 ) = N k=1 t k=1 N ( g i=1 ) ) y k X (i) k. Computational complexity is O (N) at each time step and memory requirements O (tn). If we are only interested in p (x t y 1:t ) or p (s t (x 1:t ) y 1:t ) where s t (x 1:t ) = Ψ t (x t, s t 1 (x 1:t 1 )) - e.g. s t (x 1:t ) = t k=1 x 2 k - is fixed-dimensional then memory requirements O (N). A. Doucet (MLSS Sept. 2012) Sept. 2012 21 / 136

state state Figure: p ( x 1 y 1 ) and Ê [ X 1 y 1 ] (top) and particle approximation of p ( x 1 y 1 ) (bottom) A. Doucet (MLSS Sept. 2012) Sept. 2012 22 / 136 SMC on Path-Space - figures by Olivier Cappė 1.6 1.4 1.2 1 0.8 0.6 0.4 5 10 15 20 25 time index 1.6 1.4 1.2 1 0.8 0.6 0.4 5 10 15 20 25 time index

state state 1.6 1.4 1.2 1 0.8 0.6 0.4 5 10 15 20 25 time index 1.6 1.4 1.2 1 0.8 0.6 0.4 5 10 15 20 25 time index Figure: p ( x 1 y 1 ), p ( x 2 y 1:2 )and Ê [ X 1 y 1 ], Ê [ X 2 y 1:2 ] (top) and particle approximation of p ( x 1:2 y 1:2 ) (bottom) A. Doucet (MLSS Sept. 2012) Sept. 2012 23 / 136

state state 1.6 1.4 1.2 1 0.8 0.6 0.4 5 10 15 20 25 time index 1.6 1.4 1.2 1 0.8 0.6 0.4 5 10 15 20 25 time index Figure: p ( x t y 1:t ) and Ê [ X t y 1:t ] for t = 1, 2, 3 (top) and particle approximation of p ( x 1:3 y 1:3 ) (bottom) A. Doucet (MLSS Sept. 2012) Sept. 2012 24 / 136

state state 1.6 1.4 1.2 1 0.8 0.6 0.4 5 10 15 20 25 time index 1.6 1.4 1.2 1 0.8 0.6 0.4 5 10 15 20 25 time index Figure: p ( x t y 1:t ) and Ê [ X t y 1:t ] for t = 1,..., 10 (top) and particle approximation of p ( x 1:10 y 1:10 ) (bottom) A. Doucet (MLSS Sept. 2012) Sept. 2012 25 / 136

state state 1.6 1.4 1.2 1 0.8 0.6 0.4 5 10 15 20 25 time index 1.6 1.4 1.2 1 0.8 0.6 0.4 5 10 15 20 25 time index Figure: p ( x t y 1:t ) and Ê [ X t y 1:t ] for t = 1,..., 24 (top) and particle approximation of p ( x 1:24 y 1:24 ) (bottom) A. Doucet (MLSS Sept. 2012) Sept. 2012 26 / 136

Remarks Empirically this SMC strategy performs well in terms of estimating the marginals {p (x t y 1:t )} t 1. This is what is only necessary in many applications thankfully. A. Doucet (MLSS Sept. 2012) Sept. 2012 27 / 136

Remarks Empirically this SMC strategy performs well in terms of estimating the marginals {p (x t y 1:t )} t 1. This is what is only necessary in many applications thankfully. However, the joint distribution p (x 1:t y 1:t ) is poorly estimated when t is large; i.e. we have in the previous example p (x 1:11 y 1:24 ) = δ X 1:11 (x 1:11 ). Degeneracy problem. For any N and any k, there exists t (k, N) such that for any t t (k, N) p (x 1:k y 1:t ) = δ X 1:k (x 1:k ) ; p (x 1:t y 1:t ) is an unreliable approximation of p (x 1:t y 1:t ) as t. A. Doucet (MLSS Sept. 2012) Sept. 2012 27 / 136

Another Illustration of the Degeneracy Phenomenon For the linear Gaussian state-space model described before, we can compute exactly S t /t where ( ) t S t = xk 2 p (x 1:t y 1:t ) dx 1:t k=1 using Kalman techniques. We compute the SMC estimate of this quantity using Ŝ t /t where ( ) t Ŝ t = xk 2 p (x 1:t y 1:t ) dx 1:t k=1 can be computed sequentially. A. Doucet (MLSS Sept. 2012) Sept. 2012 28 / 136

Another Illustration of the Degeneracy Phenomenon 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 Figure: S t /t obtained through the Kalman smoother (blue) and its SMC estimate Ŝ t /t (red). A. Doucet (MLSS Sept. 2012) Sept. 2012 29 / 136

Some Convergence Results for SMC Numerous convergence results for SMC are available; see (Del Moral, 2004). A. Doucet (MLSS Sept. 2012) Sept. 2012 30 / 136

Some Convergence Results for SMC Numerous convergence results for SMC are available; see (Del Moral, 2004). Let ϕ t : X t R and consider ϕ t = ϕ t (x 1:t ) p (x 1:t y 1:t ) dx 1:t, ϕ t = ϕ t (x 1:t ) p (x 1:t y 1:t ) dx 1:t = 1 N N ( ϕ t i=1 X (i) 1:t We can prove that for any bounded function ϕ and any p 1 E [ ϕ t ϕ t p ] 1/p B (t) c (p) ϕ, N lim N ( ϕt ϕ N t ) N ( 0, σ 2 ) t. ). A. Doucet (MLSS Sept. 2012) Sept. 2012 30 / 136

Stronger Convergence Results Assume the following exponentially stability assumption: For any x 1, x 1 1 p (x t y 2:t, X 1 = x 1 ) p ( x t y 2:t, X 1 = x ) 1 dx t α t for 0 α < 1. 2 A. Doucet (MLSS Sept. 2012) Sept. 2012 31 / 136

Stronger Convergence Results Assume the following exponentially stability assumption: For any x 1, x 1 1 p (x t y 2:t, X 1 = x 1 ) p ( x t y 2:t, X 1 = x ) 1 dx t α t for 0 α < 1. 2 Marginal distribution. For ϕ t (x 1:t ) = ϕ (x t L:t ), there exists B 1, B 2 < s.t. E [ ϕ t ϕ t p ] 1/p B 1 c (p) ϕ N, lim N N ( ϕt ϕ t ) N ( 0, σ 2 ) t where σ 2 t B 2, i.e. there is no accumulation of numerical errors over time. A. Doucet (MLSS Sept. 2012) Sept. 2012 31 / 136

Stronger Convergence Results Unbiasedness. The marginal likelihood estimate is unbiased E ( p (y 1:t )) = p (y 1:t ). A. Doucet (MLSS Sept. 2012) Sept. 2012 32 / 136

Stronger Convergence Results Unbiasedness. The marginal likelihood estimate is unbiased E ( p (y 1:t )) = p (y 1:t ). Relative Variance Bound. There exists B 4 < ( ) ) ( p (y1:t ) 2 E p (y 1:t ) 1 B 4 t N Central Limit Theorem. There exists B 5 < s.t. N (log p (y1:t ) log p (y 1:t )) N ( 0, σ 2 ) t with σ 2 t B 5 t. lim N A. Doucet (MLSS Sept. 2012) Sept. 2012 32 / 136

Basic Idea Used to Establish Uniform Lp Bounds We denote η k (x k ) = p (x k y 1:k 1 ) and η k (x k ) = p (x k y 1:k 1 ) its particle approximation. A. Doucet (MLSS Sept. 2012) Sept. 2012 33 / 136

Basic Idea Used to Establish Uniform Lp Bounds We denote and η k (x k ) = p (x k y 1:k 1 ) η k (x k ) = p (x k y 1:k 1 ) its particle approximation. Let Φ k,t be the measure-valued mapping such that η t = Φ k,t (η k ), which satifies Φ k,t (η k ) (x t ) = η k (x k ).p (y k :t 1 x k ) p (x t x k, y k+1:t 1 ) dx k. ηk (x k ) p (y k :t 1 x k ) dx k }{{} p(x k y 1:t 1 ) A. Doucet (MLSS Sept. 2012) Sept. 2012 33 / 136

Key Decomposition Formula η 1 η 2 = Φ 1,2 (η 1 ) η t = Φ 1,t (η 1 ) η 1 Φ 1,2 ( η 1 ) Φ 1,t ( η 1 ) η 2 Φ 2,t ( η 2 ) η t 1 ) Φ t 1,t ( ηt 1 Decomposition of the error η t η t = η t t [ ( ))] Φk,t ( η k ) Φ k,t Φk 1,k ( ηk 1 k=1 A. Doucet (MLSS Sept. 2012) Sept. 2012 34 / 136

Stability Properties We have p (x t x k, y k+1:t 1 ) = p (x k+1:t x k, y k+1:t 1 ) dx k+1:t 1 where p (x k+1:t x k, y k+1:t 1 ) = t p (x m x m 1, y m:t 1 ) m=k+1 A. Doucet (MLSS Sept. 2012) Sept. 2012 35 / 136

Stability Properties We have p (x t x k, y k+1:t 1 ) = p (x k+1:t x k, y k+1:t 1 ) dx k+1:t 1 where p (x k+1:t x k, y k+1:t 1 ) = To summarize, we have Φ k,t (η k ) (x t ) = t p (x m x m 1, y m:t 1 ) m=k+1 η k (x k ).p (y k :t 1 x k ) ηk (x k ) p (y k :t 1 x k ) dx k }{{} p(x k y 1:t 1 ) t m=k+1 p (x m x m 1, y m:t 1 ) dx k :t 1 A. Doucet (MLSS Sept. 2012) Sept. 2012 35 / 136

Stability Properties Assume there exists ɛ > 0 s.t. for any x, x and for any y, x, ɛ 1 ν ( x ) f ( x x ) ɛν ( x ) 0 < g g (y x) g < then there exists 0 λ < 1 1 ( Φ k,k+t (η) (x) Φ k,k+t η ) (x) dx λ t 2 A. Doucet (MLSS Sept. 2012) Sept. 2012 36 / 136

Stability Properties Assume there exists ɛ > 0 s.t. for any x, x and for any y, x, ɛ 1 ν ( x ) f ( x x ) ɛν ( x ) 0 < g g (y x) g < then there exists 0 λ < 1 1 ( Φ k,k+t (η) (x) Φ k,k+t η ) (x) dx λ t 2 Hence we have as (t k). Φ k,t (η k ) (x t ) Φ k,t ( η k ) (xt ) A. Doucet (MLSS Sept. 2012) Sept. 2012 36 / 136

Putting Everything Together Under such strong mixing assumptions η t η t = t k=1 [ Φk,t ( η k ) Φ k,t ( Φk 1,k ( ηk 1 ))] } {{ } 1 λ t k+1 for 0 λ 1 N A. Doucet (MLSS Sept. 2012) Sept. 2012 37 / 136

Putting Everything Together Under such strong mixing assumptions η t η t = t k=1 [ Φk,t ( η k ) Φ k,t ( Φk 1,k ( ηk 1 ))] } {{ } 1 λ t k+1 for 0 λ 1 N We can then obtain results such as there exists B 1 < s.t. E [ ϕ t ϕ t p ] 1/p B 1 c (p) ϕ N Much work has been done recently on removing such strong mixing assumptions; e.g. Whiteley (2012) for much weaker and realistic assumptions. A. Doucet (MLSS Sept. 2012) Sept. 2012 37 / 136

Summary SMC provide consistent estimates under weak assumptions. A. Doucet (MLSS Sept. 2012) Sept. 2012 38 / 136

Summary SMC provide consistent estimates under weak assumptions. Under stability assumptions, we have uniform in time stability of the SMC estimates of {p (x t y 1:t )} t 1. A. Doucet (MLSS Sept. 2012) Sept. 2012 38 / 136

Summary SMC provide consistent estimates under weak assumptions. Under stability assumptions, we have uniform in time stability of the SMC estimates of {p (x t y 1:t )} t 1. Under stability assumptions, the relative variance of the SMC estimate of {p (y 1:t )} t 1 only increases linearly with t. Even under stability assumptions, one cannot expect to obtain uniform in time stability for SMC estimates of {p (x 1:t y 1:t )} t 1 ; this is due to the degeneracy problem. A. Doucet (MLSS Sept. 2012) Sept. 2012 38 / 136

Is Resampling Really Necessary? Resampling is the source of the degeneracy problem and might appear wasteful. A. Doucet (MLSS Sept. 2012) Sept. 2012 39 / 136

Is Resampling Really Necessary? Resampling is the source of the degeneracy problem and might appear wasteful. The resampling step is an unbiased operation E [ p (x 1:t y 1:t ) p (x 1:t y 1:t )] = p (x 1:t y 1:t ) but clearly it introduces some errors locally in time. That is for any test function, we have [ ] [ ] V ϕ (x 1:t ) p (x 1:t y 1:t ) dx 1:t V ϕ (x 1:t ) p (x 1:t y 1:t ) dx 1:t A. Doucet (MLSS Sept. 2012) Sept. 2012 39 / 136

Sequential Importance Samping: SMC Without Resampling In this case, the estimate of the posterior is p SIS (x 1:t y 1:t ) = where X (i) 1:t p (x 1:t) and W (i) t ( p N i=1 ) y 1:t X (i) 1:t W (i) t δ (i) X (x 1:t ) 1:t t ( g k=1 In this case, the marginal likelihood estimate is p SIS (y 1:t ) = 1 N ) y k X (i) t. N ( ) p y 1:t X (i) 1:t i=1 A. Doucet (MLSS Sept. 2012) Sept. 2012 40 / 136

SIS For Stochastic Volatility Model 1000 500 1000 0 25 20 15 10 5 0 500 0 25 20 15 10 5 0 100 50 Figure: Histograms of log 10 ( t = 100 (bottom). 0 25 20 15 10 5 0 Importance Weights (base 10 logarithm) W (i ) t ) for t = 1 (top), t = 50 (middle) and The algorithm performance collapse as t increases as expected. A. Doucet (MLSS Sept. 2012) Sept. 2012 41 / 136

Central Limit Theorems For both SIS and SMC, we have a CLT for the estimates of the marginal likelihood ) ( psis (y 1:t ) N 1 N ( 0, σ 2 p (y 1:t ) t,sis), ) ( psmc (y 1:t ) N 1 N ( 0, σ 2 p (y 1:t ) t,smc). The variance expressions are σ 2 t,sis = p 2 ( x 1:t y 1:t ) p(x 1:t dx ) 1:t 1 = σ 2 t,smc = p 2 ( x 1 y 1:t ) µ(x 1 dx ) 1 + t k=2 g = 2 ( y 1 x 1 )µ(x 1 )dx 1 p 2 (y 1 + ) t k=2 p 2 ( y 1:t x 1:t )p(x 1:t )dx 1:t p 2 (y 1:t ) 1 p 2 ( x 1:k y 1:t ) p( x 1:k 1 y 1:k 1 )f ( x k x k 1 ) dx 1:k t p 2 ( y k :t x k )p( x k y 1:k 1 )dx k p 2 ( y k :t y 1:k 1 ) t A. Doucet (MLSS Sept. 2012) Sept. 2012 42 / 136

A Toy Example Consider the case where f (x x) = µ (x ) = N ( x ; 0, σ 2) and g (y x) = N ( y; 0, 1 1 σ 2 ) where σ 2 > 1. A. Doucet (MLSS Sept. 2012) Sept. 2012 43 / 136

A Toy Example Consider the case where f (x x) = µ (x ) = N ( x ; 0, σ 2) and g (y x) = N ( y; 0, 1 1 σ 2 ) where σ 2 > 1. Assume we observe y 1 = = y t = 0 then we have ) [ ( ( psis (y 1:t ) V = σ2 t,sis p (y 1:t ) N = 1 ) σ 4 t/2 N 2σ 2 1], 1 ) [ ( ( psmc (y 1:t ) V σ2 t,smc = t ) σ 4 1/2 p (y 1:t ) N N 2σ 2 1]. 1 If select σ 2 = 1.2 then it is necessary to use N 2 10 23 particles to obtain σ2 t,sis N = 10 2 for t = 1000. A. Doucet (MLSS Sept. 2012) Sept. 2012 43 / 136

Better Resampling Schemes [ Better resampling steps can be designed such that E [ ] ( but V < NW (i) t 1 W (i) t entropy resampling etc. (Cappé et al., 2005). N (i) t N (i) t ] = NW (i) t ) ; residual resampling, minimal A. Doucet (MLSS Sept. 2012) Sept. 2012 44 / 136

Better Resampling Schemes [ Better resampling steps can be designed such that E [ ] ( but V < NW (i) t 1 W (i) t entropy resampling etc. (Cappé et al., 2005). N (i) t Residual Resampling. Set Ñ (i) t = NW (i) t ( multinomial of parameters N, W (1:N ) ) t where W (i) t W (i) t N 1 Ñ (i) t then set N (i) t N (i) t ] = NW (i) t ) ; residual resampling, minimal, sample N 1:N t = Ñ (i) t + N (i) t. from a A. Doucet (MLSS Sept. 2012) Sept. 2012 44 / 136

Better Resampling Schemes [ Better resampling steps can be designed such that E [ ] ( but V < NW (i) t 1 W (i) t entropy resampling etc. (Cappé et al., 2005). N (i) t Residual Resampling. Set Ñ (i) t = NW (i) t ( multinomial of parameters N, W (1:N ) ) t where N (i) t ] = NW (i) t ) ; residual resampling, minimal, sample N 1:N t from a W (i) t W (i) t N 1 Ñ (i) t then set N (i) t = Ñ (i) t + N (i) t. Systematic Resampling. Sample U 1 U [ 0, 1 ] N and define U i = U { 1 + i 1 N for i = 2,..., N, then set } Nt i = U j : i 1 k=1 W (k) t U j i k=1 W (k) t with the convention 0 k=1 := 0. A. Doucet (MLSS Sept. 2012) Sept. 2012 44 / 136

Measuring Variability of the Weights To measure the variation of the weights, we can use the Effective Sample Size (ESS) ( N ( ESS = i=1 W (i) t ) 2 ) 1 A. Doucet (MLSS Sept. 2012) Sept. 2012 45 / 136

Measuring Variability of the Weights To measure the variation of the weights, we can use the Effective Sample Size (ESS) We have ESS = N if W (i) t and W (j) t = 1 for j = i. ( N ( ESS = i=1 W (i) t ) 2 ) 1 = 1/N for any i and ESS = 1 if W (i) t = 1 A. Doucet (MLSS Sept. 2012) Sept. 2012 45 / 136

Measuring Variability of the Weights To measure the variation of the weights, we can use the Effective Sample Size (ESS) We have ESS = N if W (i) t ( N ( ESS = i=1 W (i) t ) 2 ) 1 = 1/N for any i and ESS = 1 if W (i) t = 1 and W (j) t = 1 for j = i. Liu (1996) showed that for simple importance sampling for ϕ regular enough V ( N i=1 ( W (i) t ϕ X (i) t ) ) V p( x1:t y 1:t ) ( 1 ESS ESS ( ϕ i=1 X (i) t ) ) ; i.e. the estimate is roughly as accurate as using an iid sample of size ESS from p (x 1:t y 1:t ). A. Doucet (MLSS Sept. 2012) Sept. 2012 45 / 136

Dynamic Resampling Resampling at each time step can be harmful: only resample when necessary. A. Doucet (MLSS Sept. 2012) Sept. 2012 46 / 136

Dynamic Resampling Resampling at each time step can be harmful: only resample when necessary. Dynamic Resampling: If the variation of the weights as measured by ESS is too high, e.g. ESS < N/2, then resample the particles. We can also use the entropy Ent = N i=1 W (i) t log 2 ( W (i) t ) A. Doucet (MLSS Sept. 2012) Sept. 2012 46 / 136

Improving the Sampling Step Bootstrap filter. Sample particles blindly according to the prior without taking into account the observation Very ineffi cient for vague prior/peaky likelihood. Optimal proposal/perfect adaptation. Implement the following alternative update-propagate Bayesian recursion where Update p (x 1:t 1 y 1:t ) = p( y t x t 1 )p( x 1:t 1 y 1:t 1 ) p( y t y 1:t 1 ) Propagate p (x 1:t y 1:t ) = p (x 1:t 1 y 1:t ) p (x t y t, x t 1 ) p (x t y t, x t 1 ) = f (x t x t 1 ) g (y t x t 1 ) p (y t x t 1 ) Much more effi cient when applicable; e.g. f (x t x t 1 ) = N (x t ; ϕ (x t 1 ), Σ v ), g (y t x t ) = N (y t ; x t, Σ w ). A. Doucet (MLSS Sept. 2012) Sept. 2012 47 / 136

A General Bayesian Recursion Introduce an arbitrary proposal distribution q (x t y t, x t 1 ); i.e. an approximation to p (x t y t, x t 1 ). A. Doucet (MLSS Sept. 2012) Sept. 2012 48 / 136

A General Bayesian Recursion Introduce an arbitrary proposal distribution q (x t y t, x t 1 ); i.e. an approximation to p (x t y t, x t 1 ). We have seen that so clearly p (x 1:t y 1:t ) = g (y t x t ) f (x t x t 1 ) p (x 1:t 1 y 1:t 1 ) p (y t y 1:t 1 ) p (x 1:t y 1:t ) = w (x t 1, x t, y t ) q (x t y t, x t 1 ) p (x 1:t 1 y 1:t 1 ) p (y t y 1:t 1 ) where w (x t 1, x t, y t ) = g (y t x t ) f (x t x t 1 ) q (x t y t, x t 1 ) A. Doucet (MLSS Sept. 2012) Sept. 2012 48 / 136

A General Bayesian Recursion Introduce an arbitrary proposal distribution q (x t y t, x t 1 ); i.e. an approximation to p (x t y t, x t 1 ). We have seen that so clearly where p (x 1:t y 1:t ) = g (y t x t ) f (x t x t 1 ) p (x 1:t 1 y 1:t 1 ) p (y t y 1:t 1 ) p (x 1:t y 1:t ) = w (x t 1, x t, y t ) q (x t y t, x t 1 ) p (x 1:t 1 y 1:t 1 ) p (y t y 1:t 1 ) w (x t 1, x t, y t ) = g (y t x t ) f (x t x t 1 ) q (x t y t, x t 1 ) This suggests a more general SMC algorithm. A. Doucet (MLSS Sept. 2012) Sept. 2012 48 / 136

A General SMC Algorithm { } Assume we have N weighted particles W (i) t 1, X (i) 1:t 1 approximating p (x 1:t 1 y 1:t 1 ) then at time t, ( ) ( ) Sample X (i) t q x t y t, X (i) t 1, set X (i) 1:t = X (i) 1:t 1, X (i) t and p (x 1:t y 1:t ) = N i=1 W (i) t W (i) f t 1 W (i) t δ X (i) (x 1:t ), 1:t ( X (i) t q ) ( ) X (i) t 1 g y t X (i) t ( ) X (i) t y t, X (i). t 1 A. Doucet (MLSS Sept. 2012) Sept. 2012 49 / 136

A General SMC Algorithm { } Assume we have N weighted particles W (i) t 1, X (i) 1:t 1 approximating p (x 1:t 1 y 1:t 1 ) then at time t, ( ) ( ) Sample X (i) t q x t y t, X (i) t 1, set X (i) 1:t = X (i) 1:t 1, X (i) t and p (x 1:t y 1:t ) = N i=1 W (i) t W (i) f t 1 W (i) t δ X (i) (x 1:t ), 1:t ( X (i) t q ) ( ) X (i) t 1 g y t X (i) t ( ) X (i) t y t, X (i). If ESS< N/2 resample X (i) 1:t p (x 1:t y 1:t ) and set W (i) t 1 N to obtain p (x 1:t y 1:t ) = 1 N N i=1 δ (i) X (x 1:t ). 1:t t 1 A. Doucet (MLSS Sept. 2012) Sept. 2012 49 / 136

Building Proposals Our aim is to select q (x t y t, x t 1 ) as close as possible to p (x t y t, x t 1 ) as this minimizes the variance of w (x t 1, x t, y t ) = g (y t x t ) f (x t x t 1 ). q (x t y t, x t 1 ) A. Doucet (MLSS Sept. 2012) Sept. 2012 50 / 136

Building Proposals Our aim is to select q (x t y t, x t 1 ) as close as possible to p (x t y t, x t 1 ) as this minimizes the variance of w (x t 1, x t, y t ) = g (y t x t ) f (x t x t 1 ). q (x t y t, x t 1 ) Example - EKF proposal: Let X t = ϕ (X t 1 ) + V t, Y t = Ψ (X t ) + W t, with V t N (0, Σ v ), W t N (0, Σ w ). We perform local linearization Ψ (x) Y t Ψ (ϕ (X t 1 )) + (X t ϕ (X t 1 )) + W t x and use as a proposal. ϕ(xt 1 ) q (x t y t, x t 1 ) ĝ (y t x t ) f (x t x t 1 ). A. Doucet (MLSS Sept. 2012) Sept. 2012 50 / 136

Implicit Proposals Proposed recently by Chorin (2012). Let F (x t 1, x t ) = log g (y t x t ) + log f (x t x t 1 ) and xt = arg max F (x t 1, x t ) = arg max p (x t y t, x t 1 ) A. Doucet (MLSS Sept. 2012) Sept. 2012 51 / 136

Implicit Proposals Proposed recently by Chorin (2012). Let and F (x t 1, x t ) = log g (y t x t ) + log f (x t x t 1 ) x t = arg max F (x t 1, x t ) = arg max p (x t y t, x t 1 ) We sample Z N (0, I nx ), then we solve in X t F (x t 1, x t ) F (x t 1, X t ) = 1 2 Z T Z, Z N (0, I nx ) so if there is a unique solution q (x t y t, x t 1 ) = p Z (z) det z/ x t exp ( F (x t 1, xt )) g (y t x t ) f (x t x t 1 ) det x t / z A. Doucet (MLSS Sept. 2012) Sept. 2012 51 / 136

Implicit Proposals Proposed recently by Chorin (2012). Let and F (x t 1, x t ) = log g (y t x t ) + log f (x t x t 1 ) x t = arg max F (x t 1, x t ) = arg max p (x t y t, x t 1 ) We sample Z N (0, I nx ), then we solve in X t F (x t 1, x t ) F (x t 1, X t ) = 1 2 Z T Z, Z N (0, I nx ) so if there is a unique solution q (x t y t, x t 1 ) = p Z (z) det z/ x t The incremental weight is g (y t x t ) f (x t x t 1 ) q (x t y t, x t 1 ) exp ( F (x t 1, xt )) g (y t x t ) f (x t x t 1 ) det x t / z det x t / z exp (F (x t 1, x t )) A. Doucet (MLSS Sept. 2012) Sept. 2012 51 / 136

Auxiliary Particle Filters Popular variation introduced by (Pitt & Shephard, 1999). A. Doucet (MLSS Sept. 2012) Sept. 2012 52 / 136

Auxiliary Particle Filters Popular variation introduced by (Pitt & Shephard, 1999). This corresponds to a standard SMC algorithm (Johansen & D., 2008) where we target p (x 1:t y 1:t+1 ) p (x 1:t y 1:t ) p (y t+1 x t ) where p (y t+1 x t ) p (y t+1 x t ) using a proposal p (x t y t, x t 1 ). A. Doucet (MLSS Sept. 2012) Sept. 2012 52 / 136

Block Sampling Proposals Problem: we only sample X t at time t so, even if you use p (x t y t, x t 1 ), the SMC estimates could have high variance if V p( xt 1 y 1:t 1 ) [p (y t x t 1 )] is high. A. Doucet (MLSS Sept. 2012) Sept. 2012 53 / 136

Block Sampling Proposals Problem: we only sample X t at time t so, even if you use p (x t y t, x t 1 ), the SMC estimates could have high variance if V p( xt 1 y 1:t 1 ) [p (y t x t 1 )] is high. Block sampling idea: allows yourself to sample again X t L+1:t 1 as well as X t in light of y t. Optimally we would like at time t to sample ( ) X (i) t L+1:t p x t L+1:t y t L+1:t, X (i) t L and W (i) t W (i) ( p ( ) y 1:t ) X (i) t L+1:t y t L+1:t, X (i) t L ) X (i) 1:t t 1 ( ) p X (i) 1:t L y 1:t 1 p ( W (i) t 1 p y t y t L+1:t 1, X (i) t L A. Doucet (MLSS Sept. 2012) Sept. 2012 53 / 136

Block Sampling Proposals Computational cost is increased from O (N) to O (LN) so is it worth it? A. Doucet (MLSS Sept. 2012) Sept. 2012 54 / 136

Block Sampling Proposals Computational cost is increased from O (N) to O (LN) so is it worth it? Consider the ideal scenario where X t = X t 1 + V t Y t = X t + W t where X 1 N (0, 1) and V t, W t i.i.d. N (0, 1). In this case, we have p(y t y t L+1:t 1, x t L ) p(y t y t L+1:t 1, x t L) < c x t L x t L /2 L where the rate of exponential convergence depends upon the signal-to-noise ratio if more general Gaussian AR are considered. A. Doucet (MLSS Sept. 2012) Sept. 2012 54 / 136

Block Sampling Proposals Variance of incremental weight w.r.t. p ( x1:t A. Doucet (MLSS Sept. 2012) L j y1:t 1 ). Sept. 2012 55 / 136