An intro to particle methods for parameter inference in state-space models

Similar documents
Exercises Tutorial at ICASSP 2016 Learning Nonlinear Dynamical Models Using Particle Filters

An introduction to Sequential Monte Carlo

Sequential Monte Carlo Methods for Bayesian Computation

A Review of Pseudo-Marginal Markov Chain Monte Carlo

Controlled sequential Monte Carlo

An Brief Overview of Particle Filtering

Lecture 8: Bayesian Estimation of Parameters in State Space Models

Markov Chain Monte Carlo methods

MCMC algorithms for fitting Bayesian models

Lecture 2: From Linear Regression to Kalman Filter and Beyond

Lecture 2: From Linear Regression to Kalman Filter and Beyond

Advanced Computational Methods in Statistics: Lecture 5 Sequential Monte Carlo/Particle Filtering

MCMC for big data. Geir Storvik. BigInsight lunch - May Geir Storvik MCMC for big data BigInsight lunch - May / 17

L09. PARTICLE FILTERING. NA568 Mobile Robotics: Methods & Algorithms

Inferring biological dynamics Iterated filtering (IF)

List of projects. FMS020F NAMS002 Statistical inference for partially observed stochastic processes, 2016

17 : Markov Chain Monte Carlo

Learning Static Parameters in Stochastic Processes

STA 4273H: Statistical Machine Learning

Introduction to Bayesian Statistics and Markov Chain Monte Carlo Estimation. EPSY 905: Multivariate Analysis Spring 2016 Lecture #10: April 6, 2016

Pseudo-marginal Metropolis-Hastings: a simple explanation and (partial) review of theory

Notes on pseudo-marginal methods, variational Bayes and ABC

Sequential Monte Carlo methods for system identification

Adaptive Monte Carlo methods

Robert Collins CSE586, PSU Intro to Sampling Methods

Probabilistic Graphical Models Lecture 17: Markov chain Monte Carlo

Approximate Bayesian Computation and Particle Filters

Sequential Monte Carlo Methods (for DSGE Models)

A Note on Auxiliary Particle Filters

Bayesian Methods for Machine Learning

State-Space Methods for Inferring Spike Trains from Calcium Imaging

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling

Computer Intensive Methods in Mathematical Statistics

Introduction to Machine Learning

Computer Intensive Methods in Mathematical Statistics

Particle Filtering for Data-Driven Simulation and Optimization

An Introduction to Particle Filtering

Sampling Methods (11/30/04)

Introduction to Machine Learning CMU-10701

2D Image Processing (Extended) Kalman and particle filter

Graphical Models and Kernel Methods

Bayesian Inference and MCMC

Approximate Bayesian Computation: a simulation based approach to inference

Inference in state-space models with multiple paths from conditional SMC

Bayesian Monte Carlo Filtering for Stochastic Volatility Models

Monte Carlo Inference Methods

Auxiliary Particle Methods

The Hierarchical Particle Filter

Answers and expectations

Markov chain Monte Carlo methods in atmospheric remote sensing

Principles of Bayesian Inference

Bayesian Inference for DSGE Models. Lawrence J. Christiano

16 : Approximate Inference: Markov Chain Monte Carlo

ECE276A: Sensing & Estimation in Robotics Lecture 10: Gaussian Mixture and Particle Filtering

Markov Chain Monte Carlo Methods for Stochastic Optimization

Particle Filtering Approaches for Dynamic Stochastic Optimization

Bayesian inference. Fredrik Ronquist and Peter Beerli. October 3, 2007

SMC 2 : an efficient algorithm for sequential analysis of state-space models

Reminder of some Markov Chain properties:

Robert Collins CSE586, PSU Intro to Sampling Methods

Lecture 6: Bayesian Inference in SDE Models

Robert Collins CSE586, PSU Intro to Sampling Methods

An efficient stochastic approximation EM algorithm using conditional particle filters

Computational statistics

Kernel adaptive Sequential Monte Carlo

Lecture 6: Markov Chain Monte Carlo

Dynamic System Identification using HDMR-Bayesian Technique

Robotics. Mobile Robotics. Marc Toussaint U Stuttgart

CS242: Probabilistic Graphical Models Lecture 7B: Markov Chain Monte Carlo & Gibbs Sampling

Particle Filters. Pieter Abbeel UC Berkeley EECS. Many slides adapted from Thrun, Burgard and Fox, Probabilistic Robotics

Learning of state-space models with highly informative observations: a tempered Sequential Monte Carlo solution

Sequential Monte Carlo Samplers for Applications in High Dimensions

STA 4273H: Statistical Machine Learning

Bayesian Inference for DSGE Models. Lawrence J. Christiano

Markov Chain Monte Carlo Methods for Stochastic

Pseudo-marginal MCMC methods for inference in latent variable models

Stat 451 Lecture Notes Markov Chain Monte Carlo. Ryan Martin UIC

Lecture : Probabilistic Machine Learning

Density Estimation. Seungjin Choi

Bayesian Inference in GLMs. Frequentists typically base inferences on MLEs, asymptotic confidence

Computer Intensive Methods in Mathematical Statistics

Monte Carlo Approximation of Monte Carlo Filters

Tutorial on ABC Algorithms

An introduction to particle filters

CPSC 540: Machine Learning

MCMC and Gibbs Sampling. Kayhan Batmanghelich

STAT 499/962 Topics in Statistics Bayesian Inference and Decision Theory Jan 2018, Handout 01

Modeling and state estimation Examples State estimation Probabilities Bayes filter Particle filter. Modeling. CSC752 Autonomous Robotic Systems

Introduction. log p θ (y k y 1:k 1 ), k=1

Probabilistic Machine Learning

an introduction to bayesian inference

Introduction to Artificial Intelligence (AI)

Markov Chain Monte Carlo methods

Introduction to Particle Filters for Data Assimilation

Data-driven Particle Filters for Particle Markov Chain Monte Carlo

Markov chain Monte Carlo Lecture 9

19 : Slice Sampling and HMC

Lecture 7 and 8: Markov Chain Monte Carlo

A new iterated filtering algorithm

The Metropolis-Hastings Algorithm. June 8, 2012

Transcription:

An intro to particle methods for parameter inference in state-space models PhD course FMS020F NAMS002 Statistical inference for partially observed stochastic processes, Lund University http://goo.gl/sx8vu9 Umberto Picchini Centre for Mathematical Sciences, Lund University

This lecture will explore possibilities offered by particle filters, also known as Sequential Monte Carlo (SMC) methods, when applied to parameter estimation problems. We will not give a thorough introduction to SMC methods. Instead, we will introduce the most basic (popular) SMC method with the main goal of performing inference for θ, a vector of unknown parameters entering a state space / Hidden Markov model. Results anticipation Thanks to recent and astonishing results, we show how it is possible to construct a practical method for exact Bayesian inference on the parameters of a state-space model.

Acknowledgements (early enough, not to forget...) Part of the presented material, slides, MATLAB codes etc. has been kindly shared by: Magnus Wiktorsson (Matematikcentrum, Lund); Fredrik Lindsten (Automatic Control, Linköping).

Working framework We consider hereafter the (important) case where data are modelled according to an Hidden Markov Model (HMM), often denoted state-space model (SSM). (In recent literature the terms SSM and HMM have been used interchangeably. We do the same here, though elsewhere it is sometimes assumed that HMMs discrete space and SSM continuous space) A state-space model refers to a class of probabilistic graphical models that describes the probabilistic dependence between latent state variables and the observed measurements of a dynamical system.

Our system of interest is {X t } t 0. This is a latent/unobservable stochastic process. (Unavailable) values of the system on a discrete time grid {0, 1,..., T} are X 0:T = {X 0,..., X T } (we consider integer times t for ease of notation). Actual observations (data) are noisy y 1:T = {y 1,..., y T } That is at time t each y t is a noisy/corrupted version of the true state of the system X t. {X t } t 0 is assumed a Markovian stochastic process. Observations y t are assumed conditionally independent given the corresponding latent state X t. We are going to formalize this in next slide.

SSMs as graphical models Graphically: Y t 1 Y t Y t+1 (Observations)... X t 1 X t X t+1... (Markov chain) Y t X t = x t p(y t x t ) X t+1 X t = x t p(x t+1 x t ) X 0 π(x 0 ) (Observation density) (Transition density) (Initial distribution)

Example: Gaussian random walk A most trivial example (linear, Gaussian SSM). Therefore x t = x t 1 + ξ t, ξ t iid N(0, q 2 ) y t = x t + e t, e t iid N(0, r 2 ) p(x t x t 1 ) = N(x t ; x t 1, q 2 ) p(y t x t ) = N(y t ; x t, r 2 ) Notation: here and in the following N(µ, σ 2 ) is the Gaussian distribution with mean µ and variance σ 2. N(x; µ, σ 2 ) is the evaluation at x of the pdf of a N(µ, σ 2 ) distribution.

Notation reminder Always remember X 0 is NOT the state for the first observational time, that one is X 1. Instead X 0 is the (typically unknown) system s initial state for process {X t }. X 0 can be set to be a deterministic constant (as in the example we are going to discuss soon).

Markov property of hidden states Briefly: X t (and actually the whole future X t+1, X t+2,...) given X t 1 is independent from anything that has happened before time t 1: p(x t x 0:t 1, y 1:t 1 ) = p(x t x t 1 ), t = 1,..., T Also, past is independent of the future given the present: p(x t 1 x t:t, y t:t ) = p(x t 1 x t )

Conditional independence of measurements The current measurement y t given x t is conditionally independent of the measurement and state histories: p(y t x 0:t, y 1:t 1 ) = p(y t x t ). The Markov property on {X t } and the conditional independence on {Y t } are the key ingredients for defining an HMM or SSM.

In general {X t } and {Y t } can be either continuous or discrete valued stochastic processes. However in the following we assume {X t } and {Y t } to be defined on continuous spaces.

Our goal In all previous slides we haven t made explicit reference to the presence of unknown constants that we wish to estimate. For example in the Gaussian random walk example the unknowns would be the variances (q 2, r 2 ) (and perhaps the initial state x 0, in some situation). In that case we would like to estimate θ = (q 2, r 2 ) using observations y 1:T. More in general... Main goal We introduce general methods for SSM producing inference for the vector of parameters θ. We will be particularly interested in Bayesian inference. Of course θ can contain all sorts of unknowns, not just variances.

A quick look into the final goal p(y 1:T θ) is the likelihood of the measurements conditionally on θ. π(θ) is the prior density of θ (we always assume continuous-valued parameters). It encloses knowledge about θ before we see our current data y 1:T. Bayes theorem gives the posterior distribution: π(θ y 1:T ) = p(y 1:T θ)π(θ) p(y 1:T ) p(y 1:T θ)π(θ) inference based on π(θ y 1:T ) is called Bayesian inference. p(y 1:T ) is the marginal likelihood (evidence), independent of θ. Goal: calculate (sample draws from) π(θ y 1:T ). Remark: θ is a random quantity in the Bayesian framework.

As you know... (remember we set some background requirements for taking this course) Bayesian inference can rarely be performed without using some Monte Carlo sampling. Except for simple cases (e.g. data y 1,..., y T independently distributed from some member of the exponential family) it is usually impossible to write the likelihood p(y 1:T θ) in closed form. Since early 90 s MCMC (Markov chain Monte Carlo) has opened the possibility to perform Bayesian inference in practice. Are we obsessed with Bayesian inference? NO! However it sometimes offers the easiest way to deal with complex, non-trivial models. Some good reads The Bayesian approach actually opens the possibility for a surprising result (see later on...).

The likelihood function for SSMs First of all, notice that in the Bayesian framework, since θ is random, we do not simply write the likelihood function as p θ (y 1:T ) nor p(y 1:T ; θ), but we must condition on θ and write p(y 1:T θ). In a SSM data are not independent, they are only conditionally independent complication!: p(y 1:T θ) = p(y 1 θ) T p(y t y 1:t 1, θ) =? We don t have a closed for expression for the product above because we do not know how to calculate p(y t y 1:t 1, θ). Let s see why. t=2

In a SSM the observed process is assumed to depend on the latent Markov process {X t }: we can write p(y 1:T θ) = p(y 1:T, x 0:T θ)dx 0:T = p(y 1:T x 0:T, θ) } {{ } use cond. indep. T T } = t=1 { p(y t x t, θ) p(x 0 θ) t=1 p(x t x t 1, θ) p(x 0:T θ) } {{ } dx 0:T use Markovianity dx 0:T Problems The expression above is a (T + 1)-dimensional integral For most (nontrivial) models, transition densities p(x t x t 1 ) are unknown

Special cases Despite the analytic difficulties, find approximations for the likelihood function is possible (we ll consider some approach soon). In some simple cases, closed form solutions do exist: for example when the SSM is linear and Gaussian (see the Gaussian random walk example) then the classic Kalman filter gives the exact likelihood. In the SSM literature important (Gaussian) approximations are given by the extended and unscented Kalman filters. However, approximations offered by particle filters (a.k.a. sequential Monte Carlo) are presently the state-of-art for general non-linear non-gaussian SSM.

What if we had p(y 1:T θ)... Ideally if we had p(y 1:T θ) we could use Metropolis-Hastings (MH) to sample from the posterior π(θ y 1:T ). At iteration r + 1 we have: 1 current value is θ r, propose a new θ q(θ θ r ), e.g. θ N(θ r, Σ θ ) for some covariance matrix Σ θ. 2 draw u U(0, 1) accept θ if u < min(1, A), where A = p(y 1:T θ ) p(y 1:T θ r ) π(θ ) π(θ r ) q(θ r θ ) q(θ θ r ) then set θ r+1 := θ and p(y 1:T θ r+1 ) := p(y 1:T θ ). Otherwise, reject θ, set θ r+1 := θ r. Set r := r + 1, go to 1 and repeat.

By repeating steps 1-2 as much as wanted we are guaranteed that, by discarding a long enough number of iterations (burnin), the remaining draws form a Markov chain (hence dependent values) having π(θ y 1:T ) as their stationary distribution. Therefore if you have produced R = R 1 + R 2 iterations of Metropolis-Hastings, where R 1 is a sufficiently long burnin, for scalar θ you can then plot the histogram of the last R 2 draws θ R1 +1,..., θ R R1. Such histogram gives the density π(θ y 1:T ), up to a Monte Carlo error induced by using a finite R 2. for a vector valued θ R p create p separate histograms of the draws pertaining each component of θ. Such histograms represent the posterior marginals π(θ j y), j = 1,..., p.

Brush up MH with a simple example A simple toy problem not related to state-space modelling. Data: n = 1, 000 observations i.i.d y j N(3, 1). Now assume µ = E(y j ) unknown. Estimate µ. Assume for example a wide prior: µ N(2, 2 2 ) (do not look at data!) Gaussian random walk to propose µ = µ r + 0.1 ξ, with ξ N(0, 1). Notice Gaussian proposals are symmetric, hence q(µ µ r ) = q(µ r µ ) therefore A = p(y 1:n µ ) p(y 1:n µ r ) π(µ ) π(µ r )

Start far at µ = 10. Produce 10,000 iterations. 10 14 posterior π(µ y) after burnin 9 12 posterior prior 8 7 6 5 4 10 8 6 4 3 2 2 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 MCMC iterations 0-1 0 1 2 3 4 5 6 True value µ = 3 The posterior on the right plot discards the initial 500 draws (burnin).

Remark Notice Metropolis-Hastings is not an optimization algorithm. Unlike in maximum likelihood estimation here we are not trying to converge towards some mode. What we want is to explore thoroughly (i.e. sample from) π(θ data), including its tails. This way we can directly assess the uncertainty about θ by looking at π(θ data), instead of having to resort to asymptotic arguments regarding the sampling distribution of ˆθ n when n as in ML-theory.

Example: a nonlinear SSM { xt = 0.5x t 1 + 25 x t 1 (1+x 2 t 1 ) + 8 cos(1.2(t 1)) + v t, y t = 0.05x 2 t + e t, with a deterministic x 0 = 0, measurements at times t = 1, 2,..., 100. v t N(0, q), e t N(0, r) all independent. Data generated with q = 0.1 and r = 1. 20 15 10 5 0-5 -10-15 X Y -20 0 10 20 30 40 50 60 70 80 90 100 time

Priors: q InvGamma(0.01, 0.01), r InvGamma(0.01, 0.01) 0.4 Prior density InvGamma(0.01,0.01) 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 Recall true values are q = 0.1 and r = 1. The chosen priors cover both.

Results anticipation... Perform an MCMC (how about the likelihood? we ll see this later). R = 10, 000 iterations (generous burnin = 3, 000 iterations). Proposals: q new N(q old, 0.2 2 ), and similarly for r new. Start far away: initial values q init = 1, r init = 0.1. 1.5 Trace plot of p(q y 1:T ), acc. rate 1.103015e+01 percent 20 Empirical PDF of p(q y 1:T ) after burnin 1 15 10 0.5 5 0 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 0-0.05 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 3 Trace plot of p(r y 1:T ), acc. rate 1.103015e+01 percent 3 Empirical PDF of p(r y 1:T ) after burnin 2 2 1 1 0 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 0 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2

How do we deal with general SSM for which the exact likelihood p(y 1:T θ) is unavailable? That s the case of the previous example. For example, the previously analysed model is nonlinear in x t and therefore the exact Kalman filter can t be used.

General Monte Carlo integration Recall our likelihood function: p(y 1:T θ) = p(y 1 θ) T p(y t y 1:t 1, θ) We can write p(y t y 1:t 1, θ) as p(y t y 1:t 1, θ) = p(y t x t, θ)p(x t y 1:t 1, θ)dx t = E(p(y t x t, θ)) Use Monte Carlo integration: generate N draws from p(x t y 1:t 1, θ), then invoke the law of large numbers. produce N independent draws xt i p(x t y 1:t 1, θ), i = 1,..., N for each xt i compute p(y t xt, i θ) and by LLN we have 1 N p(y t x i N t, θ) E(p(y t x t, θ)), N i=1 error term is O(N 1/2 ) regardless the dimension of x t t=2

SMC and the bootstrap filter But how to generate good draws (particles) x i t p(x t y 1:t 1, θ)? Here good means that we want particles such that the values of p(y t x i t, θ) are not negligible ( explain a large fraction of the integrand). For SSM, sequential Monte Carlo (SMC) is the winning strategy. We will NOT give a thorough introduction to SMC methods. We only use a few notions to solve our parameter inference problem.

Importance sampling Let s remove for a moment the dependence on θ. p(y t y 1:t 1 ) = p(y t x 0:t )p(x 0:t y 1:t 1 )dx 0:t = p(y t x t )p(x t x t 1 )p(x 0:t 1 y 1:t 1 )dx 0:t p(yt x t )p(x t x t 1 )p(x 0:t 1 y 1:t 1 ) = h(x 0:t y 1:t )dx 0:t h(x 0:t y 1:t ) where h( ) is an arbitrary (positive) density function called importance density. Choose one easy to simulate from. 1 simulate N samples: x i 0:t h(x 0:t y 1:t ), i = 1,..., N 2 construct importance weights w i t = p(y t x i t)p(x i t x i t 1 )p(xi 0:t 1 y 1:t 1) h(x i 0:t y 1:t) 3 p(y t y 1:t 1 ) = E(p(y t x t )) 1 N N i=1 wi t

Importance Sampling Importance Sampling is an appealing and revolutionary idea for Monte Carlo integration. However generating at each time a cloud of particles x0:t i is not really computationally appealing, and it s not clear how to do so. We need something enabling a sort of sequential mechanism, as t increases.

Sequential Importance Sampling When h( ) is chosen in an intelligent way, an important property is the one that allows sequential update of weights. After some derivation on the board, we have (see p. 121 in Särkkä 1 and p. 5 in Cappe et al. 2 ) w i t p(y t x i t)p(x i t x i t 1 ) h(x i t x i 0:t 1, y 1:t) wi t 1. However, the dependence of w t on w t 1 is a curse (with remedy) as described in next slide. 1 Särkkä, available here. 2 Cappe, Godsill and Moulines. Available here.

The particle degeneracy problem Particle degeneracy occurs when at time t all but one of the importance weights w i t are close to zero. This implies a poor approximation to p(y t y 1:t 1 ). Notice that when a particle gets a zero weight (or a small positive weight that your computer sets to zero numerical underflow) that particle is doomed! Since w i t p(y t x i t)p(x i t x i t 1 ) h(x i t x i 0:t 1, y 1:t) wi t 1. if for a given i we have w i t 1 = 0 particle i will have zero weight for all subsequent times.

The particle degeneracy problem For example, suppose a new data point y t is encountered and this is an outlier. Then most particles will end up far away from y t and their weights will be 0. Those particles will die. Eventually, for a time horizon T which is long enough, all but one particle will have zero weight. This means that the methodology is very fragile and particle degeneracy has affected SMC methods for decades.

The Resampling idea A life saving solution is to use resampling with replacement (Gordon et al. 3 ). 1 interpret w i t as the probability to sample x i t from the weighted set {x i t, w i t, i = 1,..., N}, with w i t := w i t/ i wi t. 2 draw N times with replacement from the weighted set. Replace the old particles with the new ones { x 1 t,..., x N t }. 3 Reset weights w i t = 1/N (the resampling has destroyed the information on how we reached time t). Since resampling is done with replacement, a particle with a large weight is likely to be drawn multiple times. Particles with very small weights are not likely to be drawn at all. Nice! 3 Gordon, Salmond and Smith. IEEE Proceedings F. 140(2) 1993.

Sequential Importance Sampling Re-sampling (SISR) We are now in the position to introduce a method that samples from h( ), so that these samples can be used as if they were from p(x 0:t y 1:t 1 ) provided these are appropriately re-weighted. 1 Sample x 1 t,..., x N t h( ) (same as before) 2 compute weights w i t (same as before) 3 normalize weights w i t = w i t/ N i=1 wi t 4 We have a discrete probability distribution {x i t, w i t}, i = 1,..., N 5 Resample N times with replacement from the set {x 1 t,..., x N t } having associate probabilities { w 1 t,..., w N t }, to generate another sample { x 1 t,..., x N t }. 6 { x 1 t,..., x N t } is an approximate sample from p(x 0:t y 1:t 1 ). Sketch of Proof on page 111 in Wilkinson s book.

The next animation illustrates the concept of sequential importance sampling resampling with N = 5 particles. Light blue: observed trajectory (data) dark blue: forward simulation of the latent process {X t } pink balls: particles x i t green balls: selected particles x i t from resampling red curve: density p(y t x t )

Bootstrap filter This is the method we will actually use to ease Bayesian inference. The bootstrap filter is a simple example of SISR. Choose h(x i t x i 0:t 1, y 1:t) := p(x i t x i t 1 ) Recall in SISR after performing resampling reset all weights w i t 1 = 1 N Thus w i t p(y t x i t)p(x i t x i t 1 ) h(x i t x i 0:t 1, y 1:t) wi t 1 = 1 N p(y t x i t) p(y t x i t), i = 1,..., N This is a SISR strategy therefore we still have that x i t p(x 0:t y 1:t 1 )

Bootstrap filter in detail 1 t = 0 (initialize) x0 i π(x 0), assign w i 0 = 1/N, i = 1,..., N 2 at the current t assume we have the weighted particles {xt, i w i t} 3 from the current sample, resample with replacement N times to obtain { x i t, i = 1,..., N}. 4 From your model propagate forward x i t+1 p(x t+1 x i t), i = 1,..., N. 5 Compute w i t+1 = p(y t+1 xt+1 i ) and normalise weights w i t+1 = wi t+1 / N i=1 wi t+1 6 set t := t + 1 and if t < T go to step 2.

So the bootstrap filter by et al. (1993) 4 easily provide what we need! ˆp(y t y 1:t 1 ) = 1 N N i=1 wi t Finally a likelihood approximation: ˆp(y 1:T ) = ˆp(y 1 ) Put back θ in the notation so to obtain: approximate maximum likelihood or T ˆp(y t y 1:t 1 ) t=2 θ mle = argmax θ ˆp(y 1:T ; θ) exact Bayesian inference by using ˆp(y 1:T θ) inside Metropolis-Hastings. why exact?? let s check it after the example... 4 Gordon, Salmond and Smith. IEEE Proceedings F. 140(2) 1993.

Back to the nonlinear SSM example We can now comment on how we obtained previously shown results (reproposed here). We used the bootstrap filter with N = 500 particles and R = 10, 000 MCMC iterations. forward propagation: x t+1 = 0.5x t + 25 (1+xt 2 ) + 8 cos(1.2t) + v t+1 w i t+1 = N(0.05(xi t+1 )2, r) x t 1.5 Trace plot of p(q y 1:T ), acc. rate 1.103015e+01 percent 20 Empirical PDF of p(q y 1:T ) after burnin 1 15 10 0.5 5 0 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 0-0.05 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 3 Trace plot of p(r y 1:T ), acc. rate 1.103015e+01 percent 3 Empirical PDF of p(r y 1:T ) after burnin 2 2 1 1 0 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 0 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2

Numerical issues with particle degeneracy When coding your algorithms try to consider the following before normalizing weights code unnormalised weights on log-scale: e.g. when w i t = N(y t x i t) the exp() in the Gaussian pdf will likely produce an underflow (w i t = 0) for x i t far from y t. Solution: reason in terms of logw instead of w. However afterwards we necessarily have to go back to w:=exp(logw) then normalize and the above might still not be enough. Solution: subtract the maximum (log)weight from each (log)weight, e.g. set logw:=logw-max(logw). This is totally ok, the importance of particles is unaffected, weights are only scaled by the same constant c=max(logw).

Safe (log)likelihood computation As usual, better compute the (approximate) loglikelihood instead of the likelihood: log ˆp(y 1:T θ) = T ( log N + log t=1 At time t set c t = max{log w i t, i = 1,..., N} and set w i t := exp(log w i t c t ) Once all the w i t are available compute log N i=1 wi t = c t + log N i=1 wi t. (the latter follows from w i t = w i t exp{c t } which you don t need to evaluate) N i=1 w i t )

Incredible result Quite astonishingly Andrieu and Roberts 5 proved that using an unbiased estimate of the likelihood function into the MCMC routine is sufficient to obtain exact Bayesian inference! That is using the acceptance ratio A = ˆp(y 1:T θ ) ˆp(y 1:T θ) π(θ ) π(θ) q(θ θ ) q(θ θ) will return a Markov chain with stationary distribution π(θ y 1:T ) regardless the finite number N of particles used to approximate the likelihood!. The good news is that E(ˆp(y 1:T θ)) = p(y 1:T θ) with ˆp(y 1:T θ) obtained via SMC. 5 Andrieu and Roberts (2009), Annals of Statistics, 37(2) 697 725.

The previous result will be considered, in my opinion, one of the most important statistical results of the XXI century. In fact, it offers an exact-approximate approach, where because of computing limitations we can only produce N < particles, while still be reassured to obtain exact (Bayesian) inference under minor assumptions. But let s give a rapid (technically informal) look at why it works. Key result: unbiasedness (del Moral 2004 6 ) We have that E(ˆp(y 1:T θ)) = ˆp(y 1:T θ, ξ)p(ξ)dξ = p(y 1:T θ) with ξ p(ξ) vector of all random variates generated during SMC (both to propagate forward the state and to perform particles resampling). 6 Easier to look at Pitt, Silva, Giordani, Kohn. J. Econometrics 171, 2012.

To prove the exactness of the approach we look at the (easier and less general) argument in sec. 2.2 of Pitt, Silva, Giordani, Kohn. J. Econometrics 171, 2012. To simplify the notation take y := y 1:T. ˆπ(θ, ξ y) approximate joint posterior of (θ, ξ) obtained via SMC ˆπ(θ, ξ y) = ˆp(y θ, ξ)p(ξ)π(θ) p(y) (notice ξ and θ are assumed a-priori independent) Notice we put p(y) not ˆp(y) at the denominator: this follows from the unbiasedeness assumption as we obtain ˆp(y θ, ξ)p(ξ)π(θ)dξdθ = π(θ){ ˆp(y θ, ξ)p(ξ)dξ}dθ = π(θ)p(y θ)dθ = p(y).

The exact (unavailable) posterior of θ is π(θ y) = p(y θ)π(θ) p(y) therefore the marginal likelihood (evidence) is and p(y) = p(y θ)π(θ) π(θ y) ˆp(y θ, ξ)p(ξ)π(θ) ˆπ(θ, ξ y) = p(y) = π(θ y)ˆp(y θ, ξ)p(ξ) π(θ) p(y θ) π(θ)

Now, we know that applying an MCMC targeting ˆπ(θ, ξ y) then discarding the output pertaining to ξ corresponds to integrate-out ξ from the posterior ˆπ(θ, ξ y)dξ = π(θ y) ˆp(y θ, w)p(ξ)dξ = π(θ y) p(y θ) } {{ } E(ˆp(y θ))=p(y θ) We are thus performing a pseudo-marginal approach: marginal because we disregard ξ; pseudo because we use ˆp( ) not p( ). Therefore we proved that, using MCMC on an (artificially) augmented posterior, then discard from the output all the random variates created during SMC returns exact Bayesian inference. Notice that discarding the ξ is something that we naturally do in Metropolis-Hastings hence nothing strange is happening here. The ξ are just instrumental, uninteresting, variates independent of θ and independent of {X t }.

One more example Let s study the stochastic Ricker model. { y t Pois(φN t ) N t = r N t 1 e N t 1+e t, e t N(0, σ 2 ) N t is the size of a population at t r is the growth rate φ is a scale parameter e t environmental noise So this example shows a observational model for y t given by a discrete distribution. Classic Kalman based filtering can t accommodate discreteness.

We can use the R package pomp to fit this model. pomp supports a number of inference methods of interest to us: particle marginal methods iterated filtering approximate Bayesian computation (ABC) synthetic likelihoods...and more. Essentially a user who s interested in fitting a SSM can benefit from a well tested platform, and experiment with a number of possibilities. I have never used it used it until 4 days ago (!). But now I have a working example (also as an optional exercise!).

Here are 50 observations from the Ricker model at times (1, 2,..., 50) generated with (r, σ, φ) = (44.7, 0.3, 10) and starting size N 0 = 7 at t = 0. myobservedricker e N 0.0 0.4 0.4 0 5 10 20 y 0 100 200 0 10 20 30 40 50 time You can also notice the path of the (latent) process N t as well as the (not so interesting) realizations of e t.

r The pomp pmcmc function runs a pseudo-marginal MCMC. Assume we are only interested in (r, φ) and keep σ = 0.3 constant ( known ). Here we use 500 particles with 5,000 MCMC iterations. We start at (r, φ) = (7.4, 5). We assumed flat (improper) priors. PMCMC convergence diagnostics nfail log.prior loglik 8 0.0 0.5 500 0 4 12 1.0 1.0 800 200 phi 10 30 50 5 6 7 8 9 11 0 1000 2000 3000 4000 5000 PMCMC iteration 0 1000 2000 3000 4000 5000 PMCMC iteration

We used a non-adaptive Gaussian random walk (variances are kept fixed). You might get better results with an adaptive version. A topic set as an (optional) exercise is to have a thought at why the method fails at estimating parameters of a nearly deterministic (smaller σ) stochastic Ricker model. Furthermore, for N t deterministic (σ = 0), where particle filters are not applicable (nor needed), exact likelihood calculation is also challenging. A great coverage of this issue is in Fasiolo et al. (2015) arxiv:1411.4564 comparing particle marginal methods, approximate Bayesian computation, iterated filtering and more. Very much recommended read.

Conclusions We have outlined a powerful methodology for exact Bayesian inference, regardless the number of particles N a too small N will have a negative impact on chain mixing many rejections, sticky chain. the methodology is perfectly suited for state-space modelling. the methodology is plug-and-play as besides knowing the observations density p(y t x t ) nothing else is required. in fact we only need the ability to forward simulate x t given x t 1 which can be performed numerically from the state model. the above implies that knowledge of the transition density expression p(x t x t 1 ) is not required. BINGO!

Software I m not very much updated on all available software for parameter inference via SMC (tend to write my code from scratch). Some possibilities are: LiBBi (C++ template library) Biips (C++ with interfaces to MATLAB/R) demo("pmcmc") in the R package smfsb. R package pomp accompanying MATLAB code for the book by S. Särkkä. Code available here. the MATLAB code for the presented example is available at the course webpage.

Important issues How to tune the number of particles? Doucet, Pitt, and Kohn. Efficient implementation of Markov chain Monte Carlo when using an unbiased likelihood estimator. arxiv:1210.1871 (2012). Pitt, dos Santos Silva, Giordani and Kohn. On some properties of Markov chain Monte Carlo simulation methods based on the particle filter. Journal of Econometrics 171, no. 2 (2012): 134-151. Sherlock, Thiery, Roberts and Rosenthal. On the efficiency of pseudo-marginal random walk Metropolis algorithms. arxiv:1309.7209 (2013).

Important issues Comparison of resampling schemes. Douc, Cappe and Moulines. Comparison of resampling schemes for particle filtering. http://biblio.telecom-paristech.fr/cgi-bin/ download.cgi?id=5755 Hol, Schön and Gustaffson. On resampling algorithms for particle filters, http://people.isy.liu.se/rt/ schon/publications/holsg2006.pdf

Appendix

Multinomial resampling This is the simplest to explain (not the most efficient though) resampling scheme. At time t we wish to sample from a population of weighted particles (x i t, w i t, i = 1,..., N). What we actually do is to sample N times particle indeces with replacement from the population (i, w i t, i = 1,..., N). This will be a sample of size N from a multinomial distribution. Pick at particle from the urn, the larger its probability w i t the more likely it will be picked. Record its index i and put it back in the urn. Repeat for a total of N times. To code the sampling procedure we just need to recall the inverse transform method. For a generic random variable X, let F X be an invertible cdf. We can sample an x from F X using x := FX 1 (u), with u U(0, 1).

For example let s start from a simple example of multinomial distribution, the Bernoulli distribution. X {0, 1} with p = P(X = 1), 1 p = P(X = 0). Then 0 x < 0 F X (x) = 1 p 0 x < 1 1 x 1 Draw the stair represented by the plot of F X. Generate a u U(0, 1) and hit the stair s steps. If 0 < u 1 p then set x := 0 and if u > 1 p set x := 1. For the multinomial case it is a simple generalization. Drop time t and set w i = p i. F X is a stair with N steps. Shoot a u U(0, 1) and return index i i := min { i i {1,..., N}; ( p i ) u 0 }. i=1 (1)

Cool reads on Bayesian methods (titles are linked) You have an engineeristic/signal processing background: check S. Särkkä Bayesian Filtering and Smoothing (free PDF from the author!) You are a data-scientist: check K. Murphy Machine Learning: a probabilistic perspective. You are a theoretical statistician: C. Robert The Bayesian Choice. You are interested in bioinformatics/systems biology: check D. Wilkinson Stochastic Modelling for Systems Biology, 2ed.. You are interested in inference for SDEs with applications to life sciences: check the book by Wilkinson above and C. Fuchs Inference for diffusion processes.

Cool reads on Bayesian methods (titles are linked) You are a computational statistician: check Handbook of MCMC. Older (but excellent) titles are: J. Liu Monte Carlo strategies in Scientific Computing and Casella-Robert Monte Carlo Statistical Methods. You want a practical hands-on and (almost) maths free introduction: check The BUGS Book and Doing Bayesian Data Analysis.