LTCC: Advanced Computational Methods in Statistics

LTCC: Advanced Computational Methods in Statistics Advanced Particle Methods & Parameter estimation for HMMs N. Kantas Notes at http://wwwf.imperial.ac.uk/~nkantas/notes4ltcc.pdf Slides at http://wwwf.imperial.ac.uk/~nkantas/slides4.pdf

ntroduction Particle methods as presented so far can be challenged by: weight degeneracy low observation noise, high dimensions path degeneracy crucial issue when parameters unknown More elaborate/advanced methods methods can be effective Need to adress also parameter estimation using approaches that are: Bayesian or Maximum likelihood on-line or off-line (batch)

Outline Advanced methods adaptive resampling the resample move PF the auxiliary particle filter SMC for fixed state spaces Parameter estimation Bayesian or Maximum likelihood on-line or off-line

Recipes to improve performance There are more elaborate particle filtering algorithms they can work better than vanilla version in terms of variance of estimators, ESS, accuracy etc. but they do not adress path degeracy due to resampling We will look at: often just mask it or postpone it. adaptive resampling, resample move PF, the auxiliary particle filter note one can combine all the above together

Adaptive resampling While resampling is a key component to have a good approximation it tends to leave early states being represented by few particles. adaptive resampling Key idea: use resampling only when you need to Resample only when ESS n apple N e.g. = 1/2. When you dont resample continue with SS

SR filter with adaptive resampling At time n 1 Sample X i n q (x n y n, X i n 1 ) and set X i 0:n X i 0:n 1, X i n. Compute the weights w n X i n 1:n and set W i n / W i n 1 w n X i n 1:n, P N i=1 W i n = 1. F ESS n apple N resample W i n n, X0:n i o to obtain N new equally-weighted 1 particles N, X i 0:n. set X i 0:n X i 0:n, W i n 1 N

The resample move particle filter (Berzuini & Gilkks 2001 JRSSB) Fight path degeneracy by re-inserting lost diversity in the particles using appropriate MCMC moves on the path space At time n 1 Sample X i n q (x n y n, X i n 1 ) and set X i 0:n X i 0:n 1, X i n Compute the weights w n X i n 1:n and set W i n / w n X i n 1:n, P N i=1 W i n = 1. Resample n Wn, i X0:n i o to obtain N new equally-weighted particles 1 N, X i 0:n. Move particles by independently (for each i) sampling X i 0:n K MCMC ( X i 0:n).

The resample move particle filter target density for MCMC move is ny ny p (x 0:n y 0:n ) / (x 0 ) f (x k x k 1 ) g (y k x k ) MCMC proposal in this context k=1 k=0 just provides a jitter or shake in the particle population does not need to move the whole trajectory, moving only X n L+1:n can still lead to correct algorithm a Gibbs update would be very useful if available Note: we are not relying on ergodic properties of MCMC, just invariance want to preserve statistical properties of sample

The resample move particle filter using RW-MH Random walk algorithm for X i 0:n K MCMC ( X i 0:n) Set 0:n = X i 0:n For m = 1,...,M Sample U N(0, S), with S of appropriate dimension Propose Z n L+1:n = n L+1:n + cu Compute acceptance ratio = 1 ^ with probab. : nq k=n L+1 nq k=n L+1 f (Z k Z k 1 ) g (y k Z k ) f ( k k 1 ) g (y k k ) accept 0:n ( 0:n L, Z n L+1:n ) otherwise reject proposal and 0:n remains the same

The resample move particle filter M can be quite small 1-5 Tuning Can use particles to design S, e.g. look at the empirical covariance of the particles after resampling c can be tuned for average acceptance ratio around 0.2 0.4 Other MCMC moves are possible, Gibbs, Hybrid Monte Carlo,... Method will increase diversity a bit, but notice that it does not affect the weights it might be more effective to use likelihood informed proposals and weights The last point is related to the auxiliary particle filter by (Pitt & Sheppard 99 JASA)

The auxiliary particle filter Resample Move and adaptive resampling are meant to improve path degeneracy What if weight degeneracy due to S is still present? Consider the Bayesian recursion: p (x 0:n y 0:n )= 1 Z n p (x 0:n 1 y 0:n 1 ) f (x n x n 1 ) g (y n x n ) with Z n = p (y n y 0:n 1 ). Bootstrap filter: move with f (x n x n 1 ) and weight with g (y n x n ) Alternative route : weight with p (y n x n 1 ) and then move with p (x n x n 1, y n ) Recall

The auxiliary particle filter Alternative route : weight with p (y n x n with p (x n x n 1, y n ) Recall 1 ) and then move p (x n x n 1, y n )= f (x n x n 1 ) g (y n x n ) p (y n x n 1 ) (Pitt & Sheppard 99 JASA) Can reverse the steps: move with p ( x n x n 1, y n) and weight with p ( y n+1 x n) Optimal p (x n x n 1, y n ) not available in practice! Can use approximations: move with q (x n x n 1, y n ) and weight with q (y n+1, x n )

The auxiliary particle filter On approximations: here q (y n+1, x n ) is not necessarily required to be a pdf just an easy to evaluate non-negative function of (x n, y n+1). often is called a score-function (name is misleading as it is used to denote also gradient term in parameter estimation) q (x n x n 1, y n ) can be a good importance distribution that takes into account the current observation

The auxiliary particle filter nstead of the original problem consider the target: n (x 0:n y 0:n ) / (x 0 ) g (y 0 x 0 ) q (y 1, x 0 ) ny f (x k x k 1 ) g (y k x k ) q (y k+1, x k ) q (y k, x k 1 ) Note q (y 1, x 0 ) k=1 ny k=0 q (y k+1, x k ) q (y k, x k 1 ) = q (y n+1, x n ) This means we are targetting a density, twisted with a lookahead n (x 0:n y 0:n ) / p (x 0:n y 0:n ) q (y n+1, x n )

The auxiliary particle filter What is the auxiliary PF? it is a PF targetting n using proposal q (x n y n, x n 1 ) We will implement a PF targetting n using as proposal q (x n y n, x n 1 ) and then reweight to get approximations for original n that is actually of interest. Why do we do this: the PF for n is more stable numerically new likelihood g (y n x n ) q (y n+1,x n) q (y n,x n 1) might be less peaky or informative n closer to n 1

The auxiliary particle filter So in path space target is n (x 0:n y 0:n ) / n 1 (x 0:n 1 y 0:n 1 ) and proposal f (x n x n 1 ) g (y n x k ) q (y n+1, x n ) q (y n, x n 1 ) q(x 0:n ) / n (x 0:n 1 y 0:n 1 ) q (x n y n, x n 1 ) This leads to the following weights to propagate the particles: w n (x n, x n 1 )= f (x k x k 1 ) g (y k x k ) q (y n+1, x n ) q (y n, x n 1 ) q (x n y n, x n 1 ) = w n (x n 1:n ) q (y n+1, x n )

The auxiliary particle filter For convenience we will split evaluation of the weight w n in two time steps evaluate part on yn+1 at time n + 1 Here we use the notation: w 0 (x 0 ) = g (y 0 x 0 ) (x 0 ), q (x 0 y 0 ) w n (x n 1:n ) = g (y n x n ) f (x n x n 1 ) q (x n, y n x n 1 ) where we denote for n 1, for n 1 q (x n, y n x n 1 )=q (x n y n, x n 1 ) q (y n, x n 1 ) (Pitt & Sheppard 99 JASA) recommends using if available q (x n y n, x n 1 )=p (x n y n, x n 1 ) and q (y n, x n 1 )=p (y n x n 1 ) or approximations of them

The auxiliary particle filter At time n = 0,foralli 2{1,...,N}: 1. Sample X i 0 q (x 0 y 0 ). 2. Compute W i 1 / w 0 X i 0 q y 1, X i 0, P N i=1 W i 1 = 1. 3. Resample X i 0 P N i=1 W i 1 X0 i (dx 0). At time n 1,foralli 2{1,...,N}: 1. Sample Xn i q (x n y n, X i n 1) and set X0:n X i i 0:n 1, Xn i. 2. Compute W i n+1 / w n X i n 1:n q y n+1, X i n, P N i=1 W i n+1 = 1. 3. Resample X i 0:n P N i=1 W i n+1 X i 0:n (dx 0:n).

The auxiliary particle filter BUT note we want the approximations of p (x 0:n y 0:n ) and p (y n y 0:n 1 ) These are given by: NX bp (dx 0:n y 0:n )= Wn i X0:n i (dx 0:n), (1) i=1 bp (y n y 0:n 1 )= 1 N where and! NX w n Xn i 1:n i=1 W i n / w n X i n 1:n, bp (y 0 )= 1 N NX Wn i 1q y n, Xn i 1 i=1 NX Wn i = 1 i=1 NX w 0 X0 i. i=1 (2)!

Discussion Choice of w n convenient for reweighting w n ( ) is used to approximate bp (dx 0:n y 0:n ) w n ( ) is used to weight particles connection between two is simply S What are we doing we are changing carefully the weight so that algorithm is well behaved by multiplying with something and dividing at the next step This can be effective when Xt high dimensional or g too informative

Discussion Neat extension let X k is obtained from a discretisation of a continuous process, e.g. via an Euler scheme Set q (y k+1, x k ) M q (y k, x k 1 ) = Y r k,m (y k+1, y k, x k,m ) r m=1 k,m 1 (y k+1, y k, x k 1,m 1 ) with X 0,m = X k 1 and X k,m = X k. Doing the same thing as above means that you do intermediate M weight resample steps to process observation Y k+1. Detailed exposition in (Del Moral, Murray 2015- SAM/ASA UQ).

Tempering based approach Another example that fits this framework is tempering Consider r k,m = g(y k+1 x k,m ) m r k,0 = g(y k x k,0 ) with M = 1and0< 1 < 2 <...< m n the presence of dynamics for x k,m (e.g. discretisation of SDE) implementation is as above.

Tempering based approach Some notes: can tune m according to ESS (adaptive tempering) in the absense of dynamics for xk,m use MCMC steps that are invariant to 1 Z k,m p(x 0:k 1 y 0:k 1 )f (x k x k 1 ) g (y k x k ) m otherwise method prone to resampling degenaracy method can be very effective in high dimensions Some references original PF with tempering in Godsill & Clapp 01, based on Neal 01, Jarzynski 97 More resent papers set Jasra, Stephens, Doucet Tsagaris 01, K., Beskos & Jasra 14

Discussion: summary Path degeneracy can be addressed partially by: adaptive resampling: applying resampling only when necessary using MCMC moves to jitter the particles and reintroduce lost diversity in particle approximations note that path degeneracy will be still present! Weight degeneracy can be addressed by good selection of importance proposals changing the target sequence to an easier problem as in APF introducing intermediate artificial weighting-resampling sequence, e.g. tempering. Can use all ideas above together to get a very powerful algorithm but also a bit complicated algorithm

Homework 4 For the following scalar model where W n, V n iid N(0, 1), X 0 N(0, 1). X n = X n 1 + V n, Y n = X n + W n, (3) 1. Synthesise a data-sets y 0:T for T = 5000, = 0.8, = 1 with varying = 0.001, 0.01, 0.1, 1, 10. Store the real state trajectory x 0:T for future comparisons in each case. 1.1 mplement the auxiliary PF (APF) for bootstrap or optimal importance proposals. 1.2 Compare with bootstrap PF and with SR with optimal proposal in terms of accuracy for filter mean and variance, as well as Monte Carlo variance of the marginal likelihood. 1.3 How small does needs to get so that the APF shows superior performance? 2. For some cases, e.g. = 0.1 2.1 implement the resample move PF for L = 1, M = 3. Plot the ESS for the resample move and compare with APF, bootstrap PF, and optimal proposal PF. 2.2 repeat the above using adaptive resampling PF.

SMC for static state spaces Tempering in the absence of dynamics can be used to introduce the question on how can SMC be used when state space is fixed in contrast to dynamically increasing in HMMs, e.g. simply X instead of X n Example from Bayesian inference p(x y 0:n ) / ny p(y k x)p(x) k=0 or more simple example p(x y) / p(y x)p(x) written with tempering, 0 = 0, n = 1 p(x y) / ny p(y x) k k 1 p(x) k=0

SMC for static state spaces Method is often referred to as SMC samplers or simply SMC Answer is: at each time k replace dynamics (in earlier algorithm from q or f ) with MCMC steps invariant to Q k p=0 p(y p x)p(x) can use particles to tune MCMC steps i.e. use independence sampler or random walks with covariances from particle approximation there is an interpretation construct a time varying target on an artificial state space model with marginal at time n being p(x y 0:n) Some references: Chopin 01 Biometrika, Del Moral, Doucet & Jasra 06 JRSSB

ntroduction to parameter estimation So far: we have managed to get a very good approximation of p (x n y 0:n ) in this case path degeneracy does not matter s this useful? yes, we can track the unknown ship in the sea but only when is known So how do we estimate? this problem is known as parameter inference for HMMs, model calibration, system identification very crucial in practice you cannot do filtering/prediction/smoothing without often ad-hoc calibration methods are used

ntroduction to parameter estimation We are interested in principled inferential methods or procedures Bayesian Maximum likelihood nference can be performed either on-line batch (or offline) We need to use PFs within algorithms that are meant to perform inference for.

ntroduction to parameter estimation Some algorithms Likelihood methods optimisation based gradient based expectation maximisation Bayesian methods naive approach: augmentstatex 0:n with and do filtering Pseudo marginal MCMC methods: Particle MCMC, Particle Gibbs nested SMC approach: SMC 2

Reading List Read introductory Particle MCMC book chapter by Andrieu, Doucet and Holenstein http://www.stats.ox.ac.uk/~doucet/andrieu_doucet_ holenstein_pmcmc_mcqmc.pdf Have a look at a review on parameter estimation: http://www.stats.ox.ac.uk/~doucet/kantas_doucet_ singh_maciejowski_tutorialparameterestimation.pdf

Bayesian nference Parameter is a random variable and Y is some dataset Bayes rule: posterior/ likelihood prior p( Y ) / p(y )p( ) Markov chain Monte Carlo (MCMC): Obtain samples of using and appropriate ergodic Markov chain { (k)} k 0 with stationary distribution p( Y )

Bayesian inference for HMMs Choose a suitable prior density p ( ) for Approximate p ( y 0:n ) which is given by Off-line case: p ( y 0:n ) / p (y 0:n ) p ( ). (4) Compute the joint posterior density p (x0:t, y 0:T ) On-line or sequential case: Compute sequence of posterior densities {p (x 0:n, y 0:n )} on-line means also same quality in every time with fixed computational/memory cost

Generic Metropolis Hastings for sampling p( Y ) Sample (0) p( ). Atiterationk 1 Sample proposal 0 q( (k 1)) Compute acceptance ratio (, 0 )=1 ^ p(y 0 )p( 0 )q( (k 1) 0 ) p(y (k 1))p( (k 1))q( 0 (k 1)) With probability (, 0 ) accept proposal setting (k) = 0, otherwise reject sample and set (k) = (k 1)

Metropolis Hastings for HMMs Sample (0) p( ). Atiterationk 1 Sample proposal 0 q( ), where = (k 1). Compute acceptance ratio (, 0 )=1 ^ p 0 (y 0:T ) p ( 0 ) q( 0 ) p (y 0:T ) p ( ) q( 0 ) with probability (, 0 ) accept proposal setting (k) = 0, otherwise reject sample and set (k) = (k 1).

Metropolis Hastings for HMMs Hard to implement directly as p 0 (y 0:T ) is intractable Could use p (x 0:T, y 0:T )= (x 0 ) T Q k=1 f (x k x k Q 1 ) T g (y k x k ) to k=0 design sampler targetting p (x 0:T, y 0:T ) Approach is usually inefficient: but mixing could deteriote rapidly with T path in x0:t is strongly correlated difficult to find useful hierarchical structure or conditional independencies.

Metropolis Hastings for HMMs Take an approach pseudo-marginal approach (Andrieu & Roberts 2009) choose appropriate auxiliary variables. Consider instead sampling from p (x 0:T, y 0:T ) and then integrating out x 0:T ideal marginal Metropolis Sampler marginalising x0:t means running a MCMC chain targetting p (x 0:T, y 0:T ) and using only generated -s for Monte Carlo approximations.

deal Marginal Metropolis-Hastings sampler The ideal MMH sampler would utilize the following proposal density: q x 0 0:T, 0 (x 0:T, ) = q 0 p x 0 0:T y 0:T, 0 (5) The acceptance probability is 1 ^ p (x 0 0:T, 0 y 0:T ) q ((x 0:T, ) (x 0 0:T, 0 )) p (x 0:T, y 0:T ) q x 0 0:T, 0 (x 0:T, ) =1 ^ p 0 (y 0:T ) p ( 0 ) q( 0 ) p (y 0:T ) p ( ) q( 0 ) =1 ^ Z 0 T p( 0 )q( 0 ) Z T p( )q( 0 ).

Marginal Metropolis-Hastings sampler We cannot sample exactly from p (x 0:T, y 0:T ) and we cannot compute the terms Z T and Z 0 T. AsamplerwithparticleapproximationsforZ T and ZT 0 has the same marginal as an ideal PMMH sampler. it is pseudo marginal sampler targetting p x 1 0,...,x N 0, x 1 1,...,x N n, O 1 (1),...,O 1 (N),...,O n (N), x 0:T, y 0:T all the variables used to construct SMC algorithm n o Xn, i O n (i) N T can be included together with X i=1 0:T as n=1 auxiliary variables and then integrated out. validity of algorithm based on unbiasedness of likelihood E N [ˆp 0 (y 0:T )] = p 0 (y 0:T ) Andrieu, Doucet and Holenstein 2010 particle MCMC paper

Particle Marginal Metropolis-Hastings (PMMH) sampler At iteration k = 0, Set (0) p( ). Run an SMC algorithm targeting p (x 0:T, y 0:T ),sample X 0:T (0) bp (dx 0:T y 0:T, (0)), and compute estimate b Z T ( (0)) At iteration k 1 Sample a proposal 0 q ( (k 1)). Run an SMC algorithm targeting p (x 0:T, 0 y 0:T ),samplex 0 0:T bp (dx 0:T y 0:T, 0 ), and compute estimate b Z T ( 0 ). Set (k) = 0, X 0:T (k) =X0:T 0, with probability 1 ^ bz T ( 0 ) p( 0 )q( (k 1) 0 ) bz T ( (k 1))p( (k 1))q( 0 (k 1)), otherwise set (k) = (k 1), X 0:T (k) =X 0:T (k 1).

Particle Marginal Metropolis-Hastings (PMMH) sampler The remarkable feature of this algorithm is that the invariant distribution of the Markov chain {X 0:T (k), (k)} is p (x 0:T, y 0:T ) whatever being N. SMC approximations do not introduce any bias. minimal tuning required compared to usual MCMC. The higher N the better the mixing properties of the algorithm. tradeoff with added computational cost could be balanced Under favorable mixing assumptions the variance of the acceptance rate of the PMMH sampler is proportional to T /N N should roughly increase linearly with T, so computational cost O(T 2 ) this can be potentially relaxed

Online Bayesian estimation ntroducing the extended state X n =(X n, n ) with initial density p ( 0 ) µ 0 (x 0 ) The transition density is i.e. n = n 1. f n (x n x n 1 ) n 1 ( n ) Applying a standard SMC algorithm to the Markov process {X n } n 0 : parameter space would only be explored at the initialization of the algorithm. successive resampling steps, after a certain time n, the approximation bp (d y 0:n ) will only contain a single unique value for. implicitly requires having to approximate p (i) (y 0:n ) for all the particles (i) approximating p ( y 0:n ), hence we expect estimates whose variance will increase at least linearly with n;

Online Bayesian estimation Pragmatic solutions: use artificial dynamics (Liu and West 2001, Hurzeler and Kunsch 2001), simple example n = n 1 + n with n being zero mean noise with small variance can tune variance from the particles also can use fixed lag approximations (Polson et al 2008) stop resampling before n L

Online Bayesian estimation Resample Move (Gillks and Berzuini 2001): use an MCMC kernel with invariant density p (x 0:n, y 0:n ),i.e. X (i) 0:n, (i) n K n, X i 0:n, i n where by construction K n satisfies Z p x0:n, 0 0 y 0:n = p (x 0:n, y 0:n ) K n x0:n, 0 0 x 0:n, d (x 0:n, ). n practice set X (i) 0:n L = X i 0:n L for some integer L 1 and only sample (i) n and possibly X (i) n L+1:n

Resample Move some cases we can use Gibbs step to update the parameter values K n x 0 0:n, 0 x 0:n, = x0:n x 0 0:n p( 0 x 0:n, y 0:n ), where p ( y 0:n, x 0:n )=p( s n (x 0:n, y 0:n )) with s n (x 0:n, y 0:n ) fixed dimension sufficient statistic. With some variation this has appeared many times: Andrieu et al 1999, Fearnhead 2002, Storvik 2002, Johannes and Polson 2007. Elegant, but still not robust since it relies on SMC approximations of p(s n (x 0:n, y 0:n ) y 0:n ), and for fixed N, error increases with n. issue is path degeneracy Unsuitable for high dimensions (> 5 10)

Numerical example We will use again X n = X n 1 + W n, Y n = X n + V n (6) where W n, V n iid N(0, 1).

Numerical example: on-line inference pdf, n=5000 pdf, n=4000 pdf, n=3000 pdf, n=2000 pdf, n=1000 0.04 0.03 0.02 0.01 0.04 0.03 0.02 0.01 0.04 0.03 0.02 0.01 0.04 0.03 0.02 0.01 0.04 0.03 0.02 0.01 0 0.8 0.9 1 1.1 1.2 0 0.8 0.9 1 1.1 1.2 0 0.8 0.9 1 1.1 1.2 0 0.8 0.9 1 1.1 1.2 0 0.8 0.9 1 1.1 1.2 2 σ y 0.1 0.05 0.06 0.04 0.02 0.1 0.05 0.1 0.05 0.1 0.05 0 0.4 0.5 0.6 0.7 0.8 0.9 0 0.4 0.5 0.6 0.7 0.8 0.9 0 0.4 0.5 0.6 0.7 0.8 0.9 0 0.4 0.5 0.6 0.7 0.8 0.9 0 0.4 0.5 0.6 0.7 0.8 0.9 ρ Figure: Particle method with MCMC, =(, 2 );

Numerical example: on-line inference Particle method with MCMC 0.2 0.25 0.15 0.1 0.05 0.2 0.15 0.1 0.05 0 0.8 0.9 1 1.1 1.2 0 0.4 0.5 0.6 0.7 0.8 0.9 Particle Gibbs 0.2 0.25 0.15 0.1 0.05 0.2 0.15 0.1 0.05 0 0.8 0.9 1 1.1 1.2 σ 2 0 0.4 0.5 0.6 0.7 0.8 0.9 ρ Figure: Estimated marginal posterior densities for =(, 2 ) with T = 10 3 over 50 runs (black-dotted) versus ground truth (green). Top: Particle method with MCMC, N = 7.5 10 4. Bottom: Particle Gibbs with 3000 iterations and N = 50.

Likelihood estimation methods with particle filtering Some algorithms Likelihood methods optimisation based gradient based expectation maximisation offline or online we will focus on offline methods only sketch on-line ones to give very basic idea

Maximum Likelihood based methods Off-line case: Estimate of as the maximizing argument of the marginal likelihood of the observed data: b = arg max 2 l T ( ) (7) where Online case: `T ( ) =log p (y 0:T ). (8) use a recursive method let n be the estimate of the model parameter after n 1 observations update the estimate to n+1 after receiving the new data y n.

Offline Maximum Likelihood based methods Off-line case: Estimate of as: b = arg max 2 ˆl T ( ) (9) where ˆ`T ( ) = \ log p (y 0:T ). Can use direct optimisation grid on, BFGS, or other popular optimisation methods is difficult due to variance of ˆp (y 0:T )

On the Monte Carlo variance of p (y 0:T ) Recall, SMC results in unbiased estimation of the marginal likelihood E N [ˆp (y 0:T )] = p (y 0:T ) Loosely speaking ˆp (y 0:T )=p (y 0:T )+V with V some non-trivial zero mean noise depending on T, N and model. recall bp (y 0:n ) has a relative (non-asymptotic) variance that increases linearly with n The monte carlo variability is quite an issue for finding maximum over

Approximating log p (y 0:T ) Note that E N [ˆp (y 0:T )] = p (y 0:T ) implies that E N [log ˆp (y 0:T )] 6= log p (y 0:T ) So log ˆp (y 0:T ) is a biased estimator. Can we correct for the bias?

Approximating log p (y 0:T ) Can use bias correction based on Taylor series log(z) =log Z 0 + 1 Z 0 (Z Z 0 ) Let Z 0 = E[Z] then ignoring higher order terms 1 2Z 02 (Z Z 0 ) 2 + O(Z 3 ) E [log(z)] = log E[Z] 1 2E[Z] 2 Var[Z] What we have is Z = b Z =ˆp (y 0:T ) and Z 0 = p (y 0:T ) E [log ˆp (y 0:T )] = log p (y 0:T ) Var [ˆp (y 0:T )] 2p (y 0:T ) 2

Approximating log p (y 0:T ) Note from slides 1 or 3: Var [ˆp (y 0:T )] p (y 0:T ) 2 Z N p (y 0:T ) 2 Z N p (y 0:T ) 2 Z N q(x 0:T )p(x 0:T y 0:T )dx 0:T 1 w(x 0:T )p(x 0:T y 0:T )dx 0:T 1 T Y n=0 w n (x n 1:n )! p(x 0:T y 0:T )dx 0:T 1! Lets say Ŵ being the particle approximation of R Q T n=0 w n(x n 1:n ) p(x 0:T y 0:T )dx 0:T

Approximating log p (y 0:T ) We get then So can use E [log ˆp (y 0:T )] = log ˆp (y 0:T ) (Ŵ 1) 2N log \ p (y 0:T )=log ˆp (y 0:T )+Ŵ 1 2N as a bias reduced estimator for l T

Optimising log p (y 0:T ) w.r.t Still ˆ`T ( ) = \ log p (y 0:T ) will exhibit quite a bit of variance This can make finding maximum difficult Potential remedies: smooth the approximation as a function of use a different resampling scheme (Pitt 02, Lee 10) try to reduce the variance with multiple runs

Expectation Maximisation Expectation Maximization (EM) algorithm is a very popular alternative procedure for maximizing `T ( ). At iteration k + 1, we set k+1 = arg max Q( k, ) (10) where Z Q( k, )= log p (x 0:T, y 0:T ) p k (x 0:T y 0:T )dx 0:T. (11) The sequence {`T ( k )} k non-decreasing. 0 generated by this algorithm is

Expectation Maximisation n particular if p (x 0:T, y 0:T ) belongs to the exponential family, then the EM consists of computing a n s -dimensional summary statistic like Sn the maximizing argument of Q( k, ) can be characterized explicitly through a suitable function :R ns!, i.e. k+1 = S k T Particle implementation consists of computing S k n. (12)

Additive functionals S n Sn is an additive functional Z " # nx Sn = s k (x k, x k 1 ) p (x 0:n y 0:n ) dx 0:n, (13) k=0 Theory tells that the asymptotic variance of the SMC estimate Z " # nx cs n = s k (x k, x k 1 ) bp (dx 0:n y 0:n ), (14) satisfies k=0 V cs n even with exponential filter stability. D n 2 N. (15) This motivates the use of dedicated smoothing algorithms

Gradient ascent The log-likelihood may be maximized with the following steepest ascent algorithm: at iteration k + 1 k+1 = k + k+1 r `T ( ) = k, (16) { k } k 1 needs to satisfy P k k = 1 and P k could also use Hessian but omitted for simplicity 2 k < 1. To obtain the score vector r `T ( ) we can use Fisher s identity Fisher identity Z r log p (y 0:n )= r log p (x 0:n, y 0:n ) p (x 0:n y 0:n ) dx 0:n The latter is of the form of S n again.

Gradient ascent We have ny r log p (x 0:n, y 0:n ) = r log f (x p x p 1 ) g (y p x p ) = Define: p=0 nx (r log f (x p x p 1 )+rlog g (y p x p )) p=0 s p (x p 1:p )=rlog f (x p x p 1 )+rlog g (y p x p ). r log p (y 0:n ) is of the form of Sn again.

Smoothing algorithms We are essentially interested in designing better particle approximations for {p (x n y 0:T )} T n=0 Some popular approaches fixed lag smoothing forward filtering backward sampling forward filtering backward smoothing

Fixed lag smoothing For state-space models with good forgetting properties if L large enough then p (x 0:n y 0:T ) p x 0:n y 0:(n+L)^T observations collected at times k > n + L do not bring any significant additional information about X 0:n. Fixed lag approximation (Kitagawa & Sato 2001): do not resample the components X i 0:n of the particles X i 0:k obtained by particle filtering at times k > n + L. Could work in practice, but method is asymptotically biased and it might be hard to tune L.

Forward-Backward Smoothing using sampling Backward interpretation The joint smoothing distribution p (x 0:T y 0:T ) can be expressed as a function of the filtering distributions {p (x n y 0:n )} T n=0 as follows TY 1 p (x 0:T y 0:T )=p (x T y 0:T ) p (x n y 0:n, x n+1 ) (17) where n=0 p (x n y 0:n, x n+1 )= f (x n+1 x n ) p (x n y 0:n ). (18) p (x n+1 y 0:n )

Particle mplementation Forward Filtering Backward Sampling (FFBSa) : run a particle filter from time n = 0toT, storing the approximate filtering distributions {bp (dx n y 0:n )} T n=0,i. Sample X T bp (dx T y 0:T ) and for n = T 1, T 2,...,0sample X n bp (dx n y 0:n, X n+1 ) where this distribution is obtained by substituting bp (dx n y 0:n ) for p (dx n y 0:n ) in (18): bp (dx n y 0:n, X n+1 )= P N i=1 W i nf (X n+1 X i n) X i (dx n n) P N i=1 W. (19) nf i (X n+1 Xn) i

Forward-Backward Smoothing A backward in time recursion for {p (x n y 0:T )} T n=0 follows by integrating out x 0:n 1 and x n+1:t in (17) while applying (18): Z p (x n y 0:T ) = p (x n, x n+1 y 0:T ) dx n+1 Z = p (x n y 0:n, x n+1 ) p (x n+1 y 0:T ) dx n+1 = Z f (x n+1 x n ) p (x n y 0:n ) p (x n+1 y 0:T ) dx n+1. p (x n+1 y 0:n )

Forward-Backward Smoothing So the backward in time recursion for {p (x n y 0:T )} T n=0 is: Z f (x n+1 x n ) p (x n+1 y 0:T ) p (x n y 0:T )=p (x n y 0:n ) dx n+1. p (x n+1 y 0:n ) (20) So {p (x n y 0:n )} T n=0 can be used in a backward pass to obtain {p (x n y 0:T )} T n=0 and {p (x n y 0:n, x n+1 )} T 1 n=0.

Particle mplementation Forward Filtering Backward Smoothing (FFBSm) : Assume we have an approximation p (dx n+1 y 0:T )= NX i=1 W i n+1 T X i n+1 (dx n+1) where W T i T = W T i the approximation then by using (20) and (19), we obtain p (dx n y 0:T )= NX W n T i Xn i (dx n) i=1 with W i n T = W i n NX W j n+1 T f X j n+1 X n i P. (21) N l=1 W nf l X j n+1 X n l j=1

Particle mplementation Forward Filtering Backward Smoothing (FFBSm) : Run a particle filter from time n = 0toT, storing the approximate filtering distributions {bp (dx n y 0:n )} T n=0, nitialise backward pass: W T i T = W T i for n = T 1, T 2,...,0computeweights W i n T = W i n NX j=1 and obtain the approximation W j n+1 T f P N l=1 W nf l X j n+1 X n l X j n+1 X n i. (22) p (dx n y 0:T )= NX W n T i Xn i (dx n) i=1

Particle mplementation Lets say we have performed Forward Filtering Backward Smoothing (FFBSm) : Assume we have an approximation p (dx n+1 y 0:T )= NX i=1 W i n+1 T and are interested to obtain the approximation p (dx n, dx n+1 y 0:T )= NX i=1 W i n,n+1 T with Xn a(i) being the ancestor of Xn+1 i pair Xn a(i), Xn+1 i by W i n,n+1 T = W a(i) n Wn+1 T i f P N l=1 W nf l X i n+1 (dx n+1) Xn a(i) (dx,xn+1 i n ) then we can weight the X i n+1 a(i) Xn X i n+1 X l n. (23)

Discussion n both previous slides the computational cost is prop. to N 2 T operations in total Assuming expontential forgetting: S n based on the fixed-lag approximation has an asymptotic variance with rate n/n with a non-vanishing (as N!1)bias proportional to n and a constant decreasing exponentially fast with L. The asymptotic bias and variance of the particle estimate of Sn computed using the forward-backward procedures satisfy: E bs n Sn n apple F N, V bs n n apple H N. (24) but note this is using algorithms at cost of N 2 T operations

Discussion To compute b S n one can implement with cost N 2 T Then 1. simple particle filter with N 2 particles 2. FFBS particle filter with N particles Case 1: suffers from path degeneracy bias of order T /N 2 variance at least of order T 2 /N 2 Case 2: more expensive bias of order T /N variance of order T /N

On-line methods On-line/ Forwards only extensions for EM and gradient methods do exist. Poyiadjis, Doucet, Singh 11 Cappe 09 Del Moral, Doucet, Singh 09 Understanding them is beyond this course Next couple of slides are for general information & interest

On-line methods On-line extensions for EM and gradient methods do exist. For gradient method: n+1 = n + n+1 r log p 0:n (y n y 0:n 1 ) (25) where r log p 0:n (y n y 0:n 1 ) is defined as r log p 0:n (y n y 0:n 1 )=rlog p 0:n (y 0:n ) r log p 0:n 1 (y 0:n 1 ), (26)

On-line methods The notation r log p 0:n (y 0:n ) corresponds to a time-varying score which is computed with a filter using the parameter p at time p. Using Fisher s identity to compute this time-varying score, then we have for 1 apple p apple n s p (x p 1:p )=rlog f (x p x p 1 ) = p + r log g (y p x p ) = p. (27)

On-line methods n offline EM maximisation can be rewritten as k+1 = T 1 S k T. (28) So for on-line EM can use Robbins-Monro averaging R S 0:n = n+1 sn (x n 1:n ) p 0:n (x n 1, x n y 0:n )dx n 1:n +(1 n+1) P! n nq k=0 (1 i) k+1 i=k+2 R s k (x k 1:k ) p 0:k (x k 1:k y 0:k )dx k 1:k, (29) Then use standard maximization step is used as in the batch version: n+1 = (S 0:n ). There is also a forward only implementation of FFBSm (Del Moral et. al. 2009)

Discussion On-line and offline parameter estimation drops down to computing smoothed integrals of additive functions Can either use standard algorithm (with O(N) cost) or dedicated smoothing algorithms (with O(N 2 ) cost) With the exception of on-line gradient methods when the same computational cost is used: the first choice suffers from the variance the second suffers from the bias both give similar MSE

Numerical example 200 O(N) method 200 O(N 2 ) method 150 150 Bi as ( Ŝn ) 100 50 100 50 0 0 1 2 3 4 5 6 x 10 4 0.3 0 0 1 2 3 4 5 6 x 10 4 0.3 Va r ( Ŝn n ) 0.2 0.1 0.2 0.1 0 0 1 2 3 4 5 6 x 10 4 0.4 0 0 1 2 3 4 5 6 x 10 4 0.4 MS E( Ŝn n ) 0.3 0.2 0.1 0.3 0.2 0.1 0 0 1 2 3 4 5 6 time n x 10 4 0 0 1 2 3 4 5 6 time n x 10 4 Figure: Estimating smoothed additive functionals: Empirical bias of the estimate of S n (top panel), empirical variance (middle panel) and mean squared error (bottom panel) for the estimate of S n / p n.

Numerical example 0.81 ρ 1.05 τ 2 0.805 O(N) method 0.8 0.795 0.79 1 0.785 0.95 0.81 1.05 0.805 O(N 2 ) method 0.8 0.795 0.79 1 0.785 50000 60000 70000 80000 90000 100000 0.95 50000 60000 70000 80000 90000 100000 Figure: EM: Boxplots of ˆ n for n algorithms. 5 10 4 using 100 realizations of the

Homework 5 For the following scalar model X n = X n 1 + V n, Y n = X n + W n, (30) where W n, V iid n N(0, 1), X 0 N(0, 1). Synthesiseadata-setsy 0:T for T = 1000, = 0.8, = 1 with varying = 0.01, 0.1, 1. Store the real state trajectory x0:t for future comparisons in each case. mplement the a particle filter of your choice. Using appropriate plots, compare the approximations of the mean and variance of p(x n y 0:T ) using a standard particle filter a particle filter with fixed lag smoothing a particle filter with backward smoothing Comment on the computational cost in each case. (*) n each case compare the approximation of p(x 0:n y 0:n) using plots number of unique particles at certain lags or illustration of sampled paths at different times n results showing Monte Carlo bias and variance for smoothed additive functionals.

Coursework instructions Coursework option 1: particle methods Pick a HMM of your choice so that it is possible the state and observation to be multidimensional with dimensions d x and d y resp. Using some known values for the static parameters implement a bootstrap particle filter and a more advanced PF of your choice generate plots and tables to compare the two methods for varying N, d x and d y. assess methods based on accuracy & variance of normalising constant and integrals like posterior (filter) mean, variance, etc. Consider a parameter estimation method of your choice (particle MCMC, gradients, EM) implement it and describe results for varying N, d x and d y using plots and tables. n your answers provide also short comments.

Coursework instructions Coursework option 2: f your research is related to computational statistics, or uses MCMC: 1. present your model of interest and problem at hand 2. the inferential method for problem (e.g. Bayesian inference, optimisation etc.) and the challenges involved, 3. simulation method (e.g. MCMC, S, SMC), 4. numerical results, 5. a discussion on how material in this course can be used for extensions Page limit: 10-12 pages, recommended length around 8 pages, use appendices if you need to go beyond page limits Submit by email to n.kantas at imperial.ac.uk using subject: LTCC coursework submission Deadline: 5 Dec 18 (a month)