An Introduction to Sequential Monte Carlo for Filtering and Smoothing

An Introduction to Sequential Monte Carlo for Filtering and Smoothing Olivier Cappé LTCI, TELECOM ParisTech & CNRS http://perso.telecom-paristech.fr/ cappe/ Acknowlegdment: Eric Moulines (TELECOM ParisTech) & Tobias Rydén (Lund)

Bayesian Dynamic Models Bayesian Dynamic Models Hidden Markov Models and State-Space Models Extensions 2 The Filtering and Smoothing Recursions 3 Sequential Importance Sampling 4 Sequential Importance Sampling with Resampling 5 The Auxiliary Particle Filter 6 Smoothing 7 Mixture Kalman Filter

Bayesian Dynamic Models Hidden Markov Models and State-Space Models Hidden Markov Model (HMM) The Hidden State Process {X k } k 0 is a Markov chain with initial probability density function (pdf) t 0 (x) and transition density function t(x, x ) such that * k p(x 0:k ) = t 0 (x 0 ) t(x l, x l+ ). The Observed Process {Y k } k 0 is such that the conditional joint density of y 0:k given x 0:k has the conditional independence (product) form p(y 0:k x 0:k ) = l=0 k l(x l, y l ). l=0 * x 0:k denotes the collection x 0,..., x k.

Bayesian Dynamic Models Hidden Markov Models and State-Space Models Graphical Representation of the Dependence Structure The HMM can be represented pictorially by a Bayesian network which depicts the conditional independence relations: X k X k+ Y k Y k+

Bayesian Dynamic Models Hidden Markov Models and State-Space Models State-Space Form Here the model is described in a functional form: X k+ = a(x k, U k ), Y k = b(x k, V k ), where {U k } k 0 and {V k } k 0 are mutually independent i.i.d. sequences of random variables (also independent of X 0 ). Remark The term state-space model often refers to the case where a and b are linear functions of their arguments (and {U k }, {V k }, X 0 are jointly Gaussian). Likewise, the term HMM is sometimes used (not in this talk!) more restrictively for the case where X is a finite set.

Bayesian Dynamic Models Hidden Markov Models and State-Space Models HMM Examples source coding ion channel modelling ergodic tracking, vision quantitative finance finite state space continuous state space speech recognition handwritting recognition left-right trajectory models

Bayesian Dynamic Models Extensions Beyond HMMs For sequential Monte Carlo methods, the key point is the structure of the conditional p(x 0:k y 0:k ): the methods described in this talk directly apply in cases where the conditional may be factored as k p(x 0:k y 0:k ) = p(x 0 y 0 ) p(x l+ x l, y 0:l+ ) l=0 Example: Switching Autoregressive Model If the observation equation is replaced by Y k = b(x k )Y k + V k then p(x 0:k y 0:k ) = p(x 0 y 0 ) k l=0 p(x l+ x l, y l+, y l )

Bayesian Dynamic Models Extensions Hierarchical HMMs C k C k+ Z k Z k+ Y k Y k+ Bayesian network for a hierarchical HMM: The state sequence {X k } is composed of two chains {C k } and {Z k } that evolve in parallel and the observation Y k depends on both component of the chain.

Bayesian Dynamic Models Extensions In hierarchical HMMs, sequential Monte Carlo is applicable to the marginalized posterior p(c 0:k Y 0:k ) (despite the fact that it doesn t factorizes as previously requested due to the marginalization of Z 0:k ) when p(z k c 0:k, y 0:k ) is computable Conditionally Gaussian Linear State-Space Model Dynamic equation Observation equation Z k+ = A(C k+ )Z k + R(C k+ )U k Y k = B(C k )Z k + S(C k )V k where {C k } is a finite-valued Markov Chain.

The Filtering and Smoothing Recursions Bayesian Dynamic Models 2 The Filtering and Smoothing Recursions Basic Recursions Computational Filtering and Smoothing Approaches 3 Sequential Importance Sampling 4 Sequential Importance Sampling with Resampling 5 The Auxiliary Particle Filter 6 Smoothing 7 Mixture Kalman Filter

The Filtering and Smoothing Recursions Basic Recursions Tasks of interest for HMMs State Inference How to make probabilistic statements on the state sequence given the model and the observations? Filtering π k k (x k ) = p(x k Y 0:k ) Prediction π k+ k (x k+ ) = p(x k+ Y 0:k ) Smoothing π 0:k k (x 0:k ) = p(x 0:k Y 0:k ) (fixed-interval: π l k for l = 0,..., k; fixed-lag: π k k+ for k = 0,... ) Parameter Inference How to tune the model parameters based on the observations?

The Filtering and Smoothing Recursions Basic Recursions Recursive Structure of the Joint Smoothing Density By Bayes rule π 0:k+ k+ (x 0:k+ ) = (L k+ (Y 0:k+ )) t 0 (x 0 ) k l=0 k+ t(x l, x l+ ) l(x l, Y l ) ( ) Lk+ (Y 0:k+ ) = π L k (Y 0:k ) 0:k k(x 0:k) t(x k, x k+)l(x k+, Y k+), where the normalization constants L k, i.e., the likelihood of the observations, is usually not computable. l=0

The Filtering and Smoothing Recursions Basic Recursions The Joint Smoothing Recursion π 0:k+ k+ (x 0:k+ ) = ( ) Lk+ L k π 0:k k (x 0:k ) t(x k, x k+ )l(x k+, Y k+ ) The marginal recursion may be decomposed in two steps: Prediction π k+ k (x k+ ) = π k k (x k )t(x k, x k+ )dx k Filtering π k+ k+ (x k+ ) = ( Lk+ L k ) π k+ k(x k+)l(x k+, Y k+)

The Filtering and Smoothing Recursions Computational Filtering and Smoothing Approaches Exact Implementation of the Filtering and Smoothing Recursions When X is finite (Baum et al., 970) The associated computational cost is X 2 per time index (for the filtering part). In linear Gaussian state-space models (Kalman & Bucy, 96) The filtering and prediction recursion is implemented by the Kalman filter (L k+ /L k is interpreted as the likelihood of the (k + )-th innovation). Such finite dimensional filters exist only in very specific models (see, e.g., Runggaldier & Spizzichino, 200)

The Filtering and Smoothing Recursions Computational Filtering and Smoothing Approaches Approximate Implementations of the Filtering and Smoothing Recursions EKF (Extended Kalman Filter) Linearization-based approach (for non-linear Gaussian state space models) UKF (Unscented Kalman Filter, Julier & Uhlmann, 997) Point-based approach Variational Methods (e.g., Valpola & Karhunen, 2002) Based on parametric density approximation arguments. Exact Suboptimal Filters In particular, Kalman filter viewed as minimum mean square error linear filtering.

The Filtering and Smoothing Recursions Computational Filtering and Smoothing Approaches Sequential Monte Carlo (SMC) Sequential Monte Carlo (sometimes called particle filtering) is a method which uses pseudo-random simulations to produce successive populations of weighted particles Xk :n and associated weights Wk :n such that n Wk i f(xi k ) i= for all functions f of interest. f(x)π k k (x)dx, The SMC process is sequential in the sense that given Xk :n, Wk :n and the observations Y 0:k+, Xk+ :n and W :n k+ are conditionally independent of previous populations of particles. SMC is based on importance sampling and resampling.

Sequential Importance Sampling Bayesian Dynamic Models 2 The Filtering and Smoothing Recursions 3 Sequential Importance Sampling Self-Normalized Importance Sampling Sequential Importance Sampling (SIS) Weight Degeneracy 4 Sequential Importance Sampling with Resampling 5 The Auxiliary Particle Filter 6 Smoothing 7 Mixture Kalman Filter

Sequential Importance Sampling Self-Normalized Importance Sampling Self-Normalized Importance Sampling, or IS (Hammersley & Handscomb, 964) IS is a weighted form of Monte Carlo approximation, in which expectations under a target pdf π π(f) = E π [f(x)] are estimated as where X i iid q, ω i = π q (Xi ). ˆπ n q (f) = n ω i n f(x i ), j= }{{ ωj } W i i= This form of IS (sometimes also called Bayesian IS) does not necessitate that π be properly normalized.

Sequential Importance Sampling Self-Normalized Importance Sampling Performance of IS Assuming that E π [ π q (X)( + f 2 (X))] <, ˆπ q n (f) is consistent and asymptotically normal, with asymptotic variance given by [ ] π υ q (f) = E π q (X) (f(x) π(f))2. The asymptotic variance can be estimated from the IS sample by ˆυ n q (f) = n n (W i ) 2 {f(x i ) ˆπ q n (f)} 2, i= where W i = ω i / n j= ωj are the normalized weights.

Sequential Importance Sampling Self-Normalized Importance Sampling Elements of proof Consistency If π = π/c, where c = π(x)dx, is a pdf, E q [ω i f(x i )] = E q [ π q (X)f(X) ] = c E π [f(x)]. CLT n(ˆπ n q (f) π(f)) = n n i= ωi (f(x i ) π(f)) n n j= ωj and Var[ω i (f(x i ) π(f))] = c 2 E π [ π q (X)(f(X) π(f))2 ]. Empirical estimate ˆυ n q (f) = n i= W i n π q (Xi ) n j= π q (Xj ) {f(xi ) ˆπ q n (f)} 2

Sequential Importance Sampling Self-Normalized Importance Sampling Empirical diagnostic tools for IS Effective sample size ESS n q = [ n i= ( W i ) ] 2 ESS n q n 2 n/ ESS n q is an estimate of E π [ π q (X)] which is the maximal IS asymptotic variance for functions f such that f(x) π(f) (note: these functions have maximal Monte Carlo variance of under π ). 3 n/ ESS n q is an estimator of the χ 2 divergence (π(x) q(x)) 2 dx q(x) (n/ ESS n q is also the square of the empirical coefficient of variation associated with the normalized weights).

Sequential Importance Sampling Self-Normalized Importance Sampling Another Empirical diagnostic tools for IS (Shannon) Entropy of the Normalized Weights N ENT n q = W i log(w i ) i= exp(ent n q ) n 2 exp(ent n q )/n is an estimate of exp[ K(π q)], where K(π q) is the Kullback-Leibler divergence between π and q.

Sequential Importance Sampling Self-Normalized Importance Sampling Optimal IS Instrumental Density When considering large classes of functions f, IS generally appears to perform worse than Monte Carlo under π (e.g., for functions such that f(x) π(f), the maximal Monte Carlo asymptotic variance is whereas, the maximal IS asymptotic variance is E π [ π q (X)] which is strictly larger than by Jensen s inequality, unless q = π). However, for a fixed function f, the optimal IS density is not π but f(x) π(f) π(x) f(x ) π(f) π(x )dx as Cauchy-Schwarz inequality implies that E π [ π q (X) (f(x) π(f))2 ] E π [ q π (X) ] } {{ } = E 2 π [ f(x) π(f) ].

Sequential Importance Sampling Self-Normalized Importance Sampling Example (Bayesian Posterior) Consider a simple model in which X 0 N(, 2 2 ), Y 0 X 0 N(X0, 2 σ 2 ). And apply IS to compute expectations under the posterior for Y 0 = 2, using the prior as instrumental pdf.

Sequential Importance Sampling Self-Normalized Importance Sampling Instrumental Target.4 Instrumental Target 0.35.2 0.3 0.25 0.8 0.2 0.6 0.5 0. 0.05 0.2 0 6 4 2 0 2 4 6 8 0 6 4 2 0 2 4 6 8 σ = 2 ESS n q /n 0.66 υ q (x x).69 υ π (x x).25 σ = 0.5 ESS n q /n 0.25 υ q (x x) 5.35 υ π (x x).25

Sequential Importance Sampling Sequential Importance Sampling (SIS) Back to the Filtering and Smoothing Problem How to estimate expectations under the posterior π 0:k k (x 0:k ) = p(x 0:k Y 0:k ) in the model k p(x 0:k ) = t 0 (x 0 ) t(x l, x l+ ), p(y 0:k x 0:k ) = using a sequential algorithm? l=0 k l(x l, y l ), l=0

Sequential Importance Sampling Sequential Importance Sampling (SIS) Sequential Smoothing through IS, or SIS (Handschin & Mayne, 969-970) Propose n independent particle trajectories {X i 0:k+ } i n under a Markovian scheme such that p(x 0:k+ ) = ρ 0:k+ (x 0:k+ ) = q 0 (x 0 ) 2 Compute importance weights sequentially: k q l (x l, x l+ ). l= ω i k+ = π 0:k+ k+(x i 0:k+ ) ρ 0:k+ (X i 0:k+ ) = ω i k t(xi k, Xi k+ )l(xi k+, Y k+) q k (X i k, Xi k+ ). Then, n i= ω i k+ n j= ωj k+ f(x i 0:k+ ) is an estimate of E [f(x 0:k+ ) Y 0:k+ ].

Sequential Importance Sampling Sequential Importance Sampling (SIS) FILT. INSTR. FILT. + One step of the SIS algorithm with just seven particles.

Sequential Importance Sampling Sequential Importance Sampling (SIS) Choice of the Instrumental Kernel The so-called optimal choice of q k, consists in setting q k (x, x ) = q opt k (x, x ) = t(x, x )l(x, Y k+ ) t(x, x )l(x, Y k+ )dx for which the normalized weights are predictable, i.e. Wk+ i depend on Xk i but not Xi k+ (hence, for this instrumental kernel the conditional variance cancels). This is however usually not feasible and common choices include the prior q k = t (and then ωk i l(xi k, Y k)) 2 approximations (sometimes heuristic) to q opt k matching, use of EKF or UKF,... ), (moment 3 tuning parameters of q k so as to maximize the effective sample size or entropy criterions

Sequential Importance Sampling Weight Degeneracy Weight Degeneracy Empirically, the SIS approach always fail when the time-horizon k is more than a few tens; the IS weights ωk :n usually become very unbalanced with a few weights dominating all the other To understand why it is the case, consider the (silly) model where { t(x, x ) = t(x ) = t 0 (x ), (Independent states) l(x, y) = l(y), (Non-informative observations) and the instrumental kernel is such that q l (x, x ) = q 0 (x ) = q(x )

Sequential Importance Sampling Weight Degeneracy Weight Degeneracy (Contd.) For a function of interest f that only depends on the last coordinate x k of the trajectory x 0:k, the asymptotic variance of the SIS approximation to π k k (f) = E πk k [f(x)] is given by υ k (f) = ( k = l=0 ) 2 t (f(xk q (x l) ) π ) 2 k k k(f) l=0 q(x l ) dx 0... dx k ( t ) k t q (x)t(x)dx q (x) ( f(x) π k k (f) ) 2 t(x)dx. }{{} > In practise, this situation can usually be detected by monitoring the effective sample size or entropy criterions, which become abnormally small.

Sequential Importance Sampling Weight Degeneracy Application to the Stochastic Volatility Model This is a non-linearly observed state-space model used to represent log-returns in quantitative finance: where X k+ = φx k + σu k φ <, Y k = β exp(x k /2)V k, {U k } k 0 and {V k } k 0 are independent standard Gaussian white noise processes. X 0 N(0, σ 2 /( φ 2 ). We consider trajectories of length 50 simulated under the model with φ = 0.98, σ = 0.7 and β = 0.64 and apply SIS with q k = q and n =, 000 particles.

Sequential Importance Sampling Weight Degeneracy Typical Evolution of ESS for a Single Trajectory 000 ESS (500 indep. replications of the particles) 900 800 700 600 500 400 300 200 00 0 5 0 5 20 25 30 35 40 45 50 time index

Sequential Importance Sampling Weight Degeneracy Evolution of ESS When Averaging Over Trajectories 000 900 800 ESS (500 indep. simulations) 700 600 500 400 300 200 00 0 5 0 5 20 25 30 35 40 45 50 time index

Sequential Importance Sampling Weight Degeneracy Typical Evolution of the Weight Distribution 000 500 000 0 25 20 5 0 5 0 500 0 25 20 5 0 5 0 00 50 0 25 20 5 0 5 0 Importance Weights (base 0 logarithm) Histograms of the base 0 logarithm of the normalized importance weights after (from top to bottom), 0 and 00 iterations for the stochastic volatility model (improved instrumental kernel based on moment matching).

Sequential Importance Sampling Weight Degeneracy Filtering Results 0.08 Density 0.06 0.04 0.02 0 0 5 0 Time Index 5 20 2.5 0.5 State 0 0.5.5 2 Waterfall representation of the sequence of estimated filtering pdfs (weighted kernel smooth) together with the actual state for,000 particles (improved instrumental kernel based on moment matching).

Sequential Importance Sampling with Resampling Bayesian Dynamic Models 2 The Filtering and Smoothing Recursions 3 Sequential Importance Sampling 4 Sequential Importance Sampling with Resampling Sampling Importance Resampling Sequential Importance Sampling with Resampling (SISR) Case Study 5 The Auxiliary Particle Filter 6 Smoothing 7 Mixture Kalman Filter

Sequential Importance Sampling with Resampling Sampling Importance Resampling In IS, it is indeed possible to reset the weights to a constant value at the price of a, usually moderate, increase in variance. Sampling Importance Resampling (Rubin, 987) Replace {X :n, W :n } by { X :Ñ, W :Ñ} such that the discrepancy between the resampled weights { W :Ñ} is reduced and Ñ W i= k iδ is a good approximation to n Xi i= W i δ X i. In general the resampling is random and subject to the constraints Ñ = n, W [ i = /Ñ, Ñ E i= { X ] i = X j } X :n, W :n = ÑW j ( j n). The last condition is often referred to as unbiasedness or proper weighting.

Sequential Importance Sampling with Resampling Sampling Importance Resampling Multinomial Resampling Draw, conditionally independently given {X :n, W :n }, n discrete random variables (J,..., J n ) taking their values in the set {,..., n} with probabilities (W,..., W n ). 2 Set, for i =,..., n, Xi = X J i and W i = /n. Let C i = n j= { X j = X i } (i =,..., n) denote the number of times each particle is duplicated in the resampling process. The counts (C,..., C n ) follow a multinomial distribution with parameters n, (W,..., W n ), conditionally to {X :n, W :n }. INSTRUMENTAL Resampled particles TARGET

Sequential Importance Sampling with Resampling Sampling Importance Resampling Some Results on SIR D Xi π are n (some extensions of this result) n 2 n i= f( X i ) is an asymptotically normal estimator of π(f) (assuming E π [ π q (X)( + f 2 (X)) + f 2 (X)] < ) with asymptotic variance given by [ ] π [ υ q (f) = E π q (X) (f(x) π(f))2 + E π (f(x) π(f)) 2] }{{}}{{} υ Var q(f) π[f(x)] If n is sufficiently large, the cost of resampling is very moderate in situation that are challenging for IS, i.e., when υ q (f) Var π [f(x)].

Sequential Importance Sampling with Resampling Sampling Importance Resampling Elements of Proof For a bounded function f, [ E f( X ] [ ( i ) = E E f( X i ) X :n, W :n)] n = E W j f(x j ) j= and n j= W j f(x j ) a.s. π(f) ( n j= W j f(x j ) f ). For the CLT, an ingredient of the proof is that [ ] [ n n ] n Var f( n X i ) = n Var W i f(x i ) i= i= }{{} υ q(f) [ n ( n + E W i f 2 (X i ) W i f(x i ) i= i= ) 2 } {{ } a.s. Var π[f(x)] ]

Sequential Importance Sampling with Resampling Sampling Importance Resampling There Exists Resampling Schemes with Reduced Variance Residual Resampling For i =,..., n set C i = nw i + C i, where C,..., C n are distributed, conditionally to W :n, according to the multinomial distribution Mult(n R, W,..., W n ) with R = n i= nw i and W i = nw i nw i, i =,..., n. n R This scheme is obviously unbiased and induces an additional variance which is always lower than Var π [f] (the same is true for the bootstrap filter based on residual resampling, Douc & Moulines, 2005).

Sequential Importance Sampling with Resampling Sequential Importance Sampling with Resampling (SISR) The Simplest Functional Algorithm (Gordon et al., 993) Regular resampling is added to avoid weight degeneracy and to guarantee the long-term (k ) stability of the particle filter. The Bootstrap filter X :n k Given, propose new positions Xi k+ independently under the prior dynamic t( X k i, ), for i =,..., n; 2 Compute the weights ωk+ i = l(xi k+, Y k+), for i =,..., n and normalize them (Wk+ i = ωi k+ / n 3 Resample to obtain X :n k+ indices J i k+ such that P ( J i k+ = j W :n k+ setting X i k+ = XJ i k+ k+ j= ωj k+ );, e.g., by drawing independent ) = W j k+ and (Multinomial Resampling). To avoid unnecessary resampling, 3 can be used only when the ESS statistics of the weights Wk+ :n falls below a threshold.

Sequential Importance Sampling with Resampling Sequential Importance Sampling with Resampling (SISR) FILT. FILT. FILT. + FILT. + FILT. +2 FILT. +2 SIS (left) and SISR (right).

Sequential Importance Sampling with Resampling Case Study AR() Model Observed in Pulsated Noise We consider a simple Gaussian linear state-space model to allow for comparison with analytical computations: X k+ = φx k + σu k, Y k = X k + η k V k, where φ = 0.99, σ = 0.2, and η k is varied (periodically) between 0. and 3. The observation sequence is simulated from the model but we also consider the influence of an out-of-model outlying observation at time 460.

Sequential Importance Sampling with Resampling Case Study 0 5 state obs. filt. smooth. 0 5 0 50 00 50 200 250 300 350 400 450 500 time 6 state obs. filt. smooth. 0 8 state obs. filt. smooth. 4 6 4 2 2 0 0 2 2 4 4 6 6 0 20 30 40 50 60 70 80 90 00 time 8 400 40 420 430 440 450 460 470 480 490 500 time

Sequential Importance Sampling with Resampling Case Study Sequential Importance Sampling (prior kernel, n = 00) 00 ESS 50 0 5 0 5 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 00 filt. mean 2 0 2 5 0 5 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 00 filt. std..5 0.5 0 5 0 5 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 00 time

Sequential Importance Sampling with Resampling Case Study Bootstrap Filter (prior kernel, n = 00, resamp. ESS < 0) ESS 00 80 60 40 20 2 5 0 5 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 00 filt. mean 0 2 5 0 5 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 00.5 filt. std. 0.5 0 5 0 5 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 00 time

Sequential Importance Sampling with Resampling Case Study Bootstrap Filter (prior kernel, n = 00, resamp. always) 00 80 60 40 20 0 20 30 40 50 60 70 80 90 00 2 filt. mean 0 2 5 0 5 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 00 filt. std..5 0.5 0 5 0 5 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 00 time

Sequential Importance Sampling with Resampling Case Study Bootstrap Filter (prior kernel, n = 00, resamp. ESS < 0) ESS 00 80 60 40 20 filt. mean 4 2 0 400 40 420 430 440 450 460 470 480 490 500 filt. std. 0.5 0 400 40 420 430 440 450 460 470 480 490 500 time

Sequential Importance Sampling with Resampling Case Study Bootstrap Filter (n = 5, 000, resamp. ESS < 500) ESS 5000 4000 3000 2000 000 filt. mean 4 2 0 400 40 420 430 440 450 460 470 480 490 500 filt. std. 0.5 0 400 40 420 430 440 450 460 470 480 490 500 time

Sequential Importance Sampling with Resampling Case Study Conclusions With resampling, SISR achieves long-term stability. The increase in variance due to resampling is very moderate, especially when resampling is applied only when needed. The method is still sensitive to outliers, model misspecification, etc., which may necessitate the use of more elaborate instrumental kernels.

The Auxiliary Particle Filter Bayesian Dynamic Models 2 The Filtering and Smoothing Recursions 3 Sequential Importance Sampling 4 Sequential Importance Sampling with Resampling 5 The Auxiliary Particle Filter A New Interpretation of SISR The Auxiliary Trick 6 Smoothing 7 Mixture Kalman Filter

The Auxiliary Particle Filter A New Interpretation of SISR Alternatives to SISR The resampling step in the SISR algorithm can be seen as a method to sample approximately under the distribution obtained when plugging the current particle approximation into the filtering update. This alternative way of thinking about resampling suggests several sequential Monte Carlo variants.

The Auxiliary Particle Filter A New Interpretation of SISR The Filtering Recursion Revisited Recall that (with more general notations) π k+ k+ (f) = f(x )π k k (dx)t(x, dx )l(x, Y k+ ) πk k (dx)t(x, dx )l(x, Y k+ ). Now, consider what happens when considering the empirical filtering distribution ˆπ n k k = n Wk i δ Xk i i= as an approximation to π k k and plugging it into the previous relation.

The Auxiliary Particle Filter A New Interpretation of SISR Filtering Target One obtains a mixture target pdf π n k+ k+ (x) = = n i= W k it(xi k, x)l(x, Y k+) W i k t(xk i, x)l(x, Y k+)dx n i= n i= W i k γ k(x i k ) n j= W j k γ k(x j k ) qopt k (Xi k, x), where γ k (x) = t(x, dx )l(x, Y k+ ), q opt k (x, x ) = t(x, x )l(x, Y k+ ). γ k (x)

The Auxiliary Particle Filter A New Interpretation of SISR Filtering Step 0.2 Approximation 0.3 0.2 0. 0 0 8 6 4 2 0 2 4 6 8 0 0.5 0. 0.05 0 0 8 6 4 2 0 2 4 6 8 0 0.3 Prediction likelihood 0.3 Prediction likelihood 0.2 predictive distribution 0.2 predictive distribution 0. 0. 0 0 8 6 4 2 0 2 4 6 8 0 0 0 8 6 4 2 0 2 4 6 8 0 0.3 Filtering 0.8 0.6 Filtering 0.2 0. 0.2 0 0 8 6 4 2 0 2 4 6 8 0 0 0 8 6 4 2 0 2 4 6 8 0

The Auxiliary Particle Filter A New Interpretation of SISR The previous reinterpretation of SISR shows that resampling and trajectory update can be integrated into a single global IS step that targets π n k+ k+. But, how to chose the global instrumental kernel q glob k ((Xk :n, W k :n ), x)? as πk+ k+ n is an n-mixture density, evaluation of the IS weights may necessitate of the order of n 2 computations. Remark The global instrumental kernel that corresponds to the bootstrap filter (with resampling) is q glob k ((X :n k, W :n k ), x) = n Wk i t(xi k, x). i=

The Auxiliary Particle Filter The Auxiliary Trick The Auxiliary Trick Assume that the global instrumental kernel is a mixture of the form q glob k ((X :n k, W :n k ), x) = n τk i q k(xk i, x). To avoid the n 2 update, one can use data augmentation, introducing the mixture component as an auxiliary variable: i= q aux k ((X :n k, W :n k ), (i, x)) = τ i k q k(x i k, x) defines a pdf on {,..., n} X, whose marginal is q k ((Xk :n, W k :n ), x). Likewise, π n,aux k+ k+ (i, x) = c W i k t(xi k, x)l(x, Y k+), where c = n i= W i k t(x i k, x)l(x, Y k+)dx, is the auxiliary target with marginal π n k+ k+ (x).

The Auxiliary Particle Filter The Auxiliary Trick Auxiliary Sampling Algorithm (Pitt & Shephard, 999) Auxiliary Particle Filter Repeat n times independently, for i =,..., n, Sample an index J i in {,..., n} with probabilities (τ k,..., τ n k ). 2 Sample a position X i k+ from q k(x J i k, x). 3 Compute the (unnormalized) auxiliary IS weight ωk+ i = W J i k t(xj i k, Xi k+ )l(xi k+, Y k+) τ J i k q k(x J i k, Xi k+ ).

The Auxiliary Particle Filter The Auxiliary Trick The Auxiliary View Provides New Degrees of Freedom In the original auxiliary particle filter, Pitt & Shephard used qk i = t and proposed an heuristic rule for setting the weights τk i based on Xk i and Y k+. The auxiliary IS weight may also be rewritten in normalized form as ωk+ i = W J i k γ k(x J i k )qopt k (XJ i k, Xi k+ ) τ J i k t(xj i k, Xi k+ ). showing that the optimal choice of τk i, in the sense of minimizing the conditional (to Xk :n, W k :n ) variance of n i= W k+ i f(xi k+ ) for a function f, is an instance of the Neyman allocation problem. And thus the optimal choice is τ i k W i k γ k(x i k ) υ i (f) where υ i (f) is the asymptotic variance of IS for f, target q opt k (Xi k, x), and, instrumental pdf t(xi k, x) (Olsson et al., 2007).

Smoothing Bayesian Dynamic Models 2 The Filtering and Smoothing Recursions 3 Sequential Importance Sampling 4 Sequential Importance Sampling with Resampling 5 The Auxiliary Particle Filter 6 Smoothing Using the Ancestry Tree Backward Reweighting Smoothing for Sum Functionals 7 Mixture Kalman Filter

Smoothing Using the Ancestry Tree.6.4.2 state 0.8 0.6 5 0 5 20 25 time index.6.4.2 state 0.8 0.6 5 0 5 20 25 time index Predictive densities and evolution of the particle ancestry tree

Smoothing Backward Reweighting Backward Smoothing Recursion There are several options for computing the marginal smoothing pdfs π l k such that π l k (x) = p(x l Y 0:k ) for l = 0,..., k. The backward smoothing recursion is based on the observation that the conditional time-reversed state sequence has a non-homogeneous Markovian structure (for conditionally Gaussian linear state-space models, this is known as RTS, or Rauch-Tung-Striebel, 965, smoothing).

Smoothing Backward Reweighting Backward Smoothing Recursion Define the backward smoothing kernels: b l (x l+, x l ) = π l l(x l )t(x l, x l+ ) πl l (x)t(x, x l+ )dx, for l = k, k 2,..., 0. Then π l k (x l ) = π l+ k (x l+ )b l (x l+, x l )dx l+. As p(x l, x l+ Y 0:k ) = p(x l x l+, Y 0:k ) p(x }{{} l+ Y 0:k ) }{{} p(x l x l+, Y 0:l ) π l+ k (x l+ ) }{{} b l (x l+,x l )

Smoothing Backward Reweighting Particle Reweighting Scheme The natural approximation to the backward smoothing recursion consists in computing smoothing weights Wl k i backwards according to the recursion Wk k i = W k i, for i =,..., n. For l = k, k 2,..., 0, W i l k = for i =,..., n. n W j l+ k j= W i l t(xi l, Xj l+ ) n i = W i l t(x i l, Xj l+ ), The approximation to E[f(X l ) Y 0:k ] is given by n i= W i l k f(xi l ).

Smoothing Backward Reweighting Back to Our Case Study AR() Model Observed in Pulsated Noise 0 8 state obs. filt. smooth. 6 4 2 0 2 4 6 8 400 40 420 430 440 450 460 470 480 490 500 time

Smoothing Backward Reweighting Using the Ancestry Tree (n = 00) 4 smooth. mean 3 2 0 2 400 40 420 430 440 450 460 470 480 490 500 smooth. std..2 0.8 0.6 0.2 0 400 40 420 430 440 450 460 470 480 490 500 time

Smoothing Backward Reweighting With Backward Reweighting (n = 00) 4 smooth. mean 3 2 0 2 400 40 420 430 440 450 460 470 480 490 500 smooth. std..2 0.8 0.6 0.2 0 400 40 420 430 440 450 460 470 480 490 500 time

Smoothing Backward Reweighting Backward Particle Reweighting Appears to be efficient and stable in the long term (although this hasn t been proved yet). Yet, it is not sequential (in particular, one needs to store all particle positions and weights); it has a potential numerical complexity proportional to the number n of particles squared (but not further likelihood evaluation is needed).

Smoothing Smoothing for Sum Functionals Sum Functionals and Parameter Estimation For parameter estimation, one requires smoothing of particular sum functionals of the hidden states. In the example of the stochastic volatility model, one needs to evaluate E [s i (X 0:n ) Y 0:n ], for 0 i 4 with n s 0 (x 0:n ) = x 2 0, s (x 0:n ) = x 2 k, s 2(x 0:n ) = s 3 (x 0:n ) = k=0 n x k x k, s 4 (x 0:n ) = k= n k=0 n x 2 k, k= Y 2 k exp( x k). We consider the simple case of s 0 (X 0 ) using N i= W n i { X i 0:n (0) } 2 as the sequential Monte Carlo estimate of E (s 0 (X 0 ) Y 0:n ) (for sum functionals, the same applies for each term in the sum).

Smoothing Smoothing for Sum Functionals Smoothing for s 0 (X 0 ) 0 2 Particles 0 3 Particles 0 4 Particles 0.8 0.6 0.2 5 20 00 500 time 5 20 00 500 time 5 20 00 500 time Box and whisker plots of particle estimates of R x 2 π 0 k (dx) for k =, 5, 20, 30 and 500, and particle population sizes n = 0 2, 0 3 and 0 4.

Smoothing Smoothing for Sum Functionals Smoothing for s 0 (X 0 ), Contd. 2.5 State Value 0.5 0 0.5 0 0 20 30 40 50 60 70 Time Index Particle trajectories at time n = 70 for the stochastic volatility model with n = 00 particles and systematic resampling. Using k < n is beneficial! Properly setting k corresponds to a bias-variance tradeoff: k bias decreases, k variance decreases. Hence, the general principle for sequential smoothing of sum functionals: use fixed-lag smoothing (Olsson et al., 2008).

Mixture Kalman Filter Bayesian Dynamic Models 2 The Filtering and Smoothing Recursions 3 Sequential Importance Sampling 4 Sequential Importance Sampling with Resampling 5 The Auxiliary Particle Filter 6 Smoothing 7 Mixture Kalman Filter The Mixture Kalman Filter (MKF) Algorithm An Application to Change Point Detection

Mixture Kalman Filter The Mixture Kalman Filter (MKF) Algorithm Filtering in Conditionally Gaussian Linear State-Space Models There are several cases of interest where the particles are more complicated than just elements of the state-space X. Conditionally Gaussian Linear State-Space Model Dynamic equation Observation equation Z k = A(C k )Z k + R(C k )U k Y k = B(C k )Z k + S(C k )V k where {C k } is itself a finite-valued Markov Chain. Many applications: Non-Gaussian noises modelled as mixtures, switching models, applications in digital communications, etc.

Mixture Kalman Filter The Mixture Kalman Filter (MKF) Algorithm The Exact Filtering Recursions Given C 0:k, p(z k Y 0:k, C 0:k ) is a Gaussian distribution with mean vector Ẑk k(y 0:k, C 0:k ) and covariance matrix Σ k k (Y 0:k, C 0:k ), which can be determined recursively using the Kalman filter. Hence p(x k Y 0:k ) is the mixture of C k+ Gaussian densities p(x k Y 0:k ) L k (Y 0:k c 0:k )p(c 0:k ) c 0:k C ( k+ ) N x k ; Ẑk k(y 0:k, c 0:k ), Σ k k (Y 0:k, c 0:k ). This is obviously of no use when k is not very small.

Mixture Kalman Filter The Mixture Kalman Filter (MKF) Algorithm Mixture Kalman Filtering Using (Z k, C k ) as state variable, one an use the SMC approaches described so far with (Z C)-valued particles. One can also use a related algorithm, where each particle represents a trajectory C 0:k summarized by the Kalman statistics Ẑk k(y 0:k, C 0:k ) and Σ k k (Y 0:k, C 0:k ). The resulting algorithm mixes systematic exploration of the trajectory continuations (in C), Kalman update to compute the likelihood factor and resampling.

Mixture Kalman Filter The Mixture Kalman Filter (MKF) Algorithm MKF: computation of the weights For i =,..., n and j =,..., r, compute Ẑ k+ k (C i 0:k, j) = A(j)Ẑk k(c i 0:k ), Σ k+ k (C i 0:k, j) = A(j)Σ k k(c i 0:k )At (j) + R(j)R t (j), Ŷ k+ k (C i 0:k, j) = B(j)Ẑk+ k(c i 0:k, j), Γ k+ (C i 0:k, j) = B(j)Σ k+ k(c i 0:k, j)bt (j) + S(j)S t (j), ω i,j k+ = W i k N(Y k+ ; Ŷk+ k(c i 0:k, j), Γ k+(c i 0:k, j)) Q C(C i k, j).

Mixture Kalman Filter The Mixture Kalman Filter (MKF) Algorithm MKF: Importance Sampling Step For i =,..., n, draw Jk+ i with probabilities proportional to ω i, k+,..., ωi,r k+, conditionally independently of the particle history, and set C0:k+ i = (Ci 0:k, J k+ i / ), r n Wk+ i = j= ω i,j k+ r i= j= ω i,j k+, K k+ (C i 0:k+ ) = Σ k+ k(c i 0:k, J i k+ )Bt (J i k+ )Γ k+ (Ci 0:k+, J i k+ ), Ẑ k+ k+ (C i 0:k+ ) = Ẑk+ k(c i 0:k, J i k+ ) + K k+ (C i 0:k+ ){Y k+ Ŷk+ k(c i 0:k, J i k+ )}, Σ k+ k+ (C i 0:k+ ) = {I K k+(c i 0:k+ )B(J k+)}σ k+ k (C i 0:k, J i k+ ).

Mixture Kalman Filter An Application to Change Point Detection Application to a Change Point Detection Task.5 x 05 x 04.4 0.3 Nuclear response.2. 0.9 Nuclear response 2 3 4 0.8 0.7 5 0.6 0 500 000 500 2000 2500 3000 3500 4000 Time 6 0 500 000 500 2000 2500 3000 3500 4000 Time Left: well-log data waveform with a median smoothing estimate of the state. Right: median smoothing residual.

Mixture Kalman Filter An Application to Change Point Detection Change Point Modelling To model this situation we put C = {0, }, where C k = 0 means that there is no change point at time index k whereas C k = means that a change point has occurred. The state space model is Z k+ = A(C k+ )Z k + R(C k+ )U k, Y k = Z k + V k, where A(0) = I, R(0) = 0 and A() = 0 and R() = R. The simplest model consists in taking for {C k } k 0 an i.i.d. sequence of Bernoulli random variables with probability of success p. The time between two change points (period of time during which the state variable is constant) is then distributed as a geometric random variable with mean /p.

Mixture Kalman Filter An Application to Change Point Detection Outliers Modelling 6 x 04 4 Quantiles of Input Sample 2 0 2 4 6 4 3 2 0 2 3 4 Standard Normal Quantiles Quantile-quantile regression of empirical quantiles of the well-log data residuals with respect to quantiles of the standard normal distribution.

Mixture Kalman Filter An Application to Change Point Detection Outliers Modelling, Contd. The normal distribution does not fit the measurement noise well in the tails. We model the measurement noise as a mixture of two Gaussian distributions: Z k+ = A(C k+, )Z k + R(C k+, )U k, U k N(0, ), Y k = µ(c k,2 ) + B(C k,2 )Z k + S(C k,2 )V k, V k N(0, ), where C k, {0, } and C k,2 {0, } are indicators of the presence of a change point and of an outlier, respectively. {C k,2 } is modelled as a two-state Markov chain which represents the clustering behavior of outliers.

Mixture Kalman Filter An Application to Change Point Detection 5 x 04 Data 0 5 0 500 000 500 2000 2500 3000 3500 4000 Posterior Probability of Jump 0.5 0 500 000 500 2000 2500 3000 3500 4000 Posterior Probability of Outlier 0.5 0 500 000 500 2000 2500 3000 3500 4000 Time On-line analysis of the well-log data, using 00 particles with detection delay d = 0. Top: data; middle: posterior probability of a jump; bottom: posterior probability of an outlier.

Conclusions There are many more important issues in SMC such as Analysis of Performance Convergence of more elaborate algorithms, less restrictive conditions for long-term stability, analysis of smoothing algorithms Resampling Variants Conditional variance reduction schemes, triggered resampling, varying number of particles... Choice of the Instrumental Moves Lookahead moves, combination with deterministic approximation, adaptive moves

Conclusions Some General References on SMC Doucet, A., De Freitas, N. and Gordon, N. (eds.) (200) Sequential Monte Carlo Methods in Practice. Springer. Ristic, B., Arulampalam, M. and Gordon, A. (2004) Beyond Kalman Filters: Particle Filters for Target Tracking. Artech House. Cappé, 0., Moulines, E. and Rydén, T. (2005) Inference in Hidden Markov Models. Springer. Doucet, A., Godsill, S. and Andrieu, C. (2000) On sequential Monte-Carlo sampling methods for Bayesian filtering. Stat. Comput., 0, 97-208. Arulampalam, M., Maskell, S., Gordon, N. and Clapp, T. (2002) A tutorial on particle filters for on line non-linear/non-gaussian Bayesian tracking. IEEE Trans. Signal Process., 50, 24 254. Cappé, O., Godsill, S. J. and Moulines, E. (2007) An overview of existing methods and recent advances in sequential Monte Carlo, IEEE Proc., 95, 899 924.