An Introduction to Sequential Monte Carlo for Filtering and Smoothing

Size: px

Start display at page:

Download "An Introduction to Sequential Monte Carlo for Filtering and Smoothing"

Sabrina Haynes
6 years ago
Views:

1 An Introduction to Sequential Monte Carlo for Filtering and Smoothing Olivier Cappé LTCI, TELECOM ParisTech & CNRS cappe/ Acknowlegdment: Eric Moulines (TELECOM ParisTech) & Tobias Rydén (Lund)

2 Bayesian Dynamic Models Bayesian Dynamic Models Hidden Markov Models and State-Space Models Extensions 2 The Filtering and Smoothing Recursions 3 Sequential Importance Sampling 4 Sequential Importance Sampling with Resampling 5 The Auxiliary Particle Filter 6 Smoothing 7 Mixture Kalman Filter

3 Bayesian Dynamic Models Hidden Markov Models and State-Space Models Hidden Markov Model (HMM) The Hidden State Process {X k } k 0 is a Markov chain with initial probability density function (pdf) t 0 (x) and transition density function t(x, x ) such that * k p(x 0:k ) = t 0 (x 0 ) t(x l, x l+ ). The Observed Process {Y k } k 0 is such that the conditional joint density of y 0:k given x 0:k has the conditional independence (product) form p(y 0:k x 0:k ) = l=0 k l(x l, y l ). l=0 * x 0:k denotes the collection x 0,..., x k.

4 Bayesian Dynamic Models Hidden Markov Models and State-Space Models Graphical Representation of the Dependence Structure The HMM can be represented pictorially by a Bayesian network which depicts the conditional independence relations: X k X k+ Y k Y k+

5 Bayesian Dynamic Models Hidden Markov Models and State-Space Models State-Space Form Here the model is described in a functional form: X k+ = a(x k, U k ), Y k = b(x k, V k ), where {U k } k 0 and {V k } k 0 are mutually independent i.i.d. sequences of random variables (also independent of X 0 ). Remark The term state-space model often refers to the case where a and b are linear functions of their arguments (and {U k }, {V k }, X 0 are jointly Gaussian). Likewise, the term HMM is sometimes used (not in this talk!) more restrictively for the case where X is a finite set.

6 Bayesian Dynamic Models Hidden Markov Models and State-Space Models HMM Examples source coding ion channel modelling ergodic tracking, vision quantitative finance finite state space continuous state space speech recognition handwritting recognition left-right trajectory models

7 Bayesian Dynamic Models Extensions Beyond HMMs For sequential Monte Carlo methods, the key point is the structure of the conditional p(x 0:k y 0:k ): the methods described in this talk directly apply in cases where the conditional may be factored as k p(x 0:k y 0:k ) = p(x 0 y 0 ) p(x l+ x l, y 0:l+ ) l=0 Example: Switching Autoregressive Model If the observation equation is replaced by Y k = b(x k )Y k + V k then p(x 0:k y 0:k ) = p(x 0 y 0 ) k l=0 p(x l+ x l, y l+, y l )

8 Bayesian Dynamic Models Extensions Hierarchical HMMs C k C k+ Z k Z k+ Y k Y k+ Bayesian network for a hierarchical HMM: The state sequence {X k } is composed of two chains {C k } and {Z k } that evolve in parallel and the observation Y k depends on both component of the chain.

9 Bayesian Dynamic Models Extensions In hierarchical HMMs, sequential Monte Carlo is applicable to the marginalized posterior p(c 0:k Y 0:k ) (despite the fact that it doesn t factorizes as previously requested due to the marginalization of Z 0:k ) when p(z k c 0:k, y 0:k ) is computable Conditionally Gaussian Linear State-Space Model Dynamic equation Observation equation Z k+ = A(C k+ )Z k + R(C k+ )U k Y k = B(C k )Z k + S(C k )V k where {C k } is a finite-valued Markov Chain.

10 The Filtering and Smoothing Recursions Bayesian Dynamic Models 2 The Filtering and Smoothing Recursions Basic Recursions Computational Filtering and Smoothing Approaches 3 Sequential Importance Sampling 4 Sequential Importance Sampling with Resampling 5 The Auxiliary Particle Filter 6 Smoothing 7 Mixture Kalman Filter

11 The Filtering and Smoothing Recursions Basic Recursions Tasks of interest for HMMs State Inference How to make probabilistic statements on the state sequence given the model and the observations? Filtering π k k (x k ) = p(x k Y 0:k ) Prediction π k+ k (x k+ ) = p(x k+ Y 0:k ) Smoothing π 0:k k (x 0:k ) = p(x 0:k Y 0:k ) (fixed-interval: π l k for l = 0,..., k; fixed-lag: π k k+ for k = 0,... ) Parameter Inference How to tune the model parameters based on the observations?

12 The Filtering and Smoothing Recursions Basic Recursions Recursive Structure of the Joint Smoothing Density By Bayes rule π 0:k+ k+ (x 0:k+ ) = (L k+ (Y 0:k+ )) t 0 (x 0 ) k l=0 k+ t(x l, x l+ ) l(x l, Y l ) ( ) Lk+ (Y 0:k+ ) = π L k (Y 0:k ) 0:k k(x 0:k) t(x k, x k+)l(x k+, Y k+), where the normalization constants L k, i.e., the likelihood of the observations, is usually not computable. l=0

13 The Filtering and Smoothing Recursions Basic Recursions The Joint Smoothing Recursion π 0:k+ k+ (x 0:k+ ) = ( ) Lk+ L k π 0:k k (x 0:k ) t(x k, x k+ )l(x k+, Y k+ ) The marginal recursion may be decomposed in two steps: Prediction π k+ k (x k+ ) = π k k (x k )t(x k, x k+ )dx k Filtering π k+ k+ (x k+ ) = ( Lk+ L k ) π k+ k(x k+)l(x k+, Y k+)

14 The Filtering and Smoothing Recursions Computational Filtering and Smoothing Approaches Exact Implementation of the Filtering and Smoothing Recursions When X is finite (Baum et al., 970) The associated computational cost is X 2 per time index (for the filtering part). In linear Gaussian state-space models (Kalman & Bucy, 96) The filtering and prediction recursion is implemented by the Kalman filter (L k+ /L k is interpreted as the likelihood of the (k + )-th innovation). Such finite dimensional filters exist only in very specific models (see, e.g., Runggaldier & Spizzichino, 200)

15 The Filtering and Smoothing Recursions Computational Filtering and Smoothing Approaches Approximate Implementations of the Filtering and Smoothing Recursions EKF (Extended Kalman Filter) Linearization-based approach (for non-linear Gaussian state space models) UKF (Unscented Kalman Filter, Julier & Uhlmann, 997) Point-based approach Variational Methods (e.g., Valpola & Karhunen, 2002) Based on parametric density approximation arguments. Exact Suboptimal Filters In particular, Kalman filter viewed as minimum mean square error linear filtering.

16 The Filtering and Smoothing Recursions Computational Filtering and Smoothing Approaches Sequential Monte Carlo (SMC) Sequential Monte Carlo (sometimes called particle filtering) is a method which uses pseudo-random simulations to produce successive populations of weighted particles Xk :n and associated weights Wk :n such that n Wk i f(xi k ) i= for all functions f of interest. f(x)π k k (x)dx, The SMC process is sequential in the sense that given Xk :n, Wk :n and the observations Y 0:k+, Xk+ :n and W :n k+ are conditionally independent of previous populations of particles. SMC is based on importance sampling and resampling.

17 Sequential Importance Sampling Bayesian Dynamic Models 2 The Filtering and Smoothing Recursions 3 Sequential Importance Sampling Self-Normalized Importance Sampling Sequential Importance Sampling (SIS) Weight Degeneracy 4 Sequential Importance Sampling with Resampling 5 The Auxiliary Particle Filter 6 Smoothing 7 Mixture Kalman Filter

18 Sequential Importance Sampling Self-Normalized Importance Sampling Self-Normalized Importance Sampling, or IS (Hammersley & Handscomb, 964) IS is a weighted form of Monte Carlo approximation, in which expectations under a target pdf π π(f) = E π [f(x)] are estimated as where X i iid q, ω i = π q (Xi ). ˆπ n q (f) = n ω i n f(x i ), j= }{{ ωj } W i i= This form of IS (sometimes also called Bayesian IS) does not necessitate that π be properly normalized.

19 Sequential Importance Sampling Self-Normalized Importance Sampling Performance of IS Assuming that E π [ π q (X)( + f 2 (X))] <, ˆπ q n (f) is consistent and asymptotically normal, with asymptotic variance given by [ ] π υ q (f) = E π q (X) (f(x) π(f))2. The asymptotic variance can be estimated from the IS sample by ˆυ n q (f) = n n (W i ) 2 {f(x i ) ˆπ q n (f)} 2, i= where W i = ω i / n j= ωj are the normalized weights.

20 Sequential Importance Sampling Self-Normalized Importance Sampling Elements of proof Consistency If π = π/c, where c = π(x)dx, is a pdf, E q [ω i f(x i )] = E q [ π q (X)f(X) ] = c E π [f(x)]. CLT n(ˆπ n q (f) π(f)) = n n i= ωi (f(x i ) π(f)) n n j= ωj and Var[ω i (f(x i ) π(f))] = c 2 E π [ π q (X)(f(X) π(f))2 ]. Empirical estimate ˆυ n q (f) = n i= W i n π q (Xi ) n j= π q (Xj ) {f(xi ) ˆπ q n (f)} 2

21 Sequential Importance Sampling Self-Normalized Importance Sampling Empirical diagnostic tools for IS Effective sample size ESS n q = [ n i= ( W i ) ] 2 ESS n q n 2 n/ ESS n q is an estimate of E π [ π q (X)] which is the maximal IS asymptotic variance for functions f such that f(x) π(f) (note: these functions have maximal Monte Carlo variance of under π ). 3 n/ ESS n q is an estimator of the χ 2 divergence (π(x) q(x)) 2 dx q(x) (n/ ESS n q is also the square of the empirical coefficient of variation associated with the normalized weights).

22 Sequential Importance Sampling Self-Normalized Importance Sampling Another Empirical diagnostic tools for IS (Shannon) Entropy of the Normalized Weights N ENT n q = W i log(w i ) i= exp(ent n q ) n 2 exp(ent n q )/n is an estimate of exp[ K(π q)], where K(π q) is the Kullback-Leibler divergence between π and q.

23 Sequential Importance Sampling Self-Normalized Importance Sampling Optimal IS Instrumental Density When considering large classes of functions f, IS generally appears to perform worse than Monte Carlo under π (e.g., for functions such that f(x) π(f), the maximal Monte Carlo asymptotic variance is whereas, the maximal IS asymptotic variance is E π [ π q (X)] which is strictly larger than by Jensen s inequality, unless q = π). However, for a fixed function f, the optimal IS density is not π but f(x) π(f) π(x) f(x ) π(f) π(x )dx as Cauchy-Schwarz inequality implies that E π [ π q (X) (f(x) π(f))2 ] E π [ q π (X) ] } {{ } = E 2 π [ f(x) π(f) ].

24 Sequential Importance Sampling Self-Normalized Importance Sampling Example (Bayesian Posterior) Consider a simple model in which X 0 N(, 2 2 ), Y 0 X 0 N(X0, 2 σ 2 ). And apply IS to compute expectations under the posterior for Y 0 = 2, using the prior as instrumental pdf.

25 Sequential Importance Sampling Self-Normalized Importance Sampling Instrumental Target.4 Instrumental Target σ = 2 ESS n q /n 0.66 υ q (x x).69 υ π (x x).25 σ = 0.5 ESS n q /n 0.25 υ q (x x) 5.35 υ π (x x).25

26 Sequential Importance Sampling Sequential Importance Sampling (SIS) Back to the Filtering and Smoothing Problem How to estimate expectations under the posterior π 0:k k (x 0:k ) = p(x 0:k Y 0:k ) in the model k p(x 0:k ) = t 0 (x 0 ) t(x l, x l+ ), p(y 0:k x 0:k ) = using a sequential algorithm? l=0 k l(x l, y l ), l=0

27 Sequential Importance Sampling Sequential Importance Sampling (SIS) Sequential Smoothing through IS, or SIS (Handschin & Mayne, ) Propose n independent particle trajectories {X i 0:k+ } i n under a Markovian scheme such that p(x 0:k+ ) = ρ 0:k+ (x 0:k+ ) = q 0 (x 0 ) 2 Compute importance weights sequentially: k q l (x l, x l+ ). l= ω i k+ = π 0:k+ k+(x i 0:k+ ) ρ 0:k+ (X i 0:k+ ) = ω i k t(xi k, Xi k+ )l(xi k+, Y k+) q k (X i k, Xi k+ ). Then, n i= ω i k+ n j= ωj k+ f(x i 0:k+ ) is an estimate of E [f(x 0:k+ ) Y 0:k+ ].

28 Sequential Importance Sampling Sequential Importance Sampling (SIS) FILT. INSTR. FILT. + One step of the SIS algorithm with just seven particles.

29 Sequential Importance Sampling Sequential Importance Sampling (SIS) Choice of the Instrumental Kernel The so-called optimal choice of q k, consists in setting q k (x, x ) = q opt k (x, x ) = t(x, x )l(x, Y k+ ) t(x, x )l(x, Y k+ )dx for which the normalized weights are predictable, i.e. Wk+ i depend on Xk i but not Xi k+ (hence, for this instrumental kernel the conditional variance cancels). This is however usually not feasible and common choices include the prior q k = t (and then ωk i l(xi k, Y k)) 2 approximations (sometimes heuristic) to q opt k matching, use of EKF or UKF,... ), (moment 3 tuning parameters of q k so as to maximize the effective sample size or entropy criterions

30 Sequential Importance Sampling Weight Degeneracy Weight Degeneracy Empirically, the SIS approach always fail when the time-horizon k is more than a few tens; the IS weights ωk :n usually become very unbalanced with a few weights dominating all the other To understand why it is the case, consider the (silly) model where { t(x, x ) = t(x ) = t 0 (x ), (Independent states) l(x, y) = l(y), (Non-informative observations) and the instrumental kernel is such that q l (x, x ) = q 0 (x ) = q(x )

31 Sequential Importance Sampling Weight Degeneracy Weight Degeneracy (Contd.) For a function of interest f that only depends on the last coordinate x k of the trajectory x 0:k, the asymptotic variance of the SIS approximation to π k k (f) = E πk k [f(x)] is given by υ k (f) = ( k = l=0 ) 2 t (f(xk q (x l) ) π ) 2 k k k(f) l=0 q(x l ) dx 0... dx k ( t ) k t q (x)t(x)dx q (x) ( f(x) π k k (f) ) 2 t(x)dx. }{{} > In practise, this situation can usually be detected by monitoring the effective sample size or entropy criterions, which become abnormally small.

32 Sequential Importance Sampling Weight Degeneracy Application to the Stochastic Volatility Model This is a non-linearly observed state-space model used to represent log-returns in quantitative finance: where X k+ = φx k + σu k φ <, Y k = β exp(x k /2)V k, {U k } k 0 and {V k } k 0 are independent standard Gaussian white noise processes. X 0 N(0, σ 2 /( φ 2 ). We consider trajectories of length 50 simulated under the model with φ = 0.98, σ = 0.7 and β = 0.64 and apply SIS with q k = q and n =, 000 particles.

33 Sequential Importance Sampling Weight Degeneracy Typical Evolution of ESS for a Single Trajectory 000 ESS (500 indep. replications of the particles) time index

34 Sequential Importance Sampling Weight Degeneracy Evolution of ESS When Averaging Over Trajectories ESS (500 indep. simulations) time index

35 Sequential Importance Sampling Weight Degeneracy Typical Evolution of the Weight Distribution Importance Weights (base 0 logarithm) Histograms of the base 0 logarithm of the normalized importance weights after (from top to bottom), 0 and 00 iterations for the stochastic volatility model (improved instrumental kernel based on moment matching).

36 Sequential Importance Sampling Weight Degeneracy Filtering Results 0.08 Density Time Index State Waterfall representation of the sequence of estimated filtering pdfs (weighted kernel smooth) together with the actual state for,000 particles (improved instrumental kernel based on moment matching).

37 Sequential Importance Sampling with Resampling Bayesian Dynamic Models 2 The Filtering and Smoothing Recursions 3 Sequential Importance Sampling 4 Sequential Importance Sampling with Resampling Sampling Importance Resampling Sequential Importance Sampling with Resampling (SISR) Case Study 5 The Auxiliary Particle Filter 6 Smoothing 7 Mixture Kalman Filter

38 Sequential Importance Sampling with Resampling Sampling Importance Resampling In IS, it is indeed possible to reset the weights to a constant value at the price of a, usually moderate, increase in variance. Sampling Importance Resampling (Rubin, 987) Replace {X :n, W :n } by { X :Ñ, W :Ñ} such that the discrepancy between the resampled weights { W :Ñ} is reduced and Ñ W i= k iδ is a good approximation to n Xi i= W i δ X i. In general the resampling is random and subject to the constraints Ñ = n, W [ i = /Ñ, Ñ E i= { X ] i = X j } X :n, W :n = ÑW j ( j n). The last condition is often referred to as unbiasedness or proper weighting.

39 Sequential Importance Sampling with Resampling Sampling Importance Resampling Multinomial Resampling Draw, conditionally independently given {X :n, W :n }, n discrete random variables (J,..., J n ) taking their values in the set {,..., n} with probabilities (W,..., W n ). 2 Set, for i =,..., n, Xi = X J i and W i = /n. Let C i = n j= { X j = X i } (i =,..., n) denote the number of times each particle is duplicated in the resampling process. The counts (C,..., C n ) follow a multinomial distribution with parameters n, (W,..., W n ), conditionally to {X :n, W :n }. INSTRUMENTAL Resampled particles TARGET

40 Sequential Importance Sampling with Resampling Sampling Importance Resampling Some Results on SIR D Xi π are n (some extensions of this result) n 2 n i= f( X i ) is an asymptotically normal estimator of π(f) (assuming E π [ π q (X)( + f 2 (X)) + f 2 (X)] < ) with asymptotic variance given by [ ] π [ υ q (f) = E π q (X) (f(x) π(f))2 + E π (f(x) π(f)) 2] }{{}}{{} υ Var q(f) π[f(x)] If n is sufficiently large, the cost of resampling is very moderate in situation that are challenging for IS, i.e., when υ q (f) Var π [f(x)].

41 Sequential Importance Sampling with Resampling Sampling Importance Resampling Elements of Proof For a bounded function f, [ E f( X ] [ ( i ) = E E f( X i ) X :n, W :n)] n = E W j f(x j ) j= and n j= W j f(x j ) a.s. π(f) ( n j= W j f(x j ) f ). For the CLT, an ingredient of the proof is that [ ] [ n n ] n Var f( n X i ) = n Var W i f(x i ) i= i= }{{} υ q(f) [ n ( n + E W i f 2 (X i ) W i f(x i ) i= i= ) 2 } {{ } a.s. Var π[f(x)] ]

42 Sequential Importance Sampling with Resampling Sampling Importance Resampling There Exists Resampling Schemes with Reduced Variance Residual Resampling For i =,..., n set C i = nw i + C i, where C,..., C n are distributed, conditionally to W :n, according to the multinomial distribution Mult(n R, W,..., W n ) with R = n i= nw i and W i = nw i nw i, i =,..., n. n R This scheme is obviously unbiased and induces an additional variance which is always lower than Var π [f] (the same is true for the bootstrap filter based on residual resampling, Douc & Moulines, 2005).

43 Sequential Importance Sampling with Resampling Sequential Importance Sampling with Resampling (SISR) The Simplest Functional Algorithm (Gordon et al., 993) Regular resampling is added to avoid weight degeneracy and to guarantee the long-term (k ) stability of the particle filter. The Bootstrap filter X :n k Given, propose new positions Xi k+ independently under the prior dynamic t( X k i, ), for i =,..., n; 2 Compute the weights ωk+ i = l(xi k+, Y k+), for i =,..., n and normalize them (Wk+ i = ωi k+ / n 3 Resample to obtain X :n k+ indices J i k+ such that P ( J i k+ = j W :n k+ setting X i k+ = XJ i k+ k+ j= ωj k+ );, e.g., by drawing independent ) = W j k+ and (Multinomial Resampling). To avoid unnecessary resampling, 3 can be used only when the ESS statistics of the weights Wk+ :n falls below a threshold.

44 Sequential Importance Sampling with Resampling Sequential Importance Sampling with Resampling (SISR) FILT. FILT. FILT. + FILT. + FILT. +2 FILT. +2 SIS (left) and SISR (right).

45 Sequential Importance Sampling with Resampling Case Study AR() Model Observed in Pulsated Noise We consider a simple Gaussian linear state-space model to allow for comparison with analytical computations: X k+ = φx k + σu k, Y k = X k + η k V k, where φ = 0.99, σ = 0.2, and η k is varied (periodically) between 0. and 3. The observation sequence is simulated from the model but we also consider the influence of an out-of-model outlying observation at time 460.

46 Sequential Importance Sampling with Resampling Case Study 0 5 state obs. filt. smooth time 6 state obs. filt. smooth. 0 8 state obs. filt. smooth time time

47 Sequential Importance Sampling with Resampling Case Study Sequential Importance Sampling (prior kernel, n = 00) 00 ESS filt. mean filt. std time

48 Sequential Importance Sampling with Resampling Case Study Bootstrap Filter (prior kernel, n = 00, resamp. ESS < 0) ESS filt. mean filt. std time

49 Sequential Importance Sampling with Resampling Case Study Bootstrap Filter (prior kernel, n = 00, resamp. always) filt. mean filt. std time

50 Sequential Importance Sampling with Resampling Case Study Bootstrap Filter (prior kernel, n = 00, resamp. ESS < 0) ESS filt. mean filt. std time

51 Sequential Importance Sampling with Resampling Case Study Bootstrap Filter (n = 5, 000, resamp. ESS < 500) ESS filt. mean filt. std time

52 Sequential Importance Sampling with Resampling Case Study Conclusions With resampling, SISR achieves long-term stability. The increase in variance due to resampling is very moderate, especially when resampling is applied only when needed. The method is still sensitive to outliers, model misspecification, etc., which may necessitate the use of more elaborate instrumental kernels.

53 The Auxiliary Particle Filter Bayesian Dynamic Models 2 The Filtering and Smoothing Recursions 3 Sequential Importance Sampling 4 Sequential Importance Sampling with Resampling 5 The Auxiliary Particle Filter A New Interpretation of SISR The Auxiliary Trick 6 Smoothing 7 Mixture Kalman Filter

54 The Auxiliary Particle Filter A New Interpretation of SISR Alternatives to SISR The resampling step in the SISR algorithm can be seen as a method to sample approximately under the distribution obtained when plugging the current particle approximation into the filtering update. This alternative way of thinking about resampling suggests several sequential Monte Carlo variants.

55 The Auxiliary Particle Filter A New Interpretation of SISR The Filtering Recursion Revisited Recall that (with more general notations) π k+ k+ (f) = f(x )π k k (dx)t(x, dx )l(x, Y k+ ) πk k (dx)t(x, dx )l(x, Y k+ ). Now, consider what happens when considering the empirical filtering distribution ˆπ n k k = n Wk i δ Xk i i= as an approximation to π k k and plugging it into the previous relation.

56 The Auxiliary Particle Filter A New Interpretation of SISR Filtering Target One obtains a mixture target pdf π n k+ k+ (x) = = n i= W k it(xi k, x)l(x, Y k+) W i k t(xk i, x)l(x, Y k+)dx n i= n i= W i k γ k(x i k ) n j= W j k γ k(x j k ) qopt k (Xi k, x), where γ k (x) = t(x, dx )l(x, Y k+ ), q opt k (x, x ) = t(x, x )l(x, Y k+ ). γ k (x)

57 The Auxiliary Particle Filter A New Interpretation of SISR Filtering Step 0.2 Approximation Prediction likelihood 0.3 Prediction likelihood 0.2 predictive distribution 0.2 predictive distribution Filtering Filtering

58 The Auxiliary Particle Filter A New Interpretation of SISR The previous reinterpretation of SISR shows that resampling and trajectory update can be integrated into a single global IS step that targets π n k+ k+. But, how to chose the global instrumental kernel q glob k ((Xk :n, W k :n ), x)? as πk+ k+ n is an n-mixture density, evaluation of the IS weights may necessitate of the order of n 2 computations. Remark The global instrumental kernel that corresponds to the bootstrap filter (with resampling) is q glob k ((X :n k, W :n k ), x) = n Wk i t(xi k, x). i=

59 The Auxiliary Particle Filter The Auxiliary Trick The Auxiliary Trick Assume that the global instrumental kernel is a mixture of the form q glob k ((X :n k, W :n k ), x) = n τk i q k(xk i, x). To avoid the n 2 update, one can use data augmentation, introducing the mixture component as an auxiliary variable: i= q aux k ((X :n k, W :n k ), (i, x)) = τ i k q k(x i k, x) defines a pdf on {,..., n} X, whose marginal is q k ((Xk :n, W k :n ), x). Likewise, π n,aux k+ k+ (i, x) = c W i k t(xi k, x)l(x, Y k+), where c = n i= W i k t(x i k, x)l(x, Y k+)dx, is the auxiliary target with marginal π n k+ k+ (x).

60 The Auxiliary Particle Filter The Auxiliary Trick Auxiliary Sampling Algorithm (Pitt & Shephard, 999) Auxiliary Particle Filter Repeat n times independently, for i =,..., n, Sample an index J i in {,..., n} with probabilities (τ k,..., τ n k ). 2 Sample a position X i k+ from q k(x J i k, x). 3 Compute the (unnormalized) auxiliary IS weight ωk+ i = W J i k t(xj i k, Xi k+ )l(xi k+, Y k+) τ J i k q k(x J i k, Xi k+ ).

61 The Auxiliary Particle Filter The Auxiliary Trick The Auxiliary View Provides New Degrees of Freedom In the original auxiliary particle filter, Pitt & Shephard used qk i = t and proposed an heuristic rule for setting the weights τk i based on Xk i and Y k+. The auxiliary IS weight may also be rewritten in normalized form as ωk+ i = W J i k γ k(x J i k )qopt k (XJ i k, Xi k+ ) τ J i k t(xj i k, Xi k+ ). showing that the optimal choice of τk i, in the sense of minimizing the conditional (to Xk :n, W k :n ) variance of n i= W k+ i f(xi k+ ) for a function f, is an instance of the Neyman allocation problem. And thus the optimal choice is τ i k W i k γ k(x i k ) υ i (f) where υ i (f) is the asymptotic variance of IS for f, target q opt k (Xi k, x), and, instrumental pdf t(xi k, x) (Olsson et al., 2007).

62 Smoothing Bayesian Dynamic Models 2 The Filtering and Smoothing Recursions 3 Sequential Importance Sampling 4 Sequential Importance Sampling with Resampling 5 The Auxiliary Particle Filter 6 Smoothing Using the Ancestry Tree Backward Reweighting Smoothing for Sum Functionals 7 Mixture Kalman Filter

63 Smoothing Using the Ancestry Tree state time index state time index Predictive densities and evolution of the particle ancestry tree

64 Smoothing Using the Ancestry Tree state time index state time index Predictive densities and evolution of the particle ancestry tree

65 Smoothing Using the Ancestry Tree state time index state time index Predictive densities and evolution of the particle ancestry tree

66 Smoothing Using the Ancestry Tree state time index state time index Predictive densities and evolution of the particle ancestry tree

67 Smoothing Using the Ancestry Tree state time index state time index Predictive densities and evolution of the particle ancestry tree

68 Smoothing Using the Ancestry Tree state time index state time index Predictive densities and evolution of the particle ancestry tree

69 Smoothing Using the Ancestry Tree state time index state time index Predictive densities and evolution of the particle ancestry tree

70 Smoothing Using the Ancestry Tree state time index state time index Predictive densities and evolution of the particle ancestry tree

71 Smoothing Using the Ancestry Tree state time index state time index Predictive densities and evolution of the particle ancestry tree

72 Smoothing Using the Ancestry Tree state time index state time index Predictive densities and evolution of the particle ancestry tree

73 Smoothing Using the Ancestry Tree state time index state time index Predictive densities and evolution of the particle ancestry tree

74 Smoothing Using the Ancestry Tree state time index state time index Predictive densities and evolution of the particle ancestry tree

75 Smoothing Using the Ancestry Tree state time index state time index Predictive densities and evolution of the particle ancestry tree

76 Smoothing Using the Ancestry Tree state time index state time index Predictive densities and evolution of the particle ancestry tree

77 Smoothing Using the Ancestry Tree state time index state time index Predictive densities and evolution of the particle ancestry tree

78 Smoothing Using the Ancestry Tree state time index state time index Predictive densities and evolution of the particle ancestry tree

79 Smoothing Using the Ancestry Tree state time index state time index Predictive densities and evolution of the particle ancestry tree

80 Smoothing Using the Ancestry Tree state time index state time index Predictive densities and evolution of the particle ancestry tree

81 Smoothing Using the Ancestry Tree state time index state time index Predictive densities and evolution of the particle ancestry tree

82 Smoothing Using the Ancestry Tree state time index state time index Predictive densities and evolution of the particle ancestry tree

83 Smoothing Using the Ancestry Tree state time index state time index Predictive densities and evolution of the particle ancestry tree

84 Smoothing Using the Ancestry Tree state time index state time index Predictive densities and evolution of the particle ancestry tree

85 Smoothing Using the Ancestry Tree state time index state time index Predictive densities and evolution of the particle ancestry tree

86 Smoothing Using the Ancestry Tree state time index state time index Predictive densities and evolution of the particle ancestry tree

87 Smoothing Backward Reweighting Backward Smoothing Recursion There are several options for computing the marginal smoothing pdfs π l k such that π l k (x) = p(x l Y 0:k ) for l = 0,..., k. The backward smoothing recursion is based on the observation that the conditional time-reversed state sequence has a non-homogeneous Markovian structure (for conditionally Gaussian linear state-space models, this is known as RTS, or Rauch-Tung-Striebel, 965, smoothing).

88 Smoothing Backward Reweighting Backward Smoothing Recursion Define the backward smoothing kernels: b l (x l+, x l ) = π l l(x l )t(x l, x l+ ) πl l (x)t(x, x l+ )dx, for l = k, k 2,..., 0. Then π l k (x l ) = π l+ k (x l+ )b l (x l+, x l )dx l+. As p(x l, x l+ Y 0:k ) = p(x l x l+, Y 0:k ) p(x }{{} l+ Y 0:k ) }{{} p(x l x l+, Y 0:l ) π l+ k (x l+ ) }{{} b l (x l+,x l )

89 Smoothing Backward Reweighting Particle Reweighting Scheme The natural approximation to the backward smoothing recursion consists in computing smoothing weights Wl k i backwards according to the recursion Wk k i = W k i, for i =,..., n. For l = k, k 2,..., 0, W i l k = for i =,..., n. n W j l+ k j= W i l t(xi l, Xj l+ ) n i = W i l t(x i l, Xj l+ ), The approximation to E[f(X l ) Y 0:k ] is given by n i= W i l k f(xi l ).

90 Smoothing Backward Reweighting Back to Our Case Study AR() Model Observed in Pulsated Noise 0 8 state obs. filt. smooth time

91 Smoothing Backward Reweighting Using the Ancestry Tree (n = 00) 4 smooth. mean smooth. std time

92 Smoothing Backward Reweighting With Backward Reweighting (n = 00) 4 smooth. mean smooth. std time

93 Smoothing Backward Reweighting Backward Particle Reweighting Appears to be efficient and stable in the long term (although this hasn t been proved yet). Yet, it is not sequential (in particular, one needs to store all particle positions and weights); it has a potential numerical complexity proportional to the number n of particles squared (but not further likelihood evaluation is needed).

94 Smoothing Smoothing for Sum Functionals Sum Functionals and Parameter Estimation For parameter estimation, one requires smoothing of particular sum functionals of the hidden states. In the example of the stochastic volatility model, one needs to evaluate E [s i (X 0:n ) Y 0:n ], for 0 i 4 with n s 0 (x 0:n ) = x 2 0, s (x 0:n ) = x 2 k, s 2(x 0:n ) = s 3 (x 0:n ) = k=0 n x k x k, s 4 (x 0:n ) = k= n k=0 n x 2 k, k= Y 2 k exp( x k). We consider the simple case of s 0 (X 0 ) using N i= W n i { X i 0:n (0) } 2 as the sequential Monte Carlo estimate of E (s 0 (X 0 ) Y 0:n ) (for sum functionals, the same applies for each term in the sum).

95 Smoothing Smoothing for Sum Functionals Smoothing for s 0 (X 0 ) 0 2 Particles 0 3 Particles 0 4 Particles time time time Box and whisker plots of particle estimates of R x 2 π 0 k (dx) for k =, 5, 20, 30 and 500, and particle population sizes n = 0 2, 0 3 and 0 4.

96 Smoothing Smoothing for Sum Functionals Smoothing for s 0 (X 0 ), Contd. 2.5 State Value Time Index Particle trajectories at time n = 70 for the stochastic volatility model with n = 00 particles and systematic resampling. Using k < n is beneficial! Properly setting k corresponds to a bias-variance tradeoff: k bias decreases, k variance decreases. Hence, the general principle for sequential smoothing of sum functionals: use fixed-lag smoothing (Olsson et al., 2008).

97 Mixture Kalman Filter Bayesian Dynamic Models 2 The Filtering and Smoothing Recursions 3 Sequential Importance Sampling 4 Sequential Importance Sampling with Resampling 5 The Auxiliary Particle Filter 6 Smoothing 7 Mixture Kalman Filter The Mixture Kalman Filter (MKF) Algorithm An Application to Change Point Detection

98 Mixture Kalman Filter The Mixture Kalman Filter (MKF) Algorithm Filtering in Conditionally Gaussian Linear State-Space Models There are several cases of interest where the particles are more complicated than just elements of the state-space X. Conditionally Gaussian Linear State-Space Model Dynamic equation Observation equation Z k = A(C k )Z k + R(C k )U k Y k = B(C k )Z k + S(C k )V k where {C k } is itself a finite-valued Markov Chain. Many applications: Non-Gaussian noises modelled as mixtures, switching models, applications in digital communications, etc.

99 Mixture Kalman Filter The Mixture Kalman Filter (MKF) Algorithm The Exact Filtering Recursions Given C 0:k, p(z k Y 0:k, C 0:k ) is a Gaussian distribution with mean vector Ẑk k(y 0:k, C 0:k ) and covariance matrix Σ k k (Y 0:k, C 0:k ), which can be determined recursively using the Kalman filter. Hence p(x k Y 0:k ) is the mixture of C k+ Gaussian densities p(x k Y 0:k ) L k (Y 0:k c 0:k )p(c 0:k ) c 0:k C ( k+ ) N x k ; Ẑk k(y 0:k, c 0:k ), Σ k k (Y 0:k, c 0:k ). This is obviously of no use when k is not very small.

100 Mixture Kalman Filter The Mixture Kalman Filter (MKF) Algorithm Mixture Kalman Filtering Using (Z k, C k ) as state variable, one an use the SMC approaches described so far with (Z C)-valued particles. One can also use a related algorithm, where each particle represents a trajectory C 0:k summarized by the Kalman statistics Ẑk k(y 0:k, C 0:k ) and Σ k k (Y 0:k, C 0:k ). The resulting algorithm mixes systematic exploration of the trajectory continuations (in C), Kalman update to compute the likelihood factor and resampling.

101 Mixture Kalman Filter The Mixture Kalman Filter (MKF) Algorithm MKF: computation of the weights For i =,..., n and j =,..., r, compute Ẑ k+ k (C i 0:k, j) = A(j)Ẑk k(c i 0:k ), Σ k+ k (C i 0:k, j) = A(j)Σ k k(c i 0:k )At (j) + R(j)R t (j), Ŷ k+ k (C i 0:k, j) = B(j)Ẑk+ k(c i 0:k, j), Γ k+ (C i 0:k, j) = B(j)Σ k+ k(c i 0:k, j)bt (j) + S(j)S t (j), ω i,j k+ = W i k N(Y k+ ; Ŷk+ k(c i 0:k, j), Γ k+(c i 0:k, j)) Q C(C i k, j).

102 Mixture Kalman Filter The Mixture Kalman Filter (MKF) Algorithm MKF: Importance Sampling Step For i =,..., n, draw Jk+ i with probabilities proportional to ω i, k+,..., ωi,r k+, conditionally independently of the particle history, and set C0:k+ i = (Ci 0:k, J k+ i / ), r n Wk+ i = j= ω i,j k+ r i= j= ω i,j k+, K k+ (C i 0:k+ ) = Σ k+ k(c i 0:k, J i k+ )Bt (J i k+ )Γ k+ (Ci 0:k+, J i k+ ), Ẑ k+ k+ (C i 0:k+ ) = Ẑk+ k(c i 0:k, J i k+ ) + K k+ (C i 0:k+ ){Y k+ Ŷk+ k(c i 0:k, J i k+ )}, Σ k+ k+ (C i 0:k+ ) = {I K k+(c i 0:k+ )B(J k+)}σ k+ k (C i 0:k, J i k+ ).

103 Mixture Kalman Filter An Application to Change Point Detection Application to a Change Point Detection Task.5 x 05 x Nuclear response Nuclear response Time Time Left: well-log data waveform with a median smoothing estimate of the state. Right: median smoothing residual.

104 Mixture Kalman Filter An Application to Change Point Detection Change Point Modelling To model this situation we put C = {0, }, where C k = 0 means that there is no change point at time index k whereas C k = means that a change point has occurred. The state space model is Z k+ = A(C k+ )Z k + R(C k+ )U k, Y k = Z k + V k, where A(0) = I, R(0) = 0 and A() = 0 and R() = R. The simplest model consists in taking for {C k } k 0 an i.i.d. sequence of Bernoulli random variables with probability of success p. The time between two change points (period of time during which the state variable is constant) is then distributed as a geometric random variable with mean /p.

105 Mixture Kalman Filter An Application to Change Point Detection Outliers Modelling 6 x 04 4 Quantiles of Input Sample Standard Normal Quantiles Quantile-quantile regression of empirical quantiles of the well-log data residuals with respect to quantiles of the standard normal distribution.

106 Mixture Kalman Filter An Application to Change Point Detection Outliers Modelling, Contd. The normal distribution does not fit the measurement noise well in the tails. We model the measurement noise as a mixture of two Gaussian distributions: Z k+ = A(C k+, )Z k + R(C k+, )U k, U k N(0, ), Y k = µ(c k,2 ) + B(C k,2 )Z k + S(C k,2 )V k, V k N(0, ), where C k, {0, } and C k,2 {0, } are indicators of the presence of a change point and of an outlier, respectively. {C k,2 } is modelled as a two-state Markov chain which represents the clustering behavior of outliers.

107 Mixture Kalman Filter An Application to Change Point Detection 5 x 04 Data Posterior Probability of Jump Posterior Probability of Outlier Time On-line analysis of the well-log data, using 00 particles with detection delay d = 0. Top: data; middle: posterior probability of a jump; bottom: posterior probability of an outlier.

108 Mixture Kalman Filter An Application to Change Point Detection 5 x 04 Data Posterior Probability of Jump Posterior Probability of Outlier Time On-line analysis of the well-log data, using 00 particles with detection delay d = 5 (same display as above).

109 Conclusions There are many more important issues in SMC such as Analysis of Performance Convergence of more elaborate algorithms, less restrictive conditions for long-term stability, analysis of smoothing algorithms Resampling Variants Conditional variance reduction schemes, triggered resampling, varying number of particles... Choice of the Instrumental Moves Lookahead moves, combination with deterministic approximation, adaptive moves

110 Conclusions Some General References on SMC Doucet, A., De Freitas, N. and Gordon, N. (eds.) (200) Sequential Monte Carlo Methods in Practice. Springer. Ristic, B., Arulampalam, M. and Gordon, A. (2004) Beyond Kalman Filters: Particle Filters for Target Tracking. Artech House. Cappé, 0., Moulines, E. and Rydén, T. (2005) Inference in Hidden Markov Models. Springer. Doucet, A., Godsill, S. and Andrieu, C. (2000) On sequential Monte-Carlo sampling methods for Bayesian filtering. Stat. Comput., 0, Arulampalam, M., Maskell, S., Gordon, N. and Clapp, T. (2002) A tutorial on particle filters for on line non-linear/non-gaussian Bayesian tracking. IEEE Trans. Signal Process., 50, Cappé, O., Godsill, S. J. and Moulines, E. (2007) An overview of existing methods and recent advances in sequential Monte Carlo, IEEE Proc., 95,

cappe/

cappe/ Particle Methods for Hidden Markov Models - EPFL, 7 Dec 2004 Particle Methods for Hidden Markov Models Olivier Cappé CNRS Lab. Trait. Commun. Inform. & ENST département Trait. Signal Image 46 rue Barrault,