cappe/

Size: px

Start display at page:

Download "cappe/"

Samson Dickerson
5 years ago
Views:

1 Particle Methods for Hidden Markov Models - EPFL, 7 Dec 2004 Particle Methods for Hidden Markov Models Olivier Cappé CNRS Lab. Trait. Commun. Inform. & ENST département Trait. Signal Image 46 rue Barrault, Paris cedex 3, France mailto:cappe@tsi.enst.fr cappe/ These lectures are based on the book Inference in Hidden Markov Models written with E. Moulines and T. Rydén (Springer-Verlag, to appear in 2005).

2 Particle Methods for Hidden Markov Models - EPFL, 7 Dec Roadmap. What is a Hidden Markov Model? 2. Filtering and Smoothing Recursions 3. Monte Carlo, Importance Sampling and Sampling Importance Resampling 4. Sequential Importance Sampling 5. Sequential Importance Sampling with Resampling 6. More Sequential Monte Carlo Algorithms 7. Approximation of Sums Functionals and Parameter Estimation

3 Particle Methods for Hidden Markov Models - EPFL, 7 Dec What is a Hidden Markov Model? A hidden Markov model (abbreviated HMM) is a bivariate discrete-time process {X k, Y k } k 0, where {X k } k 0 is an homogeneous Markov chain and, conditional on {X k } k 0, {Y k } k 0 is a sequence of independent random variables such that the conditional distribution of Y k only depends on X k. The underlying Markov chain {X k } k 0 is called the regime, or state. We denote the state space of the Markov chain {X k } k 0 by X and the set in which {Y k } k 0 takes its values by Y.

4 Particle Methods for Hidden Markov Models - EPFL, 7 Dec What is a Hidden Markov Model? The dependence structure of an HMM can be represented by a graphical model as in X k X k+ Y k Y k+ Graphical representation of the dependence structure of a hidden Markov model, where {Y k } k 0 is the observable process and {X k } k 0 is the hidden chain.

5 Particle Methods for Hidden Markov Models - EPFL, 7 Dec What is a Hidden Markov Model? Of the two processes {X k } k 0 and {Y k } k 0, only {Y k } k 0 is actually observed; the Markov chain {X k } k 0 is unobserved, or hidden. Hence, inference on the parameters of the model must be achieved using {Y k } k 0 only. The other topic of interest is of course inference on the unobserved {X k } k 0 : given a model and some observations, can we estimate the value of the unobservable sequence of states? These two major statistical objectives are indeed strongly connected!

6 Particle Methods for Hidden Markov Models - EPFL, 7 Dec What is a Hidden Markov Model? the Y -variables are conditionally independent given {X k } k 0, but {Y k } k 0 is not an independent sequence because of the dependence in {X k } k 0. {Y k } k 0 is not a Markov chain either: the joint process {X k, Y k } k 0 is a Markov chain, but {Y k } k 0 does not have the loss of memory property: the conditional distribution of Y k given Y 0,..., Y k does depend on all the conditioning variables.

7 Particle Methods for Hidden Markov Models - EPFL, 7 Dec What is a Hidden Markov Model? There are numerous examples: where both X and Y are finite coding, digital communications, bioinformatics where X is finite but Y is not speech recognition, ion channel modelling (Gaussian HMMs) where both X and Y are continuous linear state models, non-linear state space models (ex. stochastic volatility model, bearings-only tracking) where Y is continuous and X = C W with C finite and W continuous conditionally Gaussian linear state space models (AKA jump Markov models) non-hmms that behave similarly switching autoregressions, Markov switching models Except for stability properties and theory of MLE which we don t consider today...

8 Particle Methods for Hidden Markov Models - EPFL, 7 Dec Roadmap. What is a Hidden Markov Model? 2. Filtering and Smoothing Recursions 3. Monte Carlo, Importance Sampling and Sampling Importance Resampling 4. Sequential Importance Sampling 5. Sequential Importance Sampling with Resampling 6. More Sequential Monte Carlo Algorithms 7. Approximation of Sums Functionals and Parameter Estimation

9 Particle Methods for Hidden Markov Models - EPFL, 7 Dec Hidden Markov Model Notations for HMMs. {X k } k 0 is a Markov chain on X with initial distribution ν and transition kernel Q 2. {Y k } k 0 is such that for f 0,..., f n F b (Y), [ n ] n E f k (Y k ) X 0:n = k=0 k=0 Y f k (y) g(x k, y)µ(dy), where X 0:n denotes the collection X 0,..., X n and g is a transition density function (with respect to µ) sometimes referred to as the conditional likelihood function. We will also use the simplified notation g k (x) def = g(x, Y k )

10 Particle Methods for Hidden Markov Models - EPFL, 7 Dec Some More Notations: Usual Kernel Operations Q(x, A) = Q(x, f) = A Q(x, dx ) P[X k+ A X k ] = Q(X k, A) Q(x, dx )f(x ) E[f(X k+ ) X k ] = Q(X k, f) νq(f) = ν(dx)q(x, dx )f(x ) (also denoted (Qf)(x)) Expectation after one step, starting under ν Q n (x 0, f) = Q n (x 0, Qf) Expectation after n steps, starting under δ x0 Markov transition kernels are such that Q(x, X) =. Sometimes unnormalized transition kernels, such that Q(x, A) 0 for all A X and 0 < Q(x, X) <, are also used.

11 Particle Methods for Hidden Markov Models - EPFL, 7 Dec 2004 Filtering and Smoothing Recursions To be answered Given a HMM, how to evaluate the conditional distribution of the states X k, given the observations Y 0,..., Y n? We introduce the generic notation φ ν,k:l n to denote the conditional distribution of X k:l given Y 0:n, where ν recalls the dependence with respect to the initial distribution (which will sometimes be omitted). The joint probability of the unobservable states and observations up to index n is such ( that, for any function f F ) b {X Y} n+, E ν [f(x 0, Y 0,..., X n, Y n )] = f(x 0, y 0,..., x n, y n ) n ν(dx 0 )g(x 0, y 0 ) {Q(x k, dx k )g(x k, y k )} µ n (dy 0,..., dy n ). k=

12 Particle Methods for Hidden Markov Models - EPFL, 7 Dec The Likelihood Marginalizing with respect to the unobservable variables X 0,..., X n yields E ν [f(y 0,..., Y n )] = f(y 0,..., y n ) L ν,n (y 0,..., y n ) µ n (dy 0,..., dy n ), ( for f F ) b Y n+, where L ν,n (y 0,..., y n ) = ν(dx 0 )g(x 0, y 0 )Q(x 0, dx )g(x, y ) Q(x n, dx n )g(x n, y n ), is the likelihood of the observations.

13 Particle Methods for Hidden Markov Models - EPFL, 7 Dec Joint Smoothing Distribution By Bayes rule φ ν,0:n n (y 0:n, f) = L ν,n (y 0:n ) f(x 0:n ) n ν(dx 0 )g(x 0, y 0 ) Q(x k, dx k )g(x k, y k ) k= for all functions f F b ( X n+ ). In the following, we always use the implicit conditioning convention, writing φ ν,0:n n (f) = L ν,n n f(x 0:n )ν(dx 0 )g 0 (x 0 ) Q(x k, dx k )g k (x k ) k= where L ν,n = ν(dx 0 )g 0 (x 0 ) n Q(x k, dx k )g k (x k ). k=

14 Particle Methods for Hidden Markov Models - EPFL, 7 Dec Recursive Smoothing Formula Comparing the expressions corresponding to n and n + gives the following update equation for the joint smoothing distribution: φ ν,0:n+ n+ (f n+ ) = ( Ln+ L n ) f n+ (x 0:n+ ) φ ν,0:n n (dx 0,..., dx n, dx n ) Q(x n, dx n+ ) g n+ (x n+ ) for functions f n+ F b ( X n+2 ). = Very simple structure but involves the normalization factor c n+ def = L n+ /L n which is not computable, except in simple cases such as when X finite This claim is not obvious (see next slides...)

15 Particle Methods for Hidden Markov Models - EPFL, 7 Dec Filtering Recursion Marginalizing with respect to all variables but x n and x n+ gives the (marginal) filtering recursion: c ν,n+ = φ ν,n n (dx)q(x, dx )g n+ (x ), φ ν,n+ n+ (f) = c ν,n+ f(x) φ ν,n (dx)q(x, dx )g n+ (x ), with initial condition c ν,0 = ν(g 0 ), φ ν,0 0 (f) = c ν,0 f(x)g 0 (x) ν(dx). Remark When X is finite (speech recognition, bioinformatics) the above is known as the normalized forward recursion (of forward-backward); the specialization of this relation to Gaussian linear state-space model is known as Kalman filtering.

16 Particle Methods for Hidden Markov Models - EPFL, 7 Dec Prediction and Filtering Updates It is sometimes convenient to break the previous recursion in two steps: φ ν,n+ n = φ ν,n n Q. c ν,n+ = φ ν,n+ n (g n+ ), φ ν,n+ n+ (f) = c ν,n+ f(x) g n+ (x)φ ν,n+ n (dx). prediction filtering Computation of the Log-Likelihood l ν,n def = log L ν,n = n log φ ν,k k (g k ). This is non-trivial: we have replaced an n + dimensional integral by a product of n + k=0 integrals on X! In finite state space HMMs, the filtering recursion makes it possible to evaluate the (log-)likelihood in O{(n + ) Card 2 (X)} operations.

17 Particle Methods for Hidden Markov Models - EPFL, 7 Dec The recursion Recap: Filtering and Smoothing φ ν,n+ n = φ ν,n n Q, c ν,n+ = φ ν,n+ n (g n+ ), φ ν,n+ n+ (f) = c ν,n+ f(x) g n+ (x)φ ν,n+ n (dx), with φ ν,0 def = ν computes the filtering and predictive distributions recursively, making it possible (i) to compute the likelihood L ν,n+ and, potentially, (ii) the joint smoothing distribution since φ 0:n+ n+ (f n+ ) = c ν,n+ f n+ (x 0:n+ ) φ 0:n n (dx 0,..., dx n, dx n ) Q(x n, dx n+ ) g n+ (x n+ ).

18 Particle Methods for Hidden Markov Models - EPFL, 7 Dec Appendix: Finite-Dimensional Recursive Smoothing for a Sum In particular, if f n (x 0:n ) = n k=0 s(x k), define the signed measure τ ν,n by ( n ) τ ν,n (f) = f(x n ) s(x k ) φ ν,0:n n (dx 0,..., dx n ), k=0 such that τ ν,n (X) = E ν [ n k=0 s(x k) Y 0:n ]. Then, τ ν,n+ (f) Z = c ν,n+ = Z Z f(x n+ ) n+ X k=0 s(x k ) φ ν,0:n n (dx 0,..., dx n, dx n ) Q(x n, dx n+ ) g n+ (x n+ ) f(x n+ ) Z φ ν,n+ n+ (dx n+ ) s(x n+ ) + c n+! «τ ν,n (dx n ) Q(x n, dx n+ ) g n+ (x n+ ).

19 Particle Methods for Hidden Markov Models - EPFL, 7 Dec Roadmap. What is a Hidden Markov Model? 2. Filtering and Smoothing Recursions 3. Monte Carlo, Importance Sampling and Sampling Importance Resampling 4. Sequential Importance Sampling 5. Sequential Importance Sampling with Resampling 6. More Sequential Monte Carlo Algorithms 7. Approximation of Sums Functionals and Parameter Estimation

20 Particle Methods for Hidden Markov Models - EPFL, 7 Dec Monte-Carlo Integration Objective Given a probability measure µ, how to evaluate numerically µ(f) = X µ(dx)f(x) for arbitrary µ-integrable functions f? The Monte Carlo Answer. Draw an independent sample ξ,..., ξ N from the probability measure µ. 2. Compute the sample average N N i= h(ξ i ). This technique is applicable only when direct sampling from the distribution µ is feasible.

21 Particle Methods for Hidden Markov Models - EPFL, 7 Dec (Unnormalized) Importance Sampling: General Principle It is also possible to sample from an instrumental (or importance) distribution ν, applying a change-of-measure formula to account for the fact that the instrumental distribution differs from the target distribution µ. More formally, if the target probability measure µ is absolutely continuous with respect to to the instrumental probability measure ν, µ ν. For any µ-integrable function f µ(f) = f(x) µ(dx) = f(x) dµ dν (x)ν(dx), where dµ dν is the Radon-Nikodym derivative of µ with respect to ν, called the importance function (or importance ratio) in the context of importance sampling.

22 Particle Methods for Hidden Markov Models - EPFL, 7 Dec (Unnormalized) Importance Sampling: the Algorithm Sampling Draw an independent sample ξ,..., ξ N from the distribution ν. Weighting Compute the importance weights for i =,..., N. Weigthed Monte Carlo Approximation ω i = dµ dν (ξi ), µ IS ν,n (f) = N N ω i f(ξ i ) i=

23 Particle Methods for Hidden Markov Models - EPFL, 7 Dec (Unnormalized) Importance Sampling: Large Sample Performance Strong law of large numbers The sequence µ IS ν,n N. (f) converges to µ(f), almost surely as Central limit theorem If f is a real-valued measurable function satisfying ν ( ( + f 2 ) ( ) ) 2 dµ dν = µ ( ( + f 2 ) dµ ) dν µ IS ν,n (f) is asymptotically normal N( µ IS ν,n (f) µ(f)) where ( ) dµ Var ν dν f = ν ( { f dµ } ) 2 dν µ(f). <, ( ( )) D N 0, Var dµ ν dν f Deviations inequalities (exponential, L p ) or more sophisticated empirical process results are also available. = Choosing ν such that dµ/dν stays as small as possible is very important in practice.

24 Particle Methods for Hidden Markov Models - EPFL, 7 Dec Importance Sampling In situations where dµ dν is known only up to a scaling factor we can still use the importance sampling estimator, just changing the normalization factor µ IS ν,n (f) = N i= f(ξi ) dµ dν (ξi ) N, dµ i= dν (ξi ) The (self-normalized) importance sampling estimator (sometimes also called Bayesian sampling estimator) is defined as a ratio of the unnormalized importance sampling estimators µ IS By the Strong Law of Large Numbers showing that µ IS ν,n ν,n (f) = µis ν,n (f) µ IS ν,n (). µ IS ν,n (f) a.s. µ(f) µ IS ν,n () a.s. (f) is a strongly consistent estimator of µ(f).

25 Particle Methods for Hidden Markov Models - EPFL, 7 Dec Importance Sampling (contd.) Assuming in addition that f is real-valued and satisfies ν ( ( + f 2 ) ( ) ) 2 dµ dν = µ ( ( + f 2 ) dµ ) dν <, N( µ IS ν,n (f) µ(f)) ( dµ σ 2 (ν, f) = Var ν {f µ(f)} dν D N ( 0, σ 2 (ν, h) ), ) ( (dµ ) ) 2 = ν (f µ(f)) 2. dν The estimator is errorless for constant functions and its performance is clearly dependent on the fact that dµ/dν stays small.

26 Particle Methods for Hidden Markov Models - EPFL, 7 Dec Sampling Importance Resampling (SIR) While importance sampling is originally designed to overcome difficulties with direct sampling from µ when approximating integrals like µ(f) it can also be used for approximate sampling from the distribution µ. The sampling importance resampling (SIR) method is a two-stages method: Sampling: Draw an i.i.d. sample ξ,..., ξ M from the instrumental distribution ν. Weighting: Compute the (normalized) importance weights ω i = dµ dν ( ξ i ) for i =,..., M. / M j= dµ dν ( ξ j ) Resampling: Draw, conditionally independently given ( ξ,..., ξ M ), N discrete random variables (I,..., I N ) taking values in the set {,..., M} with probabilities (ω,..., ω M ). Set, for i =,..., N, ξ i = ξ Ii. The set (I,..., I N ) is thus a multinomial trial process. This resampling method is known as multinomial resampling.

27 Particle Methods for Hidden Markov Models - EPFL, 7 Dec Sampling Importance Resampling (contd.) The first stage sample ξ,..., ξ M is really distributed under ν. In the resampling operation, the bad points, as measured by dµ/dν, are discarded whereas the good points are selected (and perhaps duplicated) with high probability. TARGET TARGET

28 Particle Methods for Hidden Markov Models - EPFL, 7 Dec SIR: Large Sample Behavior It is not obvious in which sense (ξ,..., ξ N ) is (approximately) a sample from the target distribution µ. Rewriting ˆµ SIR ν,m,n (f) = N N f(ξ i ) = i= it is easily seen that the sample mean ˆµ SIR ν,m,n M i= N i N f( ξ i ), (f) of the SIR sample is, conditionally on the first-stage sample ( ξ,..., ξ M ), equal to the importance sampling estimator µ IS ν,m (f): [ E ˆµ SIR ν,m,n (f) ξ,..., ξ ] M = µ IS ν,m (f). As a consequence the SIR estimator ˆµ SIR ν,m,n (f) is an unbiased estimate of µ(f), but its mean squared error is always larger than that of the importance sampling estimator due to the well-known variance decomposition E [ (ˆµ SIR ν,m,n (f) µ(f)) 2] = E [ (ˆµ SIR ν,m,n (f) µ IS ν,m (f)) 2] + E [ ( µ IS ν,m (f) µ(f)) 2].

29 Particle Methods for Hidden Markov Models - EPFL, 7 Dec SIR: Large Sample Behavior (contd.) Going beyond this elementary result is not trivial because the second stage sample ξ,..., ξ N is no more i.i.d. after resampling due to the normalization of the importance weights. Theorem Assume that µ ν. Let {ξ i } i M be i.i.d. random variables with distribution ν. Then ˆµ SIR ν,m,n M, N. (f) is a (weakly) consistent estimate of µ(f) for µ-integrable functions f as Assume in addition that lim M,N M/N = α for some α and that dµ/dν and fdµ/dν are in L 2 (X, ν). Then ˆµ SIR ν,m,n (f) is asymptotically normal N(ˆµ SIR ν,m,n (f) µ(f)) D N ( 0, σ 2 (f) ) with σ 2 (f) = Var µ (f) }{{} variance of resampling ( dµ + α Var ν dν ) {f µ(f)} } {{ } variance of IS. Analysis of opposite case is possible but less interesting in practice.

30 Particle Methods for Hidden Markov Models - EPFL, 7 Dec Alternative Resampling Schemes There are other resampling schemes that guarantee that E [N i ξ,..., ξ M ] = Nω i for i =,..., N and that have lower conditional variance. ω + ω 2 + ω 3 ω + ω 2 ω + ω 2 + ω 3 ω + ω ω ω Principle of stratified sampling (left) and systematic sampling (right). Note: the latter does not always reduce the conditional variance. Studying their large sample behavior is harder however.

31 Particle Methods for Hidden Markov Models - EPFL, 7 Dec Roadmap. What is a Hidden Markov Model? 2. Filtering and Smoothing Recursions 3. Monte Carlo, Importance Sampling and Sampling Importance Resampling 4. Sequential Importance Sampling 5. Sequential Importance Sampling with Resampling 6. More Sequential Monte Carlo Algorithms 7. Approximation of Sums Functionals and Parameter Estimation

32 Particle Methods for Hidden Markov Models - EPFL, 7 Dec Sequential Importance Sampling The principle of sequential Monte Carlo methods is to use Monte Carlo integration to approximate the filtering recursion in general HMMs (not finite HMMs or GLSSMs). The key remark, which can be traced back to (Handschin & Mayne, 969) and (Handschin, 970), is that the importance sampling method targeting the joint smoothing distribution φ 0:n n can be implemented sequentially, due to the particular structure of φ 0:n n. The corresponding algorithm is known as sequential importance sampling (SIS). The SIS algorithm does reasonably well but is bound to become unreliable for larger values of n (this limitation will be taken care of latter...)

33 Particle Methods for Hidden Markov Models - EPFL, 7 Dec HMM Notations (Repeated) Recall that an hidden Markov model is such that X k+ Q(X k, ) Y k G(X k, ) state equation measurement equation where {X k } k 0 is a Markov chain with transition kernel Q and initial distribution ν G is a transition kernel from (X, X ) to (Y, Y) and there exists a measure µ such that, for all x X, A Y, G(x, A) = A g(x, y)µ(dy). To simplify the mathematical expressions, we use the notation g k to denote the function g(, Y k ), considered as a function of its first argument.

34 Particle Methods for Hidden Markov Models - EPFL, 7 Dec Smoothing (Repeated) The posterior distribution φ 0:n n of the states X0:n given the observations Y 0:n may be computed recursively (in n) according to g0 (x 0 )ν(dx 0 )f(x 0 ) φ 0 0 (f) = g0 (x 0 )ν(dx 0 ) φ 0:n n (f n ) = f n (x 0:n ) φ ν,0:n n (dx 0:n )Tn (x u n, dx n ), where, for k 0, Tn u is the unnormalized transition kernel onto (X, X ) given by ( ) Tk u Lk+ (x, A) = Q(x, dx )g k+ (x ), x X, A X. L k A In this part we omit to indicate the dependence with respect to ν which not essential.

35 Particle Methods for Hidden Markov Models - EPFL, 7 Dec Choice of the Instrumental Distribution Key Remark: Both the simulation from the instrumental distribution and the computation of the importance weights can be carried out sequentially if a, possibly non-homogeneous, Markov chain is used as instrumental distribution. More precisely, Let {R k } k 0 denote a family of Markov transition kernels on (X, X ) and ρ 0 a probability measure on (X, X ). Assume that φ 0 0 ρ 0 and for all k 0 and all x X, T u k (x, ) R k(x, ). The inhomogeneous Markov chain with initial distribution ρ 0 and transition kernels {R k } k 0 defines the following distributions ρ 0:k (f k ) = f k (x 0:k ) ρ 0 (dx 0 ) k l=0 R l (x l, dx l+ ).

36 Particle Methods for Hidden Markov Models - EPFL, 7 Dec Sequential Computation of the Importance Function The importance function is then defined as dφ 0:n n (x 0:n ) = dφ n 0 0 (x 0 ) dρ 0:n dρ 0 k=0 which can be computed sequentially in the sense that for k 0. dφ 0:k+ k+ dρ 0:k+ dt u n (x k, ) dr n (x k, ) (x k+), (x 0:k+ ) = dφ 0:k k dρ 0:k (x 0:k ) dt u n (x k, ) dr n (x k, ) (x k+),

37 Particle Methods for Hidden Markov Models - EPFL, 7 Dec Sequential Importance Sampling Algorithm Initialization Draw ξ 0,..., ξ N 0 independently from ρ 0 and compute the weights ω i 0 = dφ 0 0 dρ 0 (ξ i 0), for i =,..., N. Recursion For k = 0,... For i =,..., N Draw ξ i k+ conditionally independently from {ξj l, ξm k } l<k, j N,m<i under the distribution R k (ξ i k, ). Update the importance weight according to ω i k+ = ω i k dt u k (ξi k, ) dr k (ξ i k, )(ξi k+). The ratio ω i k+ /ωi k is often referred to as the incremental weight; the points ξi k are called particles; the trajectories ξ i 0:k path particles.

38 Particle Methods for Hidden Markov Models - EPFL, 7 Dec FILT. INSTR. FILT. + One step of the SIS algorithm with just seven particles.

39 Particle Methods for Hidden Markov Models - EPFL, 7 Dec Sequential Importance Sampling Approximation At any time index n, the sequential importance sampling estimator of φ 0:n n (f n ) is available as ˆφ IS 0:n n (f n) = N i= f n(ξ i 0:n)ω i n N i= ωi n. Remark If we are just interested in functions f n (x 0:n ) = f(x n ), storing the full trajectories of the particles is not required; each step of the algorithm involves O(N) operations and requires just that N + N dim(x) real numbers be stored. Likewise, for functions of the form f n (x 0:n ) = f k (x n k:n ) only the last k + elements of each path particle ξ0:n i needs to be stored. We will see latter that one may indeed consider more general functions f n as long as they have a specific structure...

40 Particle Methods for Hidden Markov Models - EPFL, 7 Dec Choosing the Importance Kernel: () the Prior Kernel As for non-sequential importance sampling, the performance of SIS depends crucially on the choice of the importance kernel R k (and, to a lesser extent, on that of ρ 0 ). The most obvious solution is to use the prior kernel R n = Q: The instrumental kernel at each iteration mimics the state dynamic, which is usually simple to sample from. The incremental weight dtk u (x, ) dr k (x, ) (x ) = L k L k+ g k+ (x ), (x, x ) X X. does not depend on x X, hence computing the incremental weight simply amounts to evaluating the conditional likelihood function for the new particle positions. Recall that the importance weights need to be evaluated up to a constant only, hence the non-computable factor L k /L k+ may be omitted.

41 Particle Methods for Hidden Markov Models - EPFL, 7 Dec Lack of Robustness of the Prior Kernel The prior kernel is a reasonable option which is computationally very simple and is thus often hard to beat, especially in models where the state is not precisely identified by the observations. It is however very sensitive to the presence of outliers : FILT. FILT. + FILT. +2 Conflict between the prior and the posterior: at time k +, the observation does not agree with the particle approximation of the predictive distribution. After reweighting step, the mass becomes concentrated on a single particle. Due to the multiplicative structure of the importance weight, recovering from this situation is almost often impossible.

42 Particle Methods for Hidden Markov Models - EPFL, 7 Dec Choosing the Importance Kernel: (2) the Optimal Kernel To circumvent the problem one needs to incorporate information both on the state dynamic and on the new observation. Among all possible options, there is only one kernel which is such that the new weight ωk+ i is a deterministic function of the current particle ξk i ; this is the only choice for which the conditional variance of the new weights is equal to zero: Let R k (f) = T k (x, f) def = γ k (x) f(x ) Q(x, dx )g k+ (x ) where γ k (x) def = Q(x, dx )g k+ (x ). X Then dtk u (x, ) dt k (x, ) (x ) = L k L k+ γ k (x), (x, x ) X X. Unfortunately, computing γ k is usually not feasible in models where implementing the filtering recursion is problematic!

43 Particle Methods for Hidden Markov Models - EPFL, 7 Dec The Optimal Kernel is More Robust to Outliers FILT. FILT. FILT. + FILT. +2 FILT. + FILT. +2 The optimal kernel proposes particles in the regions where the filtering density has most of its mass.

44 Particle Methods for Hidden Markov Models - EPFL, 7 Dec Local Approximation of the Optimal Importance Kernel The aim is to find a distribution which resembles sampling from the optimal kernel but for which the incremental weight is computable: Ideally, this distribution should be overdispersed (recall the dµ/dν factor!) but not wildly inaccurate. We can find such a distribution in two steps:. locate the high-density region of the (multivariate) optimal distribution to ensure that our proposal does not entirely miss important regions; 2. create an overdispersed approximation, so that the instrumental distribution dominates the optimal importance distribution. Of course, because we have to repeat the process for each particles, the overall procedure should be reasonably simple.

45 Particle Methods for Hidden Markov Models - EPFL, 7 Dec Application to the Stochastic Volatility Model Consider the (discrete-time) stochastic volatility model where X k+ = φx k + σu k φ <, Y k = β exp(x k /2)V k,. {U k } k 0 and {V k } k 0 are independent standard Gaussian white noise processes. 2. X 0 N(0, σ 2 /( φ 2 ). In this model, ( q(x, x ) = exp (x φx) 2 ) 2πσ 2 2σ 2, ( g k+ (x ) = exp Y k+ 2 2πβ 2 2β 2 exp( x ) ) 2 x, and the incremental weight γ k (x) is not available in closed form.

46 Particle Methods for Hidden Markov Models - EPFL, 7 Dec Application to the Stochastic Volatility Model (contd.) The function x log(q(x, x )g k+ (x )) is (strictly) log-concave and thus unimodal. The mode m k (x) of the optimal transition density is the unique solution of the non-linear equation, σ 2 (x φx) + Y k 2 2β 2 exp( x ) 2 = 0. The solution of this equation can be computed numerically. We use, for instance, as instrumental kernel a t-distribution with η = 5 degrees of freedom, the scale of which being set as the inverse of the negated second-order derivative of x log q(x, x )g k (x ) evaluated at the mode m k (x), which is given by: σ 2 k(x) = ( σ 2 + Y 2 2β 2 k+ exp[ m k(x)] ). The incremental weight may easily be evaluated once m k (x) and σk 2 (x) have been computed (note that it now depends both on x and x ). Recall also that we need to repeat these steps independently for each current particle position x = ξ i k.

47 Particle Methods for Hidden Markov Models - EPFL, 7 Dec Application to the Stochastic Volatility Model (contd.) 0.08 Density Time Index State Waterfall representation of the sequence of estimated filtering distribution with actual state (,000 particles).

48 Particle Methods for Hidden Markov Models - EPFL, 7 Dec Weight Degeneracy The normalized importance weights measure the pertinence of each particle: a relatively small importance weight implies that the associated particle is far from the main body of the posterior distribution and contributes poorly to the sequential importance sampling approximation. If there are too many such ineffective particles, the Monte-Carlo approximation becomes highly unreliable.

49 Particle Methods for Hidden Markov Models - EPFL, 7 Dec Weight Degeneracy The normalized importance weights measure the pertinence of each particle: a relatively small importance weight implies that the associated particle is far from the main body of the posterior distribution and contributes poorly to the sequential importance sampling approximation. If there are too many such ineffective particles, the Monte-Carlo approximation becomes highly unreliable. Empirically, this phenomenon always happens when n gets larger (N being fixed). In simplistic models, it is possible to show that the asymptotic variance of the approximation ˆφ IS n (f) increases exponentially as n increases (see text).

50 Particle Methods for Hidden Markov Models - EPFL, 7 Dec Application to the Stochastic Volatility Model (contd.) Importance Weights (base 0 logarithm) Histograms of the base 0 logarithm of the normalized importance weights after (from top to bottom), 0 and 00 iterations for the stochastic volatility model

51 Particle Methods for Hidden Markov Models - EPFL, 7 Dec Numerical Indicator: () Coefficient of Variation A simple criterion is the coefficient of variation of the normalized weights, CV N (ω) = N { N N i= } 2 ωj N j= ωj /2, ω = (ω,..., ω N ) (R + ) N. When the weights are all equal to /N, then the coefficient of variation is equal to 0. At the other extreme, when one normalized weight is equal to and all the others 0, the coefficient of variation equals N. Therefore, a large CV N (ω k ) indicates that there are many ineffective particles and that memory and computation will be wasted.

52 Particle Methods for Hidden Markov Models - EPFL, 7 Dec Numerical Indicator: (2) Entropy Another possible measure of the weight imbalance is the Shannon Entropy of the importance weights, defined as Ent(ω) = N i= ( ) ω i log ω i N j= ωj 2 N, j= ωj ω = (ω,..., ω N ) (R + ) N. When all the importance weights are 0 except, then the entropy is null. On the contrary, if all the weights are equal to /N, then the entropy is maximal and equal to log 2 (N).

53 Particle Methods for Hidden Markov Models - EPFL, 7 Dec Application to the Stochastic Volatility Model (contd.) 20 0 Coeff. of Variation Time Index Entropy Time Index Left: coefficient of variations of the weights; right: weight entropy, as a function of n

54 Particle Methods for Hidden Markov Models - EPFL, 7 Dec Roadmap. What is a Hidden Markov Model? 2. Filtering and Smoothing Recursions 3. Monte Carlo, Importance Sampling and Sampling Importance Resampling 4. Sequential Importance Sampling 5. Sequential Importance Sampling with Resampling 6. More Sequential Monte Carlo Algorithms 7. Approximation of Sums Functionals and Parameter Estimation

55 Particle Methods for Hidden Markov Models - EPFL, 7 Dec Resampling The solution, proposed by (Gordon, Salmond & Smith, 993), to avoid the degeneracy of the importance weights is to regularly resample the particles according to their importance weights (thus equating all importance weights).

56 Particle Methods for Hidden Markov Models - EPFL, 7 Dec Resampling The solution, proposed by (Gordon, Salmond & Smith, 993), to avoid the degeneracy of the importance weights is to regularly resample the particles according to their importance weights (thus equating all importance weights). The basic idea of resampling is to (i) eliminate particles which have small importance weights, (ii) replicate particles which have large importance weights in proportion of their relevance.

57 Particle Methods for Hidden Markov Models - EPFL, 7 Dec Resampling The solution, proposed by (Gordon, Salmond & Smith, 993), to avoid the degeneracy of the importance weights is to regularly resample the particles according to their importance weights (thus equating all importance weights). The basic idea of resampling is to (i) eliminate particles which have small importance weights, (ii) replicate particles which have large importance weights in proportion of their relevance. Resampling concentrates the particles in regions of the state space which are pertinent and avoids exploration of highly improbable areas.

58 Particle Methods for Hidden Markov Models - EPFL, 7 Dec Resampling This idea is clearly rooted in the sampling importance resampling (SIR) technique.

59 Particle Methods for Hidden Markov Models - EPFL, 7 Dec Resampling This idea is clearly rooted in the sampling importance resampling (SIR) technique. However, contrary to standard (non-sequential) SIR, the main aim of the resampling step is not to draw (asymptotically correctly) an i.i.d. sample from a distribution but rather to avoid weight degeneracy.

60 Particle Methods for Hidden Markov Models - EPFL, 7 Dec Resampling This idea is clearly rooted in the sampling importance resampling (SIR) technique. However, contrary to standard (non-sequential) SIR, the main aim of the resampling step is not to draw (asymptotically correctly) an i.i.d. sample from a distribution but rather to avoid weight degeneracy. The resampling step, while useful in fighting degeneracy, has a drawback: resampling introduces unnecessary noise into the algorithm, and this extra noise might be far from negligible.

61 Particle Methods for Hidden Markov Models - EPFL, 7 Dec Resampling This idea is clearly rooted in the sampling importance resampling (SIR) technique. However, contrary to standard (non-sequential) SIR, the main aim of the resampling step is not to draw (asymptotically correctly) an i.i.d. sample from a distribution but rather to avoid weight degeneracy. The resampling step, while useful in fighting degeneracy, has a drawback: resampling introduces unnecessary noise into the algorithm, and this extra noise might be far from negligible. Intuitively, when the importance weights are nearly constant, resampling only reduce the number of distinct particles thus introducing an extra noise without much benefit on the weight degeneracy.

62 Particle Methods for Hidden Markov Models - EPFL, 7 Dec Resampling This idea is clearly rooted in the sampling importance resampling (SIR) technique. However, contrary to standard (non-sequential) SIR, the main aim of the resampling step is not to draw (asymptotically correctly) an i.i.d. sample from a distribution but rather to avoid weight degeneracy. The resampling step, while useful in fighting degeneracy, has a drawback: resampling introduces unnecessary noise into the algorithm, and this extra noise might be far from negligible. Intuitively, when the importance weights are nearly constant, resampling only reduce the number of distinct particles thus introducing an extra noise without much benefit on the weight degeneracy. The one-step effect of resampling is thus negative but, on the long-term, resampling is required to guarantee a correct behavior of the algorithm.

63 Particle Methods for Hidden Markov Models - EPFL, 7 Dec Sequential Importance Sampling with Resampling (SISR) For time indices k 0, do the following. Sampling: Draw ( ξ k+,..., ξ k+ N ) conditionally independently given {ξj 0:k, j =,..., N} from the instrumental kernel: ξ k+ i R k(ξk i, ), i =,..., N. Compute the updated importance weights Resampling (Optional): ω i k+ = ω i k g k+ ( ξ i k+) dq(ξi k, ) dr k (ξ i k, )( ξ i k+), i =,..., N. Draw, conditionally independently given {(ξ0:k i, ξ j k+ ), i, j =,..., N}, the multinomial trial (I k+,... IN k+ ) with probabilities of success ωk+ N,..., j ωj k+ ω N k+ N j ωj k+. Reset the importance weights ωk+ i to a constant value for i =,..., N.

64 Particle Methods for Hidden Markov Models - EPFL, 7 Dec SISR contd. If resampling is not applied, set for i =,..., N, I i k+ = i. Trajectory update: for i =,..., N, ξ i 0:k+ = ( ) ξ Ii k+ 0:k, ξ Ii k+ k+. Recall that storing the full particle path is usually not needed. The SISR algorithm with systematic resampling and R k = Q (the prior kernel) is known as the bootstrap filter.

65 Particle Methods for Hidden Markov Models - EPFL, 7 Dec Illustration of the Boostrap Filter on a Toy Example Noisy AR() model X k+ µ = φ(x k µ) + σu k Y k = X k + ηv k µ = 0.9, φ = 0.95, σ 2 = 0.0, η 2 = 0.02 = (σ 2 /( φ 2 ))/5 To approximate the predictive distribution φ k+ k, we use the bootstrap filter with N = 50 particles, plotting the full particle paths {ξ0:k i, ξ k+ i } i N for each time index. This example is used since we may also compute the actual filtering densities using Kalman filtering

66 Particle Methods for Hidden Markov Models - EPFL, 7 Dec state time index state time index Predictive densities and evolution of the particle paths

67 Particle Methods for Hidden Markov Models - EPFL, 7 Dec state time index state time index Predictive densities and evolution of the particle paths

68 Particle Methods for Hidden Markov Models - EPFL, 7 Dec state time index state time index Predictive densities and evolution of the particle paths

69 Particle Methods for Hidden Markov Models - EPFL, 7 Dec state time index state time index Predictive densities and evolution of the particle paths

70 Particle Methods for Hidden Markov Models - EPFL, 7 Dec state time index state time index Predictive densities and evolution of the particle paths

71 Particle Methods for Hidden Markov Models - EPFL, 7 Dec state time index state time index Predictive densities and evolution of the particle paths

72 Particle Methods for Hidden Markov Models - EPFL, 7 Dec state time index state time index Predictive densities and evolution of the particle paths

73 Particle Methods for Hidden Markov Models - EPFL, 7 Dec state time index state time index Predictive densities and evolution of the particle paths

74 Particle Methods for Hidden Markov Models - EPFL, 7 Dec state time index state time index Predictive densities and evolution of the particle paths

75 Particle Methods for Hidden Markov Models - EPFL, 7 Dec state time index state time index Predictive densities and evolution of the particle paths

76 Particle Methods for Hidden Markov Models - EPFL, 7 Dec state time index state time index Predictive densities and evolution of the particle paths

77 Particle Methods for Hidden Markov Models - EPFL, 7 Dec state time index state time index Predictive densities and evolution of the particle paths

78 Particle Methods for Hidden Markov Models - EPFL, 7 Dec state time index state time index Predictive densities and evolution of the particle paths

79 Particle Methods for Hidden Markov Models - EPFL, 7 Dec state time index state time index Predictive densities and evolution of the particle paths

80 Particle Methods for Hidden Markov Models - EPFL, 7 Dec state time index state time index Predictive densities and evolution of the particle paths

81 Particle Methods for Hidden Markov Models - EPFL, 7 Dec state time index state time index Predictive densities and evolution of the particle paths

82 Particle Methods for Hidden Markov Models - EPFL, 7 Dec state time index state time index Predictive densities and evolution of the particle paths

83 Particle Methods for Hidden Markov Models - EPFL, 7 Dec state time index state time index Predictive densities and evolution of the particle paths

84 Particle Methods for Hidden Markov Models - EPFL, 7 Dec state time index state time index Predictive densities and evolution of the particle paths

85 Particle Methods for Hidden Markov Models - EPFL, 7 Dec state time index state time index Predictive densities and evolution of the particle paths

86 Particle Methods for Hidden Markov Models - EPFL, 7 Dec state time index state time index Predictive densities and evolution of the particle paths

87 Particle Methods for Hidden Markov Models - EPFL, 7 Dec state time index state time index Predictive densities and evolution of the particle paths

88 Particle Methods for Hidden Markov Models - EPFL, 7 Dec state time index state time index Predictive densities and evolution of the particle paths

89 Particle Methods for Hidden Markov Models - EPFL, 7 Dec state time index state time index Predictive densities and evolution of the particle paths

90 Particle Methods for Hidden Markov Models - EPFL, 7 Dec Application to the Stochastic Volatility Model (contd.) Coeff. of Variation Time Index Entropy Time Index Coefficient of variation (left) and entropy of the normalized importance weights as a function of the number of iterations when using resampling triggered by CV N (ω) >.

91 Particle Methods for Hidden Markov Models - EPFL, 7 Dec Application to the Stochastic Volatility Model (contd.) Importance Weights (base 0 logarithm) Histograms of the base 0 logarithm of the normalized importance weights after (from top to bottom), 0 and 00 iterations when using resampling triggered by CV N (ω) >.

92 Particle Methods for Hidden Markov Models - EPFL, 7 Dec Roadmap. What is a Hidden Markov Model? 2. Filtering and Smoothing Recursions 3. Monte Carlo, Importance Sampling and Sampling Importance Resampling 4. Sequential Importance Sampling 5. Sequential Importance Sampling with Resampling 6. More Sequential Monte Carlo Algorithms 7. Approximation of Sums Functionals and Parameter Estimation

93 Particle Methods for Hidden Markov Models - EPFL, 7 Dec Alternatives to SISR The resampling step in the SISR algorithm can be seen as a method to sample approximately under φ 0:k+ k+ given the current particle approximation ˆφ 0:k k. This alternative way of thinking about resampling suggests several sequential Monte Carlo variants.

94 Particle Methods for Hidden Markov Models - EPFL, 7 Dec Sequential Monte Carlo Reinterpreted Recall that each update consists of two steps. Prediction step: compute the one-step ahead predictive distribution from the filtering distribution. φ 0:k+ k = φ 0:k k Q Correction step (Bayes): compute the filtering distribution from the predictive distribution by taking into account the new observation Y k+ : φ 0:k+ k+ (f k+ ) = fk+ (x 0:k+ ) g k+ (x k+ )φ 0:k+ k (dx 0:k+ ) gk+ (x k+ ) φ k+ k (dx 0:k+ ). φ 0:k k prediction φ 0:k+ k correction φ 0:k+ k+

95 Particle Methods for Hidden Markov Models - EPFL, 7 Dec Sequential Monte Carlo Reinterpreted Replace φ 0:k k by the empirical filtering distribution. ˆφ 0:k k = N i= ω i k N j= ωj k δ ξ i 0:k. Applying the prediction and then the correction step to this approximation yields ˆφ 0:k k prediction φ 0:k+ k = N i= correction φ 0:k+ k+ (f k+ ) = ω i k N j= ωj k δ ξ i 0:k Q(ξ i k, ) N i= ωi k fk+ (ξ i 0:k, x) g k+(x)q(ξ i k, dx) N i= ωi k gk+ (x)q(ξ i k, dx). The distribution φ 0:k+ k+ is sometimes called the empirical filtering distribution. It is in some sense the best approximation to φ 0:k+ k+ based on the knowledge of ˆφ 0:k k. It is obvioulsy not in general a distribution supported by a finite set of points!

96 Particle Methods for Hidden Markov Models - EPFL, 7 Dec Sequential Monte Carlo Reinterpreted The empirical filtering distribution is a mixture distribution φ 0:k+ k+ = N i= ωk i γ k(ξk i ) N j= ωj k γ k(ξ j k ) f k+ (ξ i 0:k, x) T k (ξ i k, dx), where γ k (x) = Q(x, dx )g k+ (x ), A T k (x, A) = Q(x, dx )g k+ (x ) γ k (x). Direct sampling from this distribution is usually not possible (because sampling from T k and evaluating γ k aren t either).

97 Particle Methods for Hidden Markov Models - EPFL, 7 Dec Auxiliary Sampling But we may in general use importance sampling or SIR, proposing new points ξ k+,..., ξ k+ N under the mixture ρ 0:k+ (f k+ ) = N i= ω i k τ i k N j= ωj k τ j k f(ξ i 0:k, x) R k (ξ i k, dx), where τk,..., τ k N sample from. In doing so, we first need to draw mixture component indicators Ik,..., IN k. are user-selected adjusment weigths and R k is a kernel which is easy to It is easilly checked that the importance weigths are then given by ω i k+ = g k+( ξ i k+ ) τ Ii k k dq(ξ Ii k k, ) ( ξ k+) i. dr k (ξ Ii k k, ) This strategy named auxiliary sampling and proposed by (Pitt & Shephard, 999) is often usefull in practice when combined with clever ways of setting {τ i k } i=,...,n and R k.

98 Particle Methods for Hidden Markov Models - EPFL, 7 Dec IID Sampling It is interesting to consider what happens in cases where sampling from T k and evaluating γ k is feasible (i.e. when τ i k = γ k(ξ i k ) and R k = T k ): Weight computation: For i =,..., N, compute the (unnormalized) importance weights α i k = γ k (ξ i k). Selection: Draw I k+,..., IN k+ conditionally i.i.d. given {ξi 0:k } i N, with probabilities P(I k+ = j) proportional to αj k, j =,..., N. Sampling: Draw ξ k+,..., ξ N k+ conditionally independently given {ξi 0:k } i N and {Ik+ i } i N, with distribution ξ k+ i T k(ξ Ii k+ k, ). Set ξ0:k+ i = (ξii k+ 0:k, ξ k+ i ) = for i =,..., N. and ω i k+

99 Particle Methods for Hidden Markov Models - EPFL, 7 Dec IID Sampling Compared with the SISR Algorithm for the particular choice R k = T k, the IID sampling algorithm differs only by the order in which the sampling (or mutation) and selection operations are performed.

100 Particle Methods for Hidden Markov Models - EPFL, 7 Dec IID Sampling Compared with the SISR Algorithm for the particular choice R k = T k, the IID sampling algorithm differs only by the order in which the sampling (or mutation) and selection operations are performed. The SISR Algorithm prescribes that each trajectory be first extended by setting ξ i 0:k+ = (ξi 0:k, ξ i k+ ) where ξ i k+ is drawn from T k(ξ i k, ). Then resampling is performed in the population of extended trajectories according to their importance weights.

101 Particle Methods for Hidden Markov Models - EPFL, 7 Dec IID Sampling Compared with the SISR Algorithm for the particular choice R k = T k, the IID sampling algorithm differs only by the order in which the sampling (or mutation) and selection operations are performed. The SISR Algorithm prescribes that each trajectory be first extended by setting ξ0:k+ i = (ξi 0:k, ξ k+ i ) where ξ k+ i is drawn from T k(ξk i, ). Then resampling is performed in the population of extended trajectories according to their importance weights. In contrast, the IID sampling algorithm first selects the trajectories based on the weights α i k and then simulate an independent extension for each selected trajectory. The new particles ξ k+,..., ξ N k+ are conditionally independent given the current generation of particles {ξ i k } i=,...,n.

102 Particle Methods for Hidden Markov Models - EPFL, 7 Dec IID Sampling Compared with the SISR Algorithm for the particular choice R k = T k, the IID sampling algorithm differs only by the order in which the sampling (or mutation) and selection operations are performed. The SISR Algorithm prescribes that each trajectory be first extended by setting ξ0:k+ i = (ξi 0:k, ξ k+ i ) where ξ k+ i is drawn from T k(ξk i, ). Then resampling is performed in the population of extended trajectories according to their importance weights. In contrast, the IID sampling algorithm first selects the trajectories based on the weights α i k and then simulate an independent extension for each selected trajectory. The new particles ξ k+,..., ξ N k+ are conditionally independent given the current generation of particles {ξk i } i=,...,n. This is of course only possible because the optimal importance kernel T k is used as instrumental kernel which renders the incremental weights independent of the position of the particle at index k + and thus allow for early selection. This way of proceeding is provably better than SISR with R k = T k (see text).

103 Particle Methods for Hidden Markov Models - EPFL, 7 Dec Roadmap. What is a Hidden Markov Model? 2. Filtering and Smoothing Recursions 3. Monte Carlo, Importance Sampling and Sampling Importance Resampling 4. Sequential Importance Sampling 5. Sequential Importance Sampling with Resampling 6. More Sequential Monte Carlo Algorithms 7. Approximation of Sums Functionals and Parameter Estimation

104 Particle Methods for Hidden Markov Models - EPFL, 7 Dec EM and Friends in HMMs A Long Story Made Short If the HMM has some unknown parameters θ, likelihood-based parameter inference, be it through Expectation-Maximization (EM) or gradient-based approaches, (only) requires to be able to compute quantities of the form E [ n k=0 for some model-dependent functions s i. ] s i (X k, X k+ ) Y 0:n ; θ,

105 Particle Methods for Hidden Markov Models - EPFL, 7 Dec EM and Friends in HMMs A Long Story Made Short If the HMM has some unknown parameters θ, likelihood-based parameter inference, be it through Expectation-Maximization (EM) or gradient-based approaches, (only) requires to be able to compute quantities of the form E [ n k=0 for some model-dependent functions s i. ] s i (X k, X k+ ) Y 0:n ; θ If exact computation is not feasible, we may use approximate Monte-Carlo evaluation in combination with variants of the former methods (MCEM, SAME, SAEM, stochastic gradient, etc.),

106 Particle Methods for Hidden Markov Models - EPFL, 7 Dec EM and Friends in HMMs A Long Story Made Short If the HMM has some unknown parameters θ, likelihood-based parameter inference, be it through Expectation-Maximization (EM) or gradient-based approaches, (only) requires to be able to compute quantities of the form E [ n k=0 for some model-dependent functions s i. ] s i (X k, X k+ ) Y 0:n ; θ If exact computation is not feasible, we may use approximate Monte-Carlo evaluation in combination with variants of the former methods (MCEM, SAME, SAEM, stochastic gradient, etc.), Are sequential Monte Carlo methods appropriate for this task?

Computer Intensive Methods in Mathematical Statistics

Computer Intensive Methods in Mathematical Statistics Department of mathematics johawes@kth.se Lecture 16 Advanced topics in computational statistics 18 May 2017 Computer Intensive Methods (1) Plan of