complément scientifique École Doctorale MATISSE IRISA et INRIA, salle Markov jeudi 26 janvier 2012 Variantes Algorithmiques et Justifications Théoriques François Le Gland INRIA Rennes et IRMAR http://www.irisa.fr/aspi/legland/ed-matisse/
...programme of the day #1 more general models, from non linear and non Gaussian systems to hidden Markov models and partially observed Markov chains, so as to handle e.g. regime / mode switching correlation between state noise and observation noise #2 for each of these models (or just for most general model), representation of P[X 0:n dx 0:n Y 0:n ] as a Gibbs Boltzmann distribution, with recursive formulation, and idem for P[X n dx n Y 0:n ] #3 particle approximation (SIS and SIR algorithms) from either representations #4 asymptotic behaviour as sample size goes to infinity #5 numerous algorithmic variants
1 (some notations : 1) some notations (continued) if X is a random variable taking values in E, then mapping φ E[φ(X)] or equivalently A P[X A] defines a probability distribution µ on E, denoted as µ(dx) = P[X dx] and such that E[φ(X)] = caracterizes uncertainty about X E φ(x) µ(dx) = µ,φ or P[X A] = µ(a)
2 (some notations : 2) transition probability kernel M(x,dx ) on E collection of probability distributions on E indexed by x E acts on functions according to M φ(x) = E M(x,dx ) φ(x ) and acts on probability distributions according to µm(dx ) = µ(dx) M(x,dx ) seen as a mixture distribution caracterized by µm,φ = [ µ(dx) M(x,dx )] φ(x ) = E E E E µ(dx) [ = µ,m φ E M(x,dx ) φ(x )]
Non linear and non Gaussian systems, and beyond non linear and non Gaussian systems hidden Markov models partially observed Markov chains likelihood free models
3 (non linear and non Gaussian systems : 1) non linear and non Gaussian systems prior model for hidden state taking values in E X k = f k (X k 1,W k ) with W k p W k (dw) initial condition X 0 η 0 (dx) observation taking values in R d with additive noise admitting a density Y k = h k (X k )+V k with V k qk V (v)dv random variables X 0, W 1,,W k, and V 0,V 1,,V k, are mutually independent but non necessarily Gaussian only requirement (to be used later on) : easy to simulate a r.v. according to η 0 (dx) or to p W k (dw) evaluate function q V k (v) for any v Rd
4 (non linear and non Gaussian systems : 2) Proposition hidden states {X k } form a Markov chain taking values in E, i.e. P[X k dx X 0:k 1 ] = P[X k dx X k 1 ] characterization in terms of transition kernel P[X k dx X k 1 = x] = Q k (x,dx ) defined (implicitly) by its action on functions Q k φ(x) = E[φ(X k ) X k 1 = x] = E[φ(f k (X k 1,W k )) X k 1 = x] = φ(f k (x,w)) p W k (w)dw R m
5 (non linear and non Gaussian systems : 3) Remark easy to simulate next state X k given X k 1 = x, i.e. to simulate a r.v. according to Q k (x,dx ) for a given x E indeed, set X k = f k (x,w k ) where W k is simulated according to p W k (dw) Remark in general, transition kernel Q k (x,dx ) does not admit a density indeed, conditionnally to X k 1 = x, r.v. X k necessarily belongs to subset M(x) = {x R m : there exist w R p such that x = f k (x,w)} if p < m and under some mild regularity assumptions, this subset of R m has zero Lebesgue measure therefore, conditionnally to X k 1 = x, probability distribution Q k (x,dx ) of r.v. X k cannot have a density w.r.t. Lebesgue measure on R m
6 (non linear and non Gaussian systems : 4) Remark if f k (x,w) = b k (x)+w and if probability distribution p W k W k admits a density, still denoted p W k (w), i.e. if (dw) of r.v. X k = b k (X k 1 )+W k with W k p W k (w)dw then a more explicit expression is available Q k (x,dx ) = p W k (x b k (x))dx i.e. transition kernel Q k (x,dx ) admits an (easy to evaluate) density indeed, change of variable x = b k (x)+w yields Q k φ(x) = φ(b k (x)+w) p W k (w)dw R m = φ(x ) p W k (x b k (x)))dx R m
7 (non linear and non Gaussian systems : 5) Proposition observations {Y k } satisfy memoryless channel assumption, i.e. P[Y 0:n dy 0:n X 0:n ] = n k=0 P[Y k dy k X k ] characterization in terms of emission density define likelihood function P[Y k dy X k = x] = q V k (y h k (x))dy g k (x) = q V k (Y k h k (x)) a quantitative measure of consistency between possible hidden state x E and actual observation Y k Remark easy to evaluate g k (x) for any x E
8 (hidden Markov models : 1) hidden Markov models motivating example hybrid continuous / discrete systems X k = f k (s k 1,X k 1,W k ) Y k = h k (X k )+V k where regime / mode sequence {s k } forms a Markov chain with finite state space does not fit into non linear and non Gaussian systems, however hidden states and modes {(X k,s k )} jointly form a Markov chain observations {Y k } satisfy memoryless channel assumption i.e. fits into hidden Markov models Remark easy to simulate next state (X k,s k ) given (X k 1,s k 1 ) = (x,s)
9 (hidden Markov models : 2) more generally, hidden states {X k } could form a Markov chain taking values in a quite general space E, e.g. hybrid continuous / discrete differentiable manifold constrained graphical (collection de connected edges) characterization in terms of transition kernel and initial distribution P[X k dx X k 1 = x] = Q k (x,dx ) P[X 0 dx] = η 0 (dx) joint probability distribution of hidden states X 0:n verifies n P[X 0:n dx 0:n ] = η 0 (dx 0 ) Q k (x k 1,dx k ) k=1
10 (hidden Markov models : 3) user should respect displacement constraints due to obstacles, as read on map
11 (hidden Markov models : 4) simplified model : user walks on a Voronoi graph, displacement constraints due to obstacles are taken automatically into account
12 (hidden Markov models : 5) observations {Y k } could verify memoryless channel assumption, i.e. P[Y 0:n dy 0:n X 0:n ] = n k=0 P[Y k dy k X k ] characterization in terms of emission density P[Y k dy X k = x] = g k (x,y)λ F k(dy) where nonnegative measure λ F k (dy) defined on F does not depend on x E define (abuse of notation) likelihood function as g k (x) = g k (x,y k ) a quantitative measure of consistency between x E and observation Y k joint conditional distribution of observations Y 0:n given hidden states X 0:n verifies P[Y 0:n dy 0:n X 0:n = x 0:n ] = n g k (x k,y k ) λ F 0 (dy 0 ) λ F n(dy n ) k=0
13 (hidden Markov models : 6) representation as X k 1 X k X k+1 Y k 1 Y k Y k+1 arrows represent dependency between random variables only requirement (to be used later on) : easy to simulate for any x E, a r.v. according to transition kernel Q k (x,dx ) evaluate for any x E, likelihood function g k (x )
14 (hidden Markov models : 7) hidden Markov models : importance decomposition motivation : from simulations seen last week, basic paradigm particles move according to prior model, described by its transition kernel new particles are weighted by evaluating likelihood function hopefully, resulting weighted empirical distribution provides reasonable approximation to non tractable Bayesian filter concern / questions : is this safe? could more information be used in mutation step?
15 (hidden Markov models : 8) recall indoor navigation example : if user is detected by a beacon with known location a and with finite range R, then necessarily user position is within detection disk centered at a and with radius R in other words, generating particles according to prior model alone could result in (a few, some, many, all) particles outside detection disk, i.e. useless particles, waste why not generate explicitly all new particles within disk, and accomodate for wrong model by changing weights? more generally, why not (and how) use next observation to move particles? ideal situation would be particles move according to posterior model warning
16 (hidden Markov models : 9) 1.4 prior distribution (sample view) prior 1.2 1 0.8 0.6 0.4 0.2 0 5 4 3 2 1 0 1 2 3 4 5 Figure 1: Prior density and generated sample
17 (hidden Markov models : 10) 1.4 prior distribution (histogram view) prior 1.2 1 0.8 0.6 0.4 0.2 0 5 4 3 2 1 0 1 2 3 4 5 Figure 2: Prior density and histogramme associated with generated sample
18 (hidden Markov models : 11) 1.4 prior distribution (sample view) prior 1.2 1 0.8 0.6 0.4 0.2 0 5 4 3 2 1 0 1 2 3 4 5 Figure 1: Prior density and generated sample
19 (hidden Markov models : 12) 1.4 1.2 prior distribution, likelihood function and posterior distribution (weighted sample view) prior likelihood posterior 1 0.8 0.6 0.4 0.2 0 5 4 3 2 1 0 1 2 3 4 5 Figure 3a: Prior density, likelihood function, posterior density and weighted sample
20 (hidden Markov models : 13) 1.4 1.2 prior distribution, likelihood function and posterior distribution (histogram view) prior likelihood posterior 1 0.8 0.6 0.4 0.2 0 5 4 3 2 1 0 1 2 3 4 5 Figure 4a: Prior density, likelihood function, posterior density and histogramme associated with weighted sample
21 (hidden Markov models : 14) 1.4 prior distribution (sample view) prior 1.2 1 0.8 0.6 0.4 0.2 0 5 4 3 2 1 0 1 2 3 4 5 Figure 1: Prior density and generated sample
22 (hidden Markov models : 15) 1.4 1.2 prior distribution, likelihood function and posterior distribution (weighted sample view) prior likelihood posterior 1 0.8 0.6 0.4 0.2 0 5 4 3 2 1 0 1 2 3 4 5 Figure 3b: Prior density, likelihood function, posterior density and weighted sample (more difficult)
23 (hidden Markov models : 16) 1.4 1.2 prior distribution, likelihood function and posterior distribution (histogram view) prior likelihood posterior 1 0.8 0.6 0.4 0.2 0 5 4 3 2 1 0 1 2 3 4 5 Figure 4b: Prior density, likelihood function, posterior density and histogramme associated with weighted sample (more difficult)
24 (hidden Markov models : 17) 1.4 prior distribution (sample view) prior 1.2 1 0.8 0.6 0.4 0.2 0 5 4 3 2 1 0 1 2 3 4 5 Figure 1: Prior density and generated sample
25 (hidden Markov models : 18) 1.4 1.2 prior distribution, likelihood function and posterior distribution (weighted sample view) prior likelihood posterior 1 0.8 0.6 0.4 0.2 0 5 4 3 2 1 0 1 2 3 4 5 Figure 3c: Prior density, likelihood function, posterior density and weighted sample (just impossible)
26 (hidden Markov models : 19) possible (non unique) decomposition and as product of γ 0 (dx) = g 0 (x) η 0 (dx) = g imp 0 (x)η imp 0 (dx) R k (x,dx ) = Q k (x,dx ) g k (x ) = g imp k (x,x ) Q imp k (x,dx ) a nonnegative weight function g imp 0 (x) or g imp k (x,x ) a probability distribution η imp 0 (dx) or a transition kernel Q imp k (x,dx ) respectively, only requirement about proposed decomposition : easy to simulate a r.v. according to η imp 0 (dx) simulate for any x E, a r.v. according to Q imp k (x,dx ) evaluate for any x,x E, weighting function g imp k (x,x ) attention : evaluating weighting function g imp k (x,x ) requires some knowledge about transition kernels Q imp k (x,dx ) and Q k (x,dx ) (was not required originally)
27 (hidden Markov models : 20) popular (optimal) importance decomposition : blind vs. guided mutation and alternatively i.e. P[X k dx,y k dy X k 1 = x] = P[Y k dy X k = x,x k 1 = x] }{{} g k (x,y ) λ k (dy ) P[X k dx,y k dy X k 1 = x] with (abuse of notation) = P[X k dx Y k = y,x k 1 = x] }{{} Q k (x,y,dx ) P[X k dx X k 1 = x] }{{} Q k (x,dx ) P[Y k dy X k 1 = x] }{{} ĝ k (x,y ) λ k (dy ) R k (x,dx ) = g k (x ) Q k (x,dx ) = ĝ k (x) Q k (x,dx ) ĝ k (x) = ĝ k (x,y k ) and Qk (x,dx ) = Q k (x,y k,dx )
28 (hidden Markov models : 21) remaining question : how easy is it to simulate for any x E, a r.v. according to Q k (x,dx )? evaluate for any x E, weighting function ĝ k (x)? positive answer in special case : linear observations and additive Gaussian noise X k = f k (X k 1 )+σ k (X k 1 ) W k Y k = H k X k +V k indeed (for simplicity, assume σ k (x) = I) Y k = H k [f k (X k 1 )+W k ]+V k = H k f k (X k 1 )+(H k W k +V k ) conditionally on X k 1 = x, r.v. (X k,y k ) is jointly Gaussian, with mean and covariance matrix f k (x) Q W k Q W k H k and H k f k (x) H k Q W k H k Q W k H k +QV k
29 (partially observed Markov chains : 1) partially observed Markov chains motivating example : assume (unsynchronized) sensors take noisy observations of different components of hidden state at different time instants, e.g. X k = (Xk 1,X2 k ) and for simplicity X k = f(x k 1 )+W k h 1 (Xk 1)+V1 k Y k = H 2 X 2 k +V2 k at odd time instants at even time instants observing all components of hidden state is fine, but processing partial observations at each time instant can be risky, since likelihood functions will be flat along some directions : ideally, try to collect and process simultaneously two successive observations, so that likelihood functions are more peaky
30 (partially observed Markov chains : 2) down sampling : set X k = X 2k+1 and Ȳ k = Y 2k+1 Y 2k+2 state equation X k = X 2k+1 = f(x 2k )+W 2k+1 = f(f(x 2k 1 )+W 2k )+W 2k+1 i.e. X k = f( X k 1, W k ) with W k = W 2k W 2k+1
31 (partially observed Markov chains : 3) observation equation : introducing projections π 1 and π 2 on 1st and 2nd components of state vector, yields i.e. Ȳ k = Y 2k+1 Y 2k+2 = = h1 (X 1 2k+1 )+V1 2k+1 H 2 X 2 2k+2 +V2 2k+2 h 1 (π 1 (X 2k+1 ))+V 1 2k+1 H 2 π 2 (f(x 2k+1 )+W 2k+2 )+V 2 2k+2 Ȳ k = h( X k )+ V k with V k = V 1 2k+1 H 2 π 2 (W 2k+2 )+V 2 2k+2
32 (partially observed Markov chains : 4) resulting system X k = f( X k 1, W k ) Ȳ k = h( X k )+ V k with W k = W 2k W 2k+1 and Vk = V 1 2k+1 H 2 π 2 (W 2k+2 )+V 2 2k+2 clearly W k and V k 1 share W 2k in common and are correlated, hence dependent, and memoryless channel assumption cannot hold
33 (partially observed Markov chains : 5) trick : decompose W k = M V k 1 + B k where B k and V k 1 are now independent, substitute in state equation and import V k 1 = Ȳk 1 h( X k 1 ) from observation equation, yielding X k = f( X k 1,M (Ȳk 1 h( X k 1 ))+ B k ) Ȳ k = h( X k )+ V k does not fit into hidden Markov model, hidden state alone does not form a Markov chain however, hidden states and observations {( X k,ȳk)} jointly form a Markov chain, the second component of which only is observed
34 (partially observed Markov chains : 6) even more generally, with previous motivating example in mind, hidden states and observations {(X k,y k )} could jointly form a Markov chain taking values in product space E F characterization in terms of transition kernel P[X k dx,y k dy X k 1 = x,y k 1 = y] = R k (x,y,y,dx ) λ F k(y,dy ) and initial distribution P[X 0 dx,y 0 dy] = γ 0 (y,dx) λ F 0 (dy) attention : hidden states {X k } alone need not form a Markov chain joint probability distribution of hidden states and observations (X 0:n,Y 0:n ) P[X 0:n dx 0:n,Y 0:n dy 0:n ] = γ 0 (y 0,dx 0 ) n R k (x k 1,y k 1,y k,dx k ) λ F 0 (dy 0 ) λ F k(y k 1,dy k ) k=1
35 (partially observed Markov chains : 7) required (non unique) decomposition partially observed Markov chains : importance decomposition γ 0 (dx) = g imp 0 (x)η imp 0 (dx) and as product of R k (x,dx ) = g imp k (x,x ) Q imp k (x,dx ) a nonnegative weight function g imp 0 (x) or g imp k (x,x ) a probability distribution η imp 0 (dx) or a transition kernel Q imp k (x,dx ) respectively, only requirement about proposed decomposition : easy to simulate a r.v. according to η imp 0 (dx) simulate for any x E, a r.v. according to Q imp k (x,dx ) evaluate for any x,x E, weighting function g imp k (x,x )
36 (likelihood free models : 1) likelihood free models so far, at least implicitly, additive observation noise has been assumed Y k = h(x k )+V k with V k qk V (v) dv with known and explicit form for probability density qk V(v) this was key assumption in deriving expression of density emission P[Y k dy X k = x] = g k (x,y) λ k (dy) hence explicit expression of likelihood function questions : could anything be said in more general cases where no explicit expression is available for a density, or it does not even exist non additive observation noise, with dimension smaller than observation, i.e. Y k = h(x k,v k ) perfect observations, i.e. observation noise is simply not present Y k = h(x k )
37 (likelihood free models : 2) trick, a form of ABC (approximate Bayesian computation) : pretend that observations are produced by slightly perturbed but regular model, i.e. or or Y k = h(x k )+V k +εu k Y k = h(x k,v k )+εu k Y k = h(x k )+εu k depending on the case under consideration, with U k q U k (u)du and set (X k,v k ) as new hidden state new requirement : easy to simulate (X k,v k ) jointly evaluate density q U k (u)
Bayesian filter hidden Markov models representation as Gibbs Boltzmann distribution recursive formulation partially observed Markov chains + given importance decomposition representation as Gibbs Boltzmann distribution
38 (Bayesian filter : hidden Markov models : representation : 1) Bayesian filter : hidden Markov models : representation Theorem joint conditional distribution of hidden state sequence X 0:n given observations Y 0:n as a Gibbs Boltzmann distribution P[X 0:n dx 0:n Y 0:n ] n k=0 g k (x k ) }{{} g 0:n (x 0:n ) η 0 (dx 0 ) n k=1 Q k (x k 1,dx k ) } {{ } η 0:n (dx 0:n ) with likelihood functions defined (abuse of notation) as g k (x) = g k (x,y k ) and with joint probability distribution of hidden state sequence X 0:n η 0:n (dx 0:n ) = P[X 0:n dx 0:n ] = η 0 (dx 0 ) n Q k (x k 1,dx k ) k=1
39 (Bayesian filter : hidden Markov models : representation : 2) general principle : p X Y=y (x) = p X,Y(x,y) p Y (y) = p X Y (x) p X,Y (x,y) Proof Bayes rule + Markov property + memoryless channel assumption, yield joint probability distribution of hidden states and observations (X 0:n,Y 0:n ) hence P[X 0:n dx 0:n,Y 0:n dy 0:n ] = P[Y 0:n dy 0:n X 0:n = x 0:n ] P[X 0:n dx 0:n ] = η 0 (dx 0 ) n k=1 Q k (x k 1,dx k ) P[X 0:n dx 0:n Y 0:n ] η 0 (dx 0 ) n k=1 n k=0 g k (x k,y k ) λ F 0 (dy 0 ) λ F n(dy n ) Q k (x k 1,dx k ) n k=0 g k (x k )
40 (Bayesian filter : hidden Markov models : representation : 3) Remark for any function f depending on whole trajectory E[f(X 0:n ) Y 0:n ] f(x 0:n ) g 0:n (x 0:n ) η 0:n (dx 0:n ) E E E[f(X 0:n ) n k=0 g k (X k )] expectation w.r.t. hidden state sequence X 0:n, while observations Y 0:n are fixed implicit parameters in likelihood functions : recall (abuse of notation) g k (x) = g k (x,y k ) if f = φ π depends only upon last state, then µ n,φ = E[φ(X n ) Y 0:n ] E[φ(X n ) n k=0 g k (X k )] = γ n,φ which defines unnormalized distribution γ n (dx) implicitly, through its action on arbitrary functions
41 (Bayesian filter : hidden Markov models : representation : 4) for a given importance decomposition P[X 0:n dx 0:n Y 0:n ] η 0 (dx 0 ) n k=1 Q k (x k 1,dx k ) n k=0 g k (x k ) η imp 0 (dx 0 ) n k=1 Q imp k (x k 1,dx k ) } {{ } η imp 0:n (dx 0:n) n k=0 g imp k (x k ) } {{ } g imp 0:n (x 0:n)
42 (Bayesian filter : hidden Markov models : recursive formulation : 1) Bayesian filter : hidden Markov models : recursive formulation Theorem Bayesian filter µ k (dx) = P[X k dx Y 0:k ] satisfies µ k 1 prediction η k = µ k 1 Q k with initial condition η 0 (dx) = P[X 0 dx] correction µ k = g k η k Remark in Theorem statement µ k 1 Q k (dx ) = E µ k 1 (dx)q k (x,dx ) denotes mixture distribution resulting from transition kernel Q k (x,dx ) acting on probability distribution µ k 1 (dx), and g k η k = g k η k η k,g k denotes (projective) product of prior probability distribution η k (dx ) with likelihood function g k (x )
43 (Bayesian filter : hidden Markov models : recursive formulation : 2) Proof recall representation for joint conditional probability distribution of hidden state sequence X 0:k given observations Y 0:k P[X 0:k dx 0:k Y 0:k ] η 0 (dx 0 ) k p=1 Q p (x p 1,dx p ) k p=0 g p (x p ) g k (x k ) Q k (x k 1,dx k ) P[X 0:k 1 dx 0:k 1 Y 0:k 1 ] integration w.r.t. variables x 0:k 1 (and in RHS, first w.r.t. variables x 0:k 2 and next w.r.t. variable x k 1 ), provides conditional distribution of current hidden state X k given observations Y 0:k, i.e. Bayesian filter, as µ k (dx k ) = P[X k dx k Y 0:k ] g k (x k ) Q k (x k 1,dx k ) P[X k 1 dx k 1 Y 0:k 1 ] g k (x k ) E µ k 1 (dx k 1 ) Q k (x k 1,dx k ) E } {{ } η k (dx k )
44 (Bayesian filter : hidden Markov models : recursive formulation : 3) Remark unnormalized version satisfies recurrent relation γ k (dx ) = g k (x ) γ k 1 (dx) Q k (x,dx ) and µ k = γ k γ k,1 or equivalently E γ k (dx ) = E γ k 1 (dx) R k (x,dx ) introducing nonnegative kernel R k (x,dx ) = Q k (x,dx ) g k (x )
45 (Bayesian filter : partially observed Markov chains : representation : 1) Bayesian filter : partially observed Markov chains : representation Theorem joint conditional distribution of hidden state sequence X 0:n given observations Y 0:n P[X 0:n dx 0:n Y 0:n ] γ 0 (dx 0 ) n k=1 R k (x k 1,dx k ) with nonnegative distribution defined (abuse of notation) as γ 0 (dx) = γ 0 (Y 0,dx) and with nonnegative kernel defined (abuse of notation) as R k (x k 1,dx k ) = R k (x k 1,Y k 1,Y k,dx k )
46 (Bayesian filter : partially observed Markov chains : representation : 2) general principle : p X Y=y (x) = p X,Y(x,y) p Y (y) = p X Y (x) p X,Y (x,y) Proof by definition joint probability distribution of hidden states and observations (X 0:n,Y 0:n ) P[X 0:n dx 0:n,Y 0:n dy 0:n ] = γ 0 (y 0,dx 0 ) n k=1 R k (x k 1,y k 1,y k,dx k ) λ F 0 (dy 0 ) λ F k(y k 1,dy k ) hence P[X 0:n dx 0:n Y 0:n ] γ 0 (dx 0 ) n R k (x k 1,dx k ) k=1
47 (Bayesian filter : partially observed Markov chains : representation : 3) for a given importance decomposition P[X 0:n dx 0:n Y 0:n ] γ 0 (dx 0 ) n k=1 R k (x k 1,dx k ) η imp 0 (dx 0 ) n k=1 Q imp k (x k 1,dx k ) } {{ } η imp 0:n (dx 0:n) n k=0 g imp k (x k ) } {{ } g imp 0:n (x 0:n)
48 (Bayesian filter : partially observed Markov chains : recursive formulation : 1) Bayesian filter : partially observed Markov chains : recursive formulation Theorem Bayesian filter µ k (dx) = P[X k dx Y 0:k ] satisfies µ k (dx ) µ k 1 (dx) R k (x,dx ) with initial condition µ 0 (dx) γ 0 (dx) E Remark unnormalized version satisfies recurrent relation γ k (dx ) = γ k 1 (dx) R k (x,dx ) and µ k = γ k γ k,1 E
49 (Bayesian filter : partially observed Markov chains : recursive formulation : 2) Proof recall representation for joint conditional probability distribution of hidden state sequence X 0:k given observations Y 0:k P[X 0:k dx 0:k Y 0:k ] γ 0 (dx 0 ) k p=1 R p (x p 1,dx p ) R k (x k 1,dx k ) P[X 0:k 1 dx 0:k 1 Y 0:k 1 ] integration w.r.t. variables x 0:k 1 (and in RHS, first w.r.t. variables x 0:k 2 and next w.r.t. variable x k 1 ), provides conditional distribution of current hidden state X k given observations Y 0:k, i.e. Bayesian filter, as µ k (dx k ) = P[X k dx k Y 0:k ] R k (x k 1,dx k ) P[X k 1 dx k 1 Y 0:k 1 ] E E µ k 1 (dx k 1 ) R k (x k 1,dx k )
Monte Carlo approximation : particle filters Monte Carlo methods : importance sampling importance sampling SIS algorithm derived from Bayesian filter representation recursive formulation redistribution SIR algorithm, adaptive redistribution derived directly from Bayesian filter recursive formulation estimation error, CLT
50 (Monte Carlo methods : importance sampling : 1) Monte Carlo methods if computing an integral (or a mathematical expectation) µ,φ = φ(x) µ(dx) = E[φ(X)] with X µ(dx) E is difficult, but simulating a r.v. according to distribution µ is easy, then introduce empirical probability distribution S N (µ) = 1 N where (ξ 1,,ξ N ) is an N sample distributed according to µ, and approximation by law of large numbers i=1 µ,φ S N (µ),φ = 1 N δ ξi φ(ξ i ) i=1 S N (µ),φ µ,φ in probability as N, with speed 1/ N
51 (Monte Carlo methods : importance sampling : 2) indeed S N (µ) µ,φ = 1 N hence (non asymptotical) mean square error (φ(ξ i ) µ,φ ) i=1 since E S N (µ) µ,φ 2 = 1 N var(φ,µ) 1 N 2 i,j=1 E[(φ(ξ i ) µ,φ ) (φ(ξ j ) µ,φ )] = 1 N 2 i=1 E φ(ξ i ) µ,φ 2 }{{} var(φ, µ) and central limit theorem holds N S N (µ) µ,φ = 1 N (φ(ξ i ) µ,φ ) = N(0,var(φ,µ)) N in distribution as N i=1
52 (Monte Carlo methods : importance sampling : 3) important special case : Gibbs Boltzmann distribution µ = g η = gη η,g i.e. µ,φ = η,gφ η,g with (non unique) decomposition in terms of a probability distribution η a nonnegative function g introduce unnormalized distribution defined by γ,φ = η,gφ = E[g(Ξ)φ(Ξ)] hence µ,φ = η,gφ η,g where r.v. Ξ is distributed according to η motivation : Bayes rule = γ,φ γ,1 ( ) posterior distribution likelihood function prior distribution
53 (Monte Carlo methods : importance sampling : 4) if simulating a r.v. according to µ is difficult, but simulating a r.v. according to η and evaluating nonnegative function g(x) for any x is easy, then it is possible to approximate µ by a weighted empirical probability distribution associated with a sample distributed according to η and weighted with nonnegative function g(x) even though normalizing constant η, g might be unknown
54 (Monte Carlo methods : importance sampling : 5) importance sampling idea : approximate numerator and denominator in ( ) with a unique sample distributed according to η : introduce approximation γ,φ = η,gφ S N (η),gφ = 1 N g(ξ i )φ(ξ i ) i=1 hence µ,φ = g η,φ g S N (η),φ = g(ξ i )φ(ξ i ) i=1 g(ξ i ) where (ξ 1,,ξ N ) is an N sample with common probability distribution η i=1
55 (Monte Carlo methods : importance sampling : 6) in other words and γ γ N = gs N (η) = 1 N µ µ N = g S N (η) = i=1 g(ξ i )δ ξ i i=1 g(ξ i ) g(ξ j ) j=1 δ ξ i = w i δ ξ i where nonnegative normalized weights (w 1,,w N ) are defined for any i = 1 N by w i = g(ξi ) g(ξ j ) j=1 i=1
56 (importance sampling SIS algorithm : 1) importance sampling SIS algorithm recall Bayesian filter representation as a Gibbs Boltzmann distribution µ 0:n = g 0:n η 0:n = g 0:nη 0:n η 0:n,g 0:n with g 0:n (x 0:n ) = i.e. µ 0:n,f = η 0:n,g 0:n f η 0:n,g 0:n n k=0 g k (x k 1,x k ) and with joint probability distribution of hidden states X 0:n n η 0:n (dx 0:n ) = P[X 0:n dx 0:n ] = η 0 (dx 0 ) Q k (x k 1,dx k ) unnormalized version defined as k=1 γ 0:n,f = η 0:n,g 0:n f = E[g 0:n (X 0:n )f(x 0:n )] = γ 0:n,f γ 0:n,1 and if f = φ π depends only upon last state and not on whole trajectory, then n γ 0:n,φ π = E[g 0:n (X 0:n ) φ π(x 0:n )] = E[φ(X n ) g k (X k 1,X k )] = γ n,φ k=0
57 (importance sampling SIS algorithm : 2) importance sampling : approximation γ 0:n,f = η 0:n,g 0:n f S N (η 0:n ),g 0:n f = 1 N g 0:n (ξ0:n)f(ξ i 0:n) i i=1 and µ 0:n,f = g 0:n η 0:n,f g 0:n S N (η 0:n ),f = g 0:n (ξ0:n)f(ξ i 0:n) i i=1 g 0:n (ξ0:n) i i=1 for any function f depending on whole trajectory, where (ξ 1 0:n,,ξ N 0:n) is an N sample with common probability distribution η 0:n
58 (importance sampling SIS algorithm : 3) in particular if f = φ π depends only upon last state and not on whole sequence, then γ n,φ = γ 0:n,φ π 1 g 0:n (ξ i N 0:n)φ(ξn) i and i=1 g 0:n (ξ0:n)φ(ξ i n) i µ n,φ = µ 0:n,φ π i=1 g 0:n (ξ0:n) i for any function φ, where (ξ 1 0:n,,ξ N 0:n) is an N sample with common probability distribution η 0:n, and for i = 1 N ξ i n = π(ξ i 0:n) denotes last state of sequence ξ i 0:n = (ξ i 0,,ξ i n) i=1
59 (importance sampling SIS algorithm : 4) in other words and µ n µ N n = γ n γ N n = 1 N i=1 g 0:n (ξ0:n) i δ ξ i n i=1 g 0:n (ξ0:n) i δ ξ i = n g 0:n (ξ j 0:n ) j=1 wn i δ ξ i n where nonnegative normalized weights (wn, 1,wn N ) are defined for any i = 1 N by wn i = g 0:n(ξ0:n) i g 0:n (ξ j 0:n ) j=1 i=1
60 (importance sampling SIS algorithm : 5) SIS algorithm importance sampling approximation : non recursive depth first implementation simulate an N sample of hidden state sequences (ξ 1 0:n,,ξ N 0:n) : independently for any i = 1 N, simulate a sequence ξ i 0:n = (ξ i 0,,ξ i n), i.e. simulate a r.v. ξ i 0 according to η 0 (dx) for any k = 1 n simulate a r.v. ξ i k according to Q k(ξ i k 1,dx ) and define for any i = 1 N g 0:n (ξ i 0:n) = n k=0 g k (ξk 1,ξ i k) i and wn i = g 0:n(ξ0:n) i g 0:n (ξ0:n) i j=1
61 (importance sampling SIS algorithm : 6) importance sampling approximation : non recursive implementation for nonlinear and non Gaussian systems simulate an N sample of hidden state sequences (ξ 1 0:n,,ξ N 0:n) : independently for any i = 1 N, simulate a sequence ξ i 0:n = (ξ i 0,,ξ i n), i.e. simulate a r.v. ξ i 0 according to η 0 (dx) for any k = 1 n simulate a r.v. W i k according to pw k (dw) and set ξi k = f k(ξ i k 1,Wi k ) and define for any i = 1 N g 0:n (ξ i 0:n) = n k=0 qk V (Y k h k (ξk)) i and wn i = g 0:n(ξ0:n) i g 0:n (ξ0:n) i j=1
62 (importance sampling SIS algorithm : 7) recursive formulation of weights updating for any k = 1 n and for any i = 1 N wk i = g 0:k(ξ0:k) i = g 0:k (ξ j 0:k ) j=1 g 0:k 1 (ξ0:k 1) i g k (ξk 1,ξ i k) i = g 0:k 1 (ξ j 0:k 1 ) g k(ξ j k 1,ξj k ) j=1 wk 1 i g k (ξk 1,ξ i k) i w j k 1 g k(ξ j k 1,ξj k ) j=1 benefit : allows breadth first implementation
63 (importance sampling SIS algorithm : 8) SIS algorithm (sequential importance sampling) : recursive implementation for k = 0, independently for any i = 1 N simulate a r.v. ξ 0 i according to η 0(dx), and define w0 i = g 0(ξ0) i g 0 (ξ j 0 ) j=1 for any k = 1 n, independently for any i = 1 N simulate a r.v. ξ i k according to Q k(ξ i k 1,dx ), and update weight as wk i = wi k 1 g k(ξk 1 i,ξi k ) w j k 1 g k(ξ j k 1,ξj k ) j=1
64 (importance sampling SIS algorithm : 9) SIS algorithm (sequential importance sampling) : recursive implementation for nonlinear and non Gaussian systems for k = 0, independently for any i = 1 N simulate a r.v. ξ 0 i according to η 0(dx), and define w0 i = qv 0 (Y 0 h 0 (ξ0)) i q0 V (Y 0 h 0 (ξ j 0 )) j=1 for any k = 1 n, independently for any i = 1 N simulate a r.v. Wk i according to pw k (dw) and set ξi k = f k(ξk 1 i,wi k ), and update weight as wk i = wi k 1 qv k (Y k h k (ξk i)) w j k 1 qv k (Y k h k (ξ j k )) j=1
65 (importance sampling SIS algorithm : 10) pros : higher weights are allocated to simulated sequences that are often consistent with observations cons : weights are evaluated afterwards, and do not have impact on how sequences are simulated (blind simulation strategy) + along a given sequence, weights are accumulated in a multiplicative way weights degeneracy : in practice, one single sequence receives a much larger weight than all other sequences, whose contributions are therefore negligible memory effect : a sequence cannot be consistent with all observations a sequence that is consistent (resp. inconsistent) with current observation, but inconsistent (resp. consistent) with earlier observations, will receive a small (resp. a large) weight proposed solutions use observations to guide how sequences are simulated from time to time, replicate / terminate sequences according to their respective weights
66 (SIR algorithm : 1) approximate Bayesian filter using recursive formulation µ n (dx) = P[X n dx Y 0:n ] µ k 1 prediction η k = µ k 1 Q k vith initial condition µ 0 = g 0 η 0 correction µ k = g k η k idea : look for approximations in the form of (possibly weighted) empirical probability distributions η k η N k = vk i δ ξ i et µ k µ N k = k i=1 associated with population of N particles characterized by positions (ξ 1 k,,ξn k ) in E wk i δ ξ i k nonnegative normalized weights (v 1 k,,vn k ) and (w1 k,,wn k ) i=1 SIR algorithm
67 (SIR algorithm : 2) initial approximation : using importance sampling µ 0 = g 0 η 0 g 0 S N (η 0 ) = i=1 g 0 (ξ0) i δ ξ i = 0 g 0 (ξ j 0 ) w0 i δ ξ i 0 i=1 j=1 where variables (ξ 1 0,,ξ N 0 ) are i.i.d. with common probability distribution η 0 correction step : clearly, from definition µ N k = g k η N k = i=1 vk i g k(ξk i) δ ξ i = k v j k g k(ξ j k ) wk i δ ξ i k i=1 j=1 which automatically has desired form
68 (SIR algorithm : 3) prediction step : from definition µ N k 1Q k,φ = µ N k 1(dx) Q k (x,dx )φ(x ) for any function φ, hence = = i=1 w i k 1 Q k (ξk 1,dx i )φ(x ) [ wk 1 i Q k (ξk 1,dx i )]φ(x ) i=1 in form of a finite mixture, with µ N k 1Q k = wk 1 i m i k i=1 m i k(dx ) = Q k (ξ i k 1,dx ) for any i = 1 N requires further approximation (several sampling schemes available)
69 (SIR algorithm : 4) multinomial resampling simulate an N sample (ξ 1 k,,ξn k ) according to µn k 1 Q k, and set µ N k 1Q k η N k = S N (µ N k 1Q k ) = 1 N δ ξ i = k i=1 vk i δ ξ i k i=1 with v i k = 1/N for any i = 1 N weights are used to select (without replacement) mixture components with higher weights, with expected consequence that components with higher weights are selected several times conversely, components with lower weights are possibly discarded and will not further contribute to approximation if R i denotes how many times i th mixture component has been selected, or equivalently how many samples in new approximation originate from i th mixture component, for any i = 1 N, then r.v. (R 1,,R N ) has a multinomial distribution
70 (SIR algorithm : 5) intuitively, if all mixture weights are equal (or close) to 1/N, i.e. if distribution of mixture weights is close to equidistribution, then selecting mixture components could be counter productive weigths preservation simulate one individual exactly from each mixture component and preserve its weight, i.e. independently for any i = 1 N simulate ξk i according to m i k (dx ) = Q k (ξk 1 i,dx ) and set µ N k 1Q k η N k = wk 1 i δ ξ i = k i=1 vk i δ ξ i k i=1 with v i k = wi k 1 for any i = 1 N intuitively, this approach is appropriate if distribution of mixture weights is close to equidistribution, and less appropriate in extreme case where most weights are zero, except a few components with positive weights
71 (SIR algorithm : 6) SIR algorithm (sampling with importance resampling) : recursive implementation for k = 0, independently for any i = 1 N simulate a r.v. ξ i 0 according to η 0 (dx), and define w0 i = g 0(ξ0) i g 0 (ξ j 0 ) j=1 for any k = 1 n, independently for any i = 1 N select an individual ξ i k 1 among population (ξ1 k 1,,ξN k 1 ) and according to weights (w 1 k 1,,wN k 1 ) simulate a r.v. ξ i k according to Q k( ξ i k 1,dx ) and define wk i = g k(ξk 1 i,ξi k ) g k (ξ j k 1,ξj k ) j=1
72 (SIR algorithm : 7) SIR algorithm (sampling with importance resampling) : recursive formulation for nonlinear and non Gaussian systems for k = 0, independently for any i = 1 N simulate a r.v. ξ i 0 according to η 0 (dx), and define w0 i = qv 0 (Y 0 h 0 (ξ0)) i q0 V (Y 0 h 0 (ξ j 0 )) j=1 for any k = 1 n, independently for any i = 1 N select and individual ξ i k 1 among population (ξ1 k 1,,ξN k 1 ) and according to weights (w 1 k 1,,wN k 1 ) simulate a r.v. W i k according to pw k (dw) and set ξi k = f k( ξ i k 1,Wi k ) and define wk i = qv k (Y k h k (ξk i)) qk V (Y k h k (ξ j k )) j=1
73 (SIR algorithm : 8) to summarize, particles (ξ 1 k 1,,ξN k 1 ) are selected according to their respective weights (w 1 k 1,,wN k 1 ) [selection step] evolve according to transition probabilities Q k (x,dx ) [mutation step] and are weighted by evaluating likelihood function g k [weighting step] pros : weights do not accumulate along each sequence, but are used to select (or resample) particles particles with larger (resp. smaller) weights are replicated (resp. are terminated) by keeping only most probable particles at each time instant, expected benefit is to concentrate available computing power within regions of interest
74 (SIR algorithm : 9) cons : introduces additional randomness, in resampling (selection) step proposed solutions alternate resampling strategies, that allocate an (almost) deterministic number of offsprings to each selected particle adaptive resampling, only when weights (wk 1,,wN k unbalanced (far from equidistribution) ) are too much cons : because of replication, fewer truly distinct positions are available (sample impoverishment) positions degeneracy : in practice, implicitly rely on mutation step to bring diversity again proposed solution after resampling (selection) step, add some random move to each selected particle, or apply some artificial Markovian dynamics (Metropolis Hastings, Gibbs sampling, etc.)
75 (particle filtering : adaptive sampling / resampling : 1) given a finite mixture m = w i m i i=1 adaptive SIR algorithm selecting mixture components is interesting only if weights (w 1,,w N ) are far from equidistribution several heuristic criteria have been proposed to quantify departure from equidistribution, and to decide wether particles should be resampled or not, e.g. effective sample size entropy
76 (particle filtering : adaptive sampling / resampling : 2) χ 2 distance and effective sample size χ 2 distance between two probability vectors p = (p 1,,p N ) and q = (q 1,,q N ) is defined as χ 2 (p,q) = i=1 q i ( p i q i 1) 2 in particular for p = w = (w 1,,w N ) and q = (1/N,,1/N), it holds hence 0 1 N (N w i 1) 2 = 1 N i=1 (N w i ) 2 1 = N i=1 1 N eff = 1 / [ wi 2 ] N i=1 wi 2 1 where equality is attained at equidistribution, which suggests to resample if H(w 1,,w N ) = N i=1 for some threshold H red > 0 still to be fixed i=1 w 2 i 1 = N N eff 1 H red
77 (estimation error, CLT : 1) on the way to asymptotic results (in 3 slides) recall linear evolution for unnormalized version of Bayesian filter γ k = γ k 1 R k = g k (γ k 1 Q k ) = g k (µ k 1 Q k ) γ k 1,1 = g k η k γ k 1,1 with initial condition γ 0 = g 0 η 0 proposed particle approximation for unnormalized distribution γk N = g k ηk N γk 1,1 N with initial condition γ0 N = g 0 η0 N and η0 N = S N (η 0 ) : clearly γk N,1 = ηk N,g k γk 1,1 N and γ0 N,1 = η0 N,g 0 and it follows γ N k γ N k,1 = g k η N k = µ N k and γ N 0 γ N 0,1 = g 0 η N 0 = µ N 0 normalized version of proposed particle approximation for γk N SIR bootstrap approximation µ N k for Bayesian filter coincides with
78 (estimation error, CLT : 2) Remark key for induction : for any k = 1 n and by difference hence γ N k γ k = g k η N k γ N k 1,1 g k (γ k 1 Q k ) = g k (γ N k 1Q k γ k 1 Q k )+g k (η N k µ N k 1Q k ) γ N k 1,1 γ N k γ k,φ = γ N k 1 γ k 1,Q k (g k φ) + η N k µ N k 1Q k,g k φ γ N k 1,1 error at current generation, evaluated on function φ, is decomposed into error at previous generation, evaluated on function R k φ = Q k (g k φ) local error resulting from Monte Carlo approximation even though samples are actually dependent, because of resampling at each generation, conditionally on previous generations, new samples are generated independently
79 (estimation error, CLT : 3) with this conditioning argument, error estimates sup E γn k γ k,φ φ: φ =1 γ k,1 c k N and sup E µ N k µ k,φ 2 c k φ: φ =1 N of order 1/ N, and CLT N γ N k γ k,φ γ k,1 = N(0,V k (φ)) and N µ N k µ k,φ = N(0,v k (φ)) with v k (φ) = V k (φ µ k,φ ) can be obtained by induction
Some algorithmic variants regularization progressive weighting, MCMC iterations sample size adaptation marginalization aka Rao Blackwellization interacting Kalman filters interacting finite state (Baum) filters
80 (marginalization aka Rao Blackwellization : 1) conditionning as a variance reduction technique if E[f(X 1,X 2 )] = E[E[f(X 1,X 2 ) X 2 ]] = E[F(X 2 )] F(x 2 ) = E[f(X 1,X 2 ) X 2 = x 2 ] = has an explicit expression, then Monte Carlo estimator 1 N E 1 f(x 1,x 2 ) P[X 1 dx 1 X 2 = x 2 ] F(Xi) 2 E[F(X 2 )] = E[f(X 1,X 2 )] i=1 where (X1, 2,XN 2 ) is an N sample with same common distribution as X2, has smaller variance than Monte Carlo estimator 1 f(x 1 N i,xi) 2 E[f(X 1,X 2 )] i=1 where ((X1,X 1 1), 2,(XN 1,X2 N ) is an N sample with same common distribution as (X 1,X 2 )
81 (marginalization aka Rao Blackwellization : 2) 1st example : conditionnally linear Gaussian systems X L k = F L k (X NL k 1) X L k 1 +f L k(x NL k 1)+W L k X NL k = Fk NL (Xk 1) NL Xk 1 L +fk NL (Xk 1)+W NL k NL Y k = h k (X NL k )+V k clearly E[φ(X L n,x NL n ) n k=0 g k (X NL k )] = E[E[φ(X L n,x NL n ) X NL 0:n] n k=0 g k (X NL k )] and conditional distribution of said linear component Xn L given said nonlinear component sequence X0:n, NL L NL is Gaussian, with mean X k and covariance matrix given explicitly, in recursive form, by Kalman filter equation P L NL k introduce new hidden state {(X NL k, X L NL k,p L NL k )} instead of {(Xk L,XNL k )} benefit : explore with particles subspace associated with nonlinear components, and associated with each particle, a Kalman filter estimates linear components
82 (marginalization aka Rao Blackwellization : 3) 2nd example : non linear systems with Markovian switching regimes / modes X k = f k (s k 1,X k 1,W k ) Y k = h k (X k )+V k where regime / mode sequence {s k } forms a Markov chain with finite state space clearly E[φ(s n,x n ) n k=0 g k (X k )] = E[E[φ(s n,x n ) X 0:n ] n k=0 g k (X k )] and conditional distribution of regime / mode s n given continuous components sequence X 0:n, is a finite dimensional probability vector defined by p i n = P[s n = i X 0:n ] for any i I given explicitly, in recursive form, by solving Baum forward equation introduce new hidden state {(X k,p k )} instead of {(s k,x k )} benefit : avoid sampling finite state space
Conclusion particle filtering provides an implementation of Bayesian approach that is intuitive, easy to understand and implement flexible, adapts to many models, many algorithmic variants available numerically efficient, through some selection mechanism amenable to mathematical analysis