# Variantes Algorithmiques et Justifications Théoriques

Save this PDF as:

Size: px
Start display at page:

## Transcription

1 complément scientifique École Doctorale MATISSE IRISA et INRIA, salle Markov jeudi 26 janvier 2012 Variantes Algorithmiques et Justifications Théoriques François Le Gland INRIA Rennes et IRMAR

2 ...programme of the day #1 more general models, from non linear and non Gaussian systems to hidden Markov models and partially observed Markov chains, so as to handle e.g. regime / mode switching correlation between state noise and observation noise #2 for each of these models (or just for most general model), representation of P[X 0:n dx 0:n Y 0:n ] as a Gibbs Boltzmann distribution, with recursive formulation, and idem for P[X n dx n Y 0:n ] #3 particle approximation (SIS and SIR algorithms) from either representations #4 asymptotic behaviour as sample size goes to infinity #5 numerous algorithmic variants

3 1 (some notations : 1) some notations (continued) if X is a random variable taking values in E, then mapping φ E[φ(X)] or equivalently A P[X A] defines a probability distribution µ on E, denoted as µ(dx) = P[X dx] and such that E[φ(X)] = caracterizes uncertainty about X E φ(x) µ(dx) = µ,φ or P[X A] = µ(a)

4 2 (some notations : 2) transition probability kernel M(x,dx ) on E collection of probability distributions on E indexed by x E acts on functions according to M φ(x) = E M(x,dx ) φ(x ) and acts on probability distributions according to µm(dx ) = µ(dx) M(x,dx ) seen as a mixture distribution caracterized by µm,φ = [ µ(dx) M(x,dx )] φ(x ) = E E E E µ(dx) [ = µ,m φ E M(x,dx ) φ(x )]

5 Non linear and non Gaussian systems, and beyond non linear and non Gaussian systems hidden Markov models partially observed Markov chains likelihood free models

6 3 (non linear and non Gaussian systems : 1) non linear and non Gaussian systems prior model for hidden state taking values in E X k = f k (X k 1,W k ) with W k p W k (dw) initial condition X 0 η 0 (dx) observation taking values in R d with additive noise admitting a density Y k = h k (X k )+V k with V k qk V (v)dv random variables X 0, W 1,,W k, and V 0,V 1,,V k, are mutually independent but non necessarily Gaussian only requirement (to be used later on) : easy to simulate a r.v. according to η 0 (dx) or to p W k (dw) evaluate function q V k (v) for any v Rd

7 4 (non linear and non Gaussian systems : 2) Proposition hidden states {X k } form a Markov chain taking values in E, i.e. P[X k dx X 0:k 1 ] = P[X k dx X k 1 ] characterization in terms of transition kernel P[X k dx X k 1 = x] = Q k (x,dx ) defined (implicitly) by its action on functions Q k φ(x) = E[φ(X k ) X k 1 = x] = E[φ(f k (X k 1,W k )) X k 1 = x] = φ(f k (x,w)) p W k (w)dw R m

8 5 (non linear and non Gaussian systems : 3) Remark easy to simulate next state X k given X k 1 = x, i.e. to simulate a r.v. according to Q k (x,dx ) for a given x E indeed, set X k = f k (x,w k ) where W k is simulated according to p W k (dw) Remark in general, transition kernel Q k (x,dx ) does not admit a density indeed, conditionnally to X k 1 = x, r.v. X k necessarily belongs to subset M(x) = {x R m : there exist w R p such that x = f k (x,w)} if p < m and under some mild regularity assumptions, this subset of R m has zero Lebesgue measure therefore, conditionnally to X k 1 = x, probability distribution Q k (x,dx ) of r.v. X k cannot have a density w.r.t. Lebesgue measure on R m

9 6 (non linear and non Gaussian systems : 4) Remark if f k (x,w) = b k (x)+w and if probability distribution p W k W k admits a density, still denoted p W k (w), i.e. if (dw) of r.v. X k = b k (X k 1 )+W k with W k p W k (w)dw then a more explicit expression is available Q k (x,dx ) = p W k (x b k (x))dx i.e. transition kernel Q k (x,dx ) admits an (easy to evaluate) density indeed, change of variable x = b k (x)+w yields Q k φ(x) = φ(b k (x)+w) p W k (w)dw R m = φ(x ) p W k (x b k (x)))dx R m

10 7 (non linear and non Gaussian systems : 5) Proposition observations {Y k } satisfy memoryless channel assumption, i.e. P[Y 0:n dy 0:n X 0:n ] = n k=0 P[Y k dy k X k ] characterization in terms of emission density define likelihood function P[Y k dy X k = x] = q V k (y h k (x))dy g k (x) = q V k (Y k h k (x)) a quantitative measure of consistency between possible hidden state x E and actual observation Y k Remark easy to evaluate g k (x) for any x E

11 8 (hidden Markov models : 1) hidden Markov models motivating example hybrid continuous / discrete systems X k = f k (s k 1,X k 1,W k ) Y k = h k (X k )+V k where regime / mode sequence {s k } forms a Markov chain with finite state space does not fit into non linear and non Gaussian systems, however hidden states and modes {(X k,s k )} jointly form a Markov chain observations {Y k } satisfy memoryless channel assumption i.e. fits into hidden Markov models Remark easy to simulate next state (X k,s k ) given (X k 1,s k 1 ) = (x,s)

12 9 (hidden Markov models : 2) more generally, hidden states {X k } could form a Markov chain taking values in a quite general space E, e.g. hybrid continuous / discrete differentiable manifold constrained graphical (collection de connected edges) characterization in terms of transition kernel and initial distribution P[X k dx X k 1 = x] = Q k (x,dx ) P[X 0 dx] = η 0 (dx) joint probability distribution of hidden states X 0:n verifies n P[X 0:n dx 0:n ] = η 0 (dx 0 ) Q k (x k 1,dx k ) k=1

13 10 (hidden Markov models : 3) user should respect displacement constraints due to obstacles, as read on map

14 11 (hidden Markov models : 4) simplified model : user walks on a Voronoi graph, displacement constraints due to obstacles are taken automatically into account

15 12 (hidden Markov models : 5) observations {Y k } could verify memoryless channel assumption, i.e. P[Y 0:n dy 0:n X 0:n ] = n k=0 P[Y k dy k X k ] characterization in terms of emission density P[Y k dy X k = x] = g k (x,y)λ F k(dy) where nonnegative measure λ F k (dy) defined on F does not depend on x E define (abuse of notation) likelihood function as g k (x) = g k (x,y k ) a quantitative measure of consistency between x E and observation Y k joint conditional distribution of observations Y 0:n given hidden states X 0:n verifies P[Y 0:n dy 0:n X 0:n = x 0:n ] = n g k (x k,y k ) λ F 0 (dy 0 ) λ F n(dy n ) k=0

16 13 (hidden Markov models : 6) representation as X k 1 X k X k+1 Y k 1 Y k Y k+1 arrows represent dependency between random variables only requirement (to be used later on) : easy to simulate for any x E, a r.v. according to transition kernel Q k (x,dx ) evaluate for any x E, likelihood function g k (x )

17 14 (hidden Markov models : 7) hidden Markov models : importance decomposition motivation : from simulations seen last week, basic paradigm particles move according to prior model, described by its transition kernel new particles are weighted by evaluating likelihood function hopefully, resulting weighted empirical distribution provides reasonable approximation to non tractable Bayesian filter concern / questions : is this safe? could more information be used in mutation step?

18 15 (hidden Markov models : 8) recall indoor navigation example : if user is detected by a beacon with known location a and with finite range R, then necessarily user position is within detection disk centered at a and with radius R in other words, generating particles according to prior model alone could result in (a few, some, many, all) particles outside detection disk, i.e. useless particles, waste why not generate explicitly all new particles within disk, and accomodate for wrong model by changing weights? more generally, why not (and how) use next observation to move particles? ideal situation would be particles move according to posterior model warning

19 16 (hidden Markov models : 9) 1.4 prior distribution (sample view) prior Figure 1: Prior density and generated sample

20 17 (hidden Markov models : 10) 1.4 prior distribution (histogram view) prior Figure 2: Prior density and histogramme associated with generated sample

21 18 (hidden Markov models : 11) 1.4 prior distribution (sample view) prior Figure 1: Prior density and generated sample

22 19 (hidden Markov models : 12) prior distribution, likelihood function and posterior distribution (weighted sample view) prior likelihood posterior Figure 3a: Prior density, likelihood function, posterior density and weighted sample

23 20 (hidden Markov models : 13) prior distribution, likelihood function and posterior distribution (histogram view) prior likelihood posterior Figure 4a: Prior density, likelihood function, posterior density and histogramme associated with weighted sample

24 21 (hidden Markov models : 14) 1.4 prior distribution (sample view) prior Figure 1: Prior density and generated sample

25 22 (hidden Markov models : 15) prior distribution, likelihood function and posterior distribution (weighted sample view) prior likelihood posterior Figure 3b: Prior density, likelihood function, posterior density and weighted sample (more difficult)

26 23 (hidden Markov models : 16) prior distribution, likelihood function and posterior distribution (histogram view) prior likelihood posterior Figure 4b: Prior density, likelihood function, posterior density and histogramme associated with weighted sample (more difficult)

27 24 (hidden Markov models : 17) 1.4 prior distribution (sample view) prior Figure 1: Prior density and generated sample

28 25 (hidden Markov models : 18) prior distribution, likelihood function and posterior distribution (weighted sample view) prior likelihood posterior Figure 3c: Prior density, likelihood function, posterior density and weighted sample (just impossible)

29 26 (hidden Markov models : 19) possible (non unique) decomposition and as product of γ 0 (dx) = g 0 (x) η 0 (dx) = g imp 0 (x)η imp 0 (dx) R k (x,dx ) = Q k (x,dx ) g k (x ) = g imp k (x,x ) Q imp k (x,dx ) a nonnegative weight function g imp 0 (x) or g imp k (x,x ) a probability distribution η imp 0 (dx) or a transition kernel Q imp k (x,dx ) respectively, only requirement about proposed decomposition : easy to simulate a r.v. according to η imp 0 (dx) simulate for any x E, a r.v. according to Q imp k (x,dx ) evaluate for any x,x E, weighting function g imp k (x,x ) attention : evaluating weighting function g imp k (x,x ) requires some knowledge about transition kernels Q imp k (x,dx ) and Q k (x,dx ) (was not required originally)

30 27 (hidden Markov models : 20) popular (optimal) importance decomposition : blind vs. guided mutation and alternatively i.e. P[X k dx,y k dy X k 1 = x] = P[Y k dy X k = x,x k 1 = x] }{{} g k (x,y ) λ k (dy ) P[X k dx,y k dy X k 1 = x] with (abuse of notation) = P[X k dx Y k = y,x k 1 = x] }{{} Q k (x,y,dx ) P[X k dx X k 1 = x] }{{} Q k (x,dx ) P[Y k dy X k 1 = x] }{{} ĝ k (x,y ) λ k (dy ) R k (x,dx ) = g k (x ) Q k (x,dx ) = ĝ k (x) Q k (x,dx ) ĝ k (x) = ĝ k (x,y k ) and Qk (x,dx ) = Q k (x,y k,dx )

31 28 (hidden Markov models : 21) remaining question : how easy is it to simulate for any x E, a r.v. according to Q k (x,dx )? evaluate for any x E, weighting function ĝ k (x)? positive answer in special case : linear observations and additive Gaussian noise X k = f k (X k 1 )+σ k (X k 1 ) W k Y k = H k X k +V k indeed (for simplicity, assume σ k (x) = I) Y k = H k [f k (X k 1 )+W k ]+V k = H k f k (X k 1 )+(H k W k +V k ) conditionally on X k 1 = x, r.v. (X k,y k ) is jointly Gaussian, with mean and covariance matrix f k (x) Q W k Q W k H k and H k f k (x) H k Q W k H k Q W k H k +QV k

32 29 (partially observed Markov chains : 1) partially observed Markov chains motivating example : assume (unsynchronized) sensors take noisy observations of different components of hidden state at different time instants, e.g. X k = (Xk 1,X2 k ) and for simplicity X k = f(x k 1 )+W k h 1 (Xk 1)+V1 k Y k = H 2 X 2 k +V2 k at odd time instants at even time instants observing all components of hidden state is fine, but processing partial observations at each time instant can be risky, since likelihood functions will be flat along some directions : ideally, try to collect and process simultaneously two successive observations, so that likelihood functions are more peaky

33 30 (partially observed Markov chains : 2) down sampling : set X k = X 2k+1 and Ȳ k = Y 2k+1 Y 2k+2 state equation X k = X 2k+1 = f(x 2k )+W 2k+1 = f(f(x 2k 1 )+W 2k )+W 2k+1 i.e. X k = f( X k 1, W k ) with W k = W 2k W 2k+1

34 31 (partially observed Markov chains : 3) observation equation : introducing projections π 1 and π 2 on 1st and 2nd components of state vector, yields i.e. Ȳ k = Y 2k+1 Y 2k+2 = = h1 (X 1 2k+1 )+V1 2k+1 H 2 X 2 2k+2 +V2 2k+2 h 1 (π 1 (X 2k+1 ))+V 1 2k+1 H 2 π 2 (f(x 2k+1 )+W 2k+2 )+V 2 2k+2 Ȳ k = h( X k )+ V k with V k = V 1 2k+1 H 2 π 2 (W 2k+2 )+V 2 2k+2

35 32 (partially observed Markov chains : 4) resulting system X k = f( X k 1, W k ) Ȳ k = h( X k )+ V k with W k = W 2k W 2k+1 and Vk = V 1 2k+1 H 2 π 2 (W 2k+2 )+V 2 2k+2 clearly W k and V k 1 share W 2k in common and are correlated, hence dependent, and memoryless channel assumption cannot hold

36 33 (partially observed Markov chains : 5) trick : decompose W k = M V k 1 + B k where B k and V k 1 are now independent, substitute in state equation and import V k 1 = Ȳk 1 h( X k 1 ) from observation equation, yielding X k = f( X k 1,M (Ȳk 1 h( X k 1 ))+ B k ) Ȳ k = h( X k )+ V k does not fit into hidden Markov model, hidden state alone does not form a Markov chain however, hidden states and observations {( X k,ȳk)} jointly form a Markov chain, the second component of which only is observed

37 34 (partially observed Markov chains : 6) even more generally, with previous motivating example in mind, hidden states and observations {(X k,y k )} could jointly form a Markov chain taking values in product space E F characterization in terms of transition kernel P[X k dx,y k dy X k 1 = x,y k 1 = y] = R k (x,y,y,dx ) λ F k(y,dy ) and initial distribution P[X 0 dx,y 0 dy] = γ 0 (y,dx) λ F 0 (dy) attention : hidden states {X k } alone need not form a Markov chain joint probability distribution of hidden states and observations (X 0:n,Y 0:n ) P[X 0:n dx 0:n,Y 0:n dy 0:n ] = γ 0 (y 0,dx 0 ) n R k (x k 1,y k 1,y k,dx k ) λ F 0 (dy 0 ) λ F k(y k 1,dy k ) k=1

38 35 (partially observed Markov chains : 7) required (non unique) decomposition partially observed Markov chains : importance decomposition γ 0 (dx) = g imp 0 (x)η imp 0 (dx) and as product of R k (x,dx ) = g imp k (x,x ) Q imp k (x,dx ) a nonnegative weight function g imp 0 (x) or g imp k (x,x ) a probability distribution η imp 0 (dx) or a transition kernel Q imp k (x,dx ) respectively, only requirement about proposed decomposition : easy to simulate a r.v. according to η imp 0 (dx) simulate for any x E, a r.v. according to Q imp k (x,dx ) evaluate for any x,x E, weighting function g imp k (x,x )

39 36 (likelihood free models : 1) likelihood free models so far, at least implicitly, additive observation noise has been assumed Y k = h(x k )+V k with V k qk V (v) dv with known and explicit form for probability density qk V(v) this was key assumption in deriving expression of density emission P[Y k dy X k = x] = g k (x,y) λ k (dy) hence explicit expression of likelihood function questions : could anything be said in more general cases where no explicit expression is available for a density, or it does not even exist non additive observation noise, with dimension smaller than observation, i.e. Y k = h(x k,v k ) perfect observations, i.e. observation noise is simply not present Y k = h(x k )

40 37 (likelihood free models : 2) trick, a form of ABC (approximate Bayesian computation) : pretend that observations are produced by slightly perturbed but regular model, i.e. or or Y k = h(x k )+V k +εu k Y k = h(x k,v k )+εu k Y k = h(x k )+εu k depending on the case under consideration, with U k q U k (u)du and set (X k,v k ) as new hidden state new requirement : easy to simulate (X k,v k ) jointly evaluate density q U k (u)

41 Bayesian filter hidden Markov models representation as Gibbs Boltzmann distribution recursive formulation partially observed Markov chains + given importance decomposition representation as Gibbs Boltzmann distribution

42 38 (Bayesian filter : hidden Markov models : representation : 1) Bayesian filter : hidden Markov models : representation Theorem joint conditional distribution of hidden state sequence X 0:n given observations Y 0:n as a Gibbs Boltzmann distribution P[X 0:n dx 0:n Y 0:n ] n k=0 g k (x k ) }{{} g 0:n (x 0:n ) η 0 (dx 0 ) n k=1 Q k (x k 1,dx k ) } {{ } η 0:n (dx 0:n ) with likelihood functions defined (abuse of notation) as g k (x) = g k (x,y k ) and with joint probability distribution of hidden state sequence X 0:n η 0:n (dx 0:n ) = P[X 0:n dx 0:n ] = η 0 (dx 0 ) n Q k (x k 1,dx k ) k=1

43 39 (Bayesian filter : hidden Markov models : representation : 2) general principle : p X Y=y (x) = p X,Y(x,y) p Y (y) = p X Y (x) p X,Y (x,y) Proof Bayes rule + Markov property + memoryless channel assumption, yield joint probability distribution of hidden states and observations (X 0:n,Y 0:n ) hence P[X 0:n dx 0:n,Y 0:n dy 0:n ] = P[Y 0:n dy 0:n X 0:n = x 0:n ] P[X 0:n dx 0:n ] = η 0 (dx 0 ) n k=1 Q k (x k 1,dx k ) P[X 0:n dx 0:n Y 0:n ] η 0 (dx 0 ) n k=1 n k=0 g k (x k,y k ) λ F 0 (dy 0 ) λ F n(dy n ) Q k (x k 1,dx k ) n k=0 g k (x k )

44 40 (Bayesian filter : hidden Markov models : representation : 3) Remark for any function f depending on whole trajectory E[f(X 0:n ) Y 0:n ] f(x 0:n ) g 0:n (x 0:n ) η 0:n (dx 0:n ) E E E[f(X 0:n ) n k=0 g k (X k )] expectation w.r.t. hidden state sequence X 0:n, while observations Y 0:n are fixed implicit parameters in likelihood functions : recall (abuse of notation) g k (x) = g k (x,y k ) if f = φ π depends only upon last state, then µ n,φ = E[φ(X n ) Y 0:n ] E[φ(X n ) n k=0 g k (X k )] = γ n,φ which defines unnormalized distribution γ n (dx) implicitly, through its action on arbitrary functions

45 41 (Bayesian filter : hidden Markov models : representation : 4) for a given importance decomposition P[X 0:n dx 0:n Y 0:n ] η 0 (dx 0 ) n k=1 Q k (x k 1,dx k ) n k=0 g k (x k ) η imp 0 (dx 0 ) n k=1 Q imp k (x k 1,dx k ) } {{ } η imp 0:n (dx 0:n) n k=0 g imp k (x k ) } {{ } g imp 0:n (x 0:n)

46 42 (Bayesian filter : hidden Markov models : recursive formulation : 1) Bayesian filter : hidden Markov models : recursive formulation Theorem Bayesian filter µ k (dx) = P[X k dx Y 0:k ] satisfies µ k 1 prediction η k = µ k 1 Q k with initial condition η 0 (dx) = P[X 0 dx] correction µ k = g k η k Remark in Theorem statement µ k 1 Q k (dx ) = E µ k 1 (dx)q k (x,dx ) denotes mixture distribution resulting from transition kernel Q k (x,dx ) acting on probability distribution µ k 1 (dx), and g k η k = g k η k η k,g k denotes (projective) product of prior probability distribution η k (dx ) with likelihood function g k (x )

47 43 (Bayesian filter : hidden Markov models : recursive formulation : 2) Proof recall representation for joint conditional probability distribution of hidden state sequence X 0:k given observations Y 0:k P[X 0:k dx 0:k Y 0:k ] η 0 (dx 0 ) k p=1 Q p (x p 1,dx p ) k p=0 g p (x p ) g k (x k ) Q k (x k 1,dx k ) P[X 0:k 1 dx 0:k 1 Y 0:k 1 ] integration w.r.t. variables x 0:k 1 (and in RHS, first w.r.t. variables x 0:k 2 and next w.r.t. variable x k 1 ), provides conditional distribution of current hidden state X k given observations Y 0:k, i.e. Bayesian filter, as µ k (dx k ) = P[X k dx k Y 0:k ] g k (x k ) Q k (x k 1,dx k ) P[X k 1 dx k 1 Y 0:k 1 ] g k (x k ) E µ k 1 (dx k 1 ) Q k (x k 1,dx k ) E } {{ } η k (dx k )

48 44 (Bayesian filter : hidden Markov models : recursive formulation : 3) Remark unnormalized version satisfies recurrent relation γ k (dx ) = g k (x ) γ k 1 (dx) Q k (x,dx ) and µ k = γ k γ k,1 or equivalently E γ k (dx ) = E γ k 1 (dx) R k (x,dx ) introducing nonnegative kernel R k (x,dx ) = Q k (x,dx ) g k (x )

49 45 (Bayesian filter : partially observed Markov chains : representation : 1) Bayesian filter : partially observed Markov chains : representation Theorem joint conditional distribution of hidden state sequence X 0:n given observations Y 0:n P[X 0:n dx 0:n Y 0:n ] γ 0 (dx 0 ) n k=1 R k (x k 1,dx k ) with nonnegative distribution defined (abuse of notation) as γ 0 (dx) = γ 0 (Y 0,dx) and with nonnegative kernel defined (abuse of notation) as R k (x k 1,dx k ) = R k (x k 1,Y k 1,Y k,dx k )

50 46 (Bayesian filter : partially observed Markov chains : representation : 2) general principle : p X Y=y (x) = p X,Y(x,y) p Y (y) = p X Y (x) p X,Y (x,y) Proof by definition joint probability distribution of hidden states and observations (X 0:n,Y 0:n ) P[X 0:n dx 0:n,Y 0:n dy 0:n ] = γ 0 (y 0,dx 0 ) n k=1 R k (x k 1,y k 1,y k,dx k ) λ F 0 (dy 0 ) λ F k(y k 1,dy k ) hence P[X 0:n dx 0:n Y 0:n ] γ 0 (dx 0 ) n R k (x k 1,dx k ) k=1

51 47 (Bayesian filter : partially observed Markov chains : representation : 3) for a given importance decomposition P[X 0:n dx 0:n Y 0:n ] γ 0 (dx 0 ) n k=1 R k (x k 1,dx k ) η imp 0 (dx 0 ) n k=1 Q imp k (x k 1,dx k ) } {{ } η imp 0:n (dx 0:n) n k=0 g imp k (x k ) } {{ } g imp 0:n (x 0:n)

52 48 (Bayesian filter : partially observed Markov chains : recursive formulation : 1) Bayesian filter : partially observed Markov chains : recursive formulation Theorem Bayesian filter µ k (dx) = P[X k dx Y 0:k ] satisfies µ k (dx ) µ k 1 (dx) R k (x,dx ) with initial condition µ 0 (dx) γ 0 (dx) E Remark unnormalized version satisfies recurrent relation γ k (dx ) = γ k 1 (dx) R k (x,dx ) and µ k = γ k γ k,1 E

53 49 (Bayesian filter : partially observed Markov chains : recursive formulation : 2) Proof recall representation for joint conditional probability distribution of hidden state sequence X 0:k given observations Y 0:k P[X 0:k dx 0:k Y 0:k ] γ 0 (dx 0 ) k p=1 R p (x p 1,dx p ) R k (x k 1,dx k ) P[X 0:k 1 dx 0:k 1 Y 0:k 1 ] integration w.r.t. variables x 0:k 1 (and in RHS, first w.r.t. variables x 0:k 2 and next w.r.t. variable x k 1 ), provides conditional distribution of current hidden state X k given observations Y 0:k, i.e. Bayesian filter, as µ k (dx k ) = P[X k dx k Y 0:k ] R k (x k 1,dx k ) P[X k 1 dx k 1 Y 0:k 1 ] E E µ k 1 (dx k 1 ) R k (x k 1,dx k )

54 Monte Carlo approximation : particle filters Monte Carlo methods : importance sampling importance sampling SIS algorithm derived from Bayesian filter representation recursive formulation redistribution SIR algorithm, adaptive redistribution derived directly from Bayesian filter recursive formulation estimation error, CLT

55 50 (Monte Carlo methods : importance sampling : 1) Monte Carlo methods if computing an integral (or a mathematical expectation) µ,φ = φ(x) µ(dx) = E[φ(X)] with X µ(dx) E is difficult, but simulating a r.v. according to distribution µ is easy, then introduce empirical probability distribution S N (µ) = 1 N where (ξ 1,,ξ N ) is an N sample distributed according to µ, and approximation by law of large numbers i=1 µ,φ S N (µ),φ = 1 N δ ξi φ(ξ i ) i=1 S N (µ),φ µ,φ in probability as N, with speed 1/ N

56 51 (Monte Carlo methods : importance sampling : 2) indeed S N (µ) µ,φ = 1 N hence (non asymptotical) mean square error (φ(ξ i ) µ,φ ) i=1 since E S N (µ) µ,φ 2 = 1 N var(φ,µ) 1 N 2 i,j=1 E[(φ(ξ i ) µ,φ ) (φ(ξ j ) µ,φ )] = 1 N 2 i=1 E φ(ξ i ) µ,φ 2 }{{} var(φ, µ) and central limit theorem holds N S N (µ) µ,φ = 1 N (φ(ξ i ) µ,φ ) = N(0,var(φ,µ)) N in distribution as N i=1

57 52 (Monte Carlo methods : importance sampling : 3) important special case : Gibbs Boltzmann distribution µ = g η = gη η,g i.e. µ,φ = η,gφ η,g with (non unique) decomposition in terms of a probability distribution η a nonnegative function g introduce unnormalized distribution defined by γ,φ = η,gφ = E[g(Ξ)φ(Ξ)] hence µ,φ = η,gφ η,g where r.v. Ξ is distributed according to η motivation : Bayes rule = γ,φ γ,1 ( ) posterior distribution likelihood function prior distribution

58 53 (Monte Carlo methods : importance sampling : 4) if simulating a r.v. according to µ is difficult, but simulating a r.v. according to η and evaluating nonnegative function g(x) for any x is easy, then it is possible to approximate µ by a weighted empirical probability distribution associated with a sample distributed according to η and weighted with nonnegative function g(x) even though normalizing constant η, g might be unknown

59 54 (Monte Carlo methods : importance sampling : 5) importance sampling idea : approximate numerator and denominator in ( ) with a unique sample distributed according to η : introduce approximation γ,φ = η,gφ S N (η),gφ = 1 N g(ξ i )φ(ξ i ) i=1 hence µ,φ = g η,φ g S N (η),φ = g(ξ i )φ(ξ i ) i=1 g(ξ i ) where (ξ 1,,ξ N ) is an N sample with common probability distribution η i=1

60 55 (Monte Carlo methods : importance sampling : 6) in other words and γ γ N = gs N (η) = 1 N µ µ N = g S N (η) = i=1 g(ξ i )δ ξ i i=1 g(ξ i ) g(ξ j ) j=1 δ ξ i = w i δ ξ i where nonnegative normalized weights (w 1,,w N ) are defined for any i = 1 N by w i = g(ξi ) g(ξ j ) j=1 i=1

61 56 (importance sampling SIS algorithm : 1) importance sampling SIS algorithm recall Bayesian filter representation as a Gibbs Boltzmann distribution µ 0:n = g 0:n η 0:n = g 0:nη 0:n η 0:n,g 0:n with g 0:n (x 0:n ) = i.e. µ 0:n,f = η 0:n,g 0:n f η 0:n,g 0:n n k=0 g k (x k 1,x k ) and with joint probability distribution of hidden states X 0:n n η 0:n (dx 0:n ) = P[X 0:n dx 0:n ] = η 0 (dx 0 ) Q k (x k 1,dx k ) unnormalized version defined as k=1 γ 0:n,f = η 0:n,g 0:n f = E[g 0:n (X 0:n )f(x 0:n )] = γ 0:n,f γ 0:n,1 and if f = φ π depends only upon last state and not on whole trajectory, then n γ 0:n,φ π = E[g 0:n (X 0:n ) φ π(x 0:n )] = E[φ(X n ) g k (X k 1,X k )] = γ n,φ k=0

62 57 (importance sampling SIS algorithm : 2) importance sampling : approximation γ 0:n,f = η 0:n,g 0:n f S N (η 0:n ),g 0:n f = 1 N g 0:n (ξ0:n)f(ξ i 0:n) i i=1 and µ 0:n,f = g 0:n η 0:n,f g 0:n S N (η 0:n ),f = g 0:n (ξ0:n)f(ξ i 0:n) i i=1 g 0:n (ξ0:n) i i=1 for any function f depending on whole trajectory, where (ξ 1 0:n,,ξ N 0:n) is an N sample with common probability distribution η 0:n

63 58 (importance sampling SIS algorithm : 3) in particular if f = φ π depends only upon last state and not on whole sequence, then γ n,φ = γ 0:n,φ π 1 g 0:n (ξ i N 0:n)φ(ξn) i and i=1 g 0:n (ξ0:n)φ(ξ i n) i µ n,φ = µ 0:n,φ π i=1 g 0:n (ξ0:n) i for any function φ, where (ξ 1 0:n,,ξ N 0:n) is an N sample with common probability distribution η 0:n, and for i = 1 N ξ i n = π(ξ i 0:n) denotes last state of sequence ξ i 0:n = (ξ i 0,,ξ i n) i=1

64 59 (importance sampling SIS algorithm : 4) in other words and µ n µ N n = γ n γ N n = 1 N i=1 g 0:n (ξ0:n) i δ ξ i n i=1 g 0:n (ξ0:n) i δ ξ i = n g 0:n (ξ j 0:n ) j=1 wn i δ ξ i n where nonnegative normalized weights (wn, 1,wn N ) are defined for any i = 1 N by wn i = g 0:n(ξ0:n) i g 0:n (ξ j 0:n ) j=1 i=1

65 60 (importance sampling SIS algorithm : 5) SIS algorithm importance sampling approximation : non recursive depth first implementation simulate an N sample of hidden state sequences (ξ 1 0:n,,ξ N 0:n) : independently for any i = 1 N, simulate a sequence ξ i 0:n = (ξ i 0,,ξ i n), i.e. simulate a r.v. ξ i 0 according to η 0 (dx) for any k = 1 n simulate a r.v. ξ i k according to Q k(ξ i k 1,dx ) and define for any i = 1 N g 0:n (ξ i 0:n) = n k=0 g k (ξk 1,ξ i k) i and wn i = g 0:n(ξ0:n) i g 0:n (ξ0:n) i j=1

66 61 (importance sampling SIS algorithm : 6) importance sampling approximation : non recursive implementation for nonlinear and non Gaussian systems simulate an N sample of hidden state sequences (ξ 1 0:n,,ξ N 0:n) : independently for any i = 1 N, simulate a sequence ξ i 0:n = (ξ i 0,,ξ i n), i.e. simulate a r.v. ξ i 0 according to η 0 (dx) for any k = 1 n simulate a r.v. W i k according to pw k (dw) and set ξi k = f k(ξ i k 1,Wi k ) and define for any i = 1 N g 0:n (ξ i 0:n) = n k=0 qk V (Y k h k (ξk)) i and wn i = g 0:n(ξ0:n) i g 0:n (ξ0:n) i j=1

67 62 (importance sampling SIS algorithm : 7) recursive formulation of weights updating for any k = 1 n and for any i = 1 N wk i = g 0:k(ξ0:k) i = g 0:k (ξ j 0:k ) j=1 g 0:k 1 (ξ0:k 1) i g k (ξk 1,ξ i k) i = g 0:k 1 (ξ j 0:k 1 ) g k(ξ j k 1,ξj k ) j=1 wk 1 i g k (ξk 1,ξ i k) i w j k 1 g k(ξ j k 1,ξj k ) j=1 benefit : allows breadth first implementation

68 63 (importance sampling SIS algorithm : 8) SIS algorithm (sequential importance sampling) : recursive implementation for k = 0, independently for any i = 1 N simulate a r.v. ξ 0 i according to η 0(dx), and define w0 i = g 0(ξ0) i g 0 (ξ j 0 ) j=1 for any k = 1 n, independently for any i = 1 N simulate a r.v. ξ i k according to Q k(ξ i k 1,dx ), and update weight as wk i = wi k 1 g k(ξk 1 i,ξi k ) w j k 1 g k(ξ j k 1,ξj k ) j=1

69 64 (importance sampling SIS algorithm : 9) SIS algorithm (sequential importance sampling) : recursive implementation for nonlinear and non Gaussian systems for k = 0, independently for any i = 1 N simulate a r.v. ξ 0 i according to η 0(dx), and define w0 i = qv 0 (Y 0 h 0 (ξ0)) i q0 V (Y 0 h 0 (ξ j 0 )) j=1 for any k = 1 n, independently for any i = 1 N simulate a r.v. Wk i according to pw k (dw) and set ξi k = f k(ξk 1 i,wi k ), and update weight as wk i = wi k 1 qv k (Y k h k (ξk i)) w j k 1 qv k (Y k h k (ξ j k )) j=1

70 65 (importance sampling SIS algorithm : 10) pros : higher weights are allocated to simulated sequences that are often consistent with observations cons : weights are evaluated afterwards, and do not have impact on how sequences are simulated (blind simulation strategy) + along a given sequence, weights are accumulated in a multiplicative way weights degeneracy : in practice, one single sequence receives a much larger weight than all other sequences, whose contributions are therefore negligible memory effect : a sequence cannot be consistent with all observations a sequence that is consistent (resp. inconsistent) with current observation, but inconsistent (resp. consistent) with earlier observations, will receive a small (resp. a large) weight proposed solutions use observations to guide how sequences are simulated from time to time, replicate / terminate sequences according to their respective weights

71 66 (SIR algorithm : 1) approximate Bayesian filter using recursive formulation µ n (dx) = P[X n dx Y 0:n ] µ k 1 prediction η k = µ k 1 Q k vith initial condition µ 0 = g 0 η 0 correction µ k = g k η k idea : look for approximations in the form of (possibly weighted) empirical probability distributions η k η N k = vk i δ ξ i et µ k µ N k = k i=1 associated with population of N particles characterized by positions (ξ 1 k,,ξn k ) in E wk i δ ξ i k nonnegative normalized weights (v 1 k,,vn k ) and (w1 k,,wn k ) i=1 SIR algorithm

72 67 (SIR algorithm : 2) initial approximation : using importance sampling µ 0 = g 0 η 0 g 0 S N (η 0 ) = i=1 g 0 (ξ0) i δ ξ i = 0 g 0 (ξ j 0 ) w0 i δ ξ i 0 i=1 j=1 where variables (ξ 1 0,,ξ N 0 ) are i.i.d. with common probability distribution η 0 correction step : clearly, from definition µ N k = g k η N k = i=1 vk i g k(ξk i) δ ξ i = k v j k g k(ξ j k ) wk i δ ξ i k i=1 j=1 which automatically has desired form

73 68 (SIR algorithm : 3) prediction step : from definition µ N k 1Q k,φ = µ N k 1(dx) Q k (x,dx )φ(x ) for any function φ, hence = = i=1 w i k 1 Q k (ξk 1,dx i )φ(x ) [ wk 1 i Q k (ξk 1,dx i )]φ(x ) i=1 in form of a finite mixture, with µ N k 1Q k = wk 1 i m i k i=1 m i k(dx ) = Q k (ξ i k 1,dx ) for any i = 1 N requires further approximation (several sampling schemes available)

74 69 (SIR algorithm : 4) multinomial resampling simulate an N sample (ξ 1 k,,ξn k ) according to µn k 1 Q k, and set µ N k 1Q k η N k = S N (µ N k 1Q k ) = 1 N δ ξ i = k i=1 vk i δ ξ i k i=1 with v i k = 1/N for any i = 1 N weights are used to select (without replacement) mixture components with higher weights, with expected consequence that components with higher weights are selected several times conversely, components with lower weights are possibly discarded and will not further contribute to approximation if R i denotes how many times i th mixture component has been selected, or equivalently how many samples in new approximation originate from i th mixture component, for any i = 1 N, then r.v. (R 1,,R N ) has a multinomial distribution

75 70 (SIR algorithm : 5) intuitively, if all mixture weights are equal (or close) to 1/N, i.e. if distribution of mixture weights is close to equidistribution, then selecting mixture components could be counter productive weigths preservation simulate one individual exactly from each mixture component and preserve its weight, i.e. independently for any i = 1 N simulate ξk i according to m i k (dx ) = Q k (ξk 1 i,dx ) and set µ N k 1Q k η N k = wk 1 i δ ξ i = k i=1 vk i δ ξ i k i=1 with v i k = wi k 1 for any i = 1 N intuitively, this approach is appropriate if distribution of mixture weights is close to equidistribution, and less appropriate in extreme case where most weights are zero, except a few components with positive weights

76 71 (SIR algorithm : 6) SIR algorithm (sampling with importance resampling) : recursive implementation for k = 0, independently for any i = 1 N simulate a r.v. ξ i 0 according to η 0 (dx), and define w0 i = g 0(ξ0) i g 0 (ξ j 0 ) j=1 for any k = 1 n, independently for any i = 1 N select an individual ξ i k 1 among population (ξ1 k 1,,ξN k 1 ) and according to weights (w 1 k 1,,wN k 1 ) simulate a r.v. ξ i k according to Q k( ξ i k 1,dx ) and define wk i = g k(ξk 1 i,ξi k ) g k (ξ j k 1,ξj k ) j=1

77 72 (SIR algorithm : 7) SIR algorithm (sampling with importance resampling) : recursive formulation for nonlinear and non Gaussian systems for k = 0, independently for any i = 1 N simulate a r.v. ξ i 0 according to η 0 (dx), and define w0 i = qv 0 (Y 0 h 0 (ξ0)) i q0 V (Y 0 h 0 (ξ j 0 )) j=1 for any k = 1 n, independently for any i = 1 N select and individual ξ i k 1 among population (ξ1 k 1,,ξN k 1 ) and according to weights (w 1 k 1,,wN k 1 ) simulate a r.v. W i k according to pw k (dw) and set ξi k = f k( ξ i k 1,Wi k ) and define wk i = qv k (Y k h k (ξk i)) qk V (Y k h k (ξ j k )) j=1

78 73 (SIR algorithm : 8) to summarize, particles (ξ 1 k 1,,ξN k 1 ) are selected according to their respective weights (w 1 k 1,,wN k 1 ) [selection step] evolve according to transition probabilities Q k (x,dx ) [mutation step] and are weighted by evaluating likelihood function g k [weighting step] pros : weights do not accumulate along each sequence, but are used to select (or resample) particles particles with larger (resp. smaller) weights are replicated (resp. are terminated) by keeping only most probable particles at each time instant, expected benefit is to concentrate available computing power within regions of interest

79 74 (SIR algorithm : 9) cons : introduces additional randomness, in resampling (selection) step proposed solutions alternate resampling strategies, that allocate an (almost) deterministic number of offsprings to each selected particle adaptive resampling, only when weights (wk 1,,wN k unbalanced (far from equidistribution) ) are too much cons : because of replication, fewer truly distinct positions are available (sample impoverishment) positions degeneracy : in practice, implicitly rely on mutation step to bring diversity again proposed solution after resampling (selection) step, add some random move to each selected particle, or apply some artificial Markovian dynamics (Metropolis Hastings, Gibbs sampling, etc.)

80 75 (particle filtering : adaptive sampling / resampling : 1) given a finite mixture m = w i m i i=1 adaptive SIR algorithm selecting mixture components is interesting only if weights (w 1,,w N ) are far from equidistribution several heuristic criteria have been proposed to quantify departure from equidistribution, and to decide wether particles should be resampled or not, e.g. effective sample size entropy

81 76 (particle filtering : adaptive sampling / resampling : 2) χ 2 distance and effective sample size χ 2 distance between two probability vectors p = (p 1,,p N ) and q = (q 1,,q N ) is defined as χ 2 (p,q) = i=1 q i ( p i q i 1) 2 in particular for p = w = (w 1,,w N ) and q = (1/N,,1/N), it holds hence 0 1 N (N w i 1) 2 = 1 N i=1 (N w i ) 2 1 = N i=1 1 N eff = 1 / [ wi 2 ] N i=1 wi 2 1 where equality is attained at equidistribution, which suggests to resample if H(w 1,,w N ) = N i=1 for some threshold H red > 0 still to be fixed i=1 w 2 i 1 = N N eff 1 H red

82 77 (estimation error, CLT : 1) on the way to asymptotic results (in 3 slides) recall linear evolution for unnormalized version of Bayesian filter γ k = γ k 1 R k = g k (γ k 1 Q k ) = g k (µ k 1 Q k ) γ k 1,1 = g k η k γ k 1,1 with initial condition γ 0 = g 0 η 0 proposed particle approximation for unnormalized distribution γk N = g k ηk N γk 1,1 N with initial condition γ0 N = g 0 η0 N and η0 N = S N (η 0 ) : clearly γk N,1 = ηk N,g k γk 1,1 N and γ0 N,1 = η0 N,g 0 and it follows γ N k γ N k,1 = g k η N k = µ N k and γ N 0 γ N 0,1 = g 0 η N 0 = µ N 0 normalized version of proposed particle approximation for γk N SIR bootstrap approximation µ N k for Bayesian filter coincides with

83 78 (estimation error, CLT : 2) Remark key for induction : for any k = 1 n and by difference hence γ N k γ k = g k η N k γ N k 1,1 g k (γ k 1 Q k ) = g k (γ N k 1Q k γ k 1 Q k )+g k (η N k µ N k 1Q k ) γ N k 1,1 γ N k γ k,φ = γ N k 1 γ k 1,Q k (g k φ) + η N k µ N k 1Q k,g k φ γ N k 1,1 error at current generation, evaluated on function φ, is decomposed into error at previous generation, evaluated on function R k φ = Q k (g k φ) local error resulting from Monte Carlo approximation even though samples are actually dependent, because of resampling at each generation, conditionally on previous generations, new samples are generated independently

84 79 (estimation error, CLT : 3) with this conditioning argument, error estimates sup E γn k γ k,φ φ: φ =1 γ k,1 c k N and sup E µ N k µ k,φ 2 c k φ: φ =1 N of order 1/ N, and CLT N γ N k γ k,φ γ k,1 = N(0,V k (φ)) and N µ N k µ k,φ = N(0,v k (φ)) with v k (φ) = V k (φ µ k,φ ) can be obtained by induction

85 Some algorithmic variants regularization progressive weighting, MCMC iterations sample size adaptation marginalization aka Rao Blackwellization interacting Kalman filters interacting finite state (Baum) filters

86 80 (marginalization aka Rao Blackwellization : 1) conditionning as a variance reduction technique if E[f(X 1,X 2 )] = E[E[f(X 1,X 2 ) X 2 ]] = E[F(X 2 )] F(x 2 ) = E[f(X 1,X 2 ) X 2 = x 2 ] = has an explicit expression, then Monte Carlo estimator 1 N E 1 f(x 1,x 2 ) P[X 1 dx 1 X 2 = x 2 ] F(Xi) 2 E[F(X 2 )] = E[f(X 1,X 2 )] i=1 where (X1, 2,XN 2 ) is an N sample with same common distribution as X2, has smaller variance than Monte Carlo estimator 1 f(x 1 N i,xi) 2 E[f(X 1,X 2 )] i=1 where ((X1,X 1 1), 2,(XN 1,X2 N ) is an N sample with same common distribution as (X 1,X 2 )

87 81 (marginalization aka Rao Blackwellization : 2) 1st example : conditionnally linear Gaussian systems X L k = F L k (X NL k 1) X L k 1 +f L k(x NL k 1)+W L k X NL k = Fk NL (Xk 1) NL Xk 1 L +fk NL (Xk 1)+W NL k NL Y k = h k (X NL k )+V k clearly E[φ(X L n,x NL n ) n k=0 g k (X NL k )] = E[E[φ(X L n,x NL n ) X NL 0:n] n k=0 g k (X NL k )] and conditional distribution of said linear component Xn L given said nonlinear component sequence X0:n, NL L NL is Gaussian, with mean X k and covariance matrix given explicitly, in recursive form, by Kalman filter equation P L NL k introduce new hidden state {(X NL k, X L NL k,p L NL k )} instead of {(Xk L,XNL k )} benefit : explore with particles subspace associated with nonlinear components, and associated with each particle, a Kalman filter estimates linear components

88 82 (marginalization aka Rao Blackwellization : 3) 2nd example : non linear systems with Markovian switching regimes / modes X k = f k (s k 1,X k 1,W k ) Y k = h k (X k )+V k where regime / mode sequence {s k } forms a Markov chain with finite state space clearly E[φ(s n,x n ) n k=0 g k (X k )] = E[E[φ(s n,x n ) X 0:n ] n k=0 g k (X k )] and conditional distribution of regime / mode s n given continuous components sequence X 0:n, is a finite dimensional probability vector defined by p i n = P[s n = i X 0:n ] for any i I given explicitly, in recursive form, by solving Baum forward equation introduce new hidden state {(X k,p k )} instead of {(s k,x k )} benefit : avoid sampling finite state space

89 Conclusion particle filtering provides an implementation of Bayesian approach that is intuitive, easy to understand and implement flexible, adapts to many models, many algorithmic variants available numerically efficient, through some selection mechanism amenable to mathematical analysis

### CPSC 540: Machine Learning

CPSC 540: Machine Learning MCMC and Non-Parametric Bayes Mark Schmidt University of British Columbia Winter 2016 Admin I went through project proposals: Some of you got a message on Piazza. No news is

### Sequential Monte Carlo and Particle Filtering. Frank Wood Gatsby, November 2007

Sequential Monte Carlo and Particle Filtering Frank Wood Gatsby, November 2007 Importance Sampling Recall: Let s say that we want to compute some expectation (integral) E p [f] = p(x)f(x)dx and we remember

### Particle Filters. Pieter Abbeel UC Berkeley EECS. Many slides adapted from Thrun, Burgard and Fox, Probabilistic Robotics

Particle Filters Pieter Abbeel UC Berkeley EECS Many slides adapted from Thrun, Burgard and Fox, Probabilistic Robotics Motivation For continuous spaces: often no analytical formulas for Bayes filter updates

### Lecture 2: From Linear Regression to Kalman Filter and Beyond

Lecture 2: From Linear Regression to Kalman Filter and Beyond Department of Biomedical Engineering and Computational Science Aalto University January 26, 2012 Contents 1 Batch and Recursive Estimation

### Lecture 2: From Linear Regression to Kalman Filter and Beyond

Lecture 2: From Linear Regression to Kalman Filter and Beyond January 18, 2017 Contents 1 Batch and Recursive Estimation 2 Towards Bayesian Filtering 3 Kalman Filter and Bayesian Filtering and Smoothing

### PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 13: SEQUENTIAL DATA

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 13: SEQUENTIAL DATA Contents in latter part Linear Dynamical Systems What is different from HMM? Kalman filter Its strength and limitation Particle Filter

### Linear Dynamical Systems

Linear Dynamical Systems Sargur N. srihari@cedar.buffalo.edu Machine Learning Course: http://www.cedar.buffalo.edu/~srihari/cse574/index.html Two Models Described by Same Graph Latent variables Observations

### Chapter 7. Markov chain background. 7.1 Finite state space

Chapter 7 Markov chain background A stochastic process is a family of random variables {X t } indexed by a varaible t which we will think of as time. Time can be discrete or continuous. We will only consider

### Sensor Fusion: Particle Filter

Sensor Fusion: Particle Filter By: Gordana Stojceska stojcesk@in.tum.de Outline Motivation Applications Fundamentals Tracking People Advantages and disadvantages Summary June 05 JASS '05, St.Petersburg,

### Computer Intensive Methods in Mathematical Statistics

Computer Intensive Methods in Mathematical Statistics Department of mathematics johawes@kth.se Lecture 16 Advanced topics in computational statistics 18 May 2017 Computer Intensive Methods (1) Plan of

Adaptive Monte Carlo methods Jean-Michel Marin Projet Select, INRIA Futurs, Université Paris-Sud joint with Randal Douc (École Polytechnique), Arnaud Guillin (Université de Marseille) and Christian Robert

### CS 630 Basic Probability and Information Theory. Tim Campbell

CS 630 Basic Probability and Information Theory Tim Campbell 21 January 2003 Probability Theory Probability Theory is the study of how best to predict outcomes of events. An experiment (or trial or event)

### Lecture 6: Bayesian Inference in SDE Models

Lecture 6: Bayesian Inference in SDE Models Bayesian Filtering and Smoothing Point of View Simo Särkkä Aalto University Simo Särkkä (Aalto) Lecture 6: Bayesian Inference in SDEs 1 / 45 Contents 1 SDEs

### Monte-Carlo MMD-MA, Université Paris-Dauphine. Xiaolu Tan

Monte-Carlo MMD-MA, Université Paris-Dauphine Xiaolu Tan tan@ceremade.dauphine.fr Septembre 2015 Contents 1 Introduction 1 1.1 The principle.................................. 1 1.2 The error analysis

### Advanced Computational Methods in Statistics: Lecture 5 Sequential Monte Carlo/Particle Filtering

Advanced Computational Methods in Statistics: Lecture 5 Sequential Monte Carlo/Particle Filtering Axel Gandy Department of Mathematics Imperial College London http://www2.imperial.ac.uk/~agandy London

### Lecture 6: Multiple Model Filtering, Particle Filtering and Other Approximations

Lecture 6: Multiple Model Filtering, Particle Filtering and Other Approximations Department of Biomedical Engineering and Computational Science Aalto University April 28, 2010 Contents 1 Multiple Model

### A new class of interacting Markov Chain Monte Carlo methods

A new class of interacting Marov Chain Monte Carlo methods P Del Moral, A Doucet INRIA Bordeaux & UBC Vancouver Worshop on Numerics and Stochastics, Helsini, August 2008 Outline 1 Introduction Stochastic

### An Brief Overview of Particle Filtering

1 An Brief Overview of Particle Filtering Adam M. Johansen a.m.johansen@warwick.ac.uk www2.warwick.ac.uk/fac/sci/statistics/staff/academic/johansen/talks/ May 11th, 2010 Warwick University Centre for Systems

### 4 Derivations of the Discrete-Time Kalman Filter

Technion Israel Institute of Technology, Department of Electrical Engineering Estimation and Identification in Dynamical Systems (048825) Lecture Notes, Fall 2009, Prof N Shimkin 4 Derivations of the Discrete-Time

### An introduction to Sequential Monte Carlo

An introduction to Sequential Monte Carlo Thang Bui Jes Frellsen Department of Engineering University of Cambridge Research and Communication Club 6 February 2014 1 Sequential Monte Carlo (SMC) methods

### Introduction to Bayesian methods in inverse problems

Introduction to Bayesian methods in inverse problems Ville Kolehmainen 1 1 Department of Applied Physics, University of Eastern Finland, Kuopio, Finland March 4 2013 Manchester, UK. Contents Introduction

### Bagging During Markov Chain Monte Carlo for Smoother Predictions

Bagging During Markov Chain Monte Carlo for Smoother Predictions Herbert K. H. Lee University of California, Santa Cruz Abstract: Making good predictions from noisy data is a challenging problem. Methods

### Kalman filtering and friends: Inference in time series models. Herke van Hoof slides mostly by Michael Rubinstein

Kalman filtering and friends: Inference in time series models Herke van Hoof slides mostly by Michael Rubinstein Problem overview Goal Estimate most probable state at time k using measurement up to time

### Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

Pattern Recognition and Machine Learning Chapter 2: Probability Distributions Cécile Amblard Alex Kläser Jakob Verbeek October 11, 27 Probability Distributions: General Density Estimation: given a finite

### x log x, which is strictly convex, and use Jensen s Inequality:

2. Information measures: mutual information 2.1 Divergence: main inequality Theorem 2.1 (Information Inequality). D(P Q) 0 ; D(P Q) = 0 iff P = Q Proof. Let ϕ(x) x log x, which is strictly convex, and

### Basic math for biology

Basic math for biology Lei Li Florida State University, Feb 6, 2002 The EM algorithm: setup Parametric models: {P θ }. Data: full data (Y, X); partial data Y. Missing data: X. Likelihood and maximum likelihood

### Particle Filters. Outline

Particle Filters M. Sami Fadali Professor of EE University of Nevada Outline Monte Carlo integration. Particle filter. Importance sampling. Degeneracy Resampling Example. 1 2 Monte Carlo Integration Numerical

### Lecture 4 October 18th

Directed and undirected graphical models Fall 2017 Lecture 4 October 18th Lecturer: Guillaume Obozinski Scribe: In this lecture, we will assume that all random variables are discrete, to keep notations

### Markov Chain Monte Carlo Methods for Stochastic

Markov Chain Monte Carlo Methods for Stochastic Optimization i John R. Birge The University of Chicago Booth School of Business Joint work with Nicholas Polson, Chicago Booth. JRBirge U Florida, Nov 2013

### Let X and Y denote two random variables. The joint distribution of these random

EE385 Class Notes 9/7/0 John Stensby Chapter 3: Multiple Random Variables Let X and Y denote two random variables. The joint distribution of these random variables is defined as F XY(x,y) = [X x,y y] P.

### Auxiliary Particle Methods

Auxiliary Particle Methods Perspectives & Applications Adam M. Johansen 1 adam.johansen@bristol.ac.uk Oxford University Man Institute 29th May 2008 1 Collaborators include: Arnaud Doucet, Nick Whiteley

### Probabilistic Graphical Models

2016 Robert Nowak Probabilistic Graphical Models 1 Introduction We have focused mainly on linear models for signals, in particular the subspace model x = Uθ, where U is a n k matrix and θ R k is a vector

### DAG models and Markov Chain Monte Carlo methods a short overview

DAG models and Markov Chain Monte Carlo methods a short overview Søren Højsgaard Institute of Genetics and Biotechnology University of Aarhus August 18, 2008 Printed: August 18, 2008 File: DAGMC-Lecture.tex

### Monte Carlo Methods. Geoff Gordon February 9, 2006

Monte Carlo Methods Geoff Gordon ggordon@cs.cmu.edu February 9, 2006 Numerical integration problem 5 4 3 f(x,y) 2 1 1 0 0.5 0 X 0.5 1 1 0.8 0.6 0.4 Y 0.2 0 0.2 0.4 0.6 0.8 1 x X f(x)dx Used for: function

### STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 7 Approximate

### Particle filters, the optimal proposal and high-dimensional systems

Particle filters, the optimal proposal and high-dimensional systems Chris Snyder National Center for Atmospheric Research Boulder, Colorado 837, United States chriss@ucar.edu 1 Introduction Particle filters

### State Estimation using Moving Horizon Estimation and Particle Filtering

State Estimation using Moving Horizon Estimation and Particle Filtering James B. Rawlings Department of Chemical and Biological Engineering UW Math Probability Seminar Spring 2009 Rawlings MHE & PF 1 /

### Markov Chain Monte Carlo (MCMC)

School of Computer Science 10-708 Probabilistic Graphical Models Markov Chain Monte Carlo (MCMC) Readings: MacKay Ch. 29 Jordan Ch. 21 Matt Gormley Lecture 16 March 14, 2016 1 Homework 2 Housekeeping Due

### 1 Using standard errors when comparing estimated values

MLPR Assignment Part : General comments Below are comments on some recurring issues I came across when marking the second part of the assignment, which I thought it would help to explain in more detail

### Based on slides by Richard Zemel

CSC 412/2506 Winter 2018 Probabilistic Learning and Reasoning Lecture 3: Directed Graphical Models and Latent Variables Based on slides by Richard Zemel Learning outcomes What aspects of a model can we

### Factor Graphs and Message Passing Algorithms Part 1: Introduction

Factor Graphs and Message Passing Algorithms Part 1: Introduction Hans-Andrea Loeliger December 2007 1 The Two Basic Problems 1. Marginalization: Compute f k (x k ) f(x 1,..., x n ) x 1,..., x n except

### Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods

Pattern Recognition and Machine Learning Chapter 6: Kernel Methods Vasil Khalidov Alex Kläser December 13, 2007 Training Data: Keep or Discard? Parametric methods (linear/nonlinear) so far: learn parameter

### An Introduction to Sequential Monte Carlo for Filtering and Smoothing

An Introduction to Sequential Monte Carlo for Filtering and Smoothing Olivier Cappé LTCI, TELECOM ParisTech & CNRS http://perso.telecom-paristech.fr/ cappe/ Acknowlegdment: Eric Moulines (TELECOM ParisTech)

### Introduction to Particle Filters for Data Assimilation

Introduction to Particle Filters for Data Assimilation Mike Dowd Dept of Mathematics & Statistics (and Dept of Oceanography Dalhousie University, Halifax, Canada STATMOS Summer School in Data Assimila5on,

Self Adaptive Particle Filter Alvaro Soto Pontificia Universidad Catolica de Chile Department of Computer Science Vicuna Mackenna 4860 (143), Santiago 22, Chile asoto@ing.puc.cl Abstract The particle filter

### GAUSSIAN PROCESS REGRESSION

GAUSSIAN PROCESS REGRESSION CSE 515T Spring 2015 1. BACKGROUND The kernel trick again... The Kernel Trick Consider again the linear regression model: y(x) = φ(x) w + ε, with prior p(w) = N (w; 0, Σ). The

### Sampling Rejection Sampling Importance Sampling Markov Chain Monte Carlo. Sampling Methods. Oliver Schulte - CMPT 419/726. Bishop PRML Ch.

Sampling Methods Oliver Schulte - CMP 419/726 Bishop PRML Ch. 11 Recall Inference or General Graphs Junction tree algorithm is an exact inference method for arbitrary graphs A particular tree structure

### Bayesian Estimation of DSGE Models

Bayesian Estimation of DSGE Models Stéphane Adjemian Université du Maine, GAINS & CEPREMAP stephane.adjemian@univ-lemans.fr http://www.dynare.org/stepan June 28, 2011 June 28, 2011 Université du Maine,

### Ergodicity in data assimilation methods

Ergodicity in data assimilation methods David Kelly Andy Majda Xin Tong Courant Institute New York University New York NY www.dtbkelly.com April 15, 2016 ETH Zurich David Kelly (CIMS) Data assimilation

### An introduction to particle filters

An introduction to particle filters Andreas Svensson Department of Information Technology Uppsala University June 10, 2014 June 10, 2014, 1 / 16 Andreas Svensson - An introduction to particle filters Outline

### ECONOMETRIC METHODS II: TIME SERIES LECTURE NOTES ON THE KALMAN FILTER. The Kalman Filter. We will be concerned with state space systems of the form

ECONOMETRIC METHODS II: TIME SERIES LECTURE NOTES ON THE KALMAN FILTER KRISTOFFER P. NIMARK The Kalman Filter We will be concerned with state space systems of the form X t = A t X t 1 + C t u t 0.1 Z t

### Machine Learning. Probability Basics. Marc Toussaint University of Stuttgart Summer 2014

Machine Learning Probability Basics Basic definitions: Random variables, joint, conditional, marginal distribution, Bayes theorem & examples; Probability distributions: Binomial, Beta, Multinomial, Dirichlet,

### 4.1 Notation and probability review

Directed and undirected graphical models Fall 2015 Lecture 4 October 21st Lecturer: Simon Lacoste-Julien Scribe: Jaime Roquero, JieYing Wu 4.1 Notation and probability review 4.1.1 Notations Let us recall

### MCMC algorithms for fitting Bayesian models

MCMC algorithms for fitting Bayesian models p. 1/1 MCMC algorithms for fitting Bayesian models Sudipto Banerjee sudiptob@biostat.umn.edu University of Minnesota MCMC algorithms for fitting Bayesian models

### Monte Carlo Simulations

Monte Carlo Simulations What are Monte Carlo Simulations and why ones them? Pseudo Random Number generators Creating a realization of a general PDF The Bootstrap approach A real life example: LOFAR simulations

### Sampling Algorithms for Probabilistic Graphical models

Sampling Algorithms for Probabilistic Graphical models Vibhav Gogate University of Washington References: Chapter 12 of Probabilistic Graphical models: Principles and Techniques by Daphne Koller and Nir

### Unsupervised Learning

Unsupervised Learning Bayesian Model Comparison Zoubin Ghahramani zoubin@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc in Intelligent Systems, Dept Computer Science University College

### Lecture 13 and 14: Bayesian estimation theory

1 Lecture 13 and 14: Bayesian estimation theory Spring 2012 - EE 194 Networked estimation and control (Prof. Khan) March 26 2012 I. BAYESIAN ESTIMATORS Mother Nature conducts a random experiment that generates

### Quantitative Methods in Economics Conditional Expectations

Quantitative Methods in Economics Conditional Expectations Maximilian Kasy Harvard University, fall 2016 1 / 19 Roadmap, Part I 1. Linear predictors and least squares regression 2. Conditional expectations

### Three examples of a Practical Exact Markov Chain Sampling

Three examples of a Practical Exact Markov Chain Sampling Zdravko Botev November 2007 Abstract We present three examples of exact sampling from complex multidimensional densities using Markov Chain theory

### Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Gaussian processes Chuong B Do (updated by Honglak Lee) November 22, 2008 Many of the classical machine learning algorithms that we talked about during the first half of this course fit the following pattern:

### Probabilistic Graphical Models

Probabilistic Graphical Models Brown University CSCI 295-P, Spring 213 Prof. Erik Sudderth Lecture 11: Inference & Learning Overview, Gaussian Graphical Models Some figures courtesy Michael Jordan s draft

### NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

NONLINEAR CLASSIFICATION AND REGRESSION Nonlinear Classification and Regression: Outline 2 Multi-Layer Perceptrons The Back-Propagation Learning Algorithm Generalized Linear Models Radial Basis Function

### Densities for the Navier Stokes equations with noise

Densities for the Navier Stokes equations with noise Marco Romito Università di Pisa Universitat de Barcelona March 25, 2015 Summary 1 Introduction & motivations 2 Malliavin calculus 3 Besov bounds 4 Other

### Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Midterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas Machine Learning Supervised Learning Unsupervised Learning Reinforcement Learning Parametric Y Continuous Non-parametric

### The Hierarchical Particle Filter

and Arnaud Doucet http://go.warwick.ac.uk/amjohansen/talks MCMSki V Lenzerheide 7th January 2016 Context & Outline Filtering in State-Space Models: SIR Particle Filters [GSS93] Block-Sampling Particle

### Bayesian Monte Carlo Filtering for Stochastic Volatility Models

Bayesian Monte Carlo Filtering for Stochastic Volatility Models Roberto Casarin CEREMADE University Paris IX (Dauphine) and Dept. of Economics University Ca Foscari, Venice Abstract Modelling of the financial

### Stat 535 C - Statistical Computing & Monte Carlo Methods. Arnaud Doucet.

Stat 535 C - Statistical Computing & Monte Carlo Methods Arnaud Doucet Email: arnaud@cs.ubc.ca 1 1.1 Outline Introduction to Markov chain Monte Carlo The Gibbs Sampler Examples Overview of the Lecture

### EE4601 Communication Systems

EE4601 Communication Systems Week 2 Review of Probability, Important Distributions 0 c 2011, Georgia Institute of Technology (lect2 1) Conditional Probability Consider a sample space that consists of two

### Sequential Monte Carlo Methods

University of Pennsylvania Bradley Visitor Lectures October 23, 2017 Introduction Unfortunately, standard MCMC can be inaccurate, especially in medium and large-scale DSGE models: disentangling importance

### Gaussian Processes (10/16/13)

STA561: Probabilistic machine learning Gaussian Processes (10/16/13) Lecturer: Barbara Engelhardt Scribes: Changwei Hu, Di Jin, Mengdi Wang 1 Introduction In supervised learning, we observe some inputs

### Machine Learning Techniques for Computer Vision

Machine Learning Techniques for Computer Vision Part 2: Unsupervised Learning Microsoft Research Cambridge x 3 1 0.5 0.2 0 0.5 0.3 0 0.5 1 ECCV 2004, Prague x 2 x 1 Overview of Part 2 Mixture models EM

### A nested sampling particle filter for nonlinear data assimilation

Quarterly Journal of the Royal Meteorological Society Q. J. R. Meteorol. Soc. : 14, July 2 A DOI:.2/qj.224 A nested sampling particle filter for nonlinear data assimilation Ahmed H. Elsheikh a,b *, Ibrahim

### SAMPLING ALGORITHMS. In general. Inference in Bayesian models

SAMPLING ALGORITHMS SAMPLING ALGORITHMS In general A sampling algorithm is an algorithm that outputs samples x 1, x 2,... from a given distribution P or density p. Sampling algorithms can for example be

### Learning Static Parameters in Stochastic Processes

Learning Static Parameters in Stochastic Processes Bharath Ramsundar December 14, 2012 1 Introduction Consider a Markovian stochastic process X T evolving (perhaps nonlinearly) over time variable T. We

### Review of Probability Theory

Review of Probability Theory Arian Maleki and Tom Do Stanford University Probability theory is the study of uncertainty Through this class, we will be relying on concepts from probability theory for deriving

### Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.

University of Cambridge Engineering Part IIB & EIST Part II Paper I0: Advanced Pattern Processing Handouts 4 & 5: Multi-Layer Perceptron: Introduction and Training x y (x) Inputs x 2 y (x) 2 Outputs x

### Bayesian Inference: Principles and Practice 3. Sparse Bayesian Models and the Relevance Vector Machine

Bayesian Inference: Principles and Practice 3. Sparse Bayesian Models and the Relevance Vector Machine Mike Tipping Gaussian prior Marginal prior: single α Independent α Cambridge, UK Lecture 3: Overview

### First-Order ODE: Separable Equations, Exact Equations and Integrating Factor

First-Order ODE: Separable Equations, Exact Equations and Integrating Factor Department of Mathematics IIT Guwahati REMARK: In the last theorem of the previous lecture, you can change the open interval

### STA 294: Stochastic Processes & Bayesian Nonparametrics

MARKOV CHAINS AND CONVERGENCE CONCEPTS Markov chains are among the simplest stochastic processes, just one step beyond iid sequences of random variables. Traditionally they ve been used in modelling a

### EnKF-based particle filters

EnKF-based particle filters Jana de Wiljes, Sebastian Reich, Wilhelm Stannat, Walter Acevedo June 20, 2017 Filtering Problem Signal dx t = f (X t )dt + 2CdW t Observations dy t = h(x t )dt + R 1/2 dv t.

### LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES Supervised Learning Linear vs non linear classifiers In K-NN we saw an example of a non-linear classifier: the decision boundary

### Bayesian Learning in Undirected Graphical Models

Bayesian Learning in Undirected Graphical Models Zoubin Ghahramani Gatsby Computational Neuroscience Unit University College London, UK http://www.gatsby.ucl.ac.uk/ and Center for Automated Learning and

### Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2014 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several

### Markov Networks.

Markov Networks www.biostat.wisc.edu/~dpage/cs760/ Goals for the lecture you should understand the following concepts Markov network syntax Markov network semantics Potential functions Partition function

### AUTOMATIC CONTROL REGLERTEKNIK LINKÖPINGS UNIVERSITET. Machine Learning T. Schön. (Chapter 11) AUTOMATIC CONTROL REGLERTEKNIK LINKÖPINGS UNIVERSITET

About the Eam I/II) ), Lecture 7 MCMC and Sampling Methods Thomas Schön Division of Automatic Control Linköping University Linköping, Sweden. Email: schon@isy.liu.se, Phone: 3-373, www.control.isy.liu.se/~schon/

### Logistic Regression: Online, Lazy, Kernelized, Sequential, etc.

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc. Harsha Veeramachaneni Thomson Reuter Research and Development April 1, 2010 Harsha Veeramachaneni (TR R&D) Logistic Regression April 1, 2010

### Bayesian Computations for DSGE Models

Bayesian Computations for DSGE Models Frank Schorfheide University of Pennsylvania, PIER, CEPR, and NBER October 23, 2017 This Lecture is Based on Bayesian Estimation of DSGE Models Edward P. Herbst &

Sensor Tasking and Control Sensing Networking Leonidas Guibas Stanford University Computation CS428 Sensor systems are about sensing, after all... System State Continuous and Discrete Variables The quantities

### MATH 415, WEEKS 14 & 15: 1 Recurrence Relations / Difference Equations

MATH 415, WEEKS 14 & 15: Recurrence Relations / Difference Equations 1 Recurrence Relations / Difference Equations In many applications, the systems are updated in discrete jumps rather than continuous

### Markov Chain Monte Carlo, Numerical Integration

Markov Chain Monte Carlo, Numerical Integration (See Statistics) Trevor Gallen Fall 2015 1 / 1 Agenda Numerical Integration: MCMC methods Estimating Markov Chains Estimating latent variables 2 / 1 Numerical

### Some Probability and Statistics

Some Probability and Statistics David M. Blei COS424 Princeton University February 13, 2012 Card problem There are three cards Red/Red Red/Black Black/Black I go through the following process. Close my

### STA 256: Statistics and Probability I

Al Nosedal. University of Toronto. Fall 2017 My momma always said: Life was like a box of chocolates. You never know what you re gonna get. Forrest Gump. There are situations where one might be interested

### Notes on pseudo-marginal methods, variational Bayes and ABC

Notes on pseudo-marginal methods, variational Bayes and ABC Christian Andersson Naesseth October 3, 2016 The Pseudo-Marginal Framework Assume we are interested in sampling from the posterior distribution

### Nonparametric Drift Estimation for Stochastic Differential Equations

Nonparametric Drift Estimation for Stochastic Differential Equations Gareth Roberts 1 Department of Statistics University of Warwick Brazilian Bayesian meeting, March 2010 Joint work with O. Papaspiliopoulos,

### Machine Learning for OR & FE

Machine Learning for OR & FE Hidden Markov Models Martin Haugh Department of Industrial Engineering and Operations Research Columbia University Email: martin.b.haugh@gmail.com Additional References: David

### Bayesian Machine Learning

Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 4 Occam s Razor, Model Construction, and Directed Graphical Models https://people.orie.cornell.edu/andrew/orie6741 Cornell University September

### Inference in Bayesian Networks

Andrea Passerini passerini@disi.unitn.it Machine Learning Inference in graphical models Description Assume we have evidence e on the state of a subset of variables E in the model (i.e. Bayesian Network)