Variantes Algorithmiques et Justifications Théoriques

 August Jasper Terry
 10 months ago
 Views:
Transcription
1 complément scientifique École Doctorale MATISSE IRISA et INRIA, salle Markov jeudi 26 janvier 2012 Variantes Algorithmiques et Justifications Théoriques François Le Gland INRIA Rennes et IRMAR
2 ...programme of the day #1 more general models, from non linear and non Gaussian systems to hidden Markov models and partially observed Markov chains, so as to handle e.g. regime / mode switching correlation between state noise and observation noise #2 for each of these models (or just for most general model), representation of P[X 0:n dx 0:n Y 0:n ] as a Gibbs Boltzmann distribution, with recursive formulation, and idem for P[X n dx n Y 0:n ] #3 particle approximation (SIS and SIR algorithms) from either representations #4 asymptotic behaviour as sample size goes to infinity #5 numerous algorithmic variants
3 1 (some notations : 1) some notations (continued) if X is a random variable taking values in E, then mapping φ E[φ(X)] or equivalently A P[X A] defines a probability distribution µ on E, denoted as µ(dx) = P[X dx] and such that E[φ(X)] = caracterizes uncertainty about X E φ(x) µ(dx) = µ,φ or P[X A] = µ(a)
4 2 (some notations : 2) transition probability kernel M(x,dx ) on E collection of probability distributions on E indexed by x E acts on functions according to M φ(x) = E M(x,dx ) φ(x ) and acts on probability distributions according to µm(dx ) = µ(dx) M(x,dx ) seen as a mixture distribution caracterized by µm,φ = [ µ(dx) M(x,dx )] φ(x ) = E E E E µ(dx) [ = µ,m φ E M(x,dx ) φ(x )]
5 Non linear and non Gaussian systems, and beyond non linear and non Gaussian systems hidden Markov models partially observed Markov chains likelihood free models
6 3 (non linear and non Gaussian systems : 1) non linear and non Gaussian systems prior model for hidden state taking values in E X k = f k (X k 1,W k ) with W k p W k (dw) initial condition X 0 η 0 (dx) observation taking values in R d with additive noise admitting a density Y k = h k (X k )+V k with V k qk V (v)dv random variables X 0, W 1,,W k, and V 0,V 1,,V k, are mutually independent but non necessarily Gaussian only requirement (to be used later on) : easy to simulate a r.v. according to η 0 (dx) or to p W k (dw) evaluate function q V k (v) for any v Rd
7 4 (non linear and non Gaussian systems : 2) Proposition hidden states {X k } form a Markov chain taking values in E, i.e. P[X k dx X 0:k 1 ] = P[X k dx X k 1 ] characterization in terms of transition kernel P[X k dx X k 1 = x] = Q k (x,dx ) defined (implicitly) by its action on functions Q k φ(x) = E[φ(X k ) X k 1 = x] = E[φ(f k (X k 1,W k )) X k 1 = x] = φ(f k (x,w)) p W k (w)dw R m
8 5 (non linear and non Gaussian systems : 3) Remark easy to simulate next state X k given X k 1 = x, i.e. to simulate a r.v. according to Q k (x,dx ) for a given x E indeed, set X k = f k (x,w k ) where W k is simulated according to p W k (dw) Remark in general, transition kernel Q k (x,dx ) does not admit a density indeed, conditionnally to X k 1 = x, r.v. X k necessarily belongs to subset M(x) = {x R m : there exist w R p such that x = f k (x,w)} if p < m and under some mild regularity assumptions, this subset of R m has zero Lebesgue measure therefore, conditionnally to X k 1 = x, probability distribution Q k (x,dx ) of r.v. X k cannot have a density w.r.t. Lebesgue measure on R m
9 6 (non linear and non Gaussian systems : 4) Remark if f k (x,w) = b k (x)+w and if probability distribution p W k W k admits a density, still denoted p W k (w), i.e. if (dw) of r.v. X k = b k (X k 1 )+W k with W k p W k (w)dw then a more explicit expression is available Q k (x,dx ) = p W k (x b k (x))dx i.e. transition kernel Q k (x,dx ) admits an (easy to evaluate) density indeed, change of variable x = b k (x)+w yields Q k φ(x) = φ(b k (x)+w) p W k (w)dw R m = φ(x ) p W k (x b k (x)))dx R m
10 7 (non linear and non Gaussian systems : 5) Proposition observations {Y k } satisfy memoryless channel assumption, i.e. P[Y 0:n dy 0:n X 0:n ] = n k=0 P[Y k dy k X k ] characterization in terms of emission density define likelihood function P[Y k dy X k = x] = q V k (y h k (x))dy g k (x) = q V k (Y k h k (x)) a quantitative measure of consistency between possible hidden state x E and actual observation Y k Remark easy to evaluate g k (x) for any x E
11 8 (hidden Markov models : 1) hidden Markov models motivating example hybrid continuous / discrete systems X k = f k (s k 1,X k 1,W k ) Y k = h k (X k )+V k where regime / mode sequence {s k } forms a Markov chain with finite state space does not fit into non linear and non Gaussian systems, however hidden states and modes {(X k,s k )} jointly form a Markov chain observations {Y k } satisfy memoryless channel assumption i.e. fits into hidden Markov models Remark easy to simulate next state (X k,s k ) given (X k 1,s k 1 ) = (x,s)
12 9 (hidden Markov models : 2) more generally, hidden states {X k } could form a Markov chain taking values in a quite general space E, e.g. hybrid continuous / discrete differentiable manifold constrained graphical (collection de connected edges) characterization in terms of transition kernel and initial distribution P[X k dx X k 1 = x] = Q k (x,dx ) P[X 0 dx] = η 0 (dx) joint probability distribution of hidden states X 0:n verifies n P[X 0:n dx 0:n ] = η 0 (dx 0 ) Q k (x k 1,dx k ) k=1
13 10 (hidden Markov models : 3) user should respect displacement constraints due to obstacles, as read on map
14 11 (hidden Markov models : 4) simplified model : user walks on a Voronoi graph, displacement constraints due to obstacles are taken automatically into account
15 12 (hidden Markov models : 5) observations {Y k } could verify memoryless channel assumption, i.e. P[Y 0:n dy 0:n X 0:n ] = n k=0 P[Y k dy k X k ] characterization in terms of emission density P[Y k dy X k = x] = g k (x,y)λ F k(dy) where nonnegative measure λ F k (dy) defined on F does not depend on x E define (abuse of notation) likelihood function as g k (x) = g k (x,y k ) a quantitative measure of consistency between x E and observation Y k joint conditional distribution of observations Y 0:n given hidden states X 0:n verifies P[Y 0:n dy 0:n X 0:n = x 0:n ] = n g k (x k,y k ) λ F 0 (dy 0 ) λ F n(dy n ) k=0
16 13 (hidden Markov models : 6) representation as X k 1 X k X k+1 Y k 1 Y k Y k+1 arrows represent dependency between random variables only requirement (to be used later on) : easy to simulate for any x E, a r.v. according to transition kernel Q k (x,dx ) evaluate for any x E, likelihood function g k (x )
17 14 (hidden Markov models : 7) hidden Markov models : importance decomposition motivation : from simulations seen last week, basic paradigm particles move according to prior model, described by its transition kernel new particles are weighted by evaluating likelihood function hopefully, resulting weighted empirical distribution provides reasonable approximation to non tractable Bayesian filter concern / questions : is this safe? could more information be used in mutation step?
18 15 (hidden Markov models : 8) recall indoor navigation example : if user is detected by a beacon with known location a and with finite range R, then necessarily user position is within detection disk centered at a and with radius R in other words, generating particles according to prior model alone could result in (a few, some, many, all) particles outside detection disk, i.e. useless particles, waste why not generate explicitly all new particles within disk, and accomodate for wrong model by changing weights? more generally, why not (and how) use next observation to move particles? ideal situation would be particles move according to posterior model warning
19 16 (hidden Markov models : 9) 1.4 prior distribution (sample view) prior Figure 1: Prior density and generated sample
20 17 (hidden Markov models : 10) 1.4 prior distribution (histogram view) prior Figure 2: Prior density and histogramme associated with generated sample
21 18 (hidden Markov models : 11) 1.4 prior distribution (sample view) prior Figure 1: Prior density and generated sample
22 19 (hidden Markov models : 12) prior distribution, likelihood function and posterior distribution (weighted sample view) prior likelihood posterior Figure 3a: Prior density, likelihood function, posterior density and weighted sample
23 20 (hidden Markov models : 13) prior distribution, likelihood function and posterior distribution (histogram view) prior likelihood posterior Figure 4a: Prior density, likelihood function, posterior density and histogramme associated with weighted sample
24 21 (hidden Markov models : 14) 1.4 prior distribution (sample view) prior Figure 1: Prior density and generated sample
25 22 (hidden Markov models : 15) prior distribution, likelihood function and posterior distribution (weighted sample view) prior likelihood posterior Figure 3b: Prior density, likelihood function, posterior density and weighted sample (more difficult)
26 23 (hidden Markov models : 16) prior distribution, likelihood function and posterior distribution (histogram view) prior likelihood posterior Figure 4b: Prior density, likelihood function, posterior density and histogramme associated with weighted sample (more difficult)
27 24 (hidden Markov models : 17) 1.4 prior distribution (sample view) prior Figure 1: Prior density and generated sample
28 25 (hidden Markov models : 18) prior distribution, likelihood function and posterior distribution (weighted sample view) prior likelihood posterior Figure 3c: Prior density, likelihood function, posterior density and weighted sample (just impossible)
29 26 (hidden Markov models : 19) possible (non unique) decomposition and as product of γ 0 (dx) = g 0 (x) η 0 (dx) = g imp 0 (x)η imp 0 (dx) R k (x,dx ) = Q k (x,dx ) g k (x ) = g imp k (x,x ) Q imp k (x,dx ) a nonnegative weight function g imp 0 (x) or g imp k (x,x ) a probability distribution η imp 0 (dx) or a transition kernel Q imp k (x,dx ) respectively, only requirement about proposed decomposition : easy to simulate a r.v. according to η imp 0 (dx) simulate for any x E, a r.v. according to Q imp k (x,dx ) evaluate for any x,x E, weighting function g imp k (x,x ) attention : evaluating weighting function g imp k (x,x ) requires some knowledge about transition kernels Q imp k (x,dx ) and Q k (x,dx ) (was not required originally)
30 27 (hidden Markov models : 20) popular (optimal) importance decomposition : blind vs. guided mutation and alternatively i.e. P[X k dx,y k dy X k 1 = x] = P[Y k dy X k = x,x k 1 = x] }{{} g k (x,y ) λ k (dy ) P[X k dx,y k dy X k 1 = x] with (abuse of notation) = P[X k dx Y k = y,x k 1 = x] }{{} Q k (x,y,dx ) P[X k dx X k 1 = x] }{{} Q k (x,dx ) P[Y k dy X k 1 = x] }{{} ĝ k (x,y ) λ k (dy ) R k (x,dx ) = g k (x ) Q k (x,dx ) = ĝ k (x) Q k (x,dx ) ĝ k (x) = ĝ k (x,y k ) and Qk (x,dx ) = Q k (x,y k,dx )
31 28 (hidden Markov models : 21) remaining question : how easy is it to simulate for any x E, a r.v. according to Q k (x,dx )? evaluate for any x E, weighting function ĝ k (x)? positive answer in special case : linear observations and additive Gaussian noise X k = f k (X k 1 )+σ k (X k 1 ) W k Y k = H k X k +V k indeed (for simplicity, assume σ k (x) = I) Y k = H k [f k (X k 1 )+W k ]+V k = H k f k (X k 1 )+(H k W k +V k ) conditionally on X k 1 = x, r.v. (X k,y k ) is jointly Gaussian, with mean and covariance matrix f k (x) Q W k Q W k H k and H k f k (x) H k Q W k H k Q W k H k +QV k
32 29 (partially observed Markov chains : 1) partially observed Markov chains motivating example : assume (unsynchronized) sensors take noisy observations of different components of hidden state at different time instants, e.g. X k = (Xk 1,X2 k ) and for simplicity X k = f(x k 1 )+W k h 1 (Xk 1)+V1 k Y k = H 2 X 2 k +V2 k at odd time instants at even time instants observing all components of hidden state is fine, but processing partial observations at each time instant can be risky, since likelihood functions will be flat along some directions : ideally, try to collect and process simultaneously two successive observations, so that likelihood functions are more peaky
33 30 (partially observed Markov chains : 2) down sampling : set X k = X 2k+1 and Ȳ k = Y 2k+1 Y 2k+2 state equation X k = X 2k+1 = f(x 2k )+W 2k+1 = f(f(x 2k 1 )+W 2k )+W 2k+1 i.e. X k = f( X k 1, W k ) with W k = W 2k W 2k+1
34 31 (partially observed Markov chains : 3) observation equation : introducing projections π 1 and π 2 on 1st and 2nd components of state vector, yields i.e. Ȳ k = Y 2k+1 Y 2k+2 = = h1 (X 1 2k+1 )+V1 2k+1 H 2 X 2 2k+2 +V2 2k+2 h 1 (π 1 (X 2k+1 ))+V 1 2k+1 H 2 π 2 (f(x 2k+1 )+W 2k+2 )+V 2 2k+2 Ȳ k = h( X k )+ V k with V k = V 1 2k+1 H 2 π 2 (W 2k+2 )+V 2 2k+2
35 32 (partially observed Markov chains : 4) resulting system X k = f( X k 1, W k ) Ȳ k = h( X k )+ V k with W k = W 2k W 2k+1 and Vk = V 1 2k+1 H 2 π 2 (W 2k+2 )+V 2 2k+2 clearly W k and V k 1 share W 2k in common and are correlated, hence dependent, and memoryless channel assumption cannot hold
36 33 (partially observed Markov chains : 5) trick : decompose W k = M V k 1 + B k where B k and V k 1 are now independent, substitute in state equation and import V k 1 = Ȳk 1 h( X k 1 ) from observation equation, yielding X k = f( X k 1,M (Ȳk 1 h( X k 1 ))+ B k ) Ȳ k = h( X k )+ V k does not fit into hidden Markov model, hidden state alone does not form a Markov chain however, hidden states and observations {( X k,ȳk)} jointly form a Markov chain, the second component of which only is observed
37 34 (partially observed Markov chains : 6) even more generally, with previous motivating example in mind, hidden states and observations {(X k,y k )} could jointly form a Markov chain taking values in product space E F characterization in terms of transition kernel P[X k dx,y k dy X k 1 = x,y k 1 = y] = R k (x,y,y,dx ) λ F k(y,dy ) and initial distribution P[X 0 dx,y 0 dy] = γ 0 (y,dx) λ F 0 (dy) attention : hidden states {X k } alone need not form a Markov chain joint probability distribution of hidden states and observations (X 0:n,Y 0:n ) P[X 0:n dx 0:n,Y 0:n dy 0:n ] = γ 0 (y 0,dx 0 ) n R k (x k 1,y k 1,y k,dx k ) λ F 0 (dy 0 ) λ F k(y k 1,dy k ) k=1
38 35 (partially observed Markov chains : 7) required (non unique) decomposition partially observed Markov chains : importance decomposition γ 0 (dx) = g imp 0 (x)η imp 0 (dx) and as product of R k (x,dx ) = g imp k (x,x ) Q imp k (x,dx ) a nonnegative weight function g imp 0 (x) or g imp k (x,x ) a probability distribution η imp 0 (dx) or a transition kernel Q imp k (x,dx ) respectively, only requirement about proposed decomposition : easy to simulate a r.v. according to η imp 0 (dx) simulate for any x E, a r.v. according to Q imp k (x,dx ) evaluate for any x,x E, weighting function g imp k (x,x )
39 36 (likelihood free models : 1) likelihood free models so far, at least implicitly, additive observation noise has been assumed Y k = h(x k )+V k with V k qk V (v) dv with known and explicit form for probability density qk V(v) this was key assumption in deriving expression of density emission P[Y k dy X k = x] = g k (x,y) λ k (dy) hence explicit expression of likelihood function questions : could anything be said in more general cases where no explicit expression is available for a density, or it does not even exist non additive observation noise, with dimension smaller than observation, i.e. Y k = h(x k,v k ) perfect observations, i.e. observation noise is simply not present Y k = h(x k )
40 37 (likelihood free models : 2) trick, a form of ABC (approximate Bayesian computation) : pretend that observations are produced by slightly perturbed but regular model, i.e. or or Y k = h(x k )+V k +εu k Y k = h(x k,v k )+εu k Y k = h(x k )+εu k depending on the case under consideration, with U k q U k (u)du and set (X k,v k ) as new hidden state new requirement : easy to simulate (X k,v k ) jointly evaluate density q U k (u)
41 Bayesian filter hidden Markov models representation as Gibbs Boltzmann distribution recursive formulation partially observed Markov chains + given importance decomposition representation as Gibbs Boltzmann distribution
42 38 (Bayesian filter : hidden Markov models : representation : 1) Bayesian filter : hidden Markov models : representation Theorem joint conditional distribution of hidden state sequence X 0:n given observations Y 0:n as a Gibbs Boltzmann distribution P[X 0:n dx 0:n Y 0:n ] n k=0 g k (x k ) }{{} g 0:n (x 0:n ) η 0 (dx 0 ) n k=1 Q k (x k 1,dx k ) } {{ } η 0:n (dx 0:n ) with likelihood functions defined (abuse of notation) as g k (x) = g k (x,y k ) and with joint probability distribution of hidden state sequence X 0:n η 0:n (dx 0:n ) = P[X 0:n dx 0:n ] = η 0 (dx 0 ) n Q k (x k 1,dx k ) k=1
43 39 (Bayesian filter : hidden Markov models : representation : 2) general principle : p X Y=y (x) = p X,Y(x,y) p Y (y) = p X Y (x) p X,Y (x,y) Proof Bayes rule + Markov property + memoryless channel assumption, yield joint probability distribution of hidden states and observations (X 0:n,Y 0:n ) hence P[X 0:n dx 0:n,Y 0:n dy 0:n ] = P[Y 0:n dy 0:n X 0:n = x 0:n ] P[X 0:n dx 0:n ] = η 0 (dx 0 ) n k=1 Q k (x k 1,dx k ) P[X 0:n dx 0:n Y 0:n ] η 0 (dx 0 ) n k=1 n k=0 g k (x k,y k ) λ F 0 (dy 0 ) λ F n(dy n ) Q k (x k 1,dx k ) n k=0 g k (x k )
44 40 (Bayesian filter : hidden Markov models : representation : 3) Remark for any function f depending on whole trajectory E[f(X 0:n ) Y 0:n ] f(x 0:n ) g 0:n (x 0:n ) η 0:n (dx 0:n ) E E E[f(X 0:n ) n k=0 g k (X k )] expectation w.r.t. hidden state sequence X 0:n, while observations Y 0:n are fixed implicit parameters in likelihood functions : recall (abuse of notation) g k (x) = g k (x,y k ) if f = φ π depends only upon last state, then µ n,φ = E[φ(X n ) Y 0:n ] E[φ(X n ) n k=0 g k (X k )] = γ n,φ which defines unnormalized distribution γ n (dx) implicitly, through its action on arbitrary functions
45 41 (Bayesian filter : hidden Markov models : representation : 4) for a given importance decomposition P[X 0:n dx 0:n Y 0:n ] η 0 (dx 0 ) n k=1 Q k (x k 1,dx k ) n k=0 g k (x k ) η imp 0 (dx 0 ) n k=1 Q imp k (x k 1,dx k ) } {{ } η imp 0:n (dx 0:n) n k=0 g imp k (x k ) } {{ } g imp 0:n (x 0:n)
46 42 (Bayesian filter : hidden Markov models : recursive formulation : 1) Bayesian filter : hidden Markov models : recursive formulation Theorem Bayesian filter µ k (dx) = P[X k dx Y 0:k ] satisfies µ k 1 prediction η k = µ k 1 Q k with initial condition η 0 (dx) = P[X 0 dx] correction µ k = g k η k Remark in Theorem statement µ k 1 Q k (dx ) = E µ k 1 (dx)q k (x,dx ) denotes mixture distribution resulting from transition kernel Q k (x,dx ) acting on probability distribution µ k 1 (dx), and g k η k = g k η k η k,g k denotes (projective) product of prior probability distribution η k (dx ) with likelihood function g k (x )
47 43 (Bayesian filter : hidden Markov models : recursive formulation : 2) Proof recall representation for joint conditional probability distribution of hidden state sequence X 0:k given observations Y 0:k P[X 0:k dx 0:k Y 0:k ] η 0 (dx 0 ) k p=1 Q p (x p 1,dx p ) k p=0 g p (x p ) g k (x k ) Q k (x k 1,dx k ) P[X 0:k 1 dx 0:k 1 Y 0:k 1 ] integration w.r.t. variables x 0:k 1 (and in RHS, first w.r.t. variables x 0:k 2 and next w.r.t. variable x k 1 ), provides conditional distribution of current hidden state X k given observations Y 0:k, i.e. Bayesian filter, as µ k (dx k ) = P[X k dx k Y 0:k ] g k (x k ) Q k (x k 1,dx k ) P[X k 1 dx k 1 Y 0:k 1 ] g k (x k ) E µ k 1 (dx k 1 ) Q k (x k 1,dx k ) E } {{ } η k (dx k )
48 44 (Bayesian filter : hidden Markov models : recursive formulation : 3) Remark unnormalized version satisfies recurrent relation γ k (dx ) = g k (x ) γ k 1 (dx) Q k (x,dx ) and µ k = γ k γ k,1 or equivalently E γ k (dx ) = E γ k 1 (dx) R k (x,dx ) introducing nonnegative kernel R k (x,dx ) = Q k (x,dx ) g k (x )
49 45 (Bayesian filter : partially observed Markov chains : representation : 1) Bayesian filter : partially observed Markov chains : representation Theorem joint conditional distribution of hidden state sequence X 0:n given observations Y 0:n P[X 0:n dx 0:n Y 0:n ] γ 0 (dx 0 ) n k=1 R k (x k 1,dx k ) with nonnegative distribution defined (abuse of notation) as γ 0 (dx) = γ 0 (Y 0,dx) and with nonnegative kernel defined (abuse of notation) as R k (x k 1,dx k ) = R k (x k 1,Y k 1,Y k,dx k )
50 46 (Bayesian filter : partially observed Markov chains : representation : 2) general principle : p X Y=y (x) = p X,Y(x,y) p Y (y) = p X Y (x) p X,Y (x,y) Proof by definition joint probability distribution of hidden states and observations (X 0:n,Y 0:n ) P[X 0:n dx 0:n,Y 0:n dy 0:n ] = γ 0 (y 0,dx 0 ) n k=1 R k (x k 1,y k 1,y k,dx k ) λ F 0 (dy 0 ) λ F k(y k 1,dy k ) hence P[X 0:n dx 0:n Y 0:n ] γ 0 (dx 0 ) n R k (x k 1,dx k ) k=1
51 47 (Bayesian filter : partially observed Markov chains : representation : 3) for a given importance decomposition P[X 0:n dx 0:n Y 0:n ] γ 0 (dx 0 ) n k=1 R k (x k 1,dx k ) η imp 0 (dx 0 ) n k=1 Q imp k (x k 1,dx k ) } {{ } η imp 0:n (dx 0:n) n k=0 g imp k (x k ) } {{ } g imp 0:n (x 0:n)
52 48 (Bayesian filter : partially observed Markov chains : recursive formulation : 1) Bayesian filter : partially observed Markov chains : recursive formulation Theorem Bayesian filter µ k (dx) = P[X k dx Y 0:k ] satisfies µ k (dx ) µ k 1 (dx) R k (x,dx ) with initial condition µ 0 (dx) γ 0 (dx) E Remark unnormalized version satisfies recurrent relation γ k (dx ) = γ k 1 (dx) R k (x,dx ) and µ k = γ k γ k,1 E
53 49 (Bayesian filter : partially observed Markov chains : recursive formulation : 2) Proof recall representation for joint conditional probability distribution of hidden state sequence X 0:k given observations Y 0:k P[X 0:k dx 0:k Y 0:k ] γ 0 (dx 0 ) k p=1 R p (x p 1,dx p ) R k (x k 1,dx k ) P[X 0:k 1 dx 0:k 1 Y 0:k 1 ] integration w.r.t. variables x 0:k 1 (and in RHS, first w.r.t. variables x 0:k 2 and next w.r.t. variable x k 1 ), provides conditional distribution of current hidden state X k given observations Y 0:k, i.e. Bayesian filter, as µ k (dx k ) = P[X k dx k Y 0:k ] R k (x k 1,dx k ) P[X k 1 dx k 1 Y 0:k 1 ] E E µ k 1 (dx k 1 ) R k (x k 1,dx k )
54 Monte Carlo approximation : particle filters Monte Carlo methods : importance sampling importance sampling SIS algorithm derived from Bayesian filter representation recursive formulation redistribution SIR algorithm, adaptive redistribution derived directly from Bayesian filter recursive formulation estimation error, CLT
55 50 (Monte Carlo methods : importance sampling : 1) Monte Carlo methods if computing an integral (or a mathematical expectation) µ,φ = φ(x) µ(dx) = E[φ(X)] with X µ(dx) E is difficult, but simulating a r.v. according to distribution µ is easy, then introduce empirical probability distribution S N (µ) = 1 N where (ξ 1,,ξ N ) is an N sample distributed according to µ, and approximation by law of large numbers i=1 µ,φ S N (µ),φ = 1 N δ ξi φ(ξ i ) i=1 S N (µ),φ µ,φ in probability as N, with speed 1/ N
56 51 (Monte Carlo methods : importance sampling : 2) indeed S N (µ) µ,φ = 1 N hence (non asymptotical) mean square error (φ(ξ i ) µ,φ ) i=1 since E S N (µ) µ,φ 2 = 1 N var(φ,µ) 1 N 2 i,j=1 E[(φ(ξ i ) µ,φ ) (φ(ξ j ) µ,φ )] = 1 N 2 i=1 E φ(ξ i ) µ,φ 2 }{{} var(φ, µ) and central limit theorem holds N S N (µ) µ,φ = 1 N (φ(ξ i ) µ,φ ) = N(0,var(φ,µ)) N in distribution as N i=1
57 52 (Monte Carlo methods : importance sampling : 3) important special case : Gibbs Boltzmann distribution µ = g η = gη η,g i.e. µ,φ = η,gφ η,g with (non unique) decomposition in terms of a probability distribution η a nonnegative function g introduce unnormalized distribution defined by γ,φ = η,gφ = E[g(Ξ)φ(Ξ)] hence µ,φ = η,gφ η,g where r.v. Ξ is distributed according to η motivation : Bayes rule = γ,φ γ,1 ( ) posterior distribution likelihood function prior distribution
58 53 (Monte Carlo methods : importance sampling : 4) if simulating a r.v. according to µ is difficult, but simulating a r.v. according to η and evaluating nonnegative function g(x) for any x is easy, then it is possible to approximate µ by a weighted empirical probability distribution associated with a sample distributed according to η and weighted with nonnegative function g(x) even though normalizing constant η, g might be unknown
59 54 (Monte Carlo methods : importance sampling : 5) importance sampling idea : approximate numerator and denominator in ( ) with a unique sample distributed according to η : introduce approximation γ,φ = η,gφ S N (η),gφ = 1 N g(ξ i )φ(ξ i ) i=1 hence µ,φ = g η,φ g S N (η),φ = g(ξ i )φ(ξ i ) i=1 g(ξ i ) where (ξ 1,,ξ N ) is an N sample with common probability distribution η i=1
60 55 (Monte Carlo methods : importance sampling : 6) in other words and γ γ N = gs N (η) = 1 N µ µ N = g S N (η) = i=1 g(ξ i )δ ξ i i=1 g(ξ i ) g(ξ j ) j=1 δ ξ i = w i δ ξ i where nonnegative normalized weights (w 1,,w N ) are defined for any i = 1 N by w i = g(ξi ) g(ξ j ) j=1 i=1
61 56 (importance sampling SIS algorithm : 1) importance sampling SIS algorithm recall Bayesian filter representation as a Gibbs Boltzmann distribution µ 0:n = g 0:n η 0:n = g 0:nη 0:n η 0:n,g 0:n with g 0:n (x 0:n ) = i.e. µ 0:n,f = η 0:n,g 0:n f η 0:n,g 0:n n k=0 g k (x k 1,x k ) and with joint probability distribution of hidden states X 0:n n η 0:n (dx 0:n ) = P[X 0:n dx 0:n ] = η 0 (dx 0 ) Q k (x k 1,dx k ) unnormalized version defined as k=1 γ 0:n,f = η 0:n,g 0:n f = E[g 0:n (X 0:n )f(x 0:n )] = γ 0:n,f γ 0:n,1 and if f = φ π depends only upon last state and not on whole trajectory, then n γ 0:n,φ π = E[g 0:n (X 0:n ) φ π(x 0:n )] = E[φ(X n ) g k (X k 1,X k )] = γ n,φ k=0
62 57 (importance sampling SIS algorithm : 2) importance sampling : approximation γ 0:n,f = η 0:n,g 0:n f S N (η 0:n ),g 0:n f = 1 N g 0:n (ξ0:n)f(ξ i 0:n) i i=1 and µ 0:n,f = g 0:n η 0:n,f g 0:n S N (η 0:n ),f = g 0:n (ξ0:n)f(ξ i 0:n) i i=1 g 0:n (ξ0:n) i i=1 for any function f depending on whole trajectory, where (ξ 1 0:n,,ξ N 0:n) is an N sample with common probability distribution η 0:n
63 58 (importance sampling SIS algorithm : 3) in particular if f = φ π depends only upon last state and not on whole sequence, then γ n,φ = γ 0:n,φ π 1 g 0:n (ξ i N 0:n)φ(ξn) i and i=1 g 0:n (ξ0:n)φ(ξ i n) i µ n,φ = µ 0:n,φ π i=1 g 0:n (ξ0:n) i for any function φ, where (ξ 1 0:n,,ξ N 0:n) is an N sample with common probability distribution η 0:n, and for i = 1 N ξ i n = π(ξ i 0:n) denotes last state of sequence ξ i 0:n = (ξ i 0,,ξ i n) i=1
64 59 (importance sampling SIS algorithm : 4) in other words and µ n µ N n = γ n γ N n = 1 N i=1 g 0:n (ξ0:n) i δ ξ i n i=1 g 0:n (ξ0:n) i δ ξ i = n g 0:n (ξ j 0:n ) j=1 wn i δ ξ i n where nonnegative normalized weights (wn, 1,wn N ) are defined for any i = 1 N by wn i = g 0:n(ξ0:n) i g 0:n (ξ j 0:n ) j=1 i=1
65 60 (importance sampling SIS algorithm : 5) SIS algorithm importance sampling approximation : non recursive depth first implementation simulate an N sample of hidden state sequences (ξ 1 0:n,,ξ N 0:n) : independently for any i = 1 N, simulate a sequence ξ i 0:n = (ξ i 0,,ξ i n), i.e. simulate a r.v. ξ i 0 according to η 0 (dx) for any k = 1 n simulate a r.v. ξ i k according to Q k(ξ i k 1,dx ) and define for any i = 1 N g 0:n (ξ i 0:n) = n k=0 g k (ξk 1,ξ i k) i and wn i = g 0:n(ξ0:n) i g 0:n (ξ0:n) i j=1
66 61 (importance sampling SIS algorithm : 6) importance sampling approximation : non recursive implementation for nonlinear and non Gaussian systems simulate an N sample of hidden state sequences (ξ 1 0:n,,ξ N 0:n) : independently for any i = 1 N, simulate a sequence ξ i 0:n = (ξ i 0,,ξ i n), i.e. simulate a r.v. ξ i 0 according to η 0 (dx) for any k = 1 n simulate a r.v. W i k according to pw k (dw) and set ξi k = f k(ξ i k 1,Wi k ) and define for any i = 1 N g 0:n (ξ i 0:n) = n k=0 qk V (Y k h k (ξk)) i and wn i = g 0:n(ξ0:n) i g 0:n (ξ0:n) i j=1
67 62 (importance sampling SIS algorithm : 7) recursive formulation of weights updating for any k = 1 n and for any i = 1 N wk i = g 0:k(ξ0:k) i = g 0:k (ξ j 0:k ) j=1 g 0:k 1 (ξ0:k 1) i g k (ξk 1,ξ i k) i = g 0:k 1 (ξ j 0:k 1 ) g k(ξ j k 1,ξj k ) j=1 wk 1 i g k (ξk 1,ξ i k) i w j k 1 g k(ξ j k 1,ξj k ) j=1 benefit : allows breadth first implementation
68 63 (importance sampling SIS algorithm : 8) SIS algorithm (sequential importance sampling) : recursive implementation for k = 0, independently for any i = 1 N simulate a r.v. ξ 0 i according to η 0(dx), and define w0 i = g 0(ξ0) i g 0 (ξ j 0 ) j=1 for any k = 1 n, independently for any i = 1 N simulate a r.v. ξ i k according to Q k(ξ i k 1,dx ), and update weight as wk i = wi k 1 g k(ξk 1 i,ξi k ) w j k 1 g k(ξ j k 1,ξj k ) j=1
69 64 (importance sampling SIS algorithm : 9) SIS algorithm (sequential importance sampling) : recursive implementation for nonlinear and non Gaussian systems for k = 0, independently for any i = 1 N simulate a r.v. ξ 0 i according to η 0(dx), and define w0 i = qv 0 (Y 0 h 0 (ξ0)) i q0 V (Y 0 h 0 (ξ j 0 )) j=1 for any k = 1 n, independently for any i = 1 N simulate a r.v. Wk i according to pw k (dw) and set ξi k = f k(ξk 1 i,wi k ), and update weight as wk i = wi k 1 qv k (Y k h k (ξk i)) w j k 1 qv k (Y k h k (ξ j k )) j=1
70 65 (importance sampling SIS algorithm : 10) pros : higher weights are allocated to simulated sequences that are often consistent with observations cons : weights are evaluated afterwards, and do not have impact on how sequences are simulated (blind simulation strategy) + along a given sequence, weights are accumulated in a multiplicative way weights degeneracy : in practice, one single sequence receives a much larger weight than all other sequences, whose contributions are therefore negligible memory effect : a sequence cannot be consistent with all observations a sequence that is consistent (resp. inconsistent) with current observation, but inconsistent (resp. consistent) with earlier observations, will receive a small (resp. a large) weight proposed solutions use observations to guide how sequences are simulated from time to time, replicate / terminate sequences according to their respective weights
71 66 (SIR algorithm : 1) approximate Bayesian filter using recursive formulation µ n (dx) = P[X n dx Y 0:n ] µ k 1 prediction η k = µ k 1 Q k vith initial condition µ 0 = g 0 η 0 correction µ k = g k η k idea : look for approximations in the form of (possibly weighted) empirical probability distributions η k η N k = vk i δ ξ i et µ k µ N k = k i=1 associated with population of N particles characterized by positions (ξ 1 k,,ξn k ) in E wk i δ ξ i k nonnegative normalized weights (v 1 k,,vn k ) and (w1 k,,wn k ) i=1 SIR algorithm
72 67 (SIR algorithm : 2) initial approximation : using importance sampling µ 0 = g 0 η 0 g 0 S N (η 0 ) = i=1 g 0 (ξ0) i δ ξ i = 0 g 0 (ξ j 0 ) w0 i δ ξ i 0 i=1 j=1 where variables (ξ 1 0,,ξ N 0 ) are i.i.d. with common probability distribution η 0 correction step : clearly, from definition µ N k = g k η N k = i=1 vk i g k(ξk i) δ ξ i = k v j k g k(ξ j k ) wk i δ ξ i k i=1 j=1 which automatically has desired form
73 68 (SIR algorithm : 3) prediction step : from definition µ N k 1Q k,φ = µ N k 1(dx) Q k (x,dx )φ(x ) for any function φ, hence = = i=1 w i k 1 Q k (ξk 1,dx i )φ(x ) [ wk 1 i Q k (ξk 1,dx i )]φ(x ) i=1 in form of a finite mixture, with µ N k 1Q k = wk 1 i m i k i=1 m i k(dx ) = Q k (ξ i k 1,dx ) for any i = 1 N requires further approximation (several sampling schemes available)
74 69 (SIR algorithm : 4) multinomial resampling simulate an N sample (ξ 1 k,,ξn k ) according to µn k 1 Q k, and set µ N k 1Q k η N k = S N (µ N k 1Q k ) = 1 N δ ξ i = k i=1 vk i δ ξ i k i=1 with v i k = 1/N for any i = 1 N weights are used to select (without replacement) mixture components with higher weights, with expected consequence that components with higher weights are selected several times conversely, components with lower weights are possibly discarded and will not further contribute to approximation if R i denotes how many times i th mixture component has been selected, or equivalently how many samples in new approximation originate from i th mixture component, for any i = 1 N, then r.v. (R 1,,R N ) has a multinomial distribution
75 70 (SIR algorithm : 5) intuitively, if all mixture weights are equal (or close) to 1/N, i.e. if distribution of mixture weights is close to equidistribution, then selecting mixture components could be counter productive weigths preservation simulate one individual exactly from each mixture component and preserve its weight, i.e. independently for any i = 1 N simulate ξk i according to m i k (dx ) = Q k (ξk 1 i,dx ) and set µ N k 1Q k η N k = wk 1 i δ ξ i = k i=1 vk i δ ξ i k i=1 with v i k = wi k 1 for any i = 1 N intuitively, this approach is appropriate if distribution of mixture weights is close to equidistribution, and less appropriate in extreme case where most weights are zero, except a few components with positive weights
76 71 (SIR algorithm : 6) SIR algorithm (sampling with importance resampling) : recursive implementation for k = 0, independently for any i = 1 N simulate a r.v. ξ i 0 according to η 0 (dx), and define w0 i = g 0(ξ0) i g 0 (ξ j 0 ) j=1 for any k = 1 n, independently for any i = 1 N select an individual ξ i k 1 among population (ξ1 k 1,,ξN k 1 ) and according to weights (w 1 k 1,,wN k 1 ) simulate a r.v. ξ i k according to Q k( ξ i k 1,dx ) and define wk i = g k(ξk 1 i,ξi k ) g k (ξ j k 1,ξj k ) j=1
77 72 (SIR algorithm : 7) SIR algorithm (sampling with importance resampling) : recursive formulation for nonlinear and non Gaussian systems for k = 0, independently for any i = 1 N simulate a r.v. ξ i 0 according to η 0 (dx), and define w0 i = qv 0 (Y 0 h 0 (ξ0)) i q0 V (Y 0 h 0 (ξ j 0 )) j=1 for any k = 1 n, independently for any i = 1 N select and individual ξ i k 1 among population (ξ1 k 1,,ξN k 1 ) and according to weights (w 1 k 1,,wN k 1 ) simulate a r.v. W i k according to pw k (dw) and set ξi k = f k( ξ i k 1,Wi k ) and define wk i = qv k (Y k h k (ξk i)) qk V (Y k h k (ξ j k )) j=1
78 73 (SIR algorithm : 8) to summarize, particles (ξ 1 k 1,,ξN k 1 ) are selected according to their respective weights (w 1 k 1,,wN k 1 ) [selection step] evolve according to transition probabilities Q k (x,dx ) [mutation step] and are weighted by evaluating likelihood function g k [weighting step] pros : weights do not accumulate along each sequence, but are used to select (or resample) particles particles with larger (resp. smaller) weights are replicated (resp. are terminated) by keeping only most probable particles at each time instant, expected benefit is to concentrate available computing power within regions of interest
79 74 (SIR algorithm : 9) cons : introduces additional randomness, in resampling (selection) step proposed solutions alternate resampling strategies, that allocate an (almost) deterministic number of offsprings to each selected particle adaptive resampling, only when weights (wk 1,,wN k unbalanced (far from equidistribution) ) are too much cons : because of replication, fewer truly distinct positions are available (sample impoverishment) positions degeneracy : in practice, implicitly rely on mutation step to bring diversity again proposed solution after resampling (selection) step, add some random move to each selected particle, or apply some artificial Markovian dynamics (Metropolis Hastings, Gibbs sampling, etc.)
80 75 (particle filtering : adaptive sampling / resampling : 1) given a finite mixture m = w i m i i=1 adaptive SIR algorithm selecting mixture components is interesting only if weights (w 1,,w N ) are far from equidistribution several heuristic criteria have been proposed to quantify departure from equidistribution, and to decide wether particles should be resampled or not, e.g. effective sample size entropy
81 76 (particle filtering : adaptive sampling / resampling : 2) χ 2 distance and effective sample size χ 2 distance between two probability vectors p = (p 1,,p N ) and q = (q 1,,q N ) is defined as χ 2 (p,q) = i=1 q i ( p i q i 1) 2 in particular for p = w = (w 1,,w N ) and q = (1/N,,1/N), it holds hence 0 1 N (N w i 1) 2 = 1 N i=1 (N w i ) 2 1 = N i=1 1 N eff = 1 / [ wi 2 ] N i=1 wi 2 1 where equality is attained at equidistribution, which suggests to resample if H(w 1,,w N ) = N i=1 for some threshold H red > 0 still to be fixed i=1 w 2 i 1 = N N eff 1 H red
82 77 (estimation error, CLT : 1) on the way to asymptotic results (in 3 slides) recall linear evolution for unnormalized version of Bayesian filter γ k = γ k 1 R k = g k (γ k 1 Q k ) = g k (µ k 1 Q k ) γ k 1,1 = g k η k γ k 1,1 with initial condition γ 0 = g 0 η 0 proposed particle approximation for unnormalized distribution γk N = g k ηk N γk 1,1 N with initial condition γ0 N = g 0 η0 N and η0 N = S N (η 0 ) : clearly γk N,1 = ηk N,g k γk 1,1 N and γ0 N,1 = η0 N,g 0 and it follows γ N k γ N k,1 = g k η N k = µ N k and γ N 0 γ N 0,1 = g 0 η N 0 = µ N 0 normalized version of proposed particle approximation for γk N SIR bootstrap approximation µ N k for Bayesian filter coincides with
83 78 (estimation error, CLT : 2) Remark key for induction : for any k = 1 n and by difference hence γ N k γ k = g k η N k γ N k 1,1 g k (γ k 1 Q k ) = g k (γ N k 1Q k γ k 1 Q k )+g k (η N k µ N k 1Q k ) γ N k 1,1 γ N k γ k,φ = γ N k 1 γ k 1,Q k (g k φ) + η N k µ N k 1Q k,g k φ γ N k 1,1 error at current generation, evaluated on function φ, is decomposed into error at previous generation, evaluated on function R k φ = Q k (g k φ) local error resulting from Monte Carlo approximation even though samples are actually dependent, because of resampling at each generation, conditionally on previous generations, new samples are generated independently
84 79 (estimation error, CLT : 3) with this conditioning argument, error estimates sup E γn k γ k,φ φ: φ =1 γ k,1 c k N and sup E µ N k µ k,φ 2 c k φ: φ =1 N of order 1/ N, and CLT N γ N k γ k,φ γ k,1 = N(0,V k (φ)) and N µ N k µ k,φ = N(0,v k (φ)) with v k (φ) = V k (φ µ k,φ ) can be obtained by induction
85 Some algorithmic variants regularization progressive weighting, MCMC iterations sample size adaptation marginalization aka Rao Blackwellization interacting Kalman filters interacting finite state (Baum) filters
86 80 (marginalization aka Rao Blackwellization : 1) conditionning as a variance reduction technique if E[f(X 1,X 2 )] = E[E[f(X 1,X 2 ) X 2 ]] = E[F(X 2 )] F(x 2 ) = E[f(X 1,X 2 ) X 2 = x 2 ] = has an explicit expression, then Monte Carlo estimator 1 N E 1 f(x 1,x 2 ) P[X 1 dx 1 X 2 = x 2 ] F(Xi) 2 E[F(X 2 )] = E[f(X 1,X 2 )] i=1 where (X1, 2,XN 2 ) is an N sample with same common distribution as X2, has smaller variance than Monte Carlo estimator 1 f(x 1 N i,xi) 2 E[f(X 1,X 2 )] i=1 where ((X1,X 1 1), 2,(XN 1,X2 N ) is an N sample with same common distribution as (X 1,X 2 )
87 81 (marginalization aka Rao Blackwellization : 2) 1st example : conditionnally linear Gaussian systems X L k = F L k (X NL k 1) X L k 1 +f L k(x NL k 1)+W L k X NL k = Fk NL (Xk 1) NL Xk 1 L +fk NL (Xk 1)+W NL k NL Y k = h k (X NL k )+V k clearly E[φ(X L n,x NL n ) n k=0 g k (X NL k )] = E[E[φ(X L n,x NL n ) X NL 0:n] n k=0 g k (X NL k )] and conditional distribution of said linear component Xn L given said nonlinear component sequence X0:n, NL L NL is Gaussian, with mean X k and covariance matrix given explicitly, in recursive form, by Kalman filter equation P L NL k introduce new hidden state {(X NL k, X L NL k,p L NL k )} instead of {(Xk L,XNL k )} benefit : explore with particles subspace associated with nonlinear components, and associated with each particle, a Kalman filter estimates linear components
88 82 (marginalization aka Rao Blackwellization : 3) 2nd example : non linear systems with Markovian switching regimes / modes X k = f k (s k 1,X k 1,W k ) Y k = h k (X k )+V k where regime / mode sequence {s k } forms a Markov chain with finite state space clearly E[φ(s n,x n ) n k=0 g k (X k )] = E[E[φ(s n,x n ) X 0:n ] n k=0 g k (X k )] and conditional distribution of regime / mode s n given continuous components sequence X 0:n, is a finite dimensional probability vector defined by p i n = P[s n = i X 0:n ] for any i I given explicitly, in recursive form, by solving Baum forward equation introduce new hidden state {(X k,p k )} instead of {(s k,x k )} benefit : avoid sampling finite state space
89 Conclusion particle filtering provides an implementation of Bayesian approach that is intuitive, easy to understand and implement flexible, adapts to many models, many algorithmic variants available numerically efficient, through some selection mechanism amenable to mathematical analysis
CPSC 540: Machine Learning
CPSC 540: Machine Learning MCMC and NonParametric Bayes Mark Schmidt University of British Columbia Winter 2016 Admin I went through project proposals: Some of you got a message on Piazza. No news is
More informationSequential Monte Carlo and Particle Filtering. Frank Wood Gatsby, November 2007
Sequential Monte Carlo and Particle Filtering Frank Wood Gatsby, November 2007 Importance Sampling Recall: Let s say that we want to compute some expectation (integral) E p [f] = p(x)f(x)dx and we remember
More informationParticle Filters. Pieter Abbeel UC Berkeley EECS. Many slides adapted from Thrun, Burgard and Fox, Probabilistic Robotics
Particle Filters Pieter Abbeel UC Berkeley EECS Many slides adapted from Thrun, Burgard and Fox, Probabilistic Robotics Motivation For continuous spaces: often no analytical formulas for Bayes filter updates
More informationLecture 2: From Linear Regression to Kalman Filter and Beyond
Lecture 2: From Linear Regression to Kalman Filter and Beyond Department of Biomedical Engineering and Computational Science Aalto University January 26, 2012 Contents 1 Batch and Recursive Estimation
More informationLecture 2: From Linear Regression to Kalman Filter and Beyond
Lecture 2: From Linear Regression to Kalman Filter and Beyond January 18, 2017 Contents 1 Batch and Recursive Estimation 2 Towards Bayesian Filtering 3 Kalman Filter and Bayesian Filtering and Smoothing
More informationPATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 13: SEQUENTIAL DATA
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 13: SEQUENTIAL DATA Contents in latter part Linear Dynamical Systems What is different from HMM? Kalman filter Its strength and limitation Particle Filter
More informationLinear Dynamical Systems
Linear Dynamical Systems Sargur N. srihari@cedar.buffalo.edu Machine Learning Course: http://www.cedar.buffalo.edu/~srihari/cse574/index.html Two Models Described by Same Graph Latent variables Observations
More informationChapter 7. Markov chain background. 7.1 Finite state space
Chapter 7 Markov chain background A stochastic process is a family of random variables {X t } indexed by a varaible t which we will think of as time. Time can be discrete or continuous. We will only consider
More informationSensor Fusion: Particle Filter
Sensor Fusion: Particle Filter By: Gordana Stojceska stojcesk@in.tum.de Outline Motivation Applications Fundamentals Tracking People Advantages and disadvantages Summary June 05 JASS '05, St.Petersburg,
More informationComputer Intensive Methods in Mathematical Statistics
Computer Intensive Methods in Mathematical Statistics Department of mathematics johawes@kth.se Lecture 16 Advanced topics in computational statistics 18 May 2017 Computer Intensive Methods (1) Plan of
More informationAdaptive Monte Carlo methods
Adaptive Monte Carlo methods JeanMichel Marin Projet Select, INRIA Futurs, Université ParisSud joint with Randal Douc (École Polytechnique), Arnaud Guillin (Université de Marseille) and Christian Robert
More informationCS 630 Basic Probability and Information Theory. Tim Campbell
CS 630 Basic Probability and Information Theory Tim Campbell 21 January 2003 Probability Theory Probability Theory is the study of how best to predict outcomes of events. An experiment (or trial or event)
More informationLecture 6: Bayesian Inference in SDE Models
Lecture 6: Bayesian Inference in SDE Models Bayesian Filtering and Smoothing Point of View Simo Särkkä Aalto University Simo Särkkä (Aalto) Lecture 6: Bayesian Inference in SDEs 1 / 45 Contents 1 SDEs
More informationMonteCarlo MMDMA, Université ParisDauphine. Xiaolu Tan
MonteCarlo MMDMA, Université ParisDauphine Xiaolu Tan tan@ceremade.dauphine.fr Septembre 2015 Contents 1 Introduction 1 1.1 The principle.................................. 1 1.2 The error analysis
More informationAdvanced Computational Methods in Statistics: Lecture 5 Sequential Monte Carlo/Particle Filtering
Advanced Computational Methods in Statistics: Lecture 5 Sequential Monte Carlo/Particle Filtering Axel Gandy Department of Mathematics Imperial College London http://www2.imperial.ac.uk/~agandy London
More informationLecture 6: Multiple Model Filtering, Particle Filtering and Other Approximations
Lecture 6: Multiple Model Filtering, Particle Filtering and Other Approximations Department of Biomedical Engineering and Computational Science Aalto University April 28, 2010 Contents 1 Multiple Model
More informationA new class of interacting Markov Chain Monte Carlo methods
A new class of interacting Marov Chain Monte Carlo methods P Del Moral, A Doucet INRIA Bordeaux & UBC Vancouver Worshop on Numerics and Stochastics, Helsini, August 2008 Outline 1 Introduction Stochastic
More informationAn Brief Overview of Particle Filtering
1 An Brief Overview of Particle Filtering Adam M. Johansen a.m.johansen@warwick.ac.uk www2.warwick.ac.uk/fac/sci/statistics/staff/academic/johansen/talks/ May 11th, 2010 Warwick University Centre for Systems
More information4 Derivations of the DiscreteTime Kalman Filter
Technion Israel Institute of Technology, Department of Electrical Engineering Estimation and Identification in Dynamical Systems (048825) Lecture Notes, Fall 2009, Prof N Shimkin 4 Derivations of the DiscreteTime
More informationAn introduction to Sequential Monte Carlo
An introduction to Sequential Monte Carlo Thang Bui Jes Frellsen Department of Engineering University of Cambridge Research and Communication Club 6 February 2014 1 Sequential Monte Carlo (SMC) methods
More informationIntroduction to Bayesian methods in inverse problems
Introduction to Bayesian methods in inverse problems Ville Kolehmainen 1 1 Department of Applied Physics, University of Eastern Finland, Kuopio, Finland March 4 2013 Manchester, UK. Contents Introduction
More informationBagging During Markov Chain Monte Carlo for Smoother Predictions
Bagging During Markov Chain Monte Carlo for Smoother Predictions Herbert K. H. Lee University of California, Santa Cruz Abstract: Making good predictions from noisy data is a challenging problem. Methods
More informationKalman filtering and friends: Inference in time series models. Herke van Hoof slides mostly by Michael Rubinstein
Kalman filtering and friends: Inference in time series models Herke van Hoof slides mostly by Michael Rubinstein Problem overview Goal Estimate most probable state at time k using measurement up to time
More informationPattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions
Pattern Recognition and Machine Learning Chapter 2: Probability Distributions Cécile Amblard Alex Kläser Jakob Verbeek October 11, 27 Probability Distributions: General Density Estimation: given a finite
More informationx log x, which is strictly convex, and use Jensen s Inequality:
2. Information measures: mutual information 2.1 Divergence: main inequality Theorem 2.1 (Information Inequality). D(P Q) 0 ; D(P Q) = 0 iff P = Q Proof. Let ϕ(x) x log x, which is strictly convex, and
More informationBasic math for biology
Basic math for biology Lei Li Florida State University, Feb 6, 2002 The EM algorithm: setup Parametric models: {P θ }. Data: full data (Y, X); partial data Y. Missing data: X. Likelihood and maximum likelihood
More informationParticle Filters. Outline
Particle Filters M. Sami Fadali Professor of EE University of Nevada Outline Monte Carlo integration. Particle filter. Importance sampling. Degeneracy Resampling Example. 1 2 Monte Carlo Integration Numerical
More informationLecture 4 October 18th
Directed and undirected graphical models Fall 2017 Lecture 4 October 18th Lecturer: Guillaume Obozinski Scribe: In this lecture, we will assume that all random variables are discrete, to keep notations
More informationMarkov Chain Monte Carlo Methods for Stochastic
Markov Chain Monte Carlo Methods for Stochastic Optimization i John R. Birge The University of Chicago Booth School of Business Joint work with Nicholas Polson, Chicago Booth. JRBirge U Florida, Nov 2013
More informationLet X and Y denote two random variables. The joint distribution of these random
EE385 Class Notes 9/7/0 John Stensby Chapter 3: Multiple Random Variables Let X and Y denote two random variables. The joint distribution of these random variables is defined as F XY(x,y) = [X x,y y] P.
More informationAuxiliary Particle Methods
Auxiliary Particle Methods Perspectives & Applications Adam M. Johansen 1 adam.johansen@bristol.ac.uk Oxford University Man Institute 29th May 2008 1 Collaborators include: Arnaud Doucet, Nick Whiteley
More informationProbabilistic Graphical Models
2016 Robert Nowak Probabilistic Graphical Models 1 Introduction We have focused mainly on linear models for signals, in particular the subspace model x = Uθ, where U is a n k matrix and θ R k is a vector
More informationDAG models and Markov Chain Monte Carlo methods a short overview
DAG models and Markov Chain Monte Carlo methods a short overview Søren Højsgaard Institute of Genetics and Biotechnology University of Aarhus August 18, 2008 Printed: August 18, 2008 File: DAGMCLecture.tex
More informationMonte Carlo Methods. Geoff Gordon February 9, 2006
Monte Carlo Methods Geoff Gordon ggordon@cs.cmu.edu February 9, 2006 Numerical integration problem 5 4 3 f(x,y) 2 1 1 0 0.5 0 X 0.5 1 1 0.8 0.6 0.4 Y 0.2 0 0.2 0.4 0.6 0.8 1 x X f(x)dx Used for: function
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 7 Approximate
More informationParticle filters, the optimal proposal and highdimensional systems
Particle filters, the optimal proposal and highdimensional systems Chris Snyder National Center for Atmospheric Research Boulder, Colorado 837, United States chriss@ucar.edu 1 Introduction Particle filters
More informationState Estimation using Moving Horizon Estimation and Particle Filtering
State Estimation using Moving Horizon Estimation and Particle Filtering James B. Rawlings Department of Chemical and Biological Engineering UW Math Probability Seminar Spring 2009 Rawlings MHE & PF 1 /
More informationMarkov Chain Monte Carlo (MCMC)
School of Computer Science 10708 Probabilistic Graphical Models Markov Chain Monte Carlo (MCMC) Readings: MacKay Ch. 29 Jordan Ch. 21 Matt Gormley Lecture 16 March 14, 2016 1 Homework 2 Housekeeping Due
More information1 Using standard errors when comparing estimated values
MLPR Assignment Part : General comments Below are comments on some recurring issues I came across when marking the second part of the assignment, which I thought it would help to explain in more detail
More informationBased on slides by Richard Zemel
CSC 412/2506 Winter 2018 Probabilistic Learning and Reasoning Lecture 3: Directed Graphical Models and Latent Variables Based on slides by Richard Zemel Learning outcomes What aspects of a model can we
More informationFactor Graphs and Message Passing Algorithms Part 1: Introduction
Factor Graphs and Message Passing Algorithms Part 1: Introduction HansAndrea Loeliger December 2007 1 The Two Basic Problems 1. Marginalization: Compute f k (x k ) f(x 1,..., x n ) x 1,..., x n except
More informationPattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods
Pattern Recognition and Machine Learning Chapter 6: Kernel Methods Vasil Khalidov Alex Kläser December 13, 2007 Training Data: Keep or Discard? Parametric methods (linear/nonlinear) so far: learn parameter
More informationAn Introduction to Sequential Monte Carlo for Filtering and Smoothing
An Introduction to Sequential Monte Carlo for Filtering and Smoothing Olivier Cappé LTCI, TELECOM ParisTech & CNRS http://perso.telecomparistech.fr/ cappe/ Acknowlegdment: Eric Moulines (TELECOM ParisTech)
More informationIntroduction to Particle Filters for Data Assimilation
Introduction to Particle Filters for Data Assimilation Mike Dowd Dept of Mathematics & Statistics (and Dept of Oceanography Dalhousie University, Halifax, Canada STATMOS Summer School in Data Assimila5on,
More informationSelf Adaptive Particle Filter
Self Adaptive Particle Filter Alvaro Soto Pontificia Universidad Catolica de Chile Department of Computer Science Vicuna Mackenna 4860 (143), Santiago 22, Chile asoto@ing.puc.cl Abstract The particle filter
More informationGAUSSIAN PROCESS REGRESSION
GAUSSIAN PROCESS REGRESSION CSE 515T Spring 2015 1. BACKGROUND The kernel trick again... The Kernel Trick Consider again the linear regression model: y(x) = φ(x) w + ε, with prior p(w) = N (w; 0, Σ). The
More informationSampling Rejection Sampling Importance Sampling Markov Chain Monte Carlo. Sampling Methods. Oliver Schulte  CMPT 419/726. Bishop PRML Ch.
Sampling Methods Oliver Schulte  CMP 419/726 Bishop PRML Ch. 11 Recall Inference or General Graphs Junction tree algorithm is an exact inference method for arbitrary graphs A particular tree structure
More informationBayesian Estimation of DSGE Models
Bayesian Estimation of DSGE Models Stéphane Adjemian Université du Maine, GAINS & CEPREMAP stephane.adjemian@univlemans.fr http://www.dynare.org/stepan June 28, 2011 June 28, 2011 Université du Maine,
More informationErgodicity in data assimilation methods
Ergodicity in data assimilation methods David Kelly Andy Majda Xin Tong Courant Institute New York University New York NY www.dtbkelly.com April 15, 2016 ETH Zurich David Kelly (CIMS) Data assimilation
More informationAn introduction to particle filters
An introduction to particle filters Andreas Svensson Department of Information Technology Uppsala University June 10, 2014 June 10, 2014, 1 / 16 Andreas Svensson  An introduction to particle filters Outline
More informationECONOMETRIC METHODS II: TIME SERIES LECTURE NOTES ON THE KALMAN FILTER. The Kalman Filter. We will be concerned with state space systems of the form
ECONOMETRIC METHODS II: TIME SERIES LECTURE NOTES ON THE KALMAN FILTER KRISTOFFER P. NIMARK The Kalman Filter We will be concerned with state space systems of the form X t = A t X t 1 + C t u t 0.1 Z t
More informationMachine Learning. Probability Basics. Marc Toussaint University of Stuttgart Summer 2014
Machine Learning Probability Basics Basic definitions: Random variables, joint, conditional, marginal distribution, Bayes theorem & examples; Probability distributions: Binomial, Beta, Multinomial, Dirichlet,
More information4.1 Notation and probability review
Directed and undirected graphical models Fall 2015 Lecture 4 October 21st Lecturer: Simon LacosteJulien Scribe: Jaime Roquero, JieYing Wu 4.1 Notation and probability review 4.1.1 Notations Let us recall
More informationMCMC algorithms for fitting Bayesian models
MCMC algorithms for fitting Bayesian models p. 1/1 MCMC algorithms for fitting Bayesian models Sudipto Banerjee sudiptob@biostat.umn.edu University of Minnesota MCMC algorithms for fitting Bayesian models
More informationMonte Carlo Simulations
Monte Carlo Simulations What are Monte Carlo Simulations and why ones them? Pseudo Random Number generators Creating a realization of a general PDF The Bootstrap approach A real life example: LOFAR simulations
More informationSampling Algorithms for Probabilistic Graphical models
Sampling Algorithms for Probabilistic Graphical models Vibhav Gogate University of Washington References: Chapter 12 of Probabilistic Graphical models: Principles and Techniques by Daphne Koller and Nir
More informationUnsupervised Learning
Unsupervised Learning Bayesian Model Comparison Zoubin Ghahramani zoubin@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc in Intelligent Systems, Dept Computer Science University College
More informationLecture 13 and 14: Bayesian estimation theory
1 Lecture 13 and 14: Bayesian estimation theory Spring 2012  EE 194 Networked estimation and control (Prof. Khan) March 26 2012 I. BAYESIAN ESTIMATORS Mother Nature conducts a random experiment that generates
More informationQuantitative Methods in Economics Conditional Expectations
Quantitative Methods in Economics Conditional Expectations Maximilian Kasy Harvard University, fall 2016 1 / 19 Roadmap, Part I 1. Linear predictors and least squares regression 2. Conditional expectations
More informationThree examples of a Practical Exact Markov Chain Sampling
Three examples of a Practical Exact Markov Chain Sampling Zdravko Botev November 2007 Abstract We present three examples of exact sampling from complex multidimensional densities using Markov Chain theory
More informationGaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008
Gaussian processes Chuong B Do (updated by Honglak Lee) November 22, 2008 Many of the classical machine learning algorithms that we talked about during the first half of this course fit the following pattern:
More informationProbabilistic Graphical Models
Probabilistic Graphical Models Brown University CSCI 295P, Spring 213 Prof. Erik Sudderth Lecture 11: Inference & Learning Overview, Gaussian Graphical Models Some figures courtesy Michael Jordan s draft
More informationNONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
NONLINEAR CLASSIFICATION AND REGRESSION Nonlinear Classification and Regression: Outline 2 MultiLayer Perceptrons The BackPropagation Learning Algorithm Generalized Linear Models Radial Basis Function
More informationDensities for the Navier Stokes equations with noise
Densities for the Navier Stokes equations with noise Marco Romito Università di Pisa Universitat de Barcelona March 25, 2015 Summary 1 Introduction & motivations 2 Malliavin calculus 3 Besov bounds 4 Other
More informationMidterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas
Midterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas Machine Learning Supervised Learning Unsupervised Learning Reinforcement Learning Parametric Y Continuous Nonparametric
More informationThe Hierarchical Particle Filter
and Arnaud Doucet http://go.warwick.ac.uk/amjohansen/talks MCMSki V Lenzerheide 7th January 2016 Context & Outline Filtering in StateSpace Models: SIR Particle Filters [GSS93] BlockSampling Particle
More informationBayesian Monte Carlo Filtering for Stochastic Volatility Models
Bayesian Monte Carlo Filtering for Stochastic Volatility Models Roberto Casarin CEREMADE University Paris IX (Dauphine) and Dept. of Economics University Ca Foscari, Venice Abstract Modelling of the financial
More informationStat 535 C  Statistical Computing & Monte Carlo Methods. Arnaud Doucet.
Stat 535 C  Statistical Computing & Monte Carlo Methods Arnaud Doucet Email: arnaud@cs.ubc.ca 1 1.1 Outline Introduction to Markov chain Monte Carlo The Gibbs Sampler Examples Overview of the Lecture
More informationEE4601 Communication Systems
EE4601 Communication Systems Week 2 Review of Probability, Important Distributions 0 c 2011, Georgia Institute of Technology (lect2 1) Conditional Probability Consider a sample space that consists of two
More informationSequential Monte Carlo Methods
University of Pennsylvania Bradley Visitor Lectures October 23, 2017 Introduction Unfortunately, standard MCMC can be inaccurate, especially in medium and largescale DSGE models: disentangling importance
More informationGaussian Processes (10/16/13)
STA561: Probabilistic machine learning Gaussian Processes (10/16/13) Lecturer: Barbara Engelhardt Scribes: Changwei Hu, Di Jin, Mengdi Wang 1 Introduction In supervised learning, we observe some inputs
More informationMachine Learning Techniques for Computer Vision
Machine Learning Techniques for Computer Vision Part 2: Unsupervised Learning Microsoft Research Cambridge x 3 1 0.5 0.2 0 0.5 0.3 0 0.5 1 ECCV 2004, Prague x 2 x 1 Overview of Part 2 Mixture models EM
More informationA nested sampling particle filter for nonlinear data assimilation
Quarterly Journal of the Royal Meteorological Society Q. J. R. Meteorol. Soc. : 14, July 2 A DOI:.2/qj.224 A nested sampling particle filter for nonlinear data assimilation Ahmed H. Elsheikh a,b *, Ibrahim
More informationSAMPLING ALGORITHMS. In general. Inference in Bayesian models
SAMPLING ALGORITHMS SAMPLING ALGORITHMS In general A sampling algorithm is an algorithm that outputs samples x 1, x 2,... from a given distribution P or density p. Sampling algorithms can for example be
More informationLearning Static Parameters in Stochastic Processes
Learning Static Parameters in Stochastic Processes Bharath Ramsundar December 14, 2012 1 Introduction Consider a Markovian stochastic process X T evolving (perhaps nonlinearly) over time variable T. We
More informationReview of Probability Theory
Review of Probability Theory Arian Maleki and Tom Do Stanford University Probability theory is the study of uncertainty Through this class, we will be relying on concepts from probability theory for deriving
More informationMark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.
University of Cambridge Engineering Part IIB & EIST Part II Paper I0: Advanced Pattern Processing Handouts 4 & 5: MultiLayer Perceptron: Introduction and Training x y (x) Inputs x 2 y (x) 2 Outputs x
More informationBayesian Inference: Principles and Practice 3. Sparse Bayesian Models and the Relevance Vector Machine
Bayesian Inference: Principles and Practice 3. Sparse Bayesian Models and the Relevance Vector Machine Mike Tipping Gaussian prior Marginal prior: single α Independent α Cambridge, UK Lecture 3: Overview
More informationFirstOrder ODE: Separable Equations, Exact Equations and Integrating Factor
FirstOrder ODE: Separable Equations, Exact Equations and Integrating Factor Department of Mathematics IIT Guwahati REMARK: In the last theorem of the previous lecture, you can change the open interval
More informationSTA 294: Stochastic Processes & Bayesian Nonparametrics
MARKOV CHAINS AND CONVERGENCE CONCEPTS Markov chains are among the simplest stochastic processes, just one step beyond iid sequences of random variables. Traditionally they ve been used in modelling a
More informationEnKFbased particle filters
EnKFbased particle filters Jana de Wiljes, Sebastian Reich, Wilhelm Stannat, Walter Acevedo June 20, 2017 Filtering Problem Signal dx t = f (X t )dt + 2CdW t Observations dy t = h(x t )dt + R 1/2 dv t.
More informationLINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning
LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES Supervised Learning Linear vs non linear classifiers In KNN we saw an example of a nonlinear classifier: the decision boundary
More informationBayesian Learning in Undirected Graphical Models
Bayesian Learning in Undirected Graphical Models Zoubin Ghahramani Gatsby Computational Neuroscience Unit University College London, UK http://www.gatsby.ucl.ac.uk/ and Center for Automated Learning and
More informationBayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014
Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2014 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several
More informationMarkov Networks.
Markov Networks www.biostat.wisc.edu/~dpage/cs760/ Goals for the lecture you should understand the following concepts Markov network syntax Markov network semantics Potential functions Partition function
More informationAUTOMATIC CONTROL REGLERTEKNIK LINKÖPINGS UNIVERSITET. Machine Learning T. Schön. (Chapter 11) AUTOMATIC CONTROL REGLERTEKNIK LINKÖPINGS UNIVERSITET
About the Eam I/II) ), Lecture 7 MCMC and Sampling Methods Thomas Schön Division of Automatic Control Linköping University Linköping, Sweden. Email: schon@isy.liu.se, Phone: 3373, www.control.isy.liu.se/~schon/
More informationLogistic Regression: Online, Lazy, Kernelized, Sequential, etc.
Logistic Regression: Online, Lazy, Kernelized, Sequential, etc. Harsha Veeramachaneni Thomson Reuter Research and Development April 1, 2010 Harsha Veeramachaneni (TR R&D) Logistic Regression April 1, 2010
More informationBayesian Computations for DSGE Models
Bayesian Computations for DSGE Models Frank Schorfheide University of Pennsylvania, PIER, CEPR, and NBER October 23, 2017 This Lecture is Based on Bayesian Estimation of DSGE Models Edward P. Herbst &
More informationSensor Tasking and Control
Sensor Tasking and Control Sensing Networking Leonidas Guibas Stanford University Computation CS428 Sensor systems are about sensing, after all... System State Continuous and Discrete Variables The quantities
More informationMATH 415, WEEKS 14 & 15: 1 Recurrence Relations / Difference Equations
MATH 415, WEEKS 14 & 15: Recurrence Relations / Difference Equations 1 Recurrence Relations / Difference Equations In many applications, the systems are updated in discrete jumps rather than continuous
More informationMarkov Chain Monte Carlo, Numerical Integration
Markov Chain Monte Carlo, Numerical Integration (See Statistics) Trevor Gallen Fall 2015 1 / 1 Agenda Numerical Integration: MCMC methods Estimating Markov Chains Estimating latent variables 2 / 1 Numerical
More informationSome Probability and Statistics
Some Probability and Statistics David M. Blei COS424 Princeton University February 13, 2012 Card problem There are three cards Red/Red Red/Black Black/Black I go through the following process. Close my
More informationSTA 256: Statistics and Probability I
Al Nosedal. University of Toronto. Fall 2017 My momma always said: Life was like a box of chocolates. You never know what you re gonna get. Forrest Gump. There are situations where one might be interested
More informationNotes on pseudomarginal methods, variational Bayes and ABC
Notes on pseudomarginal methods, variational Bayes and ABC Christian Andersson Naesseth October 3, 2016 The PseudoMarginal Framework Assume we are interested in sampling from the posterior distribution
More informationNonparametric Drift Estimation for Stochastic Differential Equations
Nonparametric Drift Estimation for Stochastic Differential Equations Gareth Roberts 1 Department of Statistics University of Warwick Brazilian Bayesian meeting, March 2010 Joint work with O. Papaspiliopoulos,
More informationMachine Learning for OR & FE
Machine Learning for OR & FE Hidden Markov Models Martin Haugh Department of Industrial Engineering and Operations Research Columbia University Email: martin.b.haugh@gmail.com Additional References: David
More informationBayesian Machine Learning
Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 4 Occam s Razor, Model Construction, and Directed Graphical Models https://people.orie.cornell.edu/andrew/orie6741 Cornell University September
More informationInference in Bayesian Networks
Andrea Passerini passerini@disi.unitn.it Machine Learning Inference in graphical models Description Assume we have evidence e on the state of a subset of variables E in the model (i.e. Bayesian Network)
More informationCSC 446 Notes: Lecture 13
CSC 446 Notes: Lecture 3 The Problem We have already studied how to calculate the probability of a variable or variables using the message passing method. However, there are some times when the structure
More information2D Image Processing. Bayes filter implementation: Kalman filter
2D Image Processing Bayes filter implementation: Kalman filter Prof. Didier Stricker Dr. Gabriele Bleser Kaiserlautern University http://ags.cs.unikl.de/ DFKI Deutsches Forschungszentrum für Künstliche
More information