Adaptive Monte Carlo methods - PDF Free Download

Adaptive Monte Carlo methods Jean-Michel Marin Projet Select, INRIA Futurs, Université Paris-Sud joint with Randal Douc (École Polytechnique), Arnaud Guillin (Université de Marseille) and Christian Robert (Université Paris Dauphine) Séminaire Montpellier (17 décembre 2006) Page 1

Introduction Let (X, B(X), Π) be a probability space. (A1) Π µ and Π(dx) = π(x)µ(dx). (A2) π is known up to a normalizing constant: π(x) = π(x) ; π(x)µ(dx) π is known; the calculation of π(x)µ(dx) < is intractable. Séminaire Montpellier (17 décembre 2006) Page 2

Problem: for some Π-measurable applications h, approximate h(x) π(x)µ(dx) Π(h) = h(x)π(x)µ(dx) =. π(x)µ(dx) (A3) the calculation of h(x) π(x)µ(dx) is intractable. More concisely, X Π and we would like to approximate (µ(dx) = dx). Π(h) = E Π (h(x)) = h(x) π(x)dx π(x)dx Séminaire Montpellier (17 décembre 2006) Page 3

Applications in Bayesian inference where the target distribution is the posterior distribution of the parameter of interest π(θ x) f(x θ)π 1 (θ) f(x θ) (θ Θ) is the likelihood ; π 1 (θ) the prior distribution of θ. A Bayesian estimator of θ is the posterior mean of θ, that is θf(x θ)π 1 (θ)dθ E π (θ x) =. f(x θ)π 1 (θ)dθ Séminaire Montpellier (17 décembre 2006) Page 4

Monte Carlo methods (MC) = Generate an iid sample x 1,...,x N from Π and to estimate E Π (h(x)) by N ˆΠ MC N (h) = N 1 h(x i ). i=1 (i) MC ˆΠ N (h) as E Π (h(x)); (ii) E Π (h 2 (X)) <, ) N (ˆΠ MC N (h) E Π (h(x)) L N(0, V Π (h(x)). Often impossible to simulate directly from Π! Séminaire Montpellier (17 décembre 2006) Page 5

Markov Chain Monte Carlo methods (MCMC) = Generate x (1),...,x (T) from a Markov chain (x t ) t N with stationary distribution Π and estimate E Π (h(x)) by ˆΠ MCMC N (h) = N 1 T i=t N+1 h ( x (i)). Convergence to the stationary distribution could be very slow! Séminaire Montpellier (17 décembre 2006) Page 6

Metropolis Hastings algorithms Metropolis Hastings algorithms are generic (or down-the-shelf) MCMC algorithms, compared with the Gibbs sampler, in the sense that they can be tuned with a much wider range of possibilities. Séminaire Montpellier (17 décembre 2006) Page 7

If the target distribution has density π, the generic Metropolis Hastings algorithm is: Initialization: Choose an arbitrary x (0) Iteration t: 1. Given x (t 1), generate x q(x (t 1), x) 2. Calculate ρ(x (t 1), x) = min ( π( x)/q(x (t 1) ), x) π(x (t 1) )/q( x, x (t 1) ), 1 3. With probability min(ρ(x (t 1), x), 1) accept x and set x (t) = x; otherwise reject x and set x (t) = x (t 1). Séminaire Montpellier (17 décembre 2006) Page 8

This algorithm only needs to simulate from q which we can choose arbitrarily, as long as q is capable of reaching all areas of positive probability under π. While theoretical guarantees that the algorithm converges are very high, the choice of q remains paramount in practice. Séminaire Montpellier (17 décembre 2006) Page 9

The random walk sampler A random walk proposal has a symmetric transition density q(x, y) = q RW (y x) where q RW (x) = q RW ( x). In this case the acceptance probability ρ(x, y) reduces to the simpler form ( ρ(x, y) = min 1, π(y) ). π(x) Séminaire Montpellier (17 décembre 2006) Page 10

Example Consider the standard normal distribution N(0, 1) as a target. If we use random walk Metropolis-Hastings algorithm with a normal random walk, i.e. ( x x (t 1) N x (t 1), σ 2), q RW ( x x (t 1) ) = 1 2πσ 2 exp 1 2σ 2 ( x x(t 1) ), the performances of the sampler depends on the value of σ 2. Séminaire Montpellier (17 décembre 2006) Page 11

0 2000 4000 6000 8000 10000 MH chain 0.5 0.0 0.5 1.0 1.5 2.0 2.5 Iterations Density 0.0 0.5 1.0 1.5 0.0 0.5 1.0 1.5 2.0 x Autocorrelation 0.0 0.2 0.4 0.6 0.8 1.0 3 2 1 0 1 2 3 0 2000 4000 6000 8000 10000 Iterations Density MH chain 2 1 0 1 2 x Autocorrelation 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.5 1.0 1.5 0 200 400 600 800 1000 Lag 0 200 400 600 800 1000 Lag Figure 1: (left) σ 2 = 10 4 and a (right) σ 2 = 10 3 top: sequence of 10,000 iterations subsampled at every 10-th iteration; middle: histogram of the 2, 000 last iterations compared with the target density; bottom: empirical autocorrelations. Séminaire Montpellier (17 décembre 2006) Page 12

0 2000 4000 6000 8000 10000 MH chain 3 2 1 0 1 2 3 Iterations Density 0.0 0.1 0.2 0.3 0.4 3 2 1 0 1 2 3 x Autocorrelation 0.0 0.2 0.4 0.6 0.8 1.0 0 200 400 600 800 1000 Lag Figure 2: σ 2 = 2 top: sequence of 10, 000 iterations subsampled at every 10-th iteration; middle: histogram of the 2, 000 last iterations compared with the target density; bottom: empirical autocorrelations. Séminaire Montpellier (17 décembre 2006) Page 13

Importance sampling Let Q be a probability distribution on (X, B(X)). Suppose that Q(dx) = q(x)dx and that [q(x) = 0] [π(x) = 0]. Π(h) = E Π (h(x)) = h(x) π(x) q(x) q(x)dx = E Q ( ) π(x) q(x) h(x) = Q ( ) π q h = Generate an iid sample x 1,...,x N from Q, called the proposal distribution, and to estimate Π(h) by ˆΠ IS Q,N(h) = N 1 N i=1 π(x i ) q(x i ) h(x i). Séminaire Montpellier (17 décembre 2006) Page 14

(i) ˆΠ IS Q,N(h) as E π (h(x)); ( ) (ii) E π 2 (X) Q q 2 (X) h2 (X) <, N(ˆΠ IS Q,N(h) E Π (h(x)) L N ( ( )) π(x) 0, V Q q(x) h(x). ( ) For many h, a sufficient condition for E π 2 (X) Q q 2 (X) h2 (X) is bounded. < is that π/q The normalizing constant of Π is unknown, not possible to use ˆΠ IS Q,N. It is natural to use the self-normalized version of the IS estimator, ˆΠ SNIS Q,N (h) = ( N i=1 ) 1 π(x i ) N q(x i ) i=1 π(x i ) q(x i ) h(x i). Séminaire Montpellier (17 décembre 2006) Page 15

(i) SNIS ˆΠ Q,N (h) as E Π (h(x)); ( (ii) E π 2 (X) ( Q q 2 (X) 1 + h 2 (X) )) <, N(ˆΠSNIS Q,N (h) E Π (h(x)) L N ( ( )) π(x) 0, V Q (h(x) π(h). q(x) The quality of the SNIS approximation depends on the choice of the proposal distribution Q. Séminaire Montpellier (17 décembre 2006) Page 16

It is the well-known that the importance distribution / q (x) = h(x) π(x) h(y) π(y)dy minimizes the variance of ˆΠ IS Q,N (h). It produces a zero variance estimator when h is either positive or negative (indeed, in both cases, ˆπ IS Q,N = E Π(h(X))). q cannot be used in practice because it depends on the integral h(y) π(y)dy. This result is thus rather understood as providing a goal for choosing a importance function g tailored for the approximation of E Π (h(x)). Séminaire Montpellier (17 décembre 2006) Page 17

/ q (x) = h(x) π(h) π(x) h(y) E Π (h(x)) π(y)dy minimizes the asymptotic variance of ˆΠ SNIS Q,N (h). This second optimum is not available either, because it still depends on E Π (h(x)). There is little in the literature besides general recommendations that the support of q should be the support of h(x) π(x) or of h(y) π(h) π(y), or yet that the tails of q should be at least as thick as those of h(x) π(x). Séminaire Montpellier (17 décembre 2006) Page 18

PMC algorithms The notion of importance sampling can actually be greatly generalized to encompass much more adaptive and local schemes than thought previously. This extension is to learn from experience, that is, to build an importance sampling function based on the performances of earlier importance sampling proposals. By introducing a temporal dimension to the selection of the importance function, an adaptive perspective can be achieved at little cost, for a potentially large gain in efficiency. Séminaire Montpellier (17 décembre 2006) Page 19

D-kernel PMC algorithm Let Q i,t be the proposal distribution at iteration t of the algorithm for particle x i,t. Obviously, the quasi-total freedom in the construction of the Q i,t s has drawbacks, namely that some proposals do not necessarily lead to improvements in terms of variance reduction. We now restrict the family of proposals from which to select the new Q i,t s to mixture of fixed proposals. Séminaire Montpellier (17 décembre 2006) Page 20

We assume from now on that we use in parallel D fixed kernels Q d (, ) with densities q d and that the proposal is a mixture of those kernels q i,t (x) = D d=1 α t,n d q d ( x i,t 1, x), d α t,n d = 1, where the weights α t,n d > 0 can be modified at each iteration. The amount of adaptivity we allow in this version of PMC is thus restricted to a possible modification of the weights α t,n d. Séminaire Montpellier (17 décembre 2006) Page 21

The importance weight associated with this mixture proposal is / D π(x i,t ) d=1 α t,n d q d ( x i,t 1, x i,t ) while simulation from q i,t can be decomposed in the two usual mixture steps: first pick the component d then simulate from the corresponding kernel Q d. Séminaire Montpellier (17 décembre 2006) Page 22

Generic D-kernel PMC algorithm At time 0, produce the sample ( x i,0 ) 1 i N and set α 1,N d At time 1 t T, = 1/D (1 d D); a). Conditionally on the α t,n iid d s, generate (K i,t ) 1 i N M(1, (α t,n d ) 1 d D ); b). Conditionally on ( x i,t 1, K i,t ) 1 i N, generate independently ffi X D and set ω i,t = π(x i,t ) (x i,t ) 1 i N Q Ki,t ( x i,t 1, ) d=1 α t,n d q d ( x i,t 1, x i,t ); c). Conditionally on ( x i,t 1, K i,t, x i,t ) 1 i N, generate set x i,t = x Ji,t,t and α t+1,n d (J i,t ) 1 i N iid M(1, (ω i,t ) 1 i N ) = Ψ d (( x i,t 1, x i,t,k i,t ) 1 i N ) such that P D d=1 αt+1,n d = 1. Séminaire Montpellier (17 décembre 2006) Page 23

Ψ d (1 d D) denotes an update function that depends upon the past iteration. (A1) d {1,...,D}, Π Π {q d (x, x ) = 0} = 0 (the individual kernel importance weights are almost surely finite). Séminaire Montpellier (17 décembre 2006) Page 24

Theorem 1 Under (A1) and (A2), for any function h in L 1 π and for all t 0, both the unnormalised and the self-normalized PMC estimators are convergent, and ˆΠ PMC t,n (h) = 1 N ˆΠ SNPMC t,n (h) = N i=1 N i=1 ω i,t h(x i,t ) N P E Π (h(x)) ω i,t h(x i,t ) N P E Π (h(x)). As noted earlier, the unnormalised PMC estimator can only be used when π is completely known. Séminaire Montpellier (17 décembre 2006) Page 25

{ } (A2) Π Π (1 + h 2 (x )) π(x ) q d (x,x ) (integrability condition). Theorem 2 Under (A1) and (A2), if for all t 1, < for a d {1,...,D} 1 d D, α t,n d N P α t d > 0, then both ( N ) N ω i,t h(x i,t ) E Π (h(x)) i=1 N ( 1 N and ) N ω i,t h(x i,t ) E Π (h(x)) i=1 converge in distribution as n goes to infinity to centered normal distributions with variances Séminaire Montpellier (17 décembre 2006) Page 26

σ 2 1,t = Π Π ( (h(x ) E Π (h(x))) 2 π(x ) D d=1 αt d q d(x, x ) and ( σ2,t 2 π(x ) = Π Π D d=1 αt d q d(x, x ) h(x ) E Π (h(x)) ) ) 2 D d=1 αt d q d(x, x ) π(x ). Séminaire Montpellier (17 décembre 2006) Page 27

A first Kullback-Leibler criterion S = { } D α = (α 1,...,α D ); d {1,...,D}, α d 0 and α d = 1. d=1 α S, let us denote by KL 1 (α) the Kullback-Leibler divergence between the mixture and the target distribution Π: [ ( )] π(x)π(x ) KL 1 (α) = log π(x) D d=1 α Π Π(dx, dx ). dq d (x, x ) First Kullback-Leibler divergence criterion: the best mixture of transition kernels is the one that minimizes KL 1 (α). Séminaire Montpellier (17 décembre 2006) Page 28

Theorem 3 Under (A1) and (A2), for the unnormalised and the selfnormalised cases, the updates Ψ d of the mixture weights given by α t+1,n d = N ω i,t I d (K i,t ) i=1 garantee a systematic decrease of KL 1, a long-term run of the algorithm providing the mixture that is KL 1 -closest to the target. Séminaire Montpellier (17 décembre 2006) Page 29

A first toy example Target π(x) = 1/3f N( 1,0.1) (x) + 1/3f N(0,1) (x) + 1/3f N(3,10) (x). 3 proposal distributions: N( 1, 0.1), N(0, 1) and N(3, 10) (more simple than transition kernels) Use of the Rao-Blackwellized 3-kernels algorithm with N = 100, 000 Séminaire Montpellier (17 décembre 2006) Page 30

1 0.0500000 0.0500000 0.9000000 2 0.1815457 0.1002131 0.7182413 3 0.3142235 0.1490543 0.5367222 4 0.3632511 0.1910549 0.4456940 5 0.3674450 0.2301243 0.4024307 6 0.3529385 0.2858416 0.3612199 7 0.3405808 0.3119567 0.3474625 8 0.3414635 0.3208192 0.3377173 9 0.3356331 0.3295758 0.3347911 10 0.3357052 0.3312788 0.3330160 Table 1: Evolution of the proposal mixture weights over the PMC iterations Séminaire Montpellier (17 décembre 2006) Page 31

A second toy example Target Π N(0, 1). 3 Gaussian random walks proposals: q 1 (x, x ) = f N(x,0.1) (x ), q 2 (x, x ) = f N(x,2) (x ) and q 3 = f N(x,10) (x ) Use of the Rao-Blackwellized 3-kernels algorithm with N = 100, 000 Séminaire Montpellier (17 décembre 2006) Page 32

1 0.33333 0.33333 0.33333 2 0.24415 0.43145 0.32443 3 0.19525 0.52445 0.28031 4 0.10725 0.72955 0.16324 5 0.08223 0.83092 0.08691 6 0.06155 0.88355 0.05490 7 0.04255 0.92950 0.02795 8 0.03790 0.93760 0.02450 9 0.03130 0.94505 0.02365 10 0.03460 0.94875 0.01665 Table 2: Evolution of the proposal mixture weights over the PMC iterations Séminaire Montpellier (17 décembre 2006) Page 33

A second Kullback-Leibler criterion in the unnormalised case α S, let us denote by KL 2 (α) the Kullback-Leibler divergence between the mixture and h(x) π(x): [ ( )] π(x) h(x ) π(x ) KL 2 (α) = log π(x) D d=1 α Π Π(dx, dx ). dq d (x, x ) Second Kullback-Leibler divergence criterion: the best mixture of transition kernels is the one that minimizes KL 2 (α). Séminaire Montpellier (17 décembre 2006) Page 34

Theorem 4 Under (A1), for the unnormalised case, the updates Ψ d of the mixture weights given by α t+1,n d = N i=1 / N ω i,t h(x i,t ) I d (K i,t ) i=1 ω i,t h(x i,t ) garantee a systematic decrease of KL 2, a long-term run of the algorithm providing the mixture that is KL 2 -closest to the target. Séminaire Montpellier (17 décembre 2006) Page 35

Asymptotic variance criterion α S, let us define σ 2 1(α) = Π Π (self-normalised case) ( (h(x ) π(h)) 2 π(x ) D d=1 α dq d (x, x ) ( σ2(α) 2 π(x ) = Π Π D d=1 α dq d (x, x ) h(x ) π(h) ) ) 2 D d=1 α dq d (x, x ) π(x ). (unnormalised case) Asymptotic variance criterion: the best mixture of transition kernels is the one that minimizes σ 2 1(α) or σ 2 2(α). Séminaire Montpellier (17 décembre 2006) Page 36

Theorem 5 Under (A1) and (A2), for the unnormalised case, the updates Ψ d of the mixture weights given by α t+1,n d = N i=1 / N ωi,th 2 2 (x i,t )I d (K i,t ) i=1 ω 2 i,th 2 (x i,t ) garantee a systematic decrease of σ 2 2, a long-term run of the algorithm providing the mixture that is σ 2 2-closest to the target. Séminaire Montpellier (17 décembre 2006) Page 37

Theorem 6 Under (A1) and (A2), for the self-normalised case, the updates Ψ d of the mixture weights given by 2 N N h(x i,t ) ω j,t h(x j,t ) I d (K i,t ) α t+1,n d = i=1 ω 2 i,t N i=1 ω 2 i,t j=1 h(x i,t ) N ω j,t h(x j,t ) garantee a systematic decrease of σ 2 1, a long-term run of the algorithm providing the mixture that is σ 2 1-closest to the target. j=1 2 Séminaire Montpellier (17 décembre 2006) Page 38

A final example Target N(0, 1) and h(x) = x In this case, it well-known that the optimal importance distribution which minimises the variance of the unnormalised importance sampling estimator is g (x) x exp x 2 /2. We choose g as one of D = 3 independent kernels, the other kernels being the N(0, 1) and the C(0, 1) (Cauchy) distributions. N = 100, 000 and T = 20 Séminaire Montpellier (17 décembre 2006) Page 39

t ˆπ t,n PMC (x) αt+1,n 1 α t+1,n 2 α t+1,n 3 σ 2,t 1 0.0000 0.1000 0.8000 0.1000 0.9524 2-0.0030 0.1144 0.7116 0.1740 0.9192 3-0.0017 0.1191 0.6033 0.2776 0.8912 4-0.0006 0.1189 0.4733 0.4078 0.8608 5-0.0035 0.1084 0.3545 0.5371 0.8394 10 0.0065 0.0519 0.0622 0.8859 0.8016 15 0.0033 0.0305 0.0136 0.9559 0.7987 20-0.0042 0.0204 0.0041 0.9755 0.7984 Séminaire Montpellier (17 décembre 2006) Page 40

0.80 0.85 0.90 0.95 1.00 0 10 20 30 40 50 t Figure 3: Estimation of E[X] = 0 for a normal variate: decrease of the standard deviation to its optimal value Séminaire Montpellier (17 décembre 2006) Page 41