Adaptive Monte Carlo methods

Size: px

Start display at page:

Download "Adaptive Monte Carlo methods"

Ashlyn Ray
6 years ago
Views:

1 Adaptive Monte Carlo methods Jean-Michel Marin Projet Select, INRIA Futurs, Université Paris-Sud joint with Randal Douc (École Polytechnique), Arnaud Guillin (Université de Marseille) and Christian Robert (Université Paris Dauphine) Séminaire Montpellier (17 décembre 2006) Page 1

2 Introduction Let (X, B(X), Π) be a probability space. (A1) Π µ and Π(dx) = π(x)µ(dx). (A2) π is known up to a normalizing constant: π(x) = π(x) ; π(x)µ(dx) π is known; the calculation of π(x)µ(dx) < is intractable. Séminaire Montpellier (17 décembre 2006) Page 2

3 Problem: for some Π-measurable applications h, approximate h(x) π(x)µ(dx) Π(h) = h(x)π(x)µ(dx) =. π(x)µ(dx) (A3) the calculation of h(x) π(x)µ(dx) is intractable. More concisely, X Π and we would like to approximate (µ(dx) = dx). Π(h) = E Π (h(x)) = h(x) π(x)dx π(x)dx Séminaire Montpellier (17 décembre 2006) Page 3

4 Applications in Bayesian inference where the target distribution is the posterior distribution of the parameter of interest π(θ x) f(x θ)π 1 (θ) f(x θ) (θ Θ) is the likelihood ; π 1 (θ) the prior distribution of θ. A Bayesian estimator of θ is the posterior mean of θ, that is θf(x θ)π 1 (θ)dθ E π (θ x) =. f(x θ)π 1 (θ)dθ Séminaire Montpellier (17 décembre 2006) Page 4

5 Monte Carlo methods (MC) = Generate an iid sample x 1,...,x N from Π and to estimate E Π (h(x)) by N ˆΠ MC N (h) = N 1 h(x i ). i=1 (i) MC ˆΠ N (h) as E Π (h(x)); (ii) E Π (h 2 (X)) <, ) N (ˆΠ MC N (h) E Π (h(x)) L N(0, V Π (h(x)). Often impossible to simulate directly from Π! Séminaire Montpellier (17 décembre 2006) Page 5

6 Markov Chain Monte Carlo methods (MCMC) = Generate x (1),...,x (T) from a Markov chain (x t ) t N with stationary distribution Π and estimate E Π (h(x)) by ˆΠ MCMC N (h) = N 1 T i=t N+1 h ( x (i)). Convergence to the stationary distribution could be very slow! Séminaire Montpellier (17 décembre 2006) Page 6

7 Metropolis Hastings algorithms Metropolis Hastings algorithms are generic (or down-the-shelf) MCMC algorithms, compared with the Gibbs sampler, in the sense that they can be tuned with a much wider range of possibilities. Séminaire Montpellier (17 décembre 2006) Page 7

8 If the target distribution has density π, the generic Metropolis Hastings algorithm is: Initialization: Choose an arbitrary x (0) Iteration t: 1. Given x (t 1), generate x q(x (t 1), x) 2. Calculate ρ(x (t 1), x) = min ( π( x)/q(x (t 1) ), x) π(x (t 1) )/q( x, x (t 1) ), 1 3. With probability min(ρ(x (t 1), x), 1) accept x and set x (t) = x; otherwise reject x and set x (t) = x (t 1). Séminaire Montpellier (17 décembre 2006) Page 8

9 This algorithm only needs to simulate from q which we can choose arbitrarily, as long as q is capable of reaching all areas of positive probability under π. While theoretical guarantees that the algorithm converges are very high, the choice of q remains paramount in practice. Séminaire Montpellier (17 décembre 2006) Page 9

10 The random walk sampler A random walk proposal has a symmetric transition density q(x, y) = q RW (y x) where q RW (x) = q RW ( x). In this case the acceptance probability ρ(x, y) reduces to the simpler form ( ρ(x, y) = min 1, π(y) ). π(x) Séminaire Montpellier (17 décembre 2006) Page 10

11 Example Consider the standard normal distribution N(0, 1) as a target. If we use random walk Metropolis-Hastings algorithm with a normal random walk, i.e. ( x x (t 1) N x (t 1), σ 2), q RW ( x x (t 1) ) = 1 2πσ 2 exp 1 2σ 2 ( x x(t 1) ), the performances of the sampler depends on the value of σ 2. Séminaire Montpellier (17 décembre 2006) Page 11

12 MH chain Iterations Density x Autocorrelation Iterations Density MH chain x Autocorrelation Lag Lag Figure 1: (left) σ 2 = 10 4 and a (right) σ 2 = 10 3 top: sequence of 10,000 iterations subsampled at every 10-th iteration; middle: histogram of the 2, 000 last iterations compared with the target density; bottom: empirical autocorrelations. Séminaire Montpellier (17 décembre 2006) Page 12

13 MH chain Iterations Density x Autocorrelation Lag Figure 2: σ 2 = 2 top: sequence of 10, 000 iterations subsampled at every 10-th iteration; middle: histogram of the 2, 000 last iterations compared with the target density; bottom: empirical autocorrelations. Séminaire Montpellier (17 décembre 2006) Page 13

14 Importance sampling Let Q be a probability distribution on (X, B(X)). Suppose that Q(dx) = q(x)dx and that [q(x) = 0] [π(x) = 0]. Π(h) = E Π (h(x)) = h(x) π(x) q(x) q(x)dx = E Q ( ) π(x) q(x) h(x) = Q ( ) π q h = Generate an iid sample x 1,...,x N from Q, called the proposal distribution, and to estimate Π(h) by ˆΠ IS Q,N(h) = N 1 N i=1 π(x i ) q(x i ) h(x i). Séminaire Montpellier (17 décembre 2006) Page 14

15 (i) ˆΠ IS Q,N(h) as E π (h(x)); ( ) (ii) E π 2 (X) Q q 2 (X) h2 (X) <, N(ˆΠ IS Q,N(h) E Π (h(x)) L N ( ( )) π(x) 0, V Q q(x) h(x). ( ) For many h, a sufficient condition for E π 2 (X) Q q 2 (X) h2 (X) is bounded. < is that π/q The normalizing constant of Π is unknown, not possible to use ˆΠ IS Q,N. It is natural to use the self-normalized version of the IS estimator, ˆΠ SNIS Q,N (h) = ( N i=1 ) 1 π(x i ) N q(x i ) i=1 π(x i ) q(x i ) h(x i). Séminaire Montpellier (17 décembre 2006) Page 15

16 (i) SNIS ˆΠ Q,N (h) as E Π (h(x)); ( (ii) E π 2 (X) ( Q q 2 (X) 1 + h 2 (X) )) <, N(ˆΠSNIS Q,N (h) E Π (h(x)) L N ( ( )) π(x) 0, V Q (h(x) π(h). q(x) The quality of the SNIS approximation depends on the choice of the proposal distribution Q. Séminaire Montpellier (17 décembre 2006) Page 16

17 It is the well-known that the importance distribution / q (x) = h(x) π(x) h(y) π(y)dy minimizes the variance of ˆΠ IS Q,N (h). It produces a zero variance estimator when h is either positive or negative (indeed, in both cases, ˆπ IS Q,N = E Π(h(X))). q cannot be used in practice because it depends on the integral h(y) π(y)dy. This result is thus rather understood as providing a goal for choosing a importance function g tailored for the approximation of E Π (h(x)). Séminaire Montpellier (17 décembre 2006) Page 17

18 / q (x) = h(x) π(h) π(x) h(y) E Π (h(x)) π(y)dy minimizes the asymptotic variance of ˆΠ SNIS Q,N (h). This second optimum is not available either, because it still depends on E Π (h(x)). There is little in the literature besides general recommendations that the support of q should be the support of h(x) π(x) or of h(y) π(h) π(y), or yet that the tails of q should be at least as thick as those of h(x) π(x). Séminaire Montpellier (17 décembre 2006) Page 18

19 PMC algorithms The notion of importance sampling can actually be greatly generalized to encompass much more adaptive and local schemes than thought previously. This extension is to learn from experience, that is, to build an importance sampling function based on the performances of earlier importance sampling proposals. By introducing a temporal dimension to the selection of the importance function, an adaptive perspective can be achieved at little cost, for a potentially large gain in efficiency. Séminaire Montpellier (17 décembre 2006) Page 19

20 D-kernel PMC algorithm Let Q i,t be the proposal distribution at iteration t of the algorithm for particle x i,t. Obviously, the quasi-total freedom in the construction of the Q i,t s has drawbacks, namely that some proposals do not necessarily lead to improvements in terms of variance reduction. We now restrict the family of proposals from which to select the new Q i,t s to mixture of fixed proposals. Séminaire Montpellier (17 décembre 2006) Page 20

21 We assume from now on that we use in parallel D fixed kernels Q d (, ) with densities q d and that the proposal is a mixture of those kernels q i,t (x) = D d=1 α t,n d q d ( x i,t 1, x), d α t,n d = 1, where the weights α t,n d > 0 can be modified at each iteration. The amount of adaptivity we allow in this version of PMC is thus restricted to a possible modification of the weights α t,n d. Séminaire Montpellier (17 décembre 2006) Page 21

22 The importance weight associated with this mixture proposal is / D π(x i,t ) d=1 α t,n d q d ( x i,t 1, x i,t ) while simulation from q i,t can be decomposed in the two usual mixture steps: first pick the component d then simulate from the corresponding kernel Q d. Séminaire Montpellier (17 décembre 2006) Page 22

23 Generic D-kernel PMC algorithm At time 0, produce the sample ( x i,0 ) 1 i N and set α 1,N d At time 1 t T, = 1/D (1 d D); a). Conditionally on the α t,n iid d s, generate (K i,t ) 1 i N M(1, (α t,n d ) 1 d D ); b). Conditionally on ( x i,t 1, K i,t ) 1 i N, generate independently ffi X D and set ω i,t = π(x i,t ) (x i,t ) 1 i N Q Ki,t ( x i,t 1, ) d=1 α t,n d q d ( x i,t 1, x i,t ); c). Conditionally on ( x i,t 1, K i,t, x i,t ) 1 i N, generate set x i,t = x Ji,t,t and α t+1,n d (J i,t ) 1 i N iid M(1, (ω i,t ) 1 i N ) = Ψ d (( x i,t 1, x i,t,k i,t ) 1 i N ) such that P D d=1 αt+1,n d = 1. Séminaire Montpellier (17 décembre 2006) Page 23

24 Ψ d (1 d D) denotes an update function that depends upon the past iteration. (A1) d {1,...,D}, Π Π {q d (x, x ) = 0} = 0 (the individual kernel importance weights are almost surely finite). Séminaire Montpellier (17 décembre 2006) Page 24

25 Theorem 1 Under (A1) and (A2), for any function h in L 1 π and for all t 0, both the unnormalised and the self-normalized PMC estimators are convergent, and ˆΠ PMC t,n (h) = 1 N ˆΠ SNPMC t,n (h) = N i=1 N i=1 ω i,t h(x i,t ) N P E Π (h(x)) ω i,t h(x i,t ) N P E Π (h(x)). As noted earlier, the unnormalised PMC estimator can only be used when π is completely known. Séminaire Montpellier (17 décembre 2006) Page 25

26 { } (A2) Π Π (1 + h 2 (x )) π(x ) q d (x,x ) (integrability condition). Theorem 2 Under (A1) and (A2), if for all t 1, < for a d {1,...,D} 1 d D, α t,n d N P α t d > 0, then both ( N ) N ω i,t h(x i,t ) E Π (h(x)) i=1 N ( 1 N and ) N ω i,t h(x i,t ) E Π (h(x)) i=1 converge in distribution as n goes to infinity to centered normal distributions with variances Séminaire Montpellier (17 décembre 2006) Page 26

27 σ 2 1,t = Π Π ( (h(x ) E Π (h(x))) 2 π(x ) D d=1 αt d q d(x, x ) and ( σ2,t 2 π(x ) = Π Π D d=1 αt d q d(x, x ) h(x ) E Π (h(x)) ) ) 2 D d=1 αt d q d(x, x ) π(x ). Séminaire Montpellier (17 décembre 2006) Page 27

28 A first Kullback-Leibler criterion S = { } D α = (α 1,...,α D ); d {1,...,D}, α d 0 and α d = 1. d=1 α S, let us denote by KL 1 (α) the Kullback-Leibler divergence between the mixture and the target distribution Π: [ ( )] π(x)π(x ) KL 1 (α) = log π(x) D d=1 α Π Π(dx, dx ). dq d (x, x ) First Kullback-Leibler divergence criterion: the best mixture of transition kernels is the one that minimizes KL 1 (α). Séminaire Montpellier (17 décembre 2006) Page 28

29 Theorem 3 Under (A1) and (A2), for the unnormalised and the selfnormalised cases, the updates Ψ d of the mixture weights given by α t+1,n d = N ω i,t I d (K i,t ) i=1 garantee a systematic decrease of KL 1, a long-term run of the algorithm providing the mixture that is KL 1 -closest to the target. Séminaire Montpellier (17 décembre 2006) Page 29

30 A first toy example Target π(x) = 1/3f N( 1,0.1) (x) + 1/3f N(0,1) (x) + 1/3f N(3,10) (x). 3 proposal distributions: N( 1, 0.1), N(0, 1) and N(3, 10) (more simple than transition kernels) Use of the Rao-Blackwellized 3-kernels algorithm with N = 100, 000 Séminaire Montpellier (17 décembre 2006) Page 30

31 Table 1: Evolution of the proposal mixture weights over the PMC iterations Séminaire Montpellier (17 décembre 2006) Page 31

32 A second toy example Target Π N(0, 1). 3 Gaussian random walks proposals: q 1 (x, x ) = f N(x,0.1) (x ), q 2 (x, x ) = f N(x,2) (x ) and q 3 = f N(x,10) (x ) Use of the Rao-Blackwellized 3-kernels algorithm with N = 100, 000 Séminaire Montpellier (17 décembre 2006) Page 32

33 Table 2: Evolution of the proposal mixture weights over the PMC iterations Séminaire Montpellier (17 décembre 2006) Page 33

34 A second Kullback-Leibler criterion in the unnormalised case α S, let us denote by KL 2 (α) the Kullback-Leibler divergence between the mixture and h(x) π(x): [ ( )] π(x) h(x ) π(x ) KL 2 (α) = log π(x) D d=1 α Π Π(dx, dx ). dq d (x, x ) Second Kullback-Leibler divergence criterion: the best mixture of transition kernels is the one that minimizes KL 2 (α). Séminaire Montpellier (17 décembre 2006) Page 34

35 Theorem 4 Under (A1), for the unnormalised case, the updates Ψ d of the mixture weights given by α t+1,n d = N i=1 / N ω i,t h(x i,t ) I d (K i,t ) i=1 ω i,t h(x i,t ) garantee a systematic decrease of KL 2, a long-term run of the algorithm providing the mixture that is KL 2 -closest to the target. Séminaire Montpellier (17 décembre 2006) Page 35

36 Asymptotic variance criterion α S, let us define σ 2 1(α) = Π Π (self-normalised case) ( (h(x ) π(h)) 2 π(x ) D d=1 α dq d (x, x ) ( σ2(α) 2 π(x ) = Π Π D d=1 α dq d (x, x ) h(x ) π(h) ) ) 2 D d=1 α dq d (x, x ) π(x ). (unnormalised case) Asymptotic variance criterion: the best mixture of transition kernels is the one that minimizes σ 2 1(α) or σ 2 2(α). Séminaire Montpellier (17 décembre 2006) Page 36

37 Theorem 5 Under (A1) and (A2), for the unnormalised case, the updates Ψ d of the mixture weights given by α t+1,n d = N i=1 / N ωi,th 2 2 (x i,t )I d (K i,t ) i=1 ω 2 i,th 2 (x i,t ) garantee a systematic decrease of σ 2 2, a long-term run of the algorithm providing the mixture that is σ 2 2-closest to the target. Séminaire Montpellier (17 décembre 2006) Page 37

38 Theorem 6 Under (A1) and (A2), for the self-normalised case, the updates Ψ d of the mixture weights given by 2 N N h(x i,t ) ω j,t h(x j,t ) I d (K i,t ) α t+1,n d = i=1 ω 2 i,t N i=1 ω 2 i,t j=1 h(x i,t ) N ω j,t h(x j,t ) garantee a systematic decrease of σ 2 1, a long-term run of the algorithm providing the mixture that is σ 2 1-closest to the target. j=1 2 Séminaire Montpellier (17 décembre 2006) Page 38

39 A final example Target N(0, 1) and h(x) = x In this case, it well-known that the optimal importance distribution which minimises the variance of the unnormalised importance sampling estimator is g (x) x exp x 2 /2. We choose g as one of D = 3 independent kernels, the other kernels being the N(0, 1) and the C(0, 1) (Cauchy) distributions. N = 100, 000 and T = 20 Séminaire Montpellier (17 décembre 2006) Page 39

40 t ˆπ t,n PMC (x) αt+1,n 1 α t+1,n 2 α t+1,n 3 σ 2,t Séminaire Montpellier (17 décembre 2006) Page 40

41 t Figure 3: Estimation of E[X] = 0 for a normal variate: decrease of the standard deviation to its optimal value Séminaire Montpellier (17 décembre 2006) Page 41

Adaptive Population Monte Carlo

Adaptive Population Monte Carlo Olivier Cappé Centre Nat. de la Recherche Scientifique & Télécom Paris 46 rue Barrault, 75634 Paris cedex 13, France http://www.tsi.enst.fr/~cappe/ Recent Advances in Monte