Adaptive Population Monte Carlo

Size: px

Start display at page:

Download "Adaptive Population Monte Carlo"

Moris Hood
6 years ago
Views:

1 Adaptive Population Monte Carlo Olivier Cappé Centre Nat. de la Recherche Scientifique & Télécom Paris 46 rue Barrault, Paris cedex 13, France Recent Advances in Monte Carlo Based Inference Workshop 30 October 3 November 2006, Isaac Newton Institute Randal Douc, Arnaud Guillin, Jean-Michel Marin & Christian P. Robert

2 Outline Monte Carlo Basics Population Monte Carlo PMC for ECOSSTAT References

3 Monte Carlo Basics Monte Carlo Basics Monte Carlo, MCMC... Importance Sampling, SIR... Population Monte Carlo PMC for ECOSSTAT References

4 Monte Carlo Basics General Purpose Given a density π, known up to a normalising constant, compute π(h) = h(x)π(x)dx for some functions of interest h In the following π denotes the normalised density but we consider only algorithms that do not necessitate knowledge of the normalising constant

5 Monte Carlo Basics Monte Carlo, MCMC... Monte Carlo Generate an iid sample X 1,..., X N from π and estimate π(h) by ˆπ MC N (h) = 1/N N h(x i ) Under π(h 2 ) = h 2 (x)π(x)dx <, we have Consistency: ˆπ N MC P (h) π(h) and Asymptotic Normality: N MC (ˆπ N (h) π(h) ) D ( { N 0, π [h π(h)] 2 }) with practical variance estimate 1/N N [ i=1 h(xi ) ˆπ N MC(h)] 2 i=1

6 Monte Carlo Basics Monte Carlo, MCMC... Monte Carlo Generate an iid sample X 1,..., X N from π and estimate π(h) by ˆπ MC N (h) = 1/N N h(x i ) Under π(h 2 ) = h 2 (x)π(x)dx <, we have Consistency: ˆπ N MC P (h) π(h) and Asymptotic Normality: N MC (ˆπ N (h) π(h) ) D ( { N 0, π [h π(h)] 2 }) with practical variance estimate 1/N N [ i=1 h(xi ) ˆπ N MC(h)] 2 i=1 Caveat Often impossible to simulate directly from π Possible answers include Markov Chain Monte Carlo

7 Monte Carlo Basics Importance Sampling, SIR... Importance Sampling Generate an iid sample X 1,..., X N from q and estimate π(h) by ˆπ IS N (h) = N i=1 w i h(x i ) where w i = π(x i )/q(x i ) and w i = w i / N j=1 w j Under π [ (1 + h 2 )π/q ], we have consistency and asymptotic normality with asymptotic variance π { [h π(h)] 2 π/q } and practical variance estimate N N i=1 w2 i [h(x i) ˆπ IS N (h)]2

8 Monte Carlo Basics Importance Sampling, SIR... Importance Sampling Generate an iid sample X 1,..., X N from q and estimate π(h) by ˆπ IS N (h) = N i=1 w i h(x i ) where w i = π(x i )/q(x i ) and w i = w i / N j=1 w j Under π [ (1 + h 2 )π/q ], we have consistency and asymptotic normality with asymptotic variance π { [h π(h)] 2 π/q } and practical variance estimate N N i=1 w2 i [h(x i) ˆπ IS N (h)]2 No Free Lunch Finding a suitable proposal density q is a hard task (particularly in high dimension)

9 Population Monte Carlo Monte Carlo Basics Population Monte Carlo The Algorithm Properties PMC for ECOSSTAT References

10 Population Monte Carlo The Algorithm Population Monte Carlo (PMC) At time t = 0 Generate {X i,0 } 1 i N iid q0 Set ω i,0 = π(x i,0 )/q 0 (X i,0 ), ω i,0 = ω i,0 / N j=1 ω j,0 Generate {J i,0 } 1 i N iid M(1, ( ωi,0 ) 1 i N ) Set X i,0 = X Ji,0

11 Population Monte Carlo The Algorithm Population Monte Carlo (PMC) At time t = 0 Generate {X i,0 } 1 i N iid q0 Set ω i,0 = π(x i,0 )/q 0 (X i,0 ), ω i,0 = ω i,0 / N j=1 ω j,0 Generate {J i,0 } 1 i N iid M(1, ( ωi,0 ) 1 i N ) Set X i,0 = X Ji,0 At time t (t = 1,..., T ) Generate X i,t ind q i,t ( X i,t 1, ) Set ω i,t = π(x i,t )/q i,t ( X i,t 1, X i,t ), ω i,t = ω i,t / N j=1 ω j,t Generate {J i,t } 1 i N iid M(1, ( ωi,t ) 1 i N ) Set X i,t = X Ji,t,t Note that other form of resampling could be used...

12 Population Monte Carlo The Algorithm PMC has many connections with (among other) West s (1992) mixture approximation Hürzeler & Künsch s (1998) and Stavropoulos & Titterington s (1999) smooth bootstrap Wong & Liang s (1997) and Liu, Liang & Wong s (2001) dynamic weighting Chopin s (2001) progressive posteriors for large datasets Gilks & Berzuini s (2001) resample-move Rubinstein & Kroese s (2004) cross-entropy method Del Moral, Doucet & Jasra s (2006) sequential Monte Carlo samplers It may also be adequately described as an iterated Sampling Importance Resampling (SIR) approach with (possibly) Markov proposals (note that Iba, 2000, use the term to refer to a more general class of methods)

13 Population Monte Carlo Properties Basic Importance Sampling Equality Preservation of Unbiasedness E [ω,t h(x,t )] = [ ( )] π(x,t ) E E h(x,t ) q,t ( X,t 1, X,t ) { X i,t 1 } 1 i N }{{} π(h) We may freely choose the way in which the proposal kernels q i,t are built from { ω i,t 1, X i,t 1 } 1 i N In the following, we consider only global proposals of the form q i,t def = q t

14 Population Monte Carlo Properties Asymptotic Analysis Key Properties {X i,t, ω i,t } 1 i N are conditionally i.i.d given {X j,t 1, ω j,t 1 } 1 j N E [w,t h(x,t ) {X i,t 1, ω i,t 1 } 1 i N ] = π(h) However, the successive populations are not independent due to the use of (possibly) Markovian proposals which are adaptively tuned In the following, we consider the behaviour of PMC when T is kept fixed and N tends to infinity (using results of Douc & Moulines, 2005)

15 Population Monte Carlo Properties Fundamental Asymptotic Results Assume h such that π(h) < Consistency Under π π{q t (x, x ) = 0} = 0, N P ω i,t h(x i,t ) π(h) i=1 Asymptotic Normality Under π π ( [1 + h 2 (x)]π(x)/q t (x, x ) ) <, ( N ) D N ω i,t h(x i,t ) π(h) N (0, σt 2 ) i=1

16 Population Monte Carlo Properties Variance Estimation The Asymptotic Variance is given (for t 1) by σ 2 t def = lim Var [w,th(x,t ) {X i,t 1, ω i,t 1 } 1 i N ] N = lim N = N i=1 ω i,t 1 [h(x ) π(h) ] 2 π(x ) q t (X i,t, x ) π(x )dx [h(x ) π(h) ] 2 π(x ) q t (x, x ) π(x)π(x )dxdx which may be estimated in practice by N N h(x i,t ) ω i,t 2 i=1 j=1 N ω j,t h(x j,t ) 2

17 Population Monte Carlo Properties The Final Estimator After T iterations of the PMC algorithm, the estimator of π(h) is given by N ˆπ N,T P MC (h) = ω i,t h(x i,t ) or, more efficiently, T t=1 ˆσ 2 t T s=1 ˆσ 2 s i=1 N ω i,t h(x i,t ) i=1

18 Population Monte Carlo Properties The Final Estimator After T iterations of the PMC algorithm, the estimator of π(h) is given by N ˆπ N,T P MC (h) = ω i,t h(x i,t ) or, more efficiently, T t=1 ˆσ 2 t T s=1 ˆσ 2 s i=1 N ω i,t h(x i,t ) i=1 How to update q t from the simulations (up to time t 1)?

19 Monte Carlo Basics Population Monte Carlo Kullback Divergence Adaptive PMC for Mixture Proposals (Toy) Examples Variance Minimisation (Toy) Example Again PMC for ECOSSTAT References

20 Kullback Divergence We first need a performance criterion Kullback Divergence arg min K[π π π q θ ] θ = arg min log π(x)π(x ) θ π(x)q θ (x, x ) π(x)π(x )dxdx = arg max log q θ (x, x )π(x)π(x )dxdx θ }{{} l(θ)

21 Kullback Divergence We first need a performance criterion Kullback Divergence arg min K[π π π q θ ] θ = arg min log π(x)π(x ) θ π(x)q θ (x, x ) π(x)π(x )dxdx = arg max log q θ (x, x )π(x)π(x )dxdx θ }{{} l(θ) K[π π π q θ ] = 0 implies that the weights are constant Sequential Monte Carlo interpretation (see Arnaud s talk) Usually gives explicit solutions for θ (see Nicolas talk)

22 Kullback Divergence The Estimation of θ is straightforward when q θ belongs to an exponential family Example (Independent Gaussian proposal) q µ,σ (x, x ) Σ 1/2 exp[ 1/2(x µ) T Σ 1 (x µ)] gives (ˆµ, ˆΣ) = (E π [X], Cov π [X]) which can be estimated by ˆµ = N ω i,0 X i,0 ˆΣ = i=1 N ω i,0 (X i,0 ˆµ)(X i,0 ˆµ) T i=1 (the estimate can obviously be improved along the iterations)

23 Kullback Divergence The Estimation of θ is straightforward when q θ belongs to an exponential family Example (Independent Gaussian proposal) q µ,σ (x, x ) Σ 1/2 exp[ 1/2(x µ) T Σ 1 (x µ)] gives (ˆµ, ˆΣ) = (E π [X], Cov π [X]) which can be estimated by ˆµ = N ω i,0 X i,0 ˆΣ = i=1 N ω i,0 (X i,0 ˆµ)(X i,0 ˆµ) T i=1 (the estimate can obviously be improved along the iterations) Example (Random-Walk Gaussian proposal) q Σ (x, x ) Σ 1/2 exp[ 1/2(x x) T Σ 1 (x x)] gives ˆΣ = Cov π π [X X] = 2 Cov π [X]

24 Kullback Divergence Integrated-EM Formulas for More General Choices of q θ If q θ is a missing-data type of proposal, ie. q θ (x, x ) = f θ (x, x, y)dy, we may define an intermediate quantity ( [ L θ (θ) = E π π E f θ log fθ (X, X, Y ) X, X ]) which (using Jensen s inequality) satisfies L θ (θ) L θ (θ ) l(θ) l(θ ) we obtain ascent integrated-em updates, which can be approximated since {( X i,t 1, X i,t ), ω i,t } 1 i N can be used to estimate expectations under π π This is used in particular in the information bottleneck algorithm

25 Adaptive PMC for Mixture Proposals Mixture Proposals In the following we consider D q α (x, x ) = α d q d (x, x ) d=1 where q d are fixed transitions Note that the criterion l(α) is then concave

26 Adaptive PMC for Mixture Proposals Integrated EM Recursion on the Proportions The mapping Ψ(α) = ( α d q d (x, x ) D j=1 α jq j (x, x ) π(x)π(x )dxdx ) 1 d D defined on the probability simplex { S = α = (α 1,..., α D ); α d 0, 1 d D } D and α d = 1 d=1 is such that l(ψ(α)) l(α)

27 Adaptive PMC for Mixture Proposals Adaptive Mixture PMC At time t (t = 1,..., T ) (Mixture Sampling) Generate iid {K i,t } 1 i N M(1, (αd,t ) 1 d D ) and {X i,t } 1 i N ind q Ki,t ( X i,t 1, ) (IS Weights) Set ω i,t = π(x i,t ) / D d=1 α d,tq d ( X i,t 1, X i,t ) (Proportions Update) α d,t+1 = N i=1 ω i,t1 d (K i,t ) (Resampling) Generate {J i,t } 1 i N iid M(1, ( ωi,t ) 1 i N ) and set X i,t = X Ji,t,t

28 Adaptive PMC for Mixture Proposals Why is this correct? E [ ] ω i,t 1 d (K i,t ) X i,t D = α d,t d=1 q d ( X i,t 1, x π(x ) ) D j=1 α jq j ( X i,t, x ) dx [ E 1/N ] N ω i,t 1 d (K i,t ) { X j,t } 1 j N i=1 ( D α d,t d=1 ) q d (x, x π(x ) ) D j=1 α jq j (x, x ) dx π(x)dx

29 (Toy) Examples Example (Independent Proposals with Mixture Target) Target 1/4N ( 1, 0.3)(x) + 1/4N (0, 1)(x) + 1/2N (3, 2)(x) Proposals: N ( 1, 0.3), N (0, 1) and N (3, 2) Table: Weight evolution (N = 100, 000)

30 (Toy) Examples Figure: Target and mixture evolution

31 (Toy) Examples Example (Random-Walk Proposals) Target N (0, 1) Gaussian random walks proposals: q 1 (x, x ) = f N (x,0.1) (x ), q 2 (x, x ) = f N (x,2) (x ) and q 3 = f N (x,10) (x ) Table: Evolution of the weights (N = 100, 000)

32 Variance Minimisation Can we do better? Recall that when h is known beforehand, the optimal importance function is given by q (x) = h(x) π(h) π(x) h(x ) π(h) π(x )dx which may give a lesser variance than ˆπ N MC(h)

33 Variance Minimisation Can we do better? Recall that when h is known beforehand, the optimal importance function is given by q (x) = h(x) π(h) π(x) h(x ) π(h) π(x )dx which may give a lesser variance than ˆπ N MC(h) For PMC, it is natural to consider the following objective: Minimum Variance Criterion arg min E π π θ ( [h(x ) π(h) ] 2 π(x ) q θ (X, X ) Note that this criterion is again convex in α for mixture proposals )

34 Variance Minimisation There is an ascent update rule for the variance criterion Ψ(α) = ( ν αd q d (x,x ) h P D l=1 α lq l (x,x ) σh 2(α) ) 1 d D where the (unnormalised) measure ν h is defined as ν h (dx, dx ) = (h(x ) π(h)) 2 π(x ) D d=1 α dq d (x, x ) π(dx)π(dx ) and σh 2 (α) = ν h (1) is the value of variance criterion corresponding to α, is such that σh 2 (Ψ(α)) σ2 h (α)

35 Variance Minimisation Interlude: Convex Puzzles Kullback Divergence Criterion arg max α log [ d α df d (x)] ν(dx) Ascent Mapping α i = α i f i (x) Pd α df d (x) ν(dx) Proof Concavity of log(x) & positivity of Kullback divergence

36 Variance Minimisation Interlude: Convex Puzzles Kullback Divergence Criterion arg max α log [ d α df d (x)] ν(dx) Ascent Mapping α i = α i f i (x) Pd α df d (x) ν(dx) Proof Concavity of log(x) & positivity of Kullback divergence Minimum Variance Criterion arg min α [ d α df d (x)] 1 ν(dx) R α if i(x) Ascent Mapping α i = [ P Pd α d f d (x) d α df d (x)] 1 ν(dx) R P [ d α df d (x)] 1 ν(dx) Proof Convexity of (x) 1 & d α d = 1 Is there any more principled way of finding the second update?

37 Variance Minimisation Updating Rule The empirical version of the previous update is α d,t+1 = N i=1 ω 2 i,t N h(x i,t ) N ω j,t h(x j,t ) j=1 h(x i,t ) ω i,t 2 i=1 j=1 2 N ω j,t h(x j,t ) 1 d (K i,t ) 2

38 (Toy) Example Again Example N (0, 1) target, h(x) = x and D = 3 independent proposals: N (0, 1) Cauchy distribution Symmetrised Ga(0.5, 0.5) (This is the optimal choice, q ) t Estimation α 1,t α 2,t α 3,t Variance Table: PMC estimates for N = 100, 000 and T = 20

39 PMC for ECOSSTAT How does this works in real life? The ECOSSTAT (Measuring cosmological parameters from large heterogeneous surveys) project is an interdisciplinary study where we use Monte Carlo methods for inferring cosmological parameters from several set of measurements PMC is well suited in this context as Evaluation of π(x) is prohibitively long, but PMC can be (mostly) parallelised (in contrast to MCMC) Variance estimation is feasible with PMC, which is important to cosmologists Because of the physical nature of the parameters, their value is known to some extent, which is more or less required for techniques based on importance sampling

40 PMC for ECOSSTAT A (Somewhat) Idealised Example Gaussian ellipsoid Figure: Target and mixture evolution We use the following proposals: Random-Walk Gaussian proposals with covariance Σ, 2Σ and 4Σ, independent Gaussian proposals with covariance Σ/2, Σ and 2Σ and the uniform distribution

41 Remark: We need at least one q d, with non-zero α d, for which the IS variance is finite Adaptive Population Monte Carlo PMC for ECOSSTAT Results Mixture proportions 100 Norm. Variance ESS of h Uniform ( ) RW 2Σ ( ) Indep. Σ ( ) Kullback ( ) Variance ( ) Normalised ESS (Effective Sample Size) (N N i=1 ω2 i,t ) 1 Function h is h(x) = (1 1) T x

42 PMC for ECOSSTAT Typical Results with N = 10, 000 Particles and T = 50 Iterations RW C RW 2C RW 4C Indep. C/2 Indep. C Indep. 2C Unif Figure: Values of α d,t as a function of t (Kullback criterion)

43 PMC for ECOSSTAT Typical Results (contd.) Figure: Normalised ESS as a function of t Figure: Estimated variance for function h as a function of t Due to the stability of the asymptotic updates, the algorithm performs well even with moderate sample sizes

44 References References Cappé, Guillin, Marin, & Robert. Population Monte Carlo. J. Comput. Graph. Statist., 13(4): , Douc, Guillin, Marin & Robert. Convergence of adaptive mixtures of importance sampling schemes. Ann. Statist., 35(1), 2007 (to appear). Douc, Guillin, Marin & Robert. Minimum variance importance sampling via population Monte Carlo. Technical report, Advertising below this line... Postdocs interested in the ECOSSTAT project, please contact Christian P. Robert and/or myself People interested in adaptive Monte Carlo, please check the workshop (june 2007)

Adaptive Monte Carlo methods

Adaptive Monte Carlo methods Jean-Michel Marin Projet Select, INRIA Futurs, Université Paris-Sud joint with Randal Douc (École Polytechnique), Arnaud Guillin (Université de Marseille) and Christian Robert