Adaptive Population Monte Carlo Olivier Cappé Centre Nat. de la Recherche Scientifique & Télécom Paris 46 rue Barrault, 75634 Paris cedex 13, France http://www.tsi.enst.fr/~cappe/ Recent Advances in Monte Carlo Based Inference Workshop 30 October 3 November 2006, Isaac Newton Institute Randal Douc, Arnaud Guillin, Jean-Michel Marin & Christian P. Robert
Outline Monte Carlo Basics Population Monte Carlo PMC for ECOSSTAT References
Monte Carlo Basics Monte Carlo Basics Monte Carlo, MCMC... Importance Sampling, SIR... Population Monte Carlo PMC for ECOSSTAT References
Monte Carlo Basics General Purpose Given a density π, known up to a normalising constant, compute π(h) = h(x)π(x)dx for some functions of interest h In the following π denotes the normalised density but we consider only algorithms that do not necessitate knowledge of the normalising constant
Monte Carlo Basics Monte Carlo, MCMC... Monte Carlo Generate an iid sample X 1,..., X N from π and estimate π(h) by ˆπ MC N (h) = 1/N N h(x i ) Under π(h 2 ) = h 2 (x)π(x)dx <, we have Consistency: ˆπ N MC P (h) π(h) and Asymptotic Normality: N MC (ˆπ N (h) π(h) ) D ( { N 0, π [h π(h)] 2 }) with practical variance estimate 1/N N [ i=1 h(xi ) ˆπ N MC(h)] 2 i=1
Monte Carlo Basics Monte Carlo, MCMC... Monte Carlo Generate an iid sample X 1,..., X N from π and estimate π(h) by ˆπ MC N (h) = 1/N N h(x i ) Under π(h 2 ) = h 2 (x)π(x)dx <, we have Consistency: ˆπ N MC P (h) π(h) and Asymptotic Normality: N MC (ˆπ N (h) π(h) ) D ( { N 0, π [h π(h)] 2 }) with practical variance estimate 1/N N [ i=1 h(xi ) ˆπ N MC(h)] 2 i=1 Caveat Often impossible to simulate directly from π Possible answers include Markov Chain Monte Carlo
Monte Carlo Basics Importance Sampling, SIR... Importance Sampling Generate an iid sample X 1,..., X N from q and estimate π(h) by ˆπ IS N (h) = N i=1 w i h(x i ) where w i = π(x i )/q(x i ) and w i = w i / N j=1 w j Under π [ (1 + h 2 )π/q ], we have consistency and asymptotic normality with asymptotic variance π { [h π(h)] 2 π/q } and practical variance estimate N N i=1 w2 i [h(x i) ˆπ IS N (h)]2
Monte Carlo Basics Importance Sampling, SIR... Importance Sampling Generate an iid sample X 1,..., X N from q and estimate π(h) by ˆπ IS N (h) = N i=1 w i h(x i ) where w i = π(x i )/q(x i ) and w i = w i / N j=1 w j Under π [ (1 + h 2 )π/q ], we have consistency and asymptotic normality with asymptotic variance π { [h π(h)] 2 π/q } and practical variance estimate N N i=1 w2 i [h(x i) ˆπ IS N (h)]2 No Free Lunch Finding a suitable proposal density q is a hard task (particularly in high dimension)
Population Monte Carlo Monte Carlo Basics Population Monte Carlo The Algorithm Properties PMC for ECOSSTAT References
Population Monte Carlo The Algorithm Population Monte Carlo (PMC) At time t = 0 Generate {X i,0 } 1 i N iid q0 Set ω i,0 = π(x i,0 )/q 0 (X i,0 ), ω i,0 = ω i,0 / N j=1 ω j,0 Generate {J i,0 } 1 i N iid M(1, ( ωi,0 ) 1 i N ) Set X i,0 = X Ji,0
Population Monte Carlo The Algorithm Population Monte Carlo (PMC) At time t = 0 Generate {X i,0 } 1 i N iid q0 Set ω i,0 = π(x i,0 )/q 0 (X i,0 ), ω i,0 = ω i,0 / N j=1 ω j,0 Generate {J i,0 } 1 i N iid M(1, ( ωi,0 ) 1 i N ) Set X i,0 = X Ji,0 At time t (t = 1,..., T ) Generate X i,t ind q i,t ( X i,t 1, ) Set ω i,t = π(x i,t )/q i,t ( X i,t 1, X i,t ), ω i,t = ω i,t / N j=1 ω j,t Generate {J i,t } 1 i N iid M(1, ( ωi,t ) 1 i N ) Set X i,t = X Ji,t,t Note that other form of resampling could be used...
Population Monte Carlo The Algorithm PMC has many connections with (among other) West s (1992) mixture approximation Hürzeler & Künsch s (1998) and Stavropoulos & Titterington s (1999) smooth bootstrap Wong & Liang s (1997) and Liu, Liang & Wong s (2001) dynamic weighting Chopin s (2001) progressive posteriors for large datasets Gilks & Berzuini s (2001) resample-move Rubinstein & Kroese s (2004) cross-entropy method Del Moral, Doucet & Jasra s (2006) sequential Monte Carlo samplers It may also be adequately described as an iterated Sampling Importance Resampling (SIR) approach with (possibly) Markov proposals (note that Iba, 2000, use the term to refer to a more general class of methods)
Population Monte Carlo Properties Basic Importance Sampling Equality Preservation of Unbiasedness E [ω,t h(x,t )] = [ ( )] π(x,t ) E E h(x,t ) q,t ( X,t 1, X,t ) { X i,t 1 } 1 i N }{{} π(h) We may freely choose the way in which the proposal kernels q i,t are built from { ω i,t 1, X i,t 1 } 1 i N In the following, we consider only global proposals of the form q i,t def = q t
Population Monte Carlo Properties Asymptotic Analysis Key Properties {X i,t, ω i,t } 1 i N are conditionally i.i.d given {X j,t 1, ω j,t 1 } 1 j N E [w,t h(x,t ) {X i,t 1, ω i,t 1 } 1 i N ] = π(h) However, the successive populations are not independent due to the use of (possibly) Markovian proposals which are adaptively tuned In the following, we consider the behaviour of PMC when T is kept fixed and N tends to infinity (using results of Douc & Moulines, 2005)
Population Monte Carlo Properties Fundamental Asymptotic Results Assume h such that π(h) < Consistency Under π π{q t (x, x ) = 0} = 0, N P ω i,t h(x i,t ) π(h) i=1 Asymptotic Normality Under π π ( [1 + h 2 (x)]π(x)/q t (x, x ) ) <, ( N ) D N ω i,t h(x i,t ) π(h) N (0, σt 2 ) i=1
Population Monte Carlo Properties Variance Estimation The Asymptotic Variance is given (for t 1) by σ 2 t def = lim Var [w,th(x,t ) {X i,t 1, ω i,t 1 } 1 i N ] N = lim N = N i=1 ω i,t 1 [h(x ) π(h) ] 2 π(x ) q t (X i,t, x ) π(x )dx [h(x ) π(h) ] 2 π(x ) q t (x, x ) π(x)π(x )dxdx which may be estimated in practice by N N h(x i,t ) ω i,t 2 i=1 j=1 N ω j,t h(x j,t ) 2
Population Monte Carlo Properties The Final Estimator After T iterations of the PMC algorithm, the estimator of π(h) is given by N ˆπ N,T P MC (h) = ω i,t h(x i,t ) or, more efficiently, T t=1 ˆσ 2 t T s=1 ˆσ 2 s i=1 N ω i,t h(x i,t ) i=1
Population Monte Carlo Properties The Final Estimator After T iterations of the PMC algorithm, the estimator of π(h) is given by N ˆπ N,T P MC (h) = ω i,t h(x i,t ) or, more efficiently, T t=1 ˆσ 2 t T s=1 ˆσ 2 s i=1 N ω i,t h(x i,t ) i=1 How to update q t from the simulations (up to time t 1)?
Monte Carlo Basics Population Monte Carlo Kullback Divergence Adaptive PMC for Mixture Proposals (Toy) Examples Variance Minimisation (Toy) Example Again PMC for ECOSSTAT References
Kullback Divergence We first need a performance criterion Kullback Divergence arg min K[π π π q θ ] θ = arg min log π(x)π(x ) θ π(x)q θ (x, x ) π(x)π(x )dxdx = arg max log q θ (x, x )π(x)π(x )dxdx θ }{{} l(θ)
Kullback Divergence We first need a performance criterion Kullback Divergence arg min K[π π π q θ ] θ = arg min log π(x)π(x ) θ π(x)q θ (x, x ) π(x)π(x )dxdx = arg max log q θ (x, x )π(x)π(x )dxdx θ }{{} l(θ) K[π π π q θ ] = 0 implies that the weights are constant Sequential Monte Carlo interpretation (see Arnaud s talk) Usually gives explicit solutions for θ (see Nicolas talk)
Kullback Divergence The Estimation of θ is straightforward when q θ belongs to an exponential family Example (Independent Gaussian proposal) q µ,σ (x, x ) Σ 1/2 exp[ 1/2(x µ) T Σ 1 (x µ)] gives (ˆµ, ˆΣ) = (E π [X], Cov π [X]) which can be estimated by ˆµ = N ω i,0 X i,0 ˆΣ = i=1 N ω i,0 (X i,0 ˆµ)(X i,0 ˆµ) T i=1 (the estimate can obviously be improved along the iterations)
Kullback Divergence The Estimation of θ is straightforward when q θ belongs to an exponential family Example (Independent Gaussian proposal) q µ,σ (x, x ) Σ 1/2 exp[ 1/2(x µ) T Σ 1 (x µ)] gives (ˆµ, ˆΣ) = (E π [X], Cov π [X]) which can be estimated by ˆµ = N ω i,0 X i,0 ˆΣ = i=1 N ω i,0 (X i,0 ˆµ)(X i,0 ˆµ) T i=1 (the estimate can obviously be improved along the iterations) Example (Random-Walk Gaussian proposal) q Σ (x, x ) Σ 1/2 exp[ 1/2(x x) T Σ 1 (x x)] gives ˆΣ = Cov π π [X X] = 2 Cov π [X]
Kullback Divergence Integrated-EM Formulas for More General Choices of q θ If q θ is a missing-data type of proposal, ie. q θ (x, x ) = f θ (x, x, y)dy, we may define an intermediate quantity ( [ L θ (θ) = E π π E f θ log fθ (X, X, Y ) X, X ]) which (using Jensen s inequality) satisfies L θ (θ) L θ (θ ) l(θ) l(θ ) we obtain ascent integrated-em updates, which can be approximated since {( X i,t 1, X i,t ), ω i,t } 1 i N can be used to estimate expectations under π π This is used in particular in the information bottleneck algorithm
Adaptive PMC for Mixture Proposals Mixture Proposals In the following we consider D q α (x, x ) = α d q d (x, x ) d=1 where q d are fixed transitions Note that the criterion l(α) is then concave
Adaptive PMC for Mixture Proposals Integrated EM Recursion on the Proportions The mapping Ψ(α) = ( α d q d (x, x ) D j=1 α jq j (x, x ) π(x)π(x )dxdx ) 1 d D defined on the probability simplex { S = α = (α 1,..., α D ); α d 0, 1 d D } D and α d = 1 d=1 is such that l(ψ(α)) l(α)
Adaptive PMC for Mixture Proposals Adaptive Mixture PMC At time t (t = 1,..., T ) (Mixture Sampling) Generate iid {K i,t } 1 i N M(1, (αd,t ) 1 d D ) and {X i,t } 1 i N ind q Ki,t ( X i,t 1, ) (IS Weights) Set ω i,t = π(x i,t ) / D d=1 α d,tq d ( X i,t 1, X i,t ) (Proportions Update) α d,t+1 = N i=1 ω i,t1 d (K i,t ) (Resampling) Generate {J i,t } 1 i N iid M(1, ( ωi,t ) 1 i N ) and set X i,t = X Ji,t,t
Adaptive PMC for Mixture Proposals Why is this correct? E [ ] ω i,t 1 d (K i,t ) X i,t D = α d,t d=1 q d ( X i,t 1, x π(x ) ) D j=1 α jq j ( X i,t, x ) dx [ E 1/N ] N ω i,t 1 d (K i,t ) { X j,t } 1 j N i=1 ( D α d,t d=1 ) q d (x, x π(x ) ) D j=1 α jq j (x, x ) dx π(x)dx
(Toy) Examples Example (Independent Proposals with Mixture Target) Target 1/4N ( 1, 0.3)(x) + 1/4N (0, 1)(x) + 1/2N (3, 2)(x) Proposals: N ( 1, 0.3), N (0, 1) and N (3, 2) 1 0.0500000 0.05000000 0.9000000 2 0.2605712 0.09970292 0.6397259 6 0.2740816 0.19160178 0.5343166 10 0.2989651 0.19200904 0.5090259 16 0.2651511 0.24129039 0.4935585 Table: Weight evolution (N = 100, 000)
(Toy) Examples Figure: Target and mixture evolution
(Toy) Examples Example (Random-Walk Proposals) Target N (0, 1) Gaussian random walks proposals: q 1 (x, x ) = f N (x,0.1) (x ), q 2 (x, x ) = f N (x,2) (x ) and q 3 = f N (x,10) (x ) 1 0.33333 0.33333 0.33333 2 0.24415 0.43145 0.32443 3 0.19525 0.52445 0.28031 4 0.10725 0.72955 0.16324 5 0.08223 0.83092 0.08691 6 0.06155 0.88355 0.05490 7 0.04255 0.92950 0.02795 8 0.03790 0.93760 0.02450 9 0.03130 0.94505 0.02365 10 0.03460 0.94875 0.01665 Table: Evolution of the weights (N = 100, 000)
Variance Minimisation Can we do better? Recall that when h is known beforehand, the optimal importance function is given by q (x) = h(x) π(h) π(x) h(x ) π(h) π(x )dx which may give a lesser variance than ˆπ N MC(h)
Variance Minimisation Can we do better? Recall that when h is known beforehand, the optimal importance function is given by q (x) = h(x) π(h) π(x) h(x ) π(h) π(x )dx which may give a lesser variance than ˆπ N MC(h) For PMC, it is natural to consider the following objective: Minimum Variance Criterion arg min E π π θ ( [h(x ) π(h) ] 2 π(x ) q θ (X, X ) Note that this criterion is again convex in α for mixture proposals )
Variance Minimisation There is an ascent update rule for the variance criterion Ψ(α) = ( ν αd q d (x,x ) h P D l=1 α lq l (x,x ) σh 2(α) ) 1 d D where the (unnormalised) measure ν h is defined as ν h (dx, dx ) = (h(x ) π(h)) 2 π(x ) D d=1 α dq d (x, x ) π(dx)π(dx ) and σh 2 (α) = ν h (1) is the value of variance criterion corresponding to α, is such that σh 2 (Ψ(α)) σ2 h (α)
Variance Minimisation Interlude: Convex Puzzles Kullback Divergence Criterion arg max α log [ d α df d (x)] ν(dx) Ascent Mapping α i = α i f i (x) Pd α df d (x) ν(dx) Proof Concavity of log(x) & positivity of Kullback divergence
Variance Minimisation Interlude: Convex Puzzles Kullback Divergence Criterion arg max α log [ d α df d (x)] ν(dx) Ascent Mapping α i = α i f i (x) Pd α df d (x) ν(dx) Proof Concavity of log(x) & positivity of Kullback divergence Minimum Variance Criterion arg min α [ d α df d (x)] 1 ν(dx) R α if i(x) Ascent Mapping α i = [ P Pd α d f d (x) d α df d (x)] 1 ν(dx) R P [ d α df d (x)] 1 ν(dx) Proof Convexity of (x) 1 & d α d = 1 Is there any more principled way of finding the second update?
Variance Minimisation Updating Rule The empirical version of the previous update is α d,t+1 = N i=1 ω 2 i,t N h(x i,t ) N ω j,t h(x j,t ) j=1 h(x i,t ) ω i,t 2 i=1 j=1 2 N ω j,t h(x j,t ) 1 d (K i,t ) 2
(Toy) Example Again Example N (0, 1) target, h(x) = x and D = 3 independent proposals: N (0, 1) Cauchy distribution Symmetrised Ga(0.5, 0.5) (This is the optimal choice, q ) t Estimation α 1,t α 2,t α 3,t Variance 1.00126.1.8.1.982 2.00061.112.715.173.926 3 -.00124.116.607.276.863 5.00248.108.357.534.742 10.00332.049.062.888.650 15.00284.026.015.958.640 20.00062.019.004.976.638 Table: PMC estimates for N = 100, 000 and T = 20
PMC for ECOSSTAT How does this works in real life? The ECOSSTAT (Measuring cosmological parameters from large heterogeneous surveys) project is an interdisciplinary study where we use Monte Carlo methods for inferring cosmological parameters from several set of measurements PMC is well suited in this context as Evaluation of π(x) is prohibitively long, but PMC can be (mostly) parallelised (in contrast to MCMC) Variance estimation is feasible with PMC, which is important to cosmologists Because of the physical nature of the parameters, their value is known to some extent, which is more or less required for techniques based on importance sampling
PMC for ECOSSTAT A (Somewhat) Idealised Example 8 7 6 5.99 Gaussian ellipsoid 4 3 2 1 0 0 1 2 3 4 5 6 7 8 Figure: Target and mixture evolution We use the following proposals: Random-Walk Gaussian proposals with covariance Σ, 2Σ and 4Σ, independent Gaussian proposals with covariance Σ/2, Σ and 2Σ and the uniform distribution
Remark: We need at least one q d, with non-zero α d, for which the IS variance is finite Adaptive Population Monte Carlo PMC for ECOSSTAT Results Mixture proportions 100 Norm. Variance ESS of h Uniform (0 0 0 0 0 0 100) 0.06 3.0 RW 2Σ (0 99 0 0 0 0 1) 0.09 3.1 Indep. Σ (0 0 0 0 99 0 1) 0.53 1.2 Kullback (0 0 2 40 45 12 1) 0.63 0.9 Variance (3 0 0 0 15 74 8) 0.57 0.5 Normalised ESS (Effective Sample Size) (N N i=1 ω2 i,t ) 1 Function h is h(x) = (1 1) T x
PMC for ECOSSTAT Typical Results with N = 10, 000 Particles and T = 50 Iterations 0.5 0.45 0.4 0.35 0.3 0.25 RW C RW 2C RW 4C Indep. C/2 Indep. C Indep. 2C Unif. 0.2 0.15 0.1 0.05 0 0 5 10 15 20 25 30 35 40 45 50 Figure: Values of α d,t as a function of t (Kullback criterion)
PMC for ECOSSTAT Typical Results (contd.) 0.8 3.5 0.7 3 0.6 2.5 0.5 2 0.4 1.5 0.3 0.2 1 0.1 0.5 0 0 5 10 15 20 25 30 35 40 45 50 0 0 5 10 15 20 25 30 35 40 45 50 Figure: Normalised ESS as a function of t Figure: Estimated variance for function h as a function of t Due to the stability of the asymptotic updates, the algorithm performs well even with moderate sample sizes
References References Cappé, Guillin, Marin, & Robert. Population Monte Carlo. J. Comput. Graph. Statist., 13(4):907-929, 2004. Douc, Guillin, Marin & Robert. Convergence of adaptive mixtures of importance sampling schemes. Ann. Statist., 35(1), 2007 (to appear). Douc, Guillin, Marin & Robert. Minimum variance importance sampling via population Monte Carlo. Technical report, 2005....Advertising below this line... Postdocs interested in the ECOSSTAT project, please contact Christian P. Robert and/or myself People interested in adaptive Monte Carlo, please check the http://www.adapmc07.enst.fr/ workshop (june 2007)