arxiv: v1 [stat.co] 23 Oct 2007

Size: px

Start display at page:

Download "arxiv: v1 [stat.co] 23 Oct 2007"

Sophia Day
5 years ago
Views:

1 Aaptive Importance Sampling in General Mixture Classes arxiv: v1 stat.co] 23 Oct 2007 Olivier Cappé, LTCI, ENST & CNRS, Paris Ranal Douc, INT, Evry Arnau Guillin, École Centrale & LATP, CNRS, Marseille Jean-Michel Marin, INRIA Futurs, Project select, Laboratoire e Mathématiques, Université Paris-Su & Christian P. Robert CEREMADE, Université Paris Dauphine & CREST, INSEE October 23, 2018 Abstract In this paper, we propose an aaptive algorithm that iteratively upates both the weights an component parameters of a mixture importance sampling ensity so as to optimise the importance sampling performances, as measure by an entropy criterion. The metho is shown to be applicable to a wie class of importance sampling ensities, which inclues in particular mixtures of multivariate Stuent t istributions. The performances of the propose scheme are stuie on both artificial an real examples, highlighting in particular the benefit of a novel Rao-Blackwellisation evice which can be easily incorporate in the upating scheme. Keywors: Importance Sampling mixtures, aaptive Monte Carlo, Population Monte Carlo, entropy. 1 Introuction In recent years, there has been a renewe interest in using Monte Carlo proceures base on Importance Sampling (abbreviate to IS in the following) for inference tasks. Compare to alternatives such as Markov Chain Monte Carlo methos, the main appeal of IS proceures lies in the possibility of eveloping parallel implementations, which becomes more an more important with the generalisation of multiple core machines an computer clusters. Importance sampling proceures are also attractive in that they allow for an easy assessment of the Monte Carlo error an, correlatively, for the evelopment of learning mechanisms. In many applications, the fact that IS proceures may be tune by choosing an appropriate IS ensity to minimise the approximation error for a specific function of interest is also crucial. On the other han, the shortcomings of IS approaches are also well-known, incluing a poor scaling to highly multiimensional problems an an acute sensitivity to This work has been supporte by the Agence Nationale e la Recherche (ANR, 212, rue e Bercy Paris) through the project Aap MC. Both last authors are grateful to the participants to the BIRS 07w5079 meeting on Bioinformatics, Genetics an Stochastic Computation: Briging the Gap, Banff, for their helpful comments. The last author also acknowleges an helpful iscussion with Geoff McLachlan. Corresponing author 1

2 the choice of the IS ensity combine with the fact that it is impossible to come up with a universally efficient IS ensity. Aaptive Monte Carlo is a natural solution to remey for the latter class of ifficulties by graually improving the IS ensity base on some form of Monte Carlo approximation. While there exist a wie variety of solutions in the literature (see, e.g. Robert an Casella, 2004, Chapter 14), this paper concentrates on the construction of iterate importance sampling schemes or population Monte Carlo. Population Monte Carlo (or PMC) was introuce by Cappé et al. (2004) as an repeate Sampling Importance Resampling (SIR) proceure: once a sample (X 1,...,X N ) is prouce by SIR, it provies an approximation to the target istribution π an can be use as a stepping stone towars a better approximation to π. More precisely, if (X 1,...,X N ) is a sample approximately istribute from π, it may be perturbe stochastically using an arbitrary Markov transition kernel q(x,x ) so as to prouce new sample (X 1,...,X N ). Conucting a resampling step base on the IS weights ω i = π(x i )/q(x i,x i ), we will then prouce a new sample ( X 1,..., X N ) that also constitutes an approximation to the target istribution π. Repeating this scheme in an iterative manner is however only of interest if samples that have been previously simulate are use to upate (or aapt) the kernel q(x,x ), in the sense that keeping the same kernel q over iterations oes not moify the statistical properties of the sample prouce at each iteration an, therefore, reuces the efficiency of the approximation by introucing extra Monte Carlo variance. Failing to improve upon the choice of the kernel q thus cancels the appeal of using several iterations, when compare with one single IS raw with the same total sample size (see Douc et al., 2007a). Population Monte Carlo therefore requires an upating scheme that takes avantage of previously generate samples so that it improves the choice of the IS transition kernel against a given measure of efficiency. In the approach avocate by Douc et al. (2007a), one consiers a transition kernel q consisting of a mixture of fixe transition kernels q α (x,x ) = D α q (x,x ), =1 D α = 1, (1) whose weights α 1,...,α D are tune aaptively. The propose aaptation proceure aims at minimising the eviance or entropy criterion between the kernel q α an the target π, =1 E(π,q α ) = E X π D(π q α (X, ))], (2) where D(p q) = log{p(x)/q(x)}p(x)x enotes the Kullback-Leibler ivergence (also calle relative entropy), an where the expectation is taken uner the target istribution X π since the kernels q (x,x ) epen on the starting value x. In the sequel, we refer to the criterion in (2) as the entropy criterion as it is obviously relate to the performance measure use in the cross-entropy metho of Rubinstein an Kroese (2004). In Douc et al. (2007b), a version of this algorithm was evelope to minimise the asymptotic variance of the IS proceure, for a specific function of interest, in lieu of the entropy criterion. A major limitation in the approaches of both Douc et al. (2007a,b) is that the proposal kernels q remain fixe over the iterative process while only the mixture weights α get improve. In the present contribution, we remove this limitation by extening the framework of Douc et al. (2007a) to allow for the aaption of IS ensities of the form q (α,θ) (x) = D α q (x;θ ), (3) =1 with respect to both the weights α an the internal parameters θ of the component ensities. In theory, as explaine through the example consiere in Section 4, the propose aaptive scheme, 2

3 which relies on an integrate EM upate mechanism, is applicable to more general families of latentata IS ensities. This propose extension an, in particular, the introuction of (multiimensional) scaling parameters raises challenging robustness issues for which we propose a Rao-Blackwellisation scheme that empirically appears to be very efficient while inucing a moest aitional algorithmic complexity. Note that we consier here the generic entropy criterion of Douc et al. (2007a) rather than the function-specific variance minimisation objective of Douc et al. (2007b). This choice is motivate by the recognition that in most applications, the IS ensity is expecte to perform well for a range of typical functions of interest rather than for a specific target function h. In aition, the generalisation of the approach of Douc et al. (2007b) to a class of mixture IS ensities that are parameterise by more than the weights remains an open question (see also Section 5). A secon remark is that in contrast to the previously cite works, we consier in this paper only global inepenent IS ensities of the form given in (3). Thus the propose scheme really is an iterate importance sampling scheme, contrary to what happens when using more general IS transition kernels as in(1). Obviously, resorting to moves that epen on the current sample is initially attractive because it allows for some local moves as oppose to the global exploration provie by inepenent IS ensities. However, the fact that the entropy criterion in (2) is a global measure of fit tens to moify the parameters of each transition kernel epening on its average performance over the whole sample, rather than locally. In aition, structurally imposing a epenence on the points sample at the previous iteration inuces some extra-variability which can be etrimental when more parameters are to be estimate. The paper is organise as follows: In Section 2, we evelop a generic upating scheme for inepenent IS mixtures (3), establishing that the integrate EM argument of Douc et al. (2007a) remains vali in our setting. Note once again that the integrate EM upate mechanism we uncover in this paper is applicable to all missing ata representations of the proposal kernel, an not only to finite mixtures. In Section 3, we consier the case of Gaussian mixtures which naturally exten the case of mixtures of Gaussian ranom walks with fixe covariance structure consiere in Douc et al. (2007a,b). In Section 4, we show that the algorithm also applies to mixture of multivariate t istributions with the continuous scale mixing representation use in Peel an McLachlan (2000). Section 5 provies some conclusive remarks about the performances of this approach as well as possible extensions. 2 Upating the IS ensity 2.1 Entropies an perplexity When consiering inepenent mixture IS ensities of the form (3), the entropy criterion E efine in (2) reuces to the Kullback-Leibler ivergence between the target ensity π an the mixture q (α,θ) : E(π,q (α,θ) ) = D(π q (α,θ) ) = ( ) π(x) log D =1 α π(x)x. (4) q (x;θ ) As usual in applications of the IS approach to Bayesian inference, the target ensity π is known up to a normalisation constant only an we will focus on the self-normalise version of IS which only requires the knowlege of an unnormalise version π unn of π (Geweke, 1989). As a sie comment, note that while E(π,q (α,θ) ) is a convex function of the weights α 1,...,α D (Douc et al., 2007a), it is generally not so when also optimising with respect to the component parameters θ 1,...,θ D. It is well known that if one consiers a function h of interest, the self-normalise IS estimation 3

4 of its expectation π(h) = ω ih(x i ), where ω i = (π(x i )/q (α,θ) (X i )) / ( j=1 π(x j)/q (α,θ) (X j )) an X i q (α,θ), has an asymptotic variance of υ(h) = {h(x) π(h)} 2 π 2 (x)/q α,θ (x)x, assuming that (1 + h 2 (x))π 2 (x)/q α,θ (x)x <. In aition, this asymptotic variance may be consistently estimate using the IS sample itself as N ω2 i {h(x i) π(h)} 2 (Geweke, 1989). Obviously, for a given function h, there is no irect link between υ(h) an the entropy criterion in (4), a fact that motivate the work of Douc et al. (2007b). However it is easily shown that sup υ(h) = M 2 π 2 (x)/q (α,θ) (x)x, {h: h π(h) M} where the latter integral term is lower an upper boune by 1 an exp E(π,q (α,θ) ) ] respectively, by irect applications of Jensen s inequality. Hence minimising E(π,q (α,θ) ) inee reuces the worst case performance of the IS approach, at least for boune functions. In aition, rewriting exp E(π,q (α,θ) ) ] ( = exp log π unn(x) q (α,θ) (x) π(x)x an estimating the first integral by self-normalise IS as )( ) π unn (x)x an the secon one by classical IS, as N ω i log π nn(x i ) q (α,θ) (X i ) 1/N N π nn (X i )/q (α,θ) (X i ), shows that exp(h N )/N, where H N = ω ilog ω i is the Shannon entropy of the normalise IS weights, is an estimator of the inverse of the term exp E(π,q (α,θ) ) ]. Thus, minimisation of the entropy criterion is connecte with the maximisation of exp(h N )/N, were H N is the entropy of the IS weights, a frequently use criterion for assessing the quality of an IS sample together with the socalle Effective Sample Size (ESS) criterion (Chen an Liu, 1996, Doucet et al., 2001, Cappé et al., 2005). In the following, we refer to exp(h N )/N as the normalise perplexity of the IS weights, following the terminology in use in the fiel of natural language processing. 2.2 Integrate upates Let α t = ( α t 1,...,αt D) an θ t = ( θ t 1,...,θt D) enote, respectively, the mixture weights an the component parameters at the t-th iteration of the algorithm (where t = 1,...,T). In orer to upate the parameters (α t,θ t ) of the inepenent IS ensity (3), we will take avantage of the latent variable structure that unerlines the objective function(4). The resulting algorithm still theoretical at this stage as it involves integration with respect to π may be interprete as an integrate EM (Expectation-Maximisation) scheme that we now escribe. Given that minimising (4) in (α, θ) is equivalent to maximising ( D ) log α q (x;θ ) π(x)x, =1 4

5 we are facing a task that formally resembles stanar mixture maximum likelihoo estimation but with an integration with respect to π replacing the empirical sum over observations. As usual in mixtures, the latent variable Z is the component inicator, with values in {1,...,D} such that the joint ensity f of x an z satisfies f(z) = α z an f(x z) = q z (x;θ z ), which prouces (3) as the marginal in x. At iteration t of our algorithm, we can therefore take avantage of this latent variable representation by consiering the expecte complete log-likelihoo ] E Z (α t,θ t ) {log(α Zq Z (X;θ Z )) X}, E X π where the inner expectation is compute uner the conitional istribution of Z for the current value of the parameters, (α t,θ t ), i.e. / D f(z x) = α t z q z(x;θz t ) α t q (x;θ t ). Theupatingmechanisminouralgorithmthencorresponstosettingthenewparameters(α t+1,θ t+1 ) equal to ] (α t+1,θ t+1 ) = argmax (α,θ) EX π E Z (α t,θ t ) {log(α Zq Z (X;θ Z )) X}, as in a regular EM estimation of the parameters of a mixture, except for the extra expectation over X. The convexity argument that shows that EM increases the objective function at each step also apply in this setup. Solving the maximisation program, we have (α t+1,θ t+1 ) = argmax (α,θ) If we efine g 1 (α) = E X π Therefore, setting we nee to solve ( E X π E Z (α t,θ t ) {log(α Z) X} ] E Z (α t,θ t ) {log(α Z) X} =1 ] +E X π an g 2 (θ) = E X π ( (α t+1,θ t+1 ) = argmax (g 1(α)+g 2 (θ)) = arg max (α,θ) an, for {1,...,D}, we obtain E Z (α t,θ t ) {log(q Z(X;θ Z )) X} ]). ] E Z (α t,θ t ) {log(q Z(X;θ Z )) X}, we get α g 1(α),argmax θ ) g 2 (θ). / D ρ (X;α,θ) = α q (X;θ ) α l q l (X;θ l ), (5) α t+1 = argmax α EX π l=1 D ] ρ (X;α t,θ t )log(α ), =1 α t+1 = E X π ρ (X;α t,θ t ) ]. (6) Similarly, an, for {1,...,D}, θ t+1 = argmaxe X π θ θ t+1 = argmax θ E X π D ] ρ (X;α t,θ t )log(q (X;θ )), =1 ρ (X;α t,θ t )log(q (X;θ )) ]. (7) 5

6 As in the regular mixture estimation problem, the resolution of this maximisation program ultimately epens on the shape of the ensity q. If q belongs to an exponential family, it is easy to erive a close-form solution, which however involves expectations uner π. Section 3 provies an illustration of this fact in the Gaussian case, while the non-exponential Stuent s t case is consiere in Section Approximate upates As argue before, the aaptivity of the propose proceure is achieve by upating the parameters base on the previously simulate sample. We thus start the PMC algorithm by arbitrarily fixing the mixture parameters (α 0,θ 0 ) an we then sample from the resulting proposal α 0 q (x;θ 0) to obtain our initial sample (X i,0 ) 1 i N, associate with the latent variables (Z i,0 ) 1 i N that inicate from which component of the mixture the corresponing (X i,0 ) 1 i N have been generate. From this stage, we procee recursively. Starting at time t from a sample (X i,t ) 1 i N, associate with (Z i,t ) 1 i N an with (α t,n,θ t,n ), we enote by ω i,t the normalise importance weights of the sample point X i,t : / π(x i,t ) ω i,t = D =1 αt,n q (X i,t ;θ t,n ) j=1 π(x j,t ) D =1 αt,n q (X j,t ;θ t,n ). (8) To approximate (6) an (7), Douc et al. (2007a) propose the following upate rule: α t+1,n = θ t+1,n N ω i,t ½{Z i,t = }, = argmax θ ( )} ω i,t ½{Z i,t = }log {q ] X i,t ;θ t,n. (9) The computational cost of this upate is of orer N whatever the number D of components is, since the weight an the parameter of each component are upate base only on the points that were actually generate from this component. However, this observation also suggests that (9) may be highly variable when N is small an/or D becomes larger. To make the upate more robust, we propose a simple Rao-Blackwellisation step that consists in replacing ½{Z i,t = } with its conitional ( expectation given X i,t, that is, ρ X i,t ;α t,n,θ t,n α t+1,n = θ t+1,n N = argmax θ ) : ( ) ω i,t ρ X i,t ;α t,n,θ t,n, ( ) ( )} ω i,t ρ X i,t ;α t,n,θ t,n log {q ] X i,t ;θ t,n. (10) Examining (5) inicates why the evaluation of the posterior probabilities ρ (X i,t ;α t,n ) oes not represent a significant aitional computation cost in the PMC scheme, given that the enominator of this expression has alreay been compute when evaluating the IS weights accoring to (8). The most significant ifference between an (9) an (10) is that, with the latter, all points contribute to the upating of the -th component, for an overall cost proportional to D N. Note however that in many applications of interest, the most significant computational cost is associate with the evaluation of π so that the cost of the upate is mostly negligible, even with the Rao-Blackwellise version.,θ t,n Convergence of the estimate upate parameters as N increases can be establishe using the same approach as in Douc et al. (2007a,b), relying mainly on the convergence property of triangular arrays 6

7 of ranom variables (see Theorem A.1 in Douc et al., 2007a). For the Rao-Blackwellise version, assuming that for all θ s, π(q ( ;θ ) = 0) = 0, for all α s an θ s, ρ ( ;α,θ)logq (,θ ) L 1 (π), an, some (uniform in x) regularity conitions on q (x;θ) viewe as a function of θ, yiel α t+1,n P α t+1, θ t+1,n P θ t+1 when N goes to infinity. Note that we o not expan on the regularity conitions impose on q since, for the algorithm to be efficient, we efinitely nee a close-form expression on the parameter upates. It is then easier to eal with the convergence of the approximation of these upate formulas on a case-by-case basis, as will be seen for instance in the following Gaussian example. 3 The Gaussian mixture case As a first example, we consier the case of p-imensional Gaussian mixture IS ensities of the form { q (X;θ ) = {(2π) p Σ } 1/2 exp 1 } 2 (X µ ) T Σ 1 (X µ ), where θ = (µ,σ ) enotes the parameters of the -th Gaussian component ensity. This parametrisation of the IS ensity provies a general framework for approximating multivariate targets π an the corresponing aaptive algorithm is a straightforwar instance of the general framework iscusse in the previous section. 3.1 Upate formulas The integrate upate formulas are obtaine as the solution of θ t+1,n = argmin θ E X π ρ (X;α t,θ t ) ( log Σ +(X µ ) T Σ 1 (X µ ) )]. It is straightforwar to check that the infimum is reache when, for {1,...,D}, ρ (X;α t,θ t )X ] µ t+1 = EX π E X π ρ (X;α t,θ t )], an Σ t+1 = EX π ρ (X;α t,θ t )(X µ t+1 )(X µ t+1 ) T] E X π ρ (X;α t,θ t. )] At iteration t+1 of the PMC algorithm, both the numerator an the enominator of each of the above expressions are approximate using self-normalise importance sampling, yieling the following empirical upate equations for the basic upating strategy α t+1,n = µ t+1,n = Σ t+1,n = N ω i,t ½{Z i,t = }, ω i,tx i,t ½{Z i,t = } ω, (11) i,t½{z i,t = } ω i,t(x i,t µ t+1,n ω i,t½{z i,t = } )(X i,t µ t+1,n ) T ½{Z i,t = }, 7

8 an α t+1,n = µ t+1,n = Σ t+1,n = N ω i,t ρ (X i,t ;α t,n,θ t,n ), ω i,tx i,t ρ (X i,t ;α t,n,θ t,n ) ω, (12) i,tρ (X i,t ;α t,n,θ t,n ) )( ) Tρ N ω i,t(x i,t µ t+1,n X i,t µ t+1,n (X i,t ;α t,n,θ t,n ) ω, i,tρ (X i,t ;α t,n,θ t,n ) for the Rao-Blackwellise scheme. Note that as iscusse at the en of Section 2, one observes that in the Gaussian case the convergence of the parameter upate can be establishe by assuming only that ρ (x;α,θ)x 2 is integrable with respect to π. 3.2 A simulation experiment To illustrate the results of the algorithm presente above, we consier a toy example in which the target ensity consists of a mixture of two multivariate Gaussian ensities. The appeal of this example is that it is sufficiently simple to allow for an explicit characterisation of the attractive points for the aaptive proceure, while still illustrating the variety of situations foun in more realistic applications. In particular, the moel contains an attractive point that oes not correspon to the global minimum of the entropy criterion as well as some regions of attraction that can eventually lea to a failure of the algorithm. The results obtaine on this example also illustrate the improvement brought by the Rao-Blackwellise upate formulas in (12). The target π is a mixture of two p-imensional Gaussian ensities such that π(x) = 0.5N(x; su p,i p )+0.5N(x;su p,i p ), when u p is the p-imensional vector whose coorinates are equal to 1 an I p stans for the ientity matrix. In the sequel, we focus on the case where p = 10 an s = 2. Note that one shoul not be misle by the image given by the marginal ensities of π: in the ten imensional space, the two components of π are inee very far from one another. It is for instance straightforwar to check that the Kullback-Leibler ivergence between the two components of π, D{N( su p,i p ) N( su p,i p )}, is equal to 1 2 2su p 2 = 2s 2 p, that is 80 in the case uner consieration. In particular, if we were to use one of the components of the mixture as an IS ensity for the other, we know from the arguments expose at the beginning of Section 2 that the normalise perplexity of the weight will eventually ten to exp( 80). This number is so small, that for any feasible sample size, using one of the component ensities of π as an IS instrumental ensity for the other component or even for π itself can only provie useless biase estimates. The initial IS ensity q 0 is chosen here as the isotropic ten-imensional Gaussian ensity with a covariance matrix of 5I p. The performances of q 0 as an importance sampling ensity, when compare to various other alternatives, are fully etaile in Table 1 below but the general comment is that it correspons to a poor initial guess which woul provie highly variable results when use with any sample size uner 50, 000. Inaition to figures relate to the initial IS ensity q 0, Table 1 also reports performanceobtaine with the best fitting Gaussian IS ensity (with respect to the entropy criterion), which is straightforwarly obtaine as the centre Gaussian ensity whose covariance matrix matches the one of π, that is, I p +s 2 u p u T p. Of course the best possible performances achievable with a mixture of two Gaussian ensities, always with the entropy criterion, is obtaine when using π as an IS ensity (secon line 8

9 Proposal N-PERP N-ESS σ 2 (x 1 ) q 0 6.5E-4 1.5E-4 37E3 Best fitting Gaussian Target mixture Best fitting Gaussian (efensive option) Best fitting two Gaussian mixture (efensive option) Table 1: Performance of various importance sampling ensities in terms of N-PERP: Normalise perplexity; N-ESS: Normalise Effective Sample Size; σ 2 (x 1 ): Asymptotic variance of self-normalise IS estimator for the coorinate projection function h(x) = x 1. Quantities marke with a agger sign are straightforwar to etermine, all others have been obtaine using IS with a sample size of one million. of Table 1). Finally both final lines of Table 1 report the best fit obtaine with IS ensities of the form 0.9 D =1 α N(µ,Σ ) + 0.1q 0 ( ) when, respectively, D = 1 an D = 2 (further comments on the use of these are given below). As a general comment on Table 1, note that the variations of the perplexity of the IS weights, of the ESS an of the asymptotic variance of the IS estimate for the coorinate projection function are very correlate. This is a phenomenon that we have observe on many examples an which justifies our postulate that minimising the entropy criterion oes provie very significant variance reuctions for the IS estimate of typical functions of interest. In this example, one may categorise the possible outcomes of aaptive IS algorithms base on mixtures of Gaussian IS ensities into mostly four situations: Disastrous (D.) After T iterations of the PMC scheme, q (α T,θ T ) is not a vali IS ensity an may lea to inconsistent estimates. Typically, this may happen if q (α T,θ T ) becomes much too peaky with light tails. As iscusse above, it will also practically be the case if the algorithm only succees in fitting q (α T,θ T ) to one of both Gaussian moes of π. Another isastrous outcome is when the irect application of the aaptation rules escribe above leas to numerical problems, usually ue to the poor conitioning of some of the covariance matrices Σ. Rather than fixing these issues by a-hoc solutions (eg. iagonal loaing), which coul nonetheless be useful in practical applications, we consier below more principle ways of making the algorithm more resistant to such failures. Meiocre (M.) After aaptation, q (α T,θ T ) is not significantly better than q 0 in terms of the performance criteria isplaye in Table 1 an, in this case, the aaptation is useless. Goo (G.) After T iterations, q (α T,θ T ) selects the best fitting Gaussian approximation (secon line of Table 1) which alreay provies a very substantial improvement as it results in variance reuctions by about four orers of magnitue for typical functions of interest. Excellent (E.) After T iterations, q (α T,θ T ) selects the best fitting mixture of two Gaussian ensities, which in this somewhat artificial example correspons to a perfect fit of π. Note, however that, the actual gain over the previous outcome is rather moerate with a reuction of variance by a factor less than four. Of course, a very important parameter here is the IS sample size N: for a given initial IS ensity q 0, if N is too small, any metho base on IS is boun to fail, conversely when N gets large all reasonable algorithms are expecte to reach either the G. or E. result. Note that with local aaptive rules such as the ones propose in this paper, it is not possible to guarantee that only the E. outcome will be achieve as the best fitting Gaussian IS ensity is inee a stationary point (an in fact a 9

10 local minima) of the entropy criterion. So, epening on the initialisation, there always is a non zero probability that the algorithm converges to the G. situation only. To focus on situations where algorithmic robustness is an issue, we purposely chose to select a rather small IS sample size of N = 5,000 points. As iscusse above, irect IS estimates using q 0 as IS ensity woul be mostly useless with such a moest sample size. We evaluate four algorithmic versions of the PMC algorithm. The first, Plain PMC, uses the parameter upate formulas in (11) an q 0 is only use as an initialisation value, which is common to all D components of the mixture (which also initially have equal weights). Only the means of the components are slightly perturbe to make it possible for the aaptation proceure to actually provie istinct mixture components. One rawback of the plain PMC approach is that we o not ensure uring the course of the algorithm that the aapte mixture IS ensity remains vali, in particular that it provies reliable estimates of the parameter upate formulas. To guarantee that the IS weights stay well behave, we consier a version of the PMC algorithm in which the IS ensity is of the form (1 α 0 ) D α N(µ,Σ )+α 0 q 0 =1 with the ifference that α 0 is a fixe parameter which is not aapte. The aim of this version, which we call Defensive PMC in reference to the work of Hesterberg (1995), is to guarantee that the importance function remains boune by α 1 0 π(x)/q 0(x), whatever happens uring the aaptation, thus guaranteeing a finite variance. Since q 0 is a poor IS ensity, it is preferable to keep α 0 as low as possible an we use α 0 = 0.1 in all the following simulations. As etaile in both last lines of Table 1, this moification will typically slightly limit the performances achievable by the aaptation proceure, although this rawback coul probably be avoie by allowing for a ecrease of α 0 uring the iterations of the PMC. The parameter upate formulas for this moifie mixture moel are very easily euce from (11) an are omitte here for the sake of conciseness. The thir version we consiere is terme Rao-Blackwellise PMC an consists in replacing the upate equations (11) by their Rao-Blackwellise version (12). Finally, we consier a fourth option in which both the efensive mixture ensity an the Rao-Blackwellise upate formulas are use. All simulations were carrie out using a sample size of N = 5,000, 20 iterations of the PMC algorithm an Gaussian mixtures with D = 3 components. Note that we purposely avoie to chose D = 2 to avoi the very artificial perfect fit phenomenon. This also means that for most runs of the algorithm, at least one component will isappear (by convergence of its weight to zero) or will be uplicate, with several components sharing very similar parameters. Disastrous Meiocre Goo Excellent Plain Defensive R.-B Defensive + R.-B Table 2: Number of outcomes of each category for the four algorithmic versions, as recore from 100 inepenent runs. Table 2 isplay the performances of the four algorithms in repeate inepenent aaptation runs. The most significant observation about Table 2 is the large gap in robustness between the non Rao- Blackwellise versions of the algorithm, which returne isastrous or meiocre results in about 60% of the cases, a fraction that falls bellow 20% when the Rao-Blackwellise upate formulas are use. Obviously the fact that the Rao-Blackwellise upates are base on all simulate values an not just on those actually simulate from a particular mixture component is a major source of robustness 10

11 of the metho when the sample size N is small, given the misfit of the initial IS ensity q 0. The same remark also applies when the PMC algorithm is to be implemente with a large number D of components. The role of the efensive mixture component is more moest although it oes improve the performances of both versions of the algorithm (non Rao-Blackwellise an Rao-Blackwellise altogether), at the price of a slight reuction of the frequency of the Excellent outcome. Also notice that the results obtaine when the efensive mixture component is use are slightly beyon those of the unconstraine aaptation (see Table 1). The frequency of the perfect or Excellent match is about 10% for all methos but this is a consequence of the local nature of the aaptation rule as well as of the choice of the initialisation of the algorithm. It shoul be stresse however that as we are not intereste in moelling π by a mixture but rather that we are seeking goo IS ensities, the solutions obtaine in the G. or E. situations are only milly ifferent in this respect (see Table 1). As a final comment, recall that the results presente above have been obtaine with a fairly small sample size of N = 5,000. Increasing N quickly reuces the failure rate of all algorithms: for N = 20,000 for instance, the failure rate of the plain PMC algorithm rops to 7/100 while the Rao-Blackwellise versions achieve either the G. or E. result (an mostly the G. one, given the chosen initialisation) for all runs. 4 Robustification via mixtures of multivariate t s We now consier the setting of a proposal compose of a mixture of p-imensional t istributions, D α T (ν,µ,σ ). (13) =1 We here follow the recommenations of West (1992) an Oh an Berger (1993) who propose using mixtures of t istributions in importance sampling. The t mixture is preferable to a normal mixture because of its heavier tails that can capture a wier range of non-gaussian targets with a smaller number of components. This alternative setting is more challenging however an one must take avantage of the missing variable representation of the t istribution itself to achieve a close-form upating of the parameters (µ,σ ) approximating (7), since a true close-form cannot be erive. 4.1 The latent-ata framework Using the classical normal/chi-square ecomposition of the t istribution, a joint istribution associate with the t mixture proposal (13) is f(x,y,z) α z Σ z 1/2 exp { (x µ z ) T Σ 1 z (x µ z )y/2ν z } y (ν z+p)/2 1 e y/2 α z ϕ(x;µ z,ν z Σ z /y)ς(y;ν z /2,1/2), where, as above, x correspons to the observable in (13), z correspons to the mixture inicator, an y correspons to the χ 2 ν completion. The normal ensity is enote by ϕ an the gamma ensity by ς. Both y an z correspon to latent variables in that the integral of the above in (y,z) returns (13). In the associate PMC algorithm, we only upate the expectations an the covariance structures of the t istributions an not the number of egrees of freeom, given that there is no close-form solution for the later. In that case, θ = (µ,σ ) an, for each = 1,...,D, the number of egrees of freeom ν is fixe. At iteration t, the integrate EM upate of the parameter will involve the following E function ] Q{(α t,θ t ),(α,θ)} = E X π E Y,Z (α t,θ t ) {log(α Z)+log(ϕ(X;µ Z,ν Z Σ z /Y)) X}, 11

12 since the χ 2 part oes not involve the parameter θ = (µ,σ). Given that Y,Z X,θ f(y,z x) α z ϕ(x;µ z,ν z Σ z /y)ς(y;ν z /2,1/2), we have that Y X,Z =,θ Ga (ν +p)/2, 1 { 1+(X µ ) T Σ 1 2 (X µ } ] )/ν an therefore ] Q{(α t,θ t ),(α,θ)} = E X π E Z (α t,θ t ) {log(α Z) X} 1 { 2 EX π E Y,Z (α t,θ t ) log Σ Z + (X µ Z) T Σ 1 }] Z (X µ Z )Y ν Z X D ] = E X π ρ (X;α t,θ t )log(α ) =1 1 2 EX π D =1 where we have use both the efinition in (5), { ρ (X;α t,θ t ) log Σ +(X µ ) T Σ 1 (X µ ) } ] ν +p ν +(X µ t )T (Σ t ) 1 (X µ t ), ρ (X;α t,θ t ) = P α t,θ t(z = X) = αt t(x;ν,µ t,σt ) D l=1 αt l t(x;ν l,µ t l,σt l ), with t(x;ν,µ,σ) enoting the T (ν,µ,σ) ensity, an the fact that γ (X;θ t ) = E Y θ t {Y/ν X,Z = } = Therefore, the M step of the integrate EM upate is α t+1 = E X π ρ (X;α t,θ t ) ] ρ (X;α t,θ t )γ (X;θ t )X ] µ t+1 Σ t+1 = EX π = EX π ν +p ν +(X µ t )T (Σ t ) 1 (X µ t ). E X π ρ (X;α t,θ t )γ (X;θ t )] ρ (X;α t,θ t )γ (X;θ t )(X µ t+1 )(X µ t+1 ) T] E X π ρ (X;α t,θ t. )] While the first upate is the generic weight moification (6), the latter formulae are (up to the integration with respect to X) essentially those foun in Peel an McLachlan (2000) for a mixture of t istributions. 4.2 Parameter upate As in Section 3.1, the empirical upate equations are obtaine by using self-normalise IS with weights ω i,t given by (8) for both the numerator an the enominator of each of the above expressions. The 12

13 Rao-Blackwellise approximation base on (10) yiels N α t+1,n = ω i,t ρ (X i,t ;α t,n,θ t,n ), µ t+1,n = Σ t+1,n = ω i,tρ (X i,t ;α t,n,θ t,n )γ (X i,t ;θ t,n )X i,t ω, i,tρ (X i,t ;α t,n,θ t,n )γ (X i,t ;θ t,n ) ω i,tρ (X i,t ;α t,n,θ t,n )γ (X i,t ;θ t,n )(X i,t µ t+1,n ω i,tρ (X i,t ;α t,n,θ t,n ) )(X i,t µ t+1,n ) T while the stanar upate equations, base on (9), are obtaine by replacing ρ (X i,t ;α t,n,θ t,n ) by ½{X i,t = } in the above equations., 4.3 Pima Inian example As a realistic if artificial illustration of the performances of the t mixture (13), we stuy the posterior istribution of the parameters of a probit moel. The corresponing ataset is borrowe from the MASS library of R (R Development Core Team, 2006). It consists in the recors of 532 Pima Inian women who were teste by the U.S. National Institute of Diabetes an Digestive an Kiney Diseases for iabetes. Four quantitative covariates were recore, along with the presence or absence of iabetes. The corresponing probit moel analyses the presence of iabetes, i.e. P β (y = 1 x) = 1 P β (y = 0 x) = Φ(β 0 +x T (β 1,β 2,β 3,β 4 )) with β = (β 0,...,β 4 ), x mae of four covariates, the number of pregnancies, the plasma glucose concentration, the boy mass inex weight in kg/(height in m) 2, an the age, an Φ correspons to the cumulative istribution function of the stanar normal. We use the flat prior istribution π(β X) 1; in that case, the 5-imensional target posterior istribution is such that 532 π(β y,x) Φ{β0 +(x i ) T (β 1,β 2,β 3,β 4 )} ] y i 1 Φ{β0 +(x i ) T (β 1,β 2,β 3,β 4 )} 1 y i where x i is the value of the covariates for the i-th iniviuals an y i is the response of the i-th iniviuals. We first present some results for N = 10,000 sample points an T = 500 iterations on Figures 1 3, base on a mixture with 4 components an with the egrees of freeom chosen as ν = (3,6,9,18), respectively, when using the non Rao-Blackwellise version (9). The unrealistic value of T is chosen purposely to illustrate the lack of stability of the upate strategy when not using the Rao- Blackwellise version. Inee, as can be seen from Figure 1, which escribes the evolution of the µ s, some components vary quite wiely over iterations, but they also correspon to a rather stable overall estimate of β, N ω i,t β i,t, equal to ( 5.54, 0.051, 0.019, 0.055, 0.022) over most iterations. When looking at Figure 3, the quasiconstant entropy estimate after iteration 100 or so shows that, even in this situation, there is little nee to perpetuate the iterations till the 500-th. Using a Rao-Blackwellise version of the upates shows a strong stabilisation for the upates of the parameters α an (µ,σ ), both in the number of iterations an in the range of the parameters. The approximation to the Bayes estimate is obviously very close to the above estimation ( 5.63, 0.052, 0.019, 0.056, 0.022). Figures 4 an 5 show the immeiate stabilisation provie by the Rao-Blackwellisation step. 13

14 µ µ µ 1 µ 3 µ µ µ 4 µ 2 Figure 1: Pima Inians: Evolution of the components of the five µ s over 500 iterations plotte by pairs: (clockwise from upper left sie) (1,2), (3,4), (4,1) an (2,3). The colour coe is blue for µ 1, yellow for µ 2, brown for µ 3 an re for µ 4. The aitional ark path correspons to the estimate of β. All µ s were starte in the vicinity of the MLE ˆβ. 14

15 σ σ e e e e σ 11 σ 22 σ 44 0e+00 2e 04 4e 04 6e e e e e 05 σ e+00 2e 04 4e 04 6e 04 σ 33 σ 44 Figure 2: Pima Inians: Evolution of the five Σ s over 500 iterations plotte by pairs for the iagonal elements: (clockwise from upper left sie) (1,2), (3,4), (4,1) an (2,3). The colour coe is blue for Σ 1, yellow for Σ 2, brown for Σ 3 an re for Σ 4. All Σ s were starte at the covariance matrix of ˆβ prouce by R glm() proceure. 15

16 p i t t 2.0e e e+00 Figure 3: Pima Inians: Evolution of the cumulate weights (top) an of the estimate entropy ivergence E π log(q α,θ (β))] (bottom). 16

17 µ µ µ 1 µ 3 µ µ µ 4 µ 2 Figure 4: Pima Inians: Evolution of the components of the five µ s over 50 Rao-Blackwellise iterations plotte by pairs: (clockwise from upper left sie) (1,2), (3,4), (4,1) an (2,3). The colour coe is blue for µ 1, yellow for µ 2, brown for µ 3 an re for µ 4. The aitional ark path correspons to the estimate of β. All µ s were starte in the vicinity of the MLE ˆβ. 17

18 p i t t Figure 5: Pima Inians: Evolution of the cumulate weights (top) an of the estimate entropy ivergence E π log(q α,θ (β))] (bottom) for the Rao-Blackwellise version. 18

19 5 Conclusions The propose algorithm provies a flexible an robust framework for aapting general importance sampling ensities represente as mixtures. The extension to mixtures of t istribution broaens the scope of the metho by allowing approximation of heavier tail targets. Moreover, we can exten here the remarks mae in Douc et al. (2007a,b), namely that the upate mechanism provies an early stabilisation of the parameters of the mixture. It is therefore unnecessary to rely on a large value of T: with large enough sample sizes N at each iteration especially on the initial iteration that requires many points to counter-weight a potentially poor initial proposal, it is quite uncommon to fail to spot a stabilisation of both the estimates an of the entropy criterion within a few iterations. While this paper relies on the generic entropy criterion to upate the mixture ensity, we want to stress that it is also possible to use a more focusse eviance criterion, namely the h-entropy with E h (π,q (α,θ) ) = D(π h q (α,θ) ), (14) π h (x) h(x) π(h) π(x), that is tune to the estimation of a particular function h, as it is well-known that the optimal choice of the importance ensity for the self-normalise importance sampling estimator is exactly π h. Since the normalising constant in π h oes not nee to be known, one can erive an aaptive algorithm that resembles the metho presente in this paper. It is expecte that this moification will be helpful in reaching IS ensities that provie a low approximation error for a specific function h, which is also an important feature of importance sampling in several applications. References Cappé, O., Guillin, A., Marin, J., an Robert, C. (2004). Population Monte Carlo. J. Comput. Graph. Statist., 13(4): Cappé, O., Moulines, E., an Ryén, T. (2005). Inference in Hien Markov Moels. Springer-Verlag, New York. Chen, R. an Liu, J. S. (1996). Preictive upating metho an Bayesian classification. J. Royal Statist. Soc. Series B, 58(2): Douc, R., Guillin, A., Marin, J.-M., an Robert, C. (2007a). Convergence of aaptive mixtures of importance sampling schemes. Ann. Statist., 35(1): Douc, R., Guillin, A., Marin, J.-M., an Robert, C. (2007b). Minimum variance importance sampling via population Monte Carlo. ESAIM: Probability an Statistics, 11: Doucet, A., e Freitas, N., an Goron, N. (2001). Sequential Monte Carlo Methos in Practice. Springer-Verlag, New York. Geweke, J. (1989). Bayesian inference in econometric moels using Monte Carlo integration. Econometrica, 57: Hesterberg, T. (1995). Weighte average importance sampling an efensive mixture istributions. Technometrics, 37(2): Oh, M. an Berger, J. (1993). Integration of multimoal functions by Monte Carlo importance sampling. J. American Statist. Assoc., 88:

20 Peel, D. an McLachlan, G. (2000). Robust mixture moelling using the t istribution. Statistics an Computing, 10: R Development Core Team (2006). R: A Language an Environment for Statistical Computing. R Founation for Statistical Computing, Vienna, Austria. Robert, C. an Casella, G. (2004). Monte Carlo Statistical Methos. Springer-Verlag, New York, secon eition. Rubinstein, R. Y. an Kroese, D. P. (2004). The Cross-Entropy Metho. Springer-Verlag, New York. West, M. (1992). Moelling with mixtures. In Berger, J., Bernaro, J., Dawi, A., an Smith, A., eitors, Bayesian Statistics 4, pages Oxfor University Press, Oxfor. 20

Survey Sampling. 1 Design-based Inference. Kosuke Imai Department of Politics, Princeton University. February 19, 2013

Survey Sampling. 1 Design-based Inference. Kosuke Imai Department of Politics, Princeton University. February 19, 2013 Survey Sampling Kosuke Imai Department of Politics, Princeton University February 19, 2013 Survey sampling is one of the most commonly use ata collection methos for social scientists. We begin by escribing