CONVERGENCE OF ADAPTIVE MIXTURES OF IMPORTANCE SAMPLING SCHEMES 1. I = f(x)π(x)dx

Size: px

Start display at page:

Download "CONVERGENCE OF ADAPTIVE MIXTURES OF IMPORTANCE SAMPLING SCHEMES 1. I = f(x)π(x)dx"

Meryl Armstrong
6 years ago
Views:

1 The Annals of Statistics 2007, Vol. 35, o. 1, DOI: / Institute of Mathematical Statistics, 2007 COVERGECE OF ADAPTIVE MIXTURES OF IMPORTACE SAMPLIG SCHEMES 1 BY R. DOUC, A.GUILLI, J.-M. MARI AD C. P. ROBERT École Polytechnique, École Centrale Marseille an LATP, CRS, IRIA Futurs, Projet Select, Université Orsay an Université Paris Dauphine an CREST-ISEE In the esign of efficient simulation algorithms, one is often beset with a poor choice of proposal istributions. Although the performance of a given simulation kernel can clarify a posteriori how aequate this kernel is for the problem at han, a permanent on-line moification of kernels causes concerns about the valiity of the resulting algorithm. While the issue is most often intractable for MCMC algorithms, the equivalent version for importance sampling algorithms can be valiate quite precisely. We erive sufficient convergence conitions for aaptive mixtures of population Monte Carlo algorithms an show that Rao Blackwellize versions asymptotically achieve an optimum in terms of a Kullback ivergence criterion, while more ruimentary versions o not benefit from repeate upating. 1. Introuction Monte Carlo calibration. In the simulation settings foun in optimization an Bayesian integration, it is well ocumente [20] that the choice of the instrumental istributions is paramount for the efficiency of the resulting algorithms. Inee, in an importance sampling algorithm with importance function gx, we are relying on a istribution g that is customarily ifficult to calibrate outsie a limite range of well-known cases. For instance, a stanar result is that the optimal importance ensity for approximating an integral I = fxπxx is g x fx πx [20], Theorem 3.12, but this formal result is not very informative about the practical choice of g, while a poor choice of g may result in an infinite variance estimator. MCMC algorithms somehow attenuate this ifficulty by using local proposals like ranom walk kernels, but two rawbacks of these Receive January 2005; revise December Supporte in part by an ACI ouvelles Interfaces es Mathématiques grant from the Ministère e la Recherche. AMS 2000 subject classifications. 60F05, 62L12, 65-04, 65C05, 65C40, 65C60. Key wors an phrases. Bayesian statistics, Kullback ivergence, LL, MCMC algorithm, population Monte Carlo, proposal istribution, Rao Blackwellization. 420

2 ADAPTIVE MIXTURES FOR IMPORTACE SAMPLIG 421 proposals are that they may take a long time to converge [18] an their efficiency ultimately epens on the scale of the local exploration. The goals of Monte Carlo experiments are multifacete an therefore the efficiency of an algorithm can be evaluate from many ifferent perspectives. In particular, in Bayesian statistics the Monte Carlo sample can be use to approximate a variety of posterior quantities. onetheless, if we try to assess the generic efficiency of an algorithm an thus evelop a portmanteau evice, a natural approach is to use a measure of agreement between the target an the proposal istribution, similar to the intrinsic loss function propose in [19] for an invariant estimation of parameters. A robust measure of similarity ubiquitous in statistical approximation theory [7] is the Kullback ivergence Eπ, π = log πx πx πx, an this paper aims to minimize Eπ, π within a class of proposals π Aaptivity in Monte Carlo settings. Given the complexity of the original optimization or integration problem which oes itself require Monte Carlo approximations, it is rarely the case that the optimization of the proposal istribution against an efficiency measure can be achieve in close form. Even the computation of the efficiency measure for a given proposal is impossible in the majority of cases. For this reason, a number of aaptive schemes have appeare in the recent literature [20], Section in orer to esign better proposals against a given measure of efficiency without resorting to a stanar optimization algorithm. For instance, in the MCMC community, sequential changes in the variance of Markov kernels have been propose in [13, 14], while aaptive changes taking avantage of regeneration properties of the kernels have been constructe by Gilks, Roberts an Sahu [12] an Sahu an Zhigljavsky [23, 24]. From a more general perspective, Anrieu an Robert [2] have evelope a two-level stochastic optimization scheme to upate parameters of a proposal towars a given integrate efficiency criterion such as the acceptance rate or its ifference with a value known to be optimal see Roberts, Gelman an Gilks [21]. However, as reflecte in this general technical report of Anrieu an Robert [2], the complexity of evising vali aaptive MCMC schemes is a genuine rawback in their extension, given that the constraints on the inhomogeneous Markov chain that results from this aaptive construction are either ifficult to satisfy or result in a fixe proposal after a certain number of iterations. Cappé, Guillin, Marin an Robert [3] see also[20], Chapter 14 evelope a methoology calle Population Monte Carlo PMC [16] motivate by the observation that the importance sampling perspective is much more amenable to aaptivity than MCMC, ue to its unbiase nature: using sampling importance resampling, any given sample from an importance istribution g can be transforme into a sample of points marginally istribute from the target istribution π an Cappé

3 422 R. DOUC, A. GUILLI, J.-M. MARI AD C. P. ROBERT et al. [3] seealso[8] showe that this property is also preserve by repeate an aaptive sampling. In this setting, aaptive is to be unerstoo as a moification of the importance istribution base on the results of previous iterations. The asymptotics of aaptive importance sampling are therefore much more manageable than those of aaptive MCMC algorithms, at least at a primary level, if only because the algorithm can be stoppe at any time. Inee, since every iteration is a vali importance sampling scheme, the algorithm oes not require a burn-in time. Borrowing from the sequential sampling literature [11], the methoology of Cappé et al. [3] thus aime at proucing an aaptive importance sampling scheme via a learning mechanism on a population of points, themselves marginally istribute from the target istribution. However, as shown by the current paper, the original implementation of Cappé et al. [3] may suffer from an asymptotic lack of aaptivity that can be overcome by Rao Blackwellization Plan an objectives. This paper focuses on a specific family of importance functions that are represente as mixtures of an arbitrary number of fixe proposals. These proposals can be eucate guesses of the target istribution, ranom walk proposals for local exploration of the target, nonparametric kernel approximations to the target or any combination of these. Using these fixe proposals as a basis, we then evise an upating mechanism for the weights of the mixture an prove that this mechanism converges to the optimal mixture, the optimality being efine here in terms of Kullback ivergence. From a probabilistic point of view, the techniques use in this paper are relate to techniques an results foun in [4, 6, 17]. In particular, the triangular array technique that is central to the CLT proofs below can be foun in [4, 10]. The paper is organize as follows. We first present the algorithmic an mathematical etails in Section 2 an establish a generic central limit theorem. We evaluate the convergence properties of the basic version of PMC in Section 3, exhibiting its limitations, an show in Section 5 that its Rao Blackwellize version overcomes these limitations an achieves optimality for the Kullback criterion evelope in Section 4. Section 6 illustrates the practical convergence of the metho on a few benchmark examples. 2. Population Monte Carlo. The Population Monte Carlo PMC algorithm introuce in [3] is a form of iterate sampling importance resampling SIR. The appeal of using a repeate form of SIR is that previous samples are informative about the connections between the proposal importance an the target istributions. We stress from the outset that this scheme has very few connections with MCMC algorithms since a PMC is not Markovian, being base on the whole sequence of simulations an b PMC can be stoppe at any time, being valiate by the basic importance sampling ientity [20], equation 3.9 rather than by a probabilistic convergence result like the ergoic theorem. These features motivate the use of the metho in setups where off-the-shelf MCMC algorithms cannot be

4 ADAPTIVE MIXTURES FOR IMPORTACE SAMPLIG 423 of use. We first recall basic Monte Carlo principles, mostly to efine notation an to make precise our goals The Monte Carlo framework. On a measurable space, A, weare given a target, that is, a probability istribution π on, A. We assume that π is ominate by a reference measure µ, π µ, an also enote by πx = πxµx its ensity. In most settings, incluing Bayesian statistics, the ensity π is known up to a normalizing constant, πx πx. The purpose of running a simulation experiment with the target π is to approximate quantities relate to π, such as intractable integrals πf = fxπx, but we o not focus here on a specific quantity πf. In this setting, a stanar stochastic approximation metho is the Monte Carlo metho, base on an i.i.. sample x 1,...,x simulate from π, that approximates πf by ˆπ MC f = 1 fx i, which almost surely converges to πf as goes to infinity by the law of large numbers LL. The central limit theorem CLT implies, in aition, that if πf 2 = f 2 xπx <,then {ˆπ MC f πf} L 0, V π f, where V π f = π[f πf] 2. Obviously, this approach requires a irect i.i.. simulation from π or π, which often is impossible. An alternative [20], Chapter 3 is to use importance sampling, that is, to choose a probability istribution ν µ on, A calle the proposal or importance istribution, the ensity of which is also enote by ν, an to estimate πf by ˆπ ν, IS f = 1 π fx i x i. ν If π is also ominate by ν, π ν,then ˆπ ν, IS f almost surely converges to πf an if νf 2 π/ν 2 <, then the CLT also applies, that is, {ˆπ IS ν, f πf} L 0, V ν f π. ν As the normalizing constant of the target istribution π is unknown, it is not possible to irectly use the IS estimator ˆπ ν, IS f. A convenient substitute is the selfnormalize IS estimator, 1 ˆπ SIS ν, f = π/νx i fx i π/νx i,

5 424 R. DOUC, A. GUILLI, J.-M. MARI AD C. P. ROBERT which also converges almost surely to πf.ifν1 + f 2 π/ν 2 <, then the CLT applies: { {ˆπ SIS ν, f πf} L 0, V ν [f πf] π }. ν Obviously, the quality of the IS an SIS approximations strongly epens on the choice of the proposal istribution ν, which is elicate for complex istributions like those that occur in high-imensional problems. While multiplication of the number of proposals may offer some reprieve in this regar, we must stress at this point that our PMC methoology also suffers from the curse of imensionality to which all importance sampling methos are subject in the sense that highimensional problems require a consierable increase in computational power Sampling importance resampling. The sampling importance resampling SIR metho of Rubin [22] is an extension of the IS metho that achieves simulation from π by aing resampling to simple reweighting. More precisely, the SIR algorithm operates in two stages. The first stage is ientical to IS an consists in generating an i.i.. sample x 1,...,x from ν. The secon stage buils a sample from π, x 1,..., x M, base on the instrumental sample x 1,...,x, by resampling. While there are many resampling methos [20], Section , the most stanar if least efficient approach is multinomial sampling from {x 1,...,x } with probabilities proportional to the importance weights [ π ν x 1,..., π ν x ], that is, the replacement of the weighte sample x 1,...,x by an unweighte sample x 1,..., x M,where x i = x Ji 1 i M an where J 1,...,J M MM, ϱ 1,...,ϱ, the multinomial istribution with probabilities P[J l = i x 1,...,x ]=ϱ i π ν x i, 1 i,1 l M. The SIR estimator of πf is then the stanar average ˆπ SIR ν,,m f = M M 1 f x i, which also converges to πf since each x i is marginally istribute from π. By construction, the variance of the SIR estimator is greater than the variance of the SIS estimator. Inee, the expectation of ˆπ ν,,m SIR f conitional on the sample x 1,...,x is equal to ˆπ ν, SIS f. ote, however, that an asymptotic analysis of ˆπ ν,,m SIR f is quite elicate because of the epenencies in the SIR sample which, again, is not an i.i.. sample from π The population Monte Carlo algorithm. In their generalization of importance sampling, Cappé et al. [3] introuce an iterative imension in the prouction of importance samples, aime at aapting the importance istribution ν to the target istribution π. Iterations are then use to learn about π from the poor or

6 ADAPTIVE MIXTURES FOR IMPORTACE SAMPLIG 425 goo performance of earlier proposals an that performance can be evaluate using ifferent criteria, as, for example, the entropy of the istribution of importance weights. More precisely, at iteration tt= 0, 1,...,T of the PMC algorithm, a sample of values from the target istribution π is prouce by a SIR algorithm whose importance function ν t epens on t, in the sense that ν t can be erive from the t 1 previous realizations of the algorithm, except for the first iteration t = 0, where the proposal istribution ν 0 is chosen as in a regular importance sampling experiment. A generic renering of the algorithm is thus as follows: PMC ALGORITHM. At time t = 0, 1,...,T, 1. generate x i,t 1 i by i.i.. sampling from ν t an compute the normalize importance weights ω i,t ; 2. resample x i,t 1 i from x i,t 1 i by multinomial sampling with weights ω i,t ; 3. construct ν t+1 base on x i,τ, ω i,τ 1 i,0 τ t. At this stage of the escription of the PMC algorithm, we o not give in etail the construction of ν t, which can thus arbitrarily epen on the past simulations. Section 3 stuies a particular choice of the ν t s in etail. A major fining of Cappé et al. [3] is, however, that the epenence of ν t on earlier proposals an realizations oes not jeoparize the funamental importance sampling ientity. Local aaptive importance sampling schemes can thus be chosen in much wier generality than was previously thought an this shows that, thanks to the introuction of a temporal imension in the selection of the importance function, an aaptive perspective can be aopte at little cost, with a potentially large gain in efficiency. Obviously, when the construction of the proposal istribution ν t is completely open, there is no guarantee of permanent improvement of the simulation scheme, whatever the criterion aopte to measure this improvement. For instance, as illustrate by Cappé et al. [3], a constant epenence of ν t on the past sample quickly leas to stable behavior. A more extreme illustration is the case where the sequence ν t egenerates into a quasi-dirac mass at a value base on earlier simulations an where the performance of the resulting PMC algorithm worsens. In orer to stuy the positive an negative effects of the upate of the importance function ν t,we hereafter consier a special class of proposals base on mixtures an stuy two particular upating schemes, one in which improvement oes not occur an one in which it oes. 3. The D-kernel PMC algorithm. In this section an the following ones, we stuy aaptivity for a particular type of parameterize PMC scheme, in the case where ν t is a mixture of measures ζ 1 D that are chosen prior to

7 426 R. DOUC, A. GUILLI, J.-M. MARI AD C. P. ROBERT the simulation experiment, base either on an eucate guess about π or on local approximations as in nonparametric kernel estimation. The epenence of the ν t s on the past simulations is of the form D ν t x = α t ζ {x i,t 1, ω i,t 1 } 1 i,x =1 D = α t ω j,t 1 Q x j,t 1,x, =1 j=1 where the ω i,t s enote the importance weights an the transition kernels Q 1 D are given. This situation is rather common in MCMC settings where several competing transition kernels are simultaneously available, but ifficult to compare. For instance, the cycle an mixture MCMC schemes iscusse by Tierney [25] are of this nature. Hereafter, we thus focus on aapting the weights a t 1 D towar a better fit with the target istribution. A natural approach to upating the weights a t 1 D is to favor those kernels Q that lea to a high acceptance probability in the resampling step of the PMC algorithm. We thus choose the a t s to be proportional to the survival rates of the corresponing Q s in the previous resampling step [since using the ouble mixture of the measures Q x j,t 1,x means that we first select a point x j,t 1 in the previous sample with probability ω i,t 1, then select a component with probability α t an, finally, simulate from Q x j,t 1,x] Algorithmic etails. The family Q 1 D of transition kernels on A is such that Q x, 1 D, x is ominate by the reference measure µ introuce earlier. As above, we set the corresponing target ensity an the transition kernel to be π an q, 1 D, respectively. The associate PMC algorithm then upates the proposal weights an generates the corresponing samples from π as follows: D-KEREL PMC ALGORITHM. At time 0, a generate x i,0 i.i.. ν 0 1 i an compute the normalize importance weights ω i,0 {π/ν 0 }x i,0 ; b resample x i,0 1 i into x i,0 1 i by multinomial sampling J i,0 1 i M, ω i,0 1 i, that is, x i,0 = x Ji,0,0; c set α 1, = 1/D 1 D. At time t = 1,...,T, a select the mixture components K i,t 1 i M, α t, 1 D ;

8 ADAPTIVE MIXTURES FOR IMPORTACE SAMPLIG 427 b generate inepenent x i,t q Ki,t x i,t 1,x 1 i an compute the normalize importance weights ω i,t πx i,t /q Ki,t x i,t 1,x i,t ; c resample x i,t 1 i into x i,t 1 i by multinomial sampling with weights ω i,t 1 i ; set α t+1, = ω i,t I K i,t 1 D. In this implementation, at time t 1, step a chooses a kernel inex in the mixture for each point of the sample, while step upates the weight α as the relative importance of kernel Q in the current roun or, in other wors, as the relative survival rate of the points simulate from kernel Q. Inee, since the survival of a simulate value x i,t is riven by its importance weight ω i,t, reweighting is relate to the respective magnitues of the importance weights for the ifferent kernels. Also, note that the resampling step c is use to avoi the propagation of very small importance weights along iterations an the subsequent egeneracy phenomenon that plagues iterate IS schemes like particle filters [11]. At time t = T, the algorithm shoul thus stop at step b for integral approximations to be base on the weighte sample x i,t, ω i,t 1 i Convergence properties. In orer to assess the impact of this upate mechanism on the performance of the PMC scheme, we now consier the convergence properties of the above algorithm when the sample size grows to infinity. Inee, as alreay pointe out in Cappé et al. [3], it oes not make sense to consier the alternative asymptotics of the PMC scheme namely, when T grows to infinity, given that this algorithm is intene to be run with a small number T of iterations. A basic assumption on the kernels Q is that the corresponing importance weights are almost surely finite, that is, A1 {1,...,D} π π{q x, x = 0}=0, where ξ ζ enotes the prouct measure, that is, ξ ζa B = A B ξx ζy. We enote by γ u the uniform istribution on {1,...,D}, thatis,γ u k = 1/D for all k {1,...,D}. The following result whose proof is given in Appenix B is a general LL on the pairs x i,t,k i,t prouce by the above algorithm. PROPOSITIO 3.1. every t, Uner A1, for any π γ u -measurable function h an ω i,t hx i,t,k i,t P π γ u h. ote that this convergence result is more than we nee for Monte Carlo purposes since the K i,t s are auxiliary variables that are not relevant to the original

9 428 R. DOUC, A. GUILLI, J.-M. MARI AD C. P. ROBERT problem. However, the consequences of this general result are far from negligible. First, it implies that the approximation provie by the algorithm is convergent, in the sense that ω i,t fx i,t is a convergent estimator of πf. But, more importantly, it also shows that for t 1, α t, = ω i,t I K i,t P 1/D. Therefore, at each iteration, the weights of all kernels converge to 1/D when the sample size grows to infinity. This translates into a lack of learning properties for the D-kernel PMC algorithm: its properties at iteration 1 an at iteration 10 are the same. In other wors, this algorithm is not aaptive an only requires one iteration for a large value of. We can also relate this to the fast stabilization of the approximation in [3]. ote that a CLT can be establishe for this algorithm, but given its unappealing features, we leave the exercise for the intereste reaer. 4. The Kullback ivergence. In orer to obtain a correct an aaptive version of the D-kernel algorithm, we must first choose an effective criterion to evaluate both the aaptivity an the approximation of the target istribution by the proposal istribution. We then propose a moification of the original D-kernel algorithm that achieves efficiency in this sense. As argue in many papers, using a wie range of arguments, a natural choice of approximation metric is the Kullback ivergence. We thus aim to erive the D- kernel mixture that minimizes the Kullback ivergence between this mixture an the target measure π, 1 log πxπx πx D =1 α q x, x π πx, x. This section is evote to the problem of fining an iterative choice of mixing coefficients α that converge to this minimum. A etaile escription of the optimal PMC scheme is given in Section The criterion. Using the same notation as above, in conjunction with the choice of weights α in the D-kernel mixture, we introuce the simplex of R D, { } D S = α = α 1,...,α D ; {1,...,D},α 0an α = 1, an enote π = π π. We now assume that the D kernels also satisfy the conition A2 =1 {1,...,D} E π [ log q X, X ] = log q x, x πx,x <,

10 ADAPTIVE MIXTURES FOR IMPORTACE SAMPLIG 429 which is automatically satisfie when all ratios π/q are boune. From the Kullback ivergence, we erive a function on S such that for α S, [ ] D D E π α = log α q x, x πx,x = E π log α q X, X. =1 By virtue of Jensen s inequality, E π α πxlog πx for all α S. ote that ue to the strict concavity of the log function, E π is a strictly concave function on a connecte compact set an thus has no local maximum besies the global maximum, enote α max.since log πxπx E π α = E π / log πxπx πx =1 { D α q X, X =1 }, α max is the optimal vector of weights for a mixture of transition kernels such that the prouct of π by this mixture is the closest to the prouct istribution π A maximization algorithm. We now stuy an iterative proceure, akin to the EM algorithm, that upates the weights so that the function E π α increases at each step. Defining the function F on S as [ ] E π Fα= / α q X, X D α j q j X, X j=1, 1 D we construct the sequence α t t 1 on S such that α 1 = 1/D,...,1/D an α t+1 = Fα t for t 1. ote that uner assumption A1,forallt 0, / D E π q X, X αj t q j X, X > 0 j=1 an thus for all t 0an1 D, wehaveα t > 0. If we efine the extremal set { I D = α S ; {1,...,D}, either α = 0or we then have the following fixe-point result: PROPOSITIO 4.1. q X, X } E π Dj=1 = 1, α j q j X, X Uner A1 an A2, i E π F E π is continuous; ii for all α S, E π Fα E π α; iii I D ={α S ; Fα= α}={α S ; E π Fα= E π α} an I D is finite.

11 430 R. DOUC, A. GUILLI, J.-M. MARI AD C. P. ROBERT PROOF. E π is clearly continuous. Moreover, by Lebesgue s ominate convergence theorem, the function α E π α q X, X / j=1 α j q j X, X is also continuous, which implies that F is continuous. This completes the proof of i. Due to the concavity of the log function, E π F α E π α [ D α q X, X = E π log Dj=1 =1 α j q j X, X E q X, X ] π Dj=1 α j q j X, X [ D α q X, X E π Dj=1 =1 α j q j X, X log E q X, X ] π Dj=1 α j q j X, X D q X, X q X, X = α E π Dj=1 log E =1 α j q j X, X π Dj=1. α j q j X, X Applying the inequality u log u u 1 yiels ii. Moreover, the equality in u log u u 1 hols if an only if u = 1. Therefore, equality above is equivalent to / D α 0 E π q X, X α j q j X, X = 1. j=1 Thus, I D ={α S ; E π Fα= E π α}. The secon equality, I D ={α S ; Fα= α}, is straightforwar. We now prove by recursion on D that I D is finite, which is equivalent to proving that { α ID ; α 0 {1,...,D} } is empty or finite. If this set is nonempty, then any element α in this set satisfies q X, X {1,...,D} E π Dj=1 = 1, α j q j X, X which implies that 0 = D =1 α max q X, X E π Dj=1 α j q j X, X D=1 α max q X, X = E π Dj=1 1 α j q j X, X D=1 α E π log max q X, X Dj=1 α j q j X, X 0. 1

12 ADAPTIVE MIXTURES FOR IMPORTACE SAMPLIG 431 Since the global maximum of E π is unique, we conclue that α = α max an hence iii follows Average EM. Proposition 4.1 implies that our recursive proceure satisfies E π α t+1 E π α t. Therefore, the Kullback ivergence criterion 1 ecreases at each step. This property is closely linke to the EM algorithm [20], Section 5.3. More precisely, consier the mixture moel V M1, {α 1,...,α D } an W = X, X V πxq V x, x with parameter α. We enote by Ē α the corresponing expectation, by p α v, w the joint ensity of V, W with respect to µ µ an by p α w the ensity of W with respect to µ. It is then easy to check that E π α = logp α w πw,which is an average version of the criterion to be maximize in the EM algorithm when only W is observe. In this case, a natural iea, aapte from the EM algorithm, is to upate α accoring to the iterative scheme α t+1 = arg max Ē α t [log p α V, w w] πw. α S Straightforwar algebra can be use to show that this efinition of α t+1 is equivalent to the upate formula α t+1 = Fα t that we use above. Our algorithm then appears as an average EM, but shares with EM the property that the criterion increases eterministically at each step. The following result ensures that any α ifferent from α max is repulsive. PROPOSITIO 4.2. Uner A1 an A2, for every α α max S, there exists a neighborhoo V α of α such that if α t 0 V α, then α t t t0 leaves V α within a finite time. PROOF. Let α α max. Then using the inequality u 1 log u,wehave 1 D α max q X, X E π Dj=1 =1 α j q j X, X D=1 α E π log max q X, X Dj=1 α j q j X, X > 0, which implies that there exists 1 D such that / D E π q X, X α j q j X, X > 1. j=1

13 432 R. DOUC, A. GUILLI, J.-M. MARI AD C. P. ROBERT Using a nonincreasing sequence W n n 0 of neighborhoos of α in S, the monotone convergence theorem implies that q X, X 1 < E π Dj=1 = E α j q j X, X π lim inf q X, X n β W n Dj=1 β j q j X, X lim inf q X, X E π n β W n Dj=1. β j q j X, X Thus, there exist W n0 = V α, a neighborhoo of α an η>1 such that / D β V α E π q X, X 2 β j q j X, X >η. j=1 ow, for all t 0an1 D, wehave1 α t > 0. Combining 2 with the upate formulas for α t = Fα t 1 shows that α t t 0 leaves V α within a finite time. We thus conclue that the maximization algorithm is convergent, as asserte by the following proposition: PROPOSITIO 4.3. Uner A1 an A2, lim t αt = α max. PROOF. First, recall that I D is a finite set an that α max I D,thatis, I D ={β 0,β 1,...,β I } with β 0 = α max. We introuce a sequence W i 0 i I of isjoint neighborhoos of the β i s such that for all 0 i I, FW i is isjoint from j i W j [this is possible since Fβ i = β i an F is continuous] an for all i {1,...,I}, W i V βi,wherethev βi s are efine as in the proof of Proposition 4.2. The sequence E π α t t 0 is upper-boune an nonecreasing; therefore it converges. This implies that lim t E π Fα t E π α t = 0. By continuity of E π F E π, there exists T>0such that for all t T, α t j W j.sincefw i is isjoint from j i W j, this implies that there exists i {0,...,I} such that for all t T, α t W i. By Proposition 4.2, i cannot be in {1,...,I}. Thus, for all t T, α t W 0, which is a neighborhoo of β 0 = α max. 5. The Rao Blackwellize D-kernel PMC. Upate of the weights α through the transform F thus improves the Kullback ivergence criterion. We now iscuss how to implement this mechanism within a PMC algorithm that resembles the previous D-kernel algorithm. The only ifference with the algorithm of Section 3.1 is that we make use of the kernel structure in the computation of the importance weight. In MCMC terminology, this is calle Rao Blackwellization

14 ADAPTIVE MIXTURES FOR IMPORTACE SAMPLIG 433 [20], Section 4.2 an it is known to provie variance reuction in ata augmentation settings [20], Section 9.2. In the current context, the improvement brought about by Rao Blackwellization is ramatic, in that the moifie algorithm oes converge to the proposal mixture that is closest to the target istribution in the sense of the Kullback ivergence. More precisely, a Monte Carlo version of the upate via F can be implemente in the iterative efinition of the mixture weights, in the same way that MCEM approximates EM [20], Section The algorithm. In importance sampling, as well as in MCMC settings, the econitioning improvement brought about by Rao Blackwellization may be significant [5]. In the case of the D-kernel PMC scheme, the Rao Blackwellization argument is that it is not necessary to conition on the value of the mixture component in the computation of the importance weight an that the improvement is brought about by using the whole mixture. The importance weight shoul therefore be / D πx i,t α t, q x i,t 1,x i,t rather than πx i,t / q Ki,t x i,t 1,x i,t, =1 as in the algorithm of Section 3.1. As alreay note by Hesterberg [15], the use of the whole mixture in the importance weight is a robust tool that prevents infinite variance importance sampling estimators. In the next section, we show that this choice of weight guarantees that the following moification of the D-kernel algorithm converges to the optimal mixture in terms of Kullback ivergence. RAO BLACKWELLIZED D-KEREL PMC ALGORITHM. At time 0, use the same steps as in the D-kernel PMC algorithm to obtain x i,0 1 i an set α 1, = 1/D 1 D. At time t = 1,...,T, a generate K i,t 1 i M, α t, 1 D ; b generate inepenent x i,t q Ki,t x i,t 1,x 1 i an compute the normalize importance weights ω i,t πx i,t / D =1 α t, q x i,t 1,x i,t ; c resample x i,t into x i,t 1 i by multinomial sampling with weights ω i,t 1 i ; set α t+1, = ω i,t I K i,t 1 D The corresponing law of large numbers. ot very surprisingly, the sample obtaine at each iteration of the above Rao Blackwellize algorithm approximates the target istribution in the sense of the weak law of large numbers LL. ote that the convergence hols uner the very weak assumption A1 an for any test function h that is absolutely integrable with respect to the target istribution π. The function h may thus be unboune.

15 434 R. DOUC, A. GUILLI, J.-M. MARI AD C. P. ROBERT THEOREM 5.1. Uner A1, for any function h in L 1 π an for all t 0, ω i,t hx i,t P πh an 1 h x i,t P πh. PROOF. First, convergence of the secon average follows from the convergence of the first average by Theorem A.1, since the latter is simply a multinomial sampling perturbation of the former. We thus focus on the weighte average an procee by inuction on t. The case t = 0 is the basic importance sampling convergence result. For t 1, if the convergence hols for t 1, then to prove convergence at iteration t, we nee only check, as in Proposition 3.1, that 1 ω i,t hx i,t P πh, where ω i,t enotes the importance weight πx i,t / D =1 α t, q x i,t 1,x i,t. The special case h 1 ensures that the renormalizing sum converges to 1. We apply Theorem A.1 with G = σ x i,t 1 1 i,α t, 1 D an U,i = 1 ω i,t hx i,t. Then conitionally on G,thex i,t s 1 i are inepenent an D x i,t G α t, Q x i,t 1,. =1 oting that ωi,t hx i,t E G πx i,t hx i,t = E D =1 α t, q x i,t 1,x i,t G = πh, we only nee to check conition iii. The en of the proof is then quite similar to the proof of Proposition Convergence of the weights. The next proposition ensures that at each iteration, the upate of the mixture weights in the Rao Blackwellize algorithm approximates the theoretical upate obtaine in Section 4.2 for minimizing the Kullback ivergence criterion. 3 PROPOSITIO 5.1. Uner A1, for all t 1, where α t = Fα t 1. 1 D α t, P α t,

16 ADAPTIVE MIXTURES FOR IMPORTACE SAMPLIG 435 Combining Proposition 5.1 with Proposition 4.3, we obtain that, uner assumptions A1 A2, the Rao Blackwellize version of the PMC algorithm aapts the weights of the propose mixture of kernels, in the sense that it converges to the optimal combination of mixtures with respect to the Kullback ivergence criterion obtaine in Section 4.2. PROOF OF PROPOSITIO 5.1. The case t = 1 is obvious. ow, assume 3 hols for some t 1. As in the proof of Proposition 3.1, we now establish that 1 ω i,t I K i,t = 1 πx i,t Dl=1 α t, l q l x i,t 1,x i,t I K i,t P α t+1, the convergence of the renormalizing sum to 1 being a consequence of this convergence. We apply Theorem A.1 with G = σ x i,t 1 1 i,α t, 1 D an U,i = 1 ω i,t I K i,t. Conitionally on G,theK i,t,x i,t s 1 i are inepenent an for all, A in {1,...,D} A,wehave PK i,t =,x i,t A G = α t, Q x i,t 1,A. To apply Theorem A.1, we nee only check conition iii. We have, for C>0, ω i,t I K i,t E I {ωi,t I K i,t >C} G D j=1 D j=1 P 1 1 D j=1 α t, Dl=1 α t, l { q x i,t 1,x q l x i,t 1,x I { } πx I D 1 q j x i,t 1,x >C πx πx π D 1 q j x,x >C, } πx D 1 q j x i,t 1,x >C πx by the LL of Theorem 5.1. The right-han sie converges to 0 as C tens to infinity since by assumption A1, π{q j x, x = 0} = 0. Thus, Theorem A.1 applies an 1 1 ω i,t I K i,t E ω i,t I K i,t G P 0.

17 436 R. DOUC, A. GUILLI, J.-M. MARI AD C. P. ROBERT To complete the proof, it simply remains to show that 1 E ω i,t I K i,t G 4 = 1 α t, q x i,t 1,x Dl=1 α t, πx P α t+1. l q l x i,t 1,x It follows from the LL state in Theorem 5.1 that 1 α t πx q x i,t 1,x α t Dl=1 αl tq P E q X, X 5 π l x i,t 1,x Dl=1 αl tq = α t+1 lx, X an it thus suffices to check that the ifference between 4 an5 convergesto0 in probability. To show this, first note that for all t 1anall1 D, α t > 0 an thus, by the inuction assumption, for all 1 D, α t, α t /αt P 0. Using the inequality B A D C A D B B D + A C C D C, we have, by straightforwar algebra, that α t, q x i,t 1,x Dl=1 α t, l q l x i,t 1,x αt q x i,t 1,x Dl=1 αl tq l x i,t 1,x α t, q x i,t 1,x α t, Dj=1 α t, sup l α t l l q j x i,t 1,x l {1,...,D} αl t α t, + α t α t q x i,t 1,x Dl=1 αl tq l x i,t 1,x 2 α t sup l {1,...,D} α t, l αl t. The proposition then follows from the convergence α t, α t l α t /αt P A corresponing central limit theorem. We now establish a CLT for the weighte an the unweighte samples when the sample size goes to infinity. As note in Section 2.2 for the SIR algorithm, the asymptotic variance associate with the unweighte sample is larger than the variance of the weighte sample because of the aitional multinomial step. THEOREM 5.2. Uner A1, i for a function h such that π{h 2 x πx/q x, x } < for at least one 1 D, we have 6 L ω i,t {hx i,t πh} 0,σt 2,

18 ADAPTIVE MIXTURES FOR IMPORTACE SAMPLIG 437 where σ 2 t = π{hx πh} 2 πx / D =1 α t q x, x ; ii if, moreover, πh 2 <, then 7 1 L {h x i,t πh} 0,σt 2 + V π h. ote that amongst the conitions uner which this theorem applies, the integrability conition π h 2 x πx 8 q x, x < is require for some in {1,...,D} an not for all s. Thus, settings where some transition kernels q, o not satisfy 8 can still be covere by this theorem provie 8 hols for at least one particular kernel. An equivalent expression of the asymptotic variance σt 2 is σt 2 = V ν {h πh} π D where νx,x = πx α t ν Q x, x. =1 Written in this form, σt 2 turns out to be the expression of the asymptotic variance that appears in the CLT associate with a self-normalize IS algorithm SIS see Section 2.1 for a escription of the algorithm, where the proposal istribution woul be ν an the target istribution π. Obviously, this SIS algorithm cannot be implemente since, given the above efinition of ν, the proposal istribution epens on both π an the weights α t, which are unknown. PROOF OF THEOREM 5.2. Without loss of generality, we assume that πh = 0. Let 0 {1,...,D} be such that πh 2 x πx/q 0 x, x <. A consequence of the proof of Theorem 5.1 is that 1 ω i,t P 1, so we only nee to prove that 9 1 L ω i,t hx i,t 0,σt 2. We will apply Theorem A.2 with U,i = 1 ω i,t hx i,t = 1 πx i,t hx i,t D=1 α t, q x i,t 1,x i,t, G = σ { x i,t 1 1 i,α t, 1 D }. Conitionally on G,thex i,t s 1 i are inepenent an x i,t G D =1 α t, Q x i,t 1,.

19 438 R. DOUC, A. GUILLI, J.-M. MARI AD C. P. ROBERT Conitions i an ii of Theorem A.2 are clearly satisfie. To check conition iii, first note that EU,i G = πh = 0. Moreover, A = EU,i 2 G = 1 h 2 πx x D=1 α t, q x i,t 1,x πx. By the LL for x i,t state in Theorem 5.1, wehave α t, 0 B = 1 h 2 πx x D=1 α t q πx P σt 2 x i,t 1,x. To prove that iii hols, it is thus sufficient to show that B A P 0. Since P α t 0 > 0, we nee only consier the upper boun I {α t, 0 >2 1 α t 0 } B A I {α t, { sup >2 1 α t 0 } 0 1 D sup 1 D α t α t, α t α t α t, α t 1 j=1 1 j=1 h 2 xπx 2 1 α t 0 q 0 x i,t 1,x h 2 xπx D=1 α t, q x i,t 1,x πx } πx P 0. Thus, conition iii is satisfie. Finally, we consier conition iv. Using the same argument as was use for conition iii, wehavethat I {α t, >2 1 α t 0 } 0 1 P π [ { 1 E ω2 i,t h2 x i,t I πx i,t hx i,t D=1 α t, q x i,t 1,x i,t >C } G ] { } h 2 πx x 2 1 α t 0 q 0 x i,t 1,x I πxhx 2 1 α t 0 q 0 x i,t 1,x >C πx { } h 2 πx x 2 1 α t 0 q 0 x,x I πxhx 2 1 α t 0 q 0 x,x >C, whichconvergesto0asc tens to infinity. Thus, Theorem A.2 can be applie an the proof of 6 is complete. The proof of 7 follows from a irect application of Theorem A.2, as in the SIR result, by setting U,i = 1 h x i,t an G = σ {x i,t 1 i,ω i,t 1 i }.

20 ADAPTIVE MIXTURES FOR IMPORTACE SAMPLIG Illustrations. In this section, we briefly show how the iterations of the PMC algorithm quickly implement aaptivity towar the most efficient mixture of kernels, through three examples of moerate ifficulty. The R programs are available from the last author s website via an Snw file. EXAMPLE 1. As a first toy example, we take the target π to be the ensity of a five-imensional normal mixture, , i, an an inepenent normal mixture proposal with the same means an variances as 10, but starte with ifferent weights α 2,. ote that this is a very special case of a D-kernel PMC scheme in that the transition kernels of Section 5.1 are then inepenent proposals. In this case, the optimal choice of weights is obviously α = 1/3. In our experiment, we use three Wishart-simulate variances i, with 10, 15 an 7 egrees of freeom, respectively. The starting values α 1, are inicate on the left of Figure 1, which clearly shows the convergence to the optimal FIG. 1. Convergence of the accumulate weights α t, 1 an α t, 1 + α t, 2 for the three-component normal mixture to the optimal values 1/3 an 2/3 represente by otte lines. At each iteration, = 1,000 points were simulate from the D-kernel proposal.

21 440 R. DOUC, A. GUILLI, J.-M. MARI AD C. P. ROBERT values 1/3 an2/3 for the two first accumulate weights in less than ten iterations. Generating more simulate points at each iteration stabilizes the convergence graph, but the primary aim of this example is to exhibit the fast convergence to the true optimal values of the weights. EXAMPLE 2. As a secon toy example, consier the case of a threeimensional normal 3 0, target with covariance matrix = an the mixture of three kernels given by 11 α t, 1 T 2 x i,t 1, 1 + α t, 2 x i,t 1, 2 + α t, 3 x i,t 1, 3, where the i s are ranom Wishart matrices. [The first proposal in the mixture is a prouct of uniimensional Stuent t-istributions with two egrees of freeom, centere at the current values of the components of x i,t 1 an rotate by τ1 iag 1/2.] A compelling feature of this example is that we can visualize the Kullback ivergence on the R 3 simplex in the sense that the ivergence / Eπ, π= E π [log }] 3 πx α q X, X = eα 1,α 2,α 3 =1 can be approximate by a Monte Carlo experiment on a gri of values of α 1,α 2. Figure 2 shows the result of this Monte Carlo experiment base on 25, , simulations an exhibits a minimum ivergence insie the R 3 simplex. Running the Rao Blackwellize D-kernel PMC algorithm from a ranom starting weight α 1,α 2,α 3 always leas to a neighborhoo of the minimum, even though a strict ecrease in the ivergence requires a large value for an a precise convergence to α max necessitates a large number of iterations T. EXAMPLE 3. Our thir example is a contingency table inspire by [1], given here as Table 1. We moel this ataset by a Poisson regression, x ij P expα i + β j, i,j = 0, 1, with α 0 = 0 for ientifiability reasons. We use a flat prior on the parameter θ = α 1,β 0,β 1 an run the PMC D-kernel algorithm with a mixture of ten normal ranom walk proposals, θ i,t 1,ϱ Iˆθ, = 1,...,10, where Iˆθ is the Fisher information matrix evaluate at the MLE, ˆθ = 0.43, 4.06, 5.9, an where the scales ϱ vary from 1.35e 19 to 1.54e+07 the ϱ s are equiistribute on a logarithmic scale. The results of five successive iterations of the Rao Blackwell D-kernel algorithm are as follows: unsurprisingly, the largest variance

ADAPTIVE MIXTURES FOR IMPORTACE SAMPLIG 441 FIG. 2. Grey level an contour representation of the Kullback ivergence eα 1,α 2,α 3 between the 3 0, istribution an the three-component mixture proposal 11.

22 ADAPTIVE MIXTURES FOR IMPORTACE SAMPLIG 441 FIG. 2. Grey level an contour representation of the Kullback ivergence eα 1,α 2,α 3 between the 3 0, istribution an the three-component mixture proposal 11. The arker pixels correspon to lower values of the ivergence. We also represent in white the path of one run of the D-kernel PMC algorithm when starte from a ranom value α 1,α 2,α 3. The number of iterations T is equal to 500, while the sample size is 50,000. kernels are harly ever sample, but fulfill their main role of variance stabilizers in the importance sampling weights while the mixture concentrates on the meium variances, with a quick convergence of the mixture weights to the limiting weights the accumulate weights of the 5th, 6th, 7th an 8th components of the mixture converge to 0, 0.003, an 0.738, respectively. The fit of the simulate sample to the target istribution is shown in Figure 3, since the points of the sample o coincie with the unique moal region of the posterior istribution. This experiment also shows that there is no egeneracy in the samples prouce: most points in the last sample have very similar posterior values. For instance, 20% of the sample correspons to 95% of the weights, while 1% of the sample TABLE 1 Two-by-two contingency table 0 1 Total Total

23 442 R. DOUC, A. GUILLI, J.-M. MARI AD C. P. ROBERT FIG. 3. Distribution of 5,000 resample points after five iterations of the Rao Blackwellize D-kernel PMC sampler for the contingency table example. Top: histograms of the components α 1, β 0 an β 1 ; bottom: scatterplots of the points α 1,β 0, α 1,β 1 an β 0,β 1 on the profile slices of the log-likelihoo.

24 ADAPTIVE MIXTURES FOR IMPORTACE SAMPLIG 443 correspons to 31% of the weights. A closer look at convergence is provie by Figure 4, where the histograms of the resample samples are represente, along with the istribution of the log likelihoo an the empirical cumulative istribution function cf of the importance weights. They o not signal any egeneracy phenomenon, but, rather the opposite a clear stabilization aroun the values of interest. 7. Conclusion an perspectives. This paper shows that it is possible to buil an aaptive mixture of proposals aime at a minimization of the Kullback ivergence with the istribution of interest. We can therefore set ifferent goals for a simulation experiment an expect to arrive at the most accurate proposal. For instance, in a companion paper [9], we also erive an aaptive upate of the weights targete at the minimal variance proposal for a given integral I of interest. Rather naturally, these results are achieve uner strong restrictions on the family of proposals in the sense that the parameterization of those families is restricte to the weights of the mixture. It is, however, possible to exten the above results to general mixtures of parameterize proposals, as shown by work currently uner evelopment. A more practical irection of research is the implementation of such aaptive algorithms in large-imensional problems. While our algorithms are in fine importance sampling algorithms, it is conceivable that mixtures of Gibbs-like proposals can take better avantage of the intuition gaine from MCMC methoology, while keeping the finite-horizon valiation of importance sampling methos. The major ifficulty in this irection, however, is that the curse of imensionality still hols in the sense that a we nee to simultaneously consier more an more proposals as the imension increases as, e.g., the set of all full conitionals an b the number of parameters to tune in the proposals exponentially increases with the imension. APPEDIX A: COVERGECE THEOREMS FOR TRIAGULAR ARRAYS In this section, we recall convergence results for triangular arrays of ranom variables see [4]or[10] for more etails, incluing the proofs. We will use these results to stuy the asymptotic behavior of the PMC algorithm. In what follows, let {U,i } 1,1 i be a triangular array of ranom variables efine on the same measurable space, A an let {G } 1 be a sequence of σ -algebras inclue in A. The symbol X P a means that X converges in probability to a as goes to infinity. DEFIITIO A.1. The sequence {U,i } 1,1 i is inepenent given {G } 1 if for all 1, the ranom variables U,1,...,U, are inepenent given G.

25 444 R. DOUC, A. GUILLI, J.-M. MARI AD C. P. ROBERT FIG. 4. Evolution of the samples over four iterations of the Rao Blackwellize D-kernel PMC sampler for the contingency table example the output from each iteration is a block of four graphs, to be rea from left to right an from top to bottom: histograms of the resample samples of α 1, β 0 an β 1 of size 50,000 an lower right of each block log-likelihoo an the empirical cumulative istribution function of the importance weights.

26 ADAPTIVE MIXTURES FOR IMPORTACE SAMPLIG 445 DEFIITIO A.2. The sequence of variables {Z } 1 is boune in probability if lim sup P[ Z C]=0. C 1 THEOREM A.1. If i {U,i } 1,1 i is inepenent given {G } 1 ; ii the sequence { E[ U,i G ]} 1 is boune in probability; iii for all η>0, E[ U,i I U,i >η G ] P 0, then U,i E[U,i G ] P 0. THEOREM A.2. If i {U,i } 1,1 i is inepenent given {G } 1 ; ii for all 1, i {1,...,}, E[ U,i G ] < ; iii there exists σ 2 > 0 such that E[U 2,i G ] E[U,i G ] 2 P σ 2 ; iv for all η>0, E[U 2,i I U,i >η G ] P 0, then for all u R, [ ] G E exp iu U,i E[U,i G ] P exp u2 σ 2. 2 APPEDIX B: PROOF OF PROPOSITIO 3.1 We procee by inuction with respect to t. Using Theorem A.1, the case t = 0 is straightforwar as a irect consequence of the convergence of the importance sampling algorithm. For t 1, let us assume that the LL hols for t 1. Then to prove that ω i,t hx i,t,k i,t converges in probability to π γ u h, we nee only check that 1 πx i,t q Ki,t x i,t 1,x i,t hx i,t,k i,t P π γ u h, the special case h 1 proviing the convergence of the normalizing constant for the importance weights. Applying Theorem A.1 with an U,i = 1 πx i,t q Ki,t x i,t 1,x i,t hx i,t,k i,t G = σ { x i,t 1 1 i,α t, 1 D },

27 446 R. DOUC, A. GUILLI, J.-M. MARI AD C. P. ROBERT where σ {X i i } enotes the σ -algebra inuce by the X i s, we nee only check conition iii. For any C>0, we have 12 1 [ πx i,t E q Ki,t x i,t 1,x i,t { } ] πx i,t hx i,t,k i,t I q Ki,t x i,t 1,x i,t hx i,t,k i,t >C G D = 1 F C x i,t 1,α t,, =1 where F C x, k = πuhu, ki{ πu q k x,u hu, k C}. By inuction, we have α t, = 1 Using these limits in 12 yiels 1 ω i,t 1 I K i,t 1 P 1/D F C x i,t 1,k P πf C, k. an [ { } ] πxi,t hx i,t,k i,t E q Ki,t x i,t 1,x i,t I πxi,t hx i,t,k i,t q Ki,t x i,t 1,x i,t >C G P π γ u F C. Since π γ u F C converges to 0 as C goes to infinity, this proves that for any η>0, 1 [ { } ] πxi,t hx i,t,k i,t E q Ki,t x i,t 1,x i,t I πxi,t hx i,t,k i,t q Ki,t x i,t 1,x i,t >η G P 0. Conition iii is satisfie an Theorem A.1applies. The proof follows. Acknowlegments. The authors are grateful to Olivier Cappé, Paul Fearnhea an Eric Moulines for helpful comments an iscussions. Comments from two referees helpe consierably in improving the focus an presentation of our results.

28 ADAPTIVE MIXTURES FOR IMPORTACE SAMPLIG 447 REFERECES [1] AGRESTI, A Categorical Data Analysis, 2n e. Wiley, ew York. MR [2] ADRIEU, C. an ROBERT, C Controlle Markov chain Monte Carlo methos for optimal sampling. Technical Report 0125, Univ. Paris Dauphine. [3] CAPPÉ, O., GUILLI, A., MARI, J. an ROBERT, C Population Monte Carlo. J. Comput. Graph. Statist MR [4] CAPPÉ, O., MOULIES, E. anrydé, T Inference in Hien Markov Moels. Springer, ew York. MR [5] CELEUX, G., MARI, J. an ROBERT, C Iterate importance sampling in missing ata problems. Comput. Statist. Data Anal [6] CHOPI, Central limit theorem for sequential Monte Carlo methos an its application to Bayesian inference. Ann. Statist MR [7] CSISZÁR, I. an TUSÁDY, G Information geometry an alternating minimization proceures. Recent results in estimation theory an relate topics. Statist. Decisions 1984 suppl MR [8] DEL MORAL, P.,DOUCET, A.anJASRA, A Sequential Monte Carlo samplers. J. R. Stat. Soc. Ser. B Stat. Methool MR [9] DOUC, R.,GUILLI, A.,MARI, J.anROBERT, C Minimum variance importance sampling via population Monte Carlo. Technical report, Cahiers u CEREMADE, Univ. Paris Dauphine. [10] DOUC, R. an MOULIES, E Limit theorems for properly weighte samples with applications to sequential Monte Carlo. Technical report, TSI, Telecom Paris. [11] DOUCET, A.,DE FREITAS,.anGORDO,., es Sequential Monte Carlo Methos in Practice. Springer, ew York. MR [12] GILKS, W., ROBERTS, G. an SAHU, S Aaptive Markov chain Monte Carlo through regeneration. J. Amer. Statist. Assoc MR [13] HAARIO, H., SAKSMA, E. an TAMMIE, J Aaptive proposal istribution for ranom walk Metropolis algorithm. Comput. Statist [14] HAARIO, H., SAKSMA, E. an TAMMIE, J An aaptive Metropolis algorithm. Bernoulli MR [15] HESTERBERG, T Weighte average importance sampling an efensive mixture istributions. Technometrics [16] IBA, Y Population-base Monte Carlo algorithms. Trans. Japanese Society for Artificial Intelligence [17] KÜSCH, H Recursive Monte Carlo filters: Algorithms an theoretical analysis. Ann. Statist MR [18] MEGERSE, K. L. an TWEEDIE, R. L Rates of convergence of the Hastings an Metropolis algorithms. Ann. Statist MR [19] ROBERT, C Intrinsic losses. Theory an Decision MR [20] ROBERT, C. an CASELLA, G Monte Carlo Statistical Methos, 2n e. Springer, ew York. MR [21] ROBERTS, G. O., GELMA, A. an GILKS, W. R Weak convergence an optimal scaling of ranom walk Metropolis algorithms. Ann. Appl. Probab MR [22] RUBI, D Using the SIR algorithm to simulate posterior istributions. In Bayesian Statistics 3 J. M. Bernaro, M. H. DeGroot, D. V. Linley an A. F. M. Smith, es Oxfor Univ. Press. [23] SAHU, S. an ZHIGLJAVSKY, A Aaptation for self regenerative MCMC. Technical report, Univ. of Wales, Cariff.

Adaptive Monte Carlo methods

Adaptive Monte Carlo methods Jean-Michel Marin Projet Select, INRIA Futurs, Université Paris-Sud joint with Randal Douc (École Polytechnique), Arnaud Guillin (Université de Marseille) and Christian Robert