The Wang-Landau algorithm in general state spaces: Applications and convergence analysis

Size: px

Start display at page:

Download "The Wang-Landau algorithm in general state spaces: Applications and convergence analysis"

Marianna Cox
6 years ago
Views:

1 The Wang-Landau algorithm in general state spaces: Applications and convergence analysis Yves F. Atchadé and Jun S. Liu First version Nov. 2004; Revised Feb. 2007, Aug. 2008) Abstract: The Wang-Landau algorithm [21]) is a recent Monte Carlo method that has generated much interest in the Physics literature due to some spectacular simulation performances. The objective of this paper is two-fold. First, we show that the algorithm can be naturally extended to more general state spaces and used to improve on Markov Chain Monte Carlo schemes of more interest in Statistics. In a second part, we study asymptotic behaviors of the algorithm. We show that with an appropriate choice of the step-size, the algorithm is consistent and a strong law of large numbers holds under some fairly mild conditions. We have also shown by simulations the potential advantage of the WL algorithm for problems in the Bayesian inference. AMS 2000 subject classifications: Primary 60C05, 60J27, 60J35, 65C40. Keywords and phrases: Monte Carlo methods, Wang-Landau algorithm, Multicanonical sampling, Trans-dimensional MCMC, Adaptive MCMC, Geometric ergodicity, Stochastic approximation. 1. Introduction Although the idea of Monte Carlo computation has been around for more than a century, its first real scientific use occurred during the World War II when the first generation computer became available. Nick Metropolis coined the name Monte Carlo for the method when he was at Los Alamos National Labs and it quick evolved to an active research area due to the active involvements of leading physicists in the Labs. Ever since then, physicists have been at the forefront of the methodological research in the field. One of their latest additions is an algorithm proposed by F. Wang and D. P. Landau [21]). The Wang-Landau WL) algorithm has been successfully applied to some complex sampling problems in physics. The algorithm is closely related to multicanonical sampling, a method due to B. A. Berg and T. Neuhaus [9]). Briefly, Department of Statistics, University of Michigan, yvesa@umich.edu Department of Statistics, Harvard University, jliu@stat.harvard.edu 1

2 /The WL algorithm in general state spaces 2 if π is the probability measure of interest, the idea behind multicanonical sampling is to obtain an importance sampling distribution by partitioning the state space along the energy function log πx)) and re-weighting appropriately each component of the partition so that the modified distribution π spends equal amount of time in each component, i.e., uniform in the energy space. The method is often criticized for the difficulty involved in computing the weights. The main contribution of the WL algorithm is in proposing an efficient algorithm that simultaneously computes the balancing weights and samples from the re-weighted distribution. The objective of this paper is to take a more probabilistic look at the WL algorithm and to explore its potential for Monte Carlo simulation problems of more direct interest to statisticians. We achieve this goal by proposing a general state space version of the algorithm. Then we show that the WL algorithm offers an effective strategy to improve on simulated tempering and transdimensional MCMC. From a probabilistic standpoint, the WL algorithm is an interesting example of adaptive Markov Chain Monte Carlo MCMC). Adaptive MCMC is an approach to Monte Carlo simulation where the transition kernel of the algorithm is sequentially adjusted over time in order to achieve some prescribed optimality. Some early work on the subject includes [13], [4], [15] See also [10], [19]. Early theoretical analysis includes [6], [2], [1], [20]). We take a similar path-wise approach to analyze the WL algorithm. The analysis of the WL algorithm is not a straightforward application of the theory in the aforementioned papers because of the specific adaptive control involved. The key point is the stability of the algorithm. We say that the WL algorithm is stable if no component of the partition receives infinitely more visits than any other component as n. On a stable sample path and under appropriate conditions, we show that the WL algorithm learns the optimal weights and satisfies a strong law of large numbers Theorem 4.1). In the specific cases of multicanonical sampling and simulated tempering, which includes the original the WL algorithm, we show that the algorithm is stable and that the aforementioned limit results hold. It came to our attention after the first draft of this paper that a similar extension of the WL algorithm has been proposed independently by F. Liang and coworkers see e.g. [16]). Their approach differs from the WL approach in that these authors took a more classic approach based on stochastic approximation with step-sizes set deterministically. Our proposed generalization to the WL algorithm is presented in Section 2. Some particular

3 /The WL algorithm in general state spaces 3 cases are discussed in Section 3. The theoretical analysis is discussed in Section 4 but the proofs are postponed to Section 6 to facilitate the flow of ideas. 2. The Wang-Landau algorithm In multicanonical sampling, we are given a state space X and a probability measure π. X is then partitioned as X = X i, where X i X j = φ and π is re-weighted in each component X i. An abstract way to do the same and much more is the following. We start with X i, B i, λ i ) i = 1,..., d, a finite family of measure spaces where λ i is a σ-finite measure. We introduce the union space X = d i=1 X i {i}. We equip X with the σ algebra B generated by {A i, i), i {1,..., d}, A i B i } and the measure λ satisfying λa, i) = λ i A)1 Bi A). Let h i : X i R be a non-negative measurable function and define θ i) = X i h i x)λ i dx)/z where Z = d i=1 X i h i x)λ i dx). We assume that θ i) > 0 for all i = 1,..., d and consider the following probability measure on X, B): π dx, i) h ix) θ i) 1 X i x)λ i dx). 1) Our objective is to sample from π. The problem of sampling from such a distribution arises in a number of different Monte Carlo strategies. For example, and as explained above, if π is a probability measure of interest on some space X, B, λ), we can partition X along the energy function logπ) and re-weight π by πx i ) in each component X i. The sampling problem then becomes of the form 1). This powerful strategy appeared first in the Physics literature as multicanonical sampling [9]). This is discussed in some more details is Section 3.1. Sampling from 1) also arises naturally when optimizing the simulated tempering algorithm [18], [12]). In simulated tempering, the states space X is not partitioned but, instead, some auxiliary distributions π 2,..., π d are introduced take π 1 = π). These distributions are chosen close to π but easier to sample from. For good performances, one typically imposes that all the distributions have the same weight. Taking each probability space X, B, π i ) as a component in the formalism above leads to a sampling problem of the form 1). Multicanonical sampling and simulated tempering have been combined in [5] giving an algorithm which can also be framed as 1). Sampling from 1) can also be an efficient strategy to improve on trans-dimensional MCMC samplers for Bayesian inference with model uncertainty. This is detailed in Section 3.3.

4 /The WL algorithm in general state spaces 4 The main obstacle in sampling from π is that the normalizing constants θ are not known. The contribution of the Wang-Landau algorithm [21]) is an efficient algorithm that simultaneously estimates θ and sample from π. The algorithm was introduced in a discrete setting with the π being uniform in i. In this work we extend the algorithm to general state spaces and to arbitrary probability measures. To carry on the discussion in our general framework, we introduce the family of probability measures {π θ, θ 0, ) d } on X, B, λ) defined by: π θ dx, i) h ix) θi) 1 X i x)λ i dx). 2) We assume that for all θ 0, ) d, we have at our disposal a transition kernel P θ on X, B) with invariant distribution π θ. Note that π θ and P θ remain unchanged if we multiply the vector θ by a positive constant. How to build such Markov chain P θ typically depends on the particular instance of the algorithm. We give some examples later. The structure of the WL algorithm is as follows. We start out with some initial value X 0, I 0 ) X, φ 0 0, ) d and set θ 0 i) = φ 0 i)/ d j=1 φ 0 j), i = 1,..., d. Here θ 0 serves as an initial guess of θ. At iteration n+1, we generate X n+1, I n+1 ) by sampling from P φn X n, I n ; ) and update φ n to φ n+1, which is used to form θ n+1 i) = φ n+1 i)/ j φ n+1j). The updating rule for φ n is fairly simple. For i {1,..., d}, if X n+1 X i equivalently, if I n+1 = i), then φ n+1 i) = φ n i)1 + ρ) for some ρ > 0; otherwise φ n+1 i) = φ n i). This leads to a first version of the WL algorithm. Algorithm 2.1 The Wang-Landau algorithm I). Let {ρ n } be a sequence of decreasing positive numbers. Let X 0, I 0 ) X be given. Let φ 0 R d be such that φ 0 i) > 0 and set θ 0 i) = φ 0 i)/ j φ 0j), i = 1,..., d. At some time n 0, given X n, I n ) X, φ n R d, θ n R d : i) Sample X n+1, I n+1 ) P θn X n, I n ; ). ) ii) For i = 1,..., d, set φ n+1 i) = φ n i) 1 + ρ n 1 {In+1 =i} and θ n+1 i) = φ n+1 i)/ j φ n+1j). It remains to choose the sequence {ρ n }. As we show below, {θ n } as defined by Algorithm 2.1 is a stochastic approximation process driven by {X n, I n )}. The general guidelines in the literature to choose {ρ n } are: ρ n > 0, ρ n = and ρ 1+ε n < for some ε > 0, often ε = 1. The typical choice is ρ n n 1. In practice, more careful choices are often necessary for good performances. To the best of knowledge, there is no general, satisfactory way of choosing the step-size in stochastic approximation. Interestingly, Wang-Landau came up with a clever, adaptive way of choosing {ρ n }

5 /The WL algorithm in general state spaces 5 which works very well in practice. We describe their approach next, again in more probabilistic terms. Let v n,k i) denote the proportion of visits to X i {i} between times n + 1 and k. That is, v k,n i) = 0 for k n and for k n + 1, v n,k i) = 1 k n kj=n+1 1{I j = i}. Let c 0, 1) be a parameter to be specified by the user. We introduce two additional random sequences {κ n } and {a n }. Initially, κ 0 = 0. For n 1, define { κ n = inf k > κ n 1 : max 1 i d v κ n 1,ki) 1 d c }, 3) d with the usual convention that inf =. We need another sequence {γ n } of positive decreasing numbers, representing stepsizes. Then, {a n } represents the index of the element of the sequence {γ n } used at time n: a 0 = 0, if k = κ j for some j 1, then a k = a k otherwise a k = a k 1. In other words, we start Algorithm 2.1 with a step-size equal to γ 0 and keep using it until time κ 1 when all the components are visited equally well. Only then we change the step-size to γ 1 and keep it constant until time κ 2 etc... Combining this with Algorithm 2.1, we get the following. Algorithm 2.2 The Wang-Landau algorithm II). Let {γ n } be a sequence of decreasing positive numbers. Let X 0, I 0 ) X be given. Set a 0 = 0, κ = 0, c 0, 1), φ 0 R d such that φ 0 i) > 0 and θ 0 i) = φ 0 i)/ j φ 0j), i = 1,..., d. At some time n 0, given X n, I n ) X, φ n R d, θ n R d, a n and κ: i) Sample X n+1, I n+1 ) P θn X n, I n ; ). ) ii) For i = 1,..., d, set φ n+1 i) = φ n i) 1 + γ an 1 {In+1 =i} and θ n+1 i) = φ n+1 i)/ j φ n+1j). vκ,n+1 iii) If max i i) 1 d c/d then set κ = n + 1 and a n+1 = a n + 1, otherwise a n+1 = a n. Remark The performances of Algorithm 2.2 depends very much on the choice of {γ n } and c. Theoretically we show that the choice γ n n 1 guaranties the convergence of the algorithm. But in practice, this type of step-size can be overly slow. The user might then consider instead γ n a n a > 1) originally proposed by Wang-Landau. But we were not able to obtain the convergence of the algorithm for summable step-sizes. A good compromise is to start the sampler with γ n a n until γ n < ε e.g. ε = 10 5 ) and then switch to γ n = n 1. There is a bias-variance trade-off involved in the choice of c. For c close

6 /The WL algorithm in general state spaces 6 to 0, Algorithm 2.2 will have a low bias in estimating θ ) but a high variance. Such values of c are more suitable for γ n a n. Whereas for larger values of c, the bias will be high with a low variance. For c = d 1, we get a standard stochastic approximation algorithm with deterministic step-size for which a step-size that sums to is necessary for convergence. We found empirically from our simulations that c in the range yields reasonably good samplers. 2. In the actual implementation of the algorithm, it is not necessary to re-normalize φ n into θ n as in ii). In fact, for computational stability, we recommend carrying out the recursion on a logarithmic scale: log φ n i) = log φ n 1 i) + log1 + γ an 1 )1 {In =i}. 3. Another interesting feature of Algorithm 2.2 is that Step iii) can serve as a stopping rule: we stop the simulation when γ an get smaller than some pre-specified value. 4. Under some regularity conditions, if f : X R is some function of interest that is π- integrable and {X n, I n, θ n )} is as described in Algorithm 2.2, we will show below that, 1 n n fx k, I k ) π f), a.s. as n. k=1 If we denote π i the distribution on X i with density with respect to λ i proportional to h i x)1 Xi x), we can estimate integrals with respect to π i as well: nk=1 fx k, I k )1 Xi X k ) nk=1 1 Xi X k ) π i f, i)), a.s. as n. Now if we denote π the distribution on X whose density with respect to λ is proportional to h i x) on X i, the ratio πx, i)/π x, i) is dθ i) and integrals with respect to π can also computed by importance sampling: d n n fx k, I k )θ k I k ) πf), a.s. as n. k=1 Various methods for recycling Monte Carlo samples can be implemented as well. All these results follow from the strong law of large numbers of Theorem Some Applications In this section, we detail briefly some applications of the general algorithm to multicanonical sampling and simulated tempering and trans-dimensional MCMC.

7 3.1. Multicanonical sampling /The WL algorithm in general state spaces 7 Multicanonical sampling is a powerful algorithm proposed by [9]. It holds the potential of improving on mixing times of classical MCMC algorithms. It fits naturally in the framework above. But the implementation can be tedious. Assume that we want to sample from a probability measure πdx) hx)λdx) on some probability space Σ, A, λ). We use the energy function Ex) = loghx)) to build a d-component partition X i ) i of Σ. X i = {x Σ : E i 1 < Ex) E i }, where E 0 < E 1 <... < E d are predefined values. Denote θ i) = πx i ) and assume θ i) > 0. As above, we introduce the union space X = X i {i}. The idea of multicanonical sampling is to sample from π given by: π dx, i) hx) θ i) 1 X i x)λdx), which is of the form 1). There is a simpler formulation of the algorithm. Since the component of the partition to which a point x belongs can be obtained from x itself, multicanonical sampling is equivalent to sampling from π on Σ, A) given by: π dx) d i=1 hx) θ i) 1 X i x)λdx), 4) and the union space formalism is not needed. After sampling from π in 4), a straightforward importance sampling estimate allows to recover π. The algorithm tries to break the barriers in the energy landscape of the distribution by re-weighting each component X i. Clearly, the success depends heavily on a good choice of the energy rings E 0,..., E d. This typically requires some prior information on π or some pilot simulations. We point out that although the energy function E is a natural candidate to utilize to partition the space, the idea can be extended to other functions. In the description of multicanonical sampling given above, taking Σ as a discrete space, π the uniform distribution on X and X e = {x Σ : Ex) = e}, e {e R : Ex) = e for some x Σ} yields the Wang-Landau algorithm of [21]) Simulated Tempering The method can be applied to the simulated tempering of [18] and [12] by taking X i, B i, λ i ) X 1, B 1, λ 1 ) and h i = h 1/t i, 1 = t 1 < t 2 <... < t d. Simulated tempering is a well-known Monte

8 /The WL algorithm in general state spaces 8 Carlo strategy for sampling from difficult target distributions. Assume that the distribution of interest is π 1 dx) hx)λdx). Typically for large temperature t, h 1/t is a more well-behaved distribution for which faster mixing Markov chains can be built. In simulated tempering, we try to take advantage of these faster mixing chains, by targeting the distribution: π θ dx, i) = h1/t i x) 1 Xi x)λ 1 dx), 5) θi) on the union space X, B, λ). A MCMC sampler P θ with invariant distribution π θ is readily designed. Typically, P θ takes the form P θ x, i); A {j}) = B x,θ i, j)p [j] x, A), 6) where B x,θ is a transition kernel on {1,..., d} with invariant distribution h j x)/θj)) i h ix)/θi)) 1 and P [i] a transition kernel on X 1, B 1 ) with invariant distribution proportional to h i x)λdx). Typically, one takes B x,θ i, j) = h i x)/θi)) i h ix)/θi)) 1. Another common choice is to take B x,θ as a Metorpolis-Kernel on {1,..., d} with proposal qi, j). By standard importance sampling techniques, we can convert samples from the higher temperature distributions to also estimate π 1. The method holds for any θ R d, θi) > 0. But the choice of θ can significantly impact the efficiency. Heuristically, it seems that, to improve on mixing, we need a θ that allows to sample from fast converging distributions but close to π); but since π 1 is the distribution of interest, for statistical efficiency, we need a θ that favors π 1. One easy way to resolve this tradeoff is to choose θ such that all the distributions are equally visited. For this we need to choose θi) = θ i) h i x)λ 1 dx) and sample from which can be done with Algorithm 2.2. π dx, i) h1/t i x) θ i) 1 X i x)λ 1 dx), Example 1. We compare a plain simulated tempering with weight θi) 1 and the Wang-Landau simulated tempering described above for sampling from a multimodal bivariate Gaussian mixture distribution. The target distribution given below was taken from [17] πx) = 1 2πσ 2 20 i=1 { ω i exp 1 } 2σ 2 x µ i) x µ i ), 7) where σ = 0.1 and ω i The µ i s are listed in Table 1. The distribution is highly multimodal and it is clear that a plain Random Walk Metropolis algorithm for this distribution does not

9 /The WL algorithm in general state spaces 9 mix in a reasonable time. Simulated tempering can be particularly efficient in such situations. We compare two strategies: a plain simulated tempering where θi) 1 in 5) and the WL adaptation of simulated tempering as described above. We use the temperature scale 1 < 7.7 < 31.6 < 100. Table 2 presents the mean squared errors MSEs) of the two methods in estimating the first two moments of the two components of π. We can see that the WL version is about three-four times more efficient than the plain version in terms of MSE. The estimates are based on 30 independent replications of the samplers. We run each sampler for 100, 000 iterations. In applying Algorithm 2.2, we use γ n = 1/n and c = 0.3 i µ i1 µ i2 i µ i1 µ i2 i µ i1 µ i2 i µ i1 µ i Table 1 The 20 means of the two-dimensional Gaussian mixture. EX 1 ) EX 2 ) EX1 2 ) EX2 2 ) Plain ST WL-ST Ratio Table 2 Mean Square Errors of the plain and the WL simulated tempering algorithms. Based on 30 independent replications with 100, 000 iterations of each sampler Application to trans-dimensional MCMC It is often the case in Statistics that many alternative models are considered for the same data. One is then interested in issues like model comparison, model selection and averaging. Let fdata k, x k ) be the likelihood of model k with parameter x k. Assume that we have a finite number d of models and that x k X k, B k, λ k ). Let X = d i=1 X i {i} be the union space equipped as above with the σ-algebra B and the σ-finite measure λ. X, B, λ) is the natural space to consider when dealing both with model uncertainty and parameter estimation. In the Bayesian framework,

10 /The WL algorithm in general state spaces 10 a prior density with respect to λ) px k, k) in X, B) is specified for x k, k). The posterior distribution of x k, k) is therefore πx k, k) h k x k ) = fdata k, x k )px k, k). In this framework, one is often interested in the Bayes factor of model i to model j defined as B ij := θ i)pj)/θ j)pi)), where θ i) X i πx i, i)λ i dx i ) and pi) = px i, i)λ i dx i ). Trans-dimensional MCMC is a set of specialized MCMC algorithms to sample from distributions like π defined on spaces of variable dimensions. The reversible-jump algorithm of Green [14]) is the most popular such sampler. In the spirit of the WL algorithm, an alternative to sampling directly from π is to sample from the distribution π dx i, i) h ix i ) θ i) 1 X i x)λ i dx i ). 8) By such re-weighting, we give the same posterior weight to all the models. The WL algorithm then offers an effective strategy to sample from π and we recover π by importance sampling. This strategy can improve on the mixing of the sampler Example 2. We set X i = R i for i = 1,..., 20 and consider the following rather trivial transdimensional target distribution: πx i, i) a 1 i e 1 2 x i 2, where we let a i = 1 for i 4, and a 4 = 2π) 16/2. In this distribution, x i R i, the i-dimensional Euclidean space and πx i, i) restricted to R i is proportional to the standard normal distribution. We are interested in the marginal distribution pi) of i. This distribution, as shown in Figure 1, is bimodal with modes at 5 and 20. We pretend that this distribution is intractable and sample from it using a Birth-and-Death Reversible-Jump MCMC. For the fixed-dimensional move, we use a Random Walk Metropolis kernel with a Gaussian proposal with covariance matrix σ p I i, σ p = 0.1. We implement a Birthand-Death move for the trans-dimensional jump. Given x, i), we randomly select j {i 1, i+1} with respective probability ω i,i 1, ω i,i+1. We choose ω i,i+1 = 1/2 with the usual correction at the boundaries. If j = i + 1, we proposal y = x i, u) where u N0, σ 2 ) with σ = 0.1. We accept y, j) with probability min1, A) where A = πy, j) ω ji 1 2πσe 2σ πx, i) ω 2 u2. ij Similarly if j = i 1, we write x = y, u ) with u R and propose y, j). This value is then

11 accepted with probability min1, A) with /The WL algorithm in general state spaces 11 A = πy, j) ω ji 1 e 1 2σ πx, i) ω 2 u 2. ij 2πσ This vanilla RJMCMC sampler fails to sample from π. Depending on its starting point, the sampler typically found one of the two modes and got stuck around that mode even after 10 millions iterations see Fig 2a). In contrast, the WL algorithm provided a reasonable estimate of the distribution in only 2 millions iterations. In this example, computationally each WL iteration step costs roughly about that for 1.2 step of the vanilla sampler. For the WL approach we use c = 0.4 and γ n = 2 n until 10 4 before switching to γ n = n 1. a) b) c) Figure 1: Marginal posterior distribution of models. a) estimate from the plain RJMCMC; b) estimate from the WL-RJMCMC ; c) True posterior distribution. Estimates are based on iterations for the plain RJMCMC and iterations for the WL-RJMCMC.

12 4. Some theoretical results /The WL algorithm in general state spaces 12 We look at some theoretical aspects of the algorithm. We investigate the convergence of θ n to θ and a strong law of large numbers for {X n, I n )}. The difficulty is in proving that the algorithm is stable in the following sense. Definition 4.1. Let v n i) be the occupation measure of X i by time n: v n i) = 1 nk=1 n 1 {Ik =i}. The Wang-Landau algorithm is said to be stable if max i,j lim sup nv n i) v n j)) <, a.s. 9) n This is an essential property of the algorithm. When the algorithm is stable, we will show that all the stopping times κ l defined in Algorithm 2.2 are finite and the step-size γ an will gradually converges to 0. Moreover Theorem 4.1 below) on a path where the algorithm is stable, θ n θ and a strong law of large numbers hold for 1 n nk=1 fx k, I k ). Then, we will derive some verifiable conditions under which the algorithm is shown to be stable. To maintain the flow of ideas, some of the proofs are postponed to Section Ergodicity Let Θ = {θ R d : di=1 θi) = 1, θi) 0, 1), i = 1,..., d} and X 0, I 0 ), θ 0 ) X Θ be the initial state of the algorithm. This initial state will be considered fixed but arbitrary. Let {γ n } be the step-size sequence. Let Pr be the distribution of the process {X n, I n, θ n } started at X 0, I 0, θ 0 ) with step-size sequence {γ n } and denote E the expectation with respect to Pr. To simplify the notations, we will omit to make explicit the dependence of Pr on X 0, I 0, θ 0 ) and {γ n }. All statements made almost surely will be with respect to Pr. For θ R d, θ denotes the Euclidean norm of θ. For any ε 0, ε ), where ε = min i θ i), define Θ ε = {θ Θ, θi) ε, i = 1,..., d}. Our main assumption asserts that the family {P θ, θ Θ ε } is Lipschitz and uniformly V -ergodic. See e.g. [2] for some examples of MCMC samplers where these assumptions hold. Before stating these assumptions, we need some notations. For any function f : X R and fx) W : X [1, ) we denote f W := sup x X W x), and introduce the set of W -bounded functions L W := {f meas., f : X R, f W < }. A transition kernel on X, B) operates on measurable real-valued functions f as P fx) = P x, dy)fy) and the product of two transition kernels P 1

13 /The WL algorithm in general state spaces 13 and P 2 is the transition kernel defined as P 1 P 2 x, A) = P 1 x, dy)p 2 y, A). For two transition kernels P 1 and P 2 we define P 1 P 2 W, the W -distance between P 1 and P 2 as P 1 P 2 W := sup P 1 f P 2 f W. f W 1 A1) We assume the existence of a measurable function V : X [1, ), a set C X, a probability measure ν on X, B) such that νc) > 0 with the following property. For all ε 0, ε ) we can find constants λ ε 0, 1), b ε [0, ), β ε 0, 1] and integer n 0,ε such that: and inf θ Θ ε P n 0,ε θ x, A) β ε νa)1 C x), x X, A B; 10) sup P θ V x) λ ε V x) + b ε 1 C x), x X. 11) θ Θ ε The inequality 11) of A1) is the so-called drift condition and 10) is the so-called minorization condition. Next, we assume that P θ is Lipschitz as a function of θ. A2) For all α [0, 1], for all ε 0, ε ), there exists K = Kα, ε) < such that for all θ, θ Θ ε P θ P θ V α K θ θ, 12) where V is defined in A1). Finally we assume that the step-size sequence is well-behaved: A3) {γ n } is non-increasing, γ n > 0, γ n = and γ n = On 1 ) as n. For B > 0, we introduce the following stopping time: τb) := inf{k 0 : max k v k i) v k j)) > B}, 13) i,j with the usual convention that inf =. τb) is the first time where one component will accumulate B visits or more than some other component. Thus Definition 4.1 is precisely equivalent to τb) = for some B. With this in mind, the next results says essentially that if the algorithm is stable and A1-3) hold then it is ergodic. Theorem 4.1. Assume A1-3). Let B > 0 be given. Then:

14 /The WL algorithm in general state spaces 14 i) θ n θ 1 {τb)>n} 0, a.s. as n. 14) ii) For any function f L V 1/2, denoting f = f π f), we have: Proof. See Section n n k=1 fx k, I k )1 {τb)>k 1} 0, a.s. as n. 15) 4.2. Checking A1-2) We impose the drift condition and the minorization condition A1) uniformly for θ Θ ε, not uniformly for θ Θ. This is an important point. Indeed and as we will see now, a drift and minorization condition uniformly-in-θ for θ Θ ε is almost always true as soon as one P θ satisfies these conditions Proposition 4.1). Whereas a minorization and drift condition uniformly-in-θ for θ Θ is almost never true. Indeed, suppose that each P θ is a Metropolis-Hastings kernel with invariant distribution π θ and proposal kernel Q. That is P θ fx, i) = M θ fx, i) + r θ x, i)fx, i), where and M θ fx, i) = d j=1 r θ x, i) = 1 X j min d j=1 1, θi) θj) Ry, j; x, i) ) fy, j)qx, i; dy, j) X j min 1, θi) θj) Ry, j; x, i) ) Qx, i; dy, j), and Rx, i; y, j) the Radon-Nikodym density of πdx, i)qx, i; dy, j) with respect to πdy, j)qy, j; dx, i). With θ = 1,..., 1), we use π resp. P and M) to denote π θ resp. P θ and M θ ). The next result states that we only need to check that P is geometrically ergodic to obtain A1-2). Proposition 4.1. Suppose that existence of a measurable function V : X [1, ), a set C X, a probability measure ν on X, B) such that νc) > 0; constants λ 0, 1), b [0, ), β 0, 1] and finite integer n 0 such that: M n 0 x, A) βνa)1 C x), x X, A B;

15 /The WL algorithm in general state spaces 15 and P V x) λv x) + b1 C x), x X. Then A1-2) hold. Proof. See Section Stability Theorem 4.1 asserts that under A1-3), the WL algorithm will converge to the right limit on stable paths. This opens the question of checking the stability of the algorithm. The stability condition is difficult to check in general. The next theorem gives some easily checked conditions under which the algorithm is stable. Theorem 4.2. The WL algorithm is stable under either of the following two conditions. a) There exist ε 0, 1), K 0, ) and integer n 0 0 such that for any i, j {1,..., d} and θ R d, θi)/θj) > K implies that P n 0 θ x, i), X j {j}) ε for all x X i. b) There exists ε 0, 1), such that for any j {1,..., d} and θ R d, θj) min 1 i d θi) implies P θ x, i), X j {j}) ε for all x / X i. Proof. See Section Application to Multicanonical sampling Consider the multicanonical sampling of Section 3.1. Suppose that P θ is the Independence- Sampler with proposal distribution Qdx) = qx)λdx) and invariant distribution π θ dx) di=1 hx)/θi))1 Xi x)λdx). Assume that the function ωx) hx)/qx) is bounded with supremum ω 0. Then clearly, for all x X i, as soon as θi) θj), taking ε j P θ x, X j ) min 1, θi) ) min θj) X j ε j, 1, ωy) ω 0 ) Qdy) = ) X j min 1, ωy) ω 0 Qdy) > 0 since πx i ) > 0). Thus for any i, j {1,..., d}, i j and by Theorem 4.2, the WL algorithm is stable. Now, since ω is

16 /The WL algorithm in general state spaces 16 bounded, each P θ satisfies a drift condition and a minorization condition which implies A1-2) by Proposition 4.1. Similarly, if P θ is a Random Walk Metropolis with proposal kernel qy x) we have: P θ x, X j ) min 1, θi) ) min θj) X j 1, πy) ) qy x)dy. πx) It follows that if X is compact and π, q positive and continuous, then i j for all i, j. Under the same assumption, A1-2) also hold. We can conclude that: Corollary 4.1. In the case of multicanonical sampling of Section 3.1. Assume either i) or ii): i) P θ is an independent-metropolis sampler with proposal distribution Qdx) = qx)λdx) and ω h/q is bounded. ii) P θ is a RWM sampler with proposal qy x); X is compact and π and q are positive and continuous. Then the algorithm is stable and under A3), we have: θ n θ 0; and 1 n for any bounded measurable function f. n fx k ) π f) a.s. as n, k= Application to Simulated tempering Theorems 4.1 and 4.2 can also be applied to the WL version of simulated tempering as described in Section 3.2. We consider the simulated tempering algorithm with kernel P θ x, i); A {j}) = B x,θ i, j)p [j] x, A) where B x,θ i, j) h j x)/θj) and P [j] is a transition kernel not necessarily Metorpolis-Hastings) with invariant distribution h j. The following corollary is easily proved and is left to the reader. Corollary 4.2. Suppose that X 1 is compact, h positive and continuous and {γ n } satisfies A3). Suppose that there exist ε > 0, a probability measure ν and an integer n 0 such that νx i ) > 0 and P [i]) n 0 x, A) ενa), i {1,..., d}. Then θ n θ 0; and 1 n for any bounded measurable function f. n fx k, I k ) π f) a.s. as n, k=1

17 5. Discussion and open problems /The WL algorithm in general state spaces 17 In this paper, we propose an extension of the WL algorithm to general states spaces. The WL algorithm differs from other adaptive Markov Chain Monte Carlo algorithms based on stochastic approximation by the adaptive nature of its step-size. We have shown through examples that the algorithm can be used effectively to improve on simulated tempering and trans-dimensional MCMC algorithms. We have also studied the asymptotic behavior of the WL algorithm. We have shown that on stable sample paths and with an appropriate step-size, θ n converges to θ and a strong law of large numbers hold. Finally, when the state space is compact, we have shown that in most cases, the algorithm is stable and the aforementioned limit results apply. Two main questions have remained unanswered. The first question concerns the stability of the algorithm in unbounded state spaces. Secondly, in order to exploit the full potential of the algorithm, a more precise understanding of its efficiency is needed. In particular we need to understand how the rate of convergence and the asymptotic variances of the algorithm are related to the parameter c and step-size γ n. We are currently investigating some of these questions. 6. Proofs The techniques used here can be found in various forms in [8], [11], [3]. Throughout, C ε denotes a generic constant whose value can be different from one equation to another. The key result in the proof of Theorem 4.1 is Lemma 6.6 which states that the weighted sum of the noise process in the stochastic approximation followed by {θ n } is summable Proof of Proposition 4.1 Lemma 6.1. Under the conditions of Proposition 4.1, A2) hold. Proof. Let ε 0, ε ) and α 0, 1]. For θ Θ ε, f V α and x, i) X, we have: where ex, i) 1, and M θ fx, i) = P θ fx, i) = M θ fx, i) + fx, i) 1 M θ ex, i)), d j=1 X j min 1, θi) θj) ry, j; x, i) ) fy, j)qx, i; dy, j).

18 As a consequence, for θ 2, θ 1 Θ ε : /The WL algorithm in general state spaces 18 P θ2 P θ1 V α M θ2 M θ1 V α + M θ2 M θ1 T V, where P T V = P V with V 1. For x, i) X, the function θ M θ fx, i) is differentiable and: d d M θ fx, i) θ j=1 j C ε M f x, i). j=1 A2) then follows by the mean-value theorem and the drift condition on P. Next we show that for any ε 0, ε ), the family P θ ) θ Θε ) satisfies a uniform in θ) drift condition. Lemma 6.2. Under the conditions of Proposition 4.1, A1) hold. Proof. The minorization is immediate. Since min1, ab) min1, a) min1, b) for a, b > 0, P θ x, i; A) M θ x, i; A) 1/ε)M 0 x, i; A) β/ενa)1 C x, i). The argument to show the uniform drift is fairly simple and we only sketch it. Let B θ,r = Θ ε Bθ, r) the open ball of Θ ε with center θ Θ ε and radius r > 0. It follows from Lemma 6.1 that by choosing r > 0 small enough, if P θ satisfies a drift condition toward a small set C, then the family {P θ, θ B θ,r } satisfies a uniform in θ ) drift condition toward C. Therefore, starting from P, we can find an open coverage of Θ ε by the B θ,r such that on each such B θ,r a uniform drift toward C hold. Since Θ ε is compact it admits a finite coverage by the B θ,r and using the maximum of the constants of the drift condition toward C for each such ball, we get a uniform drift toward C for {P θ, θ Θ ε } Proof of Theorem 4.1 We start with some preliminary remarks. For any ε 0, ε ) and α 0, 1], it is known from Markov chains theory that A1) implies the existence of C ε,α <, ρ ε,α 0, 1) and b ε as in A1) such that: and sup P θ n π θ V α C ε,α ρ n ε,α, 16) θ Θ ε sup π θ V ) b ε. 17) θ Θ ε

19 /The WL algorithm in general state spaces 19 For a proof, see e.g. [7] and the references therein. Define ξε) = inf{k 0 : θ k / Θ ε }. An easy calculation using A1) gives that ] E [V X n, I n )1 {ξε)>n} = E [V X n, I n )1 Θε θ 0 ) 1 Θε θ n )] λ n ε V X 0, I 0 ) + b ε /1 λ ε ) from which we deduce: ) sup E V X n, I n )1 {ξε)>n} <. 18) n The proof of the theorem will be based on some nice properties of solutions of so-called Poisson equation. These solutions will allow us to obtain a martingale approximation to the process nk=1 fx k, I k ). For f L V α, α 0, 1] and θ Θ ε, define the function h θ = Pθ k f π θ f)). 19) h θ solves the Poisson equation f π θ f) = h θ P θ h θ. By A1), h θ exists and h θ L V α. For f L V α and θ, θ Θ ε, denoting f θ = f π θ f), we have π θ f) π θ f) = = π θ [P θ k f θ ] π θ [ Pθ k f ] θ ) + C ε ρ k ε + θ θ k [ )] π θ P k j θ P θ P θ ) P j 1 θ f θ j=1 k ρε j 1, j=1 using A1-2) from which we deduce that there exists a finite constant C ε such that: sup π θ f) π θ f) C ε θ θ. 20) f L V α The constant C ε is not necessarily the same from one equation to the other. Similarly, denoting P θ the operator P θ π θ, we have Pθ k f θ Pθ k f θ = = P θ k f P θ k f k j=1 P k j θ P θ P θ ) C ε θ θ kρ k 1 ε V α, P j 1 θ f

20 /The WL algorithm in general state spaces 20 using A1-2). This, together with 20) imply the existence of a finite constant C ε such that for all α 0, 1], θ, θ Θ ε : h θ h θ V α + P θ h θ P θ h θ V α C ε θ θ. 21) In our analysis, we mainly see {θ n } as a stochastic approximation sequence. The recursion on {θ n } writes: θ n+1 i) = φ n+1 i) de=1 φ n+1 e) = φ ni) + γ an φ n i)1 {In+1 =i} de=1 φ n e) + γ an φ n I n+1 ) = θ n i) 1 + γ a n 1 {In+1 =i} 1 + γ an θ n I n+1 ) = θ n i) + γ an H i θ n, I n+1 ) + γ 2 a n r i,n θ n, I n+1 ), 22) ) where H i θ, I) = θi) 1 {I=i} θi) and r i,n θ, I) = θi)θi) 1 {I=i} θi) 1+γ an θi). The mean field function h i θ) = π θ H i ) is h i θ) = θ i) θi) dj=1. 23) θ j) θj) Lemma 6.3. Assume A3). Let B > 0 be given and c 0 = 2Bd/c. Then on {τb) > n}, a n n/c 0. Moreover γan =, and γ 2 an 1 {τb)>n} < a.s. Proof. On {τb) > n}, for any k < k n, and for any i {1,..., d}, v k,k i) 1/d) 2B/k k). Therefore if lc 0 n < τb), κ l n. That is, a n n/c 0 on {τb) > n}. Since {γ n } is non-increasing and a n n, γ an γ n =. On the other hand γa 2 n 1 {τb)>n} Cc 0 n 2 <. For any ε > 0, we introduce the stopping time ξε) := inf {k 0 : θ k / Θ ε }. We will need the following lemma whose proof is left to the reader. Lemma 6.4. Let {γ n, n 0} be a non-increasing sequence of positive number and {v n, n 0} a sequence of numbers such that N v k B for all N 0. Then N γ k v k γ 0 B for all N 0.

21 The following lemma relates τb) and ξε). /The WL algorithm in general state spaces 21 Lemma 6.5. Assume A3). For any B > 0, we can find ε 0, ε ) such that τb) ξε). 1 Proof. Take ε = 1 + d 1)e 0) Bγ > 0. Without any loss, we assume that ε < ε and ε min i θ 0 i). We need to show that min i θ n i) > ε for all n < τb). But θ n i) = 1 + ) φ nj) 1. j i φ n i) It is thus enough to show that φ n j)/φ n i) e Bγ 0 for all i j and for any n < τb). But we have: φ n j) φ n i) = exp p=0 γ p Nκp,n κ p+1 j) N κp,n κp+1 i) ), where N l,m i) = 0 if m l and N l,m i) = m q=l+1 1 Xi X q ) otherwise N l,m i) is the number of visits to X i from time l +1 to m). For any n < τb), P p=0 Nκp,n κ p+1 i) N κp,n κ p+1 j) ) B for all P 0. Lemma 6.4 thus implies that φ n i)/φ n j) e γ 0B. Lemma 6.6. Assume A1-3). Let B > 0 be given. Let {γ n} be a sequence that satisfies A3) and such that γ n/a γ n/a < for all a > 0. For θ Θ, Let H θ : X R be a measurable function such that H θ L V 1/2. Then: γ a k 1 {τb)>k} [H θk X k+1, I k+1 ) π θk H θk )] <, a.s. 24) Proof. Let ε > 0 as in Lemma 6.5. From A1), h θ L V 1/2 exists that solves the Poisson equation h θ P θ h θ = H θ π θ H θ ) for all θ Θ ε. Using this we can write: n γ a k 1 {τb)>k} [H θk X k+1, I k+1 ) π θk H θk )] = n γ a k 1 {τb)>k} U 1) k+1 + U 2) k+1 + U 3) ) k+1, where U 1) k+1 = h θ k X k+1, I k+1 ) P θk h θk X k, I k ), U 2) k+1 = P θ k h θk X k, I k ) P θk+1 h θk+1 X k+1, I k+1 ), U 3) k+1 = P θ k+1 h θk+1 X k+1, I k+1 ) P θk h θk X k+1, I k+1 ). Clearly, M n = n γ a k 1 {τb)>k} U 1) k+1 is a martingale and using Lemma 6.3, 18) and since h θ

22 /The WL algorithm in general state spaces 22 and P θ h θ L V 1/2, we have: ) E Mn 2 C ε n [ ] E γ a k ) 2 1 {τb)>k} V X k+1, I k+1 ) n [ ] C ε γ k/c 0 )2 E 1 {τb)>k} V X k+1, I k+1 ) C ε γ k/c 0 )2 <. By Doob s convergence theorem for martingales, γ a k 1 {τb)>k} U 1) k+1 is finite a.s. On {τb) = l}, l <, γ a k 1 {τb)>k} U 2) k+1 = l 1 γ a k U 2) k+1 which is finite almost surely. On {τb) = }, we can write: 1 {τb)= } n γ a k 1 {τb)>k} U 2) k+1 = γ a 0 P θ0 h θ0 X 0, I 0 ) γ a n 1 {τb)= } P θn+1 h θn+1 X n+1, I n+1 ) + n 1 γ a k+1 γ a k ) 1 {τb)= } P θk+1 h θk+1 X k+1, I k+1 ). [ ) ] 2 E γ a n 1 {τb)= } P θn+1 h θn+1 X n+1, I n+1 ) C ε γ n/c 0 )2 and γ n/c 0 )2 < thus γ a n 1 {τb)= } P θn+1 h θn+1 X n+1, I n+1 ) converges a.s. to 0. n 1 ) γ a k+1 γ a k 1 {τb)= } P θk+1 h θk+1 X k+1, I k+1 ) C ε k=1 γ a k γ a k+1 ) 1 {τb)= } V 1/2 X k+1, I k+1 ). Since a k only changes at the stopping times κ i, we have [ ] [ ) ] E γ ak γ ak+1 1 {τb)= } V 1/2 X k+1, I k+1 ) = E γ k 1 γ k ) 1{τB)= } V 1/2 X κk +1, I κk +1) = k=1 [ ] γ k 1 γ k) E 1 {τb)= } V 1/2 X κk +1, I κk +1) C ε γ k 1 γ k) k=1 C ε γ 0. By Lebesgue s dominated convergence theorem, we can conclude that [ ] ) E γ a k+1 γ a k 1 {τb)= } P θk+1 h θk+1 X k+1, I k+1 ) <, This is sufficient to conclude that γ a k 1 {τb)>k} U 2) k+1 is finite almost surely.

23 /The WL algorithm in general state spaces 23 and Using 21): n γ a U 3) k 1 {τb)>k} k+1 C ε γ a k γ ak 1 {τb)>k} V 1/2 X k+1, I k+1 ) C ε γ n/c 0 γ n/c 0 1 {τb)>k} V 1/2 X k+1, I k+1 ) [ ] E γ n/c 0 γ n/c 0 1 {τb)>k} V 1/2 X k+1, I k+1 ) C ε γ n/c 0 γ n/c 0 < by 18). With Lebesgue s dominated convergence theorem, we deduce that [ ] E γ a U 3) k 1 {τb)>k} k+1 C ε γ n/c 0 γ n/c 0 < which implies that n γ a k 1 {τb)>k} U 3) k+1 converges almost surely to a finite limit. This completes the proof of the lemma. We are now in position to prove Theorem 4.1. We start with i). Proposition 6.1. Assume A1-3) and let B > 0 be given. Then θ n θ 1 {τb)>n} 0 with probability one as n. Proof. The idea of the proof is borrowed from [11]. We saw in 22) that θ n+1 = θ n + γ an Hθ n, I n+1 ) + γ 2 a n r n, ) where H = H 1,..., H d ), r n = r n,1,..., r n,d ), H i θ, I) = θi) 1 {I=i} θi) and r i,n = θi)θi) 1 {I=i} θi) 1+γ a nθi). We note that H 1 and r n 1. Let ε 0, ε ) be such that when τb) ξε) Lemma 6.5). We recall that the mean field function of the recursion is h i θ) = θ i) θi))/ d j=1 θ j) θj). We introduce θ n = θ n + j=n γ aj 1 {τb)>j} [Hθ j, I j+1 ) hθ j )]. From Lemma 6.6, θ n θ n 0 almost surely. {θ n} satisfies the recursion θ n+1 = θ n + γ an hθ n ) + γ an 1 {τb) n} [Hθ n, I n+1 ) hθ n )] + γ 2 a n r n. We can then deduce: θ n+1 θ 2 1 {τb)>n} = θ n θ 2 1 {τb)>n} + 2γ an θ n θ, hθ n ) 1 {τb)>n} + γ an 1 {τb)>n} r n 1 2εγ an ) θ n θ 2 1 {τb)>n} + γ an 1 {τb)>n} r n + θ n θ, hθ n ) hθ n) ).

24 /The WL algorithm in general state spaces 24 where r n 0 almost surely as n. Since θ n remains in the compact Θ ε, h is continuous and since θ n θ n 0, it follows that θ n θ, hθ n ) hθ n) 0. We can summarize the situation like this. Writing U n = θ n θ 2 1 {τb)>n}, we have: U n+1 1 2εγ an )U n + γ an r n, 25) where r n 0 as n. This implies that U n 0 which, given Lemma 6.6 proves the Proposition. To see why U n 0, let δ > 0 be given. Take n 0 > 0 such that for n n 0, r n 2εδ and 1 2εγ an )1 {τb)>n} > 0. Then for n n 0, U n+1 δ) 1 2εγ an )U n δ) + 2εγ an r n/2ε δ) 1 2εγ an )U n δ) which implies that lim supu n δ) 0 and since δ > 0 is arbitrary, we conclude that lim U n = 0. Proposition 6.2. Assume A1-3) and let B > 0 be given. For any function f L V 1/2, denoting f = f π f), we have: 1 n n k=1 fx k, I k )1 {τb)>k 1} 0, a.s. as n. 26) Proof. In view of Proposition 6.1 and 20), we only need to show that 1 n n k=1 ) fx k, I k ) π θk 1 f) 1 {τb)>k 1} 0 a.s. 27) Kronecker s Lemma applied to 24) of Lemma 6.6 with γ n = 1/n, H θ = f yields 27) Proof of Theorem 4.2 Proof. Assume that [a] hold. Define α k = 1 + γ 0 ) 1+n0)k and suppose that lim sup n nv n i) v n j)) =. This implies the existence of an increasing sequence of integers {n k, k 1} such that n k v nk i) v nk j)) > α k and X nk, I nk ) X i {i} but X nk +n 0, I nk +n 0 ) / X j {j} for all k 1. Clearly, n k v nk i) v nk j)) > α k implies that θ nk i)/θ nk j) converges to +. But, since i leads to j, we can then find ε > 0 and k 0 such that for k k 0 : Pr [X nk +n 0, I nk +n 0 ) / X j {j} F nk, X nk, I nk ) X i {i}, n k v nk i) v nk j)) > α k ] 1 ε). Thus Pr lim sup n nv n i) v n j)) = ) lim k 1 ε) k = 0.

25 /The WL algorithm in general state spaces 25 Assume that [b] hold. Define α k = 1 + γ 0 ) 2k and suppose that lim sup n nmax i v n i) min j v n j)) =. Then we can find i 0 {1,..., d} and an increasing sequence of integers {n k, k 1} such that min j v n j) = v n i 0 ), n k max j v nk j) v nk i 0 )) > α k, X nk, I nk ) / X i0 {i 0 } and X nk +1, I nk +1) / X i0 {i 0 } for all k 1. Then we can proceed as above and conclude. Acknowledgments: The authors are grateful to Christophe Andrieu, David P. Landau, Eric Moulines and for very helpful discussions. We also thank David P. Sanders and the referees for comments that have made this paper hopefully more readable. This work is partly supported by the National Science Foundation grant DMS and by a postdoctoral fellowship from the Natural Sciences and Engineering Research Council of Canada. References [1] Andrieu, C. and Atchade, Y. F. 2007). On the efficiency of adaptive mcmc algorithms. Electronic Communications in Probability [2] Andrieu, C. and Moulines, É. 2006). On the ergodicity properties of some adaptive MCMC algorithms. Ann. Appl. Probab [3] Andrieu, C., Moulines, É. and Priouret, P. 2005). Stability of stochastic approximation under verifiable conditions. SIAM J. Control Optim electronic). [4] Andrieu, C. and Robert, C. P. 2001). Controlled mcmc for optimal sampling. Technical report, Université Paris Dauphine, Ceremade [5] Atchade, Y. F. and Liu, J. S. 2006). Discussion of the paper by kou, zhou and wong. Annals of Statistics To appear. [6] Atchade, Y. F. and Rosenthal, J. S. 2005). On adaptive markov chain monte carlo algorithm. Bernoulli [7] Baxendale, P. H. 2005). Renewal theory and computable convergence rates for geometrically ergodic markov chains. Annals of Applied Probability [8] Benveniste, A., Métivier, M. and Priouret, P. 1990). Adaptive Algorithms and Stochastic approximations. Applications of Mathematics, Springer, Paris-New York. [9] Berg, B. A. and Neuhaus, T. 1992). Multicanonical ensemble: A new approach to simulate first-order phase transitions. Phys. Rev. Lett. 68.

26 /The WL algorithm in general state spaces 26 [10] Brockwell, A. E. and Kadane, J. B. 2005). Identification of regeneration times in MCMC simulation, with application to adaptive schemes. J. Comput. Graph. Statist [11] Delyon, B. 1996). General results on the convergence of stochastic algorithms. IEEE Trans. Automat. Control [12] Geyer, C. J. and Thompson, E. 1995). Annealing markov chain monte carlo with applications to pedigree analysis. Journal of the American Statistical Association [13] Gilks, W. R., Roberts, G. O. and Sahu, S. K. 1998). Adaptive Markov chain Monte Carlo through regeneration. J. Amer. Statist. Assoc [14] Green, P. J. 1995). Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika [15] Haario, H., Saksman, E. and Tamminen, J. 2001). An adaptive metropolis algorithm. Bernoulli [16] Liang, F., Liu, C. and Carroll, R. J. 2007). Stochastic approximation in monte carlo computation. JASA [17] Liang, F. and Wong, W. H. 2001). Real-parameter evolutionary monte carlo with applications to bayesian mixture models. Journal of the American Statistical Association [18] Marinari, E. and Parisi, G. 1992). Simulated tempering: A new monte carlo schemes. Europhysics letters [19] Mira, A. and Sargent, D. J. 2003). A new strategy for speeding Markov chain Monte Carlo algorithms. Stat. Methods Appl [20] Rosenthal, J. S. and Roberts, G. O. 2007). Coupling and ergodicity of adaptive mcmc. Journal of Applied Probablity [21] Wang, F. and Landau, D. P. 2001). Efficient, multiple-range random walk algorithm to calculate the density of states. Physical Review Letters

Monte Carlo methods for sampling-based Stochastic Optimization

Monte Carlo methods for sampling-based Stochastic Optimization Gersende FORT LTCI CNRS & Telecom ParisTech Paris, France Joint works with B. Jourdain, T. Lelièvre, G. Stoltz from ENPC and E. Kuhn from