YVES F. ATCHADÉ, GARETH O. ROBERTS, AND JEFFREY S. ROSENTHAL

Similar documents
Introduction. A Dirichlet Form approach to MCMC Optimal Scaling. MCMC idea

Convergence of Random Walks

Differentiation ( , 9.5)

Math 342 Partial Differential Equations «Viktor Grigoryan

JUST THE MATHS UNIT NUMBER DIFFERENTIATION 2 (Rates of change) A.J.Hobson

Math Notes on differentials, the Chain Rule, gradients, directional derivative, and normal vectors

The derivative of a function f(x) is another function, defined in terms of a limiting expression: f(x + δx) f(x)

Introduction to the Vlasov-Poisson system

Table of Common Derivatives By David Abraham

Survey Sampling. 1 Design-based Inference. Kosuke Imai Department of Politics, Princeton University. February 19, 2013

A Review of Multiple Try MCMC algorithms for Signal Processing

Time-of-Arrival Estimation in Non-Line-Of-Sight Environments

Computing Exact Confidence Coefficients of Simultaneous Confidence Intervals for Multinomial Proportions and their Functions

Optimization of Geometries by Energy Minimization

inflow outflow Part I. Regular tasks for MAE598/494 Task 1

The total derivative. Chapter Lagrangian and Eulerian approaches

Construction of the Electronic Radial Wave Functions and Probability Distributions of Hydrogen-like Systems

arxiv:math/ v1 [math.pr] 3 Jul 2006

On Characterizing the Delay-Performance of Wireless Scheduling Algorithms

TMA 4195 Matematisk modellering Exam Tuesday December 16, :00 13:00 Problems and solution with additional comments

Final Exam Study Guide and Practice Problems Solutions

Separation of Variables

1 dx. where is a large constant, i.e., 1, (7.6) and Px is of the order of unity. Indeed, if px is given by (7.5), the inequality (7.

A Course in Machine Learning

arxiv: v2 [cond-mat.stat-mech] 11 Nov 2016

Monotonicity for excited random walk in high dimensions

Solutions to Practice Problems Tuesday, October 28, 2008

Lecture 6: Calculus. In Song Kim. September 7, 2011

Differentiability, Computing Derivatives, Trig Review. Goals:

Calculus and optimization

Topic 7: Convergence of Random Variables

Differentiability, Computing Derivatives, Trig Review

A Sketch of Menshikov s Theorem

Math 1B, lecture 8: Integration by parts

The Exact Form and General Integrating Factors

Lecture Introduction. 2 Examples of Measure Concentration. 3 The Johnson-Lindenstrauss Lemma. CS-621 Theory Gems November 28, 2012

An Optimal Algorithm for Bandit and Zero-Order Convex Optimization with Two-Point Feedback

Thermal conductivity of graded composites: Numerical simulations and an effective medium approximation

ELEC3114 Control Systems 1

u!i = a T u = 0. Then S satisfies

6 General properties of an autonomous system of two first order ODE

ensembles When working with density operators, we can use this connection to define a generalized Bloch vector: v x Tr x, v y Tr y

Analytic Scaling Formulas for Crossed Laser Acceleration in Vacuum

Linear First-Order Equations

A note on asymptotic formulae for one-dimensional network flow problems Carlos F. Daganzo and Karen R. Smilowitz

05 The Continuum Limit and the Wave Equation

Calculus in the AP Physics C Course The Derivative

Calculus of Variations

A Second Time Dimension, Hidden in Plain Sight

Balancing Expected and Worst-Case Utility in Contracting Models with Asymmetric Information and Pooling

Section 2.7 Derivatives of powers of functions

Make graph of g by adding c to the y-values. on the graph of f by c. multiplying the y-values. even-degree polynomial. graph goes up on both sides

Permanent vs. Determinant

Lecture 2 Lagrangian formulation of classical mechanics Mechanics

Schrödinger s equation.

Generalization of the persistent random walk to dimensions greater than 1

Expected Value of Partial Perfect Information

Quantum Mechanics in Three Dimensions

Jointly continuous distributions and the multivariate Normal

Lower Bounds for the Smoothed Number of Pareto optimal Solutions

SYDE 112, LECTURE 1: Review & Antidifferentiation

18 EVEN MORE CALCULUS

LATTICE-BASED D-OPTIMUM DESIGN FOR FOURIER REGRESSION

Lagrangian and Hamiltonian Mechanics

3.2 Differentiability

NOTES ON EULER-BOOLE SUMMATION (1) f (l 1) (n) f (l 1) (m) + ( 1)k 1 k! B k (y) f (k) (y) dy,

Entanglement is not very useful for estimating multiple phases

A Note on Exact Solutions to Linear Differential Equations by the Matrix Exponential

Cascaded redundancy reduction

Some Examples. Uniform motion. Poisson processes on the real line

Chapter 4. Electrostatics of Macroscopic Media

ELECTRON DIFFRACTION

7.1 Support Vector Machine

d dx But have you ever seen a derivation of these results? We ll prove the first result below. cos h 1

A PAC-Bayesian Approach to Spectrally-Normalized Margin Bounds for Neural Networks

Least-Squares Regression on Sparse Spaces

Introduction to Markov Processes

QF101: Quantitative Finance September 5, Week 3: Derivatives. Facilitator: Christopher Ting AY 2017/2018. f ( x + ) f(x) f(x) = lim

Conservation Laws. Chapter Conservation of Energy

Implicit Differentiation

Lower bounds on Locality Sensitive Hashing

Agmon Kolmogorov Inequalities on l 2 (Z d )

Lectures - Week 10 Introduction to Ordinary Differential Equations (ODES) First Order Linear ODEs

Bohr Model of the Hydrogen Atom

Sharp Thresholds. Zachary Hamaker. March 15, 2010

Switching Time Optimization in Discretized Hybrid Dynamical Systems

Lecture XII. where Φ is called the potential function. Let us introduce spherical coordinates defined through the relations

University of Toronto Department of Statistics

Acute sets in Euclidean spaces

Markov Chains in Continuous Time

Vectors in two dimensions

1 Math 285 Homework Problem List for S2016

Qubit channels that achieve capacity with two states

Multi-View Clustering via Canonical Correlation Analysis

Situation awareness of power system based on static voltage security region

Robust Low Rank Kernel Embeddings of Multivariate Distributions

arxiv: v4 [math.pr] 27 Jul 2016

26.1 Metropolis method

Non-Linear Bayesian CBRN Source Term Estimation

Math 210 Midterm #1 Review

Transcription:

TOWARDS OPTIMAL SCALING OF METROPOLIS-COUPLED MARKOV CHAIN MONTE CARLO YVES F. ATCHADÉ, GARETH O. ROBERTS, AND JEFFREY S. ROSENTHAL Abstract. We consier optimal temperature spacings for Metropolis-couple Markov chain Monte Carlo (MCMCMC) an Simulate Tempering algorithms. We prove that, uner certain conitions, it is optimal (in terms of maximising the expecte square jumping istance) to space the temperatures so that the proportion of temperature swaps which are accepte is approximately 0.234. This generalises relate work by physicists, an is consistent with previous work about optimal scaling of ranom-walk Metropolis algorithms. 1. Introuction The Metropolis-couple Markov chain Monte Carlo (MCMCMC) algorithm [7], also known as parallel tempering or the replica exchange metho, is a version of the Metropolis-Hastings algorithm [17, 9] which is very effective in ealing with problems of multi-moality. The algorithm works by simulating multiple copies of the target (stationary) istribution, each at a ifferent temperature. Through a swap move, the algorithm allows copies at lower temperatures to borrow information from high-temperature copies an thus escape from local moes. The performance of MCMCMC epens crucially on the temperatures use for the ifferent copies. There is an ongoing iscussion (especially in the Physics literature) on how to best select these temperatures. In this note, we show that this question can be partially aresse using the optimal scaling framework initiate in [20]. MCMCMC arises as follows. We are intereste in sampling from a target probability istribution π( ) having (complicate, probably multimoal, an possibly un-normalise) target ensity f (x) on some state space X, which is an open subset of R for some (large) imension. We efine a sequence of associate tempere istributions f (β j) (x), where 0 β n < β n 1 <... < β 1 < β 0 = 1 are the selecte inverse temperature values, subject to the restriction that f (β 0) (x) = f (1) (x) = f (x). (Usually the ensities f (β j) are simply powers of the original ensity, i.e. f (β j) (x) = (f (x)) β j ; in this case, if X has infinite volume, then we require β n > 0.) Date: July 30, 2009; last revise April 2010. 2000 Mathematics Subject Classification. 60J10, 65C05. Key wors an phrases. Metropolis-Couple MCMC, Simulate tempering, Optimal scaling. Y. Atchaé: Department of Statistics, University of Michigan, 1085 South University, Ann Arbor, 48109, MI, Unite States. E-mail aress: yvesa@umich.eu. G. O. Roberts: Department of Statistics, University of Warwick, Coventry, CV4 7AL UK. E-mail aress: gareth.o.roberts@warwick.ac.uk. J. S. Rosenthal: Department of Statistics, University of Toronto, Toronto, Ontario, Canaa, M5S 3G3. E-mail aress: jeff@math.toronto.eu. Partially supporte by NSERC of Canaa. 1

2 Y. F. ATCHADÉ, G. O. ROBERTS, AND J. S. ROSENTHAL MCMCMC procees by running one chain at each of the n + 1 values of β. It has state space X n+1, with unnormalise stationary measure n j=0 f (β j) (x j ), where each x j correspons to a chain at the fixe inverse temperature β j having stationary ensity f (β j) (x j ). In particular, the β 0 chain has stationary ensity f (β 0) (x 0 ) = f (x 0 ), so that after a long run the samples x 0 shoul approximately correspon to the ensity f (x) which is the ensity of actual interest. The hope is that the chains corresponing to smaller β (i.e., to higher temperatures) can mix more easily, an then can len this mixing information to the β 0 chain, thus speeing up convergence of the x 0 values to the ensity f (x). To achieve this, MCMCMC alternates between two ifferent types of ynamics. On some iterations, it attempts a within temperature move, by upating each x j accoring to some type of MCMC upate (e.g., a usual ranom-walk Metropolis algorithm) for which f (β j) is a stationary ensity. On other iterations, it attempts a temperature swap, which consists of choosing two ifferent inverse temperatures, say β j an β k, an then proposing to swap their respective chain values, i.e. to interchange the current values of x j an x k. This propose swap is then accepte accoring to the usual (symmetric) Metropolis algorithm probabilities, i.e. it is accepte with probability ( min 1, f (β j) (x k ) f (β k) (x j ) ), (1) f (β j) (x j ) f (β k) (x k ) otherwise it is rejecte an the values of x are left unchange. Such algorithms lea to many interesting questions an have been wiely stuie, see e.g. [7, 11, 12, 18, 6, 13, 15, 5, 26, 27]. This note will concentrate on the specific question of optimising the choice of the β values to achieve maximal efficiency in the temperature swaps. Specifically, suppose we wish to esign the algorithm such that the chain at inverse temperature β will propose to swap with another chain at inverse temperature β + ɛ. What choice of ɛ is best? Obviously, if ɛ is very large, then such swaps will usually be rejecte. Similarly, if ɛ is very small, then such swaps will usually be accepte, but will not greatly improve mixing. Hence, the optimal ɛ is somewhere between the two extremes (this is sometimes calle the Golilocks Principle ), an our task is to ientify it. Our main result (Theorem 1) will show that, in a certain limiting sense uner certain assumptions, it is optimal (in terms of maximising the expecte square jumping istance of the influence of hotter chains on coler ones) to choose the spacing ɛ such that the probability of such a swap being accepte is equal to 0.234. This is the same optimal acceptance probability erive previously for certain ranomwalk Metropolis algorithms [20, 22], an generalises some results in the physics literature [11, 18]. It also has connections (Section 5.1) to the Dirichlet form of an associate temperature process. We shall also consier (Section 4) the relate Simulate Tempering algorithm, an shall prove that uner certain conitions the 0.234 optimal acceptance rate applies there as well. We shall also compare (Corollary 1) Simulate Tempering to MCMCMC, an see that in a certain sense the former is exactly twice as efficient as the latter, but there are various mitigating factors so the comparison is far from clear-cut.

OPTIMAL SCALING OF MCMCMC 3 1.1. Toy Example. To illustrate the importance of temperature spacings, an of measuring the influence of hotter chains on coler ones, we consier a very simple toy example. Suppose the state space consists of just 101 iscrete points, X = {0, 1, 2, 3,..., 100}, with (un-normalise) target ensity (with respect to counting measure on X ) given by π{x} = 2 x + 2 (100 x) for x X. That is, π has two moes, at 0 an 100, with a virtually insurmountable barrier of low probability states between them (Figure 1, top-left). Let the tempere ensities be given (as usual) ( β. by powers of π, i.e. f (β) (x) = π(x)) Thus, for very small β, the insurmountable barrier is nicely remove (Figure 1, top-mile). Can we make use of such tempere ensities to improve convergence of the original chain to π? To be specific, suppose each fixe-β chain is a ranom-walk Metropolis algorithm with proposal kernel q(x, x + 1) = q(x, x 1) = 1/2. Thus, the col (large β) chains are virtually incapable of moving from state 0 to state 100 or back (Figure 1, topleft), but the hot (small β) chains have no such obstacle (Figure 1, top-mile). The question is, what values of the inverse temperatures β will best allow an MCMCMC algorithm to benefit from the rapi mixing of the hot chains, to provie goo mixing for the col chain? Figure 1: The toy example s target ensity (top-left), tempere ensity at inverse-temperature β = 0.001 (top-mile), ranom-walk Metropolis run (top-right), an trace plots of the col chain over 10,000 full iterations of MCMCMC runs with two temperatures (bottom-left), ten temperatures (bottom-mile), an fifty temperatures (bottom-right).

4 Y. F. ATCHADÉ, G. O. ROBERTS, AND J. S. ROSENTHAL To test this, we ran an MCMCMC algorithm for 10,000 full iterations (each consisting of one upate at each inverse-temperature, plus one attempte swap of ajacent inverse-temperature values), in each of four settings: with just one inversetemperature, i.e. an orinary MCMC algorithm with no temperature swaps (Figure 1, top-right); with two inverse-temperatures, 1 an 0.001 (Figure 1, bottom-left); with ten inverse-temperatures, 0.001 j/9 for j = 0, 1, 2,..., 9 (Figure 1, bottommile); an with fifty inverse-temperatures, 0.001 j/49 for j = 0, 1, 2,..., 49 (Figure 1, bottom-mile). We starte each run with all chains at the state 0, to investigate the extent to which they were able to mix well an effectively sample from the equally-large moe at state 100. The results of our runs were that the orinary MCMC algorithm with no temperature swaps i not traverse the barrier at all, an just staye near the state 0 (Figure 1, top-right). Aing one aitional temperature improve this somewhat (Figure 1, bottom-left), an the col chain was now able to sometimes sample from states near 100, but only occasionally. Using ten temperatures improve this greatly an le to quite goo mixing of the col chain (Figure 1, bottom-mile). Perhaps most interestingly, using lots more temperatures (fifty) i not improve mixing further, but actually mae it much worse (Figure 1, bottom-right), espite the greatly increase computation time (since each temperature requires its own chain values to be upate at each iteration). So, we see that in this example, ten geometrically-space temperatures performe much better than two temperatures (too few) or fifty temperatures (too many). Intuitively, with only two temperatures, it is too ifficult for the algorithm to effectively swap values between ajacent temperatures (inee, only about 5% of propose swaps were accepte). By contrast, with fifty temperatures, virtually all swaps are accepte, but too many swaps are require before the rapily-mixing hot chains can have much influence over the values of the col chain, so this again oes not lea to an efficient algorithm. Best is a compromise choice, ten temperatures, which makes the temperature ifferences small enough so swaps are easily accepte, but still large enough that they lea to efficient interchange between hot an col chains an thus to efficient convergence of the col chain to π. That is, we woul like the hot an col chains values to be able to influence each other effectively, which requires lots of successful swaps of significant istance in the inverse-temperature omain. This leas to the question of how to select the number an spacing of the inversetemperature values for a given problem, which is the topic of this paper. To maximise the influence of the hot chain on the col chain, we want to maximise the effective spee with which the chain values move along in the inverse-temperature omain. We shall o this by means of the expecte square jumping istance (ESJD) of this influence, efine below. Our main result (Theorem 1 below) says that to maximise ESJD, it is optimal (uner certain assumptions) to select the inverse-temperatures so that the probability of accepting a propose swap between ajacent inversetemperature values is approximately 23%. 2. Optimal Temperature Spacings for MCMCMC In this section, we consier the question of optimal temperature choice for MCM- CMC algorithms. To fix ieas, we consier the specific situation in which the algorithm is proposing to swap the chain values at two specific inverse temperatures,

OPTIMAL SCALING OF MCMCMC 5 namely β an β + ɛ, where β, ɛ > 0 an β + ɛ 1. We shall consier the question of what value of ɛ is optimal in terms of maximising the influence of the hot chain on the col chain. We shall focus on the asymptotic situation in which the chain has alreay converge to its target stationary istribution, i.e. that x n j=0 f (β j). To measure this precisely, let us efine γ = β + ɛ if the swap is accepte, or γ = β if the swap is rejecte. Then γ is an inication of where the β chain s value has travele to, in the inverse temperature space. That is, if the propose swap is accepte, then the value moves from β to γ = β + ɛ for a total inverse-temperature istance of γ β = ɛ, while if the swap is rejecte then it oes not move at all, i.e. the istance it moves is γ β = 0. Hence, γ β inicates the extent to which the swap proposal succeee in moving ifferent x values aroun the β space, which is essential for getting benefit from the MCMCMC algorithm. So, the larger the magnitue of γ β, the more efficient are the swap moves at proviing mixing in the temperature omain. We therefore efine the optimal choice of ɛ to be the one which maximises the stationary (i.e. asymptotic) expecte square jumping istance (ESJD) of this movement, i.e. which maximises ESJD = E π [ (γ β) 2 ] = ɛ 2 E π [P(swap accepte)] ɛ 2 ACC, (2) where from (1), [ ( ACC = E π [P(swap accepte)] = E π min 1, f (β j) (x k ) f (β k) (x j ) ) ] (3) f (β j) (x j ) f (β k) (x k ) an where the expectations are with respect to the stationary (asymptotic) istribution x n j=0 f (β j). We shall also consier the generalisation of (2) to ESJD(h) = E π [ (h(γ) h(β)) 2 ] (4) for some h : [0, 1] R; then (2) correspons to the case where h is the ientity function. Of course, (2) an (4) represent just some possible measures of optimality, an other measures might not necessarily lea to equivalent optimisations; for some iscussion relate to this issue see Section 5. To make progress on computing the optimal ɛ, we restrict (following [20, 22]) to the special case where f (x) = f(x i ), (5) i=1 i.e. the target ensity takes on a special prouct form. (Although (5) is a very restrictive assumption, it is known [20, 22] that conclusions rawn from this special case are often approximately applicable in much broaer contexts.) We also assume that the tempere istributions are simply powers of the original ensity (which is the usual case), i.e. that f (β) ( (x) = f (β) (x i ) f(xi ) ) β. (6) i=1 Intuitively, as the imension gets large, so that small changes in β lea to larger an larger changes in f (β), the inverse-temperature sprea ɛ must ecrease to preserve a non-vanishing probability of accepting a propose swap. Hence, similar i=1

6 Y. F. ATCHADÉ, G. O. ROBERTS, AND J. S. ROSENTHAL to [20, 22], we shall consier the limit as an corresponingly ɛ 0. To get a non-trivial limit, we take ɛ = 1/2 l (7) for some positive constant l to be etermine. (Choosing a smaller scaling woul correspon to taking l 0, while choosing a larger scaling woul correspon to letting l ; either choice is sub-optimal, since the optimal l will be strictly between 0 an as we shall see.) 2.1. Main Result. Uner these assumptions, we shall prove the following (where Φ(z) = 1 2π z e s2 /2 s is the cf of a stanar normal): Theorem 1. Consier the MCMCMC algorithm as escribe above, assuming (5), (6), an (7). Then( as, the ESJD of (2) is maximize when l is chosen to maximize l 2 2 Φ l ) I(β)/2, where I(β) > 0 is efine in (10) below. Furthermore, for this optimal choice of l, the corresponing probability of accepting a propose swap is given (to three ecimal points) by 0.234. In fact, the relationship between ESJD of (2) an ACC of (3) is given by ESJD = (2/I(β)) ACC ( Φ 1 (ACC/2) ) 2 (see Figure 2). Finally, the optimal choice of l also maximises ESJD(h) of (4), for any ifferentiable function h : [0, 1] R. (8) Efficiency ESJD 0.0 0.2 0.4 0.6 0.0 0.2 0.4 0.6 0.8 1.0 Acceptance Rate ACC Figure 2: A graph of the relationship between the expecte square jumping istance (ESJD) an asymptotic acceptance rate (ACC), as escribe in equation (8), in units where I(β) = 1. Proof. For this MCMCMC algorithm, the acceptance probability of a temperature swap is given by α 1 e B

where OPTIMAL SCALING OF MCMCMC 7 e B = f (β) f (β) (y)f (β+ɛ) (x)f (β+ɛ) an where x f (β) an y f (β+ɛ) are inepenent an in stationarity. We write g = log f so that g (β) (β) (x) = log f (x), etc. We then have that where B = T (x) = g (β) [ (g (β) (y) g(β+ɛ) (x) g(β+ɛ) (y) ) ( g (β) (x) (y) (x) g(β+ɛ) (x) )] T (y) T (x), (x) = βg (x) (β +ɛ) g (x) = ɛ g (x) = ɛ g(x i ). Now, write E β for expectation with respect to the istribution having ensity proportional to f β, an similarly for Var β. Then in this istribution, g has mean log f(x)f E β β (x)x (g) = f β M(β), (9) (x)x i=1 an variance Var β (g) = (log f(x)) 2 f β (x)x f β (x)x M(β) 2 I(β). (10) Hence, in the istribution proportional to f β, T (x) has mean E β ( T (x) ) = ɛ M(β) µ(β), an variance Var( β T (x) ) = ɛ 2 I(β) σ(β) 2. Now, taking erivative with respect to β (using the quotient rule ), ( ) (log f(x)) M 2 f β (x)x f (β) = β 2 log f(x)x f β (x)x f β = I(β), (11) (x)x from which it follows that µ (β) = ɛ M (β) = ɛ I(β) = σ(β) 2 /ɛ. Now, recall that ɛ = 1/2 l. Hence, T (y) T (x) = l ( g(yi ) g(x i ) ). We claim that, as, T (y) T (x) converges weakly to a ranom variable A N ( σ(β) 2, 2 σ(β) 2 ). To see this, let φ (t) be the characteristic function of T (y) T (x). Then we have: φ (t) = E [ e itɛ(t (y) T (x)) ] = e itɛ(m(β+ɛ) M(β)) { E ( e itɛḡ(y 1) ) E ( e itɛḡ(x 1) )}, where ḡ(x) = g(x) E β (g(x)). Since M(β+ɛ) M(β) = ɛm(β)+o(ɛ) an ɛ = 1/2 l, it follows that lim itɛ(m(β + ɛ) M(β)) = itl 2 I(β). Hence, using a Taylor series expansion, E ( e ) itɛḡ(y 1) E ( e ) ) ) itɛḡ(x 1) = (1 l2 t 2 2 I(β + ɛ) + o( 1 ) (1 l2 t 2 2 I(β) + o( 1 ) i=1

8 Y. F. ATCHADÉ, G. O. ROBERTS, AND J. S. ROSENTHAL = ) (1 l2 t 2 2 2 I(β) + o( 1 ). Hence, φ (t) converges to e itli(β) e l2 t 2I(β) as. This limiting function is the characteristic function of the istribution N( l 2 I(β), 2l 2 I(β)), thus proving the claim. To continue, recall (e.g. [20], Proposition 2.4) that if A N(m, s 2 ), then ( m ) ( E(1 e A ) = Φ + e m+ s2 2 Φ s m ). s s In particular, if A N( c, 2c) (so that E(e A ) = 1), then ( E(1 e A ) = 2 Φ ) c/2. Since the function 1 e B is a boune function, it follows that as, the acceptance rate of MCMCMC converges to E(α) = E(1 e B ) E ( 1 exp ( N( σ(β) 2, 2σ(β) 2 ) )) = 2 Φ( σ(β)/ 2). Now, with ɛ = 1/2 l, ɛ 2 = l 2 an σ(β) 2 /2 = ɛ 2 I(β)/2 = l 2 I(β)/2, whence σ(β)/ 2 = l [I(β)/2] 1/2. Then as, P(accept swap) = E(1 e B ) 2 Φ( l [I(β)/2] 1/2 ), an so ESJD = ɛ 2 P(accept swap) (l 2 /) 2 Φ( l [I(β)/2] 1/2 ) e MC (l). (12) Hence, maximising ESJD is equivalent to choosing l = l opt to maximise l 2 2 Φ( l [I(β)/2] 1/2 ), with the secon factor being the acceptance probability. It then follows as in [20, 22] that when l = l opt, the acceptance probability becomes 0.234 (to three ecimal places). Inee, making the substitution u = l[i(β)/2] 1/2 shows that fining l opt is equivalent to fining the value û of u which maximizes 8I(β) 2 u 2 Φ( u), an then evaluating â = 2Φ( û). It follows that the value of û, an hence also the value of â, oes not epen on the value of I(β) (provie I(β) > 0). So, it suffices to assume I(β) = 1, in which case we compute numerically that û =. 1.190609, so that â = 2 Φ( 1.190609) =. 0.2338071 =. 0.234. Finally, if instea of ESJD we consier ESJD(h) E[ ( h(γ) h(β) ) 2 ] for some ifferentiable function h, then as ɛ 0 we have ( h(γ) h(β) ) 2 [h (β)] 2 (γ β) 2, so ESJD(h) [h (β)] 2 (ESJD). Hence, as a function of ɛ, maximising ESJD(h) is equivalent to maximising ESJD, an is thus maximise at the same value ɛ opt. 2.2. Iteratively Selecting the Inverse Temperatures. The above results show that, in high imensions uner certain assumptions, it is most efficient to choose the inverse temperatures for MCMCMC such that the average acceptance probability between any two ajacent temperatures is about 23%. This leas to the question of how to select such temperatures. Although this may not be a practical approach in general, we shall aopt an intensive iterative approach to this in orer to assess the effects of our theory in applications. We assume we know (perhaps by running some preliminary simulations) some sufficiently small inverse temperature value β such that the mixing of the chain with corresponing ensity f β is sufficiently fast. This value β shall be our minimal inverse temperature. (In the examples below, we use β = 0.01.)

OPTIMAL SCALING OF MCMCMC 9 In terms of β, we construct the remaining β j values iteratively, as follows. First, starting with β 0 = 1, we fin β 1 such that the acceptance probability of a swap between β 0 = 1 an β 1 is approximately 0.23. Then, given β 1, we fin β 2 such that the acceptance probability of the swap between β 1 an β 2 is again approximately 0.23. We continue in this way to construct β 3, β 4,..., continuing until we have β j β for some j. At that point, we replace β j by β an stop. To implement this iterative algorithm, we nee to fin for each inverse-temperature β a corresponing inverse-temperature β < β such that the average acceptance probability of the swaps between β an β is approximately 0.23. We use a simulationbase approach for this, via ranom variables {(X n, X n, ρ n )} n 0. After initialisations for n = 0, then at each time n 1, we let β n = β(1 + e ρn ) 1, an raw X n+1 f β an X n+1 f β n (or upate them from some Markov chain ynamics preserving those istributions). The probability of accepting a swap between β an β n is then given by α n+1 1 e B n+1 where B n+1 = (β n β) ( g (X n+1) g (X n+1 ) ). We then attempt to converge towars 0.23 by replacing ρ n by ρ n+1 = ρ n + n 1 (α n+1 0.23), i.e. using a stochastic approximation algorithm [19, 2]. This ensures that if α n+1 > 0.23 then β will ecrease, while if α n+1 > 0.23 then β will increase, as it shoul. We shall use this iterative simulation approach to etermine the inverse temperatures {β j } in all of our simulation examples in Section 3 below. 2.3. Comparison with Geometric Temperature Spacing. In some cases, it is believe that the optimal choice of temperatures is geometric, i.e. that we want β j+1 = cβ j for appropriate constant c. Now, the notation of Theorem 1 correspons to setting β j = β + ɛ an β j+1 = β. So, it is optimal to have β j+1 = cβ j precisely when ɛ opt β in Theorem 1. Recall now that, from the en of the proof of Theorem 1, û l opt [I(β)/2] 1/2. = 1.190609, an in particular û oes not epen on β. Then l opt = û[2/i(β)] 1/2. So, ɛ opt = l opt 1/2 = û 1/2 [2/I(β)] 1/2 I(β) 1/2. It follows that the conition ɛ opt β is equivalent to the conition I(β) 1/2 β, i.e. I(β) β 2. While this oes hol for some examples, e.g. the case when f(x) = e x r (see Section 2.4 below), it oes not hol for most other examples (e.g. for the Gamma istribution, various log x terms appear which o not lea to such a simple relationship). Thus, the optimal spacing of inverse temperatures is not necessarily geometric. In our simulations below, we consier both geometric spacings, an spacings foun using our 23% rule. We shall see that the 23% rule leas to superior performance. 2.4. A Simple Analytical Example. Consier the specific example where the target ensity is given by (5) with f(x) = e x r for some fixe r > 0. (This inclues the Gaussian (r = 2) an Double-Exponential (r = 1) cases.) For this example, f β (x) e β x r, an g(x) = x r. We then compute from (9) that g(x) f M(β) = E β β (x) x (g) = f β = x r e x r x (x) x e x r x

10 Y. F. ATCHADÉ, G. O. ROBERTS, AND J. S. ROSENTHAL = 2 Γ(1/r) β (1+r)/r r 2 2 Γ(1 + 1 r ) β 1/r = 1/rβ where Γ(z) = 0 t z 1 e t t is the gamma function. Then by (11), I(β) Var β (g) = M (β) = 1/rβ 2 β 2. Hence, since I(β) β 2, it follows as above that the sprea of inverse-temperatures shoul be geometric, with β j+1 = cβ j for some constant c. 2.5. Relation to Previous Literature. The issue of temperature spacings an acceptance rates for MCMCMC has been wiely stuie in the physics literature [11, 12, 18, 6, 13, 5, 26, 27]. For example, the conclusion of Theorem 1, that the inverse temperatures shoul be chosen to make the swap acceptance probabilities all equal to the same value (here 0.234), is relate to the uniform exchange rates criterion escribe in [10], p. 629. In aition, simulation stuies [6] have suggeste tuning the temperatures so the swap acceptance rate is about 20%, an analytic stuies have suggeste [13] an optimal rate of 0.23 in some specific circumstances, as we now iscuss. In physics, the canonical ensemble correspons to writing f (β) (x) = e βv (x), where V is the energy function an 1/β is the temperature. Thus, in our notation, V (x) = g (x). Furthermore the average energy at inverse-temperature β is given by U(β) = V (x) e βv (x) x e βv (x) x. Physicists have approache the problem using the concepts of ensity of states an heat capacity. If the state space X is iscrete, the ensity of states is efine as the function G(u) := #{x X : V (x) = u}. If X is continuous, then G(u) is efine to be the ( 1)-imensional Lebesgue measure of the level set {x X : V (x) = u}. The heat capacity of the system at inverse-temperature β is efine as C(β) = β 2 U(β) β = β 2 Var β (V (X)). In our notation, Var β (V (X)) = Var β ( g (X)) = I(β), so C(β) = β 2 I(β). Hence, C(β) is constant if an only if I(β) β 2 as iscusse in Section 2.3. Previous literature has consiere, as have we, the iealise case in which exact sampling is one from each istribution f β. In this case, a swap in MCMCMC between β an β (β < β ) has acceptance probability [ ( )] A(β, β ) = E min 1, e (β β )(V (X) V (X )), with the expectation taken with respect to X f β an X f β. Assuming that the heat capacity is constant (i.e., C(β) C), an that the ensity of states has the form ( G(u) = 1 + β ) C 0 C (u u 0) G(u 0 ),

OPTIMAL SCALING OF MCMCMC 11 [11] shows that the average acceptance probability of a swap between β an β > β is given by: A(β, β ) = 2Γ(2C + 2) Γ(C + 1) 2 β/β 0 u C u (1 + u) 2(C+1) = 2 B(C + 1, C + 1) 1/(1+R) 0 θ C (1 θ) C θ, (13) where R = β /β > 1, Γ(x) is the Gamma function an B(a, b), the Beta function. Assuming a constant heat capacity an a more specific form for the ensity of state, G(u) = (2π)C+1 Γ(C + 1) uc, [18] erive the same formula as in (13) for A(β, β ), which they call the incomplete beta function law for parallel tempering. Letting C (an thus the imension of the system) go to infinity, [12, 18] use (13) to erive a limiting expression for the acceptance probability of MCMCMC similar to that obtaine in (12) above. [13] uses these limits to argue that 0.23 is approximately the optimal asymptotic swap acceptance rate using arguments somewhat similar to ours (inee, Figure 1 of [13] contains the same acceptance-versus-efficiency curve implie by (12)). Now, it seems that the heat capacity C(β) being constant is quite a restrictive assumption; it is satisfie for the example f(x) = e x r consiere in Section 2.4, but is usually not satisfie for more complicate functions. By contrast, our assumption (6) oes not impose any special conitions on the nature of the ensity function f(x). 3. Simulation examples 3.1. A Simple Gaussian Example. We illustrate Theorem 1 with the following simulation example. We implement the MCMCMC escribe above for Gaussian target an tempere istributions. Specifically, we consier just two inversetemperatures β 0 1 an β 1 with 0 < β 1 < 1. We efine f (x) an f (β) (x) as in (5) an (6), with f the ensity of a stanar normal istribution N(0, 1), so that f (β) = N(0, β 1 ). We use a version of the algorithm where the within-temperature moves are given by a Ranom Walk Metropolis (RWM) algorithm with Gaussian proposal istribution N(x, (2.38 2 /β) I ) which is asymptotically optimal [22]. Our algorithm attemps 20 within-temperature moves for every one time it attempts a temperature swap. We ran this algorithm for each of 50 ifferent possible choices of β 1, each with 0 < β 1 < 1. For each such choice of β 1, we ran the algorithm for a total of 500,000 iterations to ensure goo averaging, an then estimate the acceptance probability as the fraction of propose swaps accepte, an the expecte square jump istance (ESJD) as the average of the square temperature jump istances (γ β 0 ) 2 (or, equivalently, as (β 0 β 1 ) 2 times the average square swap istance). Figure 3 plots the estimate acceptance probabilities versus square jump istances (SJD), for each of four ifferent imensions (10, 20, 50, an 100). We can see from these results that the swap acceptance probability 0.234 is inee close to optimal, an it gets closer to optimal as the imension increases, an furthermore the relationship between

12 Y. F. ATCHADÉ, G. O. ROBERTS, AND J. S. ROSENTHAL the two quantities is given approximately by (8) an Figure 2, just as Theorem 1 preicts. (a) (b) (c) () SJD 0.00 0.02 0.04 0.06 SJD 0.00 0.01 0.02 0.03 0.04 SJD 0.000 0.005 0.010 0.015 0.020 SJD 0.000 0.004 0.008 0.0 0.4 0.8 0.0 0.4 0.8 0.0 0.4 0.8 0.0 0.4 0.8 Acc. Prob. Acc. Prob. Acc. Prob. Acc. Prob. Figure 3: square jump istance (SJD) versus estimate Acceptance Probability for the Gaussian case f = N(0, 1) an f β = N(0, β 1 ), in imensions (a) = 10, (b) = 20, (c) = 50, an () = 100. 3.2. Inhomogeneous Gaussian Example. We now moify the previous example to a non-i.i.. case where f (x) = i=1 f i(x i ) with f i = N(0, i 2 ), so that f β i = N(0, i 2 β 1 ). (The inhomogeneity implies, in particular, that assumption (5) is no longer satisfie.) The rest of the etails remain exactly the same as for the previous example, except that the proposal istribution for the RWM within-temperature moves is now taken to be N (x, (2.38 2 i 2 /β)i ) for the i th coorinate, which is optimal in this case. The resulting plots (for imensions = 10 an = 100) are given in Figure 4. Here again, an optimum value of β for the ESJD function emerges at a swap acceptance probability of approximately 23%. Again, the agreement with Theorem 1 an the relationship (8) an Figure 2 is striking. This is unsurprising given that this target istribution can be written as a collection of scalar linear transformations of Example 3.1, an Theorem 1 remains invariant through such transformations. (a) (b) SJD 0.00 0.02 0.04 0.06 SJD 0.000 0.002 0.004 0.006 0.008 0.010 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 Acc. Prob. Acc. Prob.

OPTIMAL SCALING OF MCMCMC 13 Figure 4: SJD versus Acceptance Probability for the inhomogeneous Gaussian case f i = N(0, i 2 ) an f β i = N(0, i 2 β 1 ), in imensions (a) = 10 an (b) = 100. 3.3. A Mixture Distribution Example. In this example an the next, we compare the two temperature scheuling strategies iscusse in Sections 2.2 an 2.3. One strategy consists of selecting the inverse-temperatures as a geometric sequence; we saw that this strategy is optimal when I(β) β 2, or equivalently the heat capacity is constant. The other strategy consists of choosing the temperatures such that the acceptance probability between any two ajacent temperatures is about 23%. We report here a simulation example comparing the two strategies. Consier the ensity f (x 1,..., x ) = f(x i ; ω i, µ i, σi 2 ) corresponing to a mixture of normal istributions, where = 20, an 3 f(x; ω, µ, σ 2 1 ) = ω j e 1 2σ 2 (x µ j ) 2 j, 2πσj j=1 with ω = (1/3, 1/3, 1/3), µ = ( 5, 0, 5), an σ = (0.2, 0.2, 0.2). For the 23% rule, we buil the temperatures sequentially as escribe in Section 2.2, with β = 0.01. The number of chains is thus itself unknown initially, an turns out to be 9 as shown in the top line of Table 1. The geometric scheule uses that same number (9) of chains, geometrically space between β 0 1 an β 0.01. (Thus, the geometric scheule is allowe to borrow the total number of chains from the 23% rule, which is in some sense overly generous to the geometric strategy.) Each chain was run for 200, 000 iterations. We report the square jump istances both in the temperature space an in the X - space. We observe that the 23% rule performs significantly better that the geometric scheuling, in terms of having larger average square jumping istances in both the β space an the X space. 0.23 rule 1.00 0.675 0.395 0.206 0.106 0.048 0.022 0.0105 0.01 Geometric spacing 1.00 0.562 0.316 0.178 0.100 0.056 0.032 0.018 0.01 Table 1: Inverse-temperature scheules for the two ifferent temperature scheuling strategies for the mixture istribution example. i=1 E [ β n β n 1 2] E [ X n X n 1 2] 0.23 rule 0.209 0.428 Geometric spacing 0.084 0.315 Table 2: Square Jumping istances in β space an X space for the two ifferent temperature scheuling strategies for the mixture istribution example.

14 Y. F. ATCHADÉ, G. O. ROBERTS, AND J. S. ROSENTHAL 3.4. Ising Distribution. We now compare the two scheuling strategies using the Ising istribution on the N N two-imensional lattice, given by π(x) = exp (E(x)) /Z, where Z is the normalizing constant an E(x) = J ( N i=1 N 1 j=1 x ij x i,j+1 + N 1 i=1 ) N x ij x i+1,j, (14) with x i {1, 1}. Obviously, this istribution oes not satisfy the assumption (5). For efiniteness, we choose N = 50 an J = 0.45. The Ising istribution amits a phase transition at the critical temperature T c = 2J/ log(1 + 2) =.. 1.021, i.e. critical inverse-temperature β c = 0.979, aroun which the heat capacity unergoes stiff variation. Tempering techniques like MCMCMC an Simulate Tempering can perform poorly near critical temperatures because small changes in the temperature aroun T c result in rastic variations of the properties of the istribution. We expect the 23% rule to outperform the geometric scheule in this case. For our algorithm, for the within-temperature (i.e., X -space) moves, we use a Metropolis-Hastings algorithm with a proposal which consists of ranomly selecting a position (i, j) on the lattice an flipping its spin from x ij to x ij. We first etermine iteratively the inverse-temperature points using the algorithm escribe in Section 2.2. Then, given the lowest an highest temperatures an the number of temperature points etermine by this algorithm, we compute the geometric spacing. Figure 5 gives the selecte temperatures (not inverse-temperatures, i.e. it shows T j 1/β j ) from the two methos (the circles represent the 23% rule an the triangles represent the geometric spacing). j=1 Temperatures 0 5 10 15 20 25 30 35 0 10 20 30 40 Figure 5: The temperatures {T j } selecte for the Ising moel (triangle: geometric spacing; circle: 23% rule). We note here that the 23% rule an the geometric spacing prouce very ifferent temperature points. Interestingly, in orer to maintain the right acceptance rate, the 23% rule puts more temperature points near the critical temperature (1.021); this is further illustrate in Table 3.

OPTIMAL SCALING OF MCMCMC 15 23% rule 1.00 1.02 1.06 1.10 1.14 1.18 1.24 1.30 1.37 1.44 1.53 Geometric rule 1.00 1.20 1.46 1.77 2.14 2.59 3.13 3.79 4.59 5.55 6.70 Table 3: Selecte temperature values near T c = 1.021 for the Ising moel. Each version of MCMCMC was run for 5 million iterations. The average square jump istance in inverse-temperature space was 0.0095 for the 23% rule an 0.0049 for the geometric spacing, i.e. the 23% rule was almost twice as efficient. We also present the trace plots (subsample every 1,000 iterations) of the function {E(X n ), n 0} uring the simulation for each of the two algorithms in Figure 6, together with their autocorrelation functions. While the trace plots themselves appear similar, the autocorrelations inicate that the 23% rule has resulte in a faster mixing (less correlate) sampler in the X -space as well. We confirm this by calculating the empirical average square jump istance in the X space, SJD(X ) = n 1 n j=1 E(X j) E(X j 1 ) 2, which works out to 22.11 for the 23% rule, an 13.76 for the geometric sequence, i.e. the 23% rule is 1.6 times as efficient by this measure. (a): The geometric scheule (b): geometric scheule 1400 1500 1600 1700 1800 0.0 0.2 0.4 0.6 0.8 1.0 0 1000 2000 3000 4000 5000 0 20 40 60 80 100 (c): The 23 % rule (): 23 % rule 1400 1600 1800 0.0 0.2 0.4 0.6 0.8 1.0 0 1000 2000 3000 4000 5000 0 20 40 60 80 100 Figure 6: (a),(c) trace plots an (b),() autocorrelation functions of E(X n ) for the Ising moel, subsample every 1,000 iterations, with (a),(b) geometric temperature spacing, an (c),() the 23% rule.

16 Y. F. ATCHADÉ, G. O. ROBERTS, AND J. S. ROSENTHAL 4. Simulate Tempering We now consier the Simulate Tempering algorithm [16, 7]. This is somewhat similar to MCMCMC, except now there is just one particle which can jump either within-temperature or between-temperatures. Again we focus on the betweentemperatures move, an keep similar notation to before. The state space is now given by {β 0, β 1,..., β n } X an the target ensity is proportional to f (β, x) = i=1 ek(β) f β (x i ) = e K(β) f (β) (x) for some choice of normalizing constants K(β). A propose temperature move from (β, x) to (β + ɛ, x) is accepte with probability ( ) e K(β+ɛ) f β+ɛ (x) α = 1 e K(β) f β (x). We again make the simplifying assumptions (5) an (6), an again let ɛ 0 accoring to (7). We further assume that we have chosen the true normalising constants, ( ) K(β) = log f β (x)x, (15) so that f (β, ) is inee a normalise ensity function. This choice makes the stationary istribution for β be uniform on {β 0, β 1,..., β n }; this appears to be optimal since it allows the β values to most easily an quickly migrate from the col temperature β 0 to the hot temperature β n an then back to β 0 again, thus maximising the influence of the hot temperature on the mixing of the col chain. 4.1. Main Result. Uner the above assumptions, we prove the following analog of Theorem 1: Theorem 2. Consier the Simulate Tempering algorithm as escribe above, assuming (5), (6), (7), an (15). Then as, the ESJD of (2) is maximize when l is chosen to maximize l 2 2 Φ ( li(β) 1/2 /2 ). Furthermore, for this optimal choice of l, the corresponing probability of accepting a propose swap is given (to three ecimal points) by 0.234. In fact, the relationship between ESJD of (2) an ACC of (3) is given by ESJD = (4/I(β)) ACC ( Φ 1 (ACC/2) ) 2 (16) (see Figure 7). Finally, this choice of l also maximises the ESJD(h) of (4), for any ifferentiable function h : [0, 1] R.

OPTIMAL SCALING OF MCMCMC 17 Efficiency ESJD 0.0 0.4 0.8 1.2 0.0 0.2 0.4 0.6 0.8 1.0 Acceptance Rate ACC Figure 7: A graph of the relationship between the expecte square jumping istance (ESJD) an asymptotic acceptance rate (ACC), as escribe in equations (16) (top, green) an (8) (bottom, blue), in units where I(β) = 1. Proof. A propose temperature move from (β, x) to (β + ɛ, x) is accepte with probability ( ) e K(β+ɛ) f β+ɛ (x) α = 1 e K(β) f β (x) = 1 e (K(β+ɛ) K(β)) e ɛ P i=1 g(x i) = 1 e (K(β+ɛ) K(β))+ɛM(β) e ɛ P i=1 ḡ(x i), where ḡ(x) = g(x) M(β). ( ) If we choose K(β) = log f β (x)x, then we compute (again using the quotient rule ) by comparison with (9) that K (β) = M(β). Hence, from (11), K (β) = M (β) = I(β). Therefore, by a Taylor series expansion, K(β + ɛ) K(β) = ɛ K (β) + 1 2 ɛ2 K (β) = ɛm(β) 1 2 ɛ2 I(β) for some β ɛ [β, β + ɛ]. It follows that (K(β + ɛ) K(β))+ɛM(β) = ɛ 2 K (β ɛ )/2 = (l 2 /2)K (β ɛ ) = (l 2 /2)I(β ɛ ). Setting ɛ = 1/2 l for some l > 0 an letting, it follows from the above an the central limit theorem that lim 1 ( e K(β+ɛ) f β+ɛ e K(β) f β (x) = lim 1 ) (x) ( = lim 1 e (l2 /2)I(β ɛ) e ɛ P ) i=1 ḡ(x i) ( e (l2 /2)I(β) e l 1/2 P i=1 ḡ(x i) ) 1 e A, where A N( (l 2 /2)I(β), l 2 Var β (g)) = N( (l 2 /2)I(β), l 2 I(β)). The rest of the argument is stanar, just as in Theorem 1, an shows that for large, the square jump istance of Simulate Tempering is approximately ESJD = (l 2 /2) 2 Φ( li(β) 1/2 /2) e ST (l), (17)

18 Y. F. ATCHADÉ, G. O. ROBERTS, AND J. S. ROSENTHAL which is again maximise by choosing l such that the acceptance probability is (to three ecimal places) equal to 0.234. The argument for (16) is ientical to that for Theorem 1. 4.2. Comparison of Simulate Tempering an MCMCMC. Now that we have optimality results for both Simulate Tempering (Theorem 2) an MCMCMC (Theorem 1), it is natural to compare them. We have the following. Corollary 1. Uner assumptions (5), (6), (7), an (15), asymptotically as, the maximal value of ESJD for Simulate Tempering is precisely twice that for MCMCMC (cf. Figure 7). (More generally, the maximal ESJD(h) for Simulate Tempering is precisely twice that for MCMCMC, for any ifferentiable h). Furthermore, the optimal choice of l for Simulate Tempering is 2 times that for MCMCMC. Proof. Comparing e MC (l) from (12) with e ST (l) from (17), we see that e MC (l) = 1 2 e ST (l 2). In particular, sup l e MC (l) = 1 2 sup l e ST (l), an also argsup l e MC (l) = (1/ 2) argsup l e ST (l), which gives the result. Corollary 1 says that, uner appropriate conitions, the ESJD of MCMCMC is precisely half that of Simulate Tempering. That is, if Simulate Tempering has ieally-chosen normalisation constants K(β) as in (15), then in some sense it is twice as efficient as MCMCMC. However, choosing K(β) ieally may be ifficult or impossible (though it may be possible in certain cases to learn these weights aaptively uring the simulation, see e.g. [4]), an for non-optimal K(β) this comparison no longer applies. By contrast, MCMCMC oes not require K(β) at all. Furthermore, a single temperature swap of MCMCMC upates two ifferent chain values, an thus may be twice as valuable as a single temperature move uner Simulate Tempering. If so, then the two ifferent factors of two precisely cancel each other out. In aition, it may take more computation to o one within-temperature step of MCMCMC (involving n + 1 particles) than one step of Simulate Tempering (involving just one particle). In summary, this comparison of the two algorithms involves various subtleties an is not entirely clear-cut, though Corollary 1 still provies certain insights into the relationship between them. 4.3. Langevin Dynamics for Simulate Tempering? One limitation of Simulate Tempering is that it performs a Ranom Walk Metropolis (RWM) algorithm in the β-space, where each proposal bears a high chance of being rejecte if the value of the energy i=1 g(x i) is not consistent with the propose value of temperature. Now, it is known [21] that Langevin ynamics (which use erivatives to improve the proposal istribution) are significantly more efficient than RWM when they can be applie. In the context of the inverse temperatures, a Langevin-style proposal istribution coul take the form: [ ( ) ] N β + σ2 g(x i ) K(β), σ 2. 2 i=1

OPTIMAL SCALING OF MCMCMC 19 Furthermore, in this case K(β) = E β (g(x)). Therefore, such a Langevin proposal will compare the current value of i=1 g(x i) to the average value K(β). If i=1 g(x i) K(β) then smaller temperature are more compatible an are more likely to be propose (an vice versa). The main limitation of this iea is that in practice, we o not know the graient K(β). This can perhaps be estimate uring the simulation as the average of the energy j=1 g(x n) at times n where the inverse-temperature level β is visite, but this nees more investigation an we o not pursue it here. 5. Discussion an Future Work This paper has presente certain results about optimal inverse-temperature spacings. In particular, we have prove that for MCMCMC (Theorem 1) an Simulate Tempering (Theorem 2), it is optimal (uner certain conitions) to space the inversetemperatures so that the probability of accepting a temperature swap or move is approximately 23%. Our theorems were prove uner the restrictive assumption (5), but we have seen in simulations (Section 3) that they continue to approximately apply even if this assumption is violate. Theorems 1 an 2 were state in terms of the expecte square jumping istances (ESJD) of the inverse-temperatures, assuming the chain begins in stationarity. While this provies useful information about optimising the algorithm, it provies less information about the algorithm s long-term behaviour. In this section, we consier various relate issues an possible future research irections. 5.1. Inverse-Temperature Process. It is possible to efine an entire inverse temperature process, {y n } n=0, living on the state space Y = {β 0, β 1,..., β n }. For Simulate Tempering (ST), this process is simply the inverse-temperature β at each iteration. For MCMCMC, this process involves tracking the influence of a particular inverse-temperature s chain through the temperature swaps, i.e. if y n = β j, an then the β j an β j+1 chains successfully swap, then y n+1 = β j+1 (otherwise y n+1 = β j ). In general, this {y n } process will not be Markovian. However, if we assume that the ST/MCMCMC algorithm oes an automatic i.i.. reset into the istribution f (β) immeiately upon entering the inverse-temperature β (i.e., where the chain values associate with each inverse-temperature β are taken as i.i.. raws from f (β) at each iteration), then {y n } becomes a Markov chain. Inee, uner this assumption, it correspons to a simple ranom walk (with holing probabilities) on Y. In that case, it seems likely that as, if we appropriately shrink space by a factor of an spee up time by a factor of, then as in the RWM case [20, 22], the {y n } will converge to a limiting Langevin iffusion process. In that case, maximising the spee of the limiting iffusion is equivalent to optimising the original algorithm (accoring to any optimisation measure, see [22]), an this woul provie another justification of the values of l opt in Theorems 1 an 2. In terms of this {y n } Markov chain, ESJD from (2) is simply the expecte square jumping istance of this process. Furthermore, ESJD(h) from (4) is then the Dirichlet form corresponing to the Markov operator for {y n }. Thus, Theorems 1 an 2

20 Y. F. ATCHADÉ, G. O. ROBERTS, AND J. S. ROSENTHAL woul also be saying that the given values of l opt maximise the associate Dirichlet form. Of course, this reset assumption that the moves within each inverse temperature β j result in an immeiate i.i.. reset to the ensity f β j is not realistic, since in true applications the convergence within each fixe temperature chain woul occur only graually (especially for larger β j ). However, it is not entirely unrealistic either, for a number of reasons. For example, some algorithms might run many within-temperature moves for each single attempte temperature swap, thus making the within-temperature chains effectively mix much faster. Also, some withintemperature algorithms (e.g. Langevin iffusions, see [21]) have convergence time of smaller orer (O( 1/3 )) than that of the temperature-swapping ranom-walk behaviour (O()), so effectively the within-temperature chains converge immeiately on a relative scale, which is equivalent to the reset assumption. Proving convergence to limiting iffusions can be rather technical (see e.g. [20, 21]), so we o not pursue it here, but rather leave it for future work. In any case, assuming the limiting iffusion oes hol (as it probably oes), this provies new insight an context for the optimal behaviour of MCMCMC an ST algorithms. 5.2. Joint Diffusion Limits? If we o not assume the reset assumption as above, then the process {y n } is not Markovian, so a iffusion limit seems far less certain. However, it is still possible to jointly consier the chain {x n } n=0 living on X (for ST) or X (for MCMCMC), together with the inverse temperature process {y n }. The joint process (x n, y n ) must be Markovian, an it is quite possible that it has its own joint iffusion limit an joint optimal scalings. This woul become quite technical, an we o not pursue it here, but we consier it an interesting topic for future research. 5.3. Moels For How the Algorithms Converge. Relate to the above is the question of a eeper unerstaning of how MCMCMC an ST make use of the tempere istributions to improve convergence. Significant progress was mae in this irection in previous research (e.g. [26]), but more remains to be one. One possible simplifie moel is to assume the process oes not move at all in the within-temperature irection, except at the single inverse-temperature β = β when it oes an immeiate i.i.. reset to f ( β). At first we thought that such a moel is insufficient since it woul then be exponentially ifficult for the algorithm to climb from β = β to β = β 0, but the work of [26] shows that this is in fact not the case. So, we may explore this moel further in future work. A compromise moel is where the state space X is partitione into a finite number of moes, an when entering any β > 0 the process oes a reset into f (β) conitional on remaining in the current moe. Such a process assumes fast within-moe mixing at all temperatures, but between-moe mixing only at the hot temperature when β = 0. We believe that in this case, there is a joint iffusion limit of the joint (β,moe) process. If so, this woul be interesting an provie a goo moel for how the hot temperature an intermeiate temperature mixing all works together. 5.4. State space invariance. Our results o not epen in any way on the upate mechanism within the state space X. Moreover, although our results are prove

OPTIMAL SCALING OF MCMCMC 21 for the case of IID components, this observation immeiately generalises our finings to the very large class of istributions which we can write as functions of IID components. In fact the only thing which stops this being a completely general imensional state space result is the fact that ST on the transforme variables is a ifferent metho from the transforme ST algorithm. From a practical point of view, this might even suggest improve tempering schemes (rather than simply consiering powere ensities). 5.5. Implementation. Fining practical ways to implement MCMCMC an ST in light of our results has not been a focus of this paper, though this is an important issue for future consieration. The approach aopte in Subsection 2.2 is esigne to be a thorough methoology to investigate the effects of Theorem 1, but cannot be consiere a practical approach in general. Intriguing is the propect of integrating our theory within an aaptive MCMC framework as in for example [3, 1, 24, 23]. 5.6. Parallelisation. As computational power become more an more wiesprea, there is increasing interest in running Monte Carlo algorithms in parallel on many machines, perhaps even using Graphics Processing Units (GPUs), see e.g. [8, 25, 14]. It woul be interesting to consier, even in simple situations, how to optimise parallel computations asymptotically as the number of processors goes to infinity. Acknowlegements: We thank the eitors an anonymous referees for very helpful reports that significantly improve the presentation of our results. References [1] C. Anrieu an E. Moulines (2007) On the ergoicity properties of some Markov chain Monte Carlo algorithms. Ann. Appl. Prob. 44, 2, 458 475. [2] C. Anrieu an C.P. Robert (2001), Controlle MCMC for optimal sampling. Preprint. [3] Y.F. Atchae, An aaptive version of the Metropolis ajuste Langevin algorithm with a truncate rift, Meth. an Comp. in Appl. Prob., 2006, 8, 2, 235 254. [4] Y.F. Atchae an J.S. Liu (2004), The Wang-Lanau algorithm in general state spaces: applications an convergence analysis. Statistica Sinica, to appear. [5] B. Cooke an S.C. Schmiler (2008), Preserving the Boltzmann ensemble in replica-exchange molecular ynamics. J. Chem. Phys. 129, 164112. [6] D.J. Earl an M.W. Deem (2005), Parallel tempering: theory, applications, an new perspectives. J. Phys. Chem. B 108, 6844. [7] C.J. Geyer (1991), Markov chain Monte Carlo maximum likelihoo. In Computing Science an Statistics: Proceeings of the 23r Symposium on the Interface, 156 163. [8] P.W. Glynn an P. Heielberger (1991), Analysis of Parallel, Replicate Simulations uner a Completion Time Constraint. ACM Trans. on Moeling an Simulation 1, 3 23. [9] W.K. Hastings (1970), Monte Carlo sampling methos using Markov chains an their applications. Biometrika 57, 97 109. [10] Y. Iba (2001), Extene ensemble Monte Carlo. Int. J. Mo. Phys. C 12(5), 623 656. [11] D.A. Kofke (2002), On the acceptance probability of replica-exchange Monte Carlo trials. J. Chem. Phys. 117, 6911. Erratum: J. Chem. Phys. 120, 10852. [12] D.A. Kofke (2004), Comment on reference [18]. J. Chem. Phys. 121, 1167.