arxiv:math/ v1 [math.pr] 3 Jul 2006

Similar documents
Optimal scaling for partially updating MCMC algorithms. Neal, Peter and Roberts, Gareth. MIMS EPrint:

Topic 7: Convergence of Random Variables

Robust Forward Algorithms via PAC-Bayes and Laplace Distributions. ω Q. Pr (y(ω x) < 0) = Pr A k

Introduction. A Dirichlet Form approach to MCMC Optimal Scaling. MCMC idea

Convergence of Random Walks

The derivative of a function f(x) is another function, defined in terms of a limiting expression: f(x + δx) f(x)

Lecture Introduction. 2 Examples of Measure Concentration. 3 The Johnson-Lindenstrauss Lemma. CS-621 Theory Gems November 28, 2012

Survey Sampling. 1 Design-based Inference. Kosuke Imai Department of Politics, Princeton University. February 19, 2013

NOTES ON EULER-BOOLE SUMMATION (1) f (l 1) (n) f (l 1) (m) + ( 1)k 1 k! B k (y) f (k) (y) dy,

Separation of Variables

A note on asymptotic formulae for one-dimensional network flow problems Carlos F. Daganzo and Karen R. Smilowitz

Math Notes on differentials, the Chain Rule, gradients, directional derivative, and normal vectors

Linear First-Order Equations

Euler equations for multiple integrals

Differentiation ( , 9.5)

Lower Bounds for the Smoothed Number of Pareto optimal Solutions

Introduction to Markov Processes

Math 1B, lecture 8: Integration by parts

Schrödinger s equation.

Logarithmic spurious regressions

7.1 Support Vector Machine

Monotonicity for excited random walk in high dimensions

Time-of-Arrival Estimation in Non-Line-Of-Sight Environments

Markov Chains in Continuous Time

Expected Value of Partial Perfect Information

Implicit Differentiation

The Exact Form and General Integrating Factors

Table of Common Derivatives By David Abraham

Math 342 Partial Differential Equations «Viktor Grigoryan

Final Exam Study Guide and Practice Problems Solutions

Monte Carlo Methods with Reduced Error

Introduction to the Vlasov-Poisson system

A Sketch of Menshikov s Theorem

Computing Exact Confidence Coefficients of Simultaneous Confidence Intervals for Multinomial Proportions and their Functions

6 General properties of an autonomous system of two first order ODE

Generalization of the persistent random walk to dimensions greater than 1

Quantum Mechanics in Three Dimensions

1 Math 285 Homework Problem List for S2016

θ x = f ( x,t) could be written as

An Optimal Algorithm for Bandit and Zero-Order Convex Optimization with Two-Point Feedback

Some vector algebra and the generalized chain rule Ross Bannister Data Assimilation Research Centre, University of Reading, UK Last updated 10/06/10

YVES F. ATCHADÉ, GARETH O. ROBERTS, AND JEFFREY S. ROSENTHAL

arxiv: v1 [math.pr] 18 Oct 2012

Pure Further Mathematics 1. Revision Notes

A Note on Exact Solutions to Linear Differential Equations by the Matrix Exponential

12.11 Laplace s Equation in Cylindrical and

CHAPTER 1 : DIFFERENTIABLE MANIFOLDS. 1.1 The definition of a differentiable manifold

PDE Notes, Lecture #11

ELEC3114 Control Systems 1

Calculus in the AP Physics C Course The Derivative

Least-Squares Regression on Sparse Spaces

A Review of Multiple Try MCMC algorithms for Signal Processing

Mark J. Machina CARDINAL PROPERTIES OF "LOCAL UTILITY FUNCTIONS"

Lower bounds on Locality Sensitive Hashing

Calculus and optimization

Calculus of Variations

Relative Entropy and Score Function: New Information Estimation Relationships through Arbitrary Additive Perturbation

On a limit theorem for non-stationary branching processes.

APPROXIMATE SOLUTION FOR TRANSIENT HEAT TRANSFER IN STATIC TURBULENT HE II. B. Baudouy. CEA/Saclay, DSM/DAPNIA/STCM Gif-sur-Yvette Cedex, France

Some Examples. Uniform motion. Poisson processes on the real line

Discrete Mathematics

Conservation Laws. Chapter Conservation of Energy

Jointly continuous distributions and the multivariate Normal

Lecture 7: Interchange of integration and limit

LECTURE NOTES ON DVORETZKY S THEOREM

Applications of the Wronskian to ordinary linear differential equations

MA 2232 Lecture 08 - Review of Log and Exponential Functions and Exponential Growth

Partial Differential Equations

II. First variation of functionals

Diophantine Approximations: Examining the Farey Process and its Method on Producing Best Approximations

Chapter 6: Energy-Momentum Tensors

Function Spaces. 1 Hilbert Spaces

Lecture 10 Notes, Electromagnetic Theory II Dr. Christopher S. Baird, faculty.uml.edu/cbaird University of Massachusetts Lowell

Modelling and simulation of dependence structures in nonlife insurance with Bernstein copulas

Notes on Lie Groups, Lie algebras, and the Exponentiation Map Mitchell Faulk

Qubit channels that achieve capacity with two states

SYNCHRONOUS SEQUENTIAL CIRCUITS

Tractability results for weighted Banach spaces of smooth functions

Generalized Tractability for Multivariate Problems

COUPLING REQUIREMENTS FOR WELL POSED AND STABLE MULTI-PHYSICS PROBLEMS

Lectures - Week 10 Introduction to Ordinary Differential Equations (ODES) First Order Linear ODEs

Witten s Proof of Morse Inequalities

Iterated Point-Line Configurations Grow Doubly-Exponentially

Section 2.7 Derivatives of powers of functions

Proof of SPNs as Mixture of Trees

0.1 Differentiation Rules

Sturm-Liouville Theory

Lecture XII. where Φ is called the potential function. Let us introduce spherical coordinates defined through the relations

Calculus of Variations

LATTICE-BASED D-OPTIMUM DESIGN FOR FOURIER REGRESSION

Switching Time Optimization in Discretized Hybrid Dynamical Systems

Convergence rates of moment-sum-of-squares hierarchies for optimal control problems

QF101: Quantitative Finance September 5, Week 3: Derivatives. Facilitator: Christopher Ting AY 2017/2018. f ( x + ) f(x) f(x) = lim

REVERSIBILITY FOR DIFFUSIONS VIA QUASI-INVARIANCE. 1. Introduction We look at the problem of reversibility for operators of the form

Introduction to variational calculus: Lecture notes 1

TMA 4195 Matematisk modellering Exam Tuesday December 16, :00 13:00 Problems and solution with additional comments

THE VAN KAMPEN EXPANSION FOR LINKED DUFFING LINEAR OSCILLATORS EXCITED BY COLORED NOISE

Lagrangian and Hamiltonian Mechanics

1 dx. where is a large constant, i.e., 1, (7.6) and Px is of the order of unity. Indeed, if px is given by (7.5), the inequality (7.

Transcription:

The Annals of Applie Probability 006, Vol. 6, No., 475 55 DOI: 0.4/050560500000079 c Institute of Mathematical Statistics, 006 arxiv:math/0607054v [math.pr] 3 Jul 006 OPTIMAL SCALING FOR PARTIALLY UPDATING MCMC ALGORITHMS By Peter Neal an Gareth Roberts University of Manchester an Lancaster University In this paper we shall consier optimal scaling problems for highimensional Metropolis Hastings algorithms where upates can be chosen to be lower imensional than the target ensity itself. We fin that the optimal scaling rule for the Metropolis algorithm, which tunes the overall algorithm acceptance rate to be 0.34, hols for the so-calle Metropolis-within-Gibbs algorithm as well. Furthermore, the optimal efficiency obtainable is inepenent of the imensionality of the upate rule. This has important implications for the MCMC practitioner since high-imensional upates are generally computationally more emaning, so that lower-imensional upates are therefore to be preferre. Similar results with rather ifferent conclusions are given for so-calle Langevin upates. In this case, it is foun that high-imensional upates are frequently most efficient, even taking into account computing costs.. Introuction. There exist large classes of Markov chain Monte Carlo MCMC) algorithms for exploring high-imensional target) istributions. All methos construct Markov chains with invariant istribution given by the target istribution of interest. However, for the purposes of maximizing the efficiency of the algorithm for Monte Carlo use, it is imperative to esign algorithms which give rise to Markov chains which mix sufficiently rapily. Since all Metropolis Hastings algorithms require the specification of a proposal istribution, these implementational questions can all be phrase in terms of proposal choice. This paper is about two of these choices: the scaling an imensionality of the proposal. We shall work throughout with continuous istributions, although it is envisage that more general istributions might be amenable to similar stuy. Receive June 003; revise April 005. AMS 000 subject classifications. Primary 60F05; seconary 65C05. Key wors an phrases. Metropolis algorithm, Langevin algorithm, Markov chain Monte Carlo, weak convergence, optimal scaling. This is an electronic reprint of the original article publishe by the Institute of Mathematical Statistics in The Annals of Applie Probability, 006, Vol. 6, No., 475 55. This reprint iffers from the original in pagination an typographic etail.

P. NEAL AND G. ROBERTS One important ecision the MCMC user has to make in a -imensional problem concerns the imensionality of the propose jump. For instance, two extreme types of algorithm are the following: propose a fully -imensional upate of the current state accoring to a ensity with a ensity with respect to -imensional Lebesgue measure) an accept or reject accoring to the Metropolis Hastings acceptance probabilities; or, for each of the components in turn, upate that component conitional on all the others accoring to some Markov chain which preserves the appropriate conitional istribution. The most wiely use example is the -imensional Metropolis algorithm, in one extreme, an the Gibbs sampler or some kin of Metropolis-within-Gibbs scheme in the other. In between these two options, there lie many intermeiate strategies. An important question is whether any general statements can be mae about algorithm choice in this context, leaing to practical avice for MCMC practitioners. In this paper we concentrate on two types of algorithm: Metropolis an Metropolis ajuste Langevin algorithms MALA). We consier strategies which upate a fixe proportion, c, of components at each iteration, an consier the efficiency of the algorithms constructe asymptotically as. In orer to o this, we shall exten the methoology evelope in [6, 7] to our context. The analysis prouces clear cut results which suggest that, while full-imensional Langevin upates are worthwhile, full-imensional Metropolis ones are asymptotically no better than smaller imensional upating schemes, so that the possible extra computational overhea associate with their implementation always leas to their being suboptimal in practice. All this is initially one in the context of target ensities consisting of inepenent components, an this leas naturally to the question of whether this simple picture is altere in any way in the presence of epenence. Although this is ifficult to explore in full generality, we o later consier this problem in the context of a class of Gaussian epenent target istributions where explicit results can be shown, an where the conclusions from the inepenent component case remain vali. It is now well recognize that highly correlate target istributions lea to slow mixing for upating schemes where c < see, e.g., [5, 9]). However, it is also known that spherically symmetric proposal istributions in -imensions on highly correlate target ensities can lea to slow mixing since the proposal istribution is inappropriately shape to explore the target see [8]). So for highly correlate target istributions, both high an small imensional upating strategies perform poorly. We shall explore these two competing algorithms in a Gaussian context where explicit calculations are possible. Our work shows that, for c > 0, for the Metropolis algorithm, these two slowing own effects are the same. In particular, this implies that the commonly use strategy of getting roun high correlation problems by

OPTIMAL SCALING FOR MCMC 3 block upating using Metropolis has no justification. In contrast, for MALA full imensional upating, c =, is shown to be optimal. The paper is structure as follows. In Section we outline the MCMC setup. In Sections 3 an 4 we tackle the problem of scaling the variance of the proposal istribution for RWM-within-Gibbs Ranom walk Metropoliswithin-Gibbs) an MALA-within-Gibbs Metropolis ajuste Langevin-within- Gibbs), respectively. The approach taken is similar to that use for the full RWM/MALA algorithms, by obtaining weak convergence to an appropriate Langevin iffusion as the imension of the state space, converges to infinity. The results of Sections 3 an 4 are prove for a sequence of -imensional prouct ensities of the form.) π x ) = fx i ) i= for some suitably smooth probability ensity f ). In both Sections 3 an 4, for each fixe, one-imensional component of X ; }, the one-imensional process converges weakly to an appropriate Langevin iffusion. The aim therefore is to scale the proposal variances so as to maximize the spee of the limiting Langevin iffusion. Since each of the components of X ; } are inepenent an ientically istribute, we shall prove the results for X ; }. However, it is at least plausible that the picture will be very ifferent when consiering epenent ensities. However, theoretical analysis in the limiting case where results can be obtaine an in simulations for more general cases, we fin that the general conclusions which can be erive for ensities of the form.) exten some way towar epenent ensities. To this en, in Section 5, we consier RWM/MALA-within-Gibbs for the exchangeable normal X N0,Σ ρ), where σii =, i, an σ ij = σ ji = ρ, i < j. Throughout the paper, we aopt the notation that Σ will be use for variance matrices, while elements of matrices will be enote by σ, both conventions using appropriate sub- an er-scripts.) All the proofs of the theorems in Sections 3 5 are given in the Appenix. Then in Section 6 with the ai of a simulation stuy we emonstrate that the asymptotic results are practically useful for finite, namely, 0.. Algorithms an preliminaries. For RWM/MALA, we are intereste in,σ ), the imension of the state space,, an the proposal variance σ, where the proposal for the ith component is given by Y i = x i + σ Z i, i, RWM, Yi = x i + σ Z i + σ logπ x ), i, MALA x i

4 P. NEAL AND G. ROBERTS an the Z i } s are inepenent an ientically istribute accoring to Z N0,). For both RWM an MALA, the maximum spee of the iffusion can be obtaine by taking the proposal variance to be of the form σ = l s for some l > 0 an s > 0. For RWM, s = an for MALA, s = 3.) Now for RWM/MALA-within-Gibbs, the basic iea is to choose c components at ranom at each iteration, attempting to upate them jointly accoring to the RWM/MALA mechanism, respectively. We sometimes write σ = σ,c, where c represents the proportion of components upate at each iteration. Thus, the two algorithms propose new values as follows:.) Y i = x i + χ i σ,c Z i, i, RWM-within-Gibbs, Yi = x i + χ i σ,c Z i + σ },c logπ x ), x i i, MALA-within-Gibbs, where the Z i } s are inepenent an ientically istribute accoring to Z N0,) an the χ i } are chosen as follows. Inepenently of the Z i s, we select at ranom a subset A, say, of size c from,,...,}, setting χ i = if i A, an χ i = 0 otherwise. The proposal Y is then accepte accoring to the usual Metropolis Hastings acceptance probability:.) α c x,y ) = π Y )qy,x ) π x )qx,y ), where q, ) is the proposal ensity. Otherwise, we set X m = X m. In both cases, the algorithms simulate Markov chains which are reversible with respect to π, an can be easily shown to be π -irreucible an aperioic. Therefore, both algorithms will converge in total variation istance to π. However, here we shall investigate optimization of the algorithms for rapi convergence. To fin a manageable framework for assessing optimality, Roberts, Gelman an Gilks [6] introuce the notion of the average acceptance rate which measures the steay state proportion of accepte proposals for the algorithm, an which can be shown to be closely connecte with the notion of algorithm efficiency an optimality. Specifically, we efine.3) a c l) = E π [α c X,Y )] = E π [ π Y )qy,x ) π X )qx,y ) where σ,c = l s, X π an Y represents the subsequent proposal ranom variable. Thus, a c l) is the π -average acceptance rate of the above algorithms where we upate a proportion c of the components in each iteration. We aopt the general notational convention that, for any -imensional stochastic process W ḍ,., we shall write W t,i for the value of its ith component at time t. ],

OPTIMAL SCALING FOR MCMC 5 Our aim in this paper is to consier the optimization [in c,σ,c )] of the algorithms spee of convergence. For convenience although to some extent this assumption can be relaxe), we shall assume that c c as for some 0 < c. It turns out to be both convenient an practical to express many of the optimality solutions in terms of acceptance rate criteria. 3. RWM-within-Gibbs for IID prouct ensities. We shall first consier the RWM algorithm applie initially to a simple IID form target ensity. This allows us to obtain explicit asymptotic results for optimal high-imensional algorithms. The results of this section can be seen as an extension of the results of Theorems. an. of [6] which consiers the full-imensional upate case. Let 3.) π x ) = fx i ) = expgx i )} i= be a -imensional prouct ensity with respect to Lebesgue measure. Let the proposal stanar eviation σ = l for some l > 0. For, let U t = X [t],,x [t],,...,x [t], ), an so, U t,i = X [t],i, i. Let U t = U t,. Theorem 3.. Suppose that f is positive, C 3 a three-times ifferentiable function with continuous thir erivative) an that logf) = g is Lipschitz. Suppose also that, c c, as, for some 0 < c, [ f ) X) 8 ] 3.) E f < fx) an 3.3) i= [ f ) X) 4 ] E f <. fx) Let X 0 = X 0,,X 0,,...) be such that all of its components are istribute accoring to f an assume that X j 0,i = Xi 0,i for all i j. Then, as, 3.4) U U, where U 0 is istribute accoring to f an U satisfies the Langevin SDE 3.5) an U t = h c l)) / B t + h cl)g U t )t h c l) = cl Φ l ) ci,

6 P. NEAL AND G. ROBERTS with Φ being the stanar normal cumulative c..f an [ f ) X) ] I E f E g [g X) ]. fx) The following corollary hols. Corollary 3.. Let c c, as, for some 0 < c. Then: i) lim a c l) = a c l) ef = Φ l ci ). ii) Let ˆl be the unique value of l which maximizes h l) = l Φ l I ) on [0, ), an let ˆl c be the unique value of l which maximizes h c l) on [0, ). Then ˆl c = c /ˆl an hc ˆl c ) = h ˆl). iii) For all 0 < c, the optimal acceptance rate a c ˆl c ) = 0.34 to three ecimal places). Though these results involve fairly technical mathematical statements, they yiel a very simple practical conclusion. Optimal efficiency obtainable for a given c oes not epen on c at all. Now, in practice, computational overheas associate with one iteration of the algorithm are nonecreasing as a function of c, so that, in practice, smaller values of c shoul be preferre. Therefore, for RWM, using high-imensional upate steps oes not make any sense. It is, of course, important to see how these conclusions exten to more general target ensities an, in particular, ones which exhibit epenence structure. Some theory an relate simulation stuies in Sections 5 an 6, respectively, will emonstrate that these finings exten consierably beyon the rigorous but restrictive set up of Theorem 3.. 4. MALA-within-Gibbs for IID prouct ensities. We now turn our attentions to MALA-within-Gibbs. We again consier a sequence of probability ensities π of the form given in 3.). We follow [7] in making the following assumptions. We assume that X 0 is istribute accoring to the stationary measure π, g is an eight times continuously ifferentiable function with erivatives g i) satisfying 4.) gx), g i) x) C + x K ), i 8, for some C,K > 0, an that 4.) x k fx)x <, k =,,... R Finally, we assume that g is Lipschitz. This ensures that X t } is nonexplosive see, e.g., [], Chapter V, Theorem 5.).

OPTIMAL SCALING FOR MCMC 7 Let J t } be a Poisson process with rate /3 an let Γ = Γ t } t 0 be the -imensional jump process efine by Γ t = X J t, where we take σ = l /3 with l an arbitrary constant. We then have the following two theorems which are extensions of [7], Theorems an. Theorem 4.. Suppose that c c, as, for some 0 < c. We have that lim ac ckl l)} = a c 3) l) = Φ, with K = E[ 5g X) 3g X) 3 48 ] > 0. Theorem 4.. Suppose that c c as for some 0 < c. Let U } t 0 be the process corresponing to the first component of Γ. Then, as, the process U converges weakly in the Skorokho topology) to the Langevin iffusion U efine by U t = h c l) / B t + h cl)g U t )t, where h c l) = cl Φ cl 3 K ) is the spee of the limiting iffusion. The most important consequence of Theorems 4. an 4. is the following corollary. Corollary 4.3. Let c c, as, for some 0 < c. Then: i) Let ˆl be the unique value of l which maximizes h l) = l Φ l3 K ) on [0, ), an let ˆl c be the unique value of l which maximizes h c l) on [0, ). Then ˆl c = c /6ˆl an hc ˆl c ) = c /3 h ˆl). ii) For all 0 < c, the optimal acceptance rate a c ˆl c ) = 0.574 to three ecimal places). Thus, in stark contrast to the RWM case, it is optimal to upate all components at once for MALA. The story is somewhat more complicate in the case where computational overheas are taken into account. For instance, it is common for the computational costs of implementing MALA-within- Gibbs to be approximately a + bc) for constants a an b. To see this, note that the algorithm s computational cost is often ominate by two operations: the calculation of the various erivatives neee to propose a new value, an the evaluation of π at the propose new value. The first of these operations involves a c-imensional upate an typically takes a time which is orer c, while the secon involves evaluating a -imensional function

8 P. NEAL AND G. ROBERTS which we woul expect to be at least of orer. Although, in some important special cases, target ensity ratios might be compute more efficiently than this.) In this case the overall efficiency is obtaine by maximizing c /3 a + bc. This expression is maximize at a/b. Therefore, it is conceivable for full imensional upates to be optimal even when computational costs are taken into account. In any case, the optimal proportion will be some value x 0,]. 5. RWM/MALA-within-Gibbs on epenent target istributions. We are now intereste in the extent to which the results of the last two sections can be extene to the case where the components are epenent. It is ifficult to get general results, but certain important special cases can be examine explicitly, yieling interesting results which imply essentially) that the extent by which the epenence structure affects the mixing properties of the chain RWM-within-Gibbs or MALA-within-Gibbs) is inepenent of c. The most tractable special case is the Gaussian target istribution. However, in Section 6, we shall also inclue some simulations in other cases to show that the above statement hols well beyon the cases for which rigorous mathematical results can be prove. We begin with RWM-within-Gibbs an consier the optimal scaling problem of the variance of the proposal istribution for a target istribution consisting of exchangeable normal components. Specifically, X N 0,Σ ρ), where σii =, i, an σ ij = ρ, i j, for some 0 < ρ <. Therefore, we have that 5.) where an π x ) = π) et Σ ρ / exp )) x i ) + θ x i x j ρ i= i= j= = π) et Σ ρ / exp ) j x ), say, θ = j x ) = ρ ρ + )ρ )ρ x i ) + θ x i x j. i= i= j=

OPTIMAL SCALING FOR MCMC 9 For, Û t = X [t],,x [t],,...,x [t], ). Let U t = U t,,u t,,u t,3 ) be such that Ut, = Û t,, U t, = Û t, an U t,3 = i=3 Û t,i. Now the proposal Y is given by Y i = x i + σ χ i Z i, i, where the Z i an χ i i ) are efine as before an σ = l for some constant l. [We use ) rather than or ) for simplicity in presentation of the results.] In the epenent case, more care nees to be taken in constructing the sequence X 0 ; }. Let X 0 N0,) [i.e., X 0 is istribute accoring to π )]. For an i, set X0,i = Xi 0,i. Then iteratively efine X 0, N ρ X0,i, i= + )ρ )ρ ) Therefore, X 0 is istribute accoring to π ) an we can continue this process inefinitely to obtain X 0 = X 0,,X 0,,...). Theorem 5.. Suppose that 0 < ρ < an that c c, as, for some 0 < c. Let X 0 = X 0,,X 0,,...) be constructe as above. Let ρ ρ D = ρ ρ, ρ ρ ρ 0 ρ ρ D = 0 ρ ρ, + ρ ρ ρ ρ ρ) 0 0 D 3 = 0 0. 0 0 0 Let fu) enote the probability ensity function of N0,D ). Then, as, U U, where U 0 is istribute accoring to f an U satisfies the Langevin SDE where U t = h c,ρ l)) / D 3 B t + h c,ρ l)d 3 UT t D U t )}t, h c,ρ l) = cl Φ l c ρ ). ).

0 P. NEAL AND G. ROBERTS Note that if we efine Ĩ = E[ x j X )) ] an Ĩ = ρ. Then Ĩ Ĩ as an h c,ρ l) = cl Φ l cĩ ). Therefore, the spee of the limiting iffusion for exchangeable normal has the same form as that obtaine for the IID prouct ensities consiere in Section 3. As in.3), let a c,ρ l) be the π -average acceptance rate of the above algorithm where X N0,Σ ρ), σ = l an we upate a proportion c of the components in each iteration. Then we have the following corollary. Corollary 5.. Let c c, as, for some 0 < c. Then, for 0 < ρ < : i) lim a c,ρ l) = a c,ρ l) ef = Φ l c ρ ). ii) Let ˆl be the unique value of l which maximizes h,0 l) = l Φ l ) on [0, ), an let ˆl c,ρ be the unique value of l which maximizes h c,ρ l) on [0, ). Then ˆl c,ρ = ρ c l an h c,ρˆl c,ρ ) = ρ)h,0 ˆl). iii) For all 0 < c an 0 < ρ <, the optimal acceptance rate a c,ρ ˆl c,ρ ) = 0.34 to three ecimal places). Note that Corollary 5.ii) states that the cost incurre by having σij = ρ, i j, rather than σij = 0, i j, is to slow own the spee of the limiting iffusion by a factor of ρ, for all 0 < c. In other wors, the cost incurre by the epenence between the components of X is inepenent of c. Furthermore, the optimal acceptance rate a c,ρ ˆl c,ρ ) is unaffecte by the introuction of epenence. We shall stuy this further in the simulation stuy conucte in Section 6. Note that in Theorem 5. the last row of the matrix D 3 is a row of zeros. This implies that the mixing time of T X grows more rapily than O) as. In [8], heuristic arguments an extensive simulations show that the mixing time of T X is in fact O ). Theorem 5.3 below gives a formal statement of this result. The proof of Theorem 5.3 is similar to the proof of Theorem 5. an is, hence, omitte.) For, let Ũ t = i=3 X[ t],i. Theorem 5.3. Suppose that 0 < ρ < an that c c, as, for some 0 < c. Let X 0 = X 0,,X 0,,...) be constructe as in the prelue to Theorem 5.. Then, as, Ũ Ũ, where Ũ0 N0,ρ) an Ũ satisfies the Langevin SDE Ũt = h c,ρ l)) / B t + h c,ρ l) } t, ρũt

where h c,ρ l) = cl Φ l OPTIMAL SCALING FOR MCMC c ρ ), as before. We now turn our attention to MALA-within-Gibbs for the exchangeable normal. So that now the proposal Y is given by Yi = x i + χ i σ Z i + σ )} ρ x i θ x j, where we take σ = l /3 with l an arbitrary constant. Let X 0 be constructe as outline above for the RWM-within-Gibbs. Let J t } be a Poisson process with rate /3 an let Γ = Γ t } t 0 be the -imensional jump process efine by Γ t = X J t. Let U t = U t,,u t,,u t,3 ) be such that U t, = Γ t,, Ut, = Γ t, an U t,3 = i=3 Γ t,i. Theorem 5.4. Suppose that 0 < ρ < an that c c, as, for some 0 < c. Let X 0 = X 0,,X 0,,...) be constructe as in the prelue to Theorem 5.. Let D, D, D 3 an f be as efine in Theorem 5.. Then, as, U U, where U 0 is istribute accoring to f an U satisfies the Langevin SDE j= U t = h c,ρ l)) / D 3 B t + h c,ρ l)d 3 UT t D U t )}t, where ) h c,ρ l) = cl Φ l3 c 8 ρ) 3 is the spee of the limiting iffusion. Note that if we efine K = E [ 48 3 5 x 3 jx )) 3 ) 3 }] 3 x jx ) an K = 6 ρ )3, then K K as an h c,ρ l) = cl Φ l3 c K). Therefore, the spee of the limiting iffusion for exchangeable normal has the same form as that obtaine for the IID prouct ensities consiere in Section 4. As in.3), let a c,ρ l) be the π -average acceptance rate of the above algorithm where X N0,Σ ρ), σ = l /6 an we upate a proportion c of the components in each iteration. Then we have the following corollary.

P. NEAL AND G. ROBERTS Corollary 5.5. Let c c, as, for some 0 < c. Then, for 0 < ρ < : i) lim a c,ρ l) = a c,ρ l) ef = Φ l3 c 8 ρ) 3 ). ii) Let ˆl be the unique value of l which maximizes h,0 l) = l Φ l3 8 ) on [0, ), an let ˆl c,ρ be the unique value of l which maximizes h c,ρ ˆl) on [0, ). Then ˆl c,ρ = ρc /6 l an h c,ρ ˆl c,ρ ) = c /3 ρ)h,0 ˆl). iii) For all 0 < c an 0 < ρ <, the optimal acceptance rate a c,ρ ˆl c,ρ ) = 0.574 to three ecimal places). Note that Corollary 5.5ii) states that the cost incurre by having σij = ρ, i j, rather than σij = 0, i j, is to slow own the spee of the limiting iffusion by a factor of ρ, for all 0 < c. Therefore, the epenence in the target istribution π ) affects convergence of the MALA-within-Gibbs in the same way that it affects the RWM-within-Gibbs. The cost associate with upating only a proportion c rather than all of the components is the same as that observe in Section 4. Furthermore, the optimal acceptance rate a c,ρ ˆl c,ρ ) is unaffecte by the introuction of epenence. From Theorem 5.4, we see that the mixing time of T X is greater than O /3 ) as. In fact, the mixing time of T X is in fact O 4/3 ). Let J t } be a Poisson process with rate 4/3 an for, let Ũ t = i=3 XJ. t,i Theorem 5.6. Suppose that 0 < ρ < an that c c, as, for some 0 < c. Let X 0 = X 0,,X 0,,...) be constructe as in the prelue to Theorem 5.. Then, as, Ũ Ũ, where Ũ0 N0,ρ) an Ũ satisfies the Langevin SDE Ũt = h c,ρ l)) / B t + h c,ρ l) } t, ρũt where h c,ρ l) = cl Φ l3 c 8 ), as before. ρ) 3 The proofs of Theorems 5.4 an 5.6 are hybris of those for the results of Section 4, an for Theorems 5. an 5.3 above, an are, hence, omitte. 6. A simulation stuy. The rotational symmetry of the Gaussian istribution effectively allows the epenence problem to be formulate as one of heterogeneity of scale. Other istributional forms exist for which this may be possible e.g., the multivariate t-istribution), but it seems ifficult to erive results for very general istributional families of target istribution

OPTIMAL SCALING FOR MCMC 3 without resorting to ieas such as this. Therefore, to port the conjecture that the conclusions of Sections 3 5 hol beyon the rigorous, theoretical results, we present the following simulation stuy. Furthermore, we emonstrate that the asymptotic results are achieve in relatively low imensional 0) situations. Throughout the simulation stuy we measure spee/efficiency of the algorithm by consiering first-orer efficiency. That is, for a multiimensional Markov chain X with first component X, say, the first-orer efficiency is efine to be E[Xt+ X t ) ] for RWM an /3 E[Xt+ X t ) ] for MALA, where X t is assume to be stationary. For each of the target istributions an ifferent choices of c an, we consier 50 ifferent proposal variances, σ,c. For each choice of proposal variance σ,c, we starte with X 0 rawn from the target istribution. We then ran the algorithm for 00000 iterations. We estimate E[Xt+ X t ) ] by 00000 00000 i= Xi X i ) an the acceptance rate is estimate by 00000 00000 i= Xi X i }. We then plot acceptance rate against E[Xt+ X t ) ] first-orer efficiency). We begin by consiering RWM-within-Gibbs. We shall consier three ifferent target istributions π N0,Σ ρ), π t 50 0,Σ ρ) an π x ) = i= exp x i ) ouble-sie exponential). Note that the istributions t 50 0,Σ ρ ) ρ > 0) an the ouble-sie exponential are not covere by the asymptotic results of Sections 3 an 5. For the N0,Σ ρ ) an t 500,Σ ρ ), we plot acceptance rate against the normalize first-orer efficiency, ρ E[X t+ Xt ) ]. The normalization is introuce to take account of epenence see Corollary 5.). Figures an give a representative sample of the simulation stuy we conucte for a whole range of ifferent values of c, an ρ. The results are as one woul expect. In all cases the estimate optimal acceptance rate is approximately 0.34. As can be seen from Figures an, the normalize first-orer efficiency curves are virtually inistinguishable from one another for each choice of c, an ρ. Therefore, we have mae no attempt to ifferentiate between the ifferent efficiency curves. Note that the results in Figure 3 are a representative sample from a much larger simulation stuy.) Figures 3 an 4 prouce results in line with those expecte from Sections 3 an 5. This emonstrates that the conclusions of Sections 3 an 5 o exten beyon those target istributions for which rigorous statements have been mae. We now turn our attention to MALA-within-Gibbs. We shall consier in our simulation stuy only target ensities of the form π N0,Σ ρ). Simulations in Figures 5 an 6 show excellent agreement with Corollaries 4.3 an 5.5. Again, the results emonstrate the usefulness/relevance of the asymptotic results for even fairly small.

4 P. NEAL AND G. ROBERTS 7. Discussion. A rather surprising property of high-imensional Metropolis an Langevin algorithms is the robustness of relative efficiency as a function of acceptance rate. In particular, the optimal acceptance rates 0.34 an 0.574 for Metropolis an Langevin, respectively, appear to be robust to many kins of perturbation of the target ensity. A remarkable conclusion of this paper is this apparent robustness of relative efficiency, as a function of acceptance rate, seems to exten quite reaily to upating schemes where only a fixe proportion of components are upate at once. A further unexpecte conclusion concerns the issue of optimization in c. Here, very clear cut statements appear to be available, with smallerimensional upates seeming to be optimal for the Metropolis algorithm as seen from Theorem 3. an Corollary 3.), whereas higher-imensional up- Fig.. Normalize first-orer efficiency of RWM-within-Gibbs, ρ E[X t+ Xt ) ], as a function of overall acceptance rates for each combination of = 0; c = 0.5,0.5,0.75,; ρ = 0,0.5), with π N0,Σ ρ).

OPTIMAL SCALING FOR MCMC 5 Fig.. Normalize first-orer efficiency of RWM-within-Gibbs, ρ E[X t+ Xt ) ], as a function of overall acceptance rates for each combination c = 0.5; = 0,0,50; ρ = 0,0.5), with π N0,σρ). ates are to be preferre at least before computing time has been taken into consieration) for MALA schemes see Theorem 4. an Corollary 4.3). The robustness of these conclusions to epenence in the target ensity is seen in the results of Section 5 an, porte by the simulation stuy in Section 6, seems contrary to the general intuition that block upating improves MCMC mixing at least for the Metropolis results). However, our results show that this intuition is only correct for schemes where the multivariate upate step utilies the structure of the target ensity as, e.g., in the Gibbs sampler, or, to a lesser extent, MALA). We believe that these results shoul have quite funamental implications for practical MCMC use, although, of course, they shoul be treate with care since they are only asymptotic. Our results have been shown in the

6 P. NEAL AND G. ROBERTS Fig. 3. Normalize first-orer efficiency of RWM-within-Gibbs, ρ E[X t+ Xt ) ], as a function of overall acceptance rates for each combination: i) = 0; c = 0.5,0.5,0.75,; ρ = 0,0.5) an ii) c = 0.5; = 0,0,50; ρ = 0,0.5), with π t 500,Σ ρ). simulation stuy to hol approximately in very low-imensional problems although the spee at which the infinite-imensional limit is reache oes vary in a complicate way, in particular, in c an measures of epenence in the target ensity such as ρ in the exchangeable normal examples). The results for the exchangeable normal example show that certain functions can converge at ifferent rates to others X converging at rate, while X i X converges at rate ), an this can cause serious practical problems for the MCMC practitioner. In particular, any one co-orinate X i might converge rapily, in a given time scale, to the wrong target ensity. Certainly, it woul be extremely ifficult to etect such problems empirically.

OPTIMAL SCALING FOR MCMC 7 Fig. 4. Normalize first-orer efficiency of RWM-within-Gibbs, ρ E[X t+ Xt ) ], as a function of overall acceptance rates for each combination = 40;c = 0.5,0.5,0.75,), with π x ) = i= exp x i ). The results in this paper are given for Metropolis an MALA algorithms. However, the use of these two methos is, in some sense, illustrative, an other algorithms such as, e.g., higher-orer Langevin algorithms using, e.g., the Ozaki iscretization [0]) are expecte to yiel similar conclusions. APPENDIX A.. Proofs of Section 3. Theorem 3. implies that the first component acts inepenently of all others as. Intuitively, this occurs because all other ) terms contribute expressions to the accept/reject ratio which turn out to obey SLLN an, thus, can be replace by their eterministic limits. To make this iea rigorous, we nee to efine a set in R on which the first component is well approximate by the appropriate LLN limit.

8 P. NEAL AND G. ROBERTS Fig. 5. Normalize first-orer efficiency of RWM-within-Gibbs, c /3 ρ /3 E[X t+ X t ) ], as a function of overall acceptance rates for each combination of = 0; c = 0.5,0.5,0.75,; ρ = 0,0.5), with π N0,Σ ρ). Motivate by this iea, we construct sets of tolerances aroun average values for quantities which will appear in the accept/reject ratio. Thus, we efine the sequence of sets F R, > } by } F = x ; g x i ) I < /8 i= } x ; g x i ) + I < /8 i= x ; ) g x i ) 4 < }, /8 i=

OPTIMAL SCALING FOR MCMC 9 Fig. 6. Normalize first-orer efficiency of MALA-within-Gibbs, c /3 ρ /3 E[X t+ X t ) ], as a function of overall acceptance rates for each combination c = 0.5; = 0,0,50; ρ = 0,0.5), with π N0,Σ ρ). = F, F, F,3, say, where I is efine in Theorem 3.. Let x = x,x,...) an for, let x = x,x,...,x ), where, for i, x i = x i. Thus, we shall use x an x interchangeably, as appropriate. Lemma A.. For k =,,3 an t > 0, A.) an, hence, PU s F,k,0 s t) as PU s F,0 s t) as.

0 P. NEAL AND G. ROBERTS Proof. The cases k = an k = are prove in [6], Lemma.. The case k = 3 is prove similarly using Markov s inequality an 3.). The lemma then follows. For any ranom variable X an for any subset A R, let E [X] = E[X χ = ] an P X A) = PX A χ = ). Let G be the iscrete-time) generator of X, an let V Cc the space of infinitely ifferentiable functions on compact port) be an arbitrary test function of the first component only. Thus, A.) [ G V x ) = E V Y ) V x )) π Y }] ) π x ) = Pχ [V = )E Y ) V x )) π Y ) π x ) since Y = x if χ = 0. The generator G of the one-imensional iffusion escribe in 3.4), for an arbitrary test function V C c, is given by A.3) GV x ) = cl Φ l ci ) g x )V x ) + V x )}. Note that, uner the conitions impose in Theorem 3., Cc forms a core for the full generator.) By Lemma A., we can restrict attention to x F. The aim will therefore be to show that, for all x F, G V x ) GV x ) as. The proof of Theorem 3. will then be fairly straightforwar. Thus, we begin by giving a Taylor series approximation for G V x ) in Lemma A.3, for which we will require the following lemma. }], the space of infinitely ifferentiable func- Lemma A.. For any V Cc tions on compact port), A.4) an A.5) E [V Y ) V x ))] l V x ) 0 as x F σ E [Z V Y ) V x ))] l V x ) 0 as, x F with x = x.

Proof. For χ =, Thus, by Taylor s theorem, A.6) OPTIMAL SCALING FOR MCMC Y x = σ l Z = Z. V Y ) V x ) = V x )σ Z ) + V x )σ Z ) + 6 V W )σ Z ) 3 for some W lying between x an Y. The lemma then follows by substituting A.6) into the left-han sies of A.4) an A.5). Lemma A.3. Let G V x ) = cl V x )E [ e B ] + cl V x )g x )E [e B ;B < 0], where B = B x )) = i= gyi ) gx i )). Then, we have that A.7) G V x ) G V x ) 0 as. x F Proof. Decomposing Y into Y,Y ) an using inepenence gives [ G V x ) = c E Y V Y ) V x ))E fyi Y [ ) ]] fx i ). We shall begin by concentrating on the inner expectation, by recalling the following fact note in []. Let h be a twice ifferentiable function on R, then the function z e hz) is also twice ifferentiable, except at a countable number of points, with first erivative given Lebesgue almost everywhere by the function h z ehz) = z)e hz), if hz) < 0, 0, if hz) 0. Now take h z)= h z;x )) = gx + σ z) gx )) + B an let γ z) = E fyi Y [ ) ] fx i ) Z = z. i= Thus, γ z) = E [ e hz) ], an so, for almost every x Y R, there exists W lying between 0 an z such that γ z) = E Y [ e h 0) ] A.8) + ze Y [σ g x )eh 0) ;h 0) < 0] + z E Y [σ g x + σ W) + g x + σ W) )e h W) ;h W) < 0]. i=

P. NEAL AND G. ROBERTS The key results to note are that h 0) = B an that, conitional upon χ =, Y an Y are inepenent. Therefore, A.9) G V x ) = c E Y [ V Y ) V x )) E Y [ e h0) ] + Z E Y [σ g x )e h 0) ;h 0) < 0] + Z E Y [σ g x + σ W) }] + g x + σ W) )e hw) ;h W) < 0] = c E [V Y ) V x ))]E [ e B ] + g x )c σ E [V Y ) V x ))Z ]E [e B ;B < 0] [ + c E Y V Y ) V x )) Z E Y [σg x + σ W) ] + g x + σ W) )e hw) ;h W) < 0] = ĜV x ) + D x ;Z ;W), say. Since E [ e B ],E [e B ;B < 0] an x = x, it follows from Lemma A. that x F ĜV x ) G V x ) 0 as. Thus, to prove the lemma, it is sufficient to show that, for all x F, D x ;Z ;W) converges to 0, as. By Taylor s theorem, we have that an V Y ) V x )) Z a R V a ) σ Z3 g x + σ W) g x ) + σ W g a ) a R g x ) + σ Z g a ). a R

OPTIMAL SCALING FOR MCMC 3 Since V an g are boune functions, it follows that, for all x F, D x ;Z ;W) c 3 Kσ3 K + g x ) + σ K)} 0 as, for some K > 0, an the lemma is prove. Lemma A.3 states that, for all x F, the generator G can be approximate by the generator G which resembles the limiting generator G. Thus, we now nee to consier for all x F, E [ e B ] an E [e B ;B < 0]. The aim is to approximate B by a more convenient quantity A to be efine in Lemma A.6) an, hence, show that E [ e B ] Φ l ) ci an E [e B ;B < 0] Φ l ) ci This will be one in the following lemmas. as. Lemma A.4. Let λ = λ x )) = i= χ i g x i ). For any ε > 0, x F P λ ci > ε) 0 as. Proof. Let R = R x )) = i= g x i ). Then, for x F, λ ci λ E [λ ] + E [λ ] cr + cr ci. Note that E [λ ] = c R, an so, by Lemma A., we have that E [λ ] cr + cr ci 0 as. Therefore, to prove the lemma, it suffices to show that, for any ε > 0, P λ E [λ ] > ε) 0 as. Note that an so, λ = ) χ i χ j g x i ) g x j ), i= j= E [λ ] = c ) g x i ) 4 i= + c ) c ) ) ) } g x i ) g x j) i= j i

4 P. NEAL AND G. ROBERTS = c )c ) R ) ) + c ) c ) ) ) ) g x i ) }. 4 i= Then since x F ) i= g x i )4 0 an c c as, it follows that, for all x F, E [λ E [λ ]) ] 0 as an, hence, by Chebyshev s inequality, as require. Lemma A.5. Let x F P λ E [λ ] > ε) 0 as, W = W x )) = i= g x i )Y i x i ) + an c c as. Then, recalling that σ = l/, x F E [ W ] 0 as. Proof. First, note that E [ W ] E [W ]. Then, by irect calculations, E [W ] = = i= j= + i= E [ g x i )Y i x i ) + } cl ) g x i ), } cl ) g x i ) }] g x j)yj x j) + cl ) g x j) 4 g x i ) 3 c i= j= = W, + W,, say. σ4 c )c ) ) ) 4 g x i )g x j )c )c ) ) ) σ 4 σ 4 + cl ) g x i )g x j )c σ + c l 4 } ) 4 ) g x i ) g x j) } )

OPTIMAL SCALING FOR MCMC 5 Let W,3 = W,3 x )) = cl i= ) g x i ) + g x i ) )}, an since c c as, we have that x F W, W,3 0 as. However, by efinition, x F W,3 0 an since g is boune, x F W, 0 as. The lemma follows immeiately. Lemma A.6. Let A = A x )) = i= g x i )Y i x i ) cl ) g x i ) }. Then, A.0) an A.) x F E [ e A ] E [ e B ] 0 as x F E [e A ;A < 0] E [e B ;B < 0] 0 as. Proof. Note that B = gyi ) gx i )) i= = g x i )Yi x i ) + g x i )Yi x i ) + 6 g α i )Yi x i ) 3 }, i= for some α i lying between x i an Y i. Therefore, by [6], Proposition., A.) E [ e A ] E [ e B ] E [ W ] + g a) a R 6 E [ Y x 3 ], = E [ W ] + g a) c a R 6 σ3 E[ Z 3 ]. Now let ϕ = x F E [ W ]+ a R g a) c 6 σ 3 E[ Z 3 ]}, where W is efine in Lemma A.5. Then, since g is a boune function, it follows from Lemma A.5 that ϕ 0 as an so A.0) is prove. Let J = J x )) = e A ;A < 0) e B ;B < 0) an let δ = ϕ. Then we procee by showing that Note that, if A,B > 0, then x F P J > δ ) 0 as. J = 0 A B

6 P. NEAL AND G. ROBERTS an if A,B < 0, then J = expa ) expb ) A B. Therefore, it follows that A.3) P J > δ ) P δ < A < δ ) + P A B δ ). By Markov s inequality, P A B δ ) A.4) δ E [ A B ] E [ W ] + g a) } δ a R 6 E [ Y x 3 ] ϕ, an so, P A B > δ ) 0 as, uniformly for x F. Fix x F, then for any ε > 0, by Lemma A.4, ) P ±δ l + cl λ R l ) ci A.5) > ε 0 as. Hence, Thus, A.6) ))] E [Φ ±δ l + cl l ci λ R Φ x F P δ < A < δ ) 0 as. ) as. Therefore, by A.3) A.6), x F P J > δ ) 0 as. Then since J, it follows that x F E [J ] 0 as an so A.) is prove. Lemma A.7. E [ e A ] Φ l ) ci A.7) 0 as an A.8) x F E [e A ;A < 0] Φ l ) ci 0 as. x F

that OPTIMAL SCALING FOR MCMC 7 Proof. Since A N cl R,l λ ), it follows by [6], Proposition.4, A.9) E [ e A ] = E [Φ clr ) λ ) + exp l cr λ ) Φ l λ + clr λ Since for any x F an ε > 0, P R I > ε) 0 an P λ ci > ε) 0 as, A.7) follows from A.9). A.8) is prove similarly. We are now in a position to show that, for all x F, the generator G converges to the generator G as. Theorem A.8. For V C c, x F G V x ) GV x ) 0 as. Proof. By Lemma A.3, x F G V x ) GV x ) 0 as, an by Lemmas A.6 an A.7, x F G V x ) GV x ) 0 as. Thus, the theorem is prove. Proof of Theorem 3.. The proof is similar to that of [6]. From Lemmas A., A.4 an Theorem A.8, we have uniform convergence of G V to GV for vectors containe in a set of π measure arbitrarily close to. Since Cc separates points see [4], page 3), the result will follow by [4], Chapter 4, Corollary 8.7 if we can emonstrate the compact containment conition, which in our case follows from the following statement. For all ε > 0, an all real value U0 = X 0,, we can fin K > 0 sufficiently large with PU t / K,K), 0 t ) ε, for all. We appeal irectly to the explicit form of the Metropolis transitions an assume that the Lipshitz constant for g is terme b. Thus, the following estimates are easy to erive by just noting that square jumping istances are boune above by that attaine by ignoring rejections. Moreover, these estimates are uniform over all X n : bσ e b σ / E[X n+, X n, X n] bσ e b σ / )].

8 P. NEAL AND G. ROBERTS an E[X n+, X n,) X n] E[Y n+, X n,) X n] = σ. Thus, setting V n = X n, + nbσ eb σ /, V n,0 n []} is submartingale with A.0) E[V [] ] σ + bσ e b σ / ). Since σ = l /, the right-han sie of A.0) is uniformly boune in so that the upper boun result follows by Doob s inequality. The lower boun follows similarly by consiering the ermartingale X n, nbσ eb σ /. A.. Proofs of Section 4. The proofs of Theorems 4. an 4. are similar to the proofs of Theorems an in [7], respectively. The only complication in the proofs is that we are upating a ranom set of components at each iteration in the MALA algorithm. Let x = x,x,...) an for, let x = x,x,...,x ), where, for i, x i = x i. Thus, we shall again use x an x interchangeably as appropriate. Let G be the iscrete-time) generator of X an let V Cc be an arbitrary test function of the first component only. Thus, [ G V x ) = /3 E V Y ) V x )) π Y }] ) π x ) A.) = /3 Pχ = )E [V Y ) V x )) π Y ) π x ) where E is efine after Lemma A. cf. Section A. after Lemma A.). The generator G of the one-imensional iffusion escribe in Theorem 4., for an arbitrary test function, V, is given by cl GV x ) = cl 3 ) K Φ g x )V x ) + V x )} A.) = h c l) g x )V x ) + } V x ), where K an h c l) are efine in Section 5. The aim thus, as in Section A., is to fin a sequence of sets F R } such that, for all t > 0, an, for V C c, PΓ s F, for all 0 s t) as, x F G V x ) GV x ) 0 as. }],

OPTIMAL SCALING FOR MCMC 9 The proofs of Theorem 4. an 4. are then straightforwar. The first step is therefore to construct the sets F R }. However, this is much more involve than for the RWM-within-Gibbs in Section A.. Thus, it will be more convenient to construct the sets F through the preliminary lemmas which lea to the proof of Theorems 4. an 4.. The next step will involve a Taylor series expansion of G V x ) to show that, for large, GV x ) is a goo approximation for G V x ). Thus, we begin by stuying log π Y ) π x ) ). Lemma A.9. There exists a sequence of sets F, R, with lim /3 π F C, )} = 0, such that, for χ i =, where fy log i )qyi,x i ) } fx i )qx i,y i ) = C 3 x i,z i) / + C 4 x i,z i) /3 + C 5 x i,z i) 5/6 + C 6 x i,z i ) + C 7 x i,z i,σ ), C 3 x i,z i ) = l 3 4 Z ig x i )g x i ) Z3 i g x i )}, an where C 4 x i,z i), C 5 x i,z i) an C 6 x i,z i) are polynomials in Z i an the erivatives of g. Furthermore, if E Z an E X enote expectation with Z N0,) an X having ensity f ), respectively, then A.3) E X E Z [C 3 X,Z)] = E X E Z [C 4 X,Z)] = E X E Z [C 5 X,Z)] = 0, whereas A.4) E X E Z [C 3 X,Z) ] = l 6 K = E X E Z [C 6 X,Z)]. In aition, A.5) [ E x F fy log i )qy i,x i ) } fx i )qx i,y i ) i= } ] / χ i C 3 x i,z i ) cl6 K 0 as. i= Proof. With the exception of A.5) an the exact form of the sets F,, the lemma is prove in [7], Lemma.

30 P. NEAL AND G. ROBERTS For j = 4,5,6 an x R, set c j x) = E Z [C j x,z)] an v j x) = var Z C j x, Z)). The set F,,j = 3 k= F,,j,k, where } F,,j, = x ; C j x i ) E X [C j X)]} < 5/8, F,,j, = F,,j,3 = x ; x ; i= i= i= } V j x i ) E X [V j X)]} < 6/5, C j x i ) E X [C j X)]} < 6/5 }. Then for j = 4,5,6 an k =,,3, it is straightforwar, using Markov s inequality an conitions 4.) an 4.), to show that /3 π F C,,j,k) 0 as. Cf. [7], Lemma, where only the cases k =, are require.) Finally, let F,,7 R } correspon to the sets F n,7 } constructe in [7], Lemma, an so, /3 π F C, ) 0 as, where F, = 7 j=4 F,,j. The proof of A.5) is then essentially the same as the proof of the final expression in [7], Lemma, an, hence, the etails are omitte. The next step is to fin a convenient approximation for G V x ) which effectively allows us to consier separately the first component an the remaining ) components. Lemma A.0. Let G V x ) = c /3 E [V Y ) V x )]E [e B ], where B = B x )) = i= gyi ) gx i )) x σ i Y i σ g Yi )) Yi x i σ g x i )) }. There exists sets F, R with lim /3 π F, C ) = 0 such that, for any V Cc, x F, G V x ) G V x ) 0 as. Moreover, A.6) [ E π Y )qy,x ) ) x F, π x )qx,y ) ] e B ) 0 as.

OPTIMAL SCALING FOR MCMC 3 Proof. Since, conitional upon χ, Y an Y are inepenent, it follows that G V x ) = c /3 E [V Y ) V x ))e B )]. The lemma then follows by ientical arguments to those use in [7], Theorem 3, with the sets F, } chosen to correspon to the sets S n } in [7], Theorem 3. Lemma A.. Let F,3 = x ;g x ) / } then /3 π F,3 C ) 0 as an for any V Cc, /3 c E [V Y ) V x )] c l g x )V x ) + V x )} 0 x F,3 with x = x. as, Proof. The proof is ientical to [7], Lemma an is, hence, omitte. We now focus on the remaining ) components. First we introuce the following notation. Let ax) = 4 g x)g x) an bx) = g x). Therefore, we have that Set C 3 x,z) = l 3 ax)z + bx)z 3 }. } Q x, ) = L χ i C 3 x i,z i ) χ =. Let φ x,t) = R expitw)q w) an let φt) = exp t cl6 K ). i= Lemma A.. There exists a sequence of sets F,4 R such that: a) lim /3 π F,4 C )} = 0, b) for all t R, x F,4 φ x ;t) φt) 0 as, c) for all boune continuous functions r, ) Q x,y)ry) ry) exp y x F,4 R πcl 3 K R cl 6 K y 0 as,

3 P. NEAL AND G. ROBERTS ) [ }] x F E exp / χ i C 3x i,z i) cl6 K,4 i= Φ cl 3 K ) 0 as. Proof. The sets F,4 are constructe as in the proof of [7], Lemma 3, an so, statement a) follows. Specifically, we let F,4 be the set of x R such that hx A.7) i ) hx)fx)x /4 A.8) i= hx i ) 3/4, i, for each of the functionals hx) = ax),bx),ax)bx),ax) 4,bx) 4, ax) 3 bx),ax) bx),ax)bx) 3. Since statements c) an ) follow from statement b) as outline in [7], Lemma 3, all that is require is to prove b). Let L = j;χ j =, j } an let )] it θj x j [exp ;t) = E C 3 x j,z j). Let [ } ] φ Λ it x ;t) = E exp C 3 x j,z j) L = Λ. j Λ Then since C 3 x j,z j)} j= are inepenent ranom variables, it follows that φ Λ x ;t) = θj x j ;t). j Λ Therefore, }] [ ] it φ x ;t) = E [exp χ jc 3 x j,z j ) = E θjx j;t) j= j L an so, A.9) [ } x F φ x ;t) E ] t,4 vx j) j L [ E θ x F j x j ;t) } ] t,4 vx j ), j L j L

OPTIMAL SCALING FOR MCMC 33 where vx j ) = var ZC 3 x j,z)) = l6 ax j ) + 6ax j )bx j ) + 5bx j ) }. The right-han sie of A.9) converges to 0 as by arguments similar to those use in [7], Lemma 3. Hence, the etails are omitte. Now by using a Taylor series expansion for exp t j= χ j vx j )), it is trivial to show that [ } ] E t vx j) A.30) x F,4 j L E [exp j= }] t χ jvx j) 0 as, since for all x F,4, j= vx j ) 0 as cf. [7], Lemma 3). The final step to complete the proof of statement b) is to show that [ x F E exp χ j,4 j= t vx j) )] ) exp t cl6 K 0 as. This follows immeiately, since using Chebyshev s inequality, we can show that, for all ε > 0, P χ t ) x F j,4 vx j ) t cl6 K > ε 0 as. j= Thus, statement b) is prove an the lemma follows. We are now in position to prove Theorems 4. an 4.. Proof of Theorem 4.. The theorem follows from A.5), A.6) an part ) of Lemma A.. Proof of Theorem 4.. We take F = F, F, F,3 F,4. Then an so, for fixe T, /3 π F C ) 0 as, PΓ t F,0 t T) as. Also, from Lemmas A.9 A., it follows that x F G V x ) GV x ) 0 as for all V Cc, which epen only on the first coorinate. Therefore, the weak convergence follows by [4], Chapter 4, Corollary 8.7, since Cc separates points an an ientical argument to that of Theorem 3. can be use to emonstrate compact containment. The maximizing of h c l) is straightforwar using the proof of [7], Theorem.

34 P. NEAL AND G. ROBERTS A.3. Proofs of Section 5. The proof of Theorem 5. is very similar to the proof of Theorem 3. given in Section A.. First, for x R, let x = x,x,...), x = i=3 x i an let x = lim x, shoul the limit exist. For x R, let x R be such that x = x,x,...,x ) [= x,x,...,x ), say], that is, x comprises the first components of x. Then let G be the iscrete-time) generator of X, an let V Cc be an arbitrary test function of x,x an x only. Thus, [ G V x ) = E V Y ) V x )) π Y ) π x ) The generator G of the three-imensional iffusion escribe in Theorem 5., for an arbitrary test function V of x,x an x, is given by )) GV x ) = cl cl Φ ρ i= ρ x i x) x i V x ) + x i }]. } V x ). We shall efine sets F R ; } such that for PX F C ) 0 as. This is one in Lemma A.3 an, thus, we can restrict attention to x F. Furthermore, Lemma A.3 ensures that, for all x F, lim x exists. Therefore, since we can restrict attention to x F, we aim to show that A.3) G V x ) GV x ) 0 as, x F which is prove in Theorem A.7 an then Theorem 5. follows trivially. Then efine sets F R ; } such that for PX F C ) 0 as. This is one in Lemma A.3 an, thus, we can restrict attention to x F. Lemma A.3. For k 5, efine the sequence of sets F,k R ; } by F, = x ; R x ) ρ) < /8 }, F, = x ; x x < /8 }, } F,3 = x ; max i x i < /8, F,4 = x ; x i ) < }, /8 i=

OPTIMAL SCALING FOR MCMC 35 ) F,5 = x 4 } ; ρ x i + θ x j < /8, i= j= where R x ) = i= x i j= x j ) an θ = Let F = 5 k= F,k, then A.3) PX F C ) 0 as. Proof. It is sufficient to show that, for k 5, ρ + )ρ )ρ. PX F,k C ) 0 as. For the cases k =,3,4 an 5, it is straightforwar but teious using Markov s inequality to prove the result. Therefore, the etails are omitte. For the case k =, let X = i=3 Xi 3) an let X = lim X. Therefore, by construction see Section 5), for all 3, ) 0 ) X )) N, + 3)ρ) X 0 ρ. ρ ρ Thus, A.33) ) X X )ρ = x} N + 3)ρ x, ρ ρ). + 3)ρ Therefore, by Markov s inequality, A.34) PX F C, ) = P X X /8 ) E[ X X 4 ], an the result follows trivially from A.33) an A.34). The proceure now iffers slightly from that given in Section A.. We postpone the fining of a suitable Taylor series expansion for G V x ) an first give Lemmas A.4, A.5 an A.6, which mirror Lemmas A.4, A.6 an A.7, respectively. The proofs of the aforementione lemmas are similar to the proofs of the corresponing results in Section A. an, hence, the etails are omitte. Lemma A.4. For k, let λ k = λk x )) = χ i ρ x i + θ xj). i k j= Then for any ε > 0, P k λ k c ) x F ρ > ε 0 as, where, for any ranom variable X an any subset A R, P k X A) = PX A χ k = ) an E k [X] = E[X χ k = ].

36 P. NEAL AND G. ROBERTS For z R, let h k z)= h k z;x )) = log π Y ) π x ) } Zk = z}. The role of h k z) is similar to that playe by h z) in Section A., with h k 0) equivalent to B cf. Lemma A.3). Lemma A.5. For an k, let A k = Ak x )) = σ i k χ i ρ x i + θ j= x j )Z i. Then, A.35) an x F E k [ eak ] E k [ e hk 0) ] 0 as, E k ;A k [eak < 0] E k 0) [ehk ;h k 0) < 0] 0 as. x F Lemma A.6. For k, E k [ eak ] Φ l ) c A.36) 0 as ρ an A.37) x F E k [eak ;A k < 0] Φ l ) c 0 as. ρ x F We are now in position to prove A.3). Theorem A.7. G V x ) GV x ) 0 as, x F cl ρ) Proof. Note that, for all 3, we have the following Taylor series expansion, for V : V Y ) V x ) = σ χ Z V x ) + χ x Z V x ) x + σ + χ Z x i=3 ) } χ i Z i V x ) x V x ) + χ Z x V x ) + χ χ Z Z x x V x )