Simultaneous drift conditions for Adaptive Markov Chain Monte Carlo algorithms

Similar documents
On the Containment Condition for Adaptive Markov Chain Monte Carlo Algorithms

Coupling and Ergodicity of Adaptive MCMC

Some Results on the Ergodicity of Adaptive MCMC Algorithms

A regeneration proof of the central limit theorem for uniformly ergodic Markov chains

Faithful couplings of Markov chains: now equals forever

An introduction to adaptive MCMC

SUPPLEMENT TO PAPER CONVERGENCE OF ADAPTIVE AND INTERACTING MARKOV CHAIN MONTE CARLO ALGORITHMS

Examples of Adaptive MCMC

Ergodic Theorems. Samy Tindel. Purdue University. Probability Theory 2 - MA 539. Taken from Probability: Theory and examples by R.

University of Toronto Department of Statistics

The simple slice sampler is a specialised type of MCMC auxiliary variable method (Swendsen and Wang, 1987; Edwards and Sokal, 1988; Besag and Green, 1

Quantitative Non-Geometric Convergence Bounds for Independence Samplers

On Reparametrization and the Gibbs Sampler

arxiv: v1 [math.pr] 17 May 2009

ON CONVERGENCE RATES OF GIBBS SAMPLERS FOR UNIFORM DISTRIBUTIONS

Practical conditions on Markov chains for weak convergence of tail empirical processes

arxiv: v1 [stat.co] 28 Jan 2018

Chapter 7. Markov chain background. 7.1 Finite state space

Lectures on Stochastic Stability. Sergey FOSS. Heriot-Watt University. Lecture 4. Coupling and Harris Processes

Lecture 10. Theorem 1.1 [Ergodicity and extremality] A probability measure µ on (Ω, F) is ergodic for T if and only if it is an extremal point in M.

Lecture 5. If we interpret the index n 0 as time, then a Markov chain simply requires that the future depends only on the present and not on the past.

Central limit theorems for ergodic continuous-time Markov chains with applications to single birth processes

Perturbed Proximal Gradient Algorithm

A generalization of Dobrushin coefficient

Markov Chain Monte Carlo (MCMC)

Markov Chains and Stochastic Sampling

Examples of Adaptive MCMC

Geometric Convergence Rates for Time-Sampled Markov Chains

Adaptive Markov Chain Monte Carlo: Theory and Methods

Recent Advances in Regional Adaptation for MCMC

Existence, Uniqueness and Stability of Invariant Distributions in Continuous-Time Stochastic Models

University of Toronto Department of Statistics

ON THE DEFINITION OF RELATIVE PRESSURE FOR FACTOR MAPS ON SHIFTS OF FINITE TYPE. 1. Introduction

Monte Carlo methods for sampling-based Stochastic Optimization

arxiv: v2 [math.pr] 4 Sep 2017

When is a Markov chain regenerative?

Variance Bounding Markov Chains

Lattice Gaussian Sampling with Markov Chain Monte Carlo (MCMC)

THE SET OF RECURRENT POINTS OF A CONTINUOUS SELF-MAP ON AN INTERVAL AND STRONG CHAOS

MARKOV CHAINS AND HIDDEN MARKOV MODELS

Uniformly Uniformly-ergodic Markov chains and BSDEs

MATH 564/STAT 555 Applied Stochastic Processes Homework 2, September 18, 2015 Due September 30, 2015

Local consistency of Markov chain Monte Carlo methods

Lecture 11: Introduction to Markov Chains. Copyright G. Caire (Sample Lectures) 321

Renewal theory and computable convergence rates for geometrically ergodic Markov chains

Weighted Sums of Orthogonal Polynomials Related to Birth-Death Processes with Killing

A mathematical framework for Exact Milestoning

Mixing time for a random walk on a ring

6 Markov Chain Monte Carlo (MCMC)

GLUING LEMMAS AND SKOROHOD REPRESENTATIONS

Markov Chains CK eqns Classes Hitting times Rec./trans. Strong Markov Stat. distr. Reversibility * Markov Chains

SUPPLEMENT TO EXTENDING THE LATENT MULTINOMIAL MODEL WITH COMPLEX ERROR PROCESSES AND DYNAMIC MARKOV BASES

The Wang-Landau algorithm in general state spaces: Applications and convergence analysis

MARKOV CHAIN MONTE CARLO

Tail inequalities for additive functionals and empirical processes of. Markov chains

On the Applicability of Regenerative Simulation in Markov Chain Monte Carlo

Chapter 2: Markov Chains and Queues in Discrete Time

Minorization Conditions and Convergence Rates for Markov Chain Monte Carlo. (September, 1993; revised July, 1994.)

ITERATED FUNCTION SYSTEMS WITH CONTINUOUS PLACE DEPENDENT PROBABILITIES

Convergence Rate of Markov Chains

An Introduction to Entropy and Subshifts of. Finite Type

CONVERGENCE THEOREM FOR FINITE MARKOV CHAINS. Contents

Exercise Solutions to Functional Analysis

25.1 Ergodicity and Metric Transitivity

KERNEL ESTIMATORS OF ASYMPTOTIC VARIANCE FOR ADAPTIVE MARKOV CHAIN MONTE CARLO. By Yves F. Atchadé University of Michigan

Stochastic Simulation

Winter 2019 Math 106 Topics in Applied Mathematics. Lecture 9: Markov Chain Monte Carlo

General Glivenko-Cantelli theorems

Lecture notes for Analysis of Algorithms : Markov decision processes

Stat 516, Homework 1

ON THE SIMILARITY OF CENTERED OPERATORS TO CONTRACTIONS. Srdjan Petrović

Stochastic Proximal Gradient Algorithm

Minicourse on: Markov Chain Monte Carlo: Simulation Techniques in Statistics

Rudiments of Ergodic Theory

Random Times and Their Properties

PROBLEMS. (b) (Polarization Identity) Show that in any inner product space

Functional Analysis. Franck Sueur Metric spaces Definitions Completeness Compactness Separability...

Reversible Markov chains

Stochastic relations of random variables and processes

Introduction to Machine Learning CMU-10701

Real Analysis Notes. Thomas Goller

Advances and Applications in Perfect Sampling

STOCHASTIC PROCESSES Basic notions

ON THE STRONG LIMIT THEOREMS FOR DOUBLE ARRAYS OF BLOCKWISE M-DEPENDENT RANDOM VARIABLES

Lecture 5 - Hausdorff and Gromov-Hausdorff Distance

Value Iteration and Action ɛ-approximation of Optimal Policies in Discounted Markov Decision Processes

TUNING OF MARKOV CHAIN MONTE CARLO ALGORITHMS USING COPULAS

Empirical Processes: General Weak Convergence Theory

MET Workshop: Exercises

Consistency of the maximum likelihood estimator for general hidden Markov models

Walsh Diffusions. Andrey Sarantsev. March 27, University of California, Santa Barbara. Andrey Sarantsev University of Washington, Seattle 1 / 1

REVERSALS ON SFT S. 1. Introduction and preliminaries

Lecture Notes Introduction to Ergodic Theory

Formal power series rings, inverse limits, and I-adic completions of rings

1 Kernels Definitions Operations Finite State Space Regular Conditional Probabilities... 4

Part II Probability and Measure

Brownian Motion. 1 Definition Brownian Motion Wiener measure... 3

25.1 Markov Chain Monte Carlo (MCMC)

Analysis of the Gibbs sampler for a model. related to James-Stein estimators. Jeffrey S. Rosenthal*

Lect4: Exact Sampling Techniques and MCMC Convergence Analysis

Transcription:

Simultaneous drift conditions for Adaptive Markov Chain Monte Carlo algorithms Yan Bai Feb 2009; Revised Nov 2009 Abstract In the paper, we mainly study ergodicity of adaptive MCMC algorithms. Assume that under some regular conditions about target distributions, all the MCMC samplers in {P γ : γ Y} simultaneously satisfy a group of drift conditions, and have the uniform small set C in the sense of the m-step transition such that each MCMC sampler converges to target at a polynomial rate. We say that the family {P γ : γ Y} is simultaneously polynomially ergodic. Suppose that Diminishing Adaptation and simultaneous polynomial ergodicity hold. We find that either when the number of drift conditions is greater than or equal to two, or when the number of drift conditions having some specific form is one, the adaptive MCMC algorithm is ergodic. We also discuss some recent results related to this topic, and show that under some additional condition, Containment is necessary for ergodicity of adaptive MCMC. Keywords: adaptive MCMC, ergodicity, simultaneous drift condition, simultaneous polynomial ergodicity 1 Introduction Markov chain Monte Carlo (MCMC) algorithms are widely used for approximately sampling from complicated probability distributions. However, there are some limitations for this method, because it is often necessary to tune the scaling and other parameters before the algorithm will converge efficiently. Adaptive MCMC methods using regeneration times and other complicated constructions have been proposed by [8, 6, and elsewhere]. More recently, Adaptive MCMC methods has been used to improve convergence via a self-studying procedure. In this direction, a key step was made by Haario et al. [9], who proposed an adaptive Metropolis algorithm attempting to optimize the proposal distribution, and proved that a particular version of this algorithm correctly converges weakly to the target distribution. Andrieu and Robert [2] observed that the algorithm of [9] can be viewed as a version of the Robbins-Monro stochastic control algorithm [see 12]. The results of [9] were then generalized by [4, 1], proving convergence of more general adaptive MCMC algorithms. Roberts and Rosenthal [13] use a coupling method for adaptive MCMC algorithm to show that its ergodicity is implied by Containment and Diminishing Adaptation, based on the basic Department of Statistics, University of Toronto, Toronto, ON M5S 3G3, CA. yanbai@utstat.toronto.edu 1

assumptions: each transition kernel P γ on the state space X and the adaptive parameter space Y (γ Y), is φ-irreducible, aperiodic and admits a nontrivial unique finite invariant probability measure π. Diminishing Adaptation is relatively easy to check, and commonly the adaptive strategy is designed artificially. However, Containment is really abstract and hard to check. In their paper, they prove that simultaneous strongly aperiodic geometric ergodicity ensures Containment. Yang [14] and Atchadé and Fort [3] respectively tackle the open problem 21 in [13]. Both results are very interesting. [14] assumed that all the transition kernels simultaneously satisfy the drift condition P γ V V 1 + bi C, and the adaptive parameter space is compact under some metric, and connects it with the regeneration decomposition to find the uniform bound of the distance P n γ (x, ) π( ) TV for all γ. Once this condition given, the distance P n γ (x, ) π( ) TV can be uniformly bounded by the test function. The boundedness of the test function sequence {V (X n ) : n 0} can ensure Containment. Under some situations, to directly check Containment is quite hard. [3] use the similar coupling method as that in [13] to prove an attractive and better result when adaptive MCMC algorithm is restricted to Markovian Adaptation. [3] also assumed that uniformly strongly aperiodicity, simultaneously drift condition in the weakest form P γ V V 1 + bi C, and uniform convergence on any sublevel set of the test function V (x). The idea is that after the chain comes into some big sublevel set of the test function V (x), apply the coupling method for ergodicity. In Section 2 we introduce the notations and some conditions. In Section 3 we discuss the results in [14, 3]. In Section 4 we provide a necessary condition of ergodicity conditional on some additional condition. In Section 5 we demonstrate our conditions and result. 2 Terminology Consider a target distribution π( ) defined on the state space X with respect to some σ-field B(X ) (π(x) is also used as the density function). Let {P γ : γ Y} be the family of transition kernels of time homogeneous Markov chains with the same stationary distribution as π, i.e. πp γ = π for all γ Y. An adaptive MCMC algorithm Z := {(X n, Γ n ) : n 0} is regarded as lying in the sample path space Ω := (X Y) equipped with a σ-field F. For each initial state x X and initial parameter γ Y, there is a probability measure P (x,γ) such that the probability of the event [Z A] is well-defined for any set A F. There is a filtration G := {G n : n 0} such that Z is adapted to G. Denote the corresponding expectation by E (x,γ) [f(x n )] for the measurable function f : X R. A framework of adaptive MCMC is defined as: 1. Given a initial state X 0 := x 0 X and a kernel P Γ0 with Γ 0 := γ 0 Y. At each iteration n + 1, X n+1 is generated from P Γn (X n, ); 2. Γ n+1 is obtained from some function of X 0,, X n+1 and Γ 0,, Γ n. For A B(X ), P (x0,γ 0 )(X n+1 A G n ) = P (x0,γ 0 )(X n+1 A X n, Γ n ) = P Γn (X n, A). (1) We say that the adaptive MCMC Z is ergodic if for any initial state x 0 X and any kernel index γ 0 Y, P (x0,γ 0 )(X n ) π( ) TV converges to zero eventually where µ TV = µ(a). Obviously the joint distribution of X n+1 and Γ n+1 conditional on G n is sup A B(X ) P (x0,γ 0 ) (X n+1 dx, Γ n+1 dγ G n ) = P Γn (X n, dx)p (x0,γ 0 )(Γ n+1 dγ X n+1 = x, G n ). 2

Furthermore, if Γ n+1 is simply a function of X n and Γ n, then the adaptive MCMC algorithm is called Markovian Adaptation, i.e. the joint process {(X n, Γ n ) : n 0} is a time-inhomogeneous Markov Chain. Containment is defined as that for any X 0 = x 0 and Γ 0 = γ 0, for any ɛ > 0, the stochastic process {M ɛ (X n, Γ n ) : n 0} is bounded in probability P (x0,γ 0 ), i.e. for all δ > 0, there is N N such that P (x0,γ 0 )(M ɛ (X n, Γ n ) N) 1 δ for all n N, where M ɛ (x, γ) = inf{n 1 : P n γ (x, ) π( ) TV ɛ} is the ɛ-convergence function. Diminishing Adaptation is defined as that for any X 0 = x 0 and Γ 0 = γ 0, lim n D n = 0 in probability P (x0,γ 0 ) where D n = sup x X PΓn+1 (x, ) P Γn (x, ) TV represents the amount of adaptation performed between iterations n and n + 1. For a non-negative function V, the process {V (X n ) : n 0} is bounded in probability if given any (x 0, γ 0 ) X Y, δ > 0, N > 0, M > 0 such that for n > N, P (x0,γ 0 )(V (X n ) > M) < δ. Suppose that the parameter space Y is a metric space. Say the adaptive parameter process {Γ n : n 0} is bounded in probability if x 0 X, γ 0 Y, ɛ > 0, N > 0, some compact set B Y, such that for n > N, P (x0,γ 0 )(Γ n B c ) < ɛ. 3 Simultaneous Drift Conditions Before studying the simultaneous drift conditions, we show the result in [13]: Theorem 1 ([13]). Ergodicity of an adaptive MCMC is implied by Containment and Diminishing Adaptation. Remark 1. [13] gave one condition: simultaneously strongly aperiodically geometrically ergodic which can ensure Containment. In the definition, the drift conditions have the same form: P γ V (x) λv (x) + bi C (x) for all γ Y. However, if {Γ n : n 0} is bounded in probability, Containment is implied by that in each compact subset B of Y, the drift conditions have the same form: P γ V B (x) λ B V B (x) + b B I C (x). More generally, see the following corollary. Corollary 1. Suppose that (i) the parameter space Y is a metric space, (ii) the adaptive parameter process {Γ n : n 0} is bounded in probability, (iii){ for any compact set K Y, for any ɛ > 0, the local ɛ-convergence time { M ɛ (X n ) := inf m N + : sup γ K P m γ (X n, ) π( ) } TV < ɛ : n 0} is bounded in probability, m (iv) Diminishing Adaptation. Then the adaptive MCMC algorithm {X n : n 0} with the adaptive parameter process {Γ n : n 0} is ergodic. The proof is trivial and omitted. [14] gave the following conditions to tackle the open problem 21 in [13]: Y1: There exist a constant δ > 0, and set C F, and a probability measure ν γ ( ) for γ Y, such that P γ (x, ) δi C (x)ν γ ( ) for γ Y; Y2: all kernels uniformly satisfy the weakest drift condition: P γ V V 1 + bi C, where V : X [1, ) and π(v ) < ; 3

Y3: Y is compact under the metric d(γ 1, γ 2 ) = sup P γ1 (x, ) P γ2 (x, ) TV ; x X Y4: the stochastic process {V (X n ) : n 0} is bounded in probability. Theorem 2 ([14]). Suppose Diminishing Adaptation holds. The conditions (Y1)-(Y4) ensure the ergodicity of adaptive MCMC algorithms. Remark 2. 1. In the proof of Theorem 2, both (Y1) and (Y2) can ensure that each transition kernel is ergodic to π. Both (Y3) and (Y4) imply that the total variation distance between P γ and π converges to zero uniformly on Y. 2. The condition π(v ) < is a relatively strong condition. For each P γ, suppose that the chain {X n (γ) : n 0} is a time homogeneous Markov chain with the transition kernel P γ. For any recurrent set A X with π(a) > 0, by Meyn and Tweedie [11] (MT) Proposition 10.4.9, π(v ) = [ A π(dy)e τa ] 1 γ V (X (γ) i ) X (γ) 0 = y. Assuming that there exists a small set C 1 X [ τc1 1 with sup x C1 E γ V (X (γ) i ) X (γ) 0 = x ] [ σc1 <, denote U γ (x) = E γ V (X(γ) i ) X (γ) 0 = x [ τc1 1 Hence, by MT Theorem 11.3.5, P γ U γ U γ V (x)+b 1 I C1 where b 1 = sup x C1 E γ V (X (γ) i Suppose that there is a test function V 1 satisfying P γ V 1 V 1 V + bi C1 and V 1 (x)i C1 (x) V (x). By MT Proposition 11.3.2, U γ (x) V 1 (x). So, P γ U γ U γ 1 + bi C1. We will study the simultaneous drift condition with the form P γ V 1 V 1 V 0 + bi C instead where the test functions V 0 (x) and V 1 (x) are uniform for every P γ. Under this situation, the condition (Y3) are unnecessary, and the condition (Y4) is implied (See Theorem 6, Remark 7). [3] also gave the following conditions to study the ergodicity of adaptive MCMC with Markovian Adaptation: M1: there exists a probability measure ν( ), a constant δ > 0, and set C F such that P γ (x, ) δi C (x)ν( ) for γ Y; M2: there exists a measurable function V : X [1, ) and a positive constant b > 0 such that for any γ Y, (P γ V )(x) V (x) 1 + bi C (x); M3: for any sublevel set D l = {x X : V (x) l} of V, lim n sup Dl Y ]. P n γ (x, ) π( ) TV = 0. Theorem 3 ([3]). Suppose Diminishing Adaptation holds. The conditions (M1)-(M3) imply the ergodicity of adaptive MCMC algorithm with Markovian adaptation. Remark 3. 1. Since P (x0,γ0)(v (X n ) > M) π(d c M) P (x0,γ 0 )(X n ) π( ) TV, M can be taken extremely large such that π(dm c ) < ɛ. (M1-M3) and Diminishing Adaptation imply that R.H.S. of the above equation converges to zero. So, {V (X n ) : n 0} is bounded in probability. 2. In Section 4 we show that under certain condition, Containment is a necessary condition of ergodicity of adaptive MCMC provided that (M3) holds. From another view, [3] s proof does apply the coupling method to check Containment by using Diminishing Adaptation and simultaneous drift conditions. ) X (γ) ] 0 = x. 4

4 The necessary condition of the ergodicity Before showing our result, let us to discuss one simple example which was studied in [5]. Example 1. Let the state space X = {1, 2} and the transition kernel [ ] 1 θ θ P θ =. θ 1 θ Obviously, for each θ (0, 1), the stationary distribution is uniform on X. Consider a stateindependent adaptation: for k = 1, 2,, at each time n = 2k 1 choose the transition kernel index θ n 1 = 1/2, and at each time n = 2k choose the transition kernel index θ n 1 = 1/n. Remark 4. The chain converges to the target distribution, but neither Diminishing Adaptation nor Containment holds. Remark 5. Obviously, the adaptive parameter process {θ n : n 0} is not bounded in probability. Example 1 shows that Containment is also not necessary. In the following theorem, we prove that under certain additional condition similar to (M3), Containment is necessary for ergodicity of adaptive algorithms. Theorem 4 (The necessity of Containment). Suppose that there exists an increasing sequence of sets D k X on the state space X, such that for any k > 0, lim sup P n γ n (x, ) π( ) TV = 0. (2) D k Y If the adaptive MCMC algorithm is ergodic then Containment holds. Proof: Fix ɛ > 0. For any δ > 0, taking K > 0 such that π(dk c ) < δ/2. For the set D K, there exists M such that sup P M γ (x, ) π( ) TV < ɛ. D K Y Hence, for any (x 0, γ 0 ) X Y, by the ergodicity of the adaptive MCMC {X n } n, there exists some N > 0 such that n > N, P (x0,γ0)(x n D c K) π(d c K) < δ/2. So, for (X n, Γ n ) (D K, Y), Hence, Therefore, Containment holds. [X n D K ] = [(X n, Γ n ) D K Y] [M ɛ (X n, Γ n ) M]. P (x0,γ 0 ) (M ɛ (X n, Γ n ) > M) P (x0,γ 0 ) ((X n, Γ n ) (D K Y) c ) = P (x0,γ 0 ) (X n D c K) P (x0,γ 0 ) (X n D c K) π(d c K) + π(d c K) < δ. 5

Corollary 2. Suppose that the parameter space Y is a metric space, and the adaptive scheme {Γ n : n 0} is bounded in probability. Suppose that there exists an increasing sequence of sets (D k, Y k ) X Y such that any k > 0, lim n sup Pγ n (x, ) π( ) TV = 0. D k Y k If the adaptive MCMC algorithm is ergodic then Containment holds. Proof: Using the same technique in Theorem 4, for large enough M > 0, P (x0,γ 0 ) (M ɛ (X n, Γ n ) > M) P (x0,γ 0 ) ((X n, Γ n ) (D k Y k ) c ) P (x0,γ 0 ) (X n D c k ) + P (x 0,γ 0 ) (Γ n Y c k ) P (x0,γ 0 ) (X n D c K) π(d c K) + π(d c K) + P (x0,γ 0 ) (Γ n Y c k ). Since {Γ n : n 0} is bounded in probability, the result holds. 5 Simultaneous Polynomial Ergodicity Although the ergodicity of adaptive MCMC algorithms, to some degree, is solved in [14] and [3], there are still some properties unknown about simultaneous polynomial ergodicity. In the section, we find that the conditions (Y4) and (M3) are implied for the adaptive MCMC with simultaneous polynomial ergodicity. Before studying it, let us recall the result about a quantitative bound for time-homogeneous Markov chain with polynomial convergence rate by Fort and Moulines [7]. Theorem 5 ([7]). Suppose that the time-homogeneous transition kernel P satisfies the following conditions: P is π-irreducible for an invariant probability measure π; There exist some sets C B(X ) and D B(X ), C D, π(c) > 0 and an integer m 1, such that for any (x, x ) := C D D C, A B(X ), P m (x, A) P m (x, A) ρ x,x (A) (3) where ρ x,x is some measure on X for (x, x ), and ɛ := inf (x,x ) ρ x,x (X ) > 0. Let q 1. There exist some measurable functions V k : X R + \ {0} for k {0, 1,..., q}, and for k {0, 1,..., q 1}, for some constants 0 < a k < 1, b k <, and c k > 0 such that π(v β q ) < for some β (0, 1]. P V k+1 (x) V k+1 (x) V k (x) + b k I C (x), inf x X V k(x) c k > 0, V k (x) b k a k V k (x), x D c, (4) sup V q <. D 6

Then, for any x X, n m, P n (x, ) π( ) TV min 1 l q B(β) l (x, n), (5) with B (β) l (x, n) = ɛ + (I A (β) m ) 1 δ x π(w β ), e l S(l, n + 1 m) β + j n+1 m (1 ɛ+ ) j (n m) (S(l, j + 1) β S(l, j) β ), where <, > denotes the inner product in R q+1, {e l }, 0 l q is the canonical basis on R q+1, I is the identity matrix; δ x π(w β ) := δ x (dy)π(dy )W β (y, y ) where W β (x, x ) := ( W β 0 (x, x ),, W β q (x, x )) T with W0 (x, x ) := 1 and W l (x, x ) = I (x, x ) + I c(x, x ) ( l 1 k=0 where m(v 0 ) := inf (x,x ) c {V 0(x) + V 0 (x )}; A (β) m := a k ) 1 S(0, k) := 1 and S(i, k) := (m(v 0 )) 1 (V l (x) + V l (x )) for 1 l q k S(i 1, j), i 1; j=1 A (β) m (0) 0 0 0 A (β) m (1) A (β) m (0) 0 0......., m (q 1) A (β) m (q 2) A (β) m (0) 0 A (β) m (q) A (β) m (q 1) A (β) m (1) A (β) m (0) A (β) where A (β) m (l) := sup l (x,x ) S(i, ( m)β 1 ρ x,x (X ) ) R x,x (x, dy)r x,x (x, dy )W β l i (y, y ), where the residual kernel and ɛ + := sup (x,x ) ρ x,x (X ). R x,x (u, dy) := ( 1 ρ x,x (X ) ) 1 ( P m γ (u, dy) ρ x,x (dy) ) ; Remark 6. In the B (β) l (x, n), ɛ + depends on the set and the measure ρ x,x ; the matrix (I A (β) m ) 1 depends on the set, the transition kernel P, ρ x,x and the test functions V k ; δ x π(w β ) depends on the set and the test functions V k. Consider the special case of the theorem: ρ x,x (dy) = δν(dy) where ν is a probability measure with ν(c) > 0, and := C C. 1. ɛ + = ɛ = δ. ( ) 2. I A (β) m is a lower triangle matrix so (I A (β) m ) 1 = is also a lower triangle matrix, and fixing k 0 all b (β) i,i k are equal. b(β) b (β) ij 1 ii = 1 A (β) m (0) 7 i,j=1,...,q+1. For i > j, b(β) ij is the polynomial

combination of A (β) m (0),, A (β) m (i + 1) divided by (1 A (β) that b (β) 21 = A (β) m (1) (1 A (β) m (0)) 2. m (0)) i. By some algebras, we can obtain So, by calculating B(β) 1 (x, n), we can get the quantitative bound with a simple form. B (β) 1 (x, n) only involves two test functions V 0 (x) and V 1 (x). Remark 7. From Equation (4), V 0 (x) b 0 /(1 α 0 ) > b 0 because 0 < α 0 < 1. Consider the drift condition: P V 1 V 1 V 0 + b 0 I C. Since πp = π, π(v 0 ) b 0 π(c) b 0. Hence, the V 0 in the theorem can not be constant. Remark 8. Without the condition π(v β ) <, the bound in Equation (5) can also be obtained. However, the bound is possibly infinity. The subscript l of B (β) l (x, n) and β can explain the polynomial rate. The related rate is S(l, n + 1 m) β = O((n + 1 m) lβ ). It can be observed that B (β) l (x, n) involves test functions V 0 (x),, V l (x), and lim sup n n βl B (β) l (x, n) <. The maximal rate of convergence is equal to qβ. 5.1 Conditions The following conditions are derived from Theorem 5, and some changes are added to apply for adaptive MCMC algorithms. Say that the family {P γ : γ Y} is simultaneously polynomially ergodic (S.P.E.) if the conditions (A1)-(A4) are satisfied. A1: each P γ is φ γ -irreducible with stationary distribution π( ); Remark 9. From MT Proposition 10.1.2, if P γ is ϕ-irreducible, then P γ is π-irreducible and the invariant measure π is a maximal irreducibility measure. A2: there is a set C X, some integer m N, some constant δ > 0, and some probability measure ν γ ( ) on X such that: π(c) > 0, and P m γ (x, ) δi C (x)ν γ ( ) for γ Y; (6) Remark 10. In Theorem 5, there is one condition Eq. (3) ensuring the splitting technique. Here we consider the special case of that condition: ρ x,x (dy) = δν γ (dy) and = C C. Thus, by Remark 6, the bound of P n γ (x, ) π( ) TV depends on C, m, the minorization constant δ, π( ), ν γ, and test functions V l (x) so we assume that they are uniform on all the transition kernels. A3: there is q N and measurable functions: V 0, V 1,..., V q : X (0, ) where V 0 V 1 V q, such that for k = 0, 1,..., q 1, there are 0 < α k < 1, b k <, and c k > 0 such that: P γ V k+1 (x) V k+1 (x) V k (x) + b k I C (x), V k (x) c k for x X and γ Y; (7) V k (x) b k α k V k (x) for x X /C; (8) sup V q (x) <. (9) x C Remark 11. For x C, ν γ (V l ) 1 δ P m γ V l (x) 1 δ sup x C V l (x) + mb l 1 δ. A4: π(v β q ) < for some β (0, 1]. 8

5.2 Main Result Theorem 6. Suppose an adaptive MCMC algorithm satisfies Diminishing Adaptation. Then, the algorithm is ergodic under any of the following cases: (i) S.P.E., and the number q of simultaneous drift conditions is strictly greater than two; (ii) S.P.E., and when the number q of simultaneous drift conditions is greater than or equal to two, there exists an increasing function f : R + R + such that V 1 (x) f(v 0 (x)); (iii) Under the conditions (A1) and (A2), there exist some positive constants c > 0, b > b > 0, α (0, 1), and a measurable function V (x) : X R+ with V (x) 1 and supv (x) < such that x C P γ V (x) V (x) cv α (x) + bi C (x) for γ Y, (10) cv α (x)i C c(x) b ; (iv) Under the condition (A1), (A2), and (A4), there exist some constant b > b > 0, two measurable functions V 0 : X R + and V 1 : X R + with 1 V 0 (x) V 1 (x) and supv 1 (x) < such that x C P γ V 1 (x) V 1 (x) V 0 (x) + bi C (x) for γ Y, (11) V 0 (x)i C c(x) b, and the process {V 1 (X n ) : n 0} is bounded in probability. The theorem consists of Theorem 7, Theorem 8, Theorem 9, and Lemma 1. Theorem 9 shows that {V (X n ) : n 0} in the case (iii) is bounded in probability. The case (iii) is a special case of S.P.E. with q = 1 so that the uniform quantitative bound of P n γ (x, ) π( ) TV for γ Y exists. Remark 12. For the part (iii), (A4) is implied by MT Theorem 14.3.7 with β = α. 5.3 Proof of Theorem 6 Lemma 1. Suppose that the family {P γ : γ Y} is S.P.E.. If the stochastic process {V l (X n ) : n 0} is bounded in probability for some l {1,..., q}, then Containment is satisfied. Proof: We use the notation in Theorem 5. From S.P.E., for γ Y, let ρ x,x (dy) = δν γ (dy) (so ρ x,x (X ) = δ) and := C C. So, ɛ + = ɛ = δ. Note that the matrix I A (β) m is a lower triangle matrix. Denote (I A (β) m ) 1 := (b (β) ij ) i,,,q. By the definition of B (β) l (x, n), B (β) ɛ + l k=0 l (x, n) = b(β) lk π(dy)w β k (x, y) S(l, n + 1 m) β + j n+1 m (1 ɛ+ ) j (n m) (S(l, j + 1) β S(l, j) β ) ɛ + l S(l, n + 1 m) β b (β) lk π(dy)w β k (x, y). By some algebras, for k = 1,, q, k=0 ) k 1 π(dy)w β k (m(v (x, y) 1 + β [ 0 ) a i V β k (x) + π(v β k ) ] (12) 9

because β (0, 1]. In addition, m(v 0 ) c 0 so the coefficient of the second term on the right hand side is finite. By induction, we obtain that b (β) 10 = A(β) m (1), and b(β) 2 11 = 1. It is easy to check that 0 < b (β) 11 1 δ. By some algebras, A (β) m (1) m β + m β + sup (x,x ) C C sup (x,x ) C C (1 A (β) m (0)) 1 A (β) m (0) R x,x (x, dy)r x,x (x, dy )W β 1 (y, y ) [ ] 1 + (a 0 m(v 0 )) β (Pγ m V β 1 (x) + P γ m V β 1 (x )) m β + 1 + 2(a 0 m(v 0 )) β (sup V 1 (x) + mb 0 ) x C because Pγ m V β 1 (x) P γ m V 1 (x) V 1 (x) + mb 0. Therefore, b (β) 10 is bounded from the above by some value independent of γ. Thus, ( ) B (β) δ 1 (x, n) S(1, n + 1 m) β b (β) 10 π(dy)w β 0 (x, y) + b(β) 11 π(dy)w β 1 (x, y) δ ( [ ]) (n + 1 m) β b (β) 10 π(c) + b(β) 11 1 + (a 0 m(v 0 )) β (V β β 1 (x) + π(v1 )). Therefore, the boundedness of the process {V 1 (X k ) : k 0} implies that the random sequence B (β) 1 (X n, n) converges to zero uniformly on X in probability. Containment holds. Let {Z j : j 0} be an adaptive sequence of positive random variables. For each j, Z j will denote a fixed Borel measurable function of X j. τ n will denote any stopping time starting from time n of the process {X i : i 0} i.e. [τ n = i] σ(x k : k = 1,, n + i) and P(τ n < ) = 1. Lemma 2 (Dynkin s Formula for adaptive MCMC). For m > 0, and n > 0, τ m,n E[Z τm,n X m, Γ m ] = Z m + E[ (E[Z m+i F m+i 1 ] Z m+i 1 ) X m, Γ m ] i=1 where τ m,n := min(n, τ m, inf(k 0 : Z m+k n)). Proof: τ m,n Z τm,n = Z m + (Z m+i Z m+i 1 ) = Z m + i=1 Since τ m,n i is measurable to F m+i 1, E[Z τm,n X m, Γ m ] =Z m + E[ n I( τ m,n i)(z m+i Z m+i 1 ) i=1 n E[Z m+i Z m+i 1 F m+i 1 ]I( τ m,n i) X m, Γ m ] i=1 τ m,n =Z m + E[ (E[Z m+i F m+i 1 ] Z m+i 1 ) X m, Γ m ]. i=1 10

Lemma 3 (Comparison Lemma for adaptive MCMC). Suppose that there exist two sequences of positive functions {s j, f j : j 0} on X such that E[Z j+1 F j ] Z j f j (X j ) + s j (X j ). (13) Then for a stopping time τ n starting from the time n of the adaptive MCMC {X i : i 0}, τ n 1 E[ τ n 1 f n+j (X n+j ) X n, Γ n ] Z n (X n ) + E[ Proof: From Lemma 2 and Eq. (13), the result can be obtained. s n+j (X n+j ) X n, Γ n ]. The following proposition shows the relations between the moments of the hitting time and the test function V -modulated moments for adaptive MCMC algorithms with S.P.E., which is derived from the result for Markov chain in [10, Theorem 3.2]. Define the first return time and the ith return time to the set C from the time n respectively: τ n,c := τ n,c (1) := min {k 1 : X n+k C} (14) and τ n,c (i) := min {k > τ n,c (i 1) : X n+k C} for n 0 and i > 1. (15) Proposition 1. Consider an adaptive MCMC {X i : i 0} with the adaptive parameter {Γ i : i 0}. If the family {P γ : γ Y} is S.P.E., then there exist some constants {d i : i = 0,, q 1} such that at the time n, for k = 1,, q, c q k E[τ k n,c X n, Γ n ] k τ n,c 1 E[ (i + 1) k 1 V q k (X n+i ) X n, Γ n ] d q k (V q (X n ) + k b q i I C (X n )) i=1 where the test functions {V i ( ) : i = 0,, q}, the set C, {c i : i = 0,, q 1}, and {b i : i = 0,, q 1} are defined in the S.P.E.. Proof: Since V q k (x) c q k on X, τ n,c 1 E[ τ n,c 1 (i + 1) k 1 So, the first inequality holds. Consider k = 1. By S.P.E. and Lemma 3, τn,c 0 x k 1 dx = k 1 τ k n,c. (i + 1) k 1 V q k (X n+i ) X n, Γ n ] c q k k E[τ k n,c X n, Γ n ]. (16) τ n,c 1 E[ V q 1 (X n+i ) X n, Γ n ] V q (X n ) + b q 1 I C (X n ). (17) 11

So, the case k = 1 of the second inequality of the result holds. For i 0, by S.P.E., E[(i + 1) k 1 V q k+1 (X n+i+1 ) X n+i, Γ n+i ] i k 1 V q k+1 (X n+i ) (i + 1) k 1 (V q k+1 (X n+i ) V q k (X n+i ) + b q k I C (X n+i )) i k 1 V q k+1 (X n+i ) (i + 1) k 1 V q k (X n+i ) + d ( ) i k 2 V q k+1 (X n+i ) + (i + 1) k 1 b q k I C (X n+i ) for some positive d independent of i. By Lemma 3, τ n,c 1 E[ (i + 1) k 1 V q k (X n+i ) X n, Γ n ] τ n,c 1 de[ i (k 1) 1 V q (k 1) (X n+i ) X n, Γ n ] + b q k I C (X n ). From the above equation, by induction, the second inequality of the result holds. (18) Theorem 7. Suppose that the family {P γ : γ Y} is S.P.E. for q > 2. Then, Containment holds. Proof: For k = 1,..., q, take large enough M > 0 such that C {x : V q k (x) M}, P (x0,γ 0 ) (V q k (X n ) > M) = By Proposition 1, for i = 0,, n, n P (x0,γ 0 ) (V q k (X n ) > M, τ i,c > n i, X i C) + P (x0,γ 0 ) (V q k (X n ) > M, τ 0,C > n, X 0 / C). P (x0,γ 0 ) (V q k (X n ) > M, τ i,c > n i X i C) τ i,c 1 P (x0,γ 0 ) (j + 1) k 1 V q k (X i+j ) > (n i) k 1 M+ c q k n i 1 (j + 1) k 1, τ i,c > n i X i C τ i,c 1 P (x0,γ 0 ) (j + 1) k 1 V q k (X i+j ) > (n i) k 1 M+ sup x C E (x0,γ 0 ) c q k n i 1 [ E (x0,γ 0 ) (j + 1) k 1 X i C [ τi,c ] ] 1 (j + 1) k 1 V q k (X i+j ) X i, Γ i X i = x (n i) k 1 M + c n i 1 q k (j + 1) k 1 ) ( d q k sup x C V q (x) + k j=1 b q ji C (x) (n i) k 1 M + c q k n i 1 (j + 1) k 1, 12

and P (x0,γ 0 ) (V q k (X n ) > M, τ 0,C > n X 0 / C) By simple algebra, ( d q k V q (x 0 ) + ) k j=1 b q ji C (x 0 ) n k 1 M + c n 1. q k (j + 1)k 1 Therefore, n i 1 ( ) (n i) k 1 M + c q k (j + 1) k 1 = O (n i) k 1 (M + c q k (n i)). P (x0,γ 0 ) (V q k (X n ) > M) k d q k sup V q (x) + b q j x C {x 0 } j=1 ( n ) P (x0,γ 0 )(X i C) (n i) k 1 (M + c q k (n i)) + δ C c(x 0 ) n k 1. (M + c q k n) (19) Whenever q 2, k can be chosen as 2. While k 2, the summation of L.H.S. of Eq. (19) is finite given M. But if q = 2 then just the process {V 0 (X n ) : n 0} is bounded probability so that q > 2 is required for the result. Hence, taking large enough M > 0, the probability will be small enough. So, the sequence {V q 2 (X n ) : n 0} is bounded in probability. By Lemma 1, Containment holds. Remark 13. In the proof, only (A3) is used. Remark 14. If V 0 is a nice function (non-decreasing) of V 1, then the sequence {V 1 (X n ) : n 0} is bounded in probability. In Theorem 9, we discuss this situation for certain simultaneously single polynomial drift condition. Theorem 8. Suppose that {P γ : γ Y} is S.P.E. for q = 2. Suppose that there exists a strictly increasing function f : R + R + such that V 1 (x) f(v 0 (x)) for all x X. Then, Containment is implied. Proof: From Eq. (19), we have that {V 0 (X n ) : n 0} is bounded in probability. Since V 1 (x) f(v 0 (x)), P (x0,γ 0 ) (V 1 (X n ) > f(m)) P (x0,γ 0 ) (f(v 0 (X n )) > f(m)) = P (x0,γ 0 ) (V 0 (X n ) > M), because f( ) is strictly increasing. By the boundedness of V 0 (X n ), for any ɛ > 0, there exists N > 0 and some M > 0 such that for n > N, P (x0,γ 0 ) (V 1 (X n ) > f(m)) ɛ. Therefore, {V 1 (X n ) : n 0} is bounded in probability. By Lemma 1, Containment is satisfied. Consider the single polynomial drift condition, see [10]: P γ V (x) V (x) cv α (x) + bi C (x) where 0 α < 1. Because the moments of the hitting time to the set C is (see details in [10]), for any 1 ξ 1/(1 α), [ τc ] 1 E x (i + 1) ξ 1 V (X i ) < V (x) + bi C (x). 13

The polynomial rate function r(n) = n ξ 1. If α = 0, then r(n) is a constant. Under this situation, it is difficult to utilize the technique in Theorem 7 to prove {V (X n ) : n 0} is bounded in probability. Thus, we assume α (0, 1). Proposition 2. Consider an adaptive MCMC {X n : n 0} with an adaptive scheme {Γ n : n 0}. Suppose that (A1) holds, and there exist some positive constants c > 0, > b > 0, α (0, 1), and a measurable function V (x) : X R+ with V (x) 1 and supv (x) < such that x C P γ V (x) V (x) cv α (x) + bi C (x) for γ Y. (20) Then for 1 ξ 1/(1 α), τ n,c 1 (i + 1) ξ 1 V 1 ξ(1 α) (X n+i ) X n, Γ n c ξ (C)(V (X n ) + 1). (21) E (x0,γ 0 ) Proof: The proof applies the techniques in Lemma 3.5 and Theorem 3.6 of [10]. Theorem 9. Suppose that (A2) and the conditions in Proposition 2 are satisfied, and there exists some constant b > b such that cv α (x)i C c > b. Then, Containment is implied. Proof: Using the same techniques in Theorem 7, we have that ( P (x0,γ 0 ) V 1 ξ(1 α) (X n ) > M ) ( n ) P c ξ (sup x C {x0 } V (x) + 1) (x0,γ 0 )(X i C) (n i) ξ 1 (M+n i) + δ C c (x 0). (22) n ξ 1 (M+n) Therefore, for ξ [1, 1/(1 α)), the sequence { V 1 ξ(1 α) (X n ) : n 0 } is bounded in probability. Since 1 ξ(1 α) > 0, the process {V (X n ) : n 0} is bounded in probability. By Lemma 1, Containment holds. Acknowledgements Special thanks are due to Professor Jeffrey S. Rosenthal. Without his assistance this work would not have been completed. The author is very grateful to G. Fort for her helpful suggestions. References [1] C Andrieu and E Moulines. On the ergodicity properties of some adaptive Markov Chain Monte Carlo algorithms.. Ann. Appl. Probab., 16(3):1462 1505, 2006. [2] C. Andrieu and C.P. Robert. Controlled MCMC for optimal sampling.. Preprint, 2001. [3] Y.F. Atchadé and G. Fort. Limit Theorems for some adaptive MCMC algorithms with subgeometric kernels. Preprint, 2008. [4] Y.F. Atchadé and J.S. Rosenthal. On Adaptive Markov Chain Monte Carlo Algorithms. Bernoulli, 11(5):815 828, 2005. 14

[5] Y. Bai, G.O. Roberts, and J.S. Rosenthal. On the Containment condition of adaptive Markov Chain Monte Carlo algorithms. http://www.probability.ca/jeff/ftpdir/yanbai1.pdf, 2009. [6] A.E. Brockwell and J.B. Kadane. Indentification of regeneration times in MCMC simulation, with application to adaptive schemes. J. Comp. Graph. Stat, 14:436 458, 2005. [7] G. Fort and E. Moulines. Computable Bounds for Subgeometrical and geometrical Ergodicity. 2000. [8] W.R. Gilks, G.O. Roberts, and S.K. Sahu. Adaptive Markov chain Monte Carlo. J. Amer. Statist. Assoc., 93:1045 1054, 1998. [9] H. Haario, E. Saksman, and J. Tamminen. An adaptive Metropolis algorithm. Bernoulli, 7: 223 242, 2001. [10] S.F. Jarner and G.O. Roberts. Polynomial convergence rates of Markov Chains. Ann. Appl. Probab., 12(1):224 247, 2002. [11] S. P. Meyn and R. L. Tweedie. Markov Chains and Stochastic Stability. London: Springer- Verlag, 1993. [12] H. Robbins and S. Monro. A stochastic approximation method. Ann. Math. Stat., 22:400 407, 1951. [13] G.O. Roberts and J.S. Rosenthal. Coupling and Ergodicity of adaptive Markov chain Monte Carlo algorithms. J. Appl. Prob., 44:458 475, 2007. [14] C. Yang. Recurrent and Ergodic Properties of Adaptive MCMC. Preprint, 2008. 15