Simultaneous drift conditions for Adaptive Markov Chain Monte Carlo algorithms Yan Bai Feb 2009; Revised Nov 2009 Abstract In the paper, we mainly study ergodicity of adaptive MCMC algorithms. Assume that under some regular conditions about target distributions, all the MCMC samplers in {P γ : γ Y} simultaneously satisfy a group of drift conditions, and have the uniform small set C in the sense of the m-step transition such that each MCMC sampler converges to target at a polynomial rate. We say that the family {P γ : γ Y} is simultaneously polynomially ergodic. Suppose that Diminishing Adaptation and simultaneous polynomial ergodicity hold. We find that either when the number of drift conditions is greater than or equal to two, or when the number of drift conditions having some specific form is one, the adaptive MCMC algorithm is ergodic. We also discuss some recent results related to this topic, and show that under some additional condition, Containment is necessary for ergodicity of adaptive MCMC. Keywords: adaptive MCMC, ergodicity, simultaneous drift condition, simultaneous polynomial ergodicity 1 Introduction Markov chain Monte Carlo (MCMC) algorithms are widely used for approximately sampling from complicated probability distributions. However, there are some limitations for this method, because it is often necessary to tune the scaling and other parameters before the algorithm will converge efficiently. Adaptive MCMC methods using regeneration times and other complicated constructions have been proposed by [8, 6, and elsewhere]. More recently, Adaptive MCMC methods has been used to improve convergence via a self-studying procedure. In this direction, a key step was made by Haario et al. [9], who proposed an adaptive Metropolis algorithm attempting to optimize the proposal distribution, and proved that a particular version of this algorithm correctly converges weakly to the target distribution. Andrieu and Robert [2] observed that the algorithm of [9] can be viewed as a version of the Robbins-Monro stochastic control algorithm [see 12]. The results of [9] were then generalized by [4, 1], proving convergence of more general adaptive MCMC algorithms. Roberts and Rosenthal [13] use a coupling method for adaptive MCMC algorithm to show that its ergodicity is implied by Containment and Diminishing Adaptation, based on the basic Department of Statistics, University of Toronto, Toronto, ON M5S 3G3, CA. yanbai@utstat.toronto.edu 1
assumptions: each transition kernel P γ on the state space X and the adaptive parameter space Y (γ Y), is φ-irreducible, aperiodic and admits a nontrivial unique finite invariant probability measure π. Diminishing Adaptation is relatively easy to check, and commonly the adaptive strategy is designed artificially. However, Containment is really abstract and hard to check. In their paper, they prove that simultaneous strongly aperiodic geometric ergodicity ensures Containment. Yang [14] and Atchadé and Fort [3] respectively tackle the open problem 21 in [13]. Both results are very interesting. [14] assumed that all the transition kernels simultaneously satisfy the drift condition P γ V V 1 + bi C, and the adaptive parameter space is compact under some metric, and connects it with the regeneration decomposition to find the uniform bound of the distance P n γ (x, ) π( ) TV for all γ. Once this condition given, the distance P n γ (x, ) π( ) TV can be uniformly bounded by the test function. The boundedness of the test function sequence {V (X n ) : n 0} can ensure Containment. Under some situations, to directly check Containment is quite hard. [3] use the similar coupling method as that in [13] to prove an attractive and better result when adaptive MCMC algorithm is restricted to Markovian Adaptation. [3] also assumed that uniformly strongly aperiodicity, simultaneously drift condition in the weakest form P γ V V 1 + bi C, and uniform convergence on any sublevel set of the test function V (x). The idea is that after the chain comes into some big sublevel set of the test function V (x), apply the coupling method for ergodicity. In Section 2 we introduce the notations and some conditions. In Section 3 we discuss the results in [14, 3]. In Section 4 we provide a necessary condition of ergodicity conditional on some additional condition. In Section 5 we demonstrate our conditions and result. 2 Terminology Consider a target distribution π( ) defined on the state space X with respect to some σ-field B(X ) (π(x) is also used as the density function). Let {P γ : γ Y} be the family of transition kernels of time homogeneous Markov chains with the same stationary distribution as π, i.e. πp γ = π for all γ Y. An adaptive MCMC algorithm Z := {(X n, Γ n ) : n 0} is regarded as lying in the sample path space Ω := (X Y) equipped with a σ-field F. For each initial state x X and initial parameter γ Y, there is a probability measure P (x,γ) such that the probability of the event [Z A] is well-defined for any set A F. There is a filtration G := {G n : n 0} such that Z is adapted to G. Denote the corresponding expectation by E (x,γ) [f(x n )] for the measurable function f : X R. A framework of adaptive MCMC is defined as: 1. Given a initial state X 0 := x 0 X and a kernel P Γ0 with Γ 0 := γ 0 Y. At each iteration n + 1, X n+1 is generated from P Γn (X n, ); 2. Γ n+1 is obtained from some function of X 0,, X n+1 and Γ 0,, Γ n. For A B(X ), P (x0,γ 0 )(X n+1 A G n ) = P (x0,γ 0 )(X n+1 A X n, Γ n ) = P Γn (X n, A). (1) We say that the adaptive MCMC Z is ergodic if for any initial state x 0 X and any kernel index γ 0 Y, P (x0,γ 0 )(X n ) π( ) TV converges to zero eventually where µ TV = µ(a). Obviously the joint distribution of X n+1 and Γ n+1 conditional on G n is sup A B(X ) P (x0,γ 0 ) (X n+1 dx, Γ n+1 dγ G n ) = P Γn (X n, dx)p (x0,γ 0 )(Γ n+1 dγ X n+1 = x, G n ). 2
Furthermore, if Γ n+1 is simply a function of X n and Γ n, then the adaptive MCMC algorithm is called Markovian Adaptation, i.e. the joint process {(X n, Γ n ) : n 0} is a time-inhomogeneous Markov Chain. Containment is defined as that for any X 0 = x 0 and Γ 0 = γ 0, for any ɛ > 0, the stochastic process {M ɛ (X n, Γ n ) : n 0} is bounded in probability P (x0,γ 0 ), i.e. for all δ > 0, there is N N such that P (x0,γ 0 )(M ɛ (X n, Γ n ) N) 1 δ for all n N, where M ɛ (x, γ) = inf{n 1 : P n γ (x, ) π( ) TV ɛ} is the ɛ-convergence function. Diminishing Adaptation is defined as that for any X 0 = x 0 and Γ 0 = γ 0, lim n D n = 0 in probability P (x0,γ 0 ) where D n = sup x X PΓn+1 (x, ) P Γn (x, ) TV represents the amount of adaptation performed between iterations n and n + 1. For a non-negative function V, the process {V (X n ) : n 0} is bounded in probability if given any (x 0, γ 0 ) X Y, δ > 0, N > 0, M > 0 such that for n > N, P (x0,γ 0 )(V (X n ) > M) < δ. Suppose that the parameter space Y is a metric space. Say the adaptive parameter process {Γ n : n 0} is bounded in probability if x 0 X, γ 0 Y, ɛ > 0, N > 0, some compact set B Y, such that for n > N, P (x0,γ 0 )(Γ n B c ) < ɛ. 3 Simultaneous Drift Conditions Before studying the simultaneous drift conditions, we show the result in [13]: Theorem 1 ([13]). Ergodicity of an adaptive MCMC is implied by Containment and Diminishing Adaptation. Remark 1. [13] gave one condition: simultaneously strongly aperiodically geometrically ergodic which can ensure Containment. In the definition, the drift conditions have the same form: P γ V (x) λv (x) + bi C (x) for all γ Y. However, if {Γ n : n 0} is bounded in probability, Containment is implied by that in each compact subset B of Y, the drift conditions have the same form: P γ V B (x) λ B V B (x) + b B I C (x). More generally, see the following corollary. Corollary 1. Suppose that (i) the parameter space Y is a metric space, (ii) the adaptive parameter process {Γ n : n 0} is bounded in probability, (iii){ for any compact set K Y, for any ɛ > 0, the local ɛ-convergence time { M ɛ (X n ) := inf m N + : sup γ K P m γ (X n, ) π( ) } TV < ɛ : n 0} is bounded in probability, m (iv) Diminishing Adaptation. Then the adaptive MCMC algorithm {X n : n 0} with the adaptive parameter process {Γ n : n 0} is ergodic. The proof is trivial and omitted. [14] gave the following conditions to tackle the open problem 21 in [13]: Y1: There exist a constant δ > 0, and set C F, and a probability measure ν γ ( ) for γ Y, such that P γ (x, ) δi C (x)ν γ ( ) for γ Y; Y2: all kernels uniformly satisfy the weakest drift condition: P γ V V 1 + bi C, where V : X [1, ) and π(v ) < ; 3
Y3: Y is compact under the metric d(γ 1, γ 2 ) = sup P γ1 (x, ) P γ2 (x, ) TV ; x X Y4: the stochastic process {V (X n ) : n 0} is bounded in probability. Theorem 2 ([14]). Suppose Diminishing Adaptation holds. The conditions (Y1)-(Y4) ensure the ergodicity of adaptive MCMC algorithms. Remark 2. 1. In the proof of Theorem 2, both (Y1) and (Y2) can ensure that each transition kernel is ergodic to π. Both (Y3) and (Y4) imply that the total variation distance between P γ and π converges to zero uniformly on Y. 2. The condition π(v ) < is a relatively strong condition. For each P γ, suppose that the chain {X n (γ) : n 0} is a time homogeneous Markov chain with the transition kernel P γ. For any recurrent set A X with π(a) > 0, by Meyn and Tweedie [11] (MT) Proposition 10.4.9, π(v ) = [ A π(dy)e τa ] 1 γ V (X (γ) i ) X (γ) 0 = y. Assuming that there exists a small set C 1 X [ τc1 1 with sup x C1 E γ V (X (γ) i ) X (γ) 0 = x ] [ σc1 <, denote U γ (x) = E γ V (X(γ) i ) X (γ) 0 = x [ τc1 1 Hence, by MT Theorem 11.3.5, P γ U γ U γ V (x)+b 1 I C1 where b 1 = sup x C1 E γ V (X (γ) i Suppose that there is a test function V 1 satisfying P γ V 1 V 1 V + bi C1 and V 1 (x)i C1 (x) V (x). By MT Proposition 11.3.2, U γ (x) V 1 (x). So, P γ U γ U γ 1 + bi C1. We will study the simultaneous drift condition with the form P γ V 1 V 1 V 0 + bi C instead where the test functions V 0 (x) and V 1 (x) are uniform for every P γ. Under this situation, the condition (Y3) are unnecessary, and the condition (Y4) is implied (See Theorem 6, Remark 7). [3] also gave the following conditions to study the ergodicity of adaptive MCMC with Markovian Adaptation: M1: there exists a probability measure ν( ), a constant δ > 0, and set C F such that P γ (x, ) δi C (x)ν( ) for γ Y; M2: there exists a measurable function V : X [1, ) and a positive constant b > 0 such that for any γ Y, (P γ V )(x) V (x) 1 + bi C (x); M3: for any sublevel set D l = {x X : V (x) l} of V, lim n sup Dl Y ]. P n γ (x, ) π( ) TV = 0. Theorem 3 ([3]). Suppose Diminishing Adaptation holds. The conditions (M1)-(M3) imply the ergodicity of adaptive MCMC algorithm with Markovian adaptation. Remark 3. 1. Since P (x0,γ0)(v (X n ) > M) π(d c M) P (x0,γ 0 )(X n ) π( ) TV, M can be taken extremely large such that π(dm c ) < ɛ. (M1-M3) and Diminishing Adaptation imply that R.H.S. of the above equation converges to zero. So, {V (X n ) : n 0} is bounded in probability. 2. In Section 4 we show that under certain condition, Containment is a necessary condition of ergodicity of adaptive MCMC provided that (M3) holds. From another view, [3] s proof does apply the coupling method to check Containment by using Diminishing Adaptation and simultaneous drift conditions. ) X (γ) ] 0 = x. 4
4 The necessary condition of the ergodicity Before showing our result, let us to discuss one simple example which was studied in [5]. Example 1. Let the state space X = {1, 2} and the transition kernel [ ] 1 θ θ P θ =. θ 1 θ Obviously, for each θ (0, 1), the stationary distribution is uniform on X. Consider a stateindependent adaptation: for k = 1, 2,, at each time n = 2k 1 choose the transition kernel index θ n 1 = 1/2, and at each time n = 2k choose the transition kernel index θ n 1 = 1/n. Remark 4. The chain converges to the target distribution, but neither Diminishing Adaptation nor Containment holds. Remark 5. Obviously, the adaptive parameter process {θ n : n 0} is not bounded in probability. Example 1 shows that Containment is also not necessary. In the following theorem, we prove that under certain additional condition similar to (M3), Containment is necessary for ergodicity of adaptive algorithms. Theorem 4 (The necessity of Containment). Suppose that there exists an increasing sequence of sets D k X on the state space X, such that for any k > 0, lim sup P n γ n (x, ) π( ) TV = 0. (2) D k Y If the adaptive MCMC algorithm is ergodic then Containment holds. Proof: Fix ɛ > 0. For any δ > 0, taking K > 0 such that π(dk c ) < δ/2. For the set D K, there exists M such that sup P M γ (x, ) π( ) TV < ɛ. D K Y Hence, for any (x 0, γ 0 ) X Y, by the ergodicity of the adaptive MCMC {X n } n, there exists some N > 0 such that n > N, P (x0,γ0)(x n D c K) π(d c K) < δ/2. So, for (X n, Γ n ) (D K, Y), Hence, Therefore, Containment holds. [X n D K ] = [(X n, Γ n ) D K Y] [M ɛ (X n, Γ n ) M]. P (x0,γ 0 ) (M ɛ (X n, Γ n ) > M) P (x0,γ 0 ) ((X n, Γ n ) (D K Y) c ) = P (x0,γ 0 ) (X n D c K) P (x0,γ 0 ) (X n D c K) π(d c K) + π(d c K) < δ. 5
Corollary 2. Suppose that the parameter space Y is a metric space, and the adaptive scheme {Γ n : n 0} is bounded in probability. Suppose that there exists an increasing sequence of sets (D k, Y k ) X Y such that any k > 0, lim n sup Pγ n (x, ) π( ) TV = 0. D k Y k If the adaptive MCMC algorithm is ergodic then Containment holds. Proof: Using the same technique in Theorem 4, for large enough M > 0, P (x0,γ 0 ) (M ɛ (X n, Γ n ) > M) P (x0,γ 0 ) ((X n, Γ n ) (D k Y k ) c ) P (x0,γ 0 ) (X n D c k ) + P (x 0,γ 0 ) (Γ n Y c k ) P (x0,γ 0 ) (X n D c K) π(d c K) + π(d c K) + P (x0,γ 0 ) (Γ n Y c k ). Since {Γ n : n 0} is bounded in probability, the result holds. 5 Simultaneous Polynomial Ergodicity Although the ergodicity of adaptive MCMC algorithms, to some degree, is solved in [14] and [3], there are still some properties unknown about simultaneous polynomial ergodicity. In the section, we find that the conditions (Y4) and (M3) are implied for the adaptive MCMC with simultaneous polynomial ergodicity. Before studying it, let us recall the result about a quantitative bound for time-homogeneous Markov chain with polynomial convergence rate by Fort and Moulines [7]. Theorem 5 ([7]). Suppose that the time-homogeneous transition kernel P satisfies the following conditions: P is π-irreducible for an invariant probability measure π; There exist some sets C B(X ) and D B(X ), C D, π(c) > 0 and an integer m 1, such that for any (x, x ) := C D D C, A B(X ), P m (x, A) P m (x, A) ρ x,x (A) (3) where ρ x,x is some measure on X for (x, x ), and ɛ := inf (x,x ) ρ x,x (X ) > 0. Let q 1. There exist some measurable functions V k : X R + \ {0} for k {0, 1,..., q}, and for k {0, 1,..., q 1}, for some constants 0 < a k < 1, b k <, and c k > 0 such that π(v β q ) < for some β (0, 1]. P V k+1 (x) V k+1 (x) V k (x) + b k I C (x), inf x X V k(x) c k > 0, V k (x) b k a k V k (x), x D c, (4) sup V q <. D 6
Then, for any x X, n m, P n (x, ) π( ) TV min 1 l q B(β) l (x, n), (5) with B (β) l (x, n) = ɛ + (I A (β) m ) 1 δ x π(w β ), e l S(l, n + 1 m) β + j n+1 m (1 ɛ+ ) j (n m) (S(l, j + 1) β S(l, j) β ), where <, > denotes the inner product in R q+1, {e l }, 0 l q is the canonical basis on R q+1, I is the identity matrix; δ x π(w β ) := δ x (dy)π(dy )W β (y, y ) where W β (x, x ) := ( W β 0 (x, x ),, W β q (x, x )) T with W0 (x, x ) := 1 and W l (x, x ) = I (x, x ) + I c(x, x ) ( l 1 k=0 where m(v 0 ) := inf (x,x ) c {V 0(x) + V 0 (x )}; A (β) m := a k ) 1 S(0, k) := 1 and S(i, k) := (m(v 0 )) 1 (V l (x) + V l (x )) for 1 l q k S(i 1, j), i 1; j=1 A (β) m (0) 0 0 0 A (β) m (1) A (β) m (0) 0 0......., m (q 1) A (β) m (q 2) A (β) m (0) 0 A (β) m (q) A (β) m (q 1) A (β) m (1) A (β) m (0) A (β) where A (β) m (l) := sup l (x,x ) S(i, ( m)β 1 ρ x,x (X ) ) R x,x (x, dy)r x,x (x, dy )W β l i (y, y ), where the residual kernel and ɛ + := sup (x,x ) ρ x,x (X ). R x,x (u, dy) := ( 1 ρ x,x (X ) ) 1 ( P m γ (u, dy) ρ x,x (dy) ) ; Remark 6. In the B (β) l (x, n), ɛ + depends on the set and the measure ρ x,x ; the matrix (I A (β) m ) 1 depends on the set, the transition kernel P, ρ x,x and the test functions V k ; δ x π(w β ) depends on the set and the test functions V k. Consider the special case of the theorem: ρ x,x (dy) = δν(dy) where ν is a probability measure with ν(c) > 0, and := C C. 1. ɛ + = ɛ = δ. ( ) 2. I A (β) m is a lower triangle matrix so (I A (β) m ) 1 = is also a lower triangle matrix, and fixing k 0 all b (β) i,i k are equal. b(β) b (β) ij 1 ii = 1 A (β) m (0) 7 i,j=1,...,q+1. For i > j, b(β) ij is the polynomial
combination of A (β) m (0),, A (β) m (i + 1) divided by (1 A (β) that b (β) 21 = A (β) m (1) (1 A (β) m (0)) 2. m (0)) i. By some algebras, we can obtain So, by calculating B(β) 1 (x, n), we can get the quantitative bound with a simple form. B (β) 1 (x, n) only involves two test functions V 0 (x) and V 1 (x). Remark 7. From Equation (4), V 0 (x) b 0 /(1 α 0 ) > b 0 because 0 < α 0 < 1. Consider the drift condition: P V 1 V 1 V 0 + b 0 I C. Since πp = π, π(v 0 ) b 0 π(c) b 0. Hence, the V 0 in the theorem can not be constant. Remark 8. Without the condition π(v β ) <, the bound in Equation (5) can also be obtained. However, the bound is possibly infinity. The subscript l of B (β) l (x, n) and β can explain the polynomial rate. The related rate is S(l, n + 1 m) β = O((n + 1 m) lβ ). It can be observed that B (β) l (x, n) involves test functions V 0 (x),, V l (x), and lim sup n n βl B (β) l (x, n) <. The maximal rate of convergence is equal to qβ. 5.1 Conditions The following conditions are derived from Theorem 5, and some changes are added to apply for adaptive MCMC algorithms. Say that the family {P γ : γ Y} is simultaneously polynomially ergodic (S.P.E.) if the conditions (A1)-(A4) are satisfied. A1: each P γ is φ γ -irreducible with stationary distribution π( ); Remark 9. From MT Proposition 10.1.2, if P γ is ϕ-irreducible, then P γ is π-irreducible and the invariant measure π is a maximal irreducibility measure. A2: there is a set C X, some integer m N, some constant δ > 0, and some probability measure ν γ ( ) on X such that: π(c) > 0, and P m γ (x, ) δi C (x)ν γ ( ) for γ Y; (6) Remark 10. In Theorem 5, there is one condition Eq. (3) ensuring the splitting technique. Here we consider the special case of that condition: ρ x,x (dy) = δν γ (dy) and = C C. Thus, by Remark 6, the bound of P n γ (x, ) π( ) TV depends on C, m, the minorization constant δ, π( ), ν γ, and test functions V l (x) so we assume that they are uniform on all the transition kernels. A3: there is q N and measurable functions: V 0, V 1,..., V q : X (0, ) where V 0 V 1 V q, such that for k = 0, 1,..., q 1, there are 0 < α k < 1, b k <, and c k > 0 such that: P γ V k+1 (x) V k+1 (x) V k (x) + b k I C (x), V k (x) c k for x X and γ Y; (7) V k (x) b k α k V k (x) for x X /C; (8) sup V q (x) <. (9) x C Remark 11. For x C, ν γ (V l ) 1 δ P m γ V l (x) 1 δ sup x C V l (x) + mb l 1 δ. A4: π(v β q ) < for some β (0, 1]. 8
5.2 Main Result Theorem 6. Suppose an adaptive MCMC algorithm satisfies Diminishing Adaptation. Then, the algorithm is ergodic under any of the following cases: (i) S.P.E., and the number q of simultaneous drift conditions is strictly greater than two; (ii) S.P.E., and when the number q of simultaneous drift conditions is greater than or equal to two, there exists an increasing function f : R + R + such that V 1 (x) f(v 0 (x)); (iii) Under the conditions (A1) and (A2), there exist some positive constants c > 0, b > b > 0, α (0, 1), and a measurable function V (x) : X R+ with V (x) 1 and supv (x) < such that x C P γ V (x) V (x) cv α (x) + bi C (x) for γ Y, (10) cv α (x)i C c(x) b ; (iv) Under the condition (A1), (A2), and (A4), there exist some constant b > b > 0, two measurable functions V 0 : X R + and V 1 : X R + with 1 V 0 (x) V 1 (x) and supv 1 (x) < such that x C P γ V 1 (x) V 1 (x) V 0 (x) + bi C (x) for γ Y, (11) V 0 (x)i C c(x) b, and the process {V 1 (X n ) : n 0} is bounded in probability. The theorem consists of Theorem 7, Theorem 8, Theorem 9, and Lemma 1. Theorem 9 shows that {V (X n ) : n 0} in the case (iii) is bounded in probability. The case (iii) is a special case of S.P.E. with q = 1 so that the uniform quantitative bound of P n γ (x, ) π( ) TV for γ Y exists. Remark 12. For the part (iii), (A4) is implied by MT Theorem 14.3.7 with β = α. 5.3 Proof of Theorem 6 Lemma 1. Suppose that the family {P γ : γ Y} is S.P.E.. If the stochastic process {V l (X n ) : n 0} is bounded in probability for some l {1,..., q}, then Containment is satisfied. Proof: We use the notation in Theorem 5. From S.P.E., for γ Y, let ρ x,x (dy) = δν γ (dy) (so ρ x,x (X ) = δ) and := C C. So, ɛ + = ɛ = δ. Note that the matrix I A (β) m is a lower triangle matrix. Denote (I A (β) m ) 1 := (b (β) ij ) i,,,q. By the definition of B (β) l (x, n), B (β) ɛ + l k=0 l (x, n) = b(β) lk π(dy)w β k (x, y) S(l, n + 1 m) β + j n+1 m (1 ɛ+ ) j (n m) (S(l, j + 1) β S(l, j) β ) ɛ + l S(l, n + 1 m) β b (β) lk π(dy)w β k (x, y). By some algebras, for k = 1,, q, k=0 ) k 1 π(dy)w β k (m(v (x, y) 1 + β [ 0 ) a i V β k (x) + π(v β k ) ] (12) 9
because β (0, 1]. In addition, m(v 0 ) c 0 so the coefficient of the second term on the right hand side is finite. By induction, we obtain that b (β) 10 = A(β) m (1), and b(β) 2 11 = 1. It is easy to check that 0 < b (β) 11 1 δ. By some algebras, A (β) m (1) m β + m β + sup (x,x ) C C sup (x,x ) C C (1 A (β) m (0)) 1 A (β) m (0) R x,x (x, dy)r x,x (x, dy )W β 1 (y, y ) [ ] 1 + (a 0 m(v 0 )) β (Pγ m V β 1 (x) + P γ m V β 1 (x )) m β + 1 + 2(a 0 m(v 0 )) β (sup V 1 (x) + mb 0 ) x C because Pγ m V β 1 (x) P γ m V 1 (x) V 1 (x) + mb 0. Therefore, b (β) 10 is bounded from the above by some value independent of γ. Thus, ( ) B (β) δ 1 (x, n) S(1, n + 1 m) β b (β) 10 π(dy)w β 0 (x, y) + b(β) 11 π(dy)w β 1 (x, y) δ ( [ ]) (n + 1 m) β b (β) 10 π(c) + b(β) 11 1 + (a 0 m(v 0 )) β (V β β 1 (x) + π(v1 )). Therefore, the boundedness of the process {V 1 (X k ) : k 0} implies that the random sequence B (β) 1 (X n, n) converges to zero uniformly on X in probability. Containment holds. Let {Z j : j 0} be an adaptive sequence of positive random variables. For each j, Z j will denote a fixed Borel measurable function of X j. τ n will denote any stopping time starting from time n of the process {X i : i 0} i.e. [τ n = i] σ(x k : k = 1,, n + i) and P(τ n < ) = 1. Lemma 2 (Dynkin s Formula for adaptive MCMC). For m > 0, and n > 0, τ m,n E[Z τm,n X m, Γ m ] = Z m + E[ (E[Z m+i F m+i 1 ] Z m+i 1 ) X m, Γ m ] i=1 where τ m,n := min(n, τ m, inf(k 0 : Z m+k n)). Proof: τ m,n Z τm,n = Z m + (Z m+i Z m+i 1 ) = Z m + i=1 Since τ m,n i is measurable to F m+i 1, E[Z τm,n X m, Γ m ] =Z m + E[ n I( τ m,n i)(z m+i Z m+i 1 ) i=1 n E[Z m+i Z m+i 1 F m+i 1 ]I( τ m,n i) X m, Γ m ] i=1 τ m,n =Z m + E[ (E[Z m+i F m+i 1 ] Z m+i 1 ) X m, Γ m ]. i=1 10
Lemma 3 (Comparison Lemma for adaptive MCMC). Suppose that there exist two sequences of positive functions {s j, f j : j 0} on X such that E[Z j+1 F j ] Z j f j (X j ) + s j (X j ). (13) Then for a stopping time τ n starting from the time n of the adaptive MCMC {X i : i 0}, τ n 1 E[ τ n 1 f n+j (X n+j ) X n, Γ n ] Z n (X n ) + E[ Proof: From Lemma 2 and Eq. (13), the result can be obtained. s n+j (X n+j ) X n, Γ n ]. The following proposition shows the relations between the moments of the hitting time and the test function V -modulated moments for adaptive MCMC algorithms with S.P.E., which is derived from the result for Markov chain in [10, Theorem 3.2]. Define the first return time and the ith return time to the set C from the time n respectively: τ n,c := τ n,c (1) := min {k 1 : X n+k C} (14) and τ n,c (i) := min {k > τ n,c (i 1) : X n+k C} for n 0 and i > 1. (15) Proposition 1. Consider an adaptive MCMC {X i : i 0} with the adaptive parameter {Γ i : i 0}. If the family {P γ : γ Y} is S.P.E., then there exist some constants {d i : i = 0,, q 1} such that at the time n, for k = 1,, q, c q k E[τ k n,c X n, Γ n ] k τ n,c 1 E[ (i + 1) k 1 V q k (X n+i ) X n, Γ n ] d q k (V q (X n ) + k b q i I C (X n )) i=1 where the test functions {V i ( ) : i = 0,, q}, the set C, {c i : i = 0,, q 1}, and {b i : i = 0,, q 1} are defined in the S.P.E.. Proof: Since V q k (x) c q k on X, τ n,c 1 E[ τ n,c 1 (i + 1) k 1 So, the first inequality holds. Consider k = 1. By S.P.E. and Lemma 3, τn,c 0 x k 1 dx = k 1 τ k n,c. (i + 1) k 1 V q k (X n+i ) X n, Γ n ] c q k k E[τ k n,c X n, Γ n ]. (16) τ n,c 1 E[ V q 1 (X n+i ) X n, Γ n ] V q (X n ) + b q 1 I C (X n ). (17) 11
So, the case k = 1 of the second inequality of the result holds. For i 0, by S.P.E., E[(i + 1) k 1 V q k+1 (X n+i+1 ) X n+i, Γ n+i ] i k 1 V q k+1 (X n+i ) (i + 1) k 1 (V q k+1 (X n+i ) V q k (X n+i ) + b q k I C (X n+i )) i k 1 V q k+1 (X n+i ) (i + 1) k 1 V q k (X n+i ) + d ( ) i k 2 V q k+1 (X n+i ) + (i + 1) k 1 b q k I C (X n+i ) for some positive d independent of i. By Lemma 3, τ n,c 1 E[ (i + 1) k 1 V q k (X n+i ) X n, Γ n ] τ n,c 1 de[ i (k 1) 1 V q (k 1) (X n+i ) X n, Γ n ] + b q k I C (X n ). From the above equation, by induction, the second inequality of the result holds. (18) Theorem 7. Suppose that the family {P γ : γ Y} is S.P.E. for q > 2. Then, Containment holds. Proof: For k = 1,..., q, take large enough M > 0 such that C {x : V q k (x) M}, P (x0,γ 0 ) (V q k (X n ) > M) = By Proposition 1, for i = 0,, n, n P (x0,γ 0 ) (V q k (X n ) > M, τ i,c > n i, X i C) + P (x0,γ 0 ) (V q k (X n ) > M, τ 0,C > n, X 0 / C). P (x0,γ 0 ) (V q k (X n ) > M, τ i,c > n i X i C) τ i,c 1 P (x0,γ 0 ) (j + 1) k 1 V q k (X i+j ) > (n i) k 1 M+ c q k n i 1 (j + 1) k 1, τ i,c > n i X i C τ i,c 1 P (x0,γ 0 ) (j + 1) k 1 V q k (X i+j ) > (n i) k 1 M+ sup x C E (x0,γ 0 ) c q k n i 1 [ E (x0,γ 0 ) (j + 1) k 1 X i C [ τi,c ] ] 1 (j + 1) k 1 V q k (X i+j ) X i, Γ i X i = x (n i) k 1 M + c n i 1 q k (j + 1) k 1 ) ( d q k sup x C V q (x) + k j=1 b q ji C (x) (n i) k 1 M + c q k n i 1 (j + 1) k 1, 12
and P (x0,γ 0 ) (V q k (X n ) > M, τ 0,C > n X 0 / C) By simple algebra, ( d q k V q (x 0 ) + ) k j=1 b q ji C (x 0 ) n k 1 M + c n 1. q k (j + 1)k 1 Therefore, n i 1 ( ) (n i) k 1 M + c q k (j + 1) k 1 = O (n i) k 1 (M + c q k (n i)). P (x0,γ 0 ) (V q k (X n ) > M) k d q k sup V q (x) + b q j x C {x 0 } j=1 ( n ) P (x0,γ 0 )(X i C) (n i) k 1 (M + c q k (n i)) + δ C c(x 0 ) n k 1. (M + c q k n) (19) Whenever q 2, k can be chosen as 2. While k 2, the summation of L.H.S. of Eq. (19) is finite given M. But if q = 2 then just the process {V 0 (X n ) : n 0} is bounded probability so that q > 2 is required for the result. Hence, taking large enough M > 0, the probability will be small enough. So, the sequence {V q 2 (X n ) : n 0} is bounded in probability. By Lemma 1, Containment holds. Remark 13. In the proof, only (A3) is used. Remark 14. If V 0 is a nice function (non-decreasing) of V 1, then the sequence {V 1 (X n ) : n 0} is bounded in probability. In Theorem 9, we discuss this situation for certain simultaneously single polynomial drift condition. Theorem 8. Suppose that {P γ : γ Y} is S.P.E. for q = 2. Suppose that there exists a strictly increasing function f : R + R + such that V 1 (x) f(v 0 (x)) for all x X. Then, Containment is implied. Proof: From Eq. (19), we have that {V 0 (X n ) : n 0} is bounded in probability. Since V 1 (x) f(v 0 (x)), P (x0,γ 0 ) (V 1 (X n ) > f(m)) P (x0,γ 0 ) (f(v 0 (X n )) > f(m)) = P (x0,γ 0 ) (V 0 (X n ) > M), because f( ) is strictly increasing. By the boundedness of V 0 (X n ), for any ɛ > 0, there exists N > 0 and some M > 0 such that for n > N, P (x0,γ 0 ) (V 1 (X n ) > f(m)) ɛ. Therefore, {V 1 (X n ) : n 0} is bounded in probability. By Lemma 1, Containment is satisfied. Consider the single polynomial drift condition, see [10]: P γ V (x) V (x) cv α (x) + bi C (x) where 0 α < 1. Because the moments of the hitting time to the set C is (see details in [10]), for any 1 ξ 1/(1 α), [ τc ] 1 E x (i + 1) ξ 1 V (X i ) < V (x) + bi C (x). 13
The polynomial rate function r(n) = n ξ 1. If α = 0, then r(n) is a constant. Under this situation, it is difficult to utilize the technique in Theorem 7 to prove {V (X n ) : n 0} is bounded in probability. Thus, we assume α (0, 1). Proposition 2. Consider an adaptive MCMC {X n : n 0} with an adaptive scheme {Γ n : n 0}. Suppose that (A1) holds, and there exist some positive constants c > 0, > b > 0, α (0, 1), and a measurable function V (x) : X R+ with V (x) 1 and supv (x) < such that x C P γ V (x) V (x) cv α (x) + bi C (x) for γ Y. (20) Then for 1 ξ 1/(1 α), τ n,c 1 (i + 1) ξ 1 V 1 ξ(1 α) (X n+i ) X n, Γ n c ξ (C)(V (X n ) + 1). (21) E (x0,γ 0 ) Proof: The proof applies the techniques in Lemma 3.5 and Theorem 3.6 of [10]. Theorem 9. Suppose that (A2) and the conditions in Proposition 2 are satisfied, and there exists some constant b > b such that cv α (x)i C c > b. Then, Containment is implied. Proof: Using the same techniques in Theorem 7, we have that ( P (x0,γ 0 ) V 1 ξ(1 α) (X n ) > M ) ( n ) P c ξ (sup x C {x0 } V (x) + 1) (x0,γ 0 )(X i C) (n i) ξ 1 (M+n i) + δ C c (x 0). (22) n ξ 1 (M+n) Therefore, for ξ [1, 1/(1 α)), the sequence { V 1 ξ(1 α) (X n ) : n 0 } is bounded in probability. Since 1 ξ(1 α) > 0, the process {V (X n ) : n 0} is bounded in probability. By Lemma 1, Containment holds. Acknowledgements Special thanks are due to Professor Jeffrey S. Rosenthal. Without his assistance this work would not have been completed. The author is very grateful to G. Fort for her helpful suggestions. References [1] C Andrieu and E Moulines. On the ergodicity properties of some adaptive Markov Chain Monte Carlo algorithms.. Ann. Appl. Probab., 16(3):1462 1505, 2006. [2] C. Andrieu and C.P. Robert. Controlled MCMC for optimal sampling.. Preprint, 2001. [3] Y.F. Atchadé and G. Fort. Limit Theorems for some adaptive MCMC algorithms with subgeometric kernels. Preprint, 2008. [4] Y.F. Atchadé and J.S. Rosenthal. On Adaptive Markov Chain Monte Carlo Algorithms. Bernoulli, 11(5):815 828, 2005. 14
[5] Y. Bai, G.O. Roberts, and J.S. Rosenthal. On the Containment condition of adaptive Markov Chain Monte Carlo algorithms. http://www.probability.ca/jeff/ftpdir/yanbai1.pdf, 2009. [6] A.E. Brockwell and J.B. Kadane. Indentification of regeneration times in MCMC simulation, with application to adaptive schemes. J. Comp. Graph. Stat, 14:436 458, 2005. [7] G. Fort and E. Moulines. Computable Bounds for Subgeometrical and geometrical Ergodicity. 2000. [8] W.R. Gilks, G.O. Roberts, and S.K. Sahu. Adaptive Markov chain Monte Carlo. J. Amer. Statist. Assoc., 93:1045 1054, 1998. [9] H. Haario, E. Saksman, and J. Tamminen. An adaptive Metropolis algorithm. Bernoulli, 7: 223 242, 2001. [10] S.F. Jarner and G.O. Roberts. Polynomial convergence rates of Markov Chains. Ann. Appl. Probab., 12(1):224 247, 2002. [11] S. P. Meyn and R. L. Tweedie. Markov Chains and Stochastic Stability. London: Springer- Verlag, 1993. [12] H. Robbins and S. Monro. A stochastic approximation method. Ann. Math. Stat., 22:400 407, 1951. [13] G.O. Roberts and J.S. Rosenthal. Coupling and Ergodicity of adaptive Markov chain Monte Carlo algorithms. J. Appl. Prob., 44:458 475, 2007. [14] C. Yang. Recurrent and Ergodic Properties of Adaptive MCMC. Preprint, 2008. 15