Reversible Markov chains

Size: px

Start display at page:

Download "Reversible Markov chains"

Lenard Stevenson
5 years ago
Views:

1 Reversible Markov chains Variational representations and ordering Chris Sherlock Abstract This pedagogical document explains three variational representations that are useful when comparing the efficiencies of reversible Markov chains: (i) the Dirichlet form and the associated variational representations of the spectral gaps; (ii) a variational representation of the asymptotic variance of an ergodic average; and (iii) the conductance, and the equivalence of a non-zero conductance to a non-zero right spectral gap. Introduction This document relates to variational representations of aspects of a reversible Markov kernel, P, which has a limiting (hence, stationary) distribution, π. It is a record of my own learning, but I hope it might be useful to others. The central idea is the Dirichlet form, which can be thought of as a generalisation of expected squared jumping distance to cover all functions that are square-integrable with respect to π. The minimum (over all such functions with an expectation of and a variance of 1) of the Dirichlet forms is the right spectral gap. A key quantity of interest to researchers in MCMC is the asymptotic variance which relates to the the variance of an ergodic average, Var [ 1 n n i=1 f(x i) ] as n, and where X 1,..., X n arise from successive applications of P, and f is some square-integrable function. A second variational representation, of f, (I P ) 1 f, allows us to relate a limit of the variance of an ergodic average to the Dirichlet form. Finally, a third variational quantity, the conductance of a Markov kernel is introduced, and is also related to the Dirichlet form and hence to the variance of an ergodic average. I found these variational representations particular useful in two pieces of research that I performed, mostly in 16. Sherlock et al. (17) uses the first two representations, while Sherlock and Lee (17) uses conductance. In both pieces of work the variational representations allowed us to compare pairs of Markov kernels; if one Markov kernel has a particular property, such as the variance of an ergodic average being a particular value, and if we can relate aspects of this Markov kernel, such as its Dirichlet form or its conductance, to a sec- 1

2 ond Markov kernel, then we can often obtain a bound on the property of interest for the second kernel. This is not the only use of variational representations; e.g. in Lawler and Sokal (1988) conductance is used directly to obtain bounds on the spectral gap of several discrete-statespace Markov chains. The most natural framework for representing the kernel is that of a bounded, self-adjoint operator on a Hilbert space. I was almost entirely unfamiliar with this before I embarked on the above pieces of research and so I will start by setting down the key aspects of this. Preliminaries Hilbert Space Let L (π) be the Hilbert space of real functions that are square integrable with respect to some probability measure, π: π(dx)f(x) < f L (π), equipped with the inner product (finite through Cauchy-Schwarz) and associated norm: f, g = π(dx)f(x)g(x), f = f, f. Let L (π) L (π) be the Hilbert space that uses the same inner product but includes only functions with E π [f] = : π(dx)f(x) = f, 1 =. For these functions, f, g = Cov [f, g] and f, f = Var π [f]. Markov kernel and detailed balance Let {X t } t= be a Markov chain on a statespace X with a kernel of P (x, dy) which satisfies detailed balance with respect to π: π(dx)p (x, dy) = π(dy)p (y, dx).

3 For any measure, ν, we define νp := ν(dx)p (x, dy) Then πp = π(dx)p (x, dy) = π(dy)p (y, dx) = π, so π is stationary for P. x x x P is bounded and self adjoint Given a kernel (or operator ) P we use the shorthand: P f(x) = (P f)(x) := P (x, dy)f(y). Jensen s inequality gives [ [(P f)(x)] = P (x, dy)f(y)] P (x, dy)f(y) = (P f )(x). Hence, because π is stationary for P, P f = π(dx)[(p f)(x)] π(dx)(p f )(x) = π(dx)p (x, dy)f (y) = π(dy)f (y) = f. Thus P f / f 1, and P is a bounded linear operator. Further, if P satisfies detailed balance with respect to π P f, g = π(dx)p (x, dy)f(y)g(x) = π(dy)p (y, dx)f(y)g(x) = π(dx)p (x, dy)f(x)g(y) = f, P g ; P is self-adjoint. The spectrum of a bounded, self adjoint operator The spectrum of P in H is {ρ : P ρi is not invertible in H} 1 ; the spectrum of a bounded, self-adjoint operator is (accept it, but see below) a closed, bounded set on the real line; let 1 i.e. there is at least one g H such that there is no unique f H with (P ρi)f = g 3

4 the upper and lower bounds be λ max and λ min. When P is a self-adjoint Markov kernel and H = L (π), λ max = 1 and λ min 1. The spectral decomposition theorem for bounded, self-adjoint operators states that there is a finite positive measure, a (dλ) with support contained in the real interval [λ min, λ max ] such that f, P n f = λmax λ min λ n a (dλ). This decomposition is used twice hereafter; however it may be unfamiliar so the remainder of this section provides an intuition in terms of the eigenfunctions and eigenvalues of P. Let P be a bounded, self-adjoint operator. A right eigenfunction e of P is a function that satisfies P e = λe for some scalar, λ, the corresponding eigenvalue. Let e (x), e 1 (x),... be a set of eigenfunctions of P, scaled so that e i = e i, e i = 1, and with corresponding eigenvectors λ, λ 1,.... Since, by definition, (P λ i I)e i =, the spectrum is a superset of the set of eigenvalues. The intuition below comes from the case where the spectrum is precisely the set of eigenvalues, and the eigenfunctions span H. 1. Just as for the eigenvectors of a finite self-adjoint matrix, it is possible to choose the eigenfunctions such that e i, e j = (i j);. moreover, as with self-adjoint matrices, all of the eigenvalues are real. 3. Furthermore, since P is bounded, all of the eigenvalues satisfy λ 1. If the eigenfunctions of an operator, P, span the Hilbert space, H, for any f H, f = a i e i, i= where a i = f, e i, and from which f, P n f = a i a j e i, P n e j = i= j= a i λ n i. i= With n = we obtain i= a i = f <. Thus, if the eigenfunctions span the space then a is discrete with mass a i at λ i. For a Markov kernel, P 1 = 1, where 1 is the constant function and is, thus, a right eigenfunction with an eigenvalue of 1. 4

5 Spectral gaps and the Dirichlet form Spectrum of P in L (π); spectral gaps and geometric ergodicity For any kernel P, we define P k f = P k 1 P f, recursively. Any function f(x) L (π) can be written as a 1 + f (x) where f L (π). Here a = f, 1 = E π [f(x)]. Thus P n f = E π [f(x)] + P n f (x) E π [f(x)] if P is ergodic. To bound the size of the remainder term, consider the spectrum of P restricted to functions in L (π). This must be confined to [λ min, λ max ] with 1 λ min λ max 1, where (by definition) λ max := sup f L (π) f, P f f, f = sup f L (π): f,f =1 f, P f and λ min := inf f L (π): f,f =1 f, P f. Let λ = max( λ min, λ max P n f = f P n, P n f = f, P n f = ). The (squared) size of the remainder is then λ max λ min λ max λ n a (dλ) λ n λ min a (dλ) = λ n f. Thus P n f geometrically quickly provided λ < 1; i.e., provided the inequality is strict. P is then called geometrically ergodic. The right spectral gap is ρ right := 1 λ max and the left spectral gap is ρ left := 1+λ min. Both must be non-zero for geometric ergodicity. Henceforth, for notational simplicity, we drop the subscript in λ max/min. Aside: when one or both of the spectral gaps is zero (e.g. the spectrum is [ 1, 1]), then for any fixed n we can always find functions, f, with f = 1, (albeit fewer and fewer as n ) such that P n f >.1, say. Dirichlet form, E P (f) The concept of a spectral gap motivates the Dirichlet form for P and f, since ρ right = E P (f) := f, (I P )f = f, f f, P f, inf E P (f) and ρ left = sup E P (f). f L (π): f,f =1 f L (π): f,f =1 5

6 Directly from the definition we have for two kernels P 1 and P and some γ > : E P1 (f) γe P (f) f L (π) ρ right P 1 γρ right P. (1) An alternative expression for the Dirichlet form provides a very natural intuition: E P (f) = f, f f, P f = π(dx)p (x, dy)f(x)[f(x) f(y)] = π(dx)p (x, dy)f(y)[f(y) f(x)] = 1 π(dx)p (x, dy)[f(y) f(x)], () where the penultimate line follows because P satisfies detailed balance with respect to π and the final line arises from the average of the two preceding lines. The Dirichlet form can, therefore, be thought of as a generalisation of expected squared jumping distance of the ith component of x, π(dx)p (x, dy)(y i x i ), to consider the expected squared changes for any f L (π). Variance of an ergodic average Suppose that we are interested in E π [h(x)] for some h L (π) and we estimate it by an ] average of the values in the Markov chain: ĥn := 1 n n i=1 P i 1 h(x). Typically Var [ĥn as n, but scaling by n should keep it O(1). We are, therefore, interested in [ n ] Var(P, h) := lim Var ĥ n. n So as to just consider mixing, we assume X π. Without loss of generality we may assume h L (π) (else just subtract its expectation). Since P is time-homogeneous, [ n ] n Var h(x i ) = E [ h (X i ) ] n 1 n + E [h(x i ), h(x i+1 )] + E [h(x i ), h(x i+ )] i=1 i=1 i=1 + + E [h(x 1 ), h(x n )] = ne [ h (X) ] + (n 1)E [h(x 1 ), h(x )] + + E [h(x 1 ), h(x n )] = n h, h + (n 1) h, P h + (n ) h, P h + + h, P n 1 h. 6 i=1

7 Using (I P ) 1 to denote I + P + P + P , we obtain Var(P, h) = h, h + = h, P i h i=1 h, P i h h, h i= = h, (I P ) 1 h h, h, Writing h, h = h, (I P )(I P ) 1 h gives an equivalent form: Var(P, h) = h, (I + P )(I P ) 1 h. The spectral decomposition of P in L (π) gives Var(P, h) = λ max λ min 1 + λ 1 λ a (dλ) 1 + λmax 1 λ max λ max λ min a (dλ) = 1 + λmax 1 λ max Var π [h], (3) where the inequality follows since (1+λ)/(1 λ) is an increasing function of λ. The supremum of the spectrum of L (π) provides an efficiency bound over all h L (π). Variance bounding kernels The bound on the variance in (3) is in terms of λ max, not λ min. eigenvalue λ min Even if there exists an = 1, so the chain never converges (as opposed to the more usual case where the spectrum of P is [ 1, a] 1, but there is no eigenvalue at 1) the asymptotic variance can be finite. As an extreme example, consider the Markov chain with a transition matrix of ( 1 1 After n iterations this has been in state 1 exactly 1/ = π({1}) of the time. Roberts and Rosenthal (8) realised that in almost all applications of MCMC it was the variance of the ergodic average that was important, and not (directly) convergence to the target. Specifically, it did not matter if ρ left =. A kernel where ρ right > was termed variance bounding and this was shown to be equivalent to Var(P, f) < for all f L (π); i.e. if [hypothetical!] simple Monte Carlo would lead to the standard n rate of convergence of the ergodic average, a n rate would also be observed from the MCMC kernel. This is easy to see if the left and right spectral gaps of P are non-zero. When at least one gap is zero, for any β < 1, we have Var(βP, h) = h, h + (1 1/n)β h, P h + (1 /n)β h, P h + + (1 (n 1)/n)β n 1 h, P n 1 h h, h + i=1 βi h, P i h <. Letting β 1 then gives the result, even though Var(P, h) may be infinite. 7 ).

8 Variational representation of f, Q 1 f, and a key ordering Firstly, notice that for real numbers f, g and q, sup g fg g q is f /q, which is achieved when g = f/q. The same holds for the operator Q = I P when f and g are functions (see Appendix A for a proof): f, (I P ) 1 f = sup f, g g, (I P )g = sup f, g E P (g). g L (π) g L (π) To see this, think of a reversible matrix, P and diagonalise it; the result for such matrices is just equivalent to the scalar result applied to each eigenvalue. This leads to the alternative representation: Var(P, h) = sup 4 h, g E P (g) h, h. g L (π) Clearly, E P1 (g) E P (g) g L (π) Var(P 1, h) Var(P, h), giving a much simpler proof of a result which was originally proved for finite-statespace Markov chains in Peskun (1973) and then generalised to general statespaces by Tierney (1998). But we can go further: suppose that E P1 (g) γe P (g) for all g L (π) and some γ >, then, for any g L (π) h, g E P1 (g) h, g γe P (g) = 1 γ { h, g E P (g )}, where g := γg L (π). So and hence sup h, g E P1 (g) 1 g L (π) γ sup h, g E P (g), g L (π) E P1 (f) γe P (f) f L (π) Var(P 1, h) + h, h 1 γ {Var(P, h) + h, h }. (4) This result appears as Lemma 3 in Andrieu et al. (16) (see also Caracciolo et al., 199). Propose-accept-reject kernels Many kernels consist of making a proposal, which is then either accepted or rejected: P (x, dy) := q(x, dy)α(x, y) + (1 α(x))δ(y x), 8

9 where α(x) := q(x, dy)α(x, y) is the average acceptance probability from x. Then, from (), E P (f) = 1 π(dx)q(x, dy)α(x, y){f(y) f(x)}. The right spectral gap is, therefore, ρ right = 1 inf f L (π): f,f =1 π(dx)q(x, dy)α(x, y){f(y) f(x)}. There is a similar formula for the left spectral gap. Results (1) and (4) then lead directly to the following: if two propose-accept-reject kernels, both reversible with respect to π, satisfy q 1 (x, dy)α 1 (x, y) γq (x, dy)α (x, y) x, y then ρ right P 1 γρ right P and Var(P 1, h) + Var π [h] 1 γ {Var(P, h) + Var π [h]}. From (4), if for some γ > E P1 (f) γe P (f) f L (π) and P is variance bounding, then so is P 1. Unfortunately it is rare that we can be sure of the ordering of the Dirichlet forms for all f; indeed we might be sure that there is no fixed γ for which the ordering does hold. When we cannot simply resort to a uniform ordering of Dirichlet forms then the elegant concept of conductance can come to our aid. Conductance Consider any measurable set, A X. The conductance of A is defined as κ P (A) := 1 π(dx)p (x, dy), π(a) x A,y A c where π(a) := A π(dx). Loosely speaking, κ P (A) is the probability of moving to A c conditional on the chain currently following the stationary distribution truncated to A: π(dx)1 x A. Our analysis of the properties of κ P (A) will be via the symmetric quantity κ P (A) = κ P (A c 1 ) := π(dx)p (x, dy) = κ P (A) π(a)π(a c ) x A,y A π(a c c ). 9

10 The first equality follows because the kernel is reversible with respect to π, and this also implies that π(a)κ P (A) = π(a c )κ P (A c ). Since κ P (A c ) 1, κ P (A) π(a c )/π(a), which can be arbitrarily small. To define the conductance of the kernel P we therefore only consider sets A with π(a) 1/: No such restriction is required for: κ P := inf κ(a). A:π(A) 1/ κ P := inf A κ P (A). Further, since π(a) 1/ 1/ π(a c ) 1, we have κ P κ P κ P. Setting f A (x) = [1 x A π(a)]/ π(a)π(a c ) (so that f A L (π) and f A, f A = 1) in the expression for the Dirichlet form () gives { 1 E P (f A ) = π(dx)p (x, dy) + π(a)π(a c ) x A,y A c So ρ right = x A c,y A } π(dx)p (x, dy) = κ P (A) = κ P (A) π(a c ). inf E P (f) inf E f L (π): f,f =1 P (f A ) = κ P κ P. (5) A Hence, if P has a right spectral gap (so it is variance bounding) then its conductance is non-zero. Amazingly, the converse is also true: if the conductance of P is non-zero then P has a right spectral gap: ρ right κ P κ P. (6) Thus, non-zero conductance is equivalent to a non-zero right spectral gap is equivalent to variance bounding. Equations (5) and (6) together are sometimes called Cheeger bounds. A proof (6) for finite Markov chains is given in Diaconis and Stroock (1991). A more accessible and, as far as I can see, more general, proof of the looser inequality, that ρ right κ P /8 is given in Lawler and Sokal (1988). This has a standard bit which uses the Cauchy- Schwarz inequality to obtain an inequality for ρ right, a beautiful bit which relates the expression from the standard bit to conductance, and then an ugly bit, which proves that the final expression is always greater than if κ P >. I will follow Lawler and Sokal (1988) for the first two parts and then provide a neater solution to the final part. Denote the symmetric measure π(dx)p (x, dy) by µ(dx, dy). We also set g(x) = f(x) + c for some (currently) arbitrary real constant, c. 1

11 The standard bit. Then Cauchy-Schwarz (twice) and the symmetry of µ gives: E µ [ g(x) g(y ) ] = E µ [ g(x) g(y ) g(x) + g(y ) ] [ E ] [ µ {g(x) g(y )} E ] µ {g(x) + g(y )} [ E ] [ µ {f(x) f(y )} E µ g(x) + g(y ) ] [ = 4E ] [ µ {f(x) f(y )} E ] π g(x). So The beautiful bit. We now relate the numerator of the above expression to the conductance. E P (f) 1 8E π [g(x) ] E [ µ g(x) g(y ) ]. The proof is symmetrical, the first half manipulates the denominator so that conductance may be used, with the second half reversing the route of the first. Set A t := {x : g(x) t}. Then by the symmetry of µ, µ(dx, dy) g(y) g(x) = µ(dx, dy)1 g(x) <g(y) {g(y) g(x) } X X X X = µ(dx, dy) dt1 g(x) t<g(y) X X t= = dt µ(dx, dy)1 g(x) t<g(y) t= X X = dt µ(dx, dy) So But since this is true for all c, we have ρ right κ P 8 κ P = κ P t= t= X X A t A c T dt A t A c t π(dx)π(dy) π(dx)π(dy) g(y) g(x). E P (f) κ P E π π [ g(x) g(y ) ]. 8E π [g(x) ] inf sup f L (π): f,f =1 c E π π [ {f(y ) + c} {f(x) + c} ]. E π [{f(x) + c} ] A less ugly bit. Since E π [{f(x) + c} ] = 1 + c, we need to show that E π π [ {f(y ) + c} {f(x) + c} ] inf sup f L (π): f,f =1 c 1 + c 1. 11

12 Let A = f(x) and B = f(y ) be independent and identically distributed with an expectation of and a variance of 1. We are interested in Below, I will show that E [ {A + c} {B + c} ]. (7) 1 + c E [ A B ] + 4E [ A B ]. (8) Then, as in Lawler and Sokal (1988), consider letting c or setting c =. With the former: E [ {A + c} {B + c} ] lim c 1 + c = E [ A B ], so if E [ A B ] > 1/ we are done. If not, setting c = in (7) and using (8) leaves us: To prove (8): E [ A B ] = But = P ( B > t + a ) dt = E [ A B ] 1. P ( A B > t ) dt = P ( A > t + B ) + P ( B > t + A ) dt P ( B > t + A ) [ dt = E A P ( B > t + A ) ] dt A. = 1 a P ( B > v ) dv = a 1 a up ( B > u) du P ( B > v ) dv P ( B > u) = 1 a E [ B ]. Combining the two end results gives E [ A B ] 4E [ A ] E [ B ]; i.e. E [ A B ] + 4E [ A ]. a P ( B > v ) dv However Jensen s inequality provides: E [ A B ] = E [E [ A B ] A] E [ A E [B] ] = E [ A ], and (8) follows. Acknowledgements I am grateful to Dr. Daniel Elton for providing the proof in Appendix A and Dr. Dootika Vats for spotting an error in an earlier version of this document. 1

13 A Variational representation of f, Q 1 f Let H be a Hilbert space and let Q be a positive operator on H (i.e., an operator that has a square root). Then for f H, f, Q 1 f = sup Re( f, g ) g, Qg. g H (Our Hilbert space is real, so we do not need the Re() function.) Proof Let φ = Q 1 f, so f = Qφ. The left-hand side is then Qφ, φ = Q 1/ φ, Q 1/ φ = Q 1/ φ Q 1/ φ Q 1/ (φ g) = Re( Q 1/ φ, Q 1/ g ) Q 1/ g = Re( Qφ, g ) g, Qg = Re( f, g ) g, Qg. The construction of the inequality shows that the supremum is achieved at g = Q 1 f. References Andrieu, C., Lee, A., and Vihola, M. (16). Uniform ergodicity of the iterated conditional SMC and geometric ergodicity of particle Gibbs samplers. Bernoulli. to appear. Caracciolo, S., Pelissetto, A., and Sokal, A. D. (199). Nonlocal monte carlo algorithm for self-avoiding walks with fixed endpoints. Journal of Statistical Physics, 6(1):1 53. Diaconis, P. and Stroock, D. (1991). Geometric bounds for eigenvalues of Markov chains. Ann. Appl. Probab., 1(1): Lawler, G. F. and Sokal, A. D. (1988). Bounds on the L spectrum for Markov chains and Markov processes: a generalization of Cheeger s inequality. Trans. Amer. Math. Soc., 39(): Peskun, P. H. (1973). Optimum Monte-Carlo sampling using Markov chains. Biometrika, 6: Roberts, G. O. and Rosenthal, J. S. (8). Variance bounding Markov chains. Ann. Appl. Probab., 18(3):

14 Sherlock, C. and Lee, A. (17). Variance bounding of delayed-acceptance kernels. ArXiv e-prints. Sherlock, C., Thiery, A., and Lee, A. (17). Pseudo-marginal Metropolis Hastings using averages of unbiased estimators. Biometrika. Accepted subject to minor revisions. Tierney, L. (1998). A note on Metropolis Hastings kernels for general state spaces. Ann. Appl. Probab., 8(1):

University of Toronto Department of Statistics

University of Toronto Department of Statistics Norm Comparisons for Data Augmentation by James P. Hobert Department of Statistics University of Florida and Jeffrey S. Rosenthal Department of Statistics University of Toronto Technical Report No. 0704