On Reparametrization and the Gibbs Sampler

Size: px

Start display at page:

Download "On Reparametrization and the Gibbs Sampler"

Trevor Neal
5 years ago
Views:

1 On Reparametrization and the Gibbs Sampler Jorge Carlos Román Department of Mathematics Vanderbilt University James P. Hobert Department of Statistics University of Florida March 2014 Brett Presnell Department of Statistics University of Florida Abstract Gibbs samplers derived under different parametrizations of the target density can have radically different rates of convergence. In this article, we specify conditions under which reparametrization leaves the convergence rate of a Gibbs chain unchanged. An example illustrates how these results can be exploited in convergence rate analyses. 1 Introduction It is well-known that Gibbs samplers derived under different parametrizations of a Bayesian hierarchical model can have dramatically different rates of convergence (Gelfand et al., 1995; Papaspiliopoulos et al., 2007; Roberts and Sahu, 1997; Yu and Meng, 2011). In this article, we consider the reverse situation in which reparametrization has no effect. To motivate our study, we begin with a fresh look at a well-known toy example involving a simple random effects model with known variance components. Consider the one-way random effects model given by Y ij = θ i + ɛ ij, (1) i = 1,..., c, j = 1,..., m i, where the θ i s are independent and identically distributed (iid) N(µ, σ 2 ), and the ɛ ij s are independent of the θ i s and iid N(0, σ 2 e). (For now, we restrict attention to the balanced case where m i m.) Suppose that the variance components, σ 2 and σ 2 e, are known, Román s research supported by NSF Grant DMS Corresponding author s and telephone number: jc.roman@vanderbilt.edu ; Hobert s research supported by NSF Grant DMS

2 and that the prior on µ is flat. Let θ = (θ 1,..., θ c ) and let y denote the observed data. A simple calculation shows that the posterior density of µ given y is normal, but consider nevertheless the twocomponent Gibbs chain {(µ n, θ n )} n=0 that alternately samples from the conditional distributions θ µ, y and µ θ, y, which are c-variate normal and univariate normal, respectively. The marginal sequence {µ n } n=0 is itself a Markov chain whose invariant density is the posterior density (of µ given y), and it s easy to show that the exact rate of convergence of this chain is σe/(σ 2 e 2 + mσ 2 ) (see, e.g., Liu et al., 1994). The rate of convergence will be formally defined in Section 2, but for now it suffices to note that the rate is between 0 and 1, and smaller is better. Now consider a reparametrized version of model (1) given by Y ij = µ + u i + ɛ ij, where the u i s are iid N(0, σ 2 ), and the ɛ ij s are independent of the u i s and still iid N(0, σe). 2 Let u = (u 1,..., u c ). This is called the non-centered parametrization (NCP), whereas model (1) is called the centered parametrization (CP). If we put the same flat prior on µ, then the posterior density of µ given y remains the same as in the CP model. However, the two-component Gibbs sampler derived from the NCP model, which alternates between draws from u µ, y and µ u, y, is not the same as the one based on the CP. Furthermore, the two Gibbs samplers have completely different convergence behavior. Indeed, the convergence rate of the NCP Gibbs sampler is 1 σe/(σ 2 e 2 + mσ 2 ). So when one of the two Gibbs samplers is very slow to converge, the other converges extremely rapidly. This simple example illustrates that reparametrization can significantly affect the convergence rate of the Gibbs sampler. In a practical version of the one-way model, the variance components are unknown. In this case, the standard default prior density for (µ, σ 2, σe) 2 is 1/ ( σe 2 ) σ 2. We assume that the posterior is proper - see Román (2012) for conditions. The posterior density of (µ, σ 2, σe) 2 given y, which is the same under CP and NCP, is intractable, so this is no longer a toy example. As in the known variance case, there are two different versions of the standard two-component Gibbs sampler for this problem: the CP Gibbs sampler, which alternates between θ, µ σ 2, σe, 2 y and σ 2, σe µ, 2 θ, y, and the NCP Gibbs sampler, which alternates between u, µ σ 2, σe, 2 y and σ 2, σe u, 2 µ, y. The results of Section 3 imply that, in contrast with the known variance case, these two Gibbs samplers converge at exactly the same rate. Consequently, convergence rate results for either of these Gibbs samplers apply directly to the other. In Section 3 we compare the results of Román (2012), who analyzed the NCP Gibbs sampler, with those of Tan and Hobert (2009), who studied the CP version. The CP and NCP Gibbs Markov chains described above share the same rate of convergence because the transformation that takes the CP model to the NCP model involves variables (θ and µ) that reside in the same component (or block) of the two-component Gibbs sampler. (Note that this 2

3 is not the case in the toy example where the variance components are known.) The main result in this paper is a formalization of this idea. We now provide an overview of our results in the special case where the target distribution has a density with respect to Lebesgue measure. Suppose f : R d 1 R d 2 R d k [0, ) is a probability density function, and let Φ 1 = { (X n (1), X n (2),..., X n (k) ) } denote the Markov chain simulated by the k-component Gibbs n=0 sampler based on f(x 1, x 2,..., x k ) that updates the components in the natural order. It is wellknown and easy to see that the marginal sequence Φ 1 := { (X n (2),..., X n (k) ) } is also a Markov n=0 chain. Now, for i {2, 3,..., k}, let Φ i denote the k-component Gibbs sampler whose update order is (i, i + 1,..., k, 1, 2,..., i 1), and let Φ i denote the corresponding marginal Markov chain (that leaves out X (i) ). We show that all 2k of these chains converge at exactly the same rate. Not only is this fact the key to the proof of our main result concerning reparametrization, it is also useful from a practical standpoint. Indeed, if one wishes to know the rate of convergence of Φ 1, then it suffices to study the lower-dimensional chain Φ i (for any i = 1, 2,..., k), which may be easier to analyze than Φ 1. This idea has been used to establish qualitative convergence results (such as geometric and uniform ergodicity) for two-component Gibbs samplers (see, e.g., Diebolt and Robert (1994) and Román and Hobert (2012)). Now let (X 1, X 2,..., X k ) denote a random vector with density f, and consider the k-component Gibbs sampler Φ 1 based on the distribution of ( X 1, X 2,..., X k ) = (t 1 (X 1 ), t 2 (X 2 ),..., t k (X k )). Suppose f(x 1, x 2,..., x k ) can be written as a function of (t 1 (x 1 ), t 2 (x 2 ),..., t k (x k )), an assumption that obviously holds if each t i : R d i R d i is invertible. Then, by exploiting the fact that the 2k chains described above share the same rate, we show that Φ 1 and Φ 1 converge at the same rate. An important implication of this result is that, when analyzing the convergence rate of a Gibbs sampler, one is free to choose a convenient parametrization, as long as the corresponding transformation respects the within-component restriction. The remainder of this article is organized as follows. Section 2 contains some background on general state space Markov chain theory as well as preliminary results. Our main result showing that a within-component reparametrization does not affect the convergence rate of the Gibbs Markov chain can be found in Section 3. This section also contains the application of our main result to the Gibbs samplers for the one-way model with improper priors. 3

4 2 Markov Chain Background and Preliminary Results As in Meyn and Tweedie (1993, Chapter 3), let P (x, dy) be a generic Markov transition function (MTF) on a set X equipped with a countably generated σ-algebra. Let P n (x, dy) denote the n-step MTF. We assume throughout that the chain determined by P is ψ-irreducible, aperiodic and positive recurrent with invariant probability measure π. We do not assume reversibility. For a measure ν on X, let νp n (dy) = X P n (x, dy)ν(dx). Following Roberts and Tweedie (2001) and Rosenthal (2003), define the L 1 -rate of convergence of the Markov chain as { } 1 ρ = exp sup lim n n log νp n π TV, ν p(π) where TV denotes the total variation norm for signed measures and p(π) is the set of all probability measures ν that are absolutely continuous with respect to π with X (dν/dπ)2 dπ <. For reversible chains, ρ equals the usual rate of convergence, i.e., the spectral radius (and norm) of the self-adjoint Markov operator defined by P (Rosenthal, 2003, Proposition 2). As in Roberts and Rosenthal (1997), we say that the chain (or the corresponding MTF) is π-a.e. geometrically ergodic if there exist M : X (0, ) and κ < 1 such that, for π-a.e. x X, P n (x, ) π( ) TV M(x)κ n for all n N. We often omit the π-a.e. and simply write geometrically ergodic. The next proposition follows easily from results in Roberts and Rosenthal (1997) and Roberts and Tweedie (2001). Proposition 1. The Markov chain based on P is geometrically ergodic if and only if ρ < 1. Now, for i = 1, 2,..., k, let (X i, F i, µ i ) denote σ-finite measure spaces, and let (X, F, µ) denote the corresponding product space. Suppose that π is a probability distribution on (X, F) having density f(x 1, x 2,..., x k ) with respect to µ. Let P i denote the MTF of the k-component Gibbs sampler whose update order is (i, i + 1,..., k, 1, 2,..., i 1), and let Q i denote the MTF of the corresponding marginal Markov chain (that leaves out the ith component). A proof of the following result can be found in the Appendix. Proposition 2. The Markov chains defined by the MTFs {P i } k i=1 and {Q i} k i=1 all share the same L 1 convergence rate. In conjunction with Proposition 1, Proposition 2 shows that geometric ergodicity is a solidarity property for the 2k chains defined by {P i } k i=1 and {Q i} k i=1. That is, either all 2k chains are geometrically ergodic, or none of them is. This result is actually well-known when k = 2. Indeed, in 4

5 that case, Diaconis et al. s (2008) Lemma 2.4 shows that geometric ergodicity is a solidarity property for P 1 and Q 1, and symmetry implies that the same holds for P 2 and Q 2. (These facts can also be established using results in Roberts and Rosenthal (2001).) Furthermore, when k = 2, the marginal Markov chains defined by Q 1 and Q 2 are reversible, and the norms of the corresponding self-adjoint Markov operators are identical (Liu et al., 1994). Then, because a reversible Markov chain is geometrically ergodic if and only if the norm of its Markov operator is strictly less than one (Roberts and Rosenthal, 1997), it follows that Q 1 is geometrically ergodic if and only if Q 2 is geometrically ergodic, completing the cycle, and the argument (for k = 2). 3 Reparametrization Suppose that (X 1, X 2,..., X k ) has (joint) distribution π, and let π represent the distribution of (t 1 (X 1 ), t 2 (X 2 ),..., t k (X k )). Under what conditions does the Gibbs sampler based on π have the same rate of convergence as the sampler based on π, i.e., when is the convergence rate of the Gibbs sampler unchanged by within-block transformations? To formalize this question, let (X i, F i, µ i ), i = 1, 2..., k, (X, F, µ), π, and f be as in the previous section. Let (Y i, G i ), i = 1, 2,..., k, be measurable spaces, let (Y, G) be their product, and assume that t i : X i Y i, i = 1, 2,..., k are measurable transformations. Finally, let T (x 1, x 2,..., x k ) = (t 1 (x 1 ), t 2 (x 2 ),..., t k (x k )) and let π = π T 1 be the probability distribution induced on (Y, G) by the transformation T, i.e., π(b) = π(t 1 (B)), B G, where T 1 (B) is the pre-image of B under T. The following result is proved in the Appendix. Proposition 3. Suppose that there exists a measurable function f : Y R, such that for all (x 1, x 2,..., x k ) X. f(x 1, x 2,..., x k ) = f(t 1 (x 1 ), t 2 (x 2 ),..., t k (x k )) (2) Then the k-component Gibbs samplers based on π and π (both updating the components in the natural order), have the same L 1 -rate of convergence. Remark. The main hypothesis of Proposition 3 clearly holds when each t i is an invertible function (with measurable inverse), since then (2) holds with f(y 1, y 2,..., y k ) = f ( t 1 1 (y 1), t 1 2 (y 2),..., t 1 k (y k) ). We now return to the CP and NCP Gibbs samplers for the one-way model. In the Introduction, we considered only the balanced case, in which all the m i s are the same, and we considered only one prior density. Here we allow the m i to differ, and we consider a family of prior densities for 5

6 (µ, σ 2, σ 2 e) given by ( σ 2 ) (a+1)( σ 2 e ) (b+1)i(0, ) (σ 2 e)i (0, ) (σ 2 ), where a, b are hyperparameters. Note that by taking (a, b) = ( 1/2, 0), we recover the default prior from the Introduction. Tan and Hobert (2009) analyzed the Gibbs sampler based on the CP version of the one-way model and proved that the CP Gibbs Markov chain is geometrically ergodic if a < 0 and {( c M + 2b c + 3 and c min i=1 m i m i + 1 ) 1 }, m { ( c )} < 2 exp Ψ M 2 + a, where M = c i=1 m i, m = max{m 1, m 2,..., m c } and Ψ(x) = d dx log( Γ(x) ) is the digamma function. Román (2012) (see also Román and Hobert (2012)) subsequently proved that the NCP Gibbs Markov chain is geometrically ergodic if a < 0 and { ( c )} M + 2b c + 2 and 1 < 2 exp Ψ 2 + a. It s easy to see that Román s conditions are weaker (i.e., easier to satisfy) than those of Tan and Hobert. However, the two sets of conditions are directly comparable only if geometric ergodicity is a solidarity property for the two different Gibbs chains. Let π(θ, µ, σ 2, σ 2 e y) denote the complete data posterior density under the CP model, which is the invariant density of the CP Gibbs Markov chain. Consider a one-to-one transformation of ( (θ, µ), (σ 2, σ 2 e) ) to ( t(θ, µ), (σ 2, σ 2 e) ), where t : R c+1 R c+1 is defined as follows: t(θ, µ) = ( θ1 µ, θ 2 µ,..., θ c µ, µ ). The density of the transformed variable is exactly the complete data posterior density under the NCP model, so Proposition 3 implies that the CP and NCP Gibbs chains share the same L 1 -rate. Thus, Román s (2012) result is indeed an improvement upon that of Tan and Hobert (2009). We now present an example involving a transformation that is not one-to-one. Consider a pair of random variables (X 1, X 2 ) such that X 1 X 2 = x 2 N(0, 1/x 2 ) (3) and X 2 Gamma ( ν 2, ν 2 ), where ν > 0 is a known constant. Then the density of (X1, X 2 ) is f(x 1, x 2 ) = (ν/2)ν/2 { Γ(ν/2) 2π x ν exp x 2 ( x ν )} I (0, ) (x 2 ), 6

7 and it can be shown that ( ν + 1 X 2 X 1 = x 1 Gamma 2, 1 ) 2 (x2 1 + ν). (4) Although direct simulation of (X 1, X 2 ) is clearly possible, consider the Gibbs sampler which uses the conditionals in (3) and (4). Suppose we use the transformation U 1 = t 1 (X 1 ) = X1 2 (which is not one-to-one) together with U 2 = t 2 (X 2 ) = X 2. Since X 1 X 2 = x 2 N(0, 1/x 2 ), it follows immediately using a χ 2 -type calculation that X 2 1 X 2 = x 2 Gamma(1/2, x 2 /2). In other words, U 1 U 2 = u 2 Gamma(1/2, u 2 /2). (5) Obviously, U 2 Gamma ( ν 2, ν 2 ) and the density of (U1, U 2 ) is f U (u 1, u 2 ) = (ν/2)ν/2 Γ(ν/2) 1 { u ν exp u } 2 2π u1 2 (u 1 + ν) I (0, ) (u 1 )I (0, ) (u 2 ). Moreover, a simple calculation shows that U 2 U 1 = u 1 Gamma ( ν + 1 2, 1 ) 2 (u 1 + ν). (6) The associated Gibbs sampler can be simulated using the conditionals given in (5) and (6). Finally, because the joint density of (X 1, X 2 ) depends on x 1 only through t 1 (x 1 ) = x 2 1, the condition in Proposition 3 is satisfied and we conclude that the Gibbs samplers associated with f and f U converge at the same L 1 -rate. Appendix Proof of Proposition 2. We will prove the result for k = 3. The extension to general k is obvious, and only involves more complicated notation. The proof has two parts: first we show that P 1, P 2 and P 3 share the same L 1 rate; and then we show that P i and Q i have the same L 1 rate for i = 1, 2, 3, We prove the first result by showing that ρ 1 ρ 2 ρ 3 ρ 1, where ρ i denotes the L 1 rate of P i. For this, we need only show that ρ 1 ρ 2, with the remaining inequalities following by symmetry. To prove ρ 1 ρ 2, we show that for each fixed ν p(π), there exists a ν p(π) such that, for all n N, νp n+1 1 π TV ν P n 2 π TV. (7) From this it follows that lim n n 1 log νp n 1 π TV lim n n 1 log ν P n 2 π TV log(ρ 2), 7

8 which implies ρ 1 ρ 2. To prove (7), let (X 1, X 2, X 3 ) have distribution π, and let f 1 23 (x 1 x 2, x 3 ), f 2 13 (x 2 x 1, x 3 ), and f 3 12 (x 3 x 1, x 2 ) represent the conditional densities of X 1 (given X 2, and X 3 ), of X 2 (given X 1 and X 3 ), and of X 3 (given X 1 and X 2 ), respectively. For i = 1, 2 and A F, we have P i ((x 1, x 2, x 3 ), A) = k i (x 1, x 2, x 3 x 1, x 2, x 3 ) µ(d(x 1, x 2, x 3)), A where k 1 and k 2 are the Markov transition densities associated with P 1 and P 2, respectively. Of course, k 1 (x 1, x 2, x 3 x 1, x 2, x 3 ) = f 1 23 (x 1 x 2, x 3 )f 2 13 (x 2 x 1, x 3 )f 3 12 (x 3 x 1, x 2), and k 2 is defined analogously. It is convenient to express each of P 1 and P 2 as the composition of three simple transition kernels. To this end, let δ x ( ) denote a point mass measure at x and let ( P 1 23 (x1, x 2, x 3 ), A ) = f 1 23 (x 1 x 2, x 3 )(µ 1 δ x2 δ x3 )(d(x 1, x 2, x 3)), A be the kernel associated with the single update of X 1 (given X 2 and X 3 ). Define the kernels associated with the (conditional) updates of X 2 and X 3 analogously and call them P 2 13 and P 3 12, respectively. A routine calculation shows that P 1 = P 1 23 P 2 13 P 3 12 and P 2 = P 2 13 P 3 12 P Given ν p(π) having density q with respect to π, let ν = νp A straightforward calculation shows that ν has density q = P 1 23 q with respect to π. Moreover, a simple application of Jensen s inequality shows that X (q ) 2 dπ <, so ν p(π). Also, given a function g : X [ 1, 1, let ĝ = P 2 13 P 3 12 g and note that ĝ 1, where is the supremum norm. Writing P 1 and P 2 in terms of the kernels P 1 23, P 2 13 and P 3 12, and using a simple induction argument, we obtain that, for any n 1, νp1 n+1 g = ν P2 n ĝ for all ν p(π) and all g : X [ 1, 1. Finally, since π = πp 2 13 = πp 3 12, we have that π = πp 2 13 P 3 12 and thus πg = πĝ. Hence, νp1 n+1 (g) π(g) = ν P2 n (ĝ) π(ĝ) sup ν P2 n (h) π(h) = 2 ν P2 n π TV, {h: h 1} and because g was arbitrary, (7) follows. For the second part of the proof, let η i denote the L 1 rate for Q i, i = 1, 2, 3. We will show that ρ 1 = η 1. The other two equivalences then follow by symmetry. For a measurable set B in X 2 X 3, [ Q 1 ((x 2, x 3 ), B) = k 1 (x 1, x 2, x 3 x 1, x 2, x 3 ) µ 1 (dx 1) (µ 2 µ 3 )(d(x 2, x 3)), B X 1 8

9 and the corresponding invariant distribution is given by [ π 23 (B) = f(x 1, x 2, x 3 ) µ 1 (dx 1 ) (µ 2 µ 3 )(d(x 2, x 3 )). B X 1 Given α p(π 23 ) and g : X 2 X 3 [ 1, 1, define α p(π) by α dα (A) = (x 2, x 3 )π(d(x 1, x 2, x 3 )) dπ 23 and ǧ : X [ 1, 1 by ǧ(x 1, x 2, x 3 ) = g(x 2, x 3 ), respectively. Then A (P 1 ǧ)(x 1, x 2, x 3 ) = k 1 (x 1, x 2, x 3 x 1, x 2, x 3 )g(x 2, x 3)µ(d(x 1, x 2, x 3)) X [ = k 1 (x 1, x 2, x 3 x 1, x 2, x 3 ) µ 1 (dx 1) g(x 2, x 3) µ 2 (dx 2) µ 3 (dx 3) = (Q 1 g)(x 2, x 3 ), X 1 X 2 X 3 and it follows by induction that (P1 nǧ)(x 1, x 2, x 3 ) = (Q n 1 g)(x 2, x 3 ) for all n 1. Thus, α (P1 n ǧ) = (P1 n ǧ)(x 1, x 2, x 3 ) dα (x 2, x 3 )π(d(x 1, x 2, x 3 )) X dπ 23 = (Q n 1 g)(x 2, x 3 ) dα (x 2, x 3 )π(d(x 1, x 2, x 3 )) X dπ 23 = (Q n 1 g)(x 2, x 3 ) dα [ (x 2, x 3 ) f(x 1, x 2, x 3 )µ 1 (dx 1 ) µ 2 (dx 2 ) µ 3 (dx 3 ) X 2 X 3 dπ 23 X 1 = (Q n 1 g)(x 2, x 3 ) dα (x 2, x 3 ) π 23 (d(x 2, x 3 )) = α(q n g). X 3 dπ 23 X 2 Finally, since π(ǧ) = π 23 (g), we have αq n 1 (g) π 23 (g) = α P n 1 (ǧ) π(ǧ) sup α P1 n (h) π(h) = 2 α P1 n π TV, {h: h 1} for all n 1, and since g : X 2 X 3 [ 1, 1 was arbitrary, αq n 1 π 23 TV α P n 1 π TV for all n 1. This proves that η 1 ρ 1. To prove the reverse inequality, let ν p(π) and g : X [ 1, 1, and define ν p(π 23 ) by [ ν dν (B) = B X 1 dπ (x 1, x 2, x 3 )f 1 23 (x 1 x 2, x 3 )µ 1 (dx 1 ) π 23 (d(x 2, x 3 )) and, noting that (P 1 g)(x 1, x 2, x 3 ) does not depend on x 1, let ǧ(x 2, x 3 ) = (P 1 g)(x 1, x 2, x 3 ). An induction argument similar to the one above shows that (P n+1 1 g)(x 1, x 2, x 3 ) = (Q n 1 ǧ)(x 2, x 3 ) for 9

10 all n 1, and thus, [ ν(p1 n+1 g) = (Q n dν 1 ǧ)(x 2, x 3 ) X 3 X 2 X 1 dπ (x 1, x 2, x 3 )f(x 1, x 2, x 3 ) µ 1 (dx 1 ) µ 2 (dx 2 ) µ 3 (dx 3 ) [ = (Q X 3 X2 n1 ǧ)(x 2, x 3 ) dν (x 2, x 3 ) f(x dπ 1, x 2, x 3 ) µ 1 (dx 1) µ 2 (dx 2 ) µ 3 (dx 3 ) 23 X 1 = (Q X2 X3 n1 ǧ)(x 2, x 3 ) dν (x 2, x 3 ) π 23 (d(x 2, x 3 )) = ν (Q n 1 P 1 g). dπ 23 Finally, since π(g) = (πp 1 )(g) = π(p 1 g) = π 23 (ǧ), we have νp1 n+1 (g) π(g) = ν Q n 1 (ǧ) π 23 (ǧ) sup ν Q n 1 (h) π 23 (h) = 2 ν Q n 1 π 23 TV {h: h 1} for all n 1. Since g : X [ 1, 1 was arbitrary, it follows that νp n+1 1 π TV ν Q n 1 π 23 TV. This implies that ρ 1 η 1, completing the proof of the proposition. A few technical remarks will be helpful before beginning the proof of Proposition 3. We will employ the following lemma. Lemma 1. Let (X, F, µ) be a measure space, let (Y, G) be a measurable space, and let π be a probability measure on (X, F) having density f with respect to µ. Suppose that T : X Y is measurable and that f(x) = f(t (x)) for some measurable function f : Y R. Let ν = µ T 1 be the measure induced on (Y, G) by µ and T. Similarly, let π = π T 1 be the probability measure induced on (Y, G) by π and T. Then π has density f with respect to ν. Proof. By change of variables (Billingsley, 1995, Theorem 16.13), for any B G, we have π(b) = π ( T 1 (B) ) = f(t (x)) µ(dx) = f(y) (µ T 1 )(dy) = f(y) ν(dy). T 1 (B) B B Returning to the specific context of Proposition 3, consider the product spaces (X, F, µ) and (Y, G), and the transformation T (x 1, x 2,..., x k ) = (t 1 (x 1 ), t 2 (x 2 ),..., t k (x k )). By Lemma 1, π = π T 1 has density f with respect to the measure ν = µ T 1. Let ν i = µ i t 1 i, i = 1, 2,..., k. If the measure spaces (Y i, G i, ν i ), i = 1, 2,..., k, are σ-finite, then it is easy to check that ν is equal to the product measure ν 1 ν 2 ν k. However, there is nothing in our hypotheses to guarantee that the ν i are σ-finite, and if any of them fail to be σ-finite, then technical difficulties arise which invalidate our proof. 10

11 Fortunately, we may assume without loss of generality that the ν i are σ-finite, and even finite. To see this, let π i denote the ith marginal distribution of π and let f i denote the density of π i with respect to µ i, which may be computed in the usual way by integrating f over all but its ith coordinate. Then it is easy to check that π has density f with respect to the product measure π 1 π 2 π k, where Now let f(x 1, x 2,..., x k ) f(x 1, x 2,..., x k ) = f 1 (x 1 )f 2 (x 2 ) f k (x k ), if f 1(x 1 )f 2 (x 2 ) f k (x k ) > 0, 0, otherwise. f 1 (y 1 ) = f(y1, t 2 (x 2 ),..., t k (x k )) µ(dx 2 ) µ(dx k ), X k X 2 and define f 2,..., f k similarly. From (2), it is obvious that f i (x i ) = f i (t i (x i )), i = 1, 2,..., k, and it follows that f(x 1, x 2,..., x k ) is a function of (t 1 (x 1 ), t 2 (x 2 ),..., t k (x k )). Thus, the hypotheses of Proposition 3 also hold upon replacement of f by f and µ i by π i, i = 1, 2,..., k. But in this case ν i = π i t 1 i, which is a probability measure, and hence finite. Proof of Proposition 3. Again, we will prove the result for k = 3. Assume, without loss of generality, that ν i = µ i t 1 i is σ-finite for i = 1, 2, 3. Let (X 1, X 2, X 3 ) have distribution π. We first prove that the Gibbs sampler based on (the distribution of) (t 1 (X 1 ), X 2, X 3 ) has the same L 1 -rate of convergence as the one based on (X 1, X 2, X 3 ). A similar argument then implies that this rate of convergence is shared by the Gibbs sampler based on (t 1 (X 1 ), t 2 (X 2 ), X 3 ), and then by the Gibbs sampler based on (t 1 (X 1 ), t 2 (X 2 ), t 2 (X 3 )), thus proving the result. By Lemma 1, (t 1 (X 1 ), X 2, X 3 ) has density g(y 1, x 2, x 3 ) = f(y 1, t 2 (x 2 ), t 3 (x 3 )) with respect to ν 1 µ 2 µ 3. Letting g 1 23 (y 1 x 2, x 3 ), g 2 13 (x 2 y 1, x 3 ), and g 3 12 (x 3 y 1, x 2 ) represent the corresponding conditional densities, the Gibbs sampler based on (t 1 (X 1 ), X 2, X 3 ) has transition density k 1 (y 1, x 2, x 3 y 1, x 2, x 3 ) = g 1 23 (y 1 x 2, x 3 )g 2 13 (x 2 y 1, x 3 )g 3 12 (x 3 y 1, x 2) with respect to ν 1 µ 2 µ 3. But g(t 1 (x 1 ), x 2, x 3 ) = f(t 1 (x 1 ), t 2 (x 2 ), t 3 (x 3 )) = f(x 1, x 2, x 3 ), and from this it is easily checked that g 1 23 (t 1 (x 1 ) x 2, x 3 ) = f 1 23 (x 1 x 2, x 3 ), g 2 13 (x 2 t 1 (x 1 ), x 3 ) = 11

12 f 2 13 (x 2 x 1, x 3 ), and g 3 12 (x 3 t 1 (x 1 ), x 2 ) = f 3 12 (x 3 x 1, x 2 ). By change of variables, k1 (y 1, x 2, x 3 y 1, x 2, x 3 ) ν 1 (dy 1) Y 1 = g 1 23 (y 1 x 2, x 3 )g 2 13 (x 2 y 1, x 3 )g 3 12 (x 3 y 1, x 2) (µ 1 t 1 1 )(dy 1) Y 1 = g 1 23 (t 1 (x 1) x 2, x 3 )g 2 13 (x 2 t 1 (x 1), x 3 )g 3 12 (x 3 t 1 (x 1), x 2) µ 1 (dx 1) X 1 = f 1 23 (x 1 x 2, x 3 )f 2 13 (x 2 x 1, x 3 )f 3 12 (x 3 x 1, x 2) µ 1 (dx 1) X 1 = k 1 (x 1, x 2, x 3 x 1, x 2, x 3 ) µ 1 (dx 1), X 1 and thus the marginal (X 2, X 3 ) chain of the Gibbs sampler based on (t 1 (X 1 ), X 2, X 3 ) has the same transition density (with respect to µ 2 µ 3 ) as the marginal (X 2, X 3 ) chain of the Gibbs sampler based on (X 1, X 2, X 3 ). This implies that the two marginal chains have the same L 1 -convergence rate, and it follows from Proposition 2 that the two parent chains also share this rate. Acknowledgments The authors thank Aixin Tan and an anonymous referee for helpful comments and suggestions. References BILLINGSLEY, P. (1995). Probability and Measure. 3rd ed. John Wiley and Sons, New York. DIACONIS, P., KHARE, K. and SALOFF-COSTE, L. (2008). Gibbs sampling, exponential families and orthogonal polynomials (with discussion). Statistical Science, DIEBOLT, J. and ROBERT, C. P. (1994). Estimation of finite mixture distributions by Bayesian sampling. Journal of the Royal Statistical Society, Series B, GELFAND, A. E., SAHU, S. K. and CARLIN, B. P. (1995). Efficient parametrisations for normal linear mixed models. Biometrika, LIU, J. S., WONG, W. H. and KONG, A. (1994). Covariance structure of the Gibbs sampler with applications to comparisons of estimators and augmentation schemes. Biometrika, MEYN, S. P. and TWEEDIE, R. L. (1993). Markov Chains and Stochastic Stability. Springer Verlag, London. 12

13 PAPASPILIOPOULOS, O., ROBERTS, G. O. and SKÖLD, M. (2007). A general framework for the parametrization of hierarchical models. Statistical Science, ROBERTS, G. and SAHU, S. K. (1997). Updating schemes, correlation structure, blocking and parameterisation for the Gibbs sampler. Journal of the Royal Statistical Society, Series B, ROBERTS, G. O. and ROSENTHAL, J. S. (1997). Geometric ergodicity and hybrid Markov chains. Electronic Communications in Probability, ROBERTS, G. O. and ROSENTHAL, J. S. (2001). Markov chains and de-initializing processes. Scandinavian Journal of Statistics, ROBERTS, G. O. and TWEEDIE, R. L. (2001). Geometric L 2 and L 1 convergence are equivalent for reversible Markov chains. Journal of Applied Probability, 38A ROMÁN, J. C. (2012). Convergence Analysis of Block Gibbs Samplers for Bayesian General Linear Mixed Models. Ph.D. thesis, Department of Statistics, University of Florida. ROMÁN, J. C. and HOBERT, J. P. (2012). Convergence analysis of the Gibbs sampler for Bayesian general linear mixed models with improper priors. Annals of Statistics, ROSENTHAL, J. S. (2003). Asymptotic variance and convergence rates of nearly-periodic MCMC algorithms. Journal of the American Statistical Association, TAN, A. and HOBERT, J. P. (2009). Block Gibbs sampling for Bayesian random effects models with improper priors: Convergence and regeneration. Journal of Computational and Graphical Statistics, YU, Y. and MENG, X.-L. (2011). To center or not to center: That is not the question - An ancillarity-sufficiency interweaving strategy (ASIS) for boosting MCMC efficiency (with discussion). Journal of Computational and Graphical Statistics,

University of Toronto Department of Statistics

University of Toronto Department of Statistics Norm Comparisons for Data Augmentation by James P. Hobert Department of Statistics University of Florida and Jeffrey S. Rosenthal Department of Statistics University of Toronto Technical Report No. 0704