Analysis of Polya-Gamma Gibbs sampler for Bayesian logistic analysis of variance

Size: px

Start display at page:

Download "Analysis of Polya-Gamma Gibbs sampler for Bayesian logistic analysis of variance"

May Barrett
5 years ago
Views:

1 Electronic Journal of Statistics Vol. (207) ISSN: DOI: 0.24/7-EJS227 Analysis of Polya-Gamma Gibbs sampler for Bayesian logistic analysis of variance Hee Min Choi Department of Statistics University of California, Davis and Jorge Carlos Román Department of Mathematics and Statistics San Diego State University Abstract: We consider the intractable posterior density that results when the one-way logistic analysis of variance model is combined with a flat prior. We analyze Polson, Scott and Windle s (203) data augmentation (DA) algorithm for exploring the posterior. he Markov operator associated with the DA algorithm is shown to be trace-class. AMS 2000 subject classifications: Primary 60J27; secondary 60K35. Keywords and phrases: Polya-Gamma distribution, data augmentation algorithm, geometric convergence rate, Markov chain, Markov operator, Monte Carlo, trace-class operator. Received September 205. Contents Introduction Polson, Scott and Windle s algorithm Main result Discussion A Appendix Acknowledgements References Introduction Consider the logistic regression set-up in which Y,...,Y N are independent Bernoulli random variables such that Pr(Y i = ) = F (x i β), where x i is a p vector of known covariates that are associated with Y i, β is a p vector 326

2 Gibbs sampler for Bayesian logistic analysis of variance 327 of unknown regression coefficients, and F is the standard logistic distribution function, that is, F (s) =e s /( + e s ). An important special case is the one-way logistic analysis of variance (ANOVA) model, where each x i is a unit vector. (See Section 3 for a detailed explanation of how the logistic regression model reduces to the one-way model.) In general, the joint mass function of Y,...,Y N is given by N N [ Pr(Y i = y i β) = F (x i β) ] y i [ F (x i β) ] y i I0,} (y i ). () A Bayesian version of the logistic regression model () can be assembled by specifying a prior for the unknown regression parameter β. We consider a flat improper prior. Of course, whenever an improper prior is used, one must check that the resulting posterior is well-defined (i.e., proper). Chen and Shao (2000) provide necessary and sufficient conditions for propriety and these are stated explicitly in Section 2. We assume that these conditions are satisfied, and denote the posterior by π(β y). his posterior density is intractable in the sense that its expectations, which are required for Bayesian inference, cannot be computed in closed form. However, we may resort to Markov chain Monte Carlo (MCMC) methods to approximate the intractable posterior expectations. he most recent MCMC method for this problem is a data augmentation (DA) algorithm that is based on the Polya-Gamma latent data strategy developed in Polson, Scott and Windle (203) (hereafter, PS&W). In this article, we show that in the one-way logistic ANOVA model, which is an important special case of logistic regression models, the Markov operator associated with PS& W s DA algorithm is trace-class (see Section 3 for definition). he fact that this Markov operator is trace-class implies that it is also compact, which in turn implies that the corresponding Markov chain is geometrically ergodic. hisisvery important from a practical standpoint because geometric ergodicity guarantees the existence of central limit theorems for ergodic averages, which allows for the calculation of asymptotically valid standard errors for the MCMC estimates of posterior expectations (see, e.g., Flegal, Haran and Jones, 2008; Jones, Haran, Caffo and Neath, 2006). Aside from our work, the only existing convergence rate result of a DA algorithm for the Bayesian logistic model is the one in Choi and Hobert (203). However, in Choi and Hobert (203), the model that was considered has proper normal priors on the regression parameter. While our result applies to a relatively small class of logistic regression models, it is the first of its kind for logistic regression models with an improper prior. It turns out that using a flat improper prior complicates the analysis that is required to study the corresponding Markov chain. Indeed, our analysis is substantially different from theirs. he remainder of this paper is organized as follows. Section 2 contains a formal description of PS&W s algorithm for exploring the posterior π(β y). In Section 3, we study the operator associated with PS&W s DA Markov chain for the one-way logistic ANOVA model and show that the associated Markov

3 328 H. M. Choi and J. C. Roman operator is trace-class. Finally, in Section 4 we discuss possible difficulties in extending our result to the general logistic regression model. 2. Polson, Scott and Windle s algorithm We begin with a description of the posterior density and conditions for propriety. We consider the posterior density that results when the logistic regression likelihood () is combined with a flat prior on the regression parameter β. Let y =(y,...,y N ) denote the vector of observed data and define the marginal density as N [ c(y) = F (x i β) ] y i [ F (x i β) ] y i dβ. R p By definition, the posterior is proper if and only if c(y) <. Of course, when c(y) <, the posterior density is given by π(β y) = c(y) N [ F (x i β) ] y i [ F (x i β) ] y i. Recall from the Introduction that Chen and Shao (2000) provide necessary and sufficient conditions for propriety. In order to state Chen and Shao s (2000) result, we need a bit more notation. As usual, let X denote the N p design matrix whose ith row is x i,andletz be an N p matrix whose ith row is zi := I 0} (y i ) x i I }(y i ) x i. Finally, let 0 p be the p vector of zeros. Here is the result. Proposition (Chen and Shao, 2000). he function c(y) is finite if and only if (A) the design matrix X has full column rank and (B) there is a vector b =(b,...,b N ) with strictly positive components such that Z b =0 p. In particular, in the one-way logistic ANOVA model, the design matrix X has full column rank so the posterior is proper if and only if (B) holds. (he precise form of X in the one-way model and an easily checkable equivalent condition for propriety are stated in the next section.) hroughout the remainder of this section, we assume that the conditions of Proposition are satisfied so that the posterior is well-defined. We now describe PS&W s DA algorithm for exploring the posterior π(β y). Let R + := (0, ). For fixed w R N +, define Σ(w) =(Z Ω(w)Z) and μ(w) = Σ(w)Z ( 2 N), where Ω(w) isthen N diagonal matrix whose ith diagonal entry is w i,and N is the N vector of s. When we write W PG(,c), we mean W has a Polya-Gamma density (PS&W) given by f(x; c) = cosh(c/2) e c2 x 2 g(x),

4 where c 0and Gibbs sampler for Bayesian logistic analysis of variance 329 g(x) = k (2k +) (2k+)2 ( ) e 8x I (0, ) (x). (2) 2πx 3 k=0 he dynamics of PS&W s Markov chain, Φ = β (m) } m=0, are implicitly defined through the following two-step procedure for moving from the current state, β (m) = β, to new state β (m+). Iteration m + of PS&W s DA algorithm:. Draw W,...,W N independently with W i PG(, z i β ), and call the observed value w =(w,...,w N ). 2. Draw β (m+) N p ( μ(w), Σ(w) ). A highly efficient rejection sampler for simulating the Polya-Gamma distribution is provided in PS&W. Also, a formal derivation of the above algorithm is similar to the one for the normal prior case provided in Choi and Hobert (203). he Markov transition density (Mtd) of the Markov chain Φ is given by k(β β )= π(β w, y) π(w β,y) dw, R N + where π(β w, y) andπ(w β,y) are conditional densities that can be gleaned from the two-step algorithm described above. Indeed, π(w β,y) is a product of Polya-Gamma densities, and π(β w, y) is the multivariate normal density. Note that k maps R p R p into R + and that π(β y) is an invariant density for this Mtd. It follows that the corresponding Markov chain Φ is Harris ergodic; that is, irreducible, aperiodic, and positive Harris recurrent (see, e.g., Hobert, 20). 3. Main result In this section, we restrict attention to the one-way logistic ANOVA model, an important special case of logistic regression models, and present a spectral analysis result concerning PS&W s Markov chain Φ for this problem. In particular, we prove that the Markov operator associated with Φ is trace-class. We begin by describing the one-way logistic ANOVA model. Let Y ij } be independent Bernoulli random variables such that Pr(Y ij = β) =F (β i ), i =, 2,...,p, j =, 2,...,n i,

5 330 H. M. Choi and J. C. Roman where β i is the main effect of the ith treatment group, and β =(β,...,β p ). Note that there are p treatment groups and the number of observations in different groups may differ. Of course, this model can be written in the logistic regression form (). Indeed, there are a total of N := n + + n p observations and we arrange them using the usual lexicographical ordering: y =[y y n y 2 y 2n2 y p y pnp ]. he random version of y, call it Y, is defined similarly using the same ordering. We denote the lth components of y and Y as y l and Y l, respectively. Letting ñ 0 =0andñ j = j k= n k for j N p :=, 2,...,p}, we can write x i = e j, if ñ j + i ñ j, where e j denote the p jth unit vector (i.e., the jth column of the p p identity matrix). Note that the design matrix X has full column rank since it equals p n i, which has orthogonal columns. It follows from Proposition that the posterior π(β y) that results when the one-way logistic ANOVA likelihood is combined with a flat prior on β is proper if and only if (B) holds, but this condition is not easy to interpret. We present an equivalent condition that is easy to check and understand. As usual, let ˆp i be the proportion of s in the ith treatment group, that is, ˆp i = n i ñi l=ñ i + y l. he following result, which is proven in the Appendix, implies that the posterior is proper if and only if there are at least one and one 0 in each treatment group. Corollary. Assume that X = p n i. he posterior is proper if and only if 0 < ˆp i < for all i N p. (B ) Assume that the posterior π(β y) is proper. We now study PS&W s Markov chain Φ for exploring the posterior. In order to formally describe our main result, we must introduce the operator associated with Φ. Recall that k(β β ) denotes the Mtd of Φ, that is, the conditional density of β (m+) given that β (m) = β. Let L 2 0 be the space of real-valued functions with domain R p that are square integrable and have mean zero with respect to the posterior density π(β y). his is a Hilbert space in which the inner product of g, h L 2 0 is defined as g, h = g(β) h(β) π(β y) dβ, R p and the corresponding norm is, of course, given by g = g, g /2. he Mtd k defines an operator on L 2 0 and the spectrum of the operator contains a great deal of information about the convergence behavior of the corresponding Markov chain Φ (see, e.g., Hobert, Roy and Robert, 20). Let K : L 2 0 L 2 0 denote the operator that maps g L 2 0 to (Kg)(β )= g(β) k(β β ) dβ. R p

6 Gibbs sampler for Bayesian logistic analysis of variance 33 Because the operator K is self-adjoint and positive, the spectrum of K is a subset of the interval [0, ] (see, e.g., Hobert and Marchev, 2008; Liu,Wongand Kong, 994; Hobert et al., 20). Moreover, if the self-adjoint, positive operator K is compact, then its spectrum consists solely of eigenvalues (which are all strictly less than one) and the point 0} (see, e.g., Retherford, 993, p. 6-62). If the sum of the eigenvalues is finite, then the operator is called trace-class (see Khare and Hobert (20), and references therein). Here is our main result. heorem. Assume that X = p n i and the posterior is proper. hen, the Markov operator K is trace-class. he following lemma, which is a simple derivation of results in Devroye (2009), will be used in the proof of heorem. Lemma. Let g denote the density of PG(, 0) random variable. If w (0, log 3 ],then g(w) exp }. 2πw 3 8w Remark. It is clear that (2) is a density of PG(, 0) random variable. As described in Devroye (2009), Lemma follows from the fact that g is of the form g(w) = k=0 ( )k a k (w) where nonnegative a k (w)} k=0 is decreasing in k for w (0, log 3 ]. Proofofheorem. o prove that the Markov operator K is trace-class, we follow a technique used in Khare and Hobert (20); that is, we will establish the following condition [ ] k(β β) dβ = π(β w, y) π(w β,y) dβ dw <. (3) R p R p he key is to bound the inner integral in (3) by R N + N [ } ] a c exp g(w i ), 8w i where c and a< are constants. We then use Lemma to complete the proof. We begin by evaluating the inner integral in (3). Recall that Σ = Σ(w) = (Z Ω(w)Z) and μ = μ(w) =Σ(w)Z ( 2 N ). First, note that the product of densities π(β w, y) π(w β,y) can be written as follows: (2π) p 2 Σ 2 exp ( ( β (Z ΩZ)β 2β Z ) )} 2 2 N exp 8 } NZ(Z ΩZ) Z N N ( ) } z cosh i β exp (z i β)2 w i g(w i ) 2 2

7 332 H. M. Choi and J. C. Roman =(2π) p 2 Σ 2 exp ( 2 β 2Σ ) } β N ( ) z cosh i β exp z i β } g(w i ) 2 2 exp 8 } NZ(Z ΩZ) Z N =(2π) p 2 Σ 2 exp ( 2 β 2Σ ) } [ N ( β +exp z 2 i β} )] exp } N 8 NZ(Z ΩZ) Z N g(w i ), (4) where the last equality follows from cosh(a)e a = 2 ( + e 2a ). We now evaluate the integral of (4) with respect to β. Recall N N =, 2,...,N}. For each A N N, define an N p matrix Z A whose ith row is zi if i A 0 p if i N N \ A. hen, it is easy to see that exp exp } zi β if A is nonempty N Z A β} = i A if A is empty. herefore, we have N ( +exp z i β} ) = A N N exp N Z A β } so the integral of (4) with respect to β can be written as c 0 exp } [ N ] 8 NZ(Z ΩZ) Z N g(w i ) A N N R p exp N Z A β } φ (β; 0 p, Σ/2) dβ, (5) where c 0 is a constant, and φ(β; 0 p, Σ/2) is a multivariate normal density with mean 0 p and variance Σ/2. Note that the integral in (5) is just the moment generating function of β evaluated at the point N Z A. Hence, (5) is equal to c 0 exp } [ N ] 8 N Z(Z ΩZ) Z N g(w i )

8 Gibbs sampler for Bayesian logistic analysis of variance 333 } exp 4 NZ A (Z ΩZ) ZA N. (6) A N N We now express the exponential terms of (6) in a more compact way. Recall X = p n i and the ith row of Z is zi = I 0} (y i ) x i I } (y i ) x i. Clearly, the N p matrix Z can be written as UX,whereU is the N N diagonal matrix whose ith diagonal entry is equal to u i := I 0} (y i ) I } (y i ). It is easy to see Z ΩZ = X ΩX, and using the orthogonality of the columns of X, Z ΩZ is a p p diagonal matrix whose ith diagonal entry is equal to ñ i j=ñ i + where ñ 0 =0andñ i = i k= n k for i N p.letz ij denote the jth component of the row vector z i. A straightforward calculation yields that, for A N N, NZ A (Z ΩZ) Z A N = w j, ( p i A z ) 2 ij ñj j= k=ñ w j + k Hence, the exponential terms in (6) can be written as ( ) 2 ( p N ) 2 ñj 2 z ij z ij 8. (7) A N N exp j= k=ñ j + w k i A We now claim that, for all A N N and all j N p, ( ) 2 ( N ) 2 2 z ij z ij <. n 2 j i A Recall that ˆp j is the proportion of s in the jth treatment group, that is, ˆp j = ñj n j i=ñ y j + i. Also, recall from the definition of Z that its jth column is [ 0 uñj + uñj 0 ñj ] N ñ j,. where u i = I 0} (y i ) I } (y i ). Of course, when j =orj = p, weomitthe vector 0 0 ; for example, the first column of Z is [u u n 0 N n ]. Since there are n j ˆp j s and n j ( ˆp j ) 0 s in the jth treatment group, in the jth column of Z, n j ˆp j of the u i s are s and n j ( ˆp j )oftheu i s are s. Assume without loss of generality that, for each j N p, the first n j ˆp j u i s are s and the remaining n j ( ˆp j ) u i s are s. hat is, we assume that the jth column of Z is equal to [ n ˆp n ( ˆp ) 0 N n ] if j = [0 ñ j n j ˆp j n j( ˆp j) 0 N ñ j ] if <j<p (8) [0 ñ p n p ˆp p n p( ˆp p) ] if j = p.

9 334 H. M. Choi and J. C. Roman Letting ˆq j = ˆp j for j N p, the sum of the jth column of Z is N z ij = n j ( ˆp j ) n j ˆp j = n j (ˆq j ˆp j ), and, for all A N N and j N p, z ij max n j ˆp j,n j ˆq j }. i A It follows that, for all A N N and all j N p, ( ) 2 ( N ) 2 2 z ij z ij 2maxˆp j, ˆq j } 2 (ˆq j ˆp j ) 2 n 2 j i A =2(ˆp 2 j +ˆq 2 j ) 2minˆp j, ˆq j } 2 (ˆq j ˆp j ) 2 = 2minˆp j, ˆq j } 2. Since condition (B ) of Corollary is in force, for all j N p,wehave Hence, (7) is bounded above by 2 N a exp 8 2minˆp j, ˆq j } 2 <. p j= ñj n 2 j k=ñ j w k where [ a := max 2minˆpj, ˆq j } 2] <. j N p Combining (5), (6), (7) and(9), we have a p n 2 j π(β w, y) π(w β,y) dβ c exp ñj R 8 p a c exp 8 = c N j= p, (9) k=ñ j + w k ñ j w j= k k=ñ j + a exp 8w i } g(w i ), N g(w i ) N g(w i ) where c is a constant, and the last inequality is due to the arithmetic-harmonic mean inequality. It follows that π(β w, y) π(w β,y) dβ dw R p R N +

10 Gibbs sampler for Bayesian logistic analysis of variance 335 N a c exp R + 8w i N [ t c 0 N [ t c 0 } g(w i ) dw i } a exp g(w i ) dw i + 8w i exp a 2πw 3 i 8w i t ] a exp g(w i ) dw i 8t} } ] dw i + c 2, (0) where t = log 3, c 2 is a constant, and the last inequality is due to Lemma. he integrand in (0) is a kernel of an inverse gamma density, and hence (3) is satisfied. 4. Discussion We have shown in Section 3 that in the one-way logistic ANOVA model, the Markov operator associated with PS&W s DA chain is trace-class (and thus the chain is geometrically ergodic). Given this result, it is natural to ask whether PS&W s DA operator is also trace-class in the general logistic regression model. We note that our trace-class proof for the one-way model relies heavily on the orthogonality of the columns of the design matrix X, which is not guaranteed in general. In particular, the idea is to use orthogonality to express (Z ΩZ) as a diagonal matrix whose diagonal entries are simple functions of w =(w,...,w N ). However, in the general case, (Z ΩZ) does not necessarily take a simple form, which in turn complicates the analysis. hus, we believe that establishing (3) for the general case would be much more than a straightforward extension of the proof of our heorem. It is our hope that our result for the one-way logistic ANOVA model will promote the study of the DA Markov chain for the general logistic regression model. Appendix A: Appendix In this section, we provide a proof of Corollary. Before presenting the proof, we recall the structure of Z givenin(8) which will be used a couple of times within the proof. he jth column of the N p matrix Z is [ n ˆp n ( ˆp ) 0 N n ] if j = [0 ñ j n j ˆp j n j( ˆp j) 0 N ñ j ] if <j<p [0 ñ p n p ˆp p n p( ˆp p) ] if j = p. ProofofCorollary. Recall that c(y) < if and only if (B) holds. hus, it suffices to show that (B ) is equivalent to (B). First, we shall write b = (b,,b p ), where each b i is an n i vector. We start by demonstrating that (B) implies (B ). We will establish this by contradiction. Assume (without

11 336 H. M. Choi and J. C. Roman loss of generality) that ˆp = 0. hen, there are no s in the first column of Z, so the first column of Z is equal to [ n 0 N n ]. It follows that the first component of the p vector Z b is n b > 0 for all b R N +, which contradicts (B). We can similarly prove that the assumption of ˆp = yields a contradiction to (B). Hence, it must be that 0 < ˆp j < for all j N p. We now show that (B ) implies (B). We will establish this by explicitly constructing a vector b R N + such that Z b =0 p. For each j N p, define b [ n j ( 2ˆp j )+ n j = j ] if 0 < ˆp j < 2 [ n j n j (2ˆp j ) + ] if 2 ˆp j <. Clearly, each b j is in R nj +,sothatb R N +. herefore, we need only to show that Z b =0 p. It follows that, if 2 ˆp j <, then the jth component of the p vector Z b is equal to [ n j ˆp j n j( ˆp j)] [ ] nj =0. n j (2ˆp j ) + We can similarly show that, if 0 < ˆp j < 2, then the jth component of Z b is also zero. Hence, Z b =0 p, which completes the proof. Acknowledgements Román s research was supported by NSF Grant DMS References Chen, M.-H. and Shao, Q.-M. (2000). Propriety of posterior distribution for dichotomous quantal response models. Proceedings of the American Mathematical Society, MR Choi,H.M.and Hobert, J. P. (203). he Polya-Gamma Gibbs sampler for Bayesian logistic regression is uniformly ergodic. Electronic Journal of Statistics, MR30966 Devroye, L. (2009). On exact simulation algorithms for some distributions related to Jacobi theta functions. Statistics and Probability Letters 79, Flegal,J.M., Haran, M. and Jones, G. L. (2008). Markov chain Monte Carlo: Can we trust the third significant figure? Statistical Science, MR Hobert, J. P. (20). he data augmentation algorithm: heory and methodology. In Handbook of Markov Chain Monte Carlo (S. Brooks, A. Gelman, G. Jones and X.-L. Meng, eds.). Chapman & Hall/CRC Press. MR Hobert, J. P. and Marchev, D. (2008). A theoretical comparison of the data augmentation, marginal augmentation and PX-DA algorithms. he Annals of Statistics, MR

12 Gibbs sampler for Bayesian logistic analysis of variance 337 Hobert, J. P., Roy, V. and Robert, C. P. (20). Improving the convergence properties of the data augmentation algorithm with an application to Bayesian mixture modelling. Statistical Science, Jones, G. L., Haran, M., Caffo, B. S. and Neath, R. (2006). Fixed-width output analysis for Markov chain Monte Carlo. Journal of the American Statistical Association, Khare, K. and Hobert, J. P. (20). A spectral analytic comparison of trace-class data augmentation algorithms and their sandwich variants. he Annals of Statistics, Liu, J. S., Wong, W. H. and Kong, A. (994). Covariance structure of the Gibbs sampler with applications to comparisons of estimators and augmentation schemes. Biometrika, Polson, N. G., Scott, J. G. and Windle, J. (203). Bayesian inference for logistic models using Polya-Gamma latent variables. Journal of the American Statistical Association, MR37472 Retherford,J.R.(993). Hilbert Space: Compact Operators and the race heorem. Cambridge University Press.

The Polya-Gamma Gibbs Sampler for Bayesian. Logistic Regression is Uniformly Ergodic

The Polya-Gamma Gibbs Sampler for Bayesian. Logistic Regression is Uniformly Ergodic he Polya-Gamma Gibbs Sampler for Bayesian Logistic Regression is Uniformly Ergodic Hee Min Choi and James P. Hobert Department of Statistics University of Florida August 013 Abstract One of the most widely