Geometric ergodicity of the Bayesian lasso

Size: px

Start display at page:

Download "Geometric ergodicity of the Bayesian lasso"

Clarence Webster
5 years ago
Views:

1 Geometric ergodicity of the Bayesian lasso Kshiti Khare and James P. Hobert Department of Statistics University of Florida June 3 Abstract Consider the standard linear model y = X +, where the components of are iid standard normal errors. Park and Casella [4] consider a Bayesian treatment of this model with a Laplace/Inverse-Gamma prior on (, ). They introduce a Data Augmentation approach that can be used to explore the resulting intractable posterior density, and call it the Bayesian lasso algorithm. In this paper, the Markov chain underlying the Bayesian lasso algorithm is shown to be geometrically ergodic, for arbitrary values of the sample size n and the number of variables p. This is important, as geometric ergodicity provides theoretical ustification for the use of Markov chain CLT, which can then be used to obtain asymptotic standard errors for Markov chain based estimates of posterior quantities. Kyung et al [] provide a proof of geometric ergodicity for the restricted case n p, but as we explain in this paper, their proof is incorrect. Our approach is different and more direct, and enables us to establish geometric ergodicity for arbitrary n and p. Keywords and Phrases: Convergence rate, Geometric drift condition, Markov chain, Monte Carlo. AMS Subect Classification: Primary 6J7; Secondary 6F5

2 Introduction Consider the standard linear model y = X +, where y =(y i ) n i= Rn is the vector of observations, X is the (known) design matrix, R p is the (unknown) vector of regression coefficients, the components of are iid standard normal errors, and is the (unknown) variance parameter. The obective is to estimate (, ). In a variety of modern datasets from genetics, finance, and environmental sciences, the dimension p of is much larger than the sample size n. An extremely popular approach to handle these kinds of datasets is the lasso, which was introduced in Tibshirani [9]. In this approach, the estimate of is obtained as ˆ lasso = argmin R p(y X ) T (y X )+, () = where is a user-specified tuning parameter. Note that the estimate in () can be regarded as the posterior mode of (conditional on ) if one puts independent Laplace priors on the entries of. Based on this observation, several authors proposed a Bayesian analysis using a Laplace-like prior for (see for example [3, 7, ]). Park and Casella [4] construct the following hierarchical Bayesian model as a Bayesian parallel/alternative to the lasso. y,, N n (X, I n ) (), N p ( p, D ), where D = diag(,,..., p ) (3) Inverse-Gamma(, ) (allow for impropriety via =or = ) (4) i.i.d Exponential for =,...,p (5) To see the connection with lasso, note that the prior density of given is given by f( )= py i= e.

3 Indeed, it is shown in [4] that the full conditional distributions of,, are given by,, y N p (X T X + D ) X T y, s,, y Inverse-Gaussian,, y Inverse-Gamma n + p +, (X T X + D ) (6)!, independently for =,,,p (7), (y X )T (y X )+ T D +. (8) Note that the Inverse-Gaussian(µ, ) density is given by f(x) = r x 3 e (x µ ) (µ ) x. Using the full conditional distributions specified above, one can construct a systematic scan Gibbs sampling algorithm to sample from the oint posterior density of (,, ) (and hence to obtain the desired sample from the posterior density of (, )). Let {( m, m, m )} m= be a Markov chain (with state space Rp R p + R +) whose dynamics are defined (implicitly) through the following three-step procedure for moving from the current state, ( n, n, n ), to ( n+, n+, n+ ). Iteration n +of Park and Casella s Bayesian lasso Gibbs sampler. Draw n+ from the conditional distribution in (8) given ( n, n, y).. Draw n+ from the conditional distribution in (7) given ( n, n+, y). 3. Draw n+ from the conditional distribution in (6) given ( n+, n+, y). In Section, the Markov transition density (Mtd) of the resulting Gibbs Markov chain, {( m, m, m )} m=, is defined and then used to establish that the chain is well behaved (i.e., Harris ergodic) and converges to the target posterior distribution. Thus, we can use this chain to construct strongly consistent estimators of intractable posterior expectations. To be specific, for q>, let L q ( ) denote the set of functions g : R p R p + R +! R such that Z Z E g q := R p R n + Z g(,, R + ) q (, y) d d d <, 3

4 where (,, y) represents the oint posterior density evaluated at (,, ). Harris ergodicity implies that, if g L ( ), then the estimator g m := m P m i= g( m, m, m ) is strongly consistent for E g, no matter how the chain is started. Of course, in practice, an estimator is only useful if it is possible to compute an associated standard error. All available methods of computing a valid asymptotic standard error for g m are based on the existence of a central limit theorem (CLT) for g m ; that is, we require that p m gm E g d! N(, ), for some positive, finite. Unfortunately, even if g L q ( ) for all q>, Harris ergodicity is not enough to guarantee the existence of such a CLT (see for example [6, 7]). The standard method of establishing the existence of CLTs is to prove that the underlying Markov chain converges at a geometric rate. Let B(X) denote the Borel sets in X := R p R p + R +, and let K m : X B(X)! [, ] denote the m- step Markov transition function of the Gibbs Markov chain. That is, K m (,, ),A is the probability that ( m, m, m ) A, given that the chain is started at (,, ). Also, let ( ) denote the oint posterior distribution. The chain is called geometrically ergodic if there exist a function M : X! [, ) and a constant [, ) such that, for all (,, ) X and all m =,,..., we have K m (,, ), ) ( ) TV apple M(,, ) m, where k k TV denotes the total variation norm. The relationship between geometric convergence and CLTs is simple: If the chain is geometrically ergodic and E g + < for some >, then g m satisfies a CLT. Moreover, because the Mtd is strictly positive on X (see Section ), the same + moment condition implies that the usual estimators of the asymptotic variance,, are consistent [, 8, 9, ]. Our main result, which is proven in Section 3 using a geometric drift condition and an associated minorization condition, is the following. Proposition. The Bayesian lasso Gibbs Markov chain is geometrically ergodic for n 3 and arbitrary p, X,. 4

5 Hence, the Markov chain CLT holds and can be used to obtain asymptotic standard errors of posterior estimates. In the restricted case when n p, Kyung et al. [] contains, among other results, a proof of geometric ergodicity of the Bayesian lasso Gibbs Markov chain. However, there are serious errors in this proof, which essentially arise from their assertion that the Bayesian lasso model in ()-(5) can be obtained as a marginal model of a hierarchical random effects model. A detailed explanation of the problems with the proof are provided in the appendix. The Bayesian lasso Markov chain We first provide the expressions for the full posterior conditional densities of, and respectively. Let f(,, y) denote the full posterior conditional density of (on R p ) given,, y. By (6), f(,, y) = XT X + D p p e ( (X T X+D ) X T y) T (X T X+D )( (XT X+D ) X T y). Let f(,, y) denote the full posterior conditional density of (on R p + ) given,, y. By (7), f(,, y) = py = r ( ) s A. Note that the reciprocal of the entries of (and not the entries themselves) have an Inverse-Gaussian distribution. Let f(,, y) denote the full posterior conditional density of (on R + ) given,, y. By (8), f(,, y) = (y X )T (y X )+ T D n+p+ n+p+ + n+p+ ( ) n+p+ e (y X )T (y X )+ T D +. Let denote the Lebesgue measure on R p R p + R +. The Bayesian lasso Gibbs Markov chain has a Mtd (with respect to ) given by k (,, ) (,, ) = f(,, y) f(,, y) f(,, y). (9) 5

6 It is well known, and can be verified by a straightforward calculation, that the oint posterior density of (,, ) is invariant for the Gibbs transition density k defined above. The Mtd is strictly positive, which implies that the chain is aperiodic and -irreducible [3, Page 87]. Moreover, the existence of an invariant probability density together with -irreducibility imply that the chain is positive Harris recurrent (see for example []). Note also that is equivalent to the maximal irreducibility measure. We now prove geometric ergodicity by establishing a geometric drift condition and an associated minorization condition for the Gibbs transition density k. 3 Proof of geometric ergodicity 3. Drift condition Consider the function V (,, )=(y X ) T (y X )+ T D +. = Let E k,, represent the expectation with respect to one step of the Markov chain with transition density k, starting at (,, ). The following proposition establishes a geometric drift condition for the transition density k. Proposition. If n 3, there exist constants apple < and b> such that E k V (,, ),, apple V (,, )+b, () for every (,, ) R p R p + R +. Proof: It follows by the definition of the Markov transition density k that E k V (,, ),, = E E E V (,, ),,,. () 6

7 We evaluate the three conditional expectations one step at a time. We start with the innermost conditional expectation in (). E V (,, ), = E (y X ) T (y X )+ T D, = y T y + E T (X T X + D ), + = y T XE, + = = y T y + (X T X + D ) (X T X + D ) X T y + tr (X T X + D ) (X T X + D ) (X T X + D ) y T X(X T X + D ) X T y + = = y T y y T X(X T X + D ) X T y + p + apple y T y + + p = = () For the next step, we analyze terms related to the middle conditional expectation in (). Note that if Z Inverse-Gaussian(µ, ), then apple E Z = µ +. Hence, it follows from (7) that E, = = s s + n + p + (n + p + ) + apple n + p + + (n + p + ) +. (3) Finally, we analyze terms related to the outermost conditional expectation in (). Note that E, (y X = ) T (y X )+ T D +, (4) n + p + 7

8 and apple E, = Combining ()-(5), we get that apple apple apple apple E k V (,, n + p + (y X ) T (y X )+ T D +. (5) ),, y T p(n + p + ) y + + p + p (y X ) T (y X )+ T D + + n + p + (n + p + ) P p = (n + p + ) (y X ) T (y X )+ T D + y T p(n +p + ) y + + p + p n + p + + p n + p + (y X ) T (y X )+ T D + P p = (y X ) T (y X )+ T D + y T p(n +p + ) y + + p + p n + p + + p n + p + (y X ) T (y X )+ T D + y T y + p(n +p + ) + p + p n + p + + p n + p + (y X ) T (y X )+ T D + The last inequality follows from the fact that It follows from (6) that = apple @ A = s = = q = A A. P p = T D (6) = E k V (,, ),, apple V (,, )+b, (7) 8

9 where and b = y T y + = max p n + p +,, (8) p(n +p + ) + p + p n + p +. (9) Hence, the required geometric drift condition is established. 3. Minorization condition Let us recall that V (,, )=(y X ) T (y X )+ T D + P p =. For every d>, let B V,d = {(,, ):V (,, ) apple d}. The following proposition establishes an associated minorization condition to the geometric drift condition established in Proposition. Proposition 3. There exists a constant < = (V,d) apple and a probability density function f on R p R p + R + such that k (,, ) (,, ) f(,, ), () for every (,, ) B V,d. Proof: Note that r ( µ ) f(,, y) = ( ) e (µ ) = r ( ) e (µ ) + µ r ( ) e (µ ), 9

10 where = and µ = r. If (,, ) B V,d, then ) ) T D + T D = apple d. apple d = A apple d Hence, if (,, and, ) B V,d, then (µ ) = apple d, f(,, y) r r ( ) e ( ) e g ( )e d d q d + e q d q d, () where g ( ) is the density of the reciprocal of an Inverse-Gaussian random variable with parameters q d and. Let us now consider the full conditional density of given,, y. Note that for (,, ) B V,d, (y X ) T (y X )+ T D + apple d +, ()

11 and (y X ) T (y X )+ T D + = y T y y T X + T X T X + D + = y T y y T X X T X + D X T y + X T X + D X T y X T X + D T X T X + D X T y X T X + D + y T y y T X X T X + D X T y + y T y y T X X T X + p d I X T y +. (3) The last inequality follows since apple d for every apple apple p. Note that I n X X T X + d I p X T is a positive definite matrix. Hence y T y for (,, ) B V,d, y T X X T X + d I p X T y >. It follows from () and (3) that = e e p q d f(,, y) y T y y T y y T y y T X X T X + d I p y T X X T X + d I p X T y + X T y +! n+p+! n+p+ y T X X T X + d I p X T y + d + + p d! n+p+ q ( ) n+p+ + e d+ e p d n+p+ ( ) n+p+ + e d+ e n+p+ p d g ( ), (4) where g ( ) is the density of the Inverse-Gamma distribution with parameters n+p+ and d+ + d. It follows from (9), () and (4) that, for all (,, ) B V,d, k (,, ) (,, ) = f(, f(,,, y)f(, ),, y)f(,, y)

12 where = e y T y y T X X T X + d I p X T y + d + + p d! n+p+, (5) and f is a probability density on R p R p + R + given by py f(,, )=f(,, g ( = ) A g ( ). Hence, the required minorization condition for k has been established. The drift and minorization conditions in Proposition and Proposition 3 can be combined with Theorem of Rosenthal [8] to establish geometric ergodicity of the Bayesian lasso Gibbs Markov chain. Proposition 4. Let n 3, and,b and be as defined in (8), (9), and (5) respectively. Let d> b. Let A = +d +b + d and U = + ( d + b). Then for any (,, ) R p R p + R +, m N, and <r< K m (,, U ), ( ) TV apple ( ) rm r m + A r + b + V (,, ). Proposition is an immediate corollary of the above result. 4 Discussion The Bayesian lasso is a popular and widely used algorithm for sparse Bayesian estimation in linear regression. In this paper, we have established geometric ergodicity of the Markov chain corresponding to the Bayesian lasso algorithm. As discussed previously, this proves the existence of the Markov chain CLT in this context, thereby allowing the user to obtain valid asymptotic standard errors for Markov chain based estimates of posterior quantities. The Laplace prior for used in the Bayesian lasso algorithm is an example of the so-called continuous shrinkage priors for sparse Bayesian estimation. In recent years, many other continuous shrinkage priors

13 have been introduced and studied in the literature (see [4, 5, 6,, 5] and the references therein), and sampling from the resulting posterior distribution is often achieved by using MCMC. To the best of our knowledge, geometric ergodicity of the corresponding Markov chains has not been investigated. Using the insights obtained from the analysis in this paper, we are currently investigating geometric ergodicity for Markov chains corresponding to some of the other continuous shrinkage priors proposed in the literature. Appendix We point out the errors in the proof of geometric ergodicity of the Bayesian lasso Gibbs Markov chain in Kyung et al [] (referred to henceforth as KGGC). The authors in KGGC only consider the case n p. The authors assert that the Bayesian lasso model can be obtained by appropriately marginalizing a hierarchical random effects model. They then prove the geometric ergodicity of the Markov chain corresponding to this hierarchical random effects model, and argue that this implies geometric ergodicity of the Bayesian lasso model. We start by reproducing (verbatim) the hierarchical random effects model from KGGC (Page 46, appendix) below. y p N (, e I) (6) N (µ, ( )) (7) µ N (µ, ) (8) e Gamma(a,b ) (9) i.i.d. Gamma(a,b ), (3) where ( ) is of the form ( )=diag(,,, p ) py i= e d. (3) We refer to this hierarchical random effects model as the RE model. 3

14 There is a contradiction between (3) and (3) regarding the prior distribution of. In (3), is assumed to be Inverse-Gamma, whereas in (3), is assumed to be Exponential. The authors proceed to prove geometric ergodicity of the RE model under the Inverse-Gamma assumption in (3). Since the Bayesian lasso model assumes the prior for to be Exponential, it clearly cannot be obtained by marginalizing the RE model with the Inverse-Gamma assumption in (3). Hence, the proof of geometric ergodicity for the RE model with the Inverse-Gamma assumption has no bearing on the geometric ergodicity for the Bayesian lasso model. y is assumed to be p-dimensional in (6). In the Bayesian lasso model, the data vector y is n- dimensional. Since the authors are considering the case n>p, this clearly rules out the possibility of marginalizing the RE model to get the Bayesian lasso model. If this is a typo, and y is n-dimensional, then is n-dimensional. Hence, ( ) is an n n matrix. However, ( ) is assumed to be a p p matrix in (3). There is another error in the marginalizing argument provided in KGGC. On Page 38 (Section 4) the authors consider the model y, N n (, I n ) and N n (µ n, ), (3) where = J + X(X T X) D (X T X) X T + ˆX ˆX T, and = µ n + X + ˆX (with n,x, ˆX mutually orthogonal). Here n is the n-dimensional vector of all ones, and J is the n n matrix of all ones. They claim that marginalizing over in (3) gives the marginal model y N n (µ n + X, I n ) and N p p, D. (33) Firstly, there is a small issue in terms of µ. Note that, by the mutual orthogonality of n,x, ˆX, T n n = T n (µ n + X + ˆX ) = µ. n 4

15 Since N n (µ n, ), this implies µ N µ, T n n n, which is paradoxical, as it implies that µ is a random variable with non-zero variance ( is a full rank matrix), and also implies that µ is a constant. Let us get rid of this discrepancy by assuming µ =, = X(X T X) D (X T X) X T + ˆX ˆX T, and choosing the transformation! X + ˆX, where is now (n p)-dimensional, and X T ˆX =. Let us assume without loss of generality that ˆX T ˆX = In p (the choice of ˆX is completely in our hands anyway). Still, the marginal model after integrating out is not the same as (33). An immediate way to check this is that V (y) = I n + under (3), while V (y) = I n + XD X T under (33). We now derive the correct marginal model. Note that the oint density of (y, ) is given by f(y, ) = ( p ) e (y ) T (y ) n ( p ) e n T. By making the linear transformation (y, )! (y,, ), we get that f(y,, ) = const e (X + ˆX ) T ( X(X T X) (y X ˆX ) T (y X ˆX ) e = const e (+ ) T + T ˆX T y D (XT X) X T + ˆX ˆXT )(X + ˆX ) e (y X )T (y X )+ T D. After completing the square, and integrating out, the marginal density of (y, ) is obtained as f(y, ) = const e y T ˆX ˆXT y ( +) e (y X )T (y X )+ T D. 5

16 By completing the squares again, and using ˆX T X =, it follows that the marginal model is given by y, N n (X, I n + ˆX ˆX T ) and N p p, D. This is different from the Bayesian lasso model, as the conditional variance of y given, is I n + ˆX ˆX T (as opposed to I n in the Bayesian lasso model). Acknowledgment. The first author was supported by NSF Grant DMS--684, and the second by NSF Grant DMS References [] Asmussen, S. and Glynn, P. W. (). A new proof of convergence of MCMC via the ergodic theorem, Statistics & Probability Letters 8, [] Bednorz, W. and Latuszynski, K. (7). A few remarks on Fixed-Width Output Analysis for Markov Chain Monte Carlo by Jones et al, JASA, [3] Bae, K., and Mallick, B. K. (4). Gene selection using a two-level hierarchical Bayesian model, Bioinformatics, [4] Bhattacharya, A., Pati, D., Pillai, N.S. and Dunson, D.B. (). Bayesian shrinkage, arxiv preprint. [5] Carvalho, C.M., Polson, N.G. and Scott, J.G. (9). Handling sparsity via the horseshoe, Journal of Machine Learning Research W & CP 5, [6] Carvalho, C.M., Polson, N.G. and Scott, J.G. (). The horseshoe estimator for sparse signals. Biometrika 97, [7] Figueiredo, M. A. T. (3). Adaptive sparseness for supervised learning, IEEE Transactions on Pattern Analysis and Machine Intelligence 5,

17 [8] Flegal, J.M., Haran, M. and Jones, G.M. (8). Markov chain Monte Carlo: Can we trust the third significant figure?, Statistical Science 3, 5-6. [9] Flegal, J.M. and Jones, G.L. (). Batch means and spectral variance estimators in Markov chain Monte Carlo, Annals of Statistics 38, [] Griffin, J.E. and Brown, P.J. (). Inference with normal-gamma prior distributions in regression problems, Bayesian Analysis 5, 788. [] Jones, G.L., Haran, M., Caffo, B.S. and Neath, R. (6). Fixed-width output analysis for Markov chain Monte Carlo, Journal of the American Statistical Association, [] Kyung, M., Gill, J., Ghosh, M. and Casella, G. (). Penalized regression, standard errors and Bayesian lassos, Bayesian Analysis 5, [3] Meyn, S.P. and Tweedie, R.L. (993). Markov Chains and Stochastic Stability, Springer-Verlag, London. [4] Park, T. and Casella, G. (8). The Bayesian lasso, Journal of the American Statistical Association 3, [5] Polson, N.G. and Scott, J.G. (). Shrink globally, act locally: Sparse Bayesian regularization and prediction, Bayesian Statistics 9, J.M. Bernardo, M.J. Bayarri, J.O. Berger, A.P. Dawid, D. Heckerman, A.F.M. Smith and M. West, eds., Oxford University Press, New York. [6] Roberts, G.O. and Rosenthal, J.S. (998). Markov chain Monte Carlo: Some practical implications of theoretical results (with discussion), Canadian Journal of Statistics 6, 5-3. [7] Roberts, G.O. and Rosenthal, J.S. (4). General state space Markov chains and MCMC algorithms, Probability Surveys, -7. 7

18 [8] Rosenthal, J.S. (995). Minorization Conditions and Convergence Rates for Markov Chain Monte Carlo, JASA 9, [9] Tibshirani, R. (996). Regression shrinkage and selection via the lasso, Journal of the Royal Statistical Society B 58, [] Yuan, M., and Lin, Y. (5). Efficient Empirical Bayes Variable Selection and Esimation in Linear Models, Journal of the American Statistical Association, 55. 8

The Polya-Gamma Gibbs Sampler for Bayesian. Logistic Regression is Uniformly Ergodic

The Polya-Gamma Gibbs Sampler for Bayesian. Logistic Regression is Uniformly Ergodic he Polya-Gamma Gibbs Sampler for Bayesian Logistic Regression is Uniformly Ergodic Hee Min Choi and James P. Hobert Department of Statistics University of Florida August 013 Abstract One of the most widely