Prior Distributions for the Variable Selection Problem

Size: px

Start display at page:

Download "Prior Distributions for the Variable Selection Problem"

Cornelius Lawrence
6 years ago
Views:

1 Prior Distributions for the Variable Selection Problem Sujit K Ghosh Department of Statistics North Carolina State University ghosh/ ghosh@stat.ncsu.edu Overview The Variable Selection Problem(VSP) A Bayesian Framework Choice of Prior Distributions Illustrative Examples Conclusions Bayesian Statistics Working Group, NCSU Disclaimer: This talk is not entirely based on my own research work Sujit Ghosh, October 3, Sujit Ghosh, October 3, The Variable Selection Problem Consider the following canonical linear model: y = Xβ + ɛ (1) where ɛ N n (0,σ 2 I)andβ =(β 1,...,β p ) T (X is an n p matrix) Under the above model, suppose also that only an unknown subset of the coefficients β j s are nonzero The problem of variable selection is to identify this unknown subset. Notice that the above canonical framework can be used to address many other problems of interest including multivariate polynomial regression and nonparametric function estimation Suppose the true data generating process (DGP) is given by y = X 0 β 0 + ɛ, (2) where β 0 =(β 0 1,...,β 0 p 0 ) T, X 0 is n p 0 and WLOG assume that X =(X 0 X 1 ) T and p p 0 1 (i.e., X 1 is n (p p 0 )) The LSE of β and σ 2 are given by ˆβ = (X T X) X T y (3) ˆσ 2 = y T (I P X )y/(n r) where r =rank(x) min(n, p), P X = X T (X T X) X T is the projection matrix and (X T X) is a g-inverse of X T X.Then Lemma: E[ˆβ] =((X T 0 X 0 ) X T 0 X 0 β 0, 0) T and E[X ˆβ] =X 0 β 0. Further, E[ ˆσ 2 ]=σ 2 for any g-inverse of X T X. In particular, if rank(x 0 )=p 0,thenE[ˆβ] =(β 0, 0) T. Sujit Ghosh, October 3, Sujit Ghosh, October 3,

2 VSP, contd. The variable selection problem is a special case of the model selection problem Each model under consideration corresponds to a distinct subset of x 1,...,x p (Geweke, 1996) The model (1) can be generalized to include discrete responses in terms of first two moments: E[y X] = g(xβ) V[y X] = Σ(X, β, φ), (4) where g( ) is a suitable link function and Σ( ) isann n covariance matrix which may depend on additional parameters φ Typically, a single model class is simply applied to all possible subsets so that all reduced models are nested under the full model VSP, contd. A common strategy for this VSP has been to select a model that minimizes a penalized sum of squares (PSS) criterion by a constraint optimization method (but why?) More specifically, if δ =(δ 1,...,δ p ) T denotes the indicator of the inclusion (δ j = 1) and exclusion (δ j =0)ofthevariablex j for j =1,...,p, then a PSS criterion would pick a δ {0, 1} p (and also β R p ) that minimizes PSS(β, δ) = y XD(δ)β 2 /nσ 2 + J(β, δ), (5) where D(δ) is the diagonal matrix with diagonal δ and J( ) denotes a suitable penalty function The choice of the penalty J( ) is crucial and can be shown to be equivalent to the choice of a prior distribution Sujit Ghosh, October 3, Sujit Ghosh, October 3, VSP, contd. A number of popular criteria correspond to (5) with different choices for J( ) J(β, δ) =λ(p, n) p j=1 δ j (notice p j=1 δ j = #non-zero β j s) λ(p, n) = 2 yields C p (Mallows, 1973) and AIC (Akaike, 1973) λ(p, n) =logn yields BIC (Schwarz, 1978) and λ(p, n) =2logp yields RIC (Foster and George, 1994) The C MML criteria (George and Foster, 2000) estimates λ(p, n) based on marginal maximum likelihood (MML) using an empirical Bayes framework. J(β, δ) =2 p j=1 δ j log(p/j), Benjamini and Hochberg (1995) Notice that all of the above penalties do not involve β and these are generally a function of δ (and n) Recent attempts have been to define penalties in terms of β J(β, δ) =λ p j=1 β j q, q 1 yields bridge regression (Frank and Friedman, 1993). Only q = 1 yields sparse solution among all q 1 (Fan and Li, 2001) q = 2 yields ridge regression q = 1 yields LASSO (Tibshirani, 1996) p J(β, δ) =λ 1 j=1 β p j + λ 2 j=1 β2 j yields Elastic Net (Zhou and Hastie, 2005) J(β, δ) =λ 1 ( p j=1 β j + λ 2 j<k max{ β j, β k }) yields OSCAR (Bondell and Reich, 2006) Thus, a general strategy would be to define a penalty function that would involve both δ and β We will consider this as priors: π(β, δ) exp{ J(β, δ)} Sujit Ghosh, October 3, Sujit Ghosh, October 3,

3 The full hierarchical Bayes model: A Bayesian Framework y β, δ,σ 2 N n (XD(δ)β,σ 2 I n ) (β, δ) σ 2 π(β, δ σ 2 ) exp{ J(β, δ)/σ 2 } (6) σ 2 π 0 (σ 2 ) ( e.g., IG(a 0,b 0 )) Given a loss function, L(θ, a) we can obtain (in theory) the Bayes estimator by minimizing the posterior expected loss, E[L(θ, a) y, X] = L(θ, a)π(θ y, X) dθ (7) wrt to a = a(y, X) whereθ =(β T, δ T,σ 2 ) T Which prior distributions? What loss functions? Can we even do optimization for a given prior distribution and loss function? Bayesian Framework, contd. A pure subjective point of view for prior selection is problematic for the VSP It is rather unrealistic to assume that uncertainty can be meaningfully described given the huge number ( 2 p 1 )and complexity of unknown model parameters. A common and practical approach has been to construct noninformative, semi-automatic formulation in this context. Roughly speaking the goal would be to specify priors that allow the posteriors model probabilities to accumulate near the true model (via some form of sparseness and smoothing) Unfortunately, there are no universally preferred method to construct such semi-automatic priors! (isn t that nice?) Sujit Ghosh, October 3, Sujit Ghosh, October 3, Bayesian Framework, contd. The choice of loss function, although mostly overlooked, is also crucial (different loss functions lead to different estimates) In general, suppose the true DGP is: (y, x) m 0 (y x)g 0 (x) Consider a model: (y, x) m(y x)g(x) where m(y x) = f(y x, θ)π(θ)dθ with sampling density f and prior π The Kullback-Liebler discrepancy between the DGP and model canbewrittenas: K(m 0 g 0,mg) = K(m 0,m x)g 0 (x)dx + K(g 0,g) where K(m 0,m x) = m 0 (y x)log m 0(y x) dy (8) m(y x) const. + 1 n ( ) log f(y i x i, θ)π(θ)dθ n i=1 Bayesian Framework, contd. Notice that if ˆθ denotes a MAP estimator then 1 n ( ) log f(y i x i, θ)π(θ)dθ n i=1 1 n ( log f(y i x i, n ˆθ) ) +( log π(ˆθ)) i=1 When y x follows a canonical normal linear model the above criteria is equivalent to the PSS criteria (5) Thus, J(θ) = log(π(θ)) emerges as a choice of the penalty function up to some multiplicative constant (see slide# 8) Hence the choice of a penalty function is equivalent to the choice of a prior distribution (including improper distributions in some cases) Sujit Ghosh, October 3, Sujit Ghosh, October 3,

4 Choice of Prior distributions We are not generally confident about any given set of predictors and hence little prior information on D(δ)β and σ 2 can be expected for each δ For each δ it is desirable to have some default priors for D(δ)β and σ 2 Unfortunately default priors for normal linear models are generally improper Nonobjective (conjugate) priors for β are typically centered at 0, making the model with no predictors as the null model within a testing of hypothesis set up The goal is to select a prior (and hence penalty function) that is criterion-based and fully automatic More generally we can think of constructing priors (and hence penalties) that may also depend on the design matrix X Zellner s prior: β δ,σ 2, X N p (β 0, σ2 g (XT X) ), σ 2 IG(a 0,b 0 )andδ = 1 w.p. 1. Here β 0,g,a 0 and b 0 need to be specified by the user (or estimated using either an EB or a HB procedure) Extensions of Zellner priors: β δ,σ 2, X N p (β 0, σ2 g (X(δ)T X(δ)) ), σ 2 IG(a 0,b 0 )and δ Unif({0, 1} p )wherex(δ) =XD(δ). Almost same as above but δ q P j δj (1 q) p P j δj The advantage of Zellner-type priors is the closed form suitable for rapid computations over large parameter space for δ. Sujit Ghosh, October 3, Sujit Ghosh, October 3, In general we may consider the following independence prior: p π I (δ) = q δj j (1 q j) 1 δj (9) j=1 q j =Pr[δ j = 1] is the inclusion probability of the j-th variable Small q j canbeusedtodownweightthej-th variable Notice that when q j 0.5, models with size p/2 getmoreweight Alternatively, assuming q j = q for all j, onemayuseaprior q Beta(a, b) to obtain the exchangeable prior: π E (δ) =B(a + j δ j,b+ p j δ j)/b(a, b) where B(a, b) is the Beta function. Notice that components of δ are exchangeable but not independent under the previous prior Independent and exchangeable priors on δ may be less satisfactory when the models contain dependent components (e.g., interactions, polynomials, lagged or indicator variables) Consider 3 variables with main effects x 1,x 2,x 3 and three two-factor interactions x 1 x 2,x 2 x 3,x 1 x 3. The importance of the interactions such as x 1 x 2 will often depend only on whether the main effects x 1 and x 2 are included in the model This can be expressed by a prior for δ =(δ 1,...,δ 13 ) of the form π(δ) = 3 j=1 π(δ j) j<k π(δ jk δ j,δ k ) where π(δ jk δ j,δ k ) would require specifying four probabilities one for each pair (δ j,δ k ). E.g., π(δ 12 0, 0) <π(δ 12 0, 1),π(δ 12 1, 0) <π(δ 12 1, 1) Sujit Ghosh, October 3, Sujit Ghosh, October 3,

5 The number of possible models grows exponentially as the number of interactions, polynomials, lagged variables increases In contrast to independent priors of the form (9), priors for dependent components models concentrate mass on plausible models, when the number of possible models is huge This can be crucial in applications such as screening designs, where p>>n(see Chipman, Hama and Wu, 1997) Another limitation of independence priors on δ is their failure to account for covariate collinearity This problem can be resolved by using the so-called dilution priors (George, 1999) A general form of dilution prior can be written as π D (δ) =h(det(x(δ) T X(δ)))π I (δ) Having little prior information on the variables, objective model selection methods are necessary Spiegelhalter and Smith (1982): improper priors used conventional improper priors for β used a pseudo-bayes Factor for inference Mitchell and Beauchamp (1988): spike-and-slab priors β j δ j (1 δ j )δ 0 + δ j Unif( a j,a j ) variable selection problem is solved as an estimation problem Berger and Pericchi (1996): Intrinsic Priors developed a fully automatic prior used intrinsic Bayes factor for inference based on a model encompassing approach Sujit Ghosh, October 3, Sujit Ghosh, October 3, Yuan and Lin (2006) have recently proposed the use of the following dilution prior: π(δ) =q P j δj (1 q) p P j δj det(x(δ) T X(δ)) The main idea behind this prior is to replace a set of highly correlated variables by one of the variable in that set Suppose β j δ j (1 δ j )δ 0 + δ j DE(0,τ)whereDE(0,τ) denotes a double-exponential distribution with density τ exp{ τ β } and δ 0 a distribution with point mass at 0 Yuan and Lin (2005) have shown that if one sets q =(1+τσ π/2) 1 then the model with highest posterior probability is approximately equivalent to LASSO with λ =2σ 2 τ A Gibbs sampling scheme is also presented (seminar on Oct 31st!) Another recent attempt to construct automatic priors has been made by Casella and Moreno (2006) Their proposed methodology is Criterion based: provides clear understanding of properties of selected models Automatic: No tuning parameter (hyperparameter) selection is required Formally carries out the hypothesis tests: H 0 : δ = δ vs. H a : δ = 1 p where δ {0, 1} p but δ 1 p, i.e. tests the null hypothesis of a reduced model verses the full model (this is in sharp contrast to other conjugate prior approaches) Sujit Ghosh, October 3, Sujit Ghosh, October 3,

6 The test of hypothesis is carried out using posterior model probabilities: Pr[δ y, X] = = m(y δ, X) m(y 1 p, X)+ δ 1 p m(y δ, X) BF δ,1 p (y, X) 1+ δ 1 p BF δ,1 p (y, X) where BF δ,1 p (y, X) denotes the Bayes factor for testing H 0 : δ = δ vs. H a : δ = 1 p The fact the every posterior model probability has the same denominator facilitates rapid computation The use of intrinsic priors overcomes the problem of using improper priors while computing the Bayes factors Illustrative Examples A simulation study adopted from Casella and Moreno (2006): Full Model: y = β 0 + β 1 x 1 + β 2 x 2 + β 3 x β 4 x ɛ where ɛ N(0,σ 2 ) and True DGP1: y =1+x 1 + ɛ, x 1,x 2 Unif(0, 10),σ =2 True DGP2: y =1+x 1 +2x 2 + ɛx 1,x 2 Unif(0, 10),σ =2 Asampleofn = 10 is generated and posterior model probabilities are computed for all 2 4 =16models. The procedure was repeated 1000 times and compared with Mallow s C p Sujit Ghosh, October 3, Sujit Ghosh, October 3, Examples, contd. Consider the ancient Hald data (Casella and Moreno, 2006) Measures the effect of heat on composition of cement n = 13 observations on heat (y) and four cement compositions (x j s, j =1,...,4) are available ( 2 4 =16models) Historically, it is known that the subset {x 1,x 2 } is most preferred by earlier analysis subsets Pr[δ y, X] subsets Pr[δ y, X] x 1,x x 1,x 3,x x 1,x x 2,x 3,x x 1,x 2,x x 1,x 2,x 3,x x 1,x 2,x x 3,x All other models have posterior probabilities < 10 5 Sujit Ghosh, October 3,

7 Examples, contd. Based on R 2, Draper and Smith (1981, Sec. 6.1) also concluded in favour of the top two models with preference to {x 1,x 4 } as x 4 is the single best predictor Although {x 1,x 2,x 4 } had a high R 2 the variable x 4 was excluded as Corr(x 2,x 4 )= Interestingly, George and McCulloch (1993) analyzed this data and favored the model with no predictors (δ = 0 4 ) followed by the model with one predictor George and McCulloch (1992) s stochastic search algorithm visited the model {x 1,x 2 } less than 7% of the time! This could be because of considering the no predictors as the null model in all comparisons Their methods were sensitive to the choice of priors for β and δ Examples, contd. C & M (2006) also considered the 10-predictor variables model: y = β β j x j + j=1 3 j=1 η j x 2 j + j<k η jk x j x k + η 123 x 1 x 2,x 3 + ɛ The true DGP2 was used to simulate the data and a stochastic search with intrinsic prior was used to estimate posterior model probabilities A total of 10 4 MCMC samples were generated Exact posterior model probabilities for all 2 10 =1, 024 models were also computed The entire procedure was repeated 1,000 times with n =10 Two values σ =2, 5 were used Sujit Ghosh, October 3, Sujit Ghosh, October 3, Examples, contd. Model Pr[δ y] MCMC visits σ = 2 (exact) (ssvs) x 1,x x 1,x 2,x x 1,x 2,x 1 x x 1,x 2,x σ =5 x 1,x x 1,x x x 2 1,x Conclusions Variable selection can be considered as a multiple testing problem in which we test whether any reduction in complexity of the full model is plausible Default priors typically used for model parameters are improper, and thus they are not suitable for computing model posterior probabilities The commonly used vague priors, (as a limit of conjugate priors) is typically an ill-defined notion Intrinsic priors are well defined, depend on sampling density and do not require the choice of tuning parameters Intrinsic prior for full model parameters is centered at the reduced model and has heavy tails Sujit Ghosh, October 3, Sujit Ghosh, October 3,

8 Conclusions, contd. The role of the SSVS is different from estimating a posterior distribution The goal is to find good models rather than estimating the modes accurately However determining how many MCMC runs to be carried out is a complex issue Rigorous evaluation of SSVS in terms of convergence and mixing is very difficult and might be worth more exploration Open problems: Given two priors (or equivalently penalty functions), how one would rigorously choose a model/method for VSP? Can the computational cost be factored in the loss the function? THANKS! All references mentioned in this talk and many more are available online Sujit Ghosh, October 3, Sujit Ghosh, October 3,

Consistent high-dimensional Bayesian variable selection via penalized credible regions

Consistent high-dimensional Bayesian variable selection via penalized credible regions Howard Bondell bondell@stat.ncsu.edu Joint work with Brian Reich Howard Bondell p. 1 Outline High-Dimensional Variable