Self-adaptive Lasso and its Bayesian Estimation

Size: px

Start display at page:

Download "Self-adaptive Lasso and its Bayesian Estimation"

Clarence Norman
5 years ago
Views:

1 Self-adaptive Lasso and its Bayesian Estimation Jian Kang 1 and Jian Guo 2 1. Department of Biostatistics, University of Michigan 2. Department of Statistics, University of Michigan Abstract In this paper, we proposed a self-adaptive lasso method for variable selection in regression problems. Unlike the popular lasso method, the proposed method introduces a specific tuning parameter for each regression coefficient. We modeled self-adaptive lasso in a Bayesian framework and developed an efficient Gibbs sampling algorithm to automatically select these tuning parameters and estimate the parameters. This algorithm also brings in some convenience for conducting statistical inference for selected variables. Several synthetic and real examples in this paper demonstrate flexibility of the tuning parameters enhance the performance of self-adaptive lasso in terms of both prediction and variable selection. Finally, we also extend the self-adaptive lasso to account for elastic net and fused lasso. Key Words: Bayesian modeling, Gibbs sampling, Lasso, Variable selection.

2 1 Introduction In this paper, we consider a least square regression problem with n observations (x 1,y 1 ),..., (x i,y i ),..., (x n,y n ), where x i = (x i,1,...,x i,p ) is a p-dimensional vector of predictors and y i is the associated response. To brief the notation, we denote Y = (y 1,...,y n ) T and X = (x 1,...,x n ) T. Without loss of generality, we assume Y is centered ( n y i = 0) and X is standardized along the columns ( n x i,j = 0, n x2 i,j = n, 1 j p). The response Y and the predictors X are related by the linear model Y = Xβ +ǫ, (1) where ǫ = (ǫ 1,...,ǫ n ) T areerrorterms andβ = (β 1,...,β p ) T areregression coefficients. Since we have already assumed that the response and the predictors have zero means, the intercept term can be excluded from model (1). The linear regression problem draws two aspects of one s attention: prediction accuracy and model interpretability. The former focuses on how to reduce the prediction errors, whereas the latter concerns how to select important variables. The two goals can be achieved simultaneously by a family of approaches, namely regularization methods (Breiman, 1995; Fan and Li, 2001; Meinshausen, 2007; Radchenko and James, 2008; Wang et al., 2008; Tibshirani, 1996; Tibshirani et al., 2005; Zou and Hastie, 2005; Zou, 2006). In particular, the lasso method proposed by Tibshirani (1996) has gained much attention in recent years. It penalizes the least square loss function by an l 1 norm of the regression coefficients: min β n (y i p β j x i,j ) 2 +λ j=1 p β j, (2) where λ is a tuning parameter. Due to the singularity of the l 1 norm at zero, lasso can continuously shrink the estimated coefficients towards zero, and push some estimated coefficients exactly to zero when λ is large enough. The prediction accuracy of lasso can generally dominate the ordinary least square (OLS) regression in terms of mean square error (MSE) 1 j=1

3 because its penalty introduces some bias in order to compensate for more estimation variance (namely bias-variance tradeoff). We can see that lasso reduces to an ordinary least square (OLS) regression when λ = 0. Those variables associated with zero estimated coefficients are considered unimportant and removed from the model. Although lasso has gained a high degree of success in many situations, it has a few limitations in practice. In this paper, we focus on two of them: Limitation 1. Lasso uses a unique tuning parameter λ to equally penalize all p regression coefficients β j s. This λ controls both the number of the selected variables and the shrinkage level of the fitted regression coefficients. In practice, due to the single tuning parameter, lasso usually either includes a number of irrelevant variables to reduce the estimation bias or over-shrinks the coefficients of the correct variables to produce a model with correct size (Radchenko and James, 2008). Limitation 2. The lasso criterion is an optimization problem with respective to β j s and thus it only provides a point estimate for β. Nevertheless, people usually also need to know the the level of confidence of the estimates, such as the confidence interval (or credible interval) and the p-value. The first limitation is partially addressed by adaptive lasso (Zou, 2006), which penalizes a weighted l 1 -norm of the regression coefficients: min β n (y i p β j x i,j ) 2 +λ j=1 p w j β j, (3) where w j s are adaptive weights defined as w j = β ols j r for some positive constant r, and j=1 β ols j is the OLS estimate of β j. The intuition of the adaptive lasso is to give large weights to unimportant variables, and thus to heavily shrink their associated coefficients. On the other hand, it gives small weights to important variables, and thus slightly shrinks their associated coefficients. Theoretically, adaptive lasso enjoys oracle properties (Fan and Li, 2

4 2001) that lasso does not have. Specifically, when p is fixed and n goes to infinity, then with some appropriately selected λ, adaptive lasso selects the true underlying model with probability tending to one, and the corresponding estimated coefficients have the same asymptotic normal distribution as they would have if the true underlying model were provided in advance (Zou, 2006). Although adaptive lasso has nice asymptotic properties, its finite sample performance heavily depends on the quality of the OLS estimates in w j s. In practice, adaptive lasso may suffer from the collinearity caused by highly corrected variables because of the illness of the OLS estimates in this scenario. In addition to adaptive lasso, there are a few other methods in the literature attempting to address this limitation. For example, relaxed lasso (Meinshausen, 2007) and VISA (Radchenko and James, 2008). Both of them introduces two tuning parameters in the regression model. The first parameter distinguishes the potential important variables and unimportant variables, while the second parameter controls the level of shrinkage on the selected variables. Unlike the relaxed lasso which permanently excludes the variables removed by the first tuning parameter, the VISA method allows for the potential inclusion of all variables and thus it has the chance to recover the incorrectly removed variables by the first parameter. The second limitation of lasso can be addressed by applying bootstrap sampling to lasso (Tibshirani, 1996; Wang et al., 2008). An alternative way to conduct statistical inference is to implement the idea of lasso in a Bayesian framework. Indeed, Tibshirani (1996) proposed a Bayesian interpretation of lasso. Suppose y i x i N(x i β,σ 2 ) and the regression coefficients follows a Laplace (double-exponential) distribution β j L(0,σ 2 /λ), where notation N(µ,σ 2 ) denotes the normal distribution with mean µ and variance σ 2, and L(µ,γ) denotes the Laplace distribution with density f(x µ,γ) = exp{ x µ /γ}/2γ. It can be seen that the log-posterior of β is logπ(β X,Y,σ 2,λ) = 1 2σ 2 [ n (y i ] p p β j x i,j ) 2 +2λ β j +C (4) j=1 j=1 3

5 where C = plog(λ) nlog(2π)/2 (n/2+p)log(σ 2 ) is a constant irrelevant to β. We can see that themodeof thelog-posterior(4) isexactly the solutionof thelasso criterion (2). For the parameters β j s, suppose a number of simulated samples have been drawn from the posterior (4), then these samples can be used to recover not only the posterior modes, but also the Bayesian credible intervals and the p-values. Unfortunately, directly sampling β from its posterior (4) is nontrivial. To address this problem, Park and Casella (2008) proposed the Bayesian lasso where the Laplace prior is represented as a scale mixture of normals with an exponential mixing density, i.e., a 2 exp( a z ) = 0 1 [ exp( z2 s 2πs 2s )a2 2 ][exp( a2 2 )]ds = E S[φ(z/ S)], (5) In (5), S is a random variable with density a 2 /2exp( a 2 s/2) and it can be regarded as the missing data in the point of view of data augmentation (van Dyk and Meng, 2001). Conditioned on the missing data, the conditional posterior is a normal distribution (since both conditional prior and likelihood are normal). Making use of this property, Park and Casella (2008) developed a Gibbs sampling algorithm to draw samples in the augmented parameter space. The theoretical properties and some extensions of Bayesian lasso were further discussed in Kyung et al. (2009). In this paper, we propose a new variable selection method, namely self-adaptive lasso. This method attempts to simultaneously address both limitations 1 and 2 of lasso. On the one hand, the self-adaptive lasso allows each coefficient β j in the penalty to have its specific tuning parameter λ j, and numerical results demonstrate that this flexibility can help enhance the prediction and variable selection performance (see Table 1, for example). On the other hand, the self-adaptive lasso criterion is estimated in an equivalent Bayesian framework and thus it naturally provides the credible intervals and p-values to the estimates. To implement the self-adaptive lasso, we developed a Gibbs sampling algorithm to estimate the regression coefficients β j s and to select the optimal tuning parameters λ j s for self- 4

6 adaptive lasso, simultaneously. Different from Bayesian lasso, which introduces extra latent variables, the Gibbs sampling algorithm proposed in this paper directly draws each β j from its fully conditional distribution, which is a mixture of two truncated Normal distributions. The remainder of the paper is organized as follows: Section 2 introduces the self-adaptive lasso and related algorithmic issues. Section 3 and 4 use synthetic and real data to evaluate the performance of self-adaptive lasso, respectively. Section 5 extends the idea of selfadaptive lasso to elastic net and fused lasso. Finally, some concluding remarks are drawn in Section 6. 2 Methodology 2.1 Self-adaptive Lasso The self-adaptive lasso method proposed in this paper aims at optimizing the following criterion: min β n (y i p β j x i,j ) 2 + j=1 p λ j β j, (6) where λ j, is the tuning parameter for the regression coefficient β j, 1 j p. We can see that the self-adaptive lasso defined in (6) generalizes the lasso problem in (2) by allowing each regression coefficient β j in the penalty to have its specific tuning parameter λ j. The relaxation of the tuning parameters introduces more flexibility. For example, if we specify λ j = 0 for all j A and λ j = + for all j / A (A = {1 j p : β 0 j 0} is the index set of all truly nonzero coefficients), then criterion (6) reduces to an oracle regression problem (Tibshirani, 1996), i.e., the OLS regression consisting of a subset of predictors which have truly important coefficients. The oracle regression is an ideal variable selection method but it is not realistic because we do not know the subset of truly nonzero coefficients. However, it is usually used as a benchmark in simulation studies. On the other hand, the self-adaptive lasso becomes the adaptive lasso (Zou, 2006) if we specify the tuning parameters j=1 5

7 as λ j = λw j = λ/ β ols j, 1 j p. The adaptive lasso can be regarded as an attempt to approach the oracle regression because it tends to heavily (slightly) penalize the coefficients with small (large) magnitudes. As we discussed in Section 3, however, the performance of the adaptive lasso heavily depends on the extra OLS estimates. Since self-adaptive lasso has more flexible tuning parameters than adaptive lasso, it has the chance to better approach the oracle regression if these tuning parameters are appropriately specified. A direct way is to search the optimal tuning parameters λ j s in a grid in R p space such that the prediction error on an independent validation set is minimized. However, the search in such a p-dimensional grid is computationally infeasible for highdimenional problems because the number of possible nodes in this grid grows exponentially with p. To get rid of this difficulty, we transform the self-adaptive lasso problem in (6) into the following Bayesian model and then we jointly select the tuning parameters λ j s and estimate the linear coefficients β j s using Gibbs sampling. There are several advantages to do so. Firstly, the Gibbs sampling strategy makes the search of optimal tuning parameters computationally feasible. Secondly, unlike the situation in adaptive lasso, the selected tuning parameters in the self-adaptive lasso do NOT depend on any extra estimators such as OLS estimator and thus its performance won t be affected by a bad estimation from this extra estimator. Finally, as what we discussed in Section 1, the simulated samples drawn from the posterior distribution provide a natural way to conduct statistical inference. 2.2 Bayesian Modeling The Bayesian model of the self-adaptive lasso in (6) is y i x i,β,σ 2 N(x i β,σ 2 I), i = 1,...,n, (7) β j λ j,σ 2 L(0,2σ 2 /λ j ), j = 1,,p, (8) 6

8 This implies that the conditional posterior of β given σ 2 and λ = (λ 1,...,λ p ) is { [ n ]} π(β X,Y,σ 2,λ) exp 1 p p (y 2σ 2 i β j x i,j ) 2 + λ j β j. (9) Note that the mode of (9) is identical to the solution of (6). Given σ 2, λ j s describe the rate of β j s approaching 0. to jointly select λ j s and estimate β j s, we regard λ j s as the parameters of (6) and estimate them altogether with β j s. j=1 j=1 λ j G(a λ,b λ ), j = 1,,p, (10) In this sense, λ j s can be regarded as the random effects and we assign a prior to each λ j as in (10), where G(a λ,b λ ) represents the Gamma distribution with shape a λ and rate b λ. In addition, σ 2 specifies the variance of the random errors in the model and we assign an inverse Gamma prior. σ 2 G 1 (a σ,b σ ). (11) Jointly modeling λ j s and β j s as in (7) to (11) allows us to sample these parameters from the joint posterior π(β,λ,σ 2 Y,X) π(y X,β,λ,σ 2 )π(β λ,σ 2 )π(λ)π(σ 2 ). 2.3 Sampling from The Posterior We develop a Gibbs sampling algorithm to draw samples from the joint posterior π(β,λ,σ 2 Y,X) with respect to β j s, λ j s and σ 2, iteratively. Sampling β Since β j s are the parameters of primary interest, the core of the proposed Gibbs sampling algorithm is to draw β j s given other parameters. Denote φ( ) and Φ( ) as the probabilistic density function and the cumulative density function of the standard normal distribution, respectively. Let ǫ i,j = y i k j β kx i,k and p 0± = A 0± /(A 0+ +A 0 ), where { ( n A 0± = exp ǫ } ( i,jx i,j λ j /2) 2 ± n Φ ǫ ) i,jx i,j λ j /2 2σ 2 n σ n (12) 7

9 In addition, we denote N δ+ (µ,σ 2 ) as a normal distribution positively truncated at δ, with a density function Φ((x µ)/σ)i(x > δ)/σφ((µ δ)/σ); similarly, we denote N δ (µ,σ 2 ) as a normal distribution negatively truncated at δ, with a density function Φ((x µ)/σ)i(x < δ)/σφ( (µ+δ)/σ); The following propositionprovides thefullconditional posterior ofβ j given β j andother parameters. Proposition 1 Given β j and other parameters, the fully conditional distribution of β j is a mixture of two truncated normals as follows: β j β j,y,x,λ,σ 2 ( n p 0+ N ǫ ) i,jx i,j λ j /2 0+, σ2 n n ( n +p 0 N ǫ ) i,jx i,j +λ j /2 0, σ2. (13) n n The proof of Proposition 1 is in the appendix. In practice, we generate a sample from (13) with two steps. In the first step, we randomly generate an indicator which is + with probability p 0+ and is with probability p 0. In the second step, we draw a sample β # j from N 0+ (( n ǫ i,jx i,j λ j )/n,σ 2 /n) if the indicator is + and draw a sample N 0 (( n ǫ i,jx i,j +λ j )/n,σ 2 /n) if the indicator is. Then β # j follows the mixture of two truncated normal distribution defined in (13). Sampling λ and σ 2 Ontheother hand, the fullyconditional distribution ofλandσ 2 canbecalculated asfollows: Proposition 2 In the Gibbs sampling algorithm, λ j s and σ 2 can be sampled as follows: ( λ j β,σ 2,Y,X G a λ +1, β ) j 2σ +b 2 λ σ 2 β,λ,y,x [ ( n G 1 p+ n 2 +a σ +1, 1 (y i 2 p β j x i,j ) 2 + k=1 (14) ] p λ j β j +b σ ). (15) j=1 8

10 Remark 1 To efficiently draw samples from both N 0+ (µ,σ 2 ) (when µ < 0) and N 0 (µ,σ 2 ) (when µ > 0), we used the optimal exponential accept-reject algorithm proposed by Robert (1995). In our simulation studies, the average acceptance rate of the optimal exponential accept-reject algorithm is around 90%, which indicates that it is efficient to sample β j s from the mixture of two truncated normals. Remark 2 The posterior mode of β j s can be exactly zeros due to the nature of the Laplace prior in (8). However, the posterior draws of β j s cannot be exact zeros since β j s are continuous random variables. Thus, it is very challenging to directly estimate the posterior mode based on simulated samples of β j s. Instead of estimating posterior mode for variable selection, we introduce an zero thresholding δ, a pre-specified positive number ( δ is set to 10 5 in this paper). We regard all sampled β j s located in an extremely small interval [ δ,δ] as zeros. The full conditionals (13) implies that when λ j is large enough, the absolute value of a posterior draw of β j can be smaller than δ, thus, it can be estimated exactly as zeros and its associated predictor is removed from the model. This conducts variable selection automatically. Remark 3 In practice, the hyper-parameters in (10) and (11) can estimated from the prior knowledge. For example, if we approximately know q 0, the prior expected proportion of true zeros in β, then by the property of the Laplace distribution, we have: { P( δ β j δ λ j,σ 2 ) = 1 exp λ } j σ 2δ q 0 (16) After some algebraic calculation, we can set the hyper-parameters to satisfy the following equality, which is equivalent to (16). E(λ)E(1/σ 2 ) = a λ b λ a σ b σ = log(1 q 0 )/δ (17) A feasible choice is: a σ = b σ = 10 5, a λ = 1 and b λ = δ/log(1 q 0 ). On the other hand, if 9

11 we do not use any prior knowledge, the priors of λ and σ 2 can be set to be less informative, i.e., we let a λ, b λ, a σ 2, a σ 2 be very small numbers. 2.4 Model Selection In self-adaptive lasso, the tuning parameters λ j s and the variance of the error terms σ 2 are jointly estimated with the regression coefficients β j s. We can specify the optimal λ j s and σ 2 as the joint posterior mode, i.e.: (β,λ,σ 2 ) = arg max β,λ,σ 2π(β,λ,σ2 Y,X) (18) However, there are at least two limitations if we use the criterion (18) to select λ j s and σ 2. On the one hand, the posterior mode consists of some strong assumptions about the distributions of the data and the parameters. Thus, the criterion (18) may be misleading if the model is mis-specified. On the other hand, the criterion (18) completely depends on the same data as that used to estimate β j s and this may result in model over-fitting (Hastie et al., 2001). To get rid of the two limitations of the criterion (18), we instead use the separate validation method to select λ j s and σ 2. Suppose we have obtained a series of samples {(β (m),λ (m),σ 2(m) )} M m=1 drawn from the joint posterior π(β,λ,σ2 Y,X), then the optimal λ j s and σ 2 are selected such that (β,λ,σ 2 ) = arg min 1 m M Y (v) X (v) β (m) 2 (19) where (Y (v),x (v) ) is a separate validation set independent of the training set (Y,X). 2.5 Estimation and Inference Given the selected λ and σ 2, we can draw samples from the conditional posterior π(β λ,σ 2,Y,X) and posterior median can be computed as point estimates for β j s. From 10

12 Remark 2, the extremely small posterior draws of β j s are regarded as zeros. Therefore, the posterior median can be exactly zero, which conducts variable selection automatically. With the same samples, we can also construct the Bayesian credible intervals and compute thep-valuesforeachβ j. The95%credibleintervalforβ j isestimatedby[ Q (β j ), Q (β j )], where Q α (β j ) is the empirical α-quantile of β j estimated by the simulated samples. We also estimate the probability, i.e. p b j = Pr( β j > δ λ,σ 2,Y,X), called Bayesian p-value, which is another measure of the uncertainty of β j. 3 Simulation Study In this section, we proposed several simulated examples to evaluate the performance of the proposed self-adaptive lasso method. In each example, we generate 50 data sets each consisting of a training set, an independent validation set and an independent test set. Models were fitted on training data only and the validation data were used to select the optimal hyper-parameters which minimize the prediction error. To evaluate the prediction performance of each predictive model, we calculate the relative mean-square error (RMSE) on the test data set. The definition of RMSE is given below: RMSE = X test β X test β 0 2 σ 2 (20) where β 0 is the true value of β, σ 2 is the standard deviation of the error term and X test is test data matrix. Following the notation in Zou and Hastie (2005), we use / / to describe the number of observations in the training, validation and test set respectively. In each example, the data are simulated for the true model: x i N(0,Σ), ǫ i N(0,σ 2 ), y i = x i β +ǫ i, i = 1,...,n, 11

13 where the variance of the error term, σ 2, is set such that the signal-to-noise ratio of the model equals 3. The details of the simulated examples are described as follows: Example 1 In this example, each data set consists of 20/20/200 observations and eight variables. The covariance matrix Σ is an AR(1) matrix, i.e., cor(x j,x j ) = ρ j j, where ρ = 0.5 in this example. The regression coefficients are set as β = (3,15,0,0,2,0,0,0) T Example 2 In this example, each data set consists of 100/100/1000 observations and 40 variables. Similar to Example 1, the covariance matrix Σ here is an AR(1) matrix with auto-correlation ρ = 0.5. The regression coefficients are set as (j +1)/2, if j = 1,3,5,...,19; β j = j/2, if j = 2,4,6,...,20; 0, if j = 21,22,...,40. Example 3 In this example, all settings are the same as those in Example 1 except regression coefficients are set as 1, if j = 1,3,5,...,19; β j = 1, if j = 2,4,6,...,20; 0, if j = 21,22,...,40. Example 4 In this example, each data set consists of 50/50/500 observations and 40 variables. The predictors X are generated as follows: Z 1 +ξ j, Z 1 N(0,1), 1 i 5; Z X j = 2 +ξ j, Z 2 N(0,1), 6 i 10; Z 3 +ξ j, Z 3 N(0,1), 11 i 15; ξ j, 16 i 40. where ξ j is i.i.d. generated from N(0,1). The regression coefficients are set as 1, if 1 j 5; 5, if 6 j 10; β j = 2, if 11 j 15; 0, if 16 j

14 We repeat each simulation 50 times and calculate the average of RMSEs and their standard errors. We also record the average false positive rates (FPR, the proportion of incorrectly selected unimportant variables) and false negative rates (FNR, the proportion of incorrectly removed important variables). In all examples, self-adaptive lasso is compared with lasso, Bayesian lasso, adaptive lasso and elastic net. The results are summarized in Table 1. To demonstrate the effectiveness of the self-adaptive lasso in statistical inference, we random pick the Gibbs samples from one out of the 50 repetitions and calculate the Bayesian credible intervals and the p-values of β j s. Due to the limited space, we only exhibit the inference results for Example 1 and 2, as shown in Figure 1 and 2, respectively. As we can see from Table 1, the self-adaptive lasso produces the smallest RMSEs in all examples. Compared with lasso, self-adaptive lasso produces better false positive rate in Example 1 3 and produces comparable or better false negative rate in Example 1, 2 and 4. This result demonstrates that the flexibility of the tuning parameters usually help enhance the performance. Generally speaking, self-adaptive lasso exhibits promising performance in terms of both variable selection (described by FPRs and FNRs) and prediction accuracy (described by RMSEs). Figure 1 shows the Bayesian credible intervals and the p-values estimated by self-adaptive lasso for Example 1. We can see that the Bayesian p-values of β 1, β 2 and β 5 are extremely close to zero, indicating the significance of these estimates. The statistical inference results for Example 2 are shown in Figure 2. In this example, variables are designed as unimportant, associated with zero coefficients. From Figure 2, we can see that although β 22 β 26, β 31 and β 37 are selected by self-adaptive lasso, these variables are not significantly important because their coefficients have relative high p-values (larger than 0.01). On the other hand, most coefficients of the true important variables 1 20 (except β 1 and β 6 ) exhibits their significance in Figure 2. Therefore, the statistical inference of self-adaptive lasso can help post-screen the selected variables. 13

15 Table 1: Results for Examples 1 4. All results are averaged over 50 replications and their associated standard deviations are record in the parentheses. RMSE is the relative mean square error. FPR is the false positive rate and FNR is the false negative rate. Example Method RMSE FPR FNR Lasso (0.210) (0.305) (0.269) Adaptive Lasso (0.206) (0.229) (0.238) 1 Elastic Net (0.203) (0.284) (0.252) Bayesian Lasso (0.183) (0.193) (0.245) Self-adaptive Lasso (0.099) (0.232) (0.243) Lasso (0.175) (0.167) (0.051) Adaptive Lasso (0.164) (0.190) (0.082) 2 Elastic Net (0.176) (0.171) (0.052) Bayesian Lasso (0.192) (0.095) (0.085) Self-adaptive Lasso (0.117) (0.165) (0.069) Lasso (0.193) (0.142) (0.000) Adaptive Lasso (0.207) (0.203) (0.010) 3 Elastic Net (0.188) (0.150) (0.000) Bayesian Lasso (0.749) (0.129) (0.150) Self-adaptive Lasso (0.293) (0.143) (0.041) Lasso (0.152) (0.179) (0.060) Adaptive Lasso (0.120) (0.125) (0.090) 4 Elastic Net (0.139) (0.211) (0.281) Bayesian Lasso (0.149) (0.107) (0.122) Self-adaptive Lasso (0.093) (0.125) (0.114) 14

16 Figure 1: The statistical inference results for Example 1. Panel (A): boxplots of the 95% Bayesian credible intervals for the selected variables, where X s represent the true values of β j s. Panel (B): the Bayesian p-values for the selected variables. Notice that the vertical coordinate is log 10 (p-value). 4 Real Data Analysis In this section, we apply self-adaptive lasso to two real data sets. We compare the prediction accuracy of self-adaptive lasso to those of other methods and do statistical inference for selected variables. Prostate Cancer Data. The prostate cancer data set (Stamey et al., 1989) was used by Tibshirani (1996) to evaluate the performance of lasso. This data set examines the relationships between the level of prostate specific antigen and eight clinical measures from the patients waiting for radical prostatectomy. These factors are log(cancer volume) 15

17 Figure 2: The statistical inference results for Example 2. (lcavol), log(prostate weight) (lweight), age, log(benign prostatic hyperplasia amount) (lbph), seminal vesicle invasion (svi), log(capsular penetration) (lcp), Gleason score (gleason) and percentage Gleason scores 4 or 5 (pgg45). The original data set consists of 97 observations. We randomly split them into a training set (37 observations), a validation set (30 observations) and a test set (30 observations). Diabetes Data. The diabetes data set was used to examine the LARS algorithm by Efron et al. (2004). This data set is composed of a response measuring the disease progression one year after baseline and ten baseline predictors: age, sex, body mass (bmi), blood pressure (map) as well as six blood serum measurements (denoted as tc, ldl, hdl, tch, ltg, glu). We randomly split the 442 diabetes patients in the data into three subsets: a 16

18 training set with 50 patients, a validation set with 150 patients and a test set with 242 patients. For both the prostate cancer data and the diabetes data, the prediction errors on the test set are listed in Table 2. In the prostate example, we can see that the self-adaptive lasso method produces the lowest prediction errors. The prediction error of Bayesian lasso is lower than that of Lasso, adaptive Lasso and elastic net. In the diabetes example, the difference among the prediction errors of these methods is not as significant as that in the prostate example. The prediction error produced by self-adaptive lasso is slightly lower than that by other methods, and Bayesian lasso and Elastic Net give competitive results. In the prostate example, the variables lcavol, lweight and pgg45 are selected. We plot the 95% Bayesian credible intervals and the p-values for the coefficients of these variables and show them in Figure 3. We can see that the credible intervals of pgg45 is close to zero and the p-value is higher than This result suggests considering the variable pgg45 as unimportant to the regression model. On the other hand, Figure 4 illustrates the statistical inference results for the diabetes example, where the variables bmi, map and ltg are selected. It can be seen from the p-value that map is not as significant as the other two selected variables. 5 Extensions to Elastic Net and Fused Lasso The original lasso(tibshirani, 1996) has several important extensions, such as elastic net(zou and Hastie, 2005) and fused lasso (Tibshirani et al., 2005). The elastic net method uses a mixture of l 1 -norm and l 2 -norm to penalize the regression coefficients. Unlike the situation in lasso, the number of selected variables in elastic net is no longer limited by the sample size, due to the nature of the l 2 -norm penalty. The l 2 -norm penalty also help produce the grouping effect, i.e., it tends to select or remove highly correlated variables simultane- 17

19 Table 2: The prediction errors on the test set computed from the prostate data and the diabetes data. Data Method Test Error Lasso 3.11 Adaptive Lasso 2.82 Prostate Elastic Net 3.07 Bayesian Lasso 2.38 Self-adaptive Lasso 1.58 Lasso 0.56 Adaptive Lasso 0.55 Diabetes Elastic Net 0.52 Bayesian Lasso 0.52 Self-adaptive Lasso 0.51 ously. As another development of lasso, the fused lasso are usually applied to the data with ordered variables, such as spectrum data. It consists of a mixture of l 1 -penalty and fusion penalty. The fusion penalty shrinks the l 1 -norms of the differences between the coefficients of consecutive variables in order and thus it encourages the neighbor variables to have similar coefficients. The proposed self-adaptive lasso method and its Bayesian estimation can also be extended the elastic net and the fused lasso. The modification of the Bayesian estimation mainly comes from the redefinition of the fully conditional probability of β j s. We introduce the details in Section 5.1 and Self-adaptive Elastic Net The self-adaptive elastic net is defined by the following criteria: min β n (y i p p β j x i,j ) 2 + (λ 1,j β j +λ 2,j βj) 2 (21) j=1 j=1 In criterion (21), there are two sets of tuning parameters, λ 1,j s and λ 2,j s, associated with the l 1 -norm and l 2 -norm penalties, respectively. Analogous to the model defined in (7) (11), the corresponding Bayesian model for the 18

20 Figure 3: The statistical inference results for the prostate example. Panel (A): the 95% Bayesian credible intervals for the selected variables. Panel (B): the p-values for the selected variables. Notice that the vertical coordinate is log 10 (p-value). self-adpaitve elastic net is: y i x i,β,σ 2 N(x i β,σ 2 ), i = 1,...,n, (22) β j λ 1,j,λ 2,j,σ 2 ElasticNet(σ 2 /λ 1,j,2σ 2 /λ 2,j ), j = 1,...,p (23) λ ij G(a i,b i ), i = 1,2; j = 1,...,p, (24) σ 2 G 1 (a σ,b σ ), (25) where ElasticNet(γ 1,γ 2 ) is a distribution about elastic net, with density f(z) exp{ z /γ 1 + z 2 /γ 2 }, which is the kernel of a mixture of truncated normal distributions. Denote λ i = (λ i,1,...,λ i,p ) T, i = 1,2. Similar to the situation of self-adpative lasso, we can also use the 19

21 Figure 4: The statistical inference results for the diabetes example. Gibbs sampling algorithm to draw samples from the joint posterior π(β λ 1,λ 2,σ 2,Y,X) and select (λ 1,λ 2,σ 2 ) using a separate validation set. Different from that of self-adaptive lasso, however, the fully conditional distribution of β j given β j is modified as follows: Proposition 3 The fully conditional distribution of β j with elastic net prior is β j β j,λ 1,λ 2,σ 2,Y,X p 0+ N 0+ ( n ǫ i,jx i,j λ 1,j n+λ 2,j, σ 2 n+λ 2,j ) +p 0 N 0 ( n ǫ i,jx i,j +λ 1,j n+λ 2,j, where p 0± = A 0± /(A 0+ +A 0 ) and { ( n A 0± = exp ǫ } ( i,jx i,j λ 1,j ) 2 ± ) n Φ ǫ i,jx i,j λ 2,j 2σ 2 (n+λ 2,j ) σ. n+λ 2,j σ 2 n+λ 2,j ), 20

22 We can see that the fully conditional distribution of β j in self-adaptive elastic net is also a mixture of two normals, but with modified means and variances. 5.2 Self-adaptive Fused Lasso We define the self-adaptive fused lasso as follows: min β n (y i p β j x i,j ) 2 + j=1 p λ 1,j β j + j=1 p λ 2,j β j β j 1. (26) The Bayesian model for the self-adaptive fused lasso is the same as that for the self-adaptive elastic net except that formulas (23) and (24) are modified as j=2 β λ 1,λ 2,σ 2 FusedLasso(σ 2 /λ 1,σ 2 /λ 1 ), (27) λ 1,j G(a 1,b 1 ), j = 1,,p, λ 2,j G(a 2,b 2 ), j = 2,...,p, (28) In (27), FusedLasso(σ 2 /λ 1,σ 2 /λ 1 ) is the distribution about the fused lasso, with density { } p λ 1,j z j p λ 2,j z j z j 1 f(z) exp. (29) σ 2 σ 2 j=1 where z = (z 1,...,z p ) T. Let N [l,u] (µ,σ) be a truncated normal distribution with mean µ and variance σ 2 and it has a lower bound l and an upper bound u. We allow l to be and u to be +. Let t 0 = β 0 =, t 4 = β p+1 = +, and t 1,t 2,t 3 {0,β j 1,β j+1 } satisfying t 1 t 2 t 3. Denote 1,s = 2I(t s 0) 1, 2,s = 2I(t s β j 1 ) 1 and 3,s = 2I(t s 1 β j+1 ) 1. We set p s = 0 if t s 1 = t s. In addition, we define µ s = [ n ǫ i,jx i,j ]+ 1s λ 1,j + 2s λ 2,j 1 + 3s λ 2,j+1, s p s = 1 and { }[ µ 2 p s exp s Φ 2σ 2 n j=2 ( ) ts µ s σ Φ n ( )] ts 1 µ s σ. (30) n Then the following proposition gives a way to draw β j s in the Gibbs sampling. 21

23 Proposition 4 The fully conditional distribution of β j with fused lasso prior is β j β j,λ 1,λ 2,σ 2,Y,X 4 p s N [ts 1,t s] s=1 ( ) µs n, σ2. (31) n From(31), we canseethat thefullyconditional distributionof β j given β j followsamixture of truncated normals (with up to 4 components). 6 Conclusion In this paper, we have proposed a self-adaptive lasso method for variable selection in regression problems. This method assigns a specific tuning parameter to each regression coefficient. We also developed a Gibbs sampling algorithm to estimate the regression coefficients and select the tuning parameters simultaneously. This algorithm also provides a way to conduct statistical inference. We illustrate the advantages of the new method on both simulated and real examples. Finally, we extend the idea of self-adpative lasso and its Bayesian model to elastic net and fused lasso, respectively. When this paper was in final preparation, we found Hans (2009) proposed a Gibbs sampling algorithm to estimate lasso. This algorithm shares some similar flavor with the Gibbs sampling algorithm proposed in our paper, but it concerns the Bayesian estimation for the original lasso with single tuning parameter and the related inference from the posterior mean. We acknowledge that our independent work of the proposed Gibbs sampling algorithm was originated from a completely different motivation, which aims at providing a feasible way to select the p tuning parameters in self-adaptive lasso. References Breiman, L.(1995), Better subset regression using the nonnegative garrote, Technometrics, 37,

24 Efron, B., Hastie, T., Johnstone, I., and Tibshirani, R. (2004), Least angle regression, Annals of Statistics, 32, Fan, J. and Li, R. (2001), Variable selection via nonconcave penalized likelihood and its oracle properties, Journal of the American Statistical Asscociation, 96, Hans, C. (2009), Bayesian lasso regression, Biometrika, To appear. Hastie, T., Tibshirani, R., and Friedman, J. (2001), The elements of statistical learning, Springer, New York. Kyung, M., Gill, J., Ghosh, M., and Casella, G. (2009), Peanlized regression, standard errors and Bayesian lassos, Tech. rep., Department of Statistics, University of Florida. Meinshausen, N. (2007), Relaxed lasso, Computational Statistics and Data Analysis, 52, Park, T. and Casella, G. (2008), The Bayesian lasso, Journal of the American Statistical Asscociation, 103, Radchenko, P. and James, G. (2008), Variable inclusion and shrinkage algorithms, Journal of the American Statistical Asscociation, 103, Robert, C. P. (1995), Simulation of truncated normal variables, Statistics and Computing, 5, Stamey, T., Kabalin, J., McNeal, J., Johnstone, I., Freiha, F., Redwine, E., and Yang, N. (1989), Prostate specific antigen in the diagnosis and treatment of adenocarcinoma of the prostate. II. Radical prostatectomy treated patients, The Journal of Urology, 141,

25 Tibshirani, R. (1996), Regression shrinkage and selection via the lasso, Journal of the Royal Statistical Society, Series B, 58, Tibshirani, R., Saunders, M., Rosset, S., Zhu, J., and Knight, K. (2005), Sparsity and smoothness via the fused lasso, Journal of the Royal Statistical Society, Series B, 67, van Dyk, D. and Meng, X.(2001), The art of data augmentation, Journal of Computational and Graphical Statistics, 10, Wang, S., Nan, B., Rosset, S., and Zhu, J. (2008), Random lasso, Tech. rep., Department of Biostatistics, University of Michigan. Zou, H. (2006), The adaptive LASSO and its oracle properties, Journal of the American Statistical Asscociation, 101, Zou, H. and Hastie, T. (2005), Regularization and variable selection via the elastic net, Journal of the Royal Statistical Society, Series B, 67,

26 7 Appendix 1. The proof of Proposition 1. Note that the X is standardized, i.e. any j = 1,...,p, we have n x2 i,j = n. For π(β j β j,y,x,λ,σ 2 ) { n exp (y i j k β } kx i,k β j x i,k ) 2 { exp λ } j β j 2σ 2 σ 2 n = exp { (ǫ } { i,j β j x i,j ) 2 exp λ } j β j 2σ 2 σ 2 { ( n )} exp 1 n x 2 2σ 2 i,j β2 j 2 ǫ i,j x i,j β j +2λ j β j { ( )} = exp 1 n nβ 2 2σ 2 j 2 ǫ i,j x i,j β j +2λ j β j { ( exp 1 n nβ 2 2σ 2 j 2( ǫ i,j x i,j λ j )β j )}I(β j > 0)+ { ( exp 1 n nβ 2 2σ 2 j 2( ǫ i,j x i,j +λ j )β j )}I(β j 0) ( n { φ βj 1 n σ n (ǫ i,jx i,j λ j ) }) I(β j > 0) = A 0+ Φ( 1 n nσ (ǫ + i,jx i,j λ j )) ( n { φ βj 1 n σ n (ǫ i,jx i,j +λ j ) }) I(β j 0) A 0 Φ( 1 n nσ (ǫ i,jx i,j +λ j )) where A 0± is defined in equation (12). 2. The proof of Proposition 4. We first show the case when 0 β j 1 β j+1, where we have t 1 = 0,t 2 = β j 1 and t 3 = β j+1. The fully conditional distribution of β j given β j 25

27 is π(β j β j,y,x,λ 1,λ 2,σ 2 ) n exp { (ǫ } { i,j β j x i,j ) 2 exp λ 1,j β j λ } 2,j β j β j 1 2σ 2 σ 2 σ 2 { ( )} exp 1 n nβ 2 2σ 2 j 2 ǫ i,j x i,j β j +2λ 1,j β j +2λ 2,j β j β j 1 +2λ 2,j+1 β j+1 β j { ( exp 1 n nβ 2 2σ 2 j 2( ǫ i,j x i,j +λ 1,j +λ 2,j 1 λ 2,j+1 )β j )}I(β j 0) { ( +exp 1 n nβ 2 2σ 2 j 2( ǫ i,j x i,j λ 1,j +λ 2,j 1 λ 2,j+1 )β j )}I(0 < β j β j 1 ) { ( +exp 1 n nβ 2 2σ 2 j 2( ǫ i,j x i,j λ 1,j λ 2,j 1 λ 2,j+1 )β j )}I(β j 1 < β j β j+1 ) { ( +exp 1 n nβ 2 2σ 2 j 2( ǫ i,j x i,j λ 1,j λ 2,j 1 +λ 2,j+1 )β j )}I(β j > β j+1 ) ( βj µ s 4 φ nσ )I(t s 1 < β j t s ) p s ( ) ). t s=1 Φ s µ s Φ( σ ts 1 µ s n σ n Similarly, we can also show the claims holds when the order of 0,β j 1 and β j+1 changes. 26

Nonnegative Garrote Component Selection in Functional ANOVA Models

Nonnegative Garrote Component Selection in Functional ANOVA Models Ming Yuan School of Industrial and Systems Engineering Georgia Institute of Technology Atlanta, GA 3033-005 Email: myuan@isye.gatech.edu