Self-adaptive Lasso and its Bayesian Estimation

Size: px
Start display at page:

Download "Self-adaptive Lasso and its Bayesian Estimation"

Transcription

1 Self-adaptive Lasso and its Bayesian Estimation Jian Kang 1 and Jian Guo 2 1. Department of Biostatistics, University of Michigan 2. Department of Statistics, University of Michigan Abstract In this paper, we proposed a self-adaptive lasso method for variable selection in regression problems. Unlike the popular lasso method, the proposed method introduces a specific tuning parameter for each regression coefficient. We modeled self-adaptive lasso in a Bayesian framework and developed an efficient Gibbs sampling algorithm to automatically select these tuning parameters and estimate the parameters. This algorithm also brings in some convenience for conducting statistical inference for selected variables. Several synthetic and real examples in this paper demonstrate flexibility of the tuning parameters enhance the performance of self-adaptive lasso in terms of both prediction and variable selection. Finally, we also extend the self-adaptive lasso to account for elastic net and fused lasso. Key Words: Bayesian modeling, Gibbs sampling, Lasso, Variable selection.

2 1 Introduction In this paper, we consider a least square regression problem with n observations (x 1,y 1 ),..., (x i,y i ),..., (x n,y n ), where x i = (x i,1,...,x i,p ) is a p-dimensional vector of predictors and y i is the associated response. To brief the notation, we denote Y = (y 1,...,y n ) T and X = (x 1,...,x n ) T. Without loss of generality, we assume Y is centered ( n y i = 0) and X is standardized along the columns ( n x i,j = 0, n x2 i,j = n, 1 j p). The response Y and the predictors X are related by the linear model Y = Xβ +ǫ, (1) where ǫ = (ǫ 1,...,ǫ n ) T areerrorterms andβ = (β 1,...,β p ) T areregression coefficients. Since we have already assumed that the response and the predictors have zero means, the intercept term can be excluded from model (1). The linear regression problem draws two aspects of one s attention: prediction accuracy and model interpretability. The former focuses on how to reduce the prediction errors, whereas the latter concerns how to select important variables. The two goals can be achieved simultaneously by a family of approaches, namely regularization methods (Breiman, 1995; Fan and Li, 2001; Meinshausen, 2007; Radchenko and James, 2008; Wang et al., 2008; Tibshirani, 1996; Tibshirani et al., 2005; Zou and Hastie, 2005; Zou, 2006). In particular, the lasso method proposed by Tibshirani (1996) has gained much attention in recent years. It penalizes the least square loss function by an l 1 norm of the regression coefficients: min β n (y i p β j x i,j ) 2 +λ j=1 p β j, (2) where λ is a tuning parameter. Due to the singularity of the l 1 norm at zero, lasso can continuously shrink the estimated coefficients towards zero, and push some estimated coefficients exactly to zero when λ is large enough. The prediction accuracy of lasso can generally dominate the ordinary least square (OLS) regression in terms of mean square error (MSE) 1 j=1

3 because its penalty introduces some bias in order to compensate for more estimation variance (namely bias-variance tradeoff). We can see that lasso reduces to an ordinary least square (OLS) regression when λ = 0. Those variables associated with zero estimated coefficients are considered unimportant and removed from the model. Although lasso has gained a high degree of success in many situations, it has a few limitations in practice. In this paper, we focus on two of them: Limitation 1. Lasso uses a unique tuning parameter λ to equally penalize all p regression coefficients β j s. This λ controls both the number of the selected variables and the shrinkage level of the fitted regression coefficients. In practice, due to the single tuning parameter, lasso usually either includes a number of irrelevant variables to reduce the estimation bias or over-shrinks the coefficients of the correct variables to produce a model with correct size (Radchenko and James, 2008). Limitation 2. The lasso criterion is an optimization problem with respective to β j s and thus it only provides a point estimate for β. Nevertheless, people usually also need to know the the level of confidence of the estimates, such as the confidence interval (or credible interval) and the p-value. The first limitation is partially addressed by adaptive lasso (Zou, 2006), which penalizes a weighted l 1 -norm of the regression coefficients: min β n (y i p β j x i,j ) 2 +λ j=1 p w j β j, (3) where w j s are adaptive weights defined as w j = β ols j r for some positive constant r, and j=1 β ols j is the OLS estimate of β j. The intuition of the adaptive lasso is to give large weights to unimportant variables, and thus to heavily shrink their associated coefficients. On the other hand, it gives small weights to important variables, and thus slightly shrinks their associated coefficients. Theoretically, adaptive lasso enjoys oracle properties (Fan and Li, 2

4 2001) that lasso does not have. Specifically, when p is fixed and n goes to infinity, then with some appropriately selected λ, adaptive lasso selects the true underlying model with probability tending to one, and the corresponding estimated coefficients have the same asymptotic normal distribution as they would have if the true underlying model were provided in advance (Zou, 2006). Although adaptive lasso has nice asymptotic properties, its finite sample performance heavily depends on the quality of the OLS estimates in w j s. In practice, adaptive lasso may suffer from the collinearity caused by highly corrected variables because of the illness of the OLS estimates in this scenario. In addition to adaptive lasso, there are a few other methods in the literature attempting to address this limitation. For example, relaxed lasso (Meinshausen, 2007) and VISA (Radchenko and James, 2008). Both of them introduces two tuning parameters in the regression model. The first parameter distinguishes the potential important variables and unimportant variables, while the second parameter controls the level of shrinkage on the selected variables. Unlike the relaxed lasso which permanently excludes the variables removed by the first tuning parameter, the VISA method allows for the potential inclusion of all variables and thus it has the chance to recover the incorrectly removed variables by the first parameter. The second limitation of lasso can be addressed by applying bootstrap sampling to lasso (Tibshirani, 1996; Wang et al., 2008). An alternative way to conduct statistical inference is to implement the idea of lasso in a Bayesian framework. Indeed, Tibshirani (1996) proposed a Bayesian interpretation of lasso. Suppose y i x i N(x i β,σ 2 ) and the regression coefficients follows a Laplace (double-exponential) distribution β j L(0,σ 2 /λ), where notation N(µ,σ 2 ) denotes the normal distribution with mean µ and variance σ 2, and L(µ,γ) denotes the Laplace distribution with density f(x µ,γ) = exp{ x µ /γ}/2γ. It can be seen that the log-posterior of β is logπ(β X,Y,σ 2,λ) = 1 2σ 2 [ n (y i ] p p β j x i,j ) 2 +2λ β j +C (4) j=1 j=1 3

5 where C = plog(λ) nlog(2π)/2 (n/2+p)log(σ 2 ) is a constant irrelevant to β. We can see that themodeof thelog-posterior(4) isexactly the solutionof thelasso criterion (2). For the parameters β j s, suppose a number of simulated samples have been drawn from the posterior (4), then these samples can be used to recover not only the posterior modes, but also the Bayesian credible intervals and the p-values. Unfortunately, directly sampling β from its posterior (4) is nontrivial. To address this problem, Park and Casella (2008) proposed the Bayesian lasso where the Laplace prior is represented as a scale mixture of normals with an exponential mixing density, i.e., a 2 exp( a z ) = 0 1 [ exp( z2 s 2πs 2s )a2 2 ][exp( a2 2 )]ds = E S[φ(z/ S)], (5) In (5), S is a random variable with density a 2 /2exp( a 2 s/2) and it can be regarded as the missing data in the point of view of data augmentation (van Dyk and Meng, 2001). Conditioned on the missing data, the conditional posterior is a normal distribution (since both conditional prior and likelihood are normal). Making use of this property, Park and Casella (2008) developed a Gibbs sampling algorithm to draw samples in the augmented parameter space. The theoretical properties and some extensions of Bayesian lasso were further discussed in Kyung et al. (2009). In this paper, we propose a new variable selection method, namely self-adaptive lasso. This method attempts to simultaneously address both limitations 1 and 2 of lasso. On the one hand, the self-adaptive lasso allows each coefficient β j in the penalty to have its specific tuning parameter λ j, and numerical results demonstrate that this flexibility can help enhance the prediction and variable selection performance (see Table 1, for example). On the other hand, the self-adaptive lasso criterion is estimated in an equivalent Bayesian framework and thus it naturally provides the credible intervals and p-values to the estimates. To implement the self-adaptive lasso, we developed a Gibbs sampling algorithm to estimate the regression coefficients β j s and to select the optimal tuning parameters λ j s for self- 4

6 adaptive lasso, simultaneously. Different from Bayesian lasso, which introduces extra latent variables, the Gibbs sampling algorithm proposed in this paper directly draws each β j from its fully conditional distribution, which is a mixture of two truncated Normal distributions. The remainder of the paper is organized as follows: Section 2 introduces the self-adaptive lasso and related algorithmic issues. Section 3 and 4 use synthetic and real data to evaluate the performance of self-adaptive lasso, respectively. Section 5 extends the idea of selfadaptive lasso to elastic net and fused lasso. Finally, some concluding remarks are drawn in Section 6. 2 Methodology 2.1 Self-adaptive Lasso The self-adaptive lasso method proposed in this paper aims at optimizing the following criterion: min β n (y i p β j x i,j ) 2 + j=1 p λ j β j, (6) where λ j, is the tuning parameter for the regression coefficient β j, 1 j p. We can see that the self-adaptive lasso defined in (6) generalizes the lasso problem in (2) by allowing each regression coefficient β j in the penalty to have its specific tuning parameter λ j. The relaxation of the tuning parameters introduces more flexibility. For example, if we specify λ j = 0 for all j A and λ j = + for all j / A (A = {1 j p : β 0 j 0} is the index set of all truly nonzero coefficients), then criterion (6) reduces to an oracle regression problem (Tibshirani, 1996), i.e., the OLS regression consisting of a subset of predictors which have truly important coefficients. The oracle regression is an ideal variable selection method but it is not realistic because we do not know the subset of truly nonzero coefficients. However, it is usually used as a benchmark in simulation studies. On the other hand, the self-adaptive lasso becomes the adaptive lasso (Zou, 2006) if we specify the tuning parameters j=1 5

7 as λ j = λw j = λ/ β ols j, 1 j p. The adaptive lasso can be regarded as an attempt to approach the oracle regression because it tends to heavily (slightly) penalize the coefficients with small (large) magnitudes. As we discussed in Section 3, however, the performance of the adaptive lasso heavily depends on the extra OLS estimates. Since self-adaptive lasso has more flexible tuning parameters than adaptive lasso, it has the chance to better approach the oracle regression if these tuning parameters are appropriately specified. A direct way is to search the optimal tuning parameters λ j s in a grid in R p space such that the prediction error on an independent validation set is minimized. However, the search in such a p-dimensional grid is computationally infeasible for highdimenional problems because the number of possible nodes in this grid grows exponentially with p. To get rid of this difficulty, we transform the self-adaptive lasso problem in (6) into the following Bayesian model and then we jointly select the tuning parameters λ j s and estimate the linear coefficients β j s using Gibbs sampling. There are several advantages to do so. Firstly, the Gibbs sampling strategy makes the search of optimal tuning parameters computationally feasible. Secondly, unlike the situation in adaptive lasso, the selected tuning parameters in the self-adaptive lasso do NOT depend on any extra estimators such as OLS estimator and thus its performance won t be affected by a bad estimation from this extra estimator. Finally, as what we discussed in Section 1, the simulated samples drawn from the posterior distribution provide a natural way to conduct statistical inference. 2.2 Bayesian Modeling The Bayesian model of the self-adaptive lasso in (6) is y i x i,β,σ 2 N(x i β,σ 2 I), i = 1,...,n, (7) β j λ j,σ 2 L(0,2σ 2 /λ j ), j = 1,,p, (8) 6

8 This implies that the conditional posterior of β given σ 2 and λ = (λ 1,...,λ p ) is { [ n ]} π(β X,Y,σ 2,λ) exp 1 p p (y 2σ 2 i β j x i,j ) 2 + λ j β j. (9) Note that the mode of (9) is identical to the solution of (6). Given σ 2, λ j s describe the rate of β j s approaching 0. to jointly select λ j s and estimate β j s, we regard λ j s as the parameters of (6) and estimate them altogether with β j s. j=1 j=1 λ j G(a λ,b λ ), j = 1,,p, (10) In this sense, λ j s can be regarded as the random effects and we assign a prior to each λ j as in (10), where G(a λ,b λ ) represents the Gamma distribution with shape a λ and rate b λ. In addition, σ 2 specifies the variance of the random errors in the model and we assign an inverse Gamma prior. σ 2 G 1 (a σ,b σ ). (11) Jointly modeling λ j s and β j s as in (7) to (11) allows us to sample these parameters from the joint posterior π(β,λ,σ 2 Y,X) π(y X,β,λ,σ 2 )π(β λ,σ 2 )π(λ)π(σ 2 ). 2.3 Sampling from The Posterior We develop a Gibbs sampling algorithm to draw samples from the joint posterior π(β,λ,σ 2 Y,X) with respect to β j s, λ j s and σ 2, iteratively. Sampling β Since β j s are the parameters of primary interest, the core of the proposed Gibbs sampling algorithm is to draw β j s given other parameters. Denote φ( ) and Φ( ) as the probabilistic density function and the cumulative density function of the standard normal distribution, respectively. Let ǫ i,j = y i k j β kx i,k and p 0± = A 0± /(A 0+ +A 0 ), where { ( n A 0± = exp ǫ } ( i,jx i,j λ j /2) 2 ± n Φ ǫ ) i,jx i,j λ j /2 2σ 2 n σ n (12) 7

9 In addition, we denote N δ+ (µ,σ 2 ) as a normal distribution positively truncated at δ, with a density function Φ((x µ)/σ)i(x > δ)/σφ((µ δ)/σ); similarly, we denote N δ (µ,σ 2 ) as a normal distribution negatively truncated at δ, with a density function Φ((x µ)/σ)i(x < δ)/σφ( (µ+δ)/σ); The following propositionprovides thefullconditional posterior ofβ j given β j andother parameters. Proposition 1 Given β j and other parameters, the fully conditional distribution of β j is a mixture of two truncated normals as follows: β j β j,y,x,λ,σ 2 ( n p 0+ N ǫ ) i,jx i,j λ j /2 0+, σ2 n n ( n +p 0 N ǫ ) i,jx i,j +λ j /2 0, σ2. (13) n n The proof of Proposition 1 is in the appendix. In practice, we generate a sample from (13) with two steps. In the first step, we randomly generate an indicator which is + with probability p 0+ and is with probability p 0. In the second step, we draw a sample β # j from N 0+ (( n ǫ i,jx i,j λ j )/n,σ 2 /n) if the indicator is + and draw a sample N 0 (( n ǫ i,jx i,j +λ j )/n,σ 2 /n) if the indicator is. Then β # j follows the mixture of two truncated normal distribution defined in (13). Sampling λ and σ 2 Ontheother hand, the fullyconditional distribution ofλandσ 2 canbecalculated asfollows: Proposition 2 In the Gibbs sampling algorithm, λ j s and σ 2 can be sampled as follows: ( λ j β,σ 2,Y,X G a λ +1, β ) j 2σ +b 2 λ σ 2 β,λ,y,x [ ( n G 1 p+ n 2 +a σ +1, 1 (y i 2 p β j x i,j ) 2 + k=1 (14) ] p λ j β j +b σ ). (15) j=1 8

10 Remark 1 To efficiently draw samples from both N 0+ (µ,σ 2 ) (when µ < 0) and N 0 (µ,σ 2 ) (when µ > 0), we used the optimal exponential accept-reject algorithm proposed by Robert (1995). In our simulation studies, the average acceptance rate of the optimal exponential accept-reject algorithm is around 90%, which indicates that it is efficient to sample β j s from the mixture of two truncated normals. Remark 2 The posterior mode of β j s can be exactly zeros due to the nature of the Laplace prior in (8). However, the posterior draws of β j s cannot be exact zeros since β j s are continuous random variables. Thus, it is very challenging to directly estimate the posterior mode based on simulated samples of β j s. Instead of estimating posterior mode for variable selection, we introduce an zero thresholding δ, a pre-specified positive number ( δ is set to 10 5 in this paper). We regard all sampled β j s located in an extremely small interval [ δ,δ] as zeros. The full conditionals (13) implies that when λ j is large enough, the absolute value of a posterior draw of β j can be smaller than δ, thus, it can be estimated exactly as zeros and its associated predictor is removed from the model. This conducts variable selection automatically. Remark 3 In practice, the hyper-parameters in (10) and (11) can estimated from the prior knowledge. For example, if we approximately know q 0, the prior expected proportion of true zeros in β, then by the property of the Laplace distribution, we have: { P( δ β j δ λ j,σ 2 ) = 1 exp λ } j σ 2δ q 0 (16) After some algebraic calculation, we can set the hyper-parameters to satisfy the following equality, which is equivalent to (16). E(λ)E(1/σ 2 ) = a λ b λ a σ b σ = log(1 q 0 )/δ (17) A feasible choice is: a σ = b σ = 10 5, a λ = 1 and b λ = δ/log(1 q 0 ). On the other hand, if 9

11 we do not use any prior knowledge, the priors of λ and σ 2 can be set to be less informative, i.e., we let a λ, b λ, a σ 2, a σ 2 be very small numbers. 2.4 Model Selection In self-adaptive lasso, the tuning parameters λ j s and the variance of the error terms σ 2 are jointly estimated with the regression coefficients β j s. We can specify the optimal λ j s and σ 2 as the joint posterior mode, i.e.: (β,λ,σ 2 ) = arg max β,λ,σ 2π(β,λ,σ2 Y,X) (18) However, there are at least two limitations if we use the criterion (18) to select λ j s and σ 2. On the one hand, the posterior mode consists of some strong assumptions about the distributions of the data and the parameters. Thus, the criterion (18) may be misleading if the model is mis-specified. On the other hand, the criterion (18) completely depends on the same data as that used to estimate β j s and this may result in model over-fitting (Hastie et al., 2001). To get rid of the two limitations of the criterion (18), we instead use the separate validation method to select λ j s and σ 2. Suppose we have obtained a series of samples {(β (m),λ (m),σ 2(m) )} M m=1 drawn from the joint posterior π(β,λ,σ2 Y,X), then the optimal λ j s and σ 2 are selected such that (β,λ,σ 2 ) = arg min 1 m M Y (v) X (v) β (m) 2 (19) where (Y (v),x (v) ) is a separate validation set independent of the training set (Y,X). 2.5 Estimation and Inference Given the selected λ and σ 2, we can draw samples from the conditional posterior π(β λ,σ 2,Y,X) and posterior median can be computed as point estimates for β j s. From 10

12 Remark 2, the extremely small posterior draws of β j s are regarded as zeros. Therefore, the posterior median can be exactly zero, which conducts variable selection automatically. With the same samples, we can also construct the Bayesian credible intervals and compute thep-valuesforeachβ j. The95%credibleintervalforβ j isestimatedby[ Q (β j ), Q (β j )], where Q α (β j ) is the empirical α-quantile of β j estimated by the simulated samples. We also estimate the probability, i.e. p b j = Pr( β j > δ λ,σ 2,Y,X), called Bayesian p-value, which is another measure of the uncertainty of β j. 3 Simulation Study In this section, we proposed several simulated examples to evaluate the performance of the proposed self-adaptive lasso method. In each example, we generate 50 data sets each consisting of a training set, an independent validation set and an independent test set. Models were fitted on training data only and the validation data were used to select the optimal hyper-parameters which minimize the prediction error. To evaluate the prediction performance of each predictive model, we calculate the relative mean-square error (RMSE) on the test data set. The definition of RMSE is given below: RMSE = X test β X test β 0 2 σ 2 (20) where β 0 is the true value of β, σ 2 is the standard deviation of the error term and X test is test data matrix. Following the notation in Zou and Hastie (2005), we use / / to describe the number of observations in the training, validation and test set respectively. In each example, the data are simulated for the true model: x i N(0,Σ), ǫ i N(0,σ 2 ), y i = x i β +ǫ i, i = 1,...,n, 11

13 where the variance of the error term, σ 2, is set such that the signal-to-noise ratio of the model equals 3. The details of the simulated examples are described as follows: Example 1 In this example, each data set consists of 20/20/200 observations and eight variables. The covariance matrix Σ is an AR(1) matrix, i.e., cor(x j,x j ) = ρ j j, where ρ = 0.5 in this example. The regression coefficients are set as β = (3,15,0,0,2,0,0,0) T Example 2 In this example, each data set consists of 100/100/1000 observations and 40 variables. Similar to Example 1, the covariance matrix Σ here is an AR(1) matrix with auto-correlation ρ = 0.5. The regression coefficients are set as (j +1)/2, if j = 1,3,5,...,19; β j = j/2, if j = 2,4,6,...,20; 0, if j = 21,22,...,40. Example 3 In this example, all settings are the same as those in Example 1 except regression coefficients are set as 1, if j = 1,3,5,...,19; β j = 1, if j = 2,4,6,...,20; 0, if j = 21,22,...,40. Example 4 In this example, each data set consists of 50/50/500 observations and 40 variables. The predictors X are generated as follows: Z 1 +ξ j, Z 1 N(0,1), 1 i 5; Z X j = 2 +ξ j, Z 2 N(0,1), 6 i 10; Z 3 +ξ j, Z 3 N(0,1), 11 i 15; ξ j, 16 i 40. where ξ j is i.i.d. generated from N(0,1). The regression coefficients are set as 1, if 1 j 5; 5, if 6 j 10; β j = 2, if 11 j 15; 0, if 16 j

14 We repeat each simulation 50 times and calculate the average of RMSEs and their standard errors. We also record the average false positive rates (FPR, the proportion of incorrectly selected unimportant variables) and false negative rates (FNR, the proportion of incorrectly removed important variables). In all examples, self-adaptive lasso is compared with lasso, Bayesian lasso, adaptive lasso and elastic net. The results are summarized in Table 1. To demonstrate the effectiveness of the self-adaptive lasso in statistical inference, we random pick the Gibbs samples from one out of the 50 repetitions and calculate the Bayesian credible intervals and the p-values of β j s. Due to the limited space, we only exhibit the inference results for Example 1 and 2, as shown in Figure 1 and 2, respectively. As we can see from Table 1, the self-adaptive lasso produces the smallest RMSEs in all examples. Compared with lasso, self-adaptive lasso produces better false positive rate in Example 1 3 and produces comparable or better false negative rate in Example 1, 2 and 4. This result demonstrates that the flexibility of the tuning parameters usually help enhance the performance. Generally speaking, self-adaptive lasso exhibits promising performance in terms of both variable selection (described by FPRs and FNRs) and prediction accuracy (described by RMSEs). Figure 1 shows the Bayesian credible intervals and the p-values estimated by self-adaptive lasso for Example 1. We can see that the Bayesian p-values of β 1, β 2 and β 5 are extremely close to zero, indicating the significance of these estimates. The statistical inference results for Example 2 are shown in Figure 2. In this example, variables are designed as unimportant, associated with zero coefficients. From Figure 2, we can see that although β 22 β 26, β 31 and β 37 are selected by self-adaptive lasso, these variables are not significantly important because their coefficients have relative high p-values (larger than 0.01). On the other hand, most coefficients of the true important variables 1 20 (except β 1 and β 6 ) exhibits their significance in Figure 2. Therefore, the statistical inference of self-adaptive lasso can help post-screen the selected variables. 13

15 Table 1: Results for Examples 1 4. All results are averaged over 50 replications and their associated standard deviations are record in the parentheses. RMSE is the relative mean square error. FPR is the false positive rate and FNR is the false negative rate. Example Method RMSE FPR FNR Lasso (0.210) (0.305) (0.269) Adaptive Lasso (0.206) (0.229) (0.238) 1 Elastic Net (0.203) (0.284) (0.252) Bayesian Lasso (0.183) (0.193) (0.245) Self-adaptive Lasso (0.099) (0.232) (0.243) Lasso (0.175) (0.167) (0.051) Adaptive Lasso (0.164) (0.190) (0.082) 2 Elastic Net (0.176) (0.171) (0.052) Bayesian Lasso (0.192) (0.095) (0.085) Self-adaptive Lasso (0.117) (0.165) (0.069) Lasso (0.193) (0.142) (0.000) Adaptive Lasso (0.207) (0.203) (0.010) 3 Elastic Net (0.188) (0.150) (0.000) Bayesian Lasso (0.749) (0.129) (0.150) Self-adaptive Lasso (0.293) (0.143) (0.041) Lasso (0.152) (0.179) (0.060) Adaptive Lasso (0.120) (0.125) (0.090) 4 Elastic Net (0.139) (0.211) (0.281) Bayesian Lasso (0.149) (0.107) (0.122) Self-adaptive Lasso (0.093) (0.125) (0.114) 14

16 Figure 1: The statistical inference results for Example 1. Panel (A): boxplots of the 95% Bayesian credible intervals for the selected variables, where X s represent the true values of β j s. Panel (B): the Bayesian p-values for the selected variables. Notice that the vertical coordinate is log 10 (p-value). 4 Real Data Analysis In this section, we apply self-adaptive lasso to two real data sets. We compare the prediction accuracy of self-adaptive lasso to those of other methods and do statistical inference for selected variables. Prostate Cancer Data. The prostate cancer data set (Stamey et al., 1989) was used by Tibshirani (1996) to evaluate the performance of lasso. This data set examines the relationships between the level of prostate specific antigen and eight clinical measures from the patients waiting for radical prostatectomy. These factors are log(cancer volume) 15

17 Figure 2: The statistical inference results for Example 2. (lcavol), log(prostate weight) (lweight), age, log(benign prostatic hyperplasia amount) (lbph), seminal vesicle invasion (svi), log(capsular penetration) (lcp), Gleason score (gleason) and percentage Gleason scores 4 or 5 (pgg45). The original data set consists of 97 observations. We randomly split them into a training set (37 observations), a validation set (30 observations) and a test set (30 observations). Diabetes Data. The diabetes data set was used to examine the LARS algorithm by Efron et al. (2004). This data set is composed of a response measuring the disease progression one year after baseline and ten baseline predictors: age, sex, body mass (bmi), blood pressure (map) as well as six blood serum measurements (denoted as tc, ldl, hdl, tch, ltg, glu). We randomly split the 442 diabetes patients in the data into three subsets: a 16

18 training set with 50 patients, a validation set with 150 patients and a test set with 242 patients. For both the prostate cancer data and the diabetes data, the prediction errors on the test set are listed in Table 2. In the prostate example, we can see that the self-adaptive lasso method produces the lowest prediction errors. The prediction error of Bayesian lasso is lower than that of Lasso, adaptive Lasso and elastic net. In the diabetes example, the difference among the prediction errors of these methods is not as significant as that in the prostate example. The prediction error produced by self-adaptive lasso is slightly lower than that by other methods, and Bayesian lasso and Elastic Net give competitive results. In the prostate example, the variables lcavol, lweight and pgg45 are selected. We plot the 95% Bayesian credible intervals and the p-values for the coefficients of these variables and show them in Figure 3. We can see that the credible intervals of pgg45 is close to zero and the p-value is higher than This result suggests considering the variable pgg45 as unimportant to the regression model. On the other hand, Figure 4 illustrates the statistical inference results for the diabetes example, where the variables bmi, map and ltg are selected. It can be seen from the p-value that map is not as significant as the other two selected variables. 5 Extensions to Elastic Net and Fused Lasso The original lasso(tibshirani, 1996) has several important extensions, such as elastic net(zou and Hastie, 2005) and fused lasso (Tibshirani et al., 2005). The elastic net method uses a mixture of l 1 -norm and l 2 -norm to penalize the regression coefficients. Unlike the situation in lasso, the number of selected variables in elastic net is no longer limited by the sample size, due to the nature of the l 2 -norm penalty. The l 2 -norm penalty also help produce the grouping effect, i.e., it tends to select or remove highly correlated variables simultane- 17

19 Table 2: The prediction errors on the test set computed from the prostate data and the diabetes data. Data Method Test Error Lasso 3.11 Adaptive Lasso 2.82 Prostate Elastic Net 3.07 Bayesian Lasso 2.38 Self-adaptive Lasso 1.58 Lasso 0.56 Adaptive Lasso 0.55 Diabetes Elastic Net 0.52 Bayesian Lasso 0.52 Self-adaptive Lasso 0.51 ously. As another development of lasso, the fused lasso are usually applied to the data with ordered variables, such as spectrum data. It consists of a mixture of l 1 -penalty and fusion penalty. The fusion penalty shrinks the l 1 -norms of the differences between the coefficients of consecutive variables in order and thus it encourages the neighbor variables to have similar coefficients. The proposed self-adaptive lasso method and its Bayesian estimation can also be extended the elastic net and the fused lasso. The modification of the Bayesian estimation mainly comes from the redefinition of the fully conditional probability of β j s. We introduce the details in Section 5.1 and Self-adaptive Elastic Net The self-adaptive elastic net is defined by the following criteria: min β n (y i p p β j x i,j ) 2 + (λ 1,j β j +λ 2,j βj) 2 (21) j=1 j=1 In criterion (21), there are two sets of tuning parameters, λ 1,j s and λ 2,j s, associated with the l 1 -norm and l 2 -norm penalties, respectively. Analogous to the model defined in (7) (11), the corresponding Bayesian model for the 18

20 Figure 3: The statistical inference results for the prostate example. Panel (A): the 95% Bayesian credible intervals for the selected variables. Panel (B): the p-values for the selected variables. Notice that the vertical coordinate is log 10 (p-value). self-adpaitve elastic net is: y i x i,β,σ 2 N(x i β,σ 2 ), i = 1,...,n, (22) β j λ 1,j,λ 2,j,σ 2 ElasticNet(σ 2 /λ 1,j,2σ 2 /λ 2,j ), j = 1,...,p (23) λ ij G(a i,b i ), i = 1,2; j = 1,...,p, (24) σ 2 G 1 (a σ,b σ ), (25) where ElasticNet(γ 1,γ 2 ) is a distribution about elastic net, with density f(z) exp{ z /γ 1 + z 2 /γ 2 }, which is the kernel of a mixture of truncated normal distributions. Denote λ i = (λ i,1,...,λ i,p ) T, i = 1,2. Similar to the situation of self-adpative lasso, we can also use the 19

21 Figure 4: The statistical inference results for the diabetes example. Gibbs sampling algorithm to draw samples from the joint posterior π(β λ 1,λ 2,σ 2,Y,X) and select (λ 1,λ 2,σ 2 ) using a separate validation set. Different from that of self-adaptive lasso, however, the fully conditional distribution of β j given β j is modified as follows: Proposition 3 The fully conditional distribution of β j with elastic net prior is β j β j,λ 1,λ 2,σ 2,Y,X p 0+ N 0+ ( n ǫ i,jx i,j λ 1,j n+λ 2,j, σ 2 n+λ 2,j ) +p 0 N 0 ( n ǫ i,jx i,j +λ 1,j n+λ 2,j, where p 0± = A 0± /(A 0+ +A 0 ) and { ( n A 0± = exp ǫ } ( i,jx i,j λ 1,j ) 2 ± ) n Φ ǫ i,jx i,j λ 2,j 2σ 2 (n+λ 2,j ) σ. n+λ 2,j σ 2 n+λ 2,j ), 20

22 We can see that the fully conditional distribution of β j in self-adaptive elastic net is also a mixture of two normals, but with modified means and variances. 5.2 Self-adaptive Fused Lasso We define the self-adaptive fused lasso as follows: min β n (y i p β j x i,j ) 2 + j=1 p λ 1,j β j + j=1 p λ 2,j β j β j 1. (26) The Bayesian model for the self-adaptive fused lasso is the same as that for the self-adaptive elastic net except that formulas (23) and (24) are modified as j=2 β λ 1,λ 2,σ 2 FusedLasso(σ 2 /λ 1,σ 2 /λ 1 ), (27) λ 1,j G(a 1,b 1 ), j = 1,,p, λ 2,j G(a 2,b 2 ), j = 2,...,p, (28) In (27), FusedLasso(σ 2 /λ 1,σ 2 /λ 1 ) is the distribution about the fused lasso, with density { } p λ 1,j z j p λ 2,j z j z j 1 f(z) exp. (29) σ 2 σ 2 j=1 where z = (z 1,...,z p ) T. Let N [l,u] (µ,σ) be a truncated normal distribution with mean µ and variance σ 2 and it has a lower bound l and an upper bound u. We allow l to be and u to be +. Let t 0 = β 0 =, t 4 = β p+1 = +, and t 1,t 2,t 3 {0,β j 1,β j+1 } satisfying t 1 t 2 t 3. Denote 1,s = 2I(t s 0) 1, 2,s = 2I(t s β j 1 ) 1 and 3,s = 2I(t s 1 β j+1 ) 1. We set p s = 0 if t s 1 = t s. In addition, we define µ s = [ n ǫ i,jx i,j ]+ 1s λ 1,j + 2s λ 2,j 1 + 3s λ 2,j+1, s p s = 1 and { }[ µ 2 p s exp s Φ 2σ 2 n j=2 ( ) ts µ s σ Φ n ( )] ts 1 µ s σ. (30) n Then the following proposition gives a way to draw β j s in the Gibbs sampling. 21

23 Proposition 4 The fully conditional distribution of β j with fused lasso prior is β j β j,λ 1,λ 2,σ 2,Y,X 4 p s N [ts 1,t s] s=1 ( ) µs n, σ2. (31) n From(31), we canseethat thefullyconditional distributionof β j given β j followsamixture of truncated normals (with up to 4 components). 6 Conclusion In this paper, we have proposed a self-adaptive lasso method for variable selection in regression problems. This method assigns a specific tuning parameter to each regression coefficient. We also developed a Gibbs sampling algorithm to estimate the regression coefficients and select the tuning parameters simultaneously. This algorithm also provides a way to conduct statistical inference. We illustrate the advantages of the new method on both simulated and real examples. Finally, we extend the idea of self-adpative lasso and its Bayesian model to elastic net and fused lasso, respectively. When this paper was in final preparation, we found Hans (2009) proposed a Gibbs sampling algorithm to estimate lasso. This algorithm shares some similar flavor with the Gibbs sampling algorithm proposed in our paper, but it concerns the Bayesian estimation for the original lasso with single tuning parameter and the related inference from the posterior mean. We acknowledge that our independent work of the proposed Gibbs sampling algorithm was originated from a completely different motivation, which aims at providing a feasible way to select the p tuning parameters in self-adaptive lasso. References Breiman, L.(1995), Better subset regression using the nonnegative garrote, Technometrics, 37,

24 Efron, B., Hastie, T., Johnstone, I., and Tibshirani, R. (2004), Least angle regression, Annals of Statistics, 32, Fan, J. and Li, R. (2001), Variable selection via nonconcave penalized likelihood and its oracle properties, Journal of the American Statistical Asscociation, 96, Hans, C. (2009), Bayesian lasso regression, Biometrika, To appear. Hastie, T., Tibshirani, R., and Friedman, J. (2001), The elements of statistical learning, Springer, New York. Kyung, M., Gill, J., Ghosh, M., and Casella, G. (2009), Peanlized regression, standard errors and Bayesian lassos, Tech. rep., Department of Statistics, University of Florida. Meinshausen, N. (2007), Relaxed lasso, Computational Statistics and Data Analysis, 52, Park, T. and Casella, G. (2008), The Bayesian lasso, Journal of the American Statistical Asscociation, 103, Radchenko, P. and James, G. (2008), Variable inclusion and shrinkage algorithms, Journal of the American Statistical Asscociation, 103, Robert, C. P. (1995), Simulation of truncated normal variables, Statistics and Computing, 5, Stamey, T., Kabalin, J., McNeal, J., Johnstone, I., Freiha, F., Redwine, E., and Yang, N. (1989), Prostate specific antigen in the diagnosis and treatment of adenocarcinoma of the prostate. II. Radical prostatectomy treated patients, The Journal of Urology, 141,

25 Tibshirani, R. (1996), Regression shrinkage and selection via the lasso, Journal of the Royal Statistical Society, Series B, 58, Tibshirani, R., Saunders, M., Rosset, S., Zhu, J., and Knight, K. (2005), Sparsity and smoothness via the fused lasso, Journal of the Royal Statistical Society, Series B, 67, van Dyk, D. and Meng, X.(2001), The art of data augmentation, Journal of Computational and Graphical Statistics, 10, Wang, S., Nan, B., Rosset, S., and Zhu, J. (2008), Random lasso, Tech. rep., Department of Biostatistics, University of Michigan. Zou, H. (2006), The adaptive LASSO and its oracle properties, Journal of the American Statistical Asscociation, 101, Zou, H. and Hastie, T. (2005), Regularization and variable selection via the elastic net, Journal of the Royal Statistical Society, Series B, 67,

26 7 Appendix 1. The proof of Proposition 1. Note that the X is standardized, i.e. any j = 1,...,p, we have n x2 i,j = n. For π(β j β j,y,x,λ,σ 2 ) { n exp (y i j k β } kx i,k β j x i,k ) 2 { exp λ } j β j 2σ 2 σ 2 n = exp { (ǫ } { i,j β j x i,j ) 2 exp λ } j β j 2σ 2 σ 2 { ( n )} exp 1 n x 2 2σ 2 i,j β2 j 2 ǫ i,j x i,j β j +2λ j β j { ( )} = exp 1 n nβ 2 2σ 2 j 2 ǫ i,j x i,j β j +2λ j β j { ( exp 1 n nβ 2 2σ 2 j 2( ǫ i,j x i,j λ j )β j )}I(β j > 0)+ { ( exp 1 n nβ 2 2σ 2 j 2( ǫ i,j x i,j +λ j )β j )}I(β j 0) ( n { φ βj 1 n σ n (ǫ i,jx i,j λ j ) }) I(β j > 0) = A 0+ Φ( 1 n nσ (ǫ + i,jx i,j λ j )) ( n { φ βj 1 n σ n (ǫ i,jx i,j +λ j ) }) I(β j 0) A 0 Φ( 1 n nσ (ǫ i,jx i,j +λ j )) where A 0± is defined in equation (12). 2. The proof of Proposition 4. We first show the case when 0 β j 1 β j+1, where we have t 1 = 0,t 2 = β j 1 and t 3 = β j+1. The fully conditional distribution of β j given β j 25

27 is π(β j β j,y,x,λ 1,λ 2,σ 2 ) n exp { (ǫ } { i,j β j x i,j ) 2 exp λ 1,j β j λ } 2,j β j β j 1 2σ 2 σ 2 σ 2 { ( )} exp 1 n nβ 2 2σ 2 j 2 ǫ i,j x i,j β j +2λ 1,j β j +2λ 2,j β j β j 1 +2λ 2,j+1 β j+1 β j { ( exp 1 n nβ 2 2σ 2 j 2( ǫ i,j x i,j +λ 1,j +λ 2,j 1 λ 2,j+1 )β j )}I(β j 0) { ( +exp 1 n nβ 2 2σ 2 j 2( ǫ i,j x i,j λ 1,j +λ 2,j 1 λ 2,j+1 )β j )}I(0 < β j β j 1 ) { ( +exp 1 n nβ 2 2σ 2 j 2( ǫ i,j x i,j λ 1,j λ 2,j 1 λ 2,j+1 )β j )}I(β j 1 < β j β j+1 ) { ( +exp 1 n nβ 2 2σ 2 j 2( ǫ i,j x i,j λ 1,j λ 2,j 1 +λ 2,j+1 )β j )}I(β j > β j+1 ) ( βj µ s 4 φ nσ )I(t s 1 < β j t s ) p s ( ) ). t s=1 Φ s µ s Φ( σ ts 1 µ s n σ n Similarly, we can also show the claims holds when the order of 0,β j 1 and β j+1 changes. 26

Nonnegative Garrote Component Selection in Functional ANOVA Models

Nonnegative Garrote Component Selection in Functional ANOVA Models Nonnegative Garrote Component Selection in Functional ANOVA Models Ming Yuan School of Industrial and Systems Engineering Georgia Institute of Technology Atlanta, GA 3033-005 Email: myuan@isye.gatech.edu

More information

Or How to select variables Using Bayesian LASSO

Or How to select variables Using Bayesian LASSO Or How to select variables Using Bayesian LASSO x 1 x 2 x 3 x 4 Or How to select variables Using Bayesian LASSO x 1 x 2 x 3 x 4 Or How to select variables Using Bayesian LASSO On Bayesian Variable Selection

More information

Generalized Elastic Net Regression

Generalized Elastic Net Regression Abstract Generalized Elastic Net Regression Geoffroy MOURET Jean-Jules BRAULT Vahid PARTOVINIA This work presents a variation of the elastic net penalization method. We propose applying a combined l 1

More information

Simultaneous variable selection and class fusion for high-dimensional linear discriminant analysis

Simultaneous variable selection and class fusion for high-dimensional linear discriminant analysis Biostatistics (2010), 11, 4, pp. 599 608 doi:10.1093/biostatistics/kxq023 Advance Access publication on May 26, 2010 Simultaneous variable selection and class fusion for high-dimensional linear discriminant

More information

Lecture 5: Soft-Thresholding and Lasso

Lecture 5: Soft-Thresholding and Lasso High Dimensional Data and Statistical Learning Lecture 5: Soft-Thresholding and Lasso Weixing Song Department of Statistics Kansas State University Weixing Song STAT 905 October 23, 2014 1/54 Outline Penalized

More information

Consistent Selection of Tuning Parameters via Variable Selection Stability

Consistent Selection of Tuning Parameters via Variable Selection Stability Journal of Machine Learning Research 14 2013 3419-3440 Submitted 8/12; Revised 7/13; Published 11/13 Consistent Selection of Tuning Parameters via Variable Selection Stability Wei Sun Department of Statistics

More information

Sparse Linear Models (10/7/13)

Sparse Linear Models (10/7/13) STA56: Probabilistic machine learning Sparse Linear Models (0/7/) Lecturer: Barbara Engelhardt Scribes: Jiaji Huang, Xin Jiang, Albert Oh Sparsity Sparsity has been a hot topic in statistics and machine

More information

Statistical Inference

Statistical Inference Statistical Inference Liu Yang Florida State University October 27, 2016 Liu Yang, Libo Wang (Florida State University) Statistical Inference October 27, 2016 1 / 27 Outline The Bayesian Lasso Trevor Park

More information

Analysis Methods for Supersaturated Design: Some Comparisons

Analysis Methods for Supersaturated Design: Some Comparisons Journal of Data Science 1(2003), 249-260 Analysis Methods for Supersaturated Design: Some Comparisons Runze Li 1 and Dennis K. J. Lin 2 The Pennsylvania State University Abstract: Supersaturated designs

More information

Priors on the Variance in Sparse Bayesian Learning; the demi-bayesian Lasso

Priors on the Variance in Sparse Bayesian Learning; the demi-bayesian Lasso Priors on the Variance in Sparse Bayesian Learning; the demi-bayesian Lasso Suhrid Balakrishnan AT&T Labs Research 180 Park Avenue Florham Park, NJ 07932 suhrid@research.att.com David Madigan Department

More information

Chris Fraley and Daniel Percival. August 22, 2008, revised May 14, 2010

Chris Fraley and Daniel Percival. August 22, 2008, revised May 14, 2010 Model-Averaged l 1 Regularization using Markov Chain Monte Carlo Model Composition Technical Report No. 541 Department of Statistics, University of Washington Chris Fraley and Daniel Percival August 22,

More information

Robust methods and model selection. Garth Tarr September 2015

Robust methods and model selection. Garth Tarr September 2015 Robust methods and model selection Garth Tarr September 2015 Outline 1. The past: robust statistics 2. The present: model selection 3. The future: protein data, meat science, joint modelling, data visualisation

More information

Machine Learning for OR & FE

Machine Learning for OR & FE Machine Learning for OR & FE Regression II: Regularization and Shrinkage Methods Martin Haugh Department of Industrial Engineering and Operations Research Columbia University Email: martin.b.haugh@gmail.com

More information

Least Absolute Gradient Selector: variable selection via Pseudo-Hard Thresholding

Least Absolute Gradient Selector: variable selection via Pseudo-Hard Thresholding arxiv:204.2353v4 [stat.ml] 9 Oct 202 Least Absolute Gradient Selector: variable selection via Pseudo-Hard Thresholding Kun Yang September 2, 208 Abstract In this paper, we propose a new approach, called

More information

Simultaneous regression shrinkage, variable selection, and supervised clustering of predictors with OSCAR

Simultaneous regression shrinkage, variable selection, and supervised clustering of predictors with OSCAR Simultaneous regression shrinkage, variable selection, and supervised clustering of predictors with OSCAR Howard D. Bondell and Brian J. Reich Department of Statistics, North Carolina State University,

More information

The Adaptive Lasso and Its Oracle Properties Hui Zou (2006), JASA

The Adaptive Lasso and Its Oracle Properties Hui Zou (2006), JASA The Adaptive Lasso and Its Oracle Properties Hui Zou (2006), JASA Presented by Dongjun Chung March 12, 2010 Introduction Definition Oracle Properties Computations Relationship: Nonnegative Garrote Extensions:

More information

MS-C1620 Statistical inference

MS-C1620 Statistical inference MS-C1620 Statistical inference 10 Linear regression III Joni Virta Department of Mathematics and Systems Analysis School of Science Aalto University Academic year 2018 2019 Period III - IV 1 / 32 Contents

More information

Regression Shrinkage and Selection via the Elastic Net, with Applications to Microarrays

Regression Shrinkage and Selection via the Elastic Net, with Applications to Microarrays Regression Shrinkage and Selection via the Elastic Net, with Applications to Microarrays Hui Zou and Trevor Hastie Department of Statistics, Stanford University December 5, 2003 Abstract We propose the

More information

Selection of Smoothing Parameter for One-Step Sparse Estimates with L q Penalty

Selection of Smoothing Parameter for One-Step Sparse Estimates with L q Penalty Journal of Data Science 9(2011), 549-564 Selection of Smoothing Parameter for One-Step Sparse Estimates with L q Penalty Masaru Kanba and Kanta Naito Shimane University Abstract: This paper discusses the

More information

Identify Relative importance of covariates in Bayesian lasso quantile regression via new algorithm in statistical program R

Identify Relative importance of covariates in Bayesian lasso quantile regression via new algorithm in statistical program R Identify Relative importance of covariates in Bayesian lasso quantile regression via new algorithm in statistical program R Fadel Hamid Hadi Alhusseini Department of Statistics and Informatics, University

More information

Chapter 3. Linear Models for Regression

Chapter 3. Linear Models for Regression Chapter 3. Linear Models for Regression Wei Pan Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN 55455 Email: weip@biostat.umn.edu PubH 7475/8475 c Wei Pan Linear

More information

Supplement to Bayesian inference for high-dimensional linear regression under the mnet priors

Supplement to Bayesian inference for high-dimensional linear regression under the mnet priors The Canadian Journal of Statistics Vol. xx No. yy 0?? Pages?? La revue canadienne de statistique Supplement to Bayesian inference for high-dimensional linear regression under the mnet priors Aixin Tan

More information

Bayesian Grouped Horseshoe Regression with Application to Additive Models

Bayesian Grouped Horseshoe Regression with Application to Additive Models Bayesian Grouped Horseshoe Regression with Application to Additive Models Zemei Xu, Daniel F. Schmidt, Enes Makalic, Guoqi Qian, and John L. Hopper Centre for Epidemiology and Biostatistics, Melbourne

More information

STRUCTURED VARIABLE SELECTION AND ESTIMATION

STRUCTURED VARIABLE SELECTION AND ESTIMATION The Annals of Applied Statistics 2009, Vol. 3, No. 4, 1738 1757 DOI: 10.1214/09-AOAS254 Institute of Mathematical Statistics, 2009 STRUCTURED VARIABLE SELECTION AND ESTIMATION BY MING YUAN 1,V.ROSHAN JOSEPH

More information

Regularization: Ridge Regression and the LASSO

Regularization: Ridge Regression and the LASSO Agenda Wednesday, November 29, 2006 Agenda Agenda 1 The Bias-Variance Tradeoff 2 Ridge Regression Solution to the l 2 problem Data Augmentation Approach Bayesian Interpretation The SVD and Ridge Regression

More information

A Bootstrap Lasso + Partial Ridge Method to Construct Confidence Intervals for Parameters in High-dimensional Sparse Linear Models

A Bootstrap Lasso + Partial Ridge Method to Construct Confidence Intervals for Parameters in High-dimensional Sparse Linear Models A Bootstrap Lasso + Partial Ridge Method to Construct Confidence Intervals for Parameters in High-dimensional Sparse Linear Models Jingyi Jessica Li Department of Statistics University of California, Los

More information

The lasso: some novel algorithms and applications

The lasso: some novel algorithms and applications 1 The lasso: some novel algorithms and applications Newton Institute, June 25, 2008 Robert Tibshirani Stanford University Collaborations with Trevor Hastie, Jerome Friedman, Holger Hoefling, Gen Nowak,

More information

An Improved 1-norm SVM for Simultaneous Classification and Variable Selection

An Improved 1-norm SVM for Simultaneous Classification and Variable Selection An Improved 1-norm SVM for Simultaneous Classification and Variable Selection Hui Zou School of Statistics University of Minnesota Minneapolis, MN 55455 hzou@stat.umn.edu Abstract We propose a novel extension

More information

A Confidence Region Approach to Tuning for Variable Selection

A Confidence Region Approach to Tuning for Variable Selection A Confidence Region Approach to Tuning for Variable Selection Funda Gunes and Howard D. Bondell Department of Statistics North Carolina State University Abstract We develop an approach to tuning of penalized

More information

arxiv: v1 [stat.me] 30 Dec 2017

arxiv: v1 [stat.me] 30 Dec 2017 arxiv:1801.00105v1 [stat.me] 30 Dec 2017 An ISIS screening approach involving threshold/partition for variable selection in linear regression 1. Introduction Yu-Hsiang Cheng e-mail: 96354501@nccu.edu.tw

More information

Regularization Paths

Regularization Paths December 2005 Trevor Hastie, Stanford Statistics 1 Regularization Paths Trevor Hastie Stanford University drawing on collaborations with Brad Efron, Saharon Rosset, Ji Zhu, Hui Zhou, Rob Tibshirani and

More information

Data Mining Stat 588

Data Mining Stat 588 Data Mining Stat 588 Lecture 02: Linear Methods for Regression Department of Statistics & Biostatistics Rutgers University September 13 2011 Regression Problem Quantitative generic output variable Y. Generic

More information

Consistent Group Identification and Variable Selection in Regression with Correlated Predictors

Consistent Group Identification and Variable Selection in Regression with Correlated Predictors Consistent Group Identification and Variable Selection in Regression with Correlated Predictors Dhruv B. Sharma, Howard D. Bondell and Hao Helen Zhang Abstract Statistical procedures for variable selection

More information

Variable Selection for Highly Correlated Predictors

Variable Selection for Highly Correlated Predictors Variable Selection for Highly Correlated Predictors Fei Xue and Annie Qu Department of Statistics, University of Illinois at Urbana-Champaign WHOA-PSI, Aug, 2017 St. Louis, Missouri 1 / 30 Background Variable

More information

Regularization Paths. Theme

Regularization Paths. Theme June 00 Trevor Hastie, Stanford Statistics June 00 Trevor Hastie, Stanford Statistics Theme Regularization Paths Trevor Hastie Stanford University drawing on collaborations with Brad Efron, Mee-Young Park,

More information

Regularization and Variable Selection via the Elastic Net

Regularization and Variable Selection via the Elastic Net p. 1/1 Regularization and Variable Selection via the Elastic Net Hui Zou and Trevor Hastie Journal of Royal Statistical Society, B, 2005 Presenter: Minhua Chen, Nov. 07, 2008 p. 2/1 Agenda Introduction

More information

A Survey of L 1. Regression. Céline Cunen, 20/10/2014. Vidaurre, Bielza and Larranaga (2013)

A Survey of L 1. Regression. Céline Cunen, 20/10/2014. Vidaurre, Bielza and Larranaga (2013) A Survey of L 1 Regression Vidaurre, Bielza and Larranaga (2013) Céline Cunen, 20/10/2014 Outline of article 1.Introduction 2.The Lasso for Linear Regression a) Notation and Main Concepts b) Statistical

More information

WEIGHTED QUANTILE REGRESSION THEORY AND ITS APPLICATION. Abstract

WEIGHTED QUANTILE REGRESSION THEORY AND ITS APPLICATION. Abstract Journal of Data Science,17(1). P. 145-160,2019 DOI:10.6339/JDS.201901_17(1).0007 WEIGHTED QUANTILE REGRESSION THEORY AND ITS APPLICATION Wei Xiong *, Maozai Tian 2 1 School of Statistics, University of

More information

Institute of Statistics Mimeo Series No Simultaneous regression shrinkage, variable selection and clustering of predictors with OSCAR

Institute of Statistics Mimeo Series No Simultaneous regression shrinkage, variable selection and clustering of predictors with OSCAR DEPARTMENT OF STATISTICS North Carolina State University 2501 Founders Drive, Campus Box 8203 Raleigh, NC 27695-8203 Institute of Statistics Mimeo Series No. 2583 Simultaneous regression shrinkage, variable

More information

Direct Learning: Linear Regression. Donglin Zeng, Department of Biostatistics, University of North Carolina

Direct Learning: Linear Regression. Donglin Zeng, Department of Biostatistics, University of North Carolina Direct Learning: Linear Regression Parametric learning We consider the core function in the prediction rule to be a parametric function. The most commonly used function is a linear function: squared loss:

More information

Lecture 14: Variable Selection - Beyond LASSO

Lecture 14: Variable Selection - Beyond LASSO Fall, 2017 Extension of LASSO To achieve oracle properties, L q penalty with 0 < q < 1, SCAD penalty (Fan and Li 2001; Zhang et al. 2007). Adaptive LASSO (Zou 2006; Zhang and Lu 2007; Wang et al. 2007)

More information

ISyE 691 Data mining and analytics

ISyE 691 Data mining and analytics ISyE 691 Data mining and analytics Regression Instructor: Prof. Kaibo Liu Department of Industrial and Systems Engineering UW-Madison Email: kliu8@wisc.edu Office: Room 3017 (Mechanical Engineering Building)

More information

Least Absolute Gradient Selector: Statistical Regression via Pseudo-Hard Thresholding

Least Absolute Gradient Selector: Statistical Regression via Pseudo-Hard Thresholding Least Absolute Gradient Selector: Statistical Regression via Pseudo-Hard Thresholding Kun Yang Trevor Hastie April 0, 202 Abstract Variable selection in linear models plays a pivotal role in modern statistics.

More information

A New Bayesian Variable Selection Method: The Bayesian Lasso with Pseudo Variables

A New Bayesian Variable Selection Method: The Bayesian Lasso with Pseudo Variables A New Bayesian Variable Selection Method: The Bayesian Lasso with Pseudo Variables Qi Tang (Joint work with Kam-Wah Tsui and Sijian Wang) Department of Statistics University of Wisconsin-Madison Feb. 8,

More information

Lingmin Zeng a & Jun Xie b a Department of Biostatistics, MedImmune, One MedImmune Way,

Lingmin Zeng a & Jun Xie b a Department of Biostatistics, MedImmune, One MedImmune Way, This article was downloaded by: [Jun Xie] On: 27 July 2011, At: 09:44 Publisher: Taylor & Francis Informa Ltd Registered in England and Wales Registered Number: 1072954 Registered office: Mortimer House,

More information

Robust Variable Selection Through MAVE

Robust Variable Selection Through MAVE Robust Variable Selection Through MAVE Weixin Yao and Qin Wang Abstract Dimension reduction and variable selection play important roles in high dimensional data analysis. Wang and Yin (2008) proposed sparse

More information

SOLVING NON-CONVEX LASSO TYPE PROBLEMS WITH DC PROGRAMMING. Gilles Gasso, Alain Rakotomamonjy and Stéphane Canu

SOLVING NON-CONVEX LASSO TYPE PROBLEMS WITH DC PROGRAMMING. Gilles Gasso, Alain Rakotomamonjy and Stéphane Canu SOLVING NON-CONVEX LASSO TYPE PROBLEMS WITH DC PROGRAMMING Gilles Gasso, Alain Rakotomamonjy and Stéphane Canu LITIS - EA 48 - INSA/Universite de Rouen Avenue de l Université - 768 Saint-Etienne du Rouvray

More information

STAT 462-Computational Data Analysis

STAT 462-Computational Data Analysis STAT 462-Computational Data Analysis Chapter 5- Part 2 Nasser Sadeghkhani a.sadeghkhani@queensu.ca October 2017 1 / 27 Outline Shrinkage Methods 1. Ridge Regression 2. Lasso Dimension Reduction Methods

More information

STK Statistical Learning: Advanced Regression and Classification

STK Statistical Learning: Advanced Regression and Classification STK4030 - Statistical Learning: Advanced Regression and Classification Riccardo De Bin debin@math.uio.no STK4030: lecture 1 1/ 42 Outline of the lecture Introduction Overview of supervised learning Variable

More information

Spatial Lasso with Applications to GIS Model Selection

Spatial Lasso with Applications to GIS Model Selection Spatial Lasso with Applications to GIS Model Selection Hsin-Cheng Huang Institute of Statistical Science, Academia Sinica Nan-Jung Hsu National Tsing-Hua University David Theobald Colorado State University

More information

Linear Model Selection and Regularization

Linear Model Selection and Regularization Linear Model Selection and Regularization Recall the linear model Y = β 0 + β 1 X 1 + + β p X p + ɛ. In the lectures that follow, we consider some approaches for extending the linear model framework. In

More information

Comparisons of penalized least squares. methods by simulations

Comparisons of penalized least squares. methods by simulations Comparisons of penalized least squares arxiv:1405.1796v1 [stat.co] 8 May 2014 methods by simulations Ke ZHANG, Fan YIN University of Science and Technology of China, Hefei 230026, China Shifeng XIONG Academy

More information

Bayesian variable selection via. Penalized credible regions. Brian Reich, NCSU. Joint work with. Howard Bondell and Ander Wilson

Bayesian variable selection via. Penalized credible regions. Brian Reich, NCSU. Joint work with. Howard Bondell and Ander Wilson Bayesian variable selection via penalized credible regions Brian Reich, NC State Joint work with Howard Bondell and Ander Wilson Brian Reich, NCSU Penalized credible regions 1 Motivation big p, small n

More information

Iterative Selection Using Orthogonal Regression Techniques

Iterative Selection Using Orthogonal Regression Techniques Iterative Selection Using Orthogonal Regression Techniques Bradley Turnbull 1, Subhashis Ghosal 1 and Hao Helen Zhang 2 1 Department of Statistics, North Carolina State University, Raleigh, NC, USA 2 Department

More information

Lecture 4: Newton s method and gradient descent

Lecture 4: Newton s method and gradient descent Lecture 4: Newton s method and gradient descent Newton s method Functional iteration Fitting linear regression Fitting logistic regression Prof. Yao Xie, ISyE 6416, Computational Statistics, Georgia Tech

More information

The Double Dantzig. Some key words: Dantzig Selector; Double Dantzig; Generalized Linear Models; Lasso; Variable Selection.

The Double Dantzig. Some key words: Dantzig Selector; Double Dantzig; Generalized Linear Models; Lasso; Variable Selection. The Double Dantzig GARETH M. JAMES AND PETER RADCHENKO Abstract The Dantzig selector (Candes and Tao, 2007) is a new approach that has been proposed for performing variable selection and model fitting

More information

COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d)

COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d) COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d) Instructor: Herke van Hoof (herke.vanhoof@mail.mcgill.ca) Slides mostly by: Class web page: www.cs.mcgill.ca/~hvanho2/comp551 Unless

More information

Lecture 14: Shrinkage

Lecture 14: Shrinkage Lecture 14: Shrinkage Reading: Section 6.2 STATS 202: Data mining and analysis October 27, 2017 1 / 19 Shrinkage methods The idea is to perform a linear regression, while regularizing or shrinking the

More information

Adaptive Lasso for correlated predictors

Adaptive Lasso for correlated predictors Adaptive Lasso for correlated predictors Keith Knight Department of Statistics University of Toronto e-mail: keith@utstat.toronto.edu This research was supported by NSERC of Canada. OUTLINE 1. Introduction

More information

Integrated Non-Factorized Variational Inference

Integrated Non-Factorized Variational Inference Integrated Non-Factorized Variational Inference Shaobo Han, Xuejun Liao and Lawrence Carin Duke University February 27, 2014 S. Han et al. Integrated Non-Factorized Variational Inference February 27, 2014

More information

Regression Shrinkage and Selection via the Lasso

Regression Shrinkage and Selection via the Lasso Regression Shrinkage and Selection via the Lasso ROBERT TIBSHIRANI, 1996 Presenter: Guiyun Feng April 27 () 1 / 20 Motivation Estimation in Linear Models: y = β T x + ɛ. data (x i, y i ), i = 1, 2,...,

More information

Bayesian linear regression

Bayesian linear regression Bayesian linear regression Linear regression is the basis of most statistical modeling. The model is Y i = X T i β + ε i, where Y i is the continuous response X i = (X i1,..., X ip ) T is the corresponding

More information

TECHNICAL REPORT NO. 1091r. A Note on the Lasso and Related Procedures in Model Selection

TECHNICAL REPORT NO. 1091r. A Note on the Lasso and Related Procedures in Model Selection DEPARTMENT OF STATISTICS University of Wisconsin 1210 West Dayton St. Madison, WI 53706 TECHNICAL REPORT NO. 1091r April 2004, Revised December 2004 A Note on the Lasso and Related Procedures in Model

More information

Linear regression methods

Linear regression methods Linear regression methods Most of our intuition about statistical methods stem from linear regression. For observations i = 1,..., n, the model is Y i = p X ij β j + ε i, j=1 where Y i is the response

More information

Smoothly Clipped Absolute Deviation (SCAD) for Correlated Variables

Smoothly Clipped Absolute Deviation (SCAD) for Correlated Variables Smoothly Clipped Absolute Deviation (SCAD) for Correlated Variables LIB-MA, FSSM Cadi Ayyad University (Morocco) COMPSTAT 2010 Paris, August 22-27, 2010 Motivations Fan and Li (2001), Zou and Li (2008)

More information

Least Angle Regression, Forward Stagewise and the Lasso

Least Angle Regression, Forward Stagewise and the Lasso January 2005 Rob Tibshirani, Stanford 1 Least Angle Regression, Forward Stagewise and the Lasso Brad Efron, Trevor Hastie, Iain Johnstone and Robert Tibshirani Stanford University Annals of Statistics,

More information

Variable Selection for Nonparametric Quantile. Regression via Smoothing Spline ANOVA

Variable Selection for Nonparametric Quantile. Regression via Smoothing Spline ANOVA Variable Selection for Nonparametric Quantile Regression via Smoothing Spline ANOVA Chen-Yen Lin, Hao Helen Zhang, Howard D. Bondell and Hui Zou February 15, 2012 Author s Footnote: Chen-Yen Lin (E-mail:

More information

Least Absolute Shrinkage is Equivalent to Quadratic Penalization

Least Absolute Shrinkage is Equivalent to Quadratic Penalization Least Absolute Shrinkage is Equivalent to Quadratic Penalization Yves Grandvalet Heudiasyc, UMR CNRS 6599, Université de Technologie de Compiègne, BP 20.529, 60205 Compiègne Cedex, France Yves.Grandvalet@hds.utc.fr

More information

Tutz, Binder: Boosting Ridge Regression

Tutz, Binder: Boosting Ridge Regression Tutz, Binder: Boosting Ridge Regression Sonderforschungsbereich 386, Paper 418 (2005) Online unter: http://epub.ub.uni-muenchen.de/ Projektpartner Boosting Ridge Regression Gerhard Tutz 1 & Harald Binder

More information

SCIENCE CHINA Information Sciences. Received December 22, 2008; accepted February 26, 2009; published online May 8, 2010

SCIENCE CHINA Information Sciences. Received December 22, 2008; accepted February 26, 2009; published online May 8, 2010 . RESEARCH PAPERS. SCIENCE CHINA Information Sciences June 2010 Vol. 53 No. 6: 1159 1169 doi: 10.1007/s11432-010-0090-0 L 1/2 regularization XU ZongBen 1, ZHANG Hai 1,2, WANG Yao 1, CHANG XiangYu 1 & LIANG

More information

Variable Selection in Restricted Linear Regression Models. Y. Tuaç 1 and O. Arslan 1

Variable Selection in Restricted Linear Regression Models. Y. Tuaç 1 and O. Arslan 1 Variable Selection in Restricted Linear Regression Models Y. Tuaç 1 and O. Arslan 1 Ankara University, Faculty of Science, Department of Statistics, 06100 Ankara/Turkey ytuac@ankara.edu.tr, oarslan@ankara.edu.tr

More information

A Significance Test for the Lasso

A Significance Test for the Lasso A Significance Test for the Lasso Lockhart R, Taylor J, Tibshirani R, and Tibshirani R Ashley Petersen June 6, 2013 1 Motivation Problem: Many clinical covariates which are important to a certain medical

More information

Proteomics and Variable Selection

Proteomics and Variable Selection Proteomics and Variable Selection p. 1/55 Proteomics and Variable Selection Alex Lewin With thanks to Paul Kirk for some graphs Department of Epidemiology and Biostatistics, School of Public Health, Imperial

More information

Two Tales of Variable Selection for High Dimensional Regression: Screening and Model Building

Two Tales of Variable Selection for High Dimensional Regression: Screening and Model Building Two Tales of Variable Selection for High Dimensional Regression: Screening and Model Building Cong Liu, Tao Shi and Yoonkyung Lee Department of Statistics, The Ohio State University Abstract Variable selection

More information

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning Lecture 3: More on regularization. Bayesian vs maximum likelihood learning L2 and L1 regularization for linear estimators A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting

More information

Exploratory quantile regression with many covariates: An application to adverse birth outcomes

Exploratory quantile regression with many covariates: An application to adverse birth outcomes Exploratory quantile regression with many covariates: An application to adverse birth outcomes June 3, 2011 eappendix 30 Percent of Total 20 10 0 0 1000 2000 3000 4000 5000 Birth weights efigure 1: Histogram

More information

High-dimensional covariance estimation based on Gaussian graphical models

High-dimensional covariance estimation based on Gaussian graphical models High-dimensional covariance estimation based on Gaussian graphical models Shuheng Zhou Department of Statistics, The University of Michigan, Ann Arbor IMA workshop on High Dimensional Phenomena Sept. 26,

More information

Cross-Validation with Confidence

Cross-Validation with Confidence Cross-Validation with Confidence Jing Lei Department of Statistics, Carnegie Mellon University UMN Statistics Seminar, Mar 30, 2017 Overview Parameter est. Model selection Point est. MLE, M-est.,... Cross-validation

More information

In Search of Desirable Compounds

In Search of Desirable Compounds In Search of Desirable Compounds Adrijo Chakraborty University of Georgia Email: adrijoc@uga.edu Abhyuday Mandal University of Georgia Email: amandal@stat.uga.edu Kjell Johnson Arbor Analytics, LLC Email:

More information

A New Combined Approach for Inference in High-Dimensional Regression Models with Correlated Variables

A New Combined Approach for Inference in High-Dimensional Regression Models with Correlated Variables A New Combined Approach for Inference in High-Dimensional Regression Models with Correlated Variables Niharika Gauraha and Swapan Parui Indian Statistical Institute Abstract. We consider the problem of

More information

Consistent high-dimensional Bayesian variable selection via penalized credible regions

Consistent high-dimensional Bayesian variable selection via penalized credible regions Consistent high-dimensional Bayesian variable selection via penalized credible regions Howard Bondell bondell@stat.ncsu.edu Joint work with Brian Reich Howard Bondell p. 1 Outline High-Dimensional Variable

More information

Spatial Lasso with Application to GIS Model Selection. F. Jay Breidt Colorado State University

Spatial Lasso with Application to GIS Model Selection. F. Jay Breidt Colorado State University Spatial Lasso with Application to GIS Model Selection F. Jay Breidt Colorado State University with Hsin-Cheng Huang, Nan-Jung Hsu, and Dave Theobald September 25 The work reported here was developed under

More information

On High-Dimensional Cross-Validation

On High-Dimensional Cross-Validation On High-Dimensional Cross-Validation BY WEI-CHENG HSIAO Institute of Statistical Science, Academia Sinica, 128 Academia Road, Section 2, Nankang, Taipei 11529, Taiwan hsiaowc@stat.sinica.edu.tw 5 WEI-YING

More information

Post-selection inference with an application to internal inference

Post-selection inference with an application to internal inference Post-selection inference with an application to internal inference Robert Tibshirani, Stanford University November 23, 2015 Seattle Symposium in Biostatistics, 2015 Joint work with Sam Gross, Will Fithian,

More information

High dimensional thresholded regression and shrinkage effect

High dimensional thresholded regression and shrinkage effect J. R. Statist. Soc. B (014) 76, Part 3, pp. 67 649 High dimensional thresholded regression and shrinkage effect Zemin Zheng, Yingying Fan and Jinchi Lv University of Southern California, Los Angeles, USA

More information

CS242: Probabilistic Graphical Models Lecture 4A: MAP Estimation & Graph Structure Learning

CS242: Probabilistic Graphical Models Lecture 4A: MAP Estimation & Graph Structure Learning CS242: Probabilistic Graphical Models Lecture 4A: MAP Estimation & Graph Structure Learning Professor Erik Sudderth Brown University Computer Science October 4, 2016 Some figures and materials courtesy

More information

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 10

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 10 COS53: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 0 MELISSA CARROLL, LINJIE LUO. BIAS-VARIANCE TRADE-OFF (CONTINUED FROM LAST LECTURE) If V = (X n, Y n )} are observed data, the linear regression problem

More information

25 : Graphical induced structured input/output models

25 : Graphical induced structured input/output models 10-708: Probabilistic Graphical Models 10-708, Spring 2016 25 : Graphical induced structured input/output models Lecturer: Eric P. Xing Scribes: Raied Aljadaany, Shi Zong, Chenchen Zhu Disclaimer: A large

More information

An Algorithm for Bayesian Variable Selection in High-dimensional Generalized Linear Models

An Algorithm for Bayesian Variable Selection in High-dimensional Generalized Linear Models Proceedings 59th ISI World Statistics Congress, 25-30 August 2013, Hong Kong (Session CPS023) p.3938 An Algorithm for Bayesian Variable Selection in High-dimensional Generalized Linear Models Vitara Pungpapong

More information

spikeslab: Prediction and Variable Selection Using Spike and Slab Regression

spikeslab: Prediction and Variable Selection Using Spike and Slab Regression 68 CONTRIBUTED RESEARCH ARTICLES spikeslab: Prediction and Variable Selection Using Spike and Slab Regression by Hemant Ishwaran, Udaya B. Kogalur and J. Sunil Rao Abstract Weighted generalized ridge regression

More information

Post-selection Inference for Forward Stepwise and Least Angle Regression

Post-selection Inference for Forward Stepwise and Least Angle Regression Post-selection Inference for Forward Stepwise and Least Angle Regression Ryan & Rob Tibshirani Carnegie Mellon University & Stanford University Joint work with Jonathon Taylor, Richard Lockhart September

More information

Shrinkage Methods: Ridge and Lasso

Shrinkage Methods: Ridge and Lasso Shrinkage Methods: Ridge and Lasso Jonathan Hersh 1 Chapman University, Argyros School of Business hersh@chapman.edu February 27, 2019 J.Hersh (Chapman) Ridge & Lasso February 27, 2019 1 / 43 1 Intro and

More information

The lasso, persistence, and cross-validation

The lasso, persistence, and cross-validation The lasso, persistence, and cross-validation Daniel J. McDonald Department of Statistics Indiana University http://www.stat.cmu.edu/ danielmc Joint work with: Darren Homrighausen Colorado State University

More information

Exploiting Covariate Similarity in Sparse Regression via the Pairwise Elastic Net

Exploiting Covariate Similarity in Sparse Regression via the Pairwise Elastic Net Exploiting Covariate Similarity in Sparse Regression via the Pairwise Elastic Net Alexander Lorbert, David Eis, Victoria Kostina, David M. Blei, Peter J. Ramadge Dept. of Electrical Engineering, Dept.

More information

Prediction & Feature Selection in GLM

Prediction & Feature Selection in GLM Tarigan Statistical Consulting & Coaching statistical-coaching.ch Doctoral Program in Computer Science of the Universities of Fribourg, Geneva, Lausanne, Neuchâtel, Bern and the EPFL Hands-on Data Analysis

More information

Bayesian Inference in GLMs. Frequentists typically base inferences on MLEs, asymptotic confidence

Bayesian Inference in GLMs. Frequentists typically base inferences on MLEs, asymptotic confidence Bayesian Inference in GLMs Frequentists typically base inferences on MLEs, asymptotic confidence limits, and log-likelihood ratio tests Bayesians base inferences on the posterior distribution of the unknowns

More information

The Generalized Ridge Trace Plot: Visualizing Bias and Precision

The Generalized Ridge Trace Plot: Visualizing Bias and Precision The Generalized Ridge Trace Plot: Visualizing Bias and Precision Michael Friendly Psychology Department and Statistical Consulting Service York University 4700 Keele Street, Toronto, ON, Canada M3J 1P3

More information

Robust Variable Selection Methods for Grouped Data. Kristin Lee Seamon Lilly

Robust Variable Selection Methods for Grouped Data. Kristin Lee Seamon Lilly Robust Variable Selection Methods for Grouped Data by Kristin Lee Seamon Lilly A dissertation submitted to the Graduate Faculty of Auburn University in partial fulfillment of the requirements for the Degree

More information

Robust variable selection through MAVE

Robust variable selection through MAVE This is the author s final, peer-reviewed manuscript as accepted for publication. The publisher-formatted version may be available through the publisher s web site or your institution s library. Robust

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 7 Approximate

More information