Iterative Selection Using Orthogonal Regression Techniques

Size: px
Start display at page:

Download "Iterative Selection Using Orthogonal Regression Techniques"

Transcription

1 Iterative Selection Using Orthogonal Regression Techniques Bradley Turnbull 1, Subhashis Ghosal 1 and Hao Helen Zhang 2 1 Department of Statistics, North Carolina State University, Raleigh, NC, USA 2 Department of Mathematics, University of Arizona, Tucson, AZ Received 15 October 2013; revised 1 November 2013; accepted 2 November 2013 DOI: /sam Published online in Wiley Online Library (wileyonlinelibrary.com). Abstract: High dimensional data are nowadays encountered in various branches of science. Variable selection techniques play a key role in analyzing high dimensional data. Generally two approaches for variable selection in the high dimensional data setting are considered forward selection methods and penalization methods. In the former, variables are introduced in the model one at a time depending on their ability to explain variation and the procedure is terminated at some stage following some stopping rule. In penalization techniques such as the least absolute selection and shrinkage operator (LASSO), as optimization procedure is carried out with an added carefully chosen penalty function, so that the solutions have a sparse structure. Recently, the idea of penalized forward selection has been introduced. The motivation comes from the fact that the penalization techniques like the LASSO give rise to closed form expressions when used in one dimension, just like the least square estimator. Hence one can repeat such a procedure in a forward selection setting until it converges. The resulting procedure selects sparser models than comparable methods without compromising on predictive power. However, when the regressor is high dimensional, it is typical that many predictors are highly correlated. We show that in such situations, it is possible to improve stability and computational efficiency of the procedure further by introducing an orthogonalization step. At each selection step, variables potentially available to be selected in the model are screened on the basis of their correlation with variables already in the model, thus preventing unnecessary duplication. The new strategy, called the Selection Technique in Orthogonalized Regression Models (STORM), turns out to be extremely successful in reducing the model dimension further and also leads to improved predicting power. We also consider an aggressive version of the STORM, where a potential predictor will be permanently removed from further consideration if its regression coefficient is estimated as zero at any stage. We shall carry out a detailed simulation study to compare the newly proposed method with existing ones and analyze a real dataset Wiley Periodicals, Inc. Statistical Analysis and Data Mining 6: , 2013 Keywords: forward selection; orthogonalization; high dimensional regression; LASSO 1. INTRODUCTION In modern applications of statistics and data mining, linear regression models with extremely high dimensional regressors are commonly encountered. Typically the dimension of the regressor variable far exceeds the available sample size, posing serious challenges in the analysis of such data. In particular, the data matrix becomes singular and the least squares estimator is not uniquely defined. Usually, the majority of the regressor variables are not relevant, leading to a sparse structure in the model. However, it is not known beforehand which variables are actually relevant for the response variable. This problem is addressed through a variable selection step, which screens the variables before they can enter in the model. The variable selection step Correspondence to: Subhashis Ghosal (sghosal@stat.ncsu.edu) actually allows a fairly accurate estimation of the regression function in such high dimensional low sample size (HDLSS) situations. Variable selection has many other benefits, such as the ability to work with a sparse model, which has much better interpretability compared with a regression model having a lot of predictors. Variable selection methods mainly fall in two categories. The first one is a recursive selection method such as forward selection, backward selection, and stepwise selection. In a forward selection procedure, variables are added one by one to build up the model, and a stopping rule is used, based on some criterion such as the mean squared error (MSE), adjusted R 2, Mallow s C p -metric, prediction sum of squares (PRESS), Akaike information criterion (AIC), or the Bayesian information criterion (BIC); see ref [1] for their definitions. This strategy often runs into problems in HDLSS situations. For high dimensional data, Bühlmann [2] and Wang [3] studied forward regression 2013 Wiley Periodicals, Inc.

2 558 Statistical Analysis and Data Mining, Vol. 6 (2013) for variable screening. Backward selection is a variable removal procedure, which starts with the full model and sequentially deletes redundant variables. Since estimation in the full model is not feasible in HDLSS situations if using the least squares method, the backward selection method is typically not favored. A stepwise selection allows both addition and removal of variables in each step depending on predictive performance. The second category of variable selection methods uses the idea of penalization to regularize a basic estimator such as the least squares estimator. Ridge regression [4] is among the first in the linear regression setting, where a quadratic penalty term shrinks the coefficient estimates towards zero, thus reducing variability in the estimated coefficients at the expense of introducing a small bias. The ridge regression is unable to do any variable selection unless aided with a hard thresholding step to filter out coefficients that are too small. More recently, it has been found that some other penalty terms with edged contours can do both shrinkage and selection at the same time. Examples of such procedures include the nonnegative garrote [5], the least absolute selection and shrinkage operator (LASSO) [6], the SCAD penalized estimator [7], the adaptive LASSO [8], and the elastic net [9]. A common feature of the associated penalty functions is that they are not differentiable at zero, thus allowing the minimizer of the penalized sum of squares to have many components exactly zero, inducing sparsity in the estimator. In the literature, there are also numerous Bayesian variable selection methods for high dimensional linear regression, such as refs [10 12]. Recently, Hwang et al. [13] introduced a method that combines the idea of both forward selection and penalization. It is known that, although the typical penalized regression estimators do not have closed forms in general, many of them have such expressions if the regressor were one-dimensional. Hence it is possible to use such an expression in the recursion of a forward selection procedure. Hwang et al. [13] proposed the Forward Iterative Regression and Selection Technique (FIRST). The essential difference with a traditional forward selection procedure is that the additional shrinkage penalty term, at each selection step, filters out many redundant variables before they can enter the model, thus keeping the effective model size low, and the resulting model easily interpretable. This approach can be used with any standard penalization method, such as the LASSO, the adaptive LASSO, the nonnegative garrote, the naive elastic net etc. Moreover, a post selection ordinary least squares method often gives a better prediction result, by reducing the finite sample estimation bias caused by the shrinkage. Through an extensive simulation study, they showed that the FIRST procedure leads to much sparser models compared with existing variable selection methods, without compromising the predictive power of the FIRST. Let Y = Xβ + ɛ be a standard linear regression model, where Y = (Y 1,...,Y n ) T R n is a vector of responses, X is an n p matrix of predictors, β = (β 1,...,β p ) T R p is a vector of parameters, and ɛ = (ɛ 1,...,ɛ n ) T is an n- dimensional random error vector with uncorrelated mean zero and equal variance components. The intercept may be excluded from the regression model since the data can be centered. In what follows, we standardize the columns of X to have length 1. Note that the ith observation may be decomposed as Y i = X T i β + ɛ i, i = 1,...,n, where X T 1,...,XT n are the rows of X. The most common variable selection method LASSO minimizes Y Xβ 2 + λ β 1, where 1 is the L 1 -norm. The resulting estimator does not have a closed form for p>1, and is usually obtained from the least angle regression (LARS) algorithm [14]. However, for p = 1, the expression of the LASSO is explicit, namely, with ˆβ j = X T j Y standing for the ordinary least squares estimator for linear regression of Y on X j, ˆβ L j = ˆβ j λ 2, 0, if ˆβ j λ/2, if ˆβ j <λ/2, ˆβ j + λ 2, if ˆβ j λ/2. Although p = 1 is only a hypothetical situation, the explicit formula of the LASSO in this case allows us to repeat it sequentially as in a forward selection procedure, resulting in the estimator FIRST. Indeed, the same idea can be applied on all modifications of the LASSO penalty mentioned above. When there is a relatively high degree of correlation, it seems that considering a forward selection procedure based on all variables is somewhat redundant, since a variable which is nearly a linear combination of variables already selected in the model has very little to add in the model. It therefore seems sensible to orthogonalize variables before a forward selection procedure is performed, and eliminate those which are nearly linear combinations of variables already selected. Apart from making the chosen model more sparse, such an orthogonalization step may, in principle, enhance prediction performance by reducing the complexity and variance of the final estimator. The resulting procedure will be called the Selection Technique in Orthogonalized Regression Models (STORM). An aggressive version of the procedure is obtained by permanently removing any potential predictor from further consideration if its regression coefficient is estimated as zero at any stage. Since it is hard to keep track of the linear relations resulting from the orthogonalization step, we only keep track of the selected variables and finally estimate the coefficients by a post selection least squares method. Since the selection step makes the selected model low dimensional even in HDLSS situations, the least squares estimator will be unique. We conduct a comprehensive simulation study to assess the (1)

3 Turnbull, Ghosal, and Zhang: Selection Using Orthogonal Regression 559 comparative performances of the STORM, the FIRST and other variable selection techniques in linear regression models. We find that, in all the simulating settings, the new proposal, the STORM, outperforms the FIRST and other variable selection techniques available in the literature both in terms of finding the sparsest model and in terms of estimation error. The paper is organized as follows. In Section 2, we describe the procedure STORM and give a step-by-step algorithm. In Section 3, we conduct a comprehensive simulation experiment to assess the relative performance of the STORM and other methods. In Section 4, we analyze a high dimensional real dataset. 2. METHODOLOGY In this section, we define the proposed Selection Technique in Orthogonalized Regression Models (STORM). Consider the linear model Y i = X T i β + ɛ i, i = 1,...,n, with p-dimensional regressor variables X 1,...,X n. Without loss of generality, we center the responses Y i for i = 1,...n, such that n Y i = 0, and standardize all predictor variables {X 1,...,X p }, i.e., n X ij = 0 and n X2 ij = Algorithm For classical forward selection approaches, predictors are added in the model sequentially, one at a time. At each step, the variable whose inclusion in the model improves the current prediction performance the most then enters the model. The STORM procedure also builds the model by sequentially including one variable at a time but in a different fashion. At each step, the one-dimensional penalization problem is solved for each candidate variable and then the variable which produces the smallest residual sum of squares (RSS) is selected. To handle possible high correlation among the predictors, the STORM orthogonalizes the prediction variables at each step before performing the selection. Let J 0 ={1,...,p} denote the initial set of variables available for selection and I 0 = represent the initial set of variables which have already been selected (i.e., none). The proposed STORM algorithm is described as follows: 1. Initialization Stage (1.1) Initial setup: Set m = 1, and let Y i,1 = Y i and X ij,1 = X ij for i = 1,...,n. (1.2) Initial regression: Calculate the ordinary least square estimates, ˆβ 1,1,..., ˆβ p,1, by ˆβ j,1 = n Y i,1x ij,1,forj = 1,...,p. (1.3) Initial shrinkage: Calculate the LASSO estimates, ˆβ j,1 λ/2, if ˆβ j,1 >λ/2, ˆβ j,1 L = 0, if ˆβ j,1 λ/2, ˆβ j,1 + λ/2, if ˆβ j,1 < λ/2. (2) (1.4) Initializing selection set: Let K 1 = J 0, and calculate c jk,1 = n X ij,1x ik,1, for all j,k K 1. (1.5) Initial selection: Select X j 1 as the most effective variable in the first step, where j 1 = arg min j = arg max j (1.6) First updating: Let R 1 = 2. Recursion Stage Y 2 (Y i,1 ˆβ j,1 L X ij,1) 2 { 2 ˆβ j,1 ˆβ L j,1 ( ˆβ L j,1 ) 2 }. i,1 (Y i,1 ˆβ j,1 L X ij,1) 2 = 2 ˆβ j 1,1 ˆβ L j 1,1 ( ˆβ L j 1,1 ) 2, J 1 = K 1 \{j1 }, I 1 ={j1 }. (2.1) Recursion step: Set m to m + 1, and let Y i,m+1 = Y i,m ˆβ L j m,m X ij m,m, i = 1,...,n. (2.2) Recursion correlation: Let Z ij,m+1 = X ij,m c jj m,m X ij m,m, then define { K m = j J m : } Zij,m+1 2 η. (2.3) Recursion data update: For all j K m, compute X ij,m+1 = Z ij,m+1 [ n Z2 ij,m+1] 1/2, (3)

4 560 Statistical Analysis and Data Mining, Vol. 6 (2013) and for all j,k K m, compute c jk,m+1 = c jk,m c jj m,m c kj m,m. (4) 1 c 2 jjm,m 1 c 2 kjm,m (2.4) Recursion regression: For j K m, calculate regression coefficients ˆβ j,m+1 = ˆβ j,m ˆβ j m,m c jj m,m 1 c 2. (5) jjm,m (2.5) Recursion shrinkage: Calculate the LASSO estimates, ˆβ j,m+1 λ/2, if ˆβ j,m+1 >λ/2, ˆβ j,m+1 L = 0, if ˆβ j,m+1 λ/2, ˆβ j,m+1 + λ/2, if ˆβ j,m+1 < λ/2. (6) (2.6) Recursion selection: Select X j m+1 as the most effective variable in the (m + 1) th step, where j m+1 = arg min j K m = arg max j K m (Y i,m ˆβ j,m+1 L X ij,m+1) 2 (2.7) Recursion updating: Let R m+1 = { ( ) } 2 2 ˆβ j,m+1 ˆβ j,m+1 L ˆβ j,m+1 L. Y 2 i,m+1 ˆβ L j,m+1 X ij,m+1) 2 (Y i,m+1 = 2 ˆβ j m+1,m+1 ˆβ L j m+1,m+1 ( ) 2 ˆβ L jm+1,,m+1 J m+1 = K m \{jm+1 }, I m+1 = I m {jm+1 }. 3. Stopping Rule: Repeat Recursion Stage until R m+1 <δ. 4. Final Step: Once stopped, save the current set of selected variables, I m, and compute the ordinary { least squares } (OLS) regression of Y on Xj : j I m. This gives the final model and prediction formula Properties The STORM algorithm outlined above has the following properties: (i) The residual sum of squares decreases in every iteration. (ii) The maximum number of steps allowed is the number of linearly independent variables, which is bounded by min(p, n) n. In particular, the proposed forward selection algorithm automatically terminates. (iii) Because of the orthogonalization step, no variable can be repeated more than once, whether or not the original regressors are orthogonal. (iv) In the orthogonal design case, the procedure will reduce to the OLS procedure based on variables selected by the LASSO. With the exception of property (iii) above, similar properties are also shared by the FIRST procedure. Below, we briefly give arguments why these properties hold. Proof of (i): At iteration step m, the most effective variable X j m is selected by minimizing the objective criterion n (Y i,m X ij,m ˆβ j,m L )2 among all j K m,with notations as introduced in Subsection 2.1. Thus, by the definition of ˆβ j,m L, Y 2 i,m+1 = = (Y i,m ˆβ L jm,m X ijm,m) 2 (Y i,m ˆβ L jm,m X ijm,m) 2 + λ ˆβ L jm,m (Y i,m 0 X ij m,m) 2 + λ 0 Y 2 i,m. Therefore the residual sum of squares decreases in each step. Proof of (ii) and (iii): At each step, due to the orthogonalization step, the residual variable is orthogonal to the variables which have been already selected in the previous steps. In particular, if the jth variable at stage (m + 1) is exactly a linear combination of already selected variables, then the residual variable obtained by step (2.2) is identically zero, and hence j K m, ruling out the selection of variable j in the predictor. Thus in particular, no variable

5 Turnbull, Ghosal, and Zhang: Selection Using Orthogonal Regression 561 can be repeated, proving (iii) as well as justifying the second operation in step (2.7). Further, as there can be at most n linearly independent predictors while observing a sample of size n, property (ii) follows. Proof of (iv): If the variables are already orthogonal, then steps (2.2) and (2.3) are redundant, and the procedure will select variables according to the decreasing order of nonzero ˆβ j L, or equivalently, according to the decreasing order of nonzero ˆβ j. It is well known that, the LASSO in this special case also coincides with this procedure. Thus, in view of step 4 in the algorithm, the final procedure will be the same as ordinary least squares estimate applied on the variables retained by the LASSO. An analog of Theorem 2.1 in ref [13] is also valid for the STORM; namely in the orthogonal design case, the STORM is consistent for model selection. This follows from two observations the LASSO has the model selection consistency property, and the STORM coincides with the LASSOfollowedbyOLS Tuning There are three parameters in the STORM determining its performance, namely, δ, η, and λ. The role of δ is to determine when the recursion stops. We find that, as long as δ is small enough, the choice of δ affects the predictive performance very little. The main motivation of using η is that it effectively discourages considering for selection the predictors which are highly correlated with those already in the model, since there is little extra gain in reducing the residual sum of squares. Also, the use of η assures the numerical stability in the orthogonalization step. The most important parameter in the procedure is λ, which needs to be tuned carefully to assure a proper selection and estimation result. This is in line with the FIRST procedure introduced by Hwang et al. [13]. Compared with the FIRST, one additional parameter we need to tune is η, which controls the thresholding of orthogonalized variables and therefore its value may significantly affect the chosen model and the estimated coefficients. Since the ideal value of η is not known, some tuning criterion such as AIC, BIC, or cross validation may be used. In our numerical experiments, we use a joint fivefold cross validation, i.e., a set of possible values of λ and η are chosen over a grid and the prediction error is estimated by cross-validation. The value minimizing the estimated error is then chosen in the final estimation stage using all data. In practice, we recommend tuning both λ and η in the STORM, but computing time may be significantly saved by simply selecting a sensible value for η without crossvalidation Extensions Although the STORM procedure described above is based on the one-dimensional LASSO, the idea can be extended to use other shrinkage estimators such as the nonnegative garrote or the naive elastic net as the basis. This only requires a simple change in the update equation (6). For the nonnegative garrote procedure, the updating formula, starting from the least squares estimator, is ˆβ j,m+1 λ, if ˆβ 2 ˆβ j,m+1 > λ/2, j,m+1 ˆβ j,m+1 NG = 0, if ˆβ j,m+1 λ/2, ˆβ j,m+1 + λ, if ˆβ 2 ˆβ j,m+1 < λ/2. j,m+1 (7) The naive elastic net procedure [9] employs the combination of the L 1 - and the L 2 -penalty, which obtains β = arg min β (Y i p X ij β j ) 2 j=1 p p + λ 1 β j +λ 2 j=1 j=1 β 2 j ; here λ 1 > 0 and λ 2 > 0 are two tuning parameters. To implement the STORM-based naive elastic net procedure, we need to replace Eq. (6) by the following ˆβ EN j,m+1 = ˆβ j,m+1 λ 1 /2 1+λ 2, if ˆβ j,m+1 >λ 1 /2, 0, if ˆβ j,m+1 λ1 /2, ˆβ j,m+1 +λ 1 /2 1+λ 2, if ˆβ j,m+1 < λ 1 /2. Note that the corrected elastic net estimate [9] for a onedimensional predictor coincides with the lasso and hence does not lead to a different estimator. 3. SIMULATION We now demonstrate the performance of the STORM method under various settings. We concentrate on large p and small n situations and generate the data from a high-dimensional sparse linear model, Y i = X T i β + σɛ i, i = 1,...,n, where X i R p, p>n, σ>0 is a scale parameter, and ɛ i s are independent and identically distributed (i.i.d.) errors from N(0, 1). For each method, we generate a training data (8)

6 562 Statistical Analysis and Data Mining, Vol. 6 (2013) set to fit the model, a validation data set to select the tuning parameter values, and finally an independent test data set to assess the prediction accuracy of the resulting estimators. The training and validating data sets are of each size n, while the test data set is always of size For each experiment, we run 100 Monte-Carlo simulations and report the average results along with the Monte-Carlo errors. We compare the STORM method and its aggressive version (aggrstorm) with the FIRST followed by OLS (FIRST+OLS), the adaptive FIRST followed by OLS (afirst+ols), and the LASSO. Recall that the aggrstorm excludes the participation of a predictor j for which ˆβ j L = 0 at any step, in the subsequent steps and therefore can save computing time significantly. The LARS algorithm is used to implement the LASSO, which is available in R. The optimal tuning parameters for each method are chosen with a grid search method using the validation data set. Both the STORM and the aggrstorm require tuning of both λ and η, which is done using a two-dimensional two-stage grid search Simulation Settings We examine four experiment settings by varying sample sizes, error variance (or signal strength), and correlation scenarios among the covariates. The following is a detailed description of the examples. (a) In Example 1, we set p = 1000 and n = 100 or 500. The covariates X 1,...,X p are taken to be i.i.d. N(0, 1) variables. The error variance is σ 2 = 1 for both cases. The true coefficient vector β = ( ) T β 1,...,β p has only 10 nonzero coefficients (3, 3, 3, 3, 1.5, 1.5, 1.5, 2, 2, 2), and the remaining coefficients are zero. We equally space the locations of the nonzero values in the coefficient vector. More specifically, β 1, β 101, β 201, and β 301 are 3, β 401, β 501, and β 601 are 1.5, and β 701, β 801, and β 901 are 2. (b) Example 2 has the same settings as Example 1, but we introduce moderate correlation among the covariates. The correlation between two covariates, X i and X j, is corr ( X i,x j ) = ρ i j.wesetρ = 0.5. For this example, we consider n = 100 and two error variances: σ 2 = 1 and σ 2 = 4. (c) Example 3 is the same as Example 2, except we increase the correlation between covariates to ρ = (d) In Example 4, we increase the number of covariates to p = The coefficient vector still has only 10 nonzero coefficients, which are identical to the nonzero coefficient values in Example 1. The locations of the nonzero values are again equally spaced throughout the coefficient vector. For this example, we consider n = 100 and σ 2 = 1. (e) Example 5 introduces moderate correlation between the covariates in Example 4. We set ρ = 0.5, n = 100, and again consider the two error variances: σ 2 = 1 and σ 2 = 4. (f) Example 6 is identical to Example 5, but we increase the correlation to ρ = Table 1. Simulation results for Example 1 (p = 1000, σ 2 = 1, independent covariates) Method n = 100 n = 500 n = 100 n = 500 n = 100 n = 500 LASSO (LARS) (0.084) (0.006) FIRST+OLS (0.041) (0.005) afirst+ols (0.067) (0.005) STORM (0.009) (0.005) aggrstorm (0.076) (0.005) Table 2. Simulation results for Example 2 (p = 1000, n = 100, correlated covariates with ρ = 0.5) Method σ 2 = 1 σ 2 = 4 σ 2 = 1 σ 2 = 4 σ 2 = 1 σ 2 = 4 LASSO (LARS) (0.065) (0.339) FIRST+OLS (0.158) (0.254) afirst+ols (0.166) (0.328) STORM (0.008) (0.389) aggrstorm (0.062) (0.295)

7 Turnbull, Ghosal, and Zhang: Selection Using Orthogonal Regression 563 Table 3. Simulation results for Example 3 (p = 1000, n = 100, correlated covariates with ρ = 0.75) Method σ 2 = 1 σ 2 = 4 σ 2 = 1 σ 2 = 4 σ 2 = 1 σ 2 = 4 LASSO (LARS) (0.070) (0.253) FIRST+OLS (0.251) (0.579) afirst+ols (0.155) (0.370) STORM (0.137) (0.440) aggrstorm (0.103) (0.276) Table 4. Simulation results for Example 4 (p = 2500, σ 2 = 1, n = 100, independent covariates) Method LASSO (LARS) (0.282) FIRST+OLS (0.085) afirst+ols (0.191) STORM (0.008) aggrstorm (0.102) Experiments and Results We compare four different methods with regard to their prediction error and variable selection performance. For prediction performance, we report the mean squared error evaluated on the test set. For variable selection, we report two types of selection errors: False Negative defined as the (average) number of nonzero coefficients which are estimated as zero, and False Positive defined as the number of zero coefficients which are not estimated as zero. From Tables 1 6, we observe that the STORM consistently has the best prediction performance, in terms of the Test Error for prediction. In terms of the selection, the STORM also shows advantages with small False Negative and False Positive numbers. The aggrstorm generally performs the second best among all the methods under comparison. Figure 1 presents a plot where the relative performance of the STORM against the FIRST is compared, showing the worth of the orthogonalization step, as correlation varies. 4. REAL DATA ANALYSIS We explore how the STORM performs when applied to real data by considering the gene expression data used in ref [15]. The dataset features probe sets and 120 observations. The computation is made more manageable by performing two stages of prescreening. First, we remove the 3815 probe sets whose maximum values are not greater than the 25th percentile of the entire probe set. Then in the second stage, we select the 3000 probe sets with the MSE FIRST+OLS MSE Smoothed Curve FIRST+OLS MSE Values STORM MSE Smoothed Curve STORM MSE Values Correlation Fig. 1 Comparison of the STORM and the FIRST procedures for varying degrees of covariate correlations ranging from ρ = 0 to ρ = Each data point corresponds to the average MSE for N = 75 Monte-Carlo trials with n = 100, p = 1000, σ 2 = 1, and true covariate vector, β, identical to that from the previous examples. Lines are created using LOESS on the average MSE values for each ρ value. Dotted lines correspond to the LOESS 95% CI for the average MSE at each ρ value. [Color figure can be viewed in the online issue, which is available at wileyonlinelibrary.com.] largest variance of the remaining probe sets. In our analysis, these 3000 probe sets are used as predictors, with the response being the RMA expression value for the probe _at, which has recently been found to cause Bardet- Biedl syndrome [15]. We randomly divide the data into a training set of size 100 and test set of size 20. We implement the STORM with δ = Five-fold cross validation is conducted with the training data set in order to tune the λ and η parameters. The procedure is repeated 10 times. We consider the FIRST+OLS, the LASSO+OLS, the STORM, and the aggrstorm for comparison. Table 7 shows that all the methods give competitive prediction performance in terms of test errors, while the STORM and the aggrstorm select fewer number of genes than others. This suggests that the STORM and the aggrstorm are able to find a fairly simple regression model which can still predict the response quite accurately.

8 564 Statistical Analysis and Data Mining, Vol. 6 (2013) Table 5. Simulation results for Example 5 (p = 2500, n = 100, correlated covariates with ρ = 0.50) Method σ 2 = 1 σ 2 = 4 σ 2 = 1 σ 2 = 4 σ 2 = 1 σ 2 = 4 LASSO (LARS) (0.383) (0.547) FIRST+OLS (0.130) (0.362) afirst+ols (0.198) (0.529) STORM 1.11 (0.008) (0.243) aggrstorm 1.61 (0.087) (0.254) Table 6. Simulation results for Example 6 (p = 2500, n = 100, correlated covariates with ρ = 0.75) Method σ 2 = 1 σ 2 = 4 σ 2 = 1 σ 2 = 4 σ 2 = 1 σ 2 = 4 FIRST+OLS (0.271) (0.457) afirst+ols (0.397) (0.537) LASSO (LARS) (0.331) (0.464) STORM (0.198) (0.704) aggrstorm (0.144) (0.571) Table 7. Real data analysis results Method Test error Num. genes selected FIRST+OLS (0.0058) 50.6 (4.3) LASSO+OLS (0.0021) 36.4 (4.1) STORM (0.0017) 30.5 (3.4) aggrstorm (0.0025) 25.1 (3.4) ACKNOWLEDGMENTS This work is supported by National Security Agency Grant number H and NSF DMS REFERENCES [1] J. Rawlings, S. Pantula, and D. Dickey, Applied Regression Analysis, Springer, New York, [2] P. Bühlmann, Boosting for high-dimensional linear models, Ann Stat 34 (2006), [3] H. Wang, Forward regression for ultra-high dimensional variable screening, J Am Stat Assoc 104 (2012), [4] E. Hoerl and W. Kennard, Ridge regression: biased estimation for nonorthogonal problems, Technomerics 12 (1970), [5] L. Breiman, Better subset selection using the non-negative garotte, Technometrics 37 (1995), [6] R. Tibshirani, Regression shrinkage and selection via the Lasso, J R Stat Soc 58 (1996), [7] J. Fan and R. Li, Variable selection via nonconcave penalized likelihood and its oracle properties, J Am Stat Assoc 96 (2001), [8] H. Zou, The adaptive Lasso and its oracle properties, J Am Stat Assoc 476 (2006), [9] H. Zou and T. Hastie, Regularization and variable selection via the elastic net, J R Stat Soc Vol. 67 (2005), [10] E. George and R. McCulloch, Variable selection via Gibbs sampling, J Am Stat Assoc 88 (1993), [11] H. Ishwaran and J. Rao, Spike and slab variable selection: frequentist and Bayesian strategies, Ann Stat 33 (2005), [12] C. Hans, A. Dobra, and M. West, Shotgun stochastic search for large p regression, J Am Stat Assoc 102 (2007), [13] W. Hwang, H. H. Zhang, and S. Ghosal, FIRST: Combining forward selection and shrinkage in high dimensional linear regression, Stat Interf 2 (2009), [14] B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani, Least angle regression, Ann Stat 24 (2004), [15] J. Huang, S. Ma, and C.-H. Zhang, Adaptive Lasso for sparse high-dimensional regression models, Stat Sin 18 (2009),

ISyE 691 Data mining and analytics

ISyE 691 Data mining and analytics ISyE 691 Data mining and analytics Regression Instructor: Prof. Kaibo Liu Department of Industrial and Systems Engineering UW-Madison Email: kliu8@wisc.edu Office: Room 3017 (Mechanical Engineering Building)

More information

MS-C1620 Statistical inference

MS-C1620 Statistical inference MS-C1620 Statistical inference 10 Linear regression III Joni Virta Department of Mathematics and Systems Analysis School of Science Aalto University Academic year 2018 2019 Period III - IV 1 / 32 Contents

More information

Bayesian variable selection via. Penalized credible regions. Brian Reich, NCSU. Joint work with. Howard Bondell and Ander Wilson

Bayesian variable selection via. Penalized credible regions. Brian Reich, NCSU. Joint work with. Howard Bondell and Ander Wilson Bayesian variable selection via penalized credible regions Brian Reich, NC State Joint work with Howard Bondell and Ander Wilson Brian Reich, NCSU Penalized credible regions 1 Motivation big p, small n

More information

Comparisons of penalized least squares. methods by simulations

Comparisons of penalized least squares. methods by simulations Comparisons of penalized least squares arxiv:1405.1796v1 [stat.co] 8 May 2014 methods by simulations Ke ZHANG, Fan YIN University of Science and Technology of China, Hefei 230026, China Shifeng XIONG Academy

More information

Machine Learning for OR & FE

Machine Learning for OR & FE Machine Learning for OR & FE Regression II: Regularization and Shrinkage Methods Martin Haugh Department of Industrial Engineering and Operations Research Columbia University Email: martin.b.haugh@gmail.com

More information

Lecture 14: Shrinkage

Lecture 14: Shrinkage Lecture 14: Shrinkage Reading: Section 6.2 STATS 202: Data mining and analysis October 27, 2017 1 / 19 Shrinkage methods The idea is to perform a linear regression, while regularizing or shrinking the

More information

Bayesian Grouped Horseshoe Regression with Application to Additive Models

Bayesian Grouped Horseshoe Regression with Application to Additive Models Bayesian Grouped Horseshoe Regression with Application to Additive Models Zemei Xu, Daniel F. Schmidt, Enes Makalic, Guoqi Qian, and John L. Hopper Centre for Epidemiology and Biostatistics, Melbourne

More information

Robust Variable Selection Methods for Grouped Data. Kristin Lee Seamon Lilly

Robust Variable Selection Methods for Grouped Data. Kristin Lee Seamon Lilly Robust Variable Selection Methods for Grouped Data by Kristin Lee Seamon Lilly A dissertation submitted to the Graduate Faculty of Auburn University in partial fulfillment of the requirements for the Degree

More information

Linear Model Selection and Regularization

Linear Model Selection and Regularization Linear Model Selection and Regularization Recall the linear model Y = β 0 + β 1 X 1 + + β p X p + ɛ. In the lectures that follow, we consider some approaches for extending the linear model framework. In

More information

Analysis Methods for Supersaturated Design: Some Comparisons

Analysis Methods for Supersaturated Design: Some Comparisons Journal of Data Science 1(2003), 249-260 Analysis Methods for Supersaturated Design: Some Comparisons Runze Li 1 and Dennis K. J. Lin 2 The Pennsylvania State University Abstract: Supersaturated designs

More information

Variable Selection in Restricted Linear Regression Models. Y. Tuaç 1 and O. Arslan 1

Variable Selection in Restricted Linear Regression Models. Y. Tuaç 1 and O. Arslan 1 Variable Selection in Restricted Linear Regression Models Y. Tuaç 1 and O. Arslan 1 Ankara University, Faculty of Science, Department of Statistics, 06100 Ankara/Turkey ytuac@ankara.edu.tr, oarslan@ankara.edu.tr

More information

Consistent high-dimensional Bayesian variable selection via penalized credible regions

Consistent high-dimensional Bayesian variable selection via penalized credible regions Consistent high-dimensional Bayesian variable selection via penalized credible regions Howard Bondell bondell@stat.ncsu.edu Joint work with Brian Reich Howard Bondell p. 1 Outline High-Dimensional Variable

More information

Regression Shrinkage and Selection via the Lasso

Regression Shrinkage and Selection via the Lasso Regression Shrinkage and Selection via the Lasso ROBERT TIBSHIRANI, 1996 Presenter: Guiyun Feng April 27 () 1 / 20 Motivation Estimation in Linear Models: y = β T x + ɛ. data (x i, y i ), i = 1, 2,...,

More information

Linear regression methods

Linear regression methods Linear regression methods Most of our intuition about statistical methods stem from linear regression. For observations i = 1,..., n, the model is Y i = p X ij β j + ε i, j=1 where Y i is the response

More information

Linear Methods for Regression. Lijun Zhang

Linear Methods for Regression. Lijun Zhang Linear Methods for Regression Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Linear Regression Models and Least Squares Subset Selection Shrinkage Methods Methods Using Derived

More information

Lecture 14: Variable Selection - Beyond LASSO

Lecture 14: Variable Selection - Beyond LASSO Fall, 2017 Extension of LASSO To achieve oracle properties, L q penalty with 0 < q < 1, SCAD penalty (Fan and Li 2001; Zhang et al. 2007). Adaptive LASSO (Zou 2006; Zhang and Lu 2007; Wang et al. 2007)

More information

A New Combined Approach for Inference in High-Dimensional Regression Models with Correlated Variables

A New Combined Approach for Inference in High-Dimensional Regression Models with Correlated Variables A New Combined Approach for Inference in High-Dimensional Regression Models with Correlated Variables Niharika Gauraha and Swapan Parui Indian Statistical Institute Abstract. We consider the problem of

More information

Sparse Penalized Forward Selection for Support Vector Classification

Sparse Penalized Forward Selection for Support Vector Classification Sparse Penalized Forward Selection for Support Vector Classification Subhashis Ghosal, Bradley Turnnbull, Department of Statistics, North Carolina State University Hao Helen Zhang Department of Mathematics

More information

A Modern Look at Classical Multivariate Techniques

A Modern Look at Classical Multivariate Techniques A Modern Look at Classical Multivariate Techniques Yoonkyung Lee Department of Statistics The Ohio State University March 16-20, 2015 The 13th School of Probability and Statistics CIMAT, Guanajuato, Mexico

More information

Prediction & Feature Selection in GLM

Prediction & Feature Selection in GLM Tarigan Statistical Consulting & Coaching statistical-coaching.ch Doctoral Program in Computer Science of the Universities of Fribourg, Geneva, Lausanne, Neuchâtel, Bern and the EPFL Hands-on Data Analysis

More information

Regularization and Variable Selection via the Elastic Net

Regularization and Variable Selection via the Elastic Net p. 1/1 Regularization and Variable Selection via the Elastic Net Hui Zou and Trevor Hastie Journal of Royal Statistical Society, B, 2005 Presenter: Minhua Chen, Nov. 07, 2008 p. 2/1 Agenda Introduction

More information

High-dimensional Ordinary Least-squares Projection for Screening Variables

High-dimensional Ordinary Least-squares Projection for Screening Variables 1 / 38 High-dimensional Ordinary Least-squares Projection for Screening Variables Chenlei Leng Joint with Xiangyu Wang (Duke) Conference on Nonparametric Statistics for Big Data and Celebration to Honor

More information

Variable Selection for Highly Correlated Predictors

Variable Selection for Highly Correlated Predictors Variable Selection for Highly Correlated Predictors Fei Xue and Annie Qu Department of Statistics, University of Illinois at Urbana-Champaign WHOA-PSI, Aug, 2017 St. Louis, Missouri 1 / 30 Background Variable

More information

Selection of Smoothing Parameter for One-Step Sparse Estimates with L q Penalty

Selection of Smoothing Parameter for One-Step Sparse Estimates with L q Penalty Journal of Data Science 9(2011), 549-564 Selection of Smoothing Parameter for One-Step Sparse Estimates with L q Penalty Masaru Kanba and Kanta Naito Shimane University Abstract: This paper discusses the

More information

A Bootstrap Lasso + Partial Ridge Method to Construct Confidence Intervals for Parameters in High-dimensional Sparse Linear Models

A Bootstrap Lasso + Partial Ridge Method to Construct Confidence Intervals for Parameters in High-dimensional Sparse Linear Models A Bootstrap Lasso + Partial Ridge Method to Construct Confidence Intervals for Parameters in High-dimensional Sparse Linear Models Jingyi Jessica Li Department of Statistics University of California, Los

More information

TECHNICAL REPORT NO. 1091r. A Note on the Lasso and Related Procedures in Model Selection

TECHNICAL REPORT NO. 1091r. A Note on the Lasso and Related Procedures in Model Selection DEPARTMENT OF STATISTICS University of Wisconsin 1210 West Dayton St. Madison, WI 53706 TECHNICAL REPORT NO. 1091r April 2004, Revised December 2004 A Note on the Lasso and Related Procedures in Model

More information

Direct Learning: Linear Regression. Donglin Zeng, Department of Biostatistics, University of North Carolina

Direct Learning: Linear Regression. Donglin Zeng, Department of Biostatistics, University of North Carolina Direct Learning: Linear Regression Parametric learning We consider the core function in the prediction rule to be a parametric function. The most commonly used function is a linear function: squared loss:

More information

Institute of Statistics Mimeo Series No Simultaneous regression shrinkage, variable selection and clustering of predictors with OSCAR

Institute of Statistics Mimeo Series No Simultaneous regression shrinkage, variable selection and clustering of predictors with OSCAR DEPARTMENT OF STATISTICS North Carolina State University 2501 Founders Drive, Campus Box 8203 Raleigh, NC 27695-8203 Institute of Statistics Mimeo Series No. 2583 Simultaneous regression shrinkage, variable

More information

Linear model selection and regularization

Linear model selection and regularization Linear model selection and regularization Problems with linear regression with least square 1. Prediction Accuracy: linear regression has low bias but suffer from high variance, especially when n p. It

More information

Stepwise Searching for Feature Variables in High-Dimensional Linear Regression

Stepwise Searching for Feature Variables in High-Dimensional Linear Regression Stepwise Searching for Feature Variables in High-Dimensional Linear Regression Qiwei Yao Department of Statistics, London School of Economics q.yao@lse.ac.uk Joint work with: Hongzhi An, Chinese Academy

More information

Data Mining Stat 588

Data Mining Stat 588 Data Mining Stat 588 Lecture 02: Linear Methods for Regression Department of Statistics & Biostatistics Rutgers University September 13 2011 Regression Problem Quantitative generic output variable Y. Generic

More information

Regularization: Ridge Regression and the LASSO

Regularization: Ridge Regression and the LASSO Agenda Wednesday, November 29, 2006 Agenda Agenda 1 The Bias-Variance Tradeoff 2 Ridge Regression Solution to the l 2 problem Data Augmentation Approach Bayesian Interpretation The SVD and Ridge Regression

More information

Sparse Linear Models (10/7/13)

Sparse Linear Models (10/7/13) STA56: Probabilistic machine learning Sparse Linear Models (0/7/) Lecturer: Barbara Engelhardt Scribes: Jiaji Huang, Xin Jiang, Albert Oh Sparsity Sparsity has been a hot topic in statistics and machine

More information

The Adaptive Lasso and Its Oracle Properties Hui Zou (2006), JASA

The Adaptive Lasso and Its Oracle Properties Hui Zou (2006), JASA The Adaptive Lasso and Its Oracle Properties Hui Zou (2006), JASA Presented by Dongjun Chung March 12, 2010 Introduction Definition Oracle Properties Computations Relationship: Nonnegative Garrote Extensions:

More information

On High-Dimensional Cross-Validation

On High-Dimensional Cross-Validation On High-Dimensional Cross-Validation BY WEI-CHENG HSIAO Institute of Statistical Science, Academia Sinica, 128 Academia Road, Section 2, Nankang, Taipei 11529, Taiwan hsiaowc@stat.sinica.edu.tw 5 WEI-YING

More information

Biostatistics Advanced Methods in Biostatistics IV

Biostatistics Advanced Methods in Biostatistics IV Biostatistics 140.754 Advanced Methods in Biostatistics IV Jeffrey Leek Assistant Professor Department of Biostatistics jleek@jhsph.edu Lecture 12 1 / 36 Tip + Paper Tip: As a statistician the results

More information

ESL Chap3. Some extensions of lasso

ESL Chap3. Some extensions of lasso ESL Chap3 Some extensions of lasso 1 Outline Consistency of lasso for model selection Adaptive lasso Elastic net Group lasso 2 Consistency of lasso for model selection A number of authors have studied

More information

Regularization Path Algorithms for Detecting Gene Interactions

Regularization Path Algorithms for Detecting Gene Interactions Regularization Path Algorithms for Detecting Gene Interactions Mee Young Park Trevor Hastie July 16, 2006 Abstract In this study, we consider several regularization path algorithms with grouped variable

More information

Paper Review: Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties by Jianqing Fan and Runze Li (2001)

Paper Review: Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties by Jianqing Fan and Runze Li (2001) Paper Review: Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties by Jianqing Fan and Runze Li (2001) Presented by Yang Zhao March 5, 2010 1 / 36 Outlines 2 / 36 Motivation

More information

Consistent Group Identification and Variable Selection in Regression with Correlated Predictors

Consistent Group Identification and Variable Selection in Regression with Correlated Predictors Consistent Group Identification and Variable Selection in Regression with Correlated Predictors Dhruv B. Sharma, Howard D. Bondell and Hao Helen Zhang Abstract Statistical procedures for variable selection

More information

Smoothly Clipped Absolute Deviation (SCAD) for Correlated Variables

Smoothly Clipped Absolute Deviation (SCAD) for Correlated Variables Smoothly Clipped Absolute Deviation (SCAD) for Correlated Variables LIB-MA, FSSM Cadi Ayyad University (Morocco) COMPSTAT 2010 Paris, August 22-27, 2010 Motivations Fan and Li (2001), Zou and Li (2008)

More information

Generalized Elastic Net Regression

Generalized Elastic Net Regression Abstract Generalized Elastic Net Regression Geoffroy MOURET Jean-Jules BRAULT Vahid PARTOVINIA This work presents a variation of the elastic net penalization method. We propose applying a combined l 1

More information

Stability and the elastic net

Stability and the elastic net Stability and the elastic net Patrick Breheny March 28 Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 1/32 Introduction Elastic Net Our last several lectures have concentrated on methods for

More information

STAT 462-Computational Data Analysis

STAT 462-Computational Data Analysis STAT 462-Computational Data Analysis Chapter 5- Part 2 Nasser Sadeghkhani a.sadeghkhani@queensu.ca October 2017 1 / 27 Outline Shrinkage Methods 1. Ridge Regression 2. Lasso Dimension Reduction Methods

More information

Tuning Parameter Selection in L1 Regularized Logistic Regression

Tuning Parameter Selection in L1 Regularized Logistic Regression Virginia Commonwealth University VCU Scholars Compass Theses and Dissertations Graduate School 2012 Tuning Parameter Selection in L1 Regularized Logistic Regression Shujing Shi Virginia Commonwealth University

More information

Regression, Ridge Regression, Lasso

Regression, Ridge Regression, Lasso Regression, Ridge Regression, Lasso Fabio G. Cozman - fgcozman@usp.br October 2, 2018 A general definition Regression studies the relationship between a response variable Y and covariates X 1,..., X n.

More information

Machine Learning for Economists: Part 4 Shrinkage and Sparsity

Machine Learning for Economists: Part 4 Shrinkage and Sparsity Machine Learning for Economists: Part 4 Shrinkage and Sparsity Michal Andrle International Monetary Fund Washington, D.C., October, 2018 Disclaimer #1: The views expressed herein are those of the authors

More information

An Improved 1-norm SVM for Simultaneous Classification and Variable Selection

An Improved 1-norm SVM for Simultaneous Classification and Variable Selection An Improved 1-norm SVM for Simultaneous Classification and Variable Selection Hui Zou School of Statistics University of Minnesota Minneapolis, MN 55455 hzou@stat.umn.edu Abstract We propose a novel extension

More information

Day 4: Shrinkage Estimators

Day 4: Shrinkage Estimators Day 4: Shrinkage Estimators Kenneth Benoit Data Mining and Statistical Learning March 9, 2015 n versus p (aka k) Classical regression framework: n > p. Without this inequality, the OLS coefficients have

More information

Lecture 5: Soft-Thresholding and Lasso

Lecture 5: Soft-Thresholding and Lasso High Dimensional Data and Statistical Learning Lecture 5: Soft-Thresholding and Lasso Weixing Song Department of Statistics Kansas State University Weixing Song STAT 905 October 23, 2014 1/54 Outline Penalized

More information

Variable Selection for Highly Correlated Predictors

Variable Selection for Highly Correlated Predictors Variable Selection for Highly Correlated Predictors Fei Xue and Annie Qu arxiv:1709.04840v1 [stat.me] 14 Sep 2017 Abstract Penalty-based variable selection methods are powerful in selecting relevant covariates

More information

A Survey of L 1. Regression. Céline Cunen, 20/10/2014. Vidaurre, Bielza and Larranaga (2013)

A Survey of L 1. Regression. Céline Cunen, 20/10/2014. Vidaurre, Bielza and Larranaga (2013) A Survey of L 1 Regression Vidaurre, Bielza and Larranaga (2013) Céline Cunen, 20/10/2014 Outline of article 1.Introduction 2.The Lasso for Linear Regression a) Notation and Main Concepts b) Statistical

More information

Chris Fraley and Daniel Percival. August 22, 2008, revised May 14, 2010

Chris Fraley and Daniel Percival. August 22, 2008, revised May 14, 2010 Model-Averaged l 1 Regularization using Markov Chain Monte Carlo Model Composition Technical Report No. 541 Department of Statistics, University of Washington Chris Fraley and Daniel Percival August 22,

More information

Machine Learning Linear Regression. Prof. Matteo Matteucci

Machine Learning Linear Regression. Prof. Matteo Matteucci Machine Learning Linear Regression Prof. Matteo Matteucci Outline 2 o Simple Linear Regression Model Least Squares Fit Measures of Fit Inference in Regression o Multi Variate Regession Model Least Squares

More information

Two Tales of Variable Selection for High Dimensional Regression: Screening and Model Building

Two Tales of Variable Selection for High Dimensional Regression: Screening and Model Building Two Tales of Variable Selection for High Dimensional Regression: Screening and Model Building Cong Liu, Tao Shi and Yoonkyung Lee Department of Statistics, The Ohio State University Abstract Variable selection

More information

The MNet Estimator. Patrick Breheny. Department of Biostatistics Department of Statistics University of Kentucky. August 2, 2010

The MNet Estimator. Patrick Breheny. Department of Biostatistics Department of Statistics University of Kentucky. August 2, 2010 Department of Biostatistics Department of Statistics University of Kentucky August 2, 2010 Joint work with Jian Huang, Shuangge Ma, and Cun-Hui Zhang Penalized regression methods Penalized methods have

More information

In Search of Desirable Compounds

In Search of Desirable Compounds In Search of Desirable Compounds Adrijo Chakraborty University of Georgia Email: adrijoc@uga.edu Abhyuday Mandal University of Georgia Email: amandal@stat.uga.edu Kjell Johnson Arbor Analytics, LLC Email:

More information

Bi-level feature selection with applications to genetic association

Bi-level feature selection with applications to genetic association Bi-level feature selection with applications to genetic association studies October 15, 2008 Motivation In many applications, biological features possess a grouping structure Categorical variables may

More information

Shrinkage Methods: Ridge and Lasso

Shrinkage Methods: Ridge and Lasso Shrinkage Methods: Ridge and Lasso Jonathan Hersh 1 Chapman University, Argyros School of Business hersh@chapman.edu February 27, 2019 J.Hersh (Chapman) Ridge & Lasso February 27, 2019 1 / 43 1 Intro and

More information

Or How to select variables Using Bayesian LASSO

Or How to select variables Using Bayesian LASSO Or How to select variables Using Bayesian LASSO x 1 x 2 x 3 x 4 Or How to select variables Using Bayesian LASSO x 1 x 2 x 3 x 4 Or How to select variables Using Bayesian LASSO On Bayesian Variable Selection

More information

Simultaneous regression shrinkage, variable selection, and supervised clustering of predictors with OSCAR

Simultaneous regression shrinkage, variable selection, and supervised clustering of predictors with OSCAR Simultaneous regression shrinkage, variable selection, and supervised clustering of predictors with OSCAR Howard D. Bondell and Brian J. Reich Department of Statistics, North Carolina State University,

More information

Statistical Inference

Statistical Inference Statistical Inference Liu Yang Florida State University October 27, 2016 Liu Yang, Libo Wang (Florida State University) Statistical Inference October 27, 2016 1 / 27 Outline The Bayesian Lasso Trevor Park

More information

Saharon Rosset 1 and Ji Zhu 2

Saharon Rosset 1 and Ji Zhu 2 Aust. N. Z. J. Stat. 46(3), 2004, 505 510 CORRECTED PROOF OF THE RESULT OF A PREDICTION ERROR PROPERTY OF THE LASSO ESTIMATOR AND ITS GENERALIZATION BY HUANG (2003) Saharon Rosset 1 and Ji Zhu 2 IBM T.J.

More information

A Confidence Region Approach to Tuning for Variable Selection

A Confidence Region Approach to Tuning for Variable Selection A Confidence Region Approach to Tuning for Variable Selection Funda Gunes and Howard D. Bondell Department of Statistics North Carolina State University Abstract We develop an approach to tuning of penalized

More information

MSA220/MVE440 Statistical Learning for Big Data

MSA220/MVE440 Statistical Learning for Big Data MSA220/MVE440 Statistical Learning for Big Data Lecture 9-10 - High-dimensional regression Rebecka Jörnsten Mathematical Sciences University of Gothenburg and Chalmers University of Technology Recap from

More information

spikeslab: Prediction and Variable Selection Using Spike and Slab Regression

spikeslab: Prediction and Variable Selection Using Spike and Slab Regression 68 CONTRIBUTED RESEARCH ARTICLES spikeslab: Prediction and Variable Selection Using Spike and Slab Regression by Hemant Ishwaran, Udaya B. Kogalur and J. Sunil Rao Abstract Weighted generalized ridge regression

More information

Chapter 3. Linear Models for Regression

Chapter 3. Linear Models for Regression Chapter 3. Linear Models for Regression Wei Pan Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN 55455 Email: weip@biostat.umn.edu PubH 7475/8475 c Wei Pan Linear

More information

6. Regularized linear regression

6. Regularized linear regression Foundations of Machine Learning École Centrale Paris Fall 2015 6. Regularized linear regression Chloé-Agathe Azencot Centre for Computational Biology, Mines ParisTech chloe agathe.azencott@mines paristech.fr

More information

Logistic Regression with the Nonnegative Garrote

Logistic Regression with the Nonnegative Garrote Logistic Regression with the Nonnegative Garrote Enes Makalic Daniel F. Schmidt Centre for MEGA Epidemiology The University of Melbourne 24th Australasian Joint Conference on Artificial Intelligence 2011

More information

High-dimensional regression

High-dimensional regression High-dimensional regression Advanced Methods for Data Analysis 36-402/36-608) Spring 2014 1 Back to linear regression 1.1 Shortcomings Suppose that we are given outcome measurements y 1,... y n R, and

More information

WEIGHTED QUANTILE REGRESSION THEORY AND ITS APPLICATION. Abstract

WEIGHTED QUANTILE REGRESSION THEORY AND ITS APPLICATION. Abstract Journal of Data Science,17(1). P. 145-160,2019 DOI:10.6339/JDS.201901_17(1).0007 WEIGHTED QUANTILE REGRESSION THEORY AND ITS APPLICATION Wei Xiong *, Maozai Tian 2 1 School of Statistics, University of

More information

Dimension Reduction Methods

Dimension Reduction Methods Dimension Reduction Methods And Bayesian Machine Learning Marek Petrik 2/28 Previously in Machine Learning How to choose the right features if we have (too) many options Methods: 1. Subset selection 2.

More information

Introduction to Statistical modeling: handout for Math 489/583

Introduction to Statistical modeling: handout for Math 489/583 Introduction to Statistical modeling: handout for Math 489/583 Statistical modeling occurs when we are trying to model some data using statistical tools. From the start, we recognize that no model is perfect

More information

PENALIZED PRINCIPAL COMPONENT REGRESSION. Ayanna Byrd. (Under the direction of Cheolwoo Park) Abstract

PENALIZED PRINCIPAL COMPONENT REGRESSION. Ayanna Byrd. (Under the direction of Cheolwoo Park) Abstract PENALIZED PRINCIPAL COMPONENT REGRESSION by Ayanna Byrd (Under the direction of Cheolwoo Park) Abstract When using linear regression problems, an unbiased estimate is produced by the Ordinary Least Squares.

More information

ADAPTIVE LASSO FOR SPARSE HIGH-DIMENSIONAL REGRESSION MODELS

ADAPTIVE LASSO FOR SPARSE HIGH-DIMENSIONAL REGRESSION MODELS Statistica Sinica 18(2008), 1603-1618 ADAPTIVE LASSO FOR SPARSE HIGH-DIMENSIONAL REGRESSION MODELS Jian Huang, Shuangge Ma and Cun-Hui Zhang University of Iowa, Yale University and Rutgers University Abstract:

More information

Robust Variable Selection Through MAVE

Robust Variable Selection Through MAVE Robust Variable Selection Through MAVE Weixin Yao and Qin Wang Abstract Dimension reduction and variable selection play important roles in high dimensional data analysis. Wang and Yin (2008) proposed sparse

More information

Sparsity in Underdetermined Systems

Sparsity in Underdetermined Systems Sparsity in Underdetermined Systems Department of Statistics Stanford University August 19, 2005 Classical Linear Regression Problem X n y p n 1 > Given predictors and response, y Xβ ε = + ε N( 0, σ 2

More information

A Blockwise Descent Algorithm for Group-penalized Multiresponse and Multinomial Regression

A Blockwise Descent Algorithm for Group-penalized Multiresponse and Multinomial Regression A Blockwise Descent Algorithm for Group-penalized Multiresponse and Multinomial Regression Noah Simon Jerome Friedman Trevor Hastie November 5, 013 Abstract In this paper we purpose a blockwise descent

More information

Final Review. Yang Feng. Yang Feng (Columbia University) Final Review 1 / 58

Final Review. Yang Feng.   Yang Feng (Columbia University) Final Review 1 / 58 Final Review Yang Feng http://www.stat.columbia.edu/~yangfeng Yang Feng (Columbia University) Final Review 1 / 58 Outline 1 Multiple Linear Regression (Estimation, Inference) 2 Special Topics for Multiple

More information

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

Sparse regression. Optimization-Based Data Analysis.   Carlos Fernandez-Granda Sparse regression Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda 3/28/2016 Regression Least-squares regression Example: Global warming Logistic

More information

Nonconcave Penalized Likelihood with A Diverging Number of Parameters

Nonconcave Penalized Likelihood with A Diverging Number of Parameters Nonconcave Penalized Likelihood with A Diverging Number of Parameters Jianqing Fan and Heng Peng Presenter: Jiale Xu March 12, 2010 Jianqing Fan and Heng Peng Presenter: JialeNonconcave Xu () Penalized

More information

Spatial Lasso with Applications to GIS Model Selection

Spatial Lasso with Applications to GIS Model Selection Spatial Lasso with Applications to GIS Model Selection Hsin-Cheng Huang Institute of Statistical Science, Academia Sinica Nan-Jung Hsu National Tsing-Hua University David Theobald Colorado State University

More information

The lasso, persistence, and cross-validation

The lasso, persistence, and cross-validation The lasso, persistence, and cross-validation Daniel J. McDonald Department of Statistics Indiana University http://www.stat.cmu.edu/ danielmc Joint work with: Darren Homrighausen Colorado State University

More information

Forward Selection and Estimation in High Dimensional Single Index Models

Forward Selection and Estimation in High Dimensional Single Index Models Forward Selection and Estimation in High Dimensional Single Index Models Shikai Luo and Subhashis Ghosal North Carolina State University August 29, 2016 Abstract We propose a new variable selection and

More information

High-dimensional regression modeling

High-dimensional regression modeling High-dimensional regression modeling David Causeur Department of Statistics and Computer Science Agrocampus Ouest IRMAR CNRS UMR 6625 http://www.agrocampus-ouest.fr/math/causeur/ Course objectives Making

More information

Exploratory quantile regression with many covariates: An application to adverse birth outcomes

Exploratory quantile regression with many covariates: An application to adverse birth outcomes Exploratory quantile regression with many covariates: An application to adverse birth outcomes June 3, 2011 eappendix 30 Percent of Total 20 10 0 0 1000 2000 3000 4000 5000 Birth weights efigure 1: Histogram

More information

Pre-Selection in Cluster Lasso Methods for Correlated Variable Selection in High-Dimensional Linear Models

Pre-Selection in Cluster Lasso Methods for Correlated Variable Selection in High-Dimensional Linear Models Pre-Selection in Cluster Lasso Methods for Correlated Variable Selection in High-Dimensional Linear Models Niharika Gauraha and Swapan Parui Indian Statistical Institute Abstract. We consider variable

More information

The Risk of James Stein and Lasso Shrinkage

The Risk of James Stein and Lasso Shrinkage Econometric Reviews ISSN: 0747-4938 (Print) 1532-4168 (Online) Journal homepage: http://tandfonline.com/loi/lecr20 The Risk of James Stein and Lasso Shrinkage Bruce E. Hansen To cite this article: Bruce

More information

Regularized Discriminant Analysis and Its Application in Microarrays

Regularized Discriminant Analysis and Its Application in Microarrays Biostatistics (2005), 1, 1, pp. 1 18 Printed in Great Britain Regularized Discriminant Analysis and Its Application in Microarrays By YAQIAN GUO Department of Statistics, Stanford University Stanford,

More information

PENALIZING YOUR MODELS

PENALIZING YOUR MODELS PENALIZING YOUR MODELS AN OVERVIEW OF THE GENERALIZED REGRESSION PLATFORM Michael Crotty & Clay Barker Research Statisticians JMP Division, SAS Institute Copyr i g ht 2012, SAS Ins titut e Inc. All rights

More information

Least Absolute Shrinkage is Equivalent to Quadratic Penalization

Least Absolute Shrinkage is Equivalent to Quadratic Penalization Least Absolute Shrinkage is Equivalent to Quadratic Penalization Yves Grandvalet Heudiasyc, UMR CNRS 6599, Université de Technologie de Compiègne, BP 20.529, 60205 Compiègne Cedex, France Yves.Grandvalet@hds.utc.fr

More information

application in microarrays

application in microarrays Biostatistics Advance Access published April 7, 2006 Regularized linear discriminant analysis and its application in microarrays Yaqian Guo, Trevor Hastie and Robert Tibshirani Abstract In this paper,

More information

arxiv: v1 [stat.me] 30 Dec 2017

arxiv: v1 [stat.me] 30 Dec 2017 arxiv:1801.00105v1 [stat.me] 30 Dec 2017 An ISIS screening approach involving threshold/partition for variable selection in linear regression 1. Introduction Yu-Hsiang Cheng e-mail: 96354501@nccu.edu.tw

More information

Linear Regression Models. Based on Chapter 3 of Hastie, Tibshirani and Friedman

Linear Regression Models. Based on Chapter 3 of Hastie, Tibshirani and Friedman Linear Regression Models Based on Chapter 3 of Hastie, ibshirani and Friedman Linear Regression Models Here the X s might be: p f ( X = " + " 0 j= 1 X j Raw predictor variables (continuous or coded-categorical

More information

A New Bayesian Variable Selection Method: The Bayesian Lasso with Pseudo Variables

A New Bayesian Variable Selection Method: The Bayesian Lasso with Pseudo Variables A New Bayesian Variable Selection Method: The Bayesian Lasso with Pseudo Variables Qi Tang (Joint work with Kam-Wah Tsui and Sijian Wang) Department of Statistics University of Wisconsin-Madison Feb. 8,

More information

SCMA292 Mathematical Modeling : Machine Learning. Krikamol Muandet. Department of Mathematics Faculty of Science, Mahidol University.

SCMA292 Mathematical Modeling : Machine Learning. Krikamol Muandet. Department of Mathematics Faculty of Science, Mahidol University. SCMA292 Mathematical Modeling : Machine Learning Krikamol Muandet Department of Mathematics Faculty of Science, Mahidol University February 9, 2016 Outline Quick Recap of Least Square Ridge Regression

More information

PENALIZED METHOD BASED ON REPRESENTATIVES & NONPARAMETRIC ANALYSIS OF GAP DATA

PENALIZED METHOD BASED ON REPRESENTATIVES & NONPARAMETRIC ANALYSIS OF GAP DATA PENALIZED METHOD BASED ON REPRESENTATIVES & NONPARAMETRIC ANALYSIS OF GAP DATA A Thesis Presented to The Academic Faculty by Soyoun Park In Partial Fulfillment of the Requirements for the Degree Doctor

More information

Lasso Maximum Likelihood Estimation of Parametric Models with Singular Information Matrices

Lasso Maximum Likelihood Estimation of Parametric Models with Singular Information Matrices Article Lasso Maximum Likelihood Estimation of Parametric Models with Singular Information Matrices Fei Jin 1,2 and Lung-fei Lee 3, * 1 School of Economics, Shanghai University of Finance and Economics,

More information

MS&E 226. In-Class Midterm Examination Solutions Small Data October 20, 2015

MS&E 226. In-Class Midterm Examination Solutions Small Data October 20, 2015 MS&E 226 In-Class Midterm Examination Solutions Small Data October 20, 2015 PROBLEM 1. Alice uses ordinary least squares to fit a linear regression model on a dataset containing outcome data Y and covariates

More information

On Mixture Regression Shrinkage and Selection via the MR-LASSO

On Mixture Regression Shrinkage and Selection via the MR-LASSO On Mixture Regression Shrinage and Selection via the MR-LASSO Ronghua Luo, Hansheng Wang, and Chih-Ling Tsai Guanghua School of Management, Peing University & Graduate School of Management, University

More information