Lecture 5: Soft-Thresholding and Lasso

Similar documents
Least Angle Regression, Forward Stagewise and the Lasso

Chapter 3. Linear Models for Regression

TECHNICAL REPORT NO. 1091r. A Note on the Lasso and Related Procedures in Model Selection

Comparisons of penalized least squares. methods by simulations

Analysis Methods for Supersaturated Design: Some Comparisons

Regularization Paths

Regularization Paths. Theme

STAT 462-Computational Data Analysis

Data Mining Stat 588

Regression Shrinkage and Selection via the Lasso

Robust Variable Selection Methods for Grouped Data. Kristin Lee Seamon Lilly

Paper Review: Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties by Jianqing Fan and Runze Li (2001)

Nonnegative Garrote Component Selection in Functional ANOVA Models

Regularization: Ridge Regression and the LASSO

Machine Learning for OR & FE

Lecture 14: Variable Selection - Beyond LASSO

The Adaptive Lasso and Its Oracle Properties Hui Zou (2006), JASA

Lecture 4: Newton s method and gradient descent

Linear Methods for Regression. Lijun Zhang

Sparse Linear Models (10/7/13)

Regression Shrinkage and Selection via the Elastic Net, with Applications to Microarrays

Smoothly Clipped Absolute Deviation (SCAD) for Correlated Variables

Pathwise coordinate optimization

Biostatistics Advanced Methods in Biostatistics IV

ISyE 691 Data mining and analytics

Stat588 Homework 1 (Due in class on Oct 04) Fall 2011

Iterative Selection Using Orthogonal Regression Techniques

Regularization Path Algorithms for Detecting Gene Interactions

Direct Learning: Linear Regression. Donglin Zeng, Department of Biostatistics, University of North Carolina

Regression, Ridge Regression, Lasso

Linear regression methods

MS-C1620 Statistical inference

The lasso: some novel algorithms and applications

Regularization and Variable Selection via the Elastic Net

A significance test for the lasso

ESL Chap3. Some extensions of lasso

Penalized Methods for Bi-level Variable Selection

Lecture 14: Shrinkage

Generalized Elastic Net Regression

Bi-level feature selection with applications to genetic association

SOLVING NON-CONVEX LASSO TYPE PROBLEMS WITH DC PROGRAMMING. Gilles Gasso, Alain Rakotomamonjy and Stéphane Canu

The lasso. Patrick Breheny. February 15. The lasso Convex optimization Soft thresholding

Linear Model Selection and Regularization

Least Absolute Gradient Selector: variable selection via Pseudo-Hard Thresholding

Consistent Selection of Tuning Parameters via Variable Selection Stability

MSA220/MVE440 Statistical Learning for Big Data

A Short Introduction to the Lasso Methodology

A significance test for the lasso

Learning with Sparsity Constraints

On Algorithms for Solving Least Squares Problems under an L 1 Penalty or an L 1 Constraint

The MNet Estimator. Patrick Breheny. Department of Biostatistics Department of Statistics University of Kentucky. August 2, 2010

The picasso Package for Nonconvex Regularized M-estimation in High Dimensions in R

Lecture 8: Fitting Data Statistical Computing, Wednesday October 7, 2015

SCIENCE CHINA Information Sciences. Received December 22, 2008; accepted February 26, 2009; published online May 8, 2010

A Significance Test for the Lasso

Chris Fraley and Daniel Percival. August 22, 2008, revised May 14, 2010

Nonconcave Penalized Likelihood with A Diverging Number of Parameters

Bayesian Grouped Horseshoe Regression with Application to Additive Models

Discussion of Least Angle Regression

Institute of Statistics Mimeo Series No Simultaneous regression shrinkage, variable selection and clustering of predictors with OSCAR

Selection of Smoothing Parameter for One-Step Sparse Estimates with L q Penalty

The lasso: some novel algorithms and applications

STRUCTURED VARIABLE SELECTION AND ESTIMATION

Variable Selection in Restricted Linear Regression Models. Y. Tuaç 1 and O. Arslan 1

Robust Variable Selection Through MAVE

Coordinate descent. Geoff Gordon & Ryan Tibshirani Optimization /

LASSO Review, Fused LASSO, Parallel LASSO Solvers

Identify Relative importance of covariates in Bayesian lasso quantile regression via new algorithm in statistical program R

Gradient descent. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725

Gradient Descent. Ryan Tibshirani Convex Optimization /36-725

Sparsity in Underdetermined Systems

Least Absolute Gradient Selector: Statistical Regression via Pseudo-Hard Thresholding

High-dimensional regression modeling

Prediction & Feature Selection in GLM

Regularization Methods for Additive Models

Fast Regularization Paths via Coordinate Descent

Biostatistics-Lecture 16 Model Selection. Ruibin Xi Peking University School of Mathematical Sciences

Genomics, Transcriptomics and Proteomics in Clinical Research. Statistical Learning for Analyzing Functional Genomic Data. Explanation vs.

Boosted Lasso and Reverse Boosting

Lasso Regression: Regularization for feature selection

Fast Regularization Paths via Coordinate Descent

MA 575 Linear Models: Cedric E. Ginestet, Boston University Regularization: Ridge Regression and Lasso Week 14, Lecture 2

CONSISTENT BI-LEVEL VARIABLE SELECTION VIA COMPOSITE GROUP BRIDGE PENALIZED REGRESSION INDU SEETHARAMAN

A Bootstrap Lasso + Partial Ridge Method to Construct Confidence Intervals for Parameters in High-dimensional Sparse Linear Models

arxiv: v3 [stat.me] 4 Jan 2012

Introduction to Statistics and R

Self-adaptive Lasso and its Bayesian Estimation

Simultaneous regression shrinkage, variable selection, and supervised clustering of predictors with OSCAR

Saharon Rosset 1 and Ji Zhu 2

Consistent high-dimensional Bayesian variable selection via penalized credible regions

The lasso, persistence, and cross-validation

Estimating subgroup specific treatment effects via concave fusion

S-PLUS and R package for Least Angle Regression

On High-Dimensional Cross-Validation

A Modern Look at Classical Multivariate Techniques

COMPSTAT2010 in Paris. Hiroki Motogaito. Masashi Goto

Lecture 7: Modeling Krak(en)

Business Statistics. Tommaso Proietti. Model Evaluation and Selection. DEF - Università di Roma 'Tor Vergata'

Or How to select variables Using Bayesian LASSO

The Pennsylvania State University The Graduate School Eberly College of Science NEW PROCEDURES FOR COX S MODEL WITH HIGH DIMENSIONAL PREDICTORS

Transcription:

High Dimensional Data and Statistical Learning Lecture 5: Soft-Thresholding and Lasso Weixing Song Department of Statistics Kansas State University Weixing Song STAT 905 October 23, 2014 1/54

Outline Penalized Least Squares Introduction to Lasso Lasso: Orthogonal Design and Geometry Coordinate Descent Algorithm Penalty Parameter Selection Least Angle Regression References Weixing Song STAT 905 October 23, 2014 2/54

Outline Penalized Least Squares Introduction to Lasso Lasso: Orthogonal Design and Geometry Coordinate Descent Algorithm Penalty Parameter Selection Least Angle Regression References Weixing Song STAT 905 October 23, 2014 3/54

High Dimensional Linear Regression We consider the following linear regression model y i = x i1 β 1 + + x ip β p + ε i, 1 i n. Some notations: Response: y = (y 1,..., y n) T ; Predictors: x j = (x 1j,..., x nj ) T, j = 1, 2,..., p; Design Matrix: X n p = (x 1,..., x p); Regression Coefficients: β = (β 1,..., β p); True Regression Coefficients: β o = (β1 o,..., βo p); Oracle Set: O = {j : βj o 0}; Underlying Model Dimensions: d 0 = O = #{j : βj o 0}. Weixing Song STAT 905 October 23, 2014 4/54

Centering and Standardization WLOG, we assume that the response and predictors are centered and the predictors are standardized as follows: n n n y i = 0, x ij = 0, xij 2 = n, for j = 1, 2,..., p. i=1 i=1 After the centering and standardization, there is no intercept in the model. Each predictor is standardized to have the same magnitude in L 2. So the corresponding regression coefficients are comparable. After model fitting, the results can be readily transformed back to the original scale. i=1 Weixing Song STAT 905 October 23, 2014 5/54

We consider the penalized least squares (PLS) method: { } 1 p min β R p 2n y Xβ 2 + p λ ( β j ), j=1 where denotes the L 2 -norm, p λ ( ) is a penalty function indexed by the regularized parameter λ 0. Some commonly used penalty functions: L 0 -penalty (subset selection) and L 2 -penalty (ridge regression); Bridge or L γ penalty, γ > 0. (Frank and Friedman, 1993); L 1 penalty or Lasso (Tibshirani, 1996); SCAD penalty (Fan and Li, 2001); MCP penalty (Zhang, 2010); Group penalties, bi-level penalties,....... Weixing Song STAT 905 October 23, 2014 6/54

Plots of Bridge penalties: Weixing Song STAT 905 October 23, 2014 7/54

Outline Penalized Least Squares Introduction to Lasso Lasso: Orthogonal Design and Geometry Coordinate Descent Algorithm Penalty Parameter Selection Least Angle Regression References Weixing Song STAT 905 October 23, 2014 8/54

Lasso Lasso stands for least absolute shrinkage and selection operator. There are two equivalent definitions. Minimizing the residual sum of squares subject to the sum of the absolute value of the coefficients being less than a constant: ˆβ { = argmin y Xβ 2} p subject to β j t. Minimizing the penalized sum of squares: { ˆβ = argmin y Xβ 2 + λ j=1 } p β j. j=1 Because of the nature of L 1 penalty or constraint, Lasso is able to estimate some coefficients as exactly 0 and hence performs variable selection. The Lasso enjoys some of the favorable properties of both subset selection and ridge regression. It produces interpretable models and exhibits the stability of ridge regression. Weixing Song STAT 905 October 23, 2014 9/54

The motivation for the Lasso came from an interesting proposal of Breiman (1993). Breiman s non-negative Garotte minimizes n p p or 1 2n 1 2n i=1 (y i n (y i i=1 j=1 p j=1 c j ˆβLS j x ij ) subject to c j 0, c ˆβLS j j x ij ) + λ j=1 c j t, p c j, subject to c j 0. The Garotte starts with the OLS estimates and shrinks them by non-negative factors whose sum is constrained. The Garotte estimate depends on both the sign and the magnitude of OLS. In contrast, the Lasso avoids the explicit use of the OLS estimates. Lasso is also closely related to the wavelet soft-thresholding method by Donoho and Johnstone (1994), forward statewise regression, and boosting methods. j=1 Weixing Song STAT 905 October 23, 2014 10/54

Solution Path: For each given λ, we solve the PLS problem. Therefore, for λ [λ min, λ max], we have a solution path { ˆβ n(λ) : λ [λ min, λ max]}. To examine the solution path, we can plot each component of ˆβ n(λ) versus λ. In practice, we usually need to determine a value of λ, say, λ, and use ˆβ n(λ ) as the final estimator. This model selection step is usually done using some information criterion or cross validation techniques. Thus it is important to have fast algorithms for computing the whole solution path or a grid of λ values. There are multiple packages in R for computing the Lasso path: ncvreg, glmnet and lars,.... Note that the solution path can also be indexed by the constraint value t. Weixing Song STAT 905 October 23, 2014 11/54

Lasso: Examples Example 1: Consider the linear regression model p y i = x ij β j + ε i, 1 i n, j=1 where ε N(0, 1.5 2 ), p = 100. Generate z ij s and w independently from N(0, 1), let x ij = z ij + w, 1 j 4, x i5 = z i5 + 2w, x i6 = z i6 + w, and x ij = z ij for j 7. Let (β 1, β 2, β 3 ) = (2, 1, 1) and β j = 0 for j 4. Weixing Song STAT 905 October 23, 2014 12/54

To implement the lasso estimation for the above example, we use the R-package ncvreg. A sample R-code is Remark: The ncvreg pacakge is built upon the work by Breheny and Huang (2011). Weixing Song STAT 905 October 23, 2014 13/54

Weixing Song STAT 905 October 23, 2014 14/54

Example 2 (Prostate Data): The data is from a study by by Stamey et al. (1989) to examine the association between prostate specific antigen (PSA) and several clinical measures that are potentially associated with PSA in men who were about to receive a radical prostatectomy. The variables are as follows: lcavol: Log cancer volume lweight: Log prostate weight age: The man s age lbph: Log of the amount of benign hyperplasia svi: Seminal vesicle invasion; 1=Yes, 0=No lcp: Log of capsular penetration gleason: Gleason score pgg45: Percent of Gleason scores 4 or 5 lpsa: Log PSA Weixing Song STAT 905 October 23, 2014 15/54

To implement the lasso estimation for the above example, we use the R-package ncvreg. A sample R-code is Weixing Song STAT 905 October 23, 2014 16/54

To compare Lasso and Ridge, we also add some artificial noise variables to the model. Weixing Song STAT 905 October 23, 2014 17/54

Outline Penalized Least Squares Introduction to Lasso Lasso: Orthogonal Design and Geometry Coordinate Descent Algorithm Penalty Parameter Selection Least Angle Regression References Weixing Song STAT 905 October 23, 2014 18/54

Orthogonal Design in PLS Insight about the nature of the penalization methods can be gleaned from the orthogonal design case. When the design matrix multiplied by n 1/2 is orthonormal, i.e., X T X = ni p p, the penalized least squares problem reduces to the minimization of 1 p 2n y Xˆβ LSE 2 + 1 2 ˆβ LSE β 2 + p λ ( β j ), where ˆβ LSE = n 1 X T y is the OLS estimate. Now the optimization problem is separable in β j s. It suffices to consider the univariate PLS-problem { } 1 ˆθ(z) = argmin θ R 2 (z θ)2 + p λ ( θ ). (1) j=1 Weixing Song STAT 905 October 23, 2014 19/54

Characterization of PLS Estimators Antoniadis and Fan (2001) Let p λ ( ) be a nonnegative, nondecreasing, and differentiable function in (0, ). Further, assume that the function θ p λ (θ) is strictly unimodal on (0, ). Then we have the following results. (a). The solution to the minimization problem (1) exists and is unique. It is antisymmetric: ˆθ( z) = ˆθ(z). (b). The solution satisfies ˆθ(z) = { 0 if z p 0, z sgn(z)p λ (ˆθ(z)) if z > p 0, where p 0 = min θ 0 {θ + p λ (θ)}. Moreover, ˆθ(z) z. (c). If p λ ( ) is nonincreasing, then for z > p0, we have z p 0 ˆθ(z) z p λ ( z ). (d). When p λ (θ) is continuous on (0, ), the solution ˆθ(z) is continuous if and only if the minimum of θ + p λ ( θ ) is attained at point zero. (e). If p λ ( z ) 0 as z, then ˆθ(z) = z p λ ( z ) + o(p λ ( z )). Weixing Song STAT 905 October 23, 2014 20/54

Proof: Denote l(θ) as the function in (1). (a)-(b). Note that l(θ) tends to infinity as θ. Thus, minimizers do exist. When z = 0, it is clear that ˆθ(z) = 0 is the unique minimizer. WLOG, assume that z > 0. Then for all θ > 0, l( θ) > l(θ). Hence ˆθ(z) 0. Note that for θ > 0, l (θ) = θ z + p λ (θ). When z < p 0, the function l is strictly increasing on (0, ) because the derivative function is positive. Hence ˆθ(z) = 0. Now assume that z > p 0. When the function l (θ) is strictly increasing, there is at most one zero-crossing, and hence the solution is unique. If l (θ) has a valley on (0, ) (Why?), there are two possible zero-crossings for the function l on (0, ). The larger one is the minimizer because the derivative function at that point is increasing. Hence, the solution is unique and satisfies ˆθ(z) = z p λ (ˆθ(z)) z. (2) Weixing Song STAT 905 October 23, 2014 21/54

Proof(continued): (c). From the above, it is easy to see that ˆθ(z) z p λ (z) when p λ ( ) is nonincreasing. Let θ 0 be the minimizer of θ + p λ (θ) over [0, ). Then, from the preceding argument, ˆθ(z) > θ 0 for z > p 0. If p λ ( ) is nonincreasing, then This and (2) prove result (c). p λ (ˆθ(z)) p λ (θ 0) θ 0 + p λ (θ 0) = p 0. (d). It is clear that continuity of the solution ˆθ(z) at the point z = p 0 if and only if the minimum of the function θ + p λ ( θ ) is attained at 0. The continuity at other locations follow directly from the monotonicity and continuity of the function θ + p λ (θ) in the interval (0, ). (e). This follows directly from (2). If p λ (z) is twice differentiable for large z values, and p λ ( z ) 0 as z, then from (2) ˆθ(z) = z p λ (ˆθ(z)) = z p λ (z) (ˆθ(z) z)p λ ( z), where z is between z and ˆθ(z). Since z implies ˆθ(z), so the above equation p λ ( z) 0 as z, this in turn leads to ˆθ(z) z c p λ (z) for some constant. Therefore, ˆθ(z) z + p λ (z) = ˆθ(z) z p λ ( z) c p λ (z) p λ ( z) = o(p λ (z)). Weixing Song STAT 905 October 23, 2014 22/54

The above theorem implies that the PLS estimator ˆθ(z) possesses the properties: Sparsity if min t 0 {t + p λ (t)} > 0; Approximate unbiasedness if p λ (t) 0 for large t; Continuity if and only if argmin t 0 {t + p λ (t)} = 0. In general for penalty functions, the singularity at the origin (i.e. p λ (0+) > 0) is needed for generating sparsity in variable selection and the concavity is needed to reduce the bias. These conditions are applicable for general PLS problems and more. Homework: Prove Antoniadis and Fan (2001) s theorem for z < 0. Weixing Song STAT 905 October 23, 2014 23/54

Lasso: Orthogonal Design Under orthogonal design, i.e., X T X = ni, Lasso estimation can be greatly simplified as discussed above. The problem becomes solving { } 1 LSE ˆβ j = argmin βj ( ˆβ j β j ) 2 + λ β j. 2 The Lasso estimator is given by ˆβ j = S( LSE ˆβ j ; λ), where S( ; λ) is the soft-thresholding operator { z λ, if z > λ, S(z; λ) = sgn(z)( z λ) + = 0 if z λ, z + λ, if z < λ, Homework: Verify the expression of ˆβ j based on Antoniadis and Fan (2001) s theorem. Weixing Song STAT 905 October 23, 2014 24/54

Hard Thresholding The solution to { } 1 argmin θ 2 (z θ)2 + λ 2 ( θ λ) 2 I ( θ < λ) is given by ˆθ = H(z; λ), where H(z; λ) = zi ( z > λ) = is called the hard thresholding operator. { z, if z > λ, 0, if z λ, The proof of the above minimizer is based upon Antoniadis and Fan (2001) s theorem. This corresponding to the best subset selection. Note that the best subset selection of size k reduces to choosing the k largest coefficients in absolute value and setting the rest to 0. Weixing Song STAT 905 October 23, 2014 25/54

Ridge/Non-negative Garotte Estimators Under orthogonal design, the ridge regression estimator is given by 1 ˆβ j LSE. 1 + λ Under orthogonal design, the non-negative garotte solution is given by ( ) λ ˆβ j NG = 1 ˆβ LSE ( ˆβ j ) 2 j LSE. + For: Note that the target function in the non-negative garotte procedure can be written as 1 2n y Xdiag(ˆβ)c 2 2 + λ c. Let c j = c j ˆβLSE j. Then the target function becomes p 1 2n y Xc 2 2 + λ j=1 LSE ( ˆβ j ) 1 cj which is equivalent to 2 1 ˆβ LSE c 2 2 + λ p LSE ( ˆβ j=1 j ) 1 cj Weixing Song STAT 905 October 23, 2014 26/54

Therefore, minimizing the above target function is amount to find the minimizer of Hence the solution is 1 LSE ( ˆβ j cj 2 )2 LSE + λ ( ˆβ j ) 1 cj. c j Transformation back to c j gives = S( ˆβ NG j. LSE LSE ˆβ j ; λ ˆβ j 1 ). Weixing Song STAT 905 October 23, 2014 27/54

Comparison of Four Estimators Weixing Song STAT 905 October 23, 2014 28/54

The Geometry of Lasso The RSS term y Xβ equals the following quadratic function plus a constant (β ˆβ LSE ) T X T X(β ˆβ LSE ). The contour of the above function is elliptical. These ellipsoids are centered at the LSE. For Lasso, the L 1 -constraint region is a diamond; for ridge, the L 2 -constraint region is a disk. Weixing Song STAT 905 October 23, 2014 29/54

Another way of comparing Lasso and ridge is from a Bayesian perspective. In ridge, the prior of β is normal distribution; in Lasso, the prior of β is Laplace distribution. Weixing Song STAT 905 October 23, 2014 30/54

Outline Penalized Least Squares Introduction to Lasso Lasso: Orthogonal Design and Geometry Coordinate Descent Algorithm Penalty Parameter Selection Least Angle Regression References Weixing Song STAT 905 October 23, 2014 31/54

How to obtain Lasso solution for the general case? Lasso is a convex programming problem. Several algorithms have been proposed for computing L 1 -penalized estimates. Coordinate descent. See Fu (1998), Friedman et al. (2007). Convex optimization algorithms. See Osborne et al. (2000a, b). Least angle regression (LARs). See Efron et al. (2004). Others...... Here we first focus on the coordinate descent algorithm, which is simple, stable, and efficient for a variety of high-dimensional models. Coordinate descent algorithms optimize a target function with respect to a single parameter at a time, iteratively cycling through all parameters until convergence is reached. They are ideal for problems that have a simple closed form solution in a single dimension but lack one in higher dimensions. Weixing Song STAT 905 October 23, 2014 32/54

Coordinate Descent Algorithm: Derivation Given current values for the regression coefficients β k = β k, k j. Define ( ) 2 n L j (β j ; λ) = 1 y i x ik βk x ij β j + λ β j. 2n i=1 k j Denote n ỹ ij = x ik βk, r ij = y i ỹ ij, z j = 1 x ij r ij. n k j i=1 r ij are called the partial residuals with respect to the j-th covariate. Some algebra shows that L j (β j ; λ) = 1 2 (β j z j ) 2 + λ β j + 1 2n n r ij 2 + 1 2 z2 j. i=1 Let β j denote the minimizer of L j (β j ; λ). We have β j = S( Z j ; λ) = sgn( z j )( z j λ) +, where S( ; λ) is the soft-thresholding operator. Weixing Song STAT 905 October 23, 2014 33/54

Coordinate Descent Algorithm For any fixed λ, 1. Start with an initial value for β = β (0) ; 2. In the s + 1-th iteration, (1). Let j = 1; (2). Calculate z j = n 1 n x ij r i + i=1 β (s) j, p where r i = y i ỹ i = y i j=1 x (s) ij β j is the current residual. (s+1) (3). Update β j using S( z j ; λ). If j = p, then exit step 2. (s+1) (4). Update r i using r i ( β j (5). Let j j + 1, repeat (2)-(4). 3. Repeat step 2 for s + 1 until convergence. β (s) j )x ij for all i. NOTE: The above algorithm is designed for the cases in which the predictors are standardized to have L 2 -norm n. Weixing Song STAT 905 October 23, 2014 34/54

The coordinate descent algorithm can be used repeatedly compute ˆβ(λ) on a grid of λ values. Let λ max be the smallest value for which all coefficients are 0, and λ min be the minimum of λ. We can use λ max = max j (xj T x j) 1 xj T y (Consider the orthogonal design case). If the design matrix is full rank, λ min can be 0; otherwise, we use λ min = ελ max for some small ε, e.g., ε = 10 4. Let λ 0 > λ 1 > > λ k be a grid of decreasing λ-values, where λ 0 = λ max, and λ k = λ min. Start at λ 0 for which ˆβ has the solution 0 or close to 0, and proceed along the grid using the value of ˆβ at the previous point of λ in the grid as the initial values for the current point. This is called warm start. Homework: (a). Verify the updating steps in the coordinate descent algorithm for Lasso. (b). Write an R-program to implement the coordinate descent algorithm for computing the Lasso solution path. Weixing Song STAT 905 October 23, 2014 35/54

Outline Penalized Least Squares Introduction to Lasso Lasso: Orthogonal Design and Geometry Coordinate Descent Algorithm Penalty Parameter Selection Least Angle Regression References Weixing Song STAT 905 October 23, 2014 36/54

Penalty Parameter Selection: CV Divide the data into V roughly equal part (5 to 10); For each v = 1, 2,..., V, fit the model with parameter λ to the other V 1 parts. Denote the resulting estimates by ˆβ ( v) ; Compute the prediction error (PE) in predicting the v-th part: PE v(λ) = (y i xi T ˆβ ( v) (λ)) 2 ; i v-th part Compute the overall cross-validation error V CV (λ) = 1 PE v(λ); V v=1 Carry out the above steps for many values of λ and choose the value of λ that minimizes CV (λ). Weixing Song STAT 905 October 23, 2014 37/54

Penalty Parameter Selection: AIC and GCV An AIC type criterion for choosing λ is AIC(λ) = log { y Xˆβ(λ) 2 /n } + 2df (λ)/n, where for the Lasso estimator, df (λ) = # of nonzero coefficients in the model fitted with λ. A Generalized Cross Validation (GCV) criterion is defined as GCV (λ) = n y Xˆβ(λ) 2 (n df (λ)) 2. It can be seen that these two criteria are close to each other when df (λ) is relatively small compared to n. Weixing Song STAT 905 October 23, 2014 38/54

Penalty Parameter Selection: BIC The GCV and AIC are reasonable criteria for tuning. However, they tend to select more variables than the true model contains. Another criterion that is more aggressive in seeking a sparse model is the Bayesian information criterion (BIC): BIC(λ) = log { y Xˆβ(λ) 2 /n } + 2 log(n)df (λ)/n. The tuning parameter λ is selected as the minimizer of AIC(λ), GIC(λ) or BIC(λ). Weixing Song STAT 905 October 23, 2014 39/54

Outline Penalized Least Squares Introduction to Lasso Lasso: Orthogonal Design and Geometry Coordinate Descent Algorithm Penalty Parameter Selection Least Angle Regression References Weixing Song STAT 905 October 23, 2014 40/54

Automatic model-building algorithms are notoriously familiar in the linear model literature: Forward Selection, Backward Elimination, Best Subset Selection and many others are used to produce good linear models for predicting a response y on a basis of some measured covariates x 1, x 2,..., x p. The Lasso and Forward Stagewise regression will be discussed in this section. They are both motivated by a unified approach called Least Angle Regression (LARs). LARs provides an unified explanation, fast implementation, and fast way to choose tuning parameter for Lasso and Forward Stagewise regression. Weixing Song STAT 905 October 23, 2014 41/54

Forward Stepwise Selection Given a collection of possible predictors, we select the one having the largest absolute correlation with the response y, say x j1, and perform linear regression of y on x j1. This leaves a residual vector orthogonal to x j1, now considered to be the response. Regressing other predictors to x j1 leads to p 1 residuals, now considered as new predictors. Repeat the selection process by selecting the one from new predictors having the largest absolute correlation with the new response. After k steps this results in a set of x j1,..., x jk that are then used in the usual way to construct a k-parameter linear model. Forward Selection is an aggressive fitting technique that can be overly greedy, perhaps eliminating at the second step useful predictors that happen to be correlated with x j1. Weixing Song STAT 905 October 23, 2014 42/54

Forward Stagewise Selection Forward Stagewise is a much more cautious version of Forward Selection, which may take many tiny steps as it moves toward a final model. We assume that the covariates have been standardized to have mean 0 and unit length and that the response has mean 0. Forward Stagewise Selection (1). It begins with ˆµ = Xˆβ = 0 and builds up the regression function in successive small steps. (2). If ˆµ is the current Stagewise estimate, let c(ˆµ) be the vector of current correlations ĉ = c(ˆµ) = X T (y ˆµ), so that ĉ j is proportional to the correlation between covariate x j and the current residual vector. (3). The next step of is taken in the direction of the greatest current correlation, ĵ = argmax ĉ j and update ˆµ by with ε a small constant. ˆµ ˆµ + ε sgn(ĉĵ)xĵ, Weixing Song STAT 905 October 23, 2014 43/54

Small is important here: the big choice ε = ĉĵ leads to the classic Forward Selection technique, which can be overly greedy, impulsively eliminating covarites which are correlated to xĵ. How does these related to Lasso? Example (Diabetes Study): 442 diabetes patients were measured on 10 baseline variables (age, sex, body mass index, average blood pressure, and six blood serum measurements), as well as the response of interest, a quantitative measure of disease progression one year after baseline. A prediction model was desired for the response variables. The X 442 10 matrix has been standardized to have unit L 2 -norm in each column and zero mean. We use Lasso and Forward Stagewise regression to fit the model. Weixing Song STAT 905 October 23, 2014 44/54

The two plots are nearly identical, but differ slightly for larger t as shown in the track of covariate 8. Coincidence? or a general fact... Weixing Song STAT 905 October 23, 2014 45/54

Question: Are Lasso and infinitesimal forward stagewise identical? Answer: With orthogonal design, yes; otherwise, similar. Question: Why? Answer: LARs provides answers to these questions, and an efficient way to compute the complete Lasso sequence of solutions. Weixing Song STAT 905 October 23, 2014 46/54

LARs Algorithm: General Idea Least Angle Regression is a stylized version of the stagewise procedure. Only p steps are required for the full set of solutions. The LARs procedure works roughly as follows: (1). Standardize the predictors to have mean 0 and unit L 2 -norm, the response to have mean 0. (2). As with classic Forward Selection, start with all coefficients equal to zero and find the predictor most correlated with y, say x j1. (3). Take the largest step possible in the direction of this predictor until some other predictor, say x j2, has as much correlation with the current residual. (4). At this point, LARs parts company with Forward Selection. Instead of continuing along x j1, LARs proceeds in a direction equiangular between the two predictors until a third variable x j3 earns its way into the most correlated set. (5). LARs then proceeds equiangularly between x j1, x j2, x j3, i.e., along the least angle direction, until a fourth variable enters. (6). Continue in this way until all p predictors have been entered. Weixing Song STAT 905 October 23, 2014 47/54

LARs Algorithm: Two-Predictor Case In a two-predictor case, the current correlations depend only on the projection into the linear space spanned by x 1 and x 2. The algorithm begins at µ 0 = 0, then augments ˆµ 0 in the direction of x 1 to ˆµ 1 = ˆµ 0 + ˆγ 1 x 1. Stagewise would choose γ 1 equal to some small value ε, and then repeat the process many times. Classic Forward Selection would take γ 1 large enough to make ˆµ 1 equal the projection of y into L(x 1 ). LARs uses an intermediate value of γ 1, the value that makes the residual equally correlated with x 1 and x 2. The next LARs estimate is ˆµ 2 = ˆµ 1 + γ 2 u 2, with γ 2 chosen to make ˆµ 2 = Xˆβ LSE, where u 2 is the unit bisector of x 1 and x 2. With p > 2 covariates, γ 2 would be smaller, leading to another change of direction. Weixing Song STAT 905 October 23, 2014 48/54

Weixing Song STAT 905 October 23, 2014 49/54

LARs in Action: Least Angle Regreesion (1). We begin at ˆµ 0 = 0. (2). Compute ĉ = X T (y ˆµ 0 ), and Ĉ = max j ĉ j. Find out (3). Compute and A = {j : ĉ j = Ĉ}, s j = sgn(ĉ j ), j = 1, 2,..., p. X A = (, s j x j, ), G A = X T A X A, A A = (1 T A G 1 A 1 A) 1/2 w A = A A G 1 A 1 A, u A = X A w A, a = X T u A. (4). Update ˆµ 0 with where ˆµ = ˆµ 0 + ˆγu A, { } Ĉ ˆγ = min + ĉj Ĉ + ĉ j,, j A C A A a j A A + a j and min + indicates that the minimum is taken over only positive components. Weixing Song STAT 905 October 23, 2014 50/54

Relationship between LARs, Lasso and Forward Stagewise Regression Lasso and forward stagewise can be thought of as restricted versions of LARs. For Lasso: Start with LAR. If a coefficient crosses zero, stop. Drop that predictor, recompute the best direction and continue. This gives the Lasso path. For forward stagewise: Start with LAR. Compute best (equal angular) direction at each stage. If direction for any predictor j doesn t agree in sign with corr(r, x j ), project direction into the positive cone and use the projected direction instead. In other words, forward stagewise always moves each predictor in the direction of corr(r, x j ). The incremental forward stagewise procedure approximates these steps, one predictor at a time. As step size ε 0, we can show that it coincides with this modified version of LARs. Weixing Song STAT 905 October 23, 2014 51/54

Outline Penalized Least Squares Introduction to Lasso Lasso: Orthogonal Design and Geometry Coordinate Descent Algorithm Penalty Parameter Selection Least Angle Regression References Weixing Song STAT 905 October 23, 2014 52/54

References Antoniadis, A. and J. Fan (2001). Regularization of Wavelet Approximations. Journal of the American Statistical Association 96, 939-967. Breiman, L. (1995). Better subset regression using the nonnegative garrote. Technometrics 37(4), 373-384. Efron, B., T. Hastie, I. Johnstones, and R. Tibshirani (2004). Least angle regression. The Annals of Statistics 32(2), 407-499. Fan, J. and R. Li (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association 96(456), 1348-1360. Frank, I. E. and J. H. Friedman (1993). A Statistical View of Some Chemometrics Regression Tools. Technometrics 35(2), 109-135. Friedman, J., T. Hastie, H. Höfling, and R. Tibshirani (2007). Pathwise coordinate optimization. Annals of Applied Statistics 2, 302-332. Huang, J., P. Breheny, and S. Ma (2012). A selective review of group selection in high dimensional models. Statist. Sci. 27(4), 481-499. Huang, J., S. Ma, H. Xie, and C.H. Zhang (2009, June). A group bridge approach for variable selection. Biometrika 96(2), 339-355. Weixing Song STAT 905 October 23, 2014 53/54

Osborne, M. R., B. Presnell, and B. A. Turlach (2000). On the LASSO and Its Dual. Journal of Computational and Graphical Statistics 9(2), 319-337. Park, Trevor, Casella, and George (2008, June). The Bayesian Lasso. Journal of the American Statistical Association 103(482), 681-686. Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society (Series B) 58, 267-288. Yuan, M. and Y. Lin (2006). Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 68(1), 49-67. Zhang, C.H. (2010). Nearly unbiased variable selection under minimax concave penalty. The Annals of Statistics 38, 894-942. Weixing Song STAT 905 October 23, 2014 54/54