Regression Shrinkage and Selection via the Lasso

Size: px

Start display at page:

Download "Regression Shrinkage and Selection via the Lasso"

Abner Rodgers
5 years ago
Views:

1 Regression Shrinkage and Selection via the Lasso ROBERT TIBSHIRANI, 1996 Presenter: Guiyun Feng April 27 () 1 / 20

2 Motivation Estimation in Linear Models: y = β T x + ɛ. data (x i, y i ), i = 1, 2,..., N. x ij are standardized with i x ij = 0 and i x ij 2 /N = 1. x i = (x i1,..., x ip ) T are regressors and y i is response for the i-th observation. () 2 / 20

3 Motivation Estimation in Linear Models: y = β T x + ɛ. data (x i, y i ), i = 1, 2,..., N. x ij are standardized with i x ij = 0 and i x ij 2 /N = 1. x i = (x i1,..., x ip ) T are regressors and y i is response for the i-th observation. Question: What is the criterion for good ˆβ? () 2 / 20

4 Criterion for good ˆβ? 1 Is ˆβ close to β? MSE(ˆβ) = E[ ˆβ β 2 ]. 2 Will ˆη(X) = ˆβ T X predict future data well? prediction error of ˆη(X) at X = x 0 is PE(x 0 ) = E{(Y ˆη(X)) 2 X = x 0 } = σ 2 + Bias 2 (ˆη(x 0 )) + Var(ˆη(x 0 )). () 3 / 20

5 How to Get Coefficient Estimates? Ordinary Least Squares(OLS): min N i=1 (y i j β jx ij ) 2 β Solution: β = (X T X ) 1 X T y. Prediction accuracy: low bias but large variance Interpretation: what are the most important effects? () 4 / 20

6 How to Get Coefficient Estimates? Ordinary Least Squares(OLS): min N i=1 (y i j β jx ij ) 2 β Solution: β = (X T X ) 1 X T y. Prediction accuracy: low bias but large variance Interpretation: what are the most important effects? To improve the OLS estimator Subset selection Ridge regression Breiman s non-negative garotte Lasso () 4 / 20

7 Subset Selection Subset Selection To find a small subset of the available independent variables to predict the dependent variable () 5 / 20

8 Subset Selection Subset Selection To find a small subset of the available independent variables to predict the dependent variable One algorithm: Forward Selection: 1 Begin with no terms in the model. 2 Find the term that, when added to the model, achieves the largest value of R 2. Enter this term into the model. 3 Continue adding terms until a target value for R 2 is achieved or until a preset limit on the maximum number of terms in the model is reached. () 5 / 20

9 Subset Selection Subset Selection To find a small subset of the available independent variables to predict the dependent variable One algorithm: Forward Selection: 1 Begin with no terms in the model. 2 Find the term that, when added to the model, achieves the largest value of R 2. Enter this term into the model. 3 Continue adding terms until a target value for R 2 is achieved or until a preset limit on the maximum number of terms in the model is reached. Regressors either retained or dropped, sensitive to changes in the data () 5 / 20

10 Ridge Regression Ridge Regression: min N i=1 (y i j β jx ij ) 2, subject to j β β2 j t. Equivalent to min N i=1 (y i j β jx ij ) 2 + λ j β β2 j. Solution: β = (X T X + λi ) 1 X T y. Property: more stable (shrink coefficients continuously); not easily interpretable (set no coefficients zero) () 6 / 20

11 Ridge Regression Ridge Regression: min N i=1 (y i j β jx ij ) 2, subject to j β β2 j t. Equivalent to min N i=1 (y i j β jx ij ) 2 + λ j β β2 j. Solution: β = (X T X + λi ) 1 X T y. Property: more stable (shrink coefficients continuously); not easily interpretable (set no coefficients zero) Question: how to retain the good features of both subset selection and ridge regression? () 6 / 20

12 Breiman s non-negative garotte (1993) Starts with the OLS estimates ˆβ o and shrinks them by non-negative factors with constrained sum: N ˆβ = argmin i i=1(y c j ˆβ j o x ij ) 2, subject to c j 0, c j t. j j (1) Advantage: has consistently lower prediction error than subset selection; competitive with ridge regression except when the true model has many small nonzero coefficients Disadvantage: its solution depends on both the sign and the magnitude of the OLS estimate () 7 / 20

13 Least Absolute Shrinkage and Selection Operator (LASSO) N ˆβ = argmin i i=1(y j β j x ij ) 2, subject to j β j t (2) () 8 / 20

14 Least Absolute Shrinkage and Selection Operator (LASSO) N ˆβ = argmin i i=1(y j β j x ij ) 2, subject to j β j t (2) Question: How to solve the optimization problem? How to find a good t? () 8 / 20

15 Algorithms for Finding Lasso Solution Let g(β) = N i=1 (y i j β jx ij ) 2, Let δ i be the p tuples of the form (±1, ±1,..., ±1). the condition β j t is equivalent to δ T i β t for all i. Denote by G E the matrix whose rows are δ i for i E. () 9 / 20

16 Algorithms for Finding Lasso Solution 1 Interior point method 2 Subgradient Descent 3 LARS(least angle regression, Efron et al. 2004): lars package in R 4 Coordinate Descent (Friedman et al. 2007): glmnet package in R 5 ISTA (Iterative Shrinkage-Thresholding Algorithm) Lasso wants to minimize f (β) + λh(β), where f (β) = N i=1 (y i j β jx ij ) 2, h(β) = β j. the Gradient Descent algorithm to optimize the smooth function f is x t+1 = x t η f (x t ), which can be written in the proximal form as x t+1 = argmin x R n f (x t ) + f (x t ) (x x t ) + 1 2η x x t 2 2. To minimize f + λh, iterate in the following procedure: x t+1 = argmin x R n f (x t ) + f (x t ) (x x t ) + 1 2η x x t λh(x) FISTA (Fast ISTA): Convergence rate f (y t ) + g(y t ) (f (x ) + g(x )) 2β x1 x 2 t 2. () 10 / 20

17 Estimation of t 1 Cross Validation: Suppose Y = η(x) + ɛ, prediction error of ˆη(X) is PE = E{Y ˆη(X)} 2. The PE is estimated over a grid of s = t/ j ˆβ j o from 0 to 1. The ŝ yielding the lowest PE is selected. 2 Based on Stein s unbiased estimate of risk 3 Based on a linear approximation to the lasso estimate () 11 / 20

18 Questions to answer 1 Why Lasso provides more sparse solution compared to ridge regression? 2 What is the performance of Lasso compared to other regression functions? () 12 / 20

19 Geometry of Lasso:Orthonormal Design Case When design matrix X be the n p matrix satisfying X T X = I: Best Subset Selection: choosing the k largest coefficients in absolute value Ridge Regression: 1 Garotte: ( 1+γ ˆβ o j 1 γ ( ˆβ o j )2 ) + ˆβ o j Lasso: ˆβ j = sign( ˆβ o j )( ˆβ o j γ)+. () 13 / 20

20 Geometry of Lasso: in general case Question: Why constraint j β j t produces more zero coefficients compared to j β2 j t? Explanation: N i=1 (y i j β jx ij ) 2 = (β ˆβ o ) T X T X(β ˆβ o ) + constant () 14 / 20

21 Geometry of Lasso: in general case Question: Can the signs of the lasso estimates be different from those of the ˆβ 0 j? Answer: The lasso can change the sign of each ˆβ j o, however, the garotte retains the sign. () 15 / 20

22 Lasso vs. Ridge Regression: Two Predictor Case Generate 100 data points from the model y = 6x 1 + 3x 2 with no noise, where x 1 and x 2 are standard normal variates with correlation ρ. ( For lasso, ˆβ 1 = t 2 + ˆβ 1 o ˆβ ) +, ( 2 o 2 ˆβ 2 = t 2 ˆβ 1 o ˆβ ) 2 o + 2 For ridge regression, the shrinkage depends on the correlation of the predictors () 16 / 20

23 Experiment 1 y = β T x + σε, where β = (3, 1, 5, 0, 0, 2, 0, 0, 0) T, corr(x i, x j ) = 0.5 i j. () 17 / 20

24 Experiment 2 y = β T x + σε, where β j = 0.85 for any j, corr(x i, x j ) = 0.5 i j. () 18 / 20

25 Experiment 3 y = β T x + σε, where β = (5, 0, 0, 0, 0, 0, 0, 0, 0) T, corr(x i, x j ) = 0.5 i j. () 19 / 20

26 Conclusion: How to select from different models? () 20 / 20

ESL Chap3. Some extensions of lasso

ESL Chap3. Some extensions of lasso ESL Chap3 Some extensions of lasso 1 Outline Consistency of lasso for model selection Adaptive lasso Elastic net Group lasso 2 Consistency of lasso for model selection A number of authors have studied