A Short Introduction to the Lasso Methodology

Size: px

Start display at page:

Download "A Short Introduction to the Lasso Methodology"

Rachel Smith
5 years ago
Views:

1 A Short Introduction to the Lasso Methodology Michael Gutmann sites.google.com/site/michaelgutmann University of Helsinki Aalto University Helsinki Institute for Information Technology March 9, 2016 Michael Gutmann Short Introduction to the Lasso 1 / 24

2 Goals of the lecture Lasso Least Absolute Shrinkage and Selection Operator Goal: After the lecture, to understand what these words mean Shrinkage: The lasso shrinks / regularizes the least squares regression coefficients (like ridge regression). Selection: The lasso also performs variable selection (unlike ridge regression). Least absolute: Shrinkage and selection are achieved by penalizing the absolute values of the regression coefficients. Michael Gutmann Short Introduction to the Lasso 2 / 24

3 Linear regression 5 4 Data: {(x 1, y 1 ),..., (x n, y n )} n observations of pairs (x i, y i ) x i R p : covariates yi R: response Response Data Regression lines Covariate Assumption: linear relation between covariates and response y i = x i1 β x ip β p + e i (1) = x i β + e i (2) where e i is the residual Goal: Determine the coefficients β = (β 1,..., β p ) Michael Gutmann Short Introduction to the Lasso 3 / 24

4 Least squares Minimize the residual sum of squares (RSS) In vector notation, with we have y = n n ( 2 RSS(β) = ei 2 = y i x i β) (3) i=1 i=1 y 1. y n X = x 1. x n = x x 1p... x n1... x np (4) RSS(β) = y Xβ 2 2 (5) Michael Gutmann Short Introduction to the Lasso 4 / 24

5 Least squares Closed form solution if p p matrix X X is invertible ˆβ o = argmin RSS(β) (6) β = (X X) 1 X y (7) 5 4 Prediction given a test covariate vector x (8) ŷ = x ˆβ o Response Data -3 Regression lines Least squares solution Covariate Michael Gutmann Short Introduction to the Lasso 5 / 24

6 Ridge regression If X X is not invertible, regularized inverse can be taken ˆβ r = (X X + λi p ) 1 X y (9) where I p is the p p identity matrix and λ 0 the regularization parameter. This is ridge regression, ˆβ r is minimizing J r (β) p J r (β) = y Xβ λ βj 2 (10) j=1 As λ increases, ˆβ r shrinks to zero ( shrinkage ). Michael Gutmann Short Introduction to the Lasso 6 / 24

7 Benefits of ridge regression ˆβ r = (X X + λi p) 1 X y Regularization / shrinkage is useful even if X X is invertible. Reason: it can improve prediction accuracy Example: n = 50 observations, p = 10 covariates Orthonormal matrix X: X X = I p ˆβ r = 1 1+λ X y = 1 ˆβ o Least squares 56 1+λ Mean squared prediction error Penalty λ (log 10) Ridge regression Michael Gutmann Short Introduction to the Lasso 7 / 24

8 Limits of ridge regression ˆβ r = 1 1+λ ˆβ o Data were artificially generated with β = (3, 2, 1, 0,..., 0) The vector is sparse: only 3/10 nonzero terms Ridge regression cannot recover sparse β. Ridge regression performs shrinkage but not variable selection. Variable selection: Some ˆβ j are set to zero; covariates are omitted from Regression coefficients the fitted model Penalty λ (log 10) Michael Gutmann Short Introduction to the Lasso 8 / 24

9 Some practical aspects Choice of λ: via cross-validation Ridge solution ˆβ r depends on the scale of the covariates. Center so that n i=1 y i = n i=1 x ij = 0 Re-scale so that n i=1 x ij 2 = 1 Assume that the data were preprocessed in this manner. Michael Gutmann Short Introduction to the Lasso 9 / 24

10 Importance of variable selection It reduces the complexity of the models. The models become easier to interpret. It makes prediction cheaper: only covariates with nonzero ˆβ j need to be measured. ŷ = x 1 ˆβ x 1000 ˆβ 1000 ŷ = x 1 ˆβ1 + x 2 ˆβ2 + x 3 ˆβ3 Michael Gutmann Short Introduction to the Lasso 10 / 24

11 Lasso regression Lasso regression consists in minimizing J L (β), p J L (β) = y Xβ λ β j (11) j=1 Similar to the cost function J r (β) for ridge regression, p J r (β) = y Xβ λ βj 2 (12) j=1 λ 0 is the regularization (shrinkage) parameter. Penalty: sum of absolute values instead of sum of squares Difference seems minor but it results in a very different behavior: it enables shrinkage and selection of covariates. Michael Gutmann Short Introduction to the Lasso 11 / 24

12 Shrinkage and variable selection with the lasso The lasso generally lacks an analytical solution. Closed form solution when X X = I p ˆβ o λ ˆβ j L 2 if ˆβ o λ 2 = 0 if ˆβ o ( λ 2, λ 2 ) ˆβ o + λ 2 if ˆβ o λ 2 (13) shrinkage 1 shrinkage Lasso coefficient variable selection Ridge coefficient Least squares coefficient Least squares coefficient Michael Gutmann Short Introduction to the Lasso 12 / 24

13 Back to the example Data were artificially generated with β = (3, 2, 1, 0,..., 0) The vector is sparse: only 3/10 nonzero terms Lasso regression combines shrinkage and variable selection. Mean squared prediction error Lasso regression Least squares Regression coefficients Penalty λ (log 10) Penalty λ (log 10) Michael Gutmann Short Introduction to the Lasso 13 / 24

14 Proof Assume X X = I p We want to show that the shrinkage and selection operator ˆβ o λ ˆβ j L 2 if ˆβ o λ 2 = 0 if ˆβ o ( λ 2, λ 2 ) (14) ˆβ o + λ 2 if ˆβ o λ 2 minimizes J L (β), p J L (β) = y Xβ λ β j (15) j=1 Michael Gutmann Short Introduction to the Lasso 14 / 24

15 Proof p J L (β) = y Xβ λ β j (16) j=1 p = (y Xβ) (y Xβ) + λ β j (17) j=1 p = y y y Xβ β X y + β X Xβ + λ β j (18) j=1 p = y y 2β X y + β X}{{ X} β + λ β j (19) I p j=1 p = y y 2β X y +β β + λ β }{{} j (20) ˆβ o j=1 =r Michael Gutmann Short Introduction to the Lasso 15 / 24

16 Proof p J L (β) = y y 2β r + β β + λ β j (21) j=1 p p p = y y 2 β j r j + βj 2 + λ β j (22) j=1 j=1 j=1 p ( ) = y y + 2β j r j + βj 2 + λ β j (23) j=1 }{{} f j (β j ) p = constant + f j (β j ) (24) j=1 For X X = I p, the optimization problem decomposes into p independent problems. Minimizing each f j (β j ) separately will minimize J L (β). Michael Gutmann Short Introduction to the Lasso 16 / 24

17 Proof Drop the subscripts for a moment and consider a single f only. Problem: derivative at zero not defined f (β) = β 2 2rβ + λ β (25) 6 5 λ = 1 λ = 2 λ = 3 4 f (for r=1) β Michael Gutmann Short Introduction to the Lasso 17 / 24

18 Proof Approach: Make a smooth approximation β h ɛ (β) { ɛ h ɛ (β) = ɛ β2 if β ( ɛ, ɛ) β otherwise (26) Do all the work with ɛ > 0 and, at the end, take the limit ɛ absolute value approximation β Michael Gutmann Short Introduction to the Lasso 18 / 24

19 Proof Using h ɛ (β) instead of β gives f (β) = β 2 2rβ + λh ɛ (β) (27) The derivative of f (β) is f (β) = 2β 2r + λh ɛ(β) (28) if β ɛ h ɛ(β) β = ɛ if β ( ɛ, ɛ) 1 if β ɛ β Michael Gutmann Short Introduction to the Lasso 19 / 24

20 Proof Setting the derivative of f (β) to zero gives the condition 2β 2r + λh ɛ(β) = 0 (29) β + λ 2 h ɛ(β) = r (30) The left-hand side is a piecewise linear, monotonically increasing function g ɛ (β): β is uniquely determined by r β + λ 2 if β ɛ g ɛ (β) = β(1 + λ 2ɛ ) if β ( ɛ, ɛ) β λ 2 if β ɛ β Michael Gutmann Short Introduction to the Lasso 20 / 24

21 Proof There are three cases 1. r ɛ + λ 2 β + λ! = r β = r λ r ( ɛ λ, ɛ + ) λ 2 2 β 2ɛ + λ 2ɛ! = r β = 2ɛr 2ɛ + λ Case 1 Case 2 3. r ɛ λ 2 β λ 2! = r β = r + λ 2 Case β Michael Gutmann Short Introduction to the Lasso 21 / 24

22 Proof There are three cases 1. r ɛ + λ 2 β + λ! = r β = r λ r ( ɛ λ, ɛ + ) λ 2 2 β 2ɛ + λ 2ɛ! = r β = 2ɛr 2ɛ + λ Case 1 Case 2 3. r ɛ λ 2 Case β β λ 2! = r β = r + λ 2 r λ 2 if r ɛ + λ 2 Hence β = 2ɛ 2ɛ+λ r if r ( ɛ λ 2, ɛ + λ 2 ) r + λ 2 if r ɛ λ 2 Michael Gutmann Short Introduction to the Lasso 22 / 24

23 Proof Taking the limit ɛ 0 gives r λ 2 if r λ 2 ˆβ = 0 if r ( λ 2, λ 2 ) r + λ 2 if r λ 2 With the subscripts, and r j = ˆβ j o, we have ˆβ j o λ 2 if ˆβ j o λ 2 ˆβ j = 0 if ˆβ j o ( λ 2, λ 2 ) ˆβ j o + λ 2 if ˆβ j o λ 2 (31) (32) which is the lasso solution ˆβ L j. Michael Gutmann Short Introduction to the Lasso 23 / 24

24 Summary Lasso Least Absolute Shrinkage and Selection Operator Method to regularize linear regression (like ridge regression) Regularization / shrinkage can improve prediction accuracy. Method to perform covariate selection (unlike ridge regression) Covariate selection reduces the complexity of fitted models; makes them easier to interpret. Combination of shrinkage and selection is achieved by penalizing the absolute values of the regression coefficients. Michael Gutmann Short Introduction to the Lasso 24 / 24

25 Appendix

26 Constrained optimization point of view Ridge regression: min y β Xβ 2 2 p subject to t j=1 β 2 j Lasso regression: min y β Xβ 2 2 p subject to β j t j=1 RSS increases Iso-contours of the RSS Constraint set (Based on figures from chapter 6 of Introduction to Statistical Learning) Michael Gutmann Short Introduction to the Lasso 2 / 2

Dimension Reduction Methods

Dimension Reduction Methods And Bayesian Machine Learning Marek Petrik 2/28 Previously in Machine Learning How to choose the right features if we have (too) many options Methods: 1. Subset selection 2.