A Short Introduction to the Lasso Methodology Michael Gutmann sites.google.com/site/michaelgutmann University of Helsinki Aalto University Helsinki Institute for Information Technology March 9, 2016 Michael Gutmann Short Introduction to the Lasso 1 / 24
Goals of the lecture Lasso Least Absolute Shrinkage and Selection Operator Goal: After the lecture, to understand what these words mean Shrinkage: The lasso shrinks / regularizes the least squares regression coefficients (like ridge regression). Selection: The lasso also performs variable selection (unlike ridge regression). Least absolute: Shrinkage and selection are achieved by penalizing the absolute values of the regression coefficients. Michael Gutmann Short Introduction to the Lasso 2 / 24
Linear regression 5 4 Data: {(x 1, y 1 ),..., (x n, y n )} n observations of pairs (x i, y i ) x i R p : covariates yi R: response Response 3 2 1 0-1 -2-3 Data Regression lines -4-3 -2-1 0 1 2 3 Covariate Assumption: linear relation between covariates and response y i = x i1 β 1 +... + x ip β p + e i (1) = x i β + e i (2) where e i is the residual Goal: Determine the coefficients β = (β 1,..., β p ) Michael Gutmann Short Introduction to the Lasso 3 / 24
Least squares Minimize the residual sum of squares (RSS) In vector notation, with we have y = n n ( 2 RSS(β) = ei 2 = y i x i β) (3) i=1 i=1 y 1. y n X = x 1. x n = x 11... x 1p... x n1... x np (4) RSS(β) = y Xβ 2 2 (5) Michael Gutmann Short Introduction to the Lasso 4 / 24
Least squares Closed form solution if p p matrix X X is invertible ˆβ o = argmin RSS(β) (6) β = (X X) 1 X y (7) 5 4 Prediction given a test covariate vector x (8) ŷ = x ˆβ o Response 3 2 1 0-1 -2 Data -3 Regression lines Least squares solution -4-3 -2-1 0 1 2 3 Covariate Michael Gutmann Short Introduction to the Lasso 5 / 24
Ridge regression If X X is not invertible, regularized inverse can be taken ˆβ r = (X X + λi p ) 1 X y (9) where I p is the p p identity matrix and λ 0 the regularization parameter. This is ridge regression, ˆβ r is minimizing J r (β) p J r (β) = y Xβ 2 2 + λ βj 2 (10) j=1 As λ increases, ˆβ r shrinks to zero ( shrinkage ). Michael Gutmann Short Introduction to the Lasso 6 / 24
Benefits of ridge regression ˆβ r = (X X + λi p) 1 X y Regularization / shrinkage is useful even if X X is invertible. Reason: it can improve prediction accuracy Example: n = 50 observations, p = 10 covariates Orthonormal matrix X: X X = I p ˆβ r = 1 1+λ X y = 1 ˆβ o Least squares 56 1+λ 55-6 -4-2 0 2 4 6 Mean squared prediction error 64 63 62 61 60 59 58 57 Penalty λ (log 10) Ridge regression Michael Gutmann Short Introduction to the Lasso 7 / 24
Limits of ridge regression ˆβ r = 1 1+λ ˆβ o Data were artificially generated with β = (3, 2, 1, 0,..., 0) The vector is sparse: only 3/10 nonzero terms Ridge regression cannot recover sparse β. Ridge regression performs shrinkage but not variable selection. Variable selection: Some ˆβ j are set to zero; covariates are omitted from Regression coefficients 3.5 3 2.5 2 1.5 1 0.5 0-0.5 the fitted model. -1-6 -4-2 0 2 4 6 Penalty λ (log 10) Michael Gutmann Short Introduction to the Lasso 8 / 24
Some practical aspects Choice of λ: via cross-validation Ridge solution ˆβ r depends on the scale of the covariates. Center so that n i=1 y i = n i=1 x ij = 0 Re-scale so that n i=1 x ij 2 = 1 Assume that the data were preprocessed in this manner. Michael Gutmann Short Introduction to the Lasso 9 / 24
Importance of variable selection It reduces the complexity of the models. The models become easier to interpret. It makes prediction cheaper: only covariates with nonzero ˆβ j need to be measured. ŷ = x 1 ˆβ 1 +... + x 1000 ˆβ 1000 ŷ = x 1 ˆβ1 + x 2 ˆβ2 + x 3 ˆβ3 Michael Gutmann Short Introduction to the Lasso 10 / 24
Lasso regression Lasso regression consists in minimizing J L (β), p J L (β) = y Xβ 2 2 + λ β j (11) j=1 Similar to the cost function J r (β) for ridge regression, p J r (β) = y Xβ 2 2 + λ βj 2 (12) j=1 λ 0 is the regularization (shrinkage) parameter. Penalty: sum of absolute values instead of sum of squares Difference seems minor but it results in a very different behavior: it enables shrinkage and selection of covariates. Michael Gutmann Short Introduction to the Lasso 11 / 24
Shrinkage and variable selection with the lasso The lasso generally lacks an analytical solution. Closed form solution when X X = I p ˆβ o λ ˆβ j L 2 if ˆβ o λ 2 = 0 if ˆβ o ( λ 2, λ 2 ) ˆβ o + λ 2 if ˆβ o λ 2 (13) 1.5 1.5 1 shrinkage 1 shrinkage Lasso coefficient 0.5 0-0.5 variable selection Ridge coefficient 0.5 0-0.5-1 -1-1.5-1.5-1 -0.5 0 0.5 1 1.5 Least squares coefficient -1.5-1.5-1 -0.5 0 0.5 1 1.5 Least squares coefficient Michael Gutmann Short Introduction to the Lasso 12 / 24
Back to the example Data were artificially generated with β = (3, 2, 1, 0,..., 0) The vector is sparse: only 3/10 nonzero terms Lasso regression combines shrinkage and variable selection. Mean squared prediction error 64 63 62 61 60 59 58 57 56 Lasso regression Least squares Regression coefficients 3.5 3 2.5 2 1.5 1 0.5 0-0.5 55-6 -4-2 0 2 4 6 Penalty λ (log 10) -1-6 -4-2 0 2 4 6 Penalty λ (log 10) Michael Gutmann Short Introduction to the Lasso 13 / 24
Proof Assume X X = I p We want to show that the shrinkage and selection operator ˆβ o λ ˆβ j L 2 if ˆβ o λ 2 = 0 if ˆβ o ( λ 2, λ 2 ) (14) ˆβ o + λ 2 if ˆβ o λ 2 minimizes J L (β), p J L (β) = y Xβ 2 2 + λ β j (15) j=1 Michael Gutmann Short Introduction to the Lasso 14 / 24
Proof p J L (β) = y Xβ 2 2 + λ β j (16) j=1 p = (y Xβ) (y Xβ) + λ β j (17) j=1 p = y y y Xβ β X y + β X Xβ + λ β j (18) j=1 p = y y 2β X y + β X}{{ X} β + λ β j (19) I p j=1 p = y y 2β X y +β β + λ β }{{} j (20) ˆβ o j=1 =r Michael Gutmann Short Introduction to the Lasso 15 / 24
Proof p J L (β) = y y 2β r + β β + λ β j (21) j=1 p p p = y y 2 β j r j + βj 2 + λ β j (22) j=1 j=1 j=1 p ( ) = y y + 2β j r j + βj 2 + λ β j (23) j=1 }{{} f j (β j ) p = constant + f j (β j ) (24) j=1 For X X = I p, the optimization problem decomposes into p independent problems. Minimizing each f j (β j ) separately will minimize J L (β). Michael Gutmann Short Introduction to the Lasso 16 / 24
Proof Drop the subscripts for a moment and consider a single f only. Problem: derivative at zero not defined f (β) = β 2 2rβ + λ β (25) 6 5 λ = 1 λ = 2 λ = 3 4 f (for r=1) 3 2 1 0-1 -1-0.5 0 0.5 1 1.5 2 β Michael Gutmann Short Introduction to the Lasso 17 / 24
Proof Approach: Make a smooth approximation β h ɛ (β) { ɛ h ɛ (β) = 2 + 1 2ɛ β2 if β ( ɛ, ɛ) β otherwise (26) Do all the work with ɛ > 0 and, at the end, take the limit ɛ 0. 0.2 0.18 0.16 absolute value approximation 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0-0.2-0.15-0.1-0.05 0 0.05 0.1 0.15 0.2 β Michael Gutmann Short Introduction to the Lasso 18 / 24
Proof Using h ɛ (β) instead of β gives f (β) = β 2 2rβ + λh ɛ (β) (27) The derivative of f (β) is f (β) = 2β 2r + λh ɛ(β) (28) 1 0.8 1 if β ɛ h ɛ(β) β = ɛ if β ( ɛ, ɛ) 1 if β ɛ 0.6 0.4 0.2 0-0.2-0.4-0.6-0.8-1 -0.2-0.15-0.1-0.05 0 0.05 0.1 0.15 0.2 β Michael Gutmann Short Introduction to the Lasso 19 / 24
Proof Setting the derivative of f (β) to zero gives the condition 2β 2r + λh ɛ(β) = 0 (29) β + λ 2 h ɛ(β) = r (30) The left-hand side is a piecewise linear, monotonically increasing function g ɛ (β): β is uniquely determined by r. 0.8 0.6 β + λ 2 if β ɛ g ɛ (β) = β(1 + λ 2ɛ ) if β ( ɛ, ɛ) β λ 2 if β ɛ 0.4 0.2 0-0.2-0.4-0.6-0.8-0.2-0.15-0.1-0.05 0 0.05 0.1 0.15 0.2 β Michael Gutmann Short Introduction to the Lasso 20 / 24
Proof There are three cases 1. r ɛ + λ 2 β + λ! = r β = r λ 2 2 2. r ( ɛ λ, ɛ + ) λ 2 2 β 2ɛ + λ 2ɛ! = r β = 2ɛr 2ɛ + λ 1.5 1 0.5 0-0.5-1 Case 1 Case 2 3. r ɛ λ 2 β λ 2! = r β = r + λ 2 Case 3-1.5-0.6-0.4-0.2 0 0.2 0.4 0.6 β Michael Gutmann Short Introduction to the Lasso 21 / 24
Proof There are three cases 1. r ɛ + λ 2 β + λ! = r β = r λ 2 2 2. r ( ɛ λ, ɛ + ) λ 2 2 β 2ɛ + λ 2ɛ! = r β = 2ɛr 2ɛ + λ 1.5 1 0.5 0-0.5-1 Case 1 Case 2 3. r ɛ λ 2 Case 3-1.5-0.6-0.4-0.2 0 0.2 0.4 0.6 β β λ 2! = r β = r + λ 2 r λ 2 if r ɛ + λ 2 Hence β = 2ɛ 2ɛ+λ r if r ( ɛ λ 2, ɛ + λ 2 ) r + λ 2 if r ɛ λ 2 Michael Gutmann Short Introduction to the Lasso 22 / 24
Proof Taking the limit ɛ 0 gives r λ 2 if r λ 2 ˆβ = 0 if r ( λ 2, λ 2 ) r + λ 2 if r λ 2 With the subscripts, and r j = ˆβ j o, we have ˆβ j o λ 2 if ˆβ j o λ 2 ˆβ j = 0 if ˆβ j o ( λ 2, λ 2 ) ˆβ j o + λ 2 if ˆβ j o λ 2 (31) (32) which is the lasso solution ˆβ L j. Michael Gutmann Short Introduction to the Lasso 23 / 24
Summary Lasso Least Absolute Shrinkage and Selection Operator Method to regularize linear regression (like ridge regression) Regularization / shrinkage can improve prediction accuracy. Method to perform covariate selection (unlike ridge regression) Covariate selection reduces the complexity of fitted models; makes them easier to interpret. Combination of shrinkage and selection is achieved by penalizing the absolute values of the regression coefficients. Michael Gutmann Short Introduction to the Lasso 24 / 24
Appendix
Constrained optimization point of view Ridge regression: min y β Xβ 2 2 p subject to t j=1 β 2 j Lasso regression: min y β Xβ 2 2 p subject to β j t j=1 RSS increases Iso-contours of the RSS Constraint set (Based on figures from chapter 6 of Introduction to Statistical Learning) Michael Gutmann Short Introduction to the Lasso 2 / 2