Lecture 14: Shrinkage Reading: Section 6.2 STATS 202: Data mining and analysis October 27, 2017 1 / 19
Shrinkage methods The idea is to perform a linear regression, while regularizing or shrinking the coefficients ˆβ toward 0. 2 / 19
Shrinkage methods The idea is to perform a linear regression, while regularizing or shrinking the coefficients ˆβ toward 0. Why would shrunk coefficients be better? 2 / 19
Shrinkage methods The idea is to perform a linear regression, while regularizing or shrinking the coefficients ˆβ toward 0. Why would shrunk coefficients be better? This introduces bias, but may significantly decrease the variance of the estimates. If the latter effect is larger, this would decrease the test error. 2 / 19
Shrinkage methods The idea is to perform a linear regression, while regularizing or shrinking the coefficients ˆβ toward 0. Why would shrunk coefficients be better? This introduces bias, but may significantly decrease the variance of the estimates. If the latter effect is larger, this would decrease the test error. There are Bayesian motivations to do this: the prior tends to shrink the parameters. 2 / 19
Ridge regression Ridge regression solves the following optimization: min β n y i β 0 i=1 β j x i,j 2 + λ In blue, we have the RSS of the model (the loss). In red, we have the squared l 2 norm of β, or β 2 2 (the penalty or regularization). Note that the intercept is not penalized, just the slopes! The parameter λ is a tuning parameter. It modulates the importance of fit vs. shrinkage. β 2 j We find an estimate ˆβ λ R cross-validation. for many values of λ and then choose λ by 3 / 19
Bias-variance tradeoff In a simulation study, we compute bias, variance, and test error as a function of λ. Mean Squared Error 0 10 20 30 40 50 60 Mean Squared Error 0 10 20 30 40 50 60 1e 01 1e+01 1e+03 0.0 0.2 0.4 0.6 0.8 1.0 λ ˆβ R λ 2/ ˆβ 2 4 / 19
Ridge regression In least-squares linear regression, scaling the variables has no effect on the fit of the model: Y = β 0 + β 1 X 1 + β 2 X 2 + + β p X p. Multiplying X 1 by c can be compensated by dividing ˆβ 1 by c, ie. after doing this we have the same RSS. 5 / 19
Ridge regression In least-squares linear regression, scaling the variables has no effect on the fit of the model: Y = β 0 + β 1 X 1 + β 2 X 2 + + β p X p. Multiplying X 1 by c can be compensated by dividing ˆβ 1 by c, ie. after doing this we have the same RSS. In ridge regression, this is not true as the penalty term discourages large coefficients. 5 / 19
Ridge regression In least-squares linear regression, scaling the variables has no effect on the fit of the model: Y = β 0 + β 1 X 1 + β 2 X 2 + + β p X p. Multiplying X 1 by c can be compensated by dividing ˆβ 1 by c, ie. after doing this we have the same RSS. In ridge regression, this is not true as the penalty term discourages large coefficients. In practice, what do we do? Scale each variable such that it has sample variance 1 before running the regression. This prevents penalizing some coefficients more than others. 5 / 19
Example. Ridge regression Ridge regression of default in the Credit dataset. Standardized Coefficients 300 100 0 100 200 300 400 1e 02 1e+00 1e+02 1e+04 Income Limit Rating Student Standardized Coefficients 300 100 0 100 200 300 400 0.0 0.2 0.4 0.6 0.8 1.0 λ ˆβ R λ 2/ ˆβ 2 6 / 19
Selecting λ by cross-validation Cross Validation Error 25.0 25.2 25.4 25.6 5e 03 5e 02 5e 01 5e+00 λ Standardized Coefficients 300 100 0 100 300 5e 03 5e 02 5e 01 5e+00 λ 7 / 19
The Lasso Lasso regression solves the following optimization: min β n y i β 0 i=1 β j x i,j 2 + λ In blue, we have the RSS of the model (the loss). β j In red, we have the l 1 norm of β, or β 1 (the penalty or regularization). 8 / 19
The Lasso Lasso regression solves the following optimization: min β n y i β 0 i=1 β j x i,j 2 + λ In blue, we have the RSS of the model (the loss). β j In red, we have the l 1 norm of β, or β 1 (the penalty or regularization). Why would we use the Lasso instead of Ridge regression? 8 / 19
The Lasso Lasso regression solves the following optimization: min β n y i β 0 i=1 β j x i,j 2 + λ In blue, we have the RSS of the model (the loss). β j In red, we have the l 1 norm of β, or β 1 (the penalty or regularization). Why would we use the Lasso instead of Ridge regression? Ridge regression shrinks all the coefficients to a non-zero value. 8 / 19
The Lasso Lasso regression solves the following optimization: min β n y i β 0 i=1 β j x i,j 2 + λ In blue, we have the RSS of the model (the loss). β j In red, we have the l 1 norm of β, or β 1 (the penalty or regularization). Why would we use the Lasso instead of Ridge regression? Ridge regression shrinks all the coefficients to a non-zero value. The Lasso shrinks some of the coefficients all the way to zero. Alternative to best subset selection or stepwise selection! 8 / 19
Example. Ridge regression Ridge regression of default in the Credit dataset. Standardized Coefficients 300 100 0 100 200 300 400 1e 02 1e+00 1e+02 1e+04 Income Limit Rating Student Standardized Coefficients 300 100 0 100 200 300 400 0.0 0.2 0.4 0.6 0.8 1.0 λ ˆβ R λ 2/ ˆβ 2 A lot of pesky small coefficients throughout the regularization path. 9 / 19
Example. The Lasso Lasso regression of default in the Credit dataset. Standardized Coefficients 200 0 100 200 300 400 20 50 100 200 500 2000 5000 Standardized Coefficients 300 100 0 100 200 300 400 Income Limit Rating Student 0.0 0.2 0.4 0.6 0.8 1.0 λ ˆβ L λ 1/ ˆβ 1 Those coefficients are shrunk to zero. 10 / 19
An alternative formulation for regularization Ridge: for every λ, there is an s such that ˆβ λ R solves: 2 n minimize y i β 0 β j x i,j β subject to βj 2 < s. i=1 11 / 19
An alternative formulation for regularization Ridge: for every λ, there is an s such that ˆβ λ R solves: 2 n minimize y i β 0 β j x i,j β subject to βj 2 < s. i=1 Lasso: for every λ, there is an s such that ˆβ λ L solves: 2 n minimize y i β 0 β j x i,j β subject to β j < s. i=1 11 / 19
An alternative formulation for regularization Ridge: for every λ, there is an s such that ˆβ λ R solves: 2 n minimize y i β 0 β j x i,j β subject to βj 2 < s. i=1 Lasso: for every λ, there is an s such that ˆβ λ L solves: 2 n minimize y i β 0 β j x i,j β subject to β j < s. i=1 Best subset: n minimize y i β 0 β i=1 2 β j x i,j s.t. 1(β j 0) < s. 11 / 19
Visualizing Ridge and the Lasso with 2 predictors The Lasso Ridge Regression : p β j < s : p β2 j < s (Red ellipses are equal RSS contours) 12 / 19
When is the Lasso better than Ridge? Example 1. All coefficients β j are non-zero. Mean Squared Error 0 10 20 30 40 50 60 Mean Squared Error 0 10 20 30 40 50 60 0.02 0.10 0.50 2.00 10.00 50.00 λ 0.0 0.2 0.4 0.6 0.8 1.0 R 2 on Training Data 13 / 19
When is the Lasso better than Ridge? Example 1. All coefficients β j are non-zero. Mean Squared Error 0 10 20 30 40 50 60 Mean Squared Error 0 10 20 30 40 50 60 0.02 0.10 0.50 2.00 10.00 50.00 0.0 0.2 0.4 0.6 0.8 1.0 λ R 2 on Training Data Bias, Variance, MSE. The Lasso ( ), Ridge ( ). 13 / 19
When is the Lasso better than Ridge? Example 1. All coefficients β j are non-zero. Mean Squared Error 0 10 20 30 40 50 60 Mean Squared Error 0 10 20 30 40 50 60 0.02 0.10 0.50 2.00 10.00 50.00 0.0 0.2 0.4 0.6 0.8 1.0 λ R 2 on Training Data Bias, Variance, MSE. The Lasso ( ), Ridge ( ). The bias is about the same for both methods. 13 / 19
When is the Lasso better than Ridge? Example 1. All coefficients β j are non-zero. Mean Squared Error 0 10 20 30 40 50 60 Mean Squared Error 0 10 20 30 40 50 60 0.02 0.10 0.50 2.00 10.00 50.00 0.0 0.2 0.4 0.6 0.8 1.0 λ R 2 on Training Data Bias, Variance, MSE. The Lasso ( ), Ridge ( ). The bias is about the same for both methods. The variance of Ridge regression is smaller, so is the MSE. 13 / 19
When is the Lasso better than Ridge? Example 1. All coefficients β j are non-zero. Mean Squared Error 0 10 20 30 40 50 60 Mean Squared Error 0 10 20 30 40 50 60 0.02 0.10 0.50 2.00 10.00 50.00 0.0 0.2 0.4 0.6 0.8 1.0 λ R 2 on Training Data Bias, Variance, MSE. The Lasso ( ), Ridge ( ). The bias is about the same for both methods. The variance of Ridge regression is smaller, so is the MSE. Rule of thumb: Lasso typically no better than ridge for prediction when most variables are useful for predition. 13 / 19
When is the Lasso better than Ridge? Example 2. Only 2 of 45 coefficients β j are non-zero. Mean Squared Error 0 20 40 60 80 100 Mean Squared Error 0 20 40 60 80 100 0.02 0.10 0.50 2.00 10.00 50.00 λ 0.4 0.5 0.6 0.7 0.8 0.9 1.0 R 2 on Training Data 14 / 19
When is the Lasso better than Ridge? Example 2. Only 2 of 45 coefficients β j are non-zero. Mean Squared Error 0 20 40 60 80 100 Mean Squared Error 0 20 40 60 80 100 0.02 0.10 0.50 2.00 10.00 50.00 λ 0.4 0.5 0.6 0.7 0.8 0.9 1.0 R 2 on Training Data Bias, Variance, MSE. The Lasso ( ), Ridge ( ). 14 / 19
When is the Lasso better than Ridge? Example 2. Only 2 of 45 coefficients β j are non-zero. Mean Squared Error 0 20 40 60 80 100 Mean Squared Error 0 20 40 60 80 100 0.02 0.10 0.50 2.00 10.00 50.00 λ 0.4 0.5 0.6 0.7 0.8 0.9 1.0 R 2 on Training Data Bias, Variance, MSE. The Lasso ( ), Ridge ( ). The bias, variance, and MSE are lower for the Lasso. 14 / 19
When is the Lasso better than Ridge? Example 2. Only 2 of 45 coefficients β j are non-zero. Mean Squared Error 0 20 40 60 80 100 Mean Squared Error 0 20 40 60 80 100 0.02 0.10 0.50 2.00 10.00 50.00 λ 0.4 0.5 0.6 0.7 0.8 0.9 1.0 R 2 on Training Data Bias, Variance, MSE. The Lasso ( ), Ridge ( ). The bias, variance, and MSE are lower for the Lasso. Rule of thumb: Lasso especially effective when most variables are not useful for prediction. 14 / 19
Choosing λ by cross-validation Cross Validation Error 0 200 600 1000 1400 Standardized Coefficients 5 0 5 10 15 0.0 0.2 0.4 0.6 0.8 1.0 ˆβ L λ 1/ ˆβ 1 0.0 0.2 0.4 0.6 0.8 1.0 ˆβ L λ 1/ ˆβ 1 15 / 19
A very special case Suppose n = p and our matrix of predictors is X = I. 16 / 19
A very special case Suppose n = p and our matrix of predictors is X = I. Then, the objective function in Ridge regression can be simplified: (y j β j ) 2 + λ β 2 j 16 / 19
A very special case Suppose n = p and our matrix of predictors is X = I. Then, the objective function in Ridge regression can be simplified: (y j β j ) 2 + λ and we can minimize the terms that involve each β j separately: (y j β j ) 2 + λβj 2. β 2 j 16 / 19
A very special case Suppose n = p and our matrix of predictors is X = I. Then, the objective function in Ridge regression can be simplified: (y j β j ) 2 + λ and we can minimize the terms that involve each β j separately: It is easy to show that (y j β j ) 2 + λβ 2 j. ˆβ R j = y j 1 + λ. β 2 j 16 / 19
A very special case Similar story for the Lasso; the objective function is: (y j β j ) 2 + λ β j 17 / 19
A very special case Similar story for the Lasso; the objective function is: (y j β j ) 2 + λ β j and we can minimize the terms that involve each β j separately: (y j β j ) 2 + λ β j. 17 / 19
A very special case Similar story for the Lasso; the objective function is: (y j β j ) 2 + λ β j and we can minimize the terms that involve each β j separately: (y j β j ) 2 + λ β j. It is easy to show that y j λ/2 if y j > λ/2; ˆβ j L = y j + λ/2 if y j < λ/2; 0 if y j < λ/2. 17 / 19
Lasso and Ridge coefficients as a function of λ Coefficient Estimate 1.5 0.5 0.5 1.5 Ridge Least Squares Coefficient Estimate 1.5 0.5 0.5 1.5 Lasso Least Squares 1.5 0.5 0.0 0.5 1.0 1.5 1.5 0.5 0.0 0.5 1.0 1.5 y j y j 18 / 19
Bayesian interpretations Ridge: ˆβ R is the posterior mean, with a Normal prior on β. Lasso: ˆβ L is the posterior mode, with a Laplace prior on β. g(βj) 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 g(βj) 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 3 2 1 0 1 2 3 β j 3 2 1 0 1 2 3 β j 19 / 19