Lecture 14: Shrinkage

Similar documents
Linear model selection and regularization

STAT 462-Computational Data Analysis

Machine Learning for OR & FE

Dimension Reduction Methods

Linear Model Selection and Regularization

Linear Model Selection and Regularization

Data Mining Stat 588

Regression, Ridge Regression, Lasso

Statistics 203: Introduction to Regression and Analysis of Variance Penalized models

Linear Methods for Regression. Lijun Zhang

IEOR165 Discussion Week 5

A Modern Look at Classical Multivariate Techniques

A Short Introduction to the Lasso Methodology

Lecture 6: Linear Regression (continued)

Sparse Linear Models (10/7/13)

MS-C1620 Statistical inference

Machine Learning Linear Regression. Prof. Matteo Matteucci

Day 4: Shrinkage Estimators

Homework 1: Solutions

Tutorial on Linear Regression

The prediction of house price

Linear regression methods

MA 575 Linear Models: Cedric E. Ginestet, Boston University Regularization: Ridge Regression and Lasso Week 14, Lecture 2

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 10

Shrinkage Methods: Ridge and Lasso

Prediction & Feature Selection in GLM

Direct Learning: Linear Regression. Donglin Zeng, Department of Biostatistics, University of North Carolina

LECTURE 10: LINEAR MODEL SELECTION PT. 1. October 16, 2017 SDS 293: Machine Learning

Lecture Data Science

LINEAR REGRESSION, RIDGE, LASSO, SVR

Machine Learning for Economists: Part 4 Shrinkage and Sparsity

Applied Machine Learning Annalisa Marsico

STA414/2104 Statistical Methods for Machine Learning II

Data Analysis and Machine Learning Lecture 12: Multicollinearity, Bias-Variance Trade-off, Cross-validation and Shrinkage Methods.

Linear Regression Models. Based on Chapter 3 of Hastie, Tibshirani and Friedman

Regularization: Ridge Regression and the LASSO

A Survey of L 1. Regression. Céline Cunen, 20/10/2014. Vidaurre, Bielza and Larranaga (2013)

ISyE 691 Data mining and analytics

9/26/17. Ridge regression. What our model needs to do. Ridge Regression: L2 penalty. Ridge coefficients. Ridge coefficients

Linear Regression 9/23/17. Simple linear regression. Advertising sales: Variance changes based on # of TVs. Advertising sales: Normal error?

Lecture 5: A step back

MS&E 226. In-Class Midterm Examination Solutions Small Data October 20, 2015

Iterative Selection Using Orthogonal Regression Techniques

UNIVERSITETET I OSLO

Least Absolute Shrinkage is Equivalent to Quadratic Penalization

Ridge regression. Patrick Breheny. February 8. Penalized regression Ridge regression Bayesian interpretation

6. Regularized linear regression

A New Bayesian Variable Selection Method: The Bayesian Lasso with Pseudo Variables

IEOR 165 Lecture 7 1 Bias-Variance Tradeoff

Variable Selection under Measurement Error: Comparing the Performance of Subset Selection and Shrinkage Methods

High-dimensional regression

Linear Regression Linear Regression with Shrinkage

An economic application of machine learning: Nowcasting Thai exports using global financial market data and time-lag lasso

Penalized Regression

COMS 4721: Machine Learning for Data Science Lecture 6, 2/2/2017

Multiple (non) linear regression. Department of Computer Science, Czech Technical University in Prague

Parameter Norm Penalties. Sargur N. Srihari

Lecture 6: Linear Regression

Transportation Big Data Analytics

Consistent high-dimensional Bayesian variable selection via penalized credible regions

Linear Models: Comparing Variables. Stony Brook University CSE545, Fall 2017

Business Statistics. Tommaso Proietti. Model Evaluation and Selection. DEF - Università di Roma 'Tor Vergata'

Robust Variable Selection Methods for Grouped Data. Kristin Lee Seamon Lilly

Stability and the elastic net

Ratemaking application of Bayesian LASSO with conjugate hyperprior

Lecture 6: Methods for high-dimensional problems

Ridge Regression: Regulating overfitting when using many features. Training, true, & test error vs. model complexity. CSE 446: Machine Learning

Biostatistics Advanced Methods in Biostatistics IV

PENALIZED PRINCIPAL COMPONENT REGRESSION. Ayanna Byrd. (Under the direction of Cheolwoo Park) Abstract

arxiv: v3 [stat.ml] 14 Apr 2016

Machine Learning for OR & FE

Bayesian linear regression

ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015

MSA220/MVE440 Statistical Learning for Big Data

High-dimensional regression modeling

The Adaptive Lasso and Its Oracle Properties Hui Zou (2006), JASA

Chapter 3. Linear Models for Regression

ESL Chap3. Some extensions of lasso

Statistics 262: Intermediate Biostatistics Model selection

CS 4491/CS 7990 SPECIAL TOPICS IN BIOINFORMATICS

High-Dimensional Statistical Learning: Introduction

Machine Learning Linear Classification. Prof. Matteo Matteucci

Introduction to Machine Learning

Stat 602 Exam 1 Spring 2017 (corrected version)

Statistical Methods for Data Mining

Regularization and Variable Selection via the Elastic Net

Linear Regression Linear Regression with Shrinkage

Proteomics and Variable Selection

On prediction. Jussi Hakanen Post-doctoral researcher. TIES445 Data mining (guest lecture)

Linear Regression. CSL603 - Fall 2017 Narayanan C Krishnan

Linear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan

MSG500/MVE190 Linear Models - Lecture 15

Regression Shrinkage and Selection via the Lasso

STAT 535 Lecture 5 November, 2018 Brief overview of Model Selection and Regularization c Marina Meilă

Learning with Sparsity Constraints

ECE521 lecture 4: 19 January Optimization, MLE, regularization

Bayesian performance

Bayesian variable selection via. Penalized credible regions. Brian Reich, NCSU. Joint work with. Howard Bondell and Ander Wilson

Machine Learning for Biomedical Engineering. Enrico Grisan

Reduction of Model Complexity and the Treatment of Discrete Inputs in Computer Model Emulation

Transcription:

Lecture 14: Shrinkage Reading: Section 6.2 STATS 202: Data mining and analysis October 27, 2017 1 / 19

Shrinkage methods The idea is to perform a linear regression, while regularizing or shrinking the coefficients ˆβ toward 0. 2 / 19

Shrinkage methods The idea is to perform a linear regression, while regularizing or shrinking the coefficients ˆβ toward 0. Why would shrunk coefficients be better? 2 / 19

Shrinkage methods The idea is to perform a linear regression, while regularizing or shrinking the coefficients ˆβ toward 0. Why would shrunk coefficients be better? This introduces bias, but may significantly decrease the variance of the estimates. If the latter effect is larger, this would decrease the test error. 2 / 19

Shrinkage methods The idea is to perform a linear regression, while regularizing or shrinking the coefficients ˆβ toward 0. Why would shrunk coefficients be better? This introduces bias, but may significantly decrease the variance of the estimates. If the latter effect is larger, this would decrease the test error. There are Bayesian motivations to do this: the prior tends to shrink the parameters. 2 / 19

Ridge regression Ridge regression solves the following optimization: min β n y i β 0 i=1 β j x i,j 2 + λ In blue, we have the RSS of the model (the loss). In red, we have the squared l 2 norm of β, or β 2 2 (the penalty or regularization). Note that the intercept is not penalized, just the slopes! The parameter λ is a tuning parameter. It modulates the importance of fit vs. shrinkage. β 2 j We find an estimate ˆβ λ R cross-validation. for many values of λ and then choose λ by 3 / 19

Bias-variance tradeoff In a simulation study, we compute bias, variance, and test error as a function of λ. Mean Squared Error 0 10 20 30 40 50 60 Mean Squared Error 0 10 20 30 40 50 60 1e 01 1e+01 1e+03 0.0 0.2 0.4 0.6 0.8 1.0 λ ˆβ R λ 2/ ˆβ 2 4 / 19

Ridge regression In least-squares linear regression, scaling the variables has no effect on the fit of the model: Y = β 0 + β 1 X 1 + β 2 X 2 + + β p X p. Multiplying X 1 by c can be compensated by dividing ˆβ 1 by c, ie. after doing this we have the same RSS. 5 / 19

Ridge regression In least-squares linear regression, scaling the variables has no effect on the fit of the model: Y = β 0 + β 1 X 1 + β 2 X 2 + + β p X p. Multiplying X 1 by c can be compensated by dividing ˆβ 1 by c, ie. after doing this we have the same RSS. In ridge regression, this is not true as the penalty term discourages large coefficients. 5 / 19

Ridge regression In least-squares linear regression, scaling the variables has no effect on the fit of the model: Y = β 0 + β 1 X 1 + β 2 X 2 + + β p X p. Multiplying X 1 by c can be compensated by dividing ˆβ 1 by c, ie. after doing this we have the same RSS. In ridge regression, this is not true as the penalty term discourages large coefficients. In practice, what do we do? Scale each variable such that it has sample variance 1 before running the regression. This prevents penalizing some coefficients more than others. 5 / 19

Example. Ridge regression Ridge regression of default in the Credit dataset. Standardized Coefficients 300 100 0 100 200 300 400 1e 02 1e+00 1e+02 1e+04 Income Limit Rating Student Standardized Coefficients 300 100 0 100 200 300 400 0.0 0.2 0.4 0.6 0.8 1.0 λ ˆβ R λ 2/ ˆβ 2 6 / 19

Selecting λ by cross-validation Cross Validation Error 25.0 25.2 25.4 25.6 5e 03 5e 02 5e 01 5e+00 λ Standardized Coefficients 300 100 0 100 300 5e 03 5e 02 5e 01 5e+00 λ 7 / 19

The Lasso Lasso regression solves the following optimization: min β n y i β 0 i=1 β j x i,j 2 + λ In blue, we have the RSS of the model (the loss). β j In red, we have the l 1 norm of β, or β 1 (the penalty or regularization). 8 / 19

The Lasso Lasso regression solves the following optimization: min β n y i β 0 i=1 β j x i,j 2 + λ In blue, we have the RSS of the model (the loss). β j In red, we have the l 1 norm of β, or β 1 (the penalty or regularization). Why would we use the Lasso instead of Ridge regression? 8 / 19

The Lasso Lasso regression solves the following optimization: min β n y i β 0 i=1 β j x i,j 2 + λ In blue, we have the RSS of the model (the loss). β j In red, we have the l 1 norm of β, or β 1 (the penalty or regularization). Why would we use the Lasso instead of Ridge regression? Ridge regression shrinks all the coefficients to a non-zero value. 8 / 19

The Lasso Lasso regression solves the following optimization: min β n y i β 0 i=1 β j x i,j 2 + λ In blue, we have the RSS of the model (the loss). β j In red, we have the l 1 norm of β, or β 1 (the penalty or regularization). Why would we use the Lasso instead of Ridge regression? Ridge regression shrinks all the coefficients to a non-zero value. The Lasso shrinks some of the coefficients all the way to zero. Alternative to best subset selection or stepwise selection! 8 / 19

Example. Ridge regression Ridge regression of default in the Credit dataset. Standardized Coefficients 300 100 0 100 200 300 400 1e 02 1e+00 1e+02 1e+04 Income Limit Rating Student Standardized Coefficients 300 100 0 100 200 300 400 0.0 0.2 0.4 0.6 0.8 1.0 λ ˆβ R λ 2/ ˆβ 2 A lot of pesky small coefficients throughout the regularization path. 9 / 19

Example. The Lasso Lasso regression of default in the Credit dataset. Standardized Coefficients 200 0 100 200 300 400 20 50 100 200 500 2000 5000 Standardized Coefficients 300 100 0 100 200 300 400 Income Limit Rating Student 0.0 0.2 0.4 0.6 0.8 1.0 λ ˆβ L λ 1/ ˆβ 1 Those coefficients are shrunk to zero. 10 / 19

An alternative formulation for regularization Ridge: for every λ, there is an s such that ˆβ λ R solves: 2 n minimize y i β 0 β j x i,j β subject to βj 2 < s. i=1 11 / 19

An alternative formulation for regularization Ridge: for every λ, there is an s such that ˆβ λ R solves: 2 n minimize y i β 0 β j x i,j β subject to βj 2 < s. i=1 Lasso: for every λ, there is an s such that ˆβ λ L solves: 2 n minimize y i β 0 β j x i,j β subject to β j < s. i=1 11 / 19

An alternative formulation for regularization Ridge: for every λ, there is an s such that ˆβ λ R solves: 2 n minimize y i β 0 β j x i,j β subject to βj 2 < s. i=1 Lasso: for every λ, there is an s such that ˆβ λ L solves: 2 n minimize y i β 0 β j x i,j β subject to β j < s. i=1 Best subset: n minimize y i β 0 β i=1 2 β j x i,j s.t. 1(β j 0) < s. 11 / 19

Visualizing Ridge and the Lasso with 2 predictors The Lasso Ridge Regression : p β j < s : p β2 j < s (Red ellipses are equal RSS contours) 12 / 19

When is the Lasso better than Ridge? Example 1. All coefficients β j are non-zero. Mean Squared Error 0 10 20 30 40 50 60 Mean Squared Error 0 10 20 30 40 50 60 0.02 0.10 0.50 2.00 10.00 50.00 λ 0.0 0.2 0.4 0.6 0.8 1.0 R 2 on Training Data 13 / 19

When is the Lasso better than Ridge? Example 1. All coefficients β j are non-zero. Mean Squared Error 0 10 20 30 40 50 60 Mean Squared Error 0 10 20 30 40 50 60 0.02 0.10 0.50 2.00 10.00 50.00 0.0 0.2 0.4 0.6 0.8 1.0 λ R 2 on Training Data Bias, Variance, MSE. The Lasso ( ), Ridge ( ). 13 / 19

When is the Lasso better than Ridge? Example 1. All coefficients β j are non-zero. Mean Squared Error 0 10 20 30 40 50 60 Mean Squared Error 0 10 20 30 40 50 60 0.02 0.10 0.50 2.00 10.00 50.00 0.0 0.2 0.4 0.6 0.8 1.0 λ R 2 on Training Data Bias, Variance, MSE. The Lasso ( ), Ridge ( ). The bias is about the same for both methods. 13 / 19

When is the Lasso better than Ridge? Example 1. All coefficients β j are non-zero. Mean Squared Error 0 10 20 30 40 50 60 Mean Squared Error 0 10 20 30 40 50 60 0.02 0.10 0.50 2.00 10.00 50.00 0.0 0.2 0.4 0.6 0.8 1.0 λ R 2 on Training Data Bias, Variance, MSE. The Lasso ( ), Ridge ( ). The bias is about the same for both methods. The variance of Ridge regression is smaller, so is the MSE. 13 / 19

When is the Lasso better than Ridge? Example 1. All coefficients β j are non-zero. Mean Squared Error 0 10 20 30 40 50 60 Mean Squared Error 0 10 20 30 40 50 60 0.02 0.10 0.50 2.00 10.00 50.00 0.0 0.2 0.4 0.6 0.8 1.0 λ R 2 on Training Data Bias, Variance, MSE. The Lasso ( ), Ridge ( ). The bias is about the same for both methods. The variance of Ridge regression is smaller, so is the MSE. Rule of thumb: Lasso typically no better than ridge for prediction when most variables are useful for predition. 13 / 19

When is the Lasso better than Ridge? Example 2. Only 2 of 45 coefficients β j are non-zero. Mean Squared Error 0 20 40 60 80 100 Mean Squared Error 0 20 40 60 80 100 0.02 0.10 0.50 2.00 10.00 50.00 λ 0.4 0.5 0.6 0.7 0.8 0.9 1.0 R 2 on Training Data 14 / 19

When is the Lasso better than Ridge? Example 2. Only 2 of 45 coefficients β j are non-zero. Mean Squared Error 0 20 40 60 80 100 Mean Squared Error 0 20 40 60 80 100 0.02 0.10 0.50 2.00 10.00 50.00 λ 0.4 0.5 0.6 0.7 0.8 0.9 1.0 R 2 on Training Data Bias, Variance, MSE. The Lasso ( ), Ridge ( ). 14 / 19

When is the Lasso better than Ridge? Example 2. Only 2 of 45 coefficients β j are non-zero. Mean Squared Error 0 20 40 60 80 100 Mean Squared Error 0 20 40 60 80 100 0.02 0.10 0.50 2.00 10.00 50.00 λ 0.4 0.5 0.6 0.7 0.8 0.9 1.0 R 2 on Training Data Bias, Variance, MSE. The Lasso ( ), Ridge ( ). The bias, variance, and MSE are lower for the Lasso. 14 / 19

When is the Lasso better than Ridge? Example 2. Only 2 of 45 coefficients β j are non-zero. Mean Squared Error 0 20 40 60 80 100 Mean Squared Error 0 20 40 60 80 100 0.02 0.10 0.50 2.00 10.00 50.00 λ 0.4 0.5 0.6 0.7 0.8 0.9 1.0 R 2 on Training Data Bias, Variance, MSE. The Lasso ( ), Ridge ( ). The bias, variance, and MSE are lower for the Lasso. Rule of thumb: Lasso especially effective when most variables are not useful for prediction. 14 / 19

Choosing λ by cross-validation Cross Validation Error 0 200 600 1000 1400 Standardized Coefficients 5 0 5 10 15 0.0 0.2 0.4 0.6 0.8 1.0 ˆβ L λ 1/ ˆβ 1 0.0 0.2 0.4 0.6 0.8 1.0 ˆβ L λ 1/ ˆβ 1 15 / 19

A very special case Suppose n = p and our matrix of predictors is X = I. 16 / 19

A very special case Suppose n = p and our matrix of predictors is X = I. Then, the objective function in Ridge regression can be simplified: (y j β j ) 2 + λ β 2 j 16 / 19

A very special case Suppose n = p and our matrix of predictors is X = I. Then, the objective function in Ridge regression can be simplified: (y j β j ) 2 + λ and we can minimize the terms that involve each β j separately: (y j β j ) 2 + λβj 2. β 2 j 16 / 19

A very special case Suppose n = p and our matrix of predictors is X = I. Then, the objective function in Ridge regression can be simplified: (y j β j ) 2 + λ and we can minimize the terms that involve each β j separately: It is easy to show that (y j β j ) 2 + λβ 2 j. ˆβ R j = y j 1 + λ. β 2 j 16 / 19

A very special case Similar story for the Lasso; the objective function is: (y j β j ) 2 + λ β j 17 / 19

A very special case Similar story for the Lasso; the objective function is: (y j β j ) 2 + λ β j and we can minimize the terms that involve each β j separately: (y j β j ) 2 + λ β j. 17 / 19

A very special case Similar story for the Lasso; the objective function is: (y j β j ) 2 + λ β j and we can minimize the terms that involve each β j separately: (y j β j ) 2 + λ β j. It is easy to show that y j λ/2 if y j > λ/2; ˆβ j L = y j + λ/2 if y j < λ/2; 0 if y j < λ/2. 17 / 19

Lasso and Ridge coefficients as a function of λ Coefficient Estimate 1.5 0.5 0.5 1.5 Ridge Least Squares Coefficient Estimate 1.5 0.5 0.5 1.5 Lasso Least Squares 1.5 0.5 0.0 0.5 1.0 1.5 1.5 0.5 0.0 0.5 1.0 1.5 y j y j 18 / 19

Bayesian interpretations Ridge: ˆβ R is the posterior mean, with a Normal prior on β. Lasso: ˆβ L is the posterior mode, with a Laplace prior on β. g(βj) 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 g(βj) 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 3 2 1 0 1 2 3 β j 3 2 1 0 1 2 3 β j 19 / 19