Shrinkage Methods: Ridge and Lasso

Similar documents
Regularization: Ridge Regression and the LASSO

Lecture 14: Shrinkage

SCMA292 Mathematical Modeling : Machine Learning. Krikamol Muandet. Department of Mathematics Faculty of Science, Mahidol University.

Introduction to Machine Learning and Cross-Validation

Machine Learning for OR & FE

Direct Learning: Linear Regression. Donglin Zeng, Department of Biostatistics, University of North Carolina

MS-C1620 Statistical inference

High-dimensional regression

A Survey of L 1. Regression. Céline Cunen, 20/10/2014. Vidaurre, Bielza and Larranaga (2013)

Statistics 203: Introduction to Regression and Analysis of Variance Penalized models

Prediction & Feature Selection in GLM

Linear Model Selection and Regularization

Machine Learning for Economists: Part 4 Shrinkage and Sparsity

Chapter 3. Linear Models for Regression

Day 4: Shrinkage Estimators

MSA220/MVE440 Statistical Learning for Big Data

Fast Regularization Paths via Coordinate Descent

Linear regression methods

Sparse Linear Models (10/7/13)

Consistent high-dimensional Bayesian variable selection via penalized credible regions

MSA220/MVE440 Statistical Learning for Big Data

Ridge regression. Patrick Breheny. February 8. Penalized regression Ridge regression Bayesian interpretation

Data Analysis and Machine Learning Lecture 12: Multicollinearity, Bias-Variance Trade-off, Cross-validation and Shrinkage Methods.

Biostatistics Advanced Methods in Biostatistics IV

ISyE 691 Data mining and analytics

Bayesian variable selection via. Penalized credible regions. Brian Reich, NCSU. Joint work with. Howard Bondell and Ander Wilson

A Short Introduction to the Lasso Methodology

Biostatistics-Lecture 16 Model Selection. Ruibin Xi Peking University School of Mathematical Sciences

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 10

STAT 462-Computational Data Analysis

Statistical Inference

The Risk of James Stein and Lasso Shrinkage

A Significance Test for the Lasso

Penalized Regression

Fast Regularization Paths via Coordinate Descent

Machine learning, shrinkage estimation, and economic theory

Machine Learning Linear Regression. Prof. Matteo Matteucci

Introduction to Machine Learning

Proteomics and Variable Selection

Linear Methods for Regression. Lijun Zhang

Linear Regression Models. Based on Chapter 3 of Hastie, Tibshirani and Friedman

Applied Machine Learning Annalisa Marsico

Regression, Ridge Regression, Lasso

Linear model selection and regularization

Robust Variable Selection Methods for Grouped Data. Kristin Lee Seamon Lilly

Smoothly Clipped Absolute Deviation (SCAD) for Correlated Variables

Generalized Elastic Net Regression

Lecture 6: Methods for high-dimensional problems

Bayesian Grouped Horseshoe Regression with Application to Additive Models

Chris Fraley and Daniel Percival. August 22, 2008, revised May 14, 2010

Stepwise Searching for Feature Variables in High-Dimensional Linear Regression

Fast Regularization Paths via Coordinate Descent

MA 575 Linear Models: Cedric E. Ginestet, Boston University Regularization: Ridge Regression and Lasso Week 14, Lecture 2

The Illusion of Independence: High Dimensional Data, Shrinkage Methods and Model Selection

arxiv: v1 [stat.me] 30 Dec 2017

Bayesian linear regression

Least Absolute Shrinkage is Equivalent to Quadratic Penalization

Lasso, Ridge, and Elastic Net

Variable Selection in Restricted Linear Regression Models. Y. Tuaç 1 and O. Arslan 1

Statistical aspects of prediction models with high-dimensional data

Data Mining Stat 588

Adaptive Lasso for correlated predictors

ESL Chap3. Some extensions of lasso

Regularization and Variable Selection via the Elastic Net

High-dimensional regression with unknown variance

High-dimensional Ordinary Least-squares Projection for Screening Variables

LASSO Review, Fused LASSO, Parallel LASSO Solvers

Regression Shrinkage and Selection via the Lasso

STK4900/ Lecture 5. Program

Statistics 262: Intermediate Biostatistics Model selection

A Modern Look at Classical Multivariate Techniques

The lasso. Patrick Breheny. February 15. The lasso Convex optimization Soft thresholding

A New Combined Approach for Inference in High-Dimensional Regression Models with Correlated Variables

6. Regularized linear regression

Or How to select variables Using Bayesian LASSO

Regularized Regression A Bayesian point of view

Learning with Sparsity Constraints

Compressed Sensing in Cancer Biology? (A Work in Progress)

Dimension Reduction Methods

CS 4491/CS 7990 SPECIAL TOPICS IN BIOINFORMATICS

Analysis Methods for Supersaturated Design: Some Comparisons

Probabilistic machine learning group, Aalto University Bayesian theory and methods, approximative integration, model

Logistic Regression with the Nonnegative Garrote

Week 3: Linear Regression

Statistical Data Mining and Machine Learning Hilary Term 2016

Statistical Learning with the Lasso, spring The Lasso

Machine Learning for Biomedical Engineering. Enrico Grisan

Mostly Dangerous Econometrics: How to do Model Selection with Inference in Mind

Ratemaking application of Bayesian LASSO with conjugate hyperprior

Variable Selection under Measurement Error: Comparing the Performance of Subset Selection and Shrinkage Methods

The risk of machine learning

COMS 4771 Regression. Nakul Verma

Pre-Selection in Cluster Lasso Methods for Correlated Variable Selection in High-Dimensional Linear Models

COMS 4771 Lecture Fixed-design linear regression 2. Ridge and principal components regression 3. Sparse regression and Lasso

Iterative Selection Using Orthogonal Regression Techniques

The prediction of house price

Lasso regression. Wessel van Wieringen

Linear Regression. CSL603 - Fall 2017 Narayanan C Krishnan

Linear Regression (9/11/13)

On High-Dimensional Cross-Validation

Transcription:

Shrinkage Methods: Ridge and Lasso Jonathan Hersh 1 Chapman University, Argyros School of Business hersh@chapman.edu February 27, 2019 J.Hersh (Chapman) Ridge & Lasso February 27, 2019 1 / 43

1 Intro and Background Introduction 2 Ridge Regression Example: Ridge & Multicollinearity 3 Lasso 4 Applications & Extensions 5 Conclusion J.Hersh (Chapman) Ridge & Lasso February 27, 2019 2 / 43

Source material Introduction to Statistical Learning, Chapter 6 J.Hersh (Chapman) Ridge & Lasso February 27, 2019 3 / 43

Plan 1 Intro and Background Introduction 2 Ridge Regression Example: Ridge & Multicollinearity 3 Lasso 4 Applications & Extensions 5 Conclusion J.Hersh (Chapman) Ridge & Lasso February 27, 2019 4 / 43

Shrinkage Methods Consider the case where we have many more variables than predictors J.Hersh (Chapman) Ridge & Lasso February 27, 2019 5 / 43

Shrinkage Methods Consider the case where we have many more variables than predictors Shrinkage methods fit a model with all p predictors, but estimate coefficients are shrunken towards zero J.Hersh (Chapman) Ridge & Lasso February 27, 2019 5 / 43

Shrinkage Methods Consider the case where we have many more variables than predictors Shrinkage methods fit a model with all p predictors, but estimate coefficients are shrunken towards zero In extreme case, N < p β = (X X) 1 (X Y ) not full rank cannot invert J.Hersh (Chapman) Ridge & Lasso February 27, 2019 5 / 43

Shrinkage Methods Consider the case where we have many more variables than predictors Shrinkage methods fit a model with all p predictors, but estimate coefficients are shrunken towards zero In extreme case, N < p β = (X X) 1 (X Y ) not full rank cannot invert Example: bioinformatics. Predict cancer cells (Y ), by gene type (X) J.Hersh (Chapman) Ridge & Lasso February 27, 2019 5 / 43

Recall: bias-variance tradeoff Prediction Error: E y i }{{} data 2 ˆf (x i ) }{{} model = Var (y) + Var ) ( (ˆf + f E [ˆf Prediction Error [( )] E y i ˆf = σε 2 }{{} irreducible error ] Bias is minimized when f = E [ˆf ) + Var (ˆf }{{} variance ([ + ]]) 2 f E [ˆf } {{ } bias But total error (variance + bias) may be minimized by some other ˆf

Recall: bias-variance tradeoff Prediction Error: E y i }{{} data 2 ˆf (x i ) }{{} model = Var (y) + Var ) ( (ˆf + f E [ˆf Prediction Error [( )] E y i ˆf = σε 2 }{{} irreducible error ] Bias is minimized when f = E [ˆf ) + Var (ˆf }{{} variance ([ + ]]) 2 f E [ˆf } {{ } bias But total error (variance + bias) may be minimized by some other ˆf ˆf (x) with smaller variance fewer variables, smaller magnitude coefficients

Plan 1 Intro and Background Introduction 2 Ridge Regression Example: Ridge & Multicollinearity 3 Lasso 4 Applications & Extensions 5 Conclusion J.Hersh (Chapman) Ridge & Lasso February 27, 2019 7 / 43

Ridge Regression OLS Sum of Squared Resids = 2 N p y i β j x ij i=1 j=1 }{{} Squared Sum of Residuals J.Hersh (Chapman) Ridge & Lasso February 27, 2019 8 / 43

Ridge Regression OLS Sum of Squared Resids = 2 N p y i β j x ij i=1 j=1 }{{} Squared Sum of Residuals To reduce prediction error: minimize Var(ˆf (x)) = Var(Xβ) One way: decrease β in absolute value J.Hersh (Chapman) Ridge & Lasso February 27, 2019 8 / 43

Ridge Regression Definitions Ridge estimator β R is defined as β ridge = argmin β 2 N p p y i β j x ij + λ βj 2 i=1 j=1 j=1 }{{}}{{} RSS Shrinkage Factor where λ 0 is a tuning parameter (or hyper-parameter) J.Hersh (Chapman) Ridge & Lasso February 27, 2019 9 / 43

Ridge Continued The ridge estimator also wants to find coefficients that fits the data well, and reduces RSS The second term, λ p j=1 β2 j ensures that it does so in a balanced way, so that bias isn t minimized at the expense of variance J.Hersh (Chapman) Ridge & Lasso February 27, 2019 10 / 43

Ridge Continued The ridge estimator also wants to find coefficients that fits the data well, and reduces RSS The second term, λ p j=1 β2 j ensures that it does so in a balanced way, so that bias isn t minimized at the expense of variance The tuning parameter λ controls the relative impact of bias and variance Larger λ more bias Note as λ β R 0 Note as λ 0 β R β OLS J.Hersh (Chapman) Ridge & Lasso February 27, 2019 10 / 43

Ridge Continued In matrix form: β R = (X X + λi K ) 1 (X Y ) Note is positive definite even when K > N J.Hersh (Chapman) Ridge & Lasso February 27, 2019 11 / 43

Ridge Continued In matrix form: β R = (X X + λi K ) 1 (X Y ) Note is positive definite even when K > N Coefficients are shrunk smoothly towards zero. Bayesian interpretation: Laplace priors β R N ( 0, τ 2 I K ) β R is the posterior mean/mode/median J.Hersh (Chapman) Ridge & Lasso February 27, 2019 11 / 43

How to Choose λ? In practice we estimate a range of λ values and choose J.Hersh (Chapman) Ridge & Lasso February 27, 2019 12 / 43

How to Choose λ? In practice we estimate a range of λ values and choose J.Hersh (Chapman) Ridge & Lasso February 27, 2019 13 / 43

Root Mean Squared Error Across λ Values J.Hersh (Chapman) Ridge & Lasso February 27, 2019 14 / 43

Root Mean Squared Error Across λ Values J.Hersh (Chapman) Ridge & Lasso February 27, 2019 15 / 43

Ridge Notes Small amount of shrinkage usually improves prediction performance Paricularly when the number of variables is large and variance is likely high J.Hersh (Chapman) Ridge & Lasso February 27, 2019 16 / 43

Ridge Notes Small amount of shrinkage usually improves prediction performance Paricularly when the number of variables is large and variance is likely high Variables are never completely shrunk to zero but very small in absolute value Works poorly for variable selection Useful for when you have reason to suspect underlying DGP is non-sparse J.Hersh (Chapman) Ridge & Lasso February 27, 2019 16 / 43

How can ridge help with multicollinearity? Quick example in R #Generate x1 and x2 that are highly colinear x1 <- rnorm(20) x2 <- rnorm(20,mean=x1,sd=.01) y <- rnorm(20,mean=3+x1+x2) # OLS Reg OLSmod <- lm(y~x1+x2) #Ridge Reg RIDGEmod <- lm.ridge(y~x1+x2,lambda=1) J.Hersh (Chapman) Ridge & Lasso February 27, 2019 17 / 43

Multicollinearity Example J.Hersh (Chapman) Ridge & Lasso February 27, 2019 18 / 43

Red line = OLS, Blue = Ridge J.Hersh (Chapman) Ridge & Lasso February 27, 2019 19 / 43

Plan 1 Intro and Background Introduction 2 Ridge Regression Example: Ridge & Multicollinearity 3 Lasso 4 Applications & Extensions 5 Conclusion J.Hersh (Chapman) Ridge & Lasso February 27, 2019 20 / 43

LASSO (Least Absolute Shrinkage and Selection Operator) (Tibshirani, 1996) Lasso Regression looks very similar to Ridge Lasso estimator β Lasso will minimize the modified likelihood 2 N p p y i β 0 β j x ij + λ β j i=1 j=1 j=1 }{{}}{{} RSS Shrinkage Factor where λ 0 is a tuning parameter

LASSO (Least Absolute Shrinkage and Selection Operator) (Tibshirani, 1996) Lasso Regression looks very similar to Ridge Lasso estimator β Lasso will minimize the modified likelihood 2 N p p y i β 0 β j x ij + λ β j i=1 j=1 j=1 }{{}}{{} RSS Shrinkage Factor where λ 0 is a tuning parameter Magnitude of variables is penalized with absolute value loss Because of absolute value, more efficient to spend only on useful variables Acts as variable selection. Though will in addition get shrinkage of estimates towards zero

LASSO (Least Absolute Shrinkage and Selection Operator) (Tibshirani, 1996) Lasso Regression looks very similar to Ridge Lasso estimator β Lasso will minimize the modified likelihood 2 N p p y i β 0 β j x ij + λ β j i=1 j=1 j=1 }{{}}{{} RSS Shrinkage Factor where λ 0 is a tuning parameter Magnitude of variables is penalized with absolute value loss Because of absolute value, more efficient to spend only on useful variables Acts as variable selection. Though will in addition get shrinkage of estimates towards zero Again as λ 0 β Lasso β OLS, as λ β Lasso 0

Visualization Lasso, Ridge, and OLS Coefficients

Comparing Ridge and Lasso No analytic solution to Lasso, unlike ridge, but computationally very feasible with large datasets given the convex optimization problem. J.Hersh (Chapman) Ridge & Lasso February 27, 2019 23 / 43

Comparing Ridge and Lasso No analytic solution to Lasso, unlike ridge, but computationally very feasible with large datasets given the convex optimization problem. With large datasets, inverting (X X + λi K ) is expensive J.Hersh (Chapman) Ridge & Lasso February 27, 2019 23 / 43

Comparing Ridge and Lasso No analytic solution to Lasso, unlike ridge, but computationally very feasible with large datasets given the convex optimization problem. With large datasets, inverting (X X + λi K ) is expensive Lasso has favorable properties if the true model is sparse. J.Hersh (Chapman) Ridge & Lasso February 27, 2019 23 / 43

Comparing Ridge and Lasso No analytic solution to Lasso, unlike ridge, but computationally very feasible with large datasets given the convex optimization problem. With large datasets, inverting (X X + λi K ) is expensive Lasso has favorable properties if the true model is sparse. If the distribution of coefficients is very thick tailed (few variables matter a lot) Lasso will do much better than ridge. If there are many moderate sized effects, ridge may do better J.Hersh (Chapman) Ridge & Lasso February 27, 2019 23 / 43

Social Scientists are Coming Around to Lasso J.Hersh (Chapman) Ridge & Lasso February 27, 2019 24 / 43

Bayesian Interpretation of Lasso Lasso coefficients are the mode of the posterior distribution, given a normal linear model with Laplace priors p(β) exp (λ k=1 β k ) Slightly odd that we re picking the mode rather than the mean from the posterior distribution Related: Spike and Slab prior J.Hersh (Chapman) Ridge & Lasso February 27, 2019 25 / 43

Related Estimators Least Angle RegreSsion (LARS) - The S here suggests stepwise. A stagewise iterative procedure that iteratively selects regressors to be included in the regression function J.Hersh (Chapman) Ridge & Lasso February 27, 2019 26 / 43

Related Estimators Least Angle RegreSsion (LARS) - The S here suggests stepwise. A stagewise iterative procedure that iteratively selects regressors to be included in the regression function Dantzig Selector (Candes & Tao, 2007) Lasso type regularization, but minimizing the maximum correlation between residuals and covariates Doesn t particularly work well. J.Hersh (Chapman) Ridge & Lasso February 27, 2019 26 / 43

Related Estimators Least Angle RegreSsion (LARS) - The S here suggests stepwise. A stagewise iterative procedure that iteratively selects regressors to be included in the regression function Dantzig Selector (Candes & Tao, 2007) Lasso type regularization, but minimizing the maximum correlation between residuals and covariates Doesn t particularly work well. Elastic-Net N min (Y i X i β) 2 + λ α β 1 + (1 α) β 2 2 β }{{}}{{} i=1 Lasso penalty Ridge penalty J.Hersh (Chapman) Ridge & Lasso February 27, 2019 26 / 43

Related Estimators Least Angle RegreSsion (LARS) - The S here suggests stepwise. A stagewise iterative procedure that iteratively selects regressors to be included in the regression function Dantzig Selector (Candes & Tao, 2007) Lasso type regularization, but minimizing the maximum correlation between residuals and covariates Doesn t particularly work well. Elastic-Net N min (Y i X i β) 2 + λ α β 1 + (1 α) β 2 2 β }{{}}{{} i=1 Lasso penalty Ridge penalty α controls the weighting between ridge and Lasso, obtained through cross-validation J.Hersh (Chapman) Ridge & Lasso February 27, 2019 26 / 43

Oracle Property (Fan and Zhuo, 2001) If the true model is sparse -- so that there are few (say k ) true non-zero coefficients and many true zero coefficients (K k ) an estimator has the oracle property if inference is as if you knew the true model, i.e. knew a priori exactly which coefficients were truly zero. J.Hersh (Chapman) Ridge & Lasso February 27, 2019 27 / 43

Oracle Property (Fan and Zhuo, 2001) If the true model is sparse -- so that there are few (say k ) true non-zero coefficients and many true zero coefficients (K k ) an estimator has the oracle property if inference is as if you knew the true model, i.e. knew a priori exactly which coefficients were truly zero. Limitation: sample size needs to be large relative to k J.Hersh (Chapman) Ridge & Lasso February 27, 2019 27 / 43

Oracle Property (Fan and Zhuo, 2001) If the true model is sparse -- so that there are few (say k ) true non-zero coefficients and many true zero coefficients (K k ) an estimator has the oracle property if inference is as if you knew the true model, i.e. knew a priori exactly which coefficients were truly zero. Limitation: sample size needs to be large relative to k What this means: you can ignore the selection of covariates in the calculation of the standard errors. Can just use regular OLS SEs J.Hersh (Chapman) Ridge & Lasso February 27, 2019 27 / 43

Estimating Lasso/Ridge Model in R Many packages, but glmnet is maintained by Tibshirani cv.glmnet() estimates a series of Lasso models for various levels of λ lasso.mod <- cv.glmnet(x = X, y = Y, alpha = 1, nfolds = 10) build.x() and build.y() are helper functions for glmnet that build glmnet compatible X and Y matrices respectively. Xvars <- build.x(formula, data = df) Yvar <- build.y(formula, data = df) J.Hersh (Chapman) Ridge & Lasso February 27, 2019 28 / 43

Stata Implementation of Lasso: elasticregress J.Hersh (Chapman) Ridge & Lasso February 27, 2019 29 / 43

Automatic α Selection: Package glmnetutils J.Hersh (Chapman) Ridge & Lasso February 27, 2019 30 / 43

Automatic α Selection: Function cva.glmnet J.Hersh (Chapman) Ridge & Lasso February 27, 2019 31 / 43

Caveats of Lasso for Poverty Modeling 1. Because β Lasso < β OLS ŷ lasso < ŷ OLS 1.1 Don t predict from Lasso, use Lasso for model selection, then do ELL 2. Lasso may drop variables with hierarchical relationships, e.g. age and age 2 2.1 Use Sparse Group Lasso (SGL package) which specifies hierarchical structure (e.g. if select age 2, must select age). J.Hersh (Chapman) Ridge & Lasso February 27, 2019 32 / 43

Plan 1 Intro and Background Introduction 2 Ridge Regression Example: Ridge & Multicollinearity 3 Lasso 4 Applications & Extensions 5 Conclusion J.Hersh (Chapman) Ridge & Lasso February 27, 2019 33 / 43

Afzal, Hersh and Newhouse (2015) Lasso for model selection for poverty mapping in Sri Lanka and Pakistan [source] J.Hersh (Chapman) Ridge & Lasso February 27, 2019 34 / 43

Afzal, Hersh and Newhouse (2015) Continued Lasso works well for model selection when # of candidate variables is large (100+) No worse than stepwise when set of variables is small J.Hersh (Chapman) Ridge & Lasso February 27, 2019 35 / 43

Post-Model Selection Estimator: Belloni & Chernozhukov, 2013 Belloni & Chernozhukov (2013) define the two step Post-Lasso estimator as 1. Estimate a Lasso model using full candidate set of variables (X candidate ) J.Hersh (Chapman) Ridge & Lasso February 27, 2019 36 / 43

Post-Model Selection Estimator: Belloni & Chernozhukov, 2013 Belloni & Chernozhukov (2013) define the two step Post-Lasso estimator as 1. Estimate a Lasso model using full candidate set of variables (X candidate ) 2. Use selected variables (X selected ) to estimate final model using modeling strategy of choice J.Hersh (Chapman) Ridge & Lasso February 27, 2019 36 / 43

Post-Model Selection Estimator: Belloni & Chernozhukov, 2013 Belloni & Chernozhukov (2013) define the two step Post-Lasso estimator as 1. Estimate a Lasso model using full candidate set of variables (X candidate ) 2. Use selected variables (X selected ) to estimate final model using modeling strategy of choice Because of oracle property of Lasso (Fan and Li, 2001) inference in the second stage using the reduced set of variables is consistent with inference with single stage estimation strategy using only the selected variables present in the true data-generating process J.Hersh (Chapman) Ridge & Lasso February 27, 2019 36 / 43

Heteroscedastic Robust Lasso See: Belloni, Chen, Chernozhukov, Hansen (Econometrica, 2012) J.Hersh (Chapman) Ridge & Lasso February 27, 2019 37 / 43

Double Selection Procedure for Estimating Treatment Effects Consider the problem of estimating the effect of treatment d i on some outcome y i in the presence of possibly confounding controls x i J.Hersh (Chapman) Ridge & Lasso February 27, 2019 38 / 43

Double Selection Procedure for Estimating Treatment Effects Consider the problem of estimating the effect of treatment d i on some outcome y i in the presence of possibly confounding controls x i Double Selection Method: 1. Select via Lasso controls x ij that predict y i J.Hersh (Chapman) Ridge & Lasso February 27, 2019 38 / 43

Double Selection Procedure for Estimating Treatment Effects Consider the problem of estimating the effect of treatment d i on some outcome y i in the presence of possibly confounding controls x i Double Selection Method: 1. Select via Lasso controls x ij that predict y i 2. Select via Lasso controls x ij that predict d i J.Hersh (Chapman) Ridge & Lasso February 27, 2019 38 / 43

Double Selection Procedure for Estimating Treatment Effects Consider the problem of estimating the effect of treatment d i on some outcome y i in the presence of possibly confounding controls x i Double Selection Method: 1. Select via Lasso controls x ij that predict y i 2. Select via Lasso controls x ij that predict d i 3. Run OLS of y i on d i on the union of controls selected in steps 1 and 2 J.Hersh (Chapman) Ridge & Lasso February 27, 2019 38 / 43

Double Selection Procedure for Estimating Treatment Effects Consider the problem of estimating the effect of treatment d i on some outcome y i in the presence of possibly confounding controls x i Double Selection Method: 1. Select via Lasso controls x ij that predict y i 2. Select via Lasso controls x ij that predict d i 3. Run OLS of y i on d i on the union of controls selected in steps 1 and 2 Authors claim: additional selection step controls the omitted variable bias J.Hersh (Chapman) Ridge & Lasso February 27, 2019 38 / 43

Double Selection vs Naive Approach J.Hersh (Chapman) Ridge & Lasso February 27, 2019 39 / 43

Replicating Donohue and Levitt J.Hersh (Chapman) Ridge & Lasso February 27, 2019 40 / 43

Replicating AJR (2001) J.Hersh (Chapman) Ridge & Lasso February 27, 2019 41 / 43

Plan 1 Intro and Background Introduction 2 Ridge Regression Example: Ridge & Multicollinearity 3 Lasso 4 Applications & Extensions 5 Conclusion J.Hersh (Chapman) Ridge & Lasso February 27, 2019 42 / 43

Conclusion Use Lasso regression if you have reason to believe the true model is sparse Use Ridge otherwise J.Hersh (Chapman) Ridge & Lasso February 27, 2019 43 / 43

Conclusion Use Lasso regression if you have reason to believe the true model is sparse Use Ridge otherwise Lasso s sparsity offers disciplined method of variable selection But be careful predicting from Lasso. Do Lasso + ELL (Elbers, Lanjouw, Lanjouw) J.Hersh (Chapman) Ridge & Lasso February 27, 2019 43 / 43

Conclusion Use Lasso regression if you have reason to believe the true model is sparse Use Ridge otherwise Lasso s sparsity offers disciplined method of variable selection But be careful predicting from Lasso. Do Lasso + ELL (Elbers, Lanjouw, Lanjouw) Select λ using k-fold cross-valdiation Use test sample to approximate out of sample error J.Hersh (Chapman) Ridge & Lasso February 27, 2019 43 / 43