A Short Introduction to the Lasso Methodology

Similar documents
Dimension Reduction Methods

Data Mining Stat 588

Lecture 14: Shrinkage

Machine Learning for OR & FE

Regression Shrinkage and Selection via the Lasso

A Modern Look at Classical Multivariate Techniques

ESL Chap3. Some extensions of lasso

Regression, Ridge Regression, Lasso

STAT 462-Computational Data Analysis

Ridge Regression. Flachs, Munkholt og Skotte. May 4, 2009

MA 575 Linear Models: Cedric E. Ginestet, Boston University Regularization: Ridge Regression and Lasso Week 14, Lecture 2

Lecture 5: A step back

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Sparse Linear Models (10/7/13)

Linear Regression (9/11/13)

High-dimensional regression

The prediction of house price

The lasso. Patrick Breheny. February 15. The lasso Convex optimization Soft thresholding

Day 4: Shrinkage Estimators

Weighted Least Squares

A Survey of L 1. Regression. Céline Cunen, 20/10/2014. Vidaurre, Bielza and Larranaga (2013)

Homework 1: Solutions

MS-C1620 Statistical inference

Data Analysis and Machine Learning Lecture 12: Multicollinearity, Bias-Variance Trade-off, Cross-validation and Shrinkage Methods.

Biostatistics Advanced Methods in Biostatistics IV

Statistics 203: Introduction to Regression and Analysis of Variance Penalized models

Linear regression methods

Linear model selection and regularization

MA 575 Linear Models: Cedric E. Ginestet, Boston University Mixed Effects Estimation, Residuals Diagnostics Week 11, Lecture 1

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

arxiv: v3 [stat.ml] 14 Apr 2016

Shrinkage Methods: Ridge and Lasso

Consistent high-dimensional Bayesian variable selection via penalized credible regions

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 9: LINEAR REGRESSION

UNIVERSITETET I OSLO

Linear Model Selection and Regularization

Linear Methods for Regression. Lijun Zhang

Spatial Lasso with Application to GIS Model Selection. F. Jay Breidt Colorado State University

. a m1 a mn. a 1 a 2 a = a n

Regularization: Ridge Regression and the LASSO

Stability and the elastic net

MSA220/MVE440 Statistical Learning for Big Data

Nonconcave Penalized Likelihood with A Diverging Number of Parameters

Bayesian linear regression

BANA 7046 Data Mining I Lecture 2. Linear Regression, Model Assessment, and Cross-validation 1

5.2 Expounding on the Admissibility of Shrinkage Estimators

Linear Regression. In this problem sheet, we consider the problem of linear regression with p predictors and one intercept,

Linear Models in Machine Learning

Prediction & Feature Selection in GLM

Regression #2. Econ 671. Purdue University. Justin L. Tobias (Purdue) Regression #2 1 / 24

ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015

Nonparametric Regression. Badr Missaoui

Generalized Elastic Net Regression

Chapter 3. Linear Models for Regression

Direct Learning: Linear Regression. Donglin Zeng, Department of Biostatistics, University of North Carolina

Lecture 11: Regression Methods I (Linear Regression)

CS 4491/CS 7990 SPECIAL TOPICS IN BIOINFORMATICS

COMS 4721: Machine Learning for Data Science Lecture 6, 2/2/2017

Linear Regression (1/1/17)

Contents. 1 Review of Residuals. 2 Detecting Outliers. 3 Influential Observations. 4 Multicollinearity and its Effects

High-dimensional regression modeling

Bi-level feature selection with applications to genetic association

MATH 829: Introduction to Data Mining and Analysis Least angle regression

MSA220/MVE440 Statistical Learning for Big Data

UVA CS 6316/4501 Fall 2016 Machine Learning. Lecture 6: Linear Regression Model with RegularizaEons. Dr. Yanjun Qi. University of Virginia

Within Group Variable Selection through the Exclusive Lasso

Ridge regression. Patrick Breheny. February 8. Penalized regression Ridge regression Bayesian interpretation

Logistic Regression with the Nonnegative Garrote

25 : Graphical induced structured input/output models

Chapter 2: simple regression model

Lecture 19 Multiple (Linear) Regression

Smoothly Clipped Absolute Deviation (SCAD) for Correlated Variables

ISyE 691 Data mining and analytics

Machine Learning. Regularization and Feature Selection. Fabio Vandin November 13, 2017

Machine Learning. Bayesian Regression & Classification. Marc Toussaint U Stuttgart

Weighted Least Squares

OWL to the rescue of LASSO

The MNet Estimator. Patrick Breheny. Department of Biostatistics Department of Statistics University of Kentucky. August 2, 2010

Compressed Sensing in Cancer Biology? (A Work in Progress)

Lecture 11: Regression Methods I (Linear Regression)

ORIE 4741: Learning with Big Messy Data. Regularization

Non-linear Supervised High Frequency Trading Strategies with Applications in US Equity Markets

6. Regularized linear regression

Machine Learning. Regression basics. Marc Toussaint University of Stuttgart Summer 2015

Least Squares Regression

Bayesian Grouped Horseshoe Regression with Application to Additive Models

Regression.

Theoretical results for lasso, MCP, and SCAD

18.S096 Problem Set 3 Fall 2013 Regression Analysis Due Date: 10/8/2013

Simultaneous regression shrinkage, variable selection, and supervised clustering of predictors with OSCAR

Least Squares Regression

Analysis of Greedy Algorithms

Variable Selection under Measurement Error: Comparing the Performance of Subset Selection and Shrinkage Methods

GI07/COMPM012: Mathematical Programming and Research Methods (Part 2) 2. Least Squares and Principal Components Analysis. Massimiliano Pontil

Parameter Norm Penalties. Sargur N. Srihari

Linear Model Selection and Regularization

Probabilistic machine learning group, Aalto University Bayesian theory and methods, approximative integration, model

Modeling Real Estate Data using Quantile Regression

Machine Learning for Economists: Part 4 Shrinkage and Sparsity

Regularization Paths

Transcription:

A Short Introduction to the Lasso Methodology Michael Gutmann sites.google.com/site/michaelgutmann University of Helsinki Aalto University Helsinki Institute for Information Technology March 9, 2016 Michael Gutmann Short Introduction to the Lasso 1 / 24

Goals of the lecture Lasso Least Absolute Shrinkage and Selection Operator Goal: After the lecture, to understand what these words mean Shrinkage: The lasso shrinks / regularizes the least squares regression coefficients (like ridge regression). Selection: The lasso also performs variable selection (unlike ridge regression). Least absolute: Shrinkage and selection are achieved by penalizing the absolute values of the regression coefficients. Michael Gutmann Short Introduction to the Lasso 2 / 24

Linear regression 5 4 Data: {(x 1, y 1 ),..., (x n, y n )} n observations of pairs (x i, y i ) x i R p : covariates yi R: response Response 3 2 1 0-1 -2-3 Data Regression lines -4-3 -2-1 0 1 2 3 Covariate Assumption: linear relation between covariates and response y i = x i1 β 1 +... + x ip β p + e i (1) = x i β + e i (2) where e i is the residual Goal: Determine the coefficients β = (β 1,..., β p ) Michael Gutmann Short Introduction to the Lasso 3 / 24

Least squares Minimize the residual sum of squares (RSS) In vector notation, with we have y = n n ( 2 RSS(β) = ei 2 = y i x i β) (3) i=1 i=1 y 1. y n X = x 1. x n = x 11... x 1p... x n1... x np (4) RSS(β) = y Xβ 2 2 (5) Michael Gutmann Short Introduction to the Lasso 4 / 24

Least squares Closed form solution if p p matrix X X is invertible ˆβ o = argmin RSS(β) (6) β = (X X) 1 X y (7) 5 4 Prediction given a test covariate vector x (8) ŷ = x ˆβ o Response 3 2 1 0-1 -2 Data -3 Regression lines Least squares solution -4-3 -2-1 0 1 2 3 Covariate Michael Gutmann Short Introduction to the Lasso 5 / 24

Ridge regression If X X is not invertible, regularized inverse can be taken ˆβ r = (X X + λi p ) 1 X y (9) where I p is the p p identity matrix and λ 0 the regularization parameter. This is ridge regression, ˆβ r is minimizing J r (β) p J r (β) = y Xβ 2 2 + λ βj 2 (10) j=1 As λ increases, ˆβ r shrinks to zero ( shrinkage ). Michael Gutmann Short Introduction to the Lasso 6 / 24

Benefits of ridge regression ˆβ r = (X X + λi p) 1 X y Regularization / shrinkage is useful even if X X is invertible. Reason: it can improve prediction accuracy Example: n = 50 observations, p = 10 covariates Orthonormal matrix X: X X = I p ˆβ r = 1 1+λ X y = 1 ˆβ o Least squares 56 1+λ 55-6 -4-2 0 2 4 6 Mean squared prediction error 64 63 62 61 60 59 58 57 Penalty λ (log 10) Ridge regression Michael Gutmann Short Introduction to the Lasso 7 / 24

Limits of ridge regression ˆβ r = 1 1+λ ˆβ o Data were artificially generated with β = (3, 2, 1, 0,..., 0) The vector is sparse: only 3/10 nonzero terms Ridge regression cannot recover sparse β. Ridge regression performs shrinkage but not variable selection. Variable selection: Some ˆβ j are set to zero; covariates are omitted from Regression coefficients 3.5 3 2.5 2 1.5 1 0.5 0-0.5 the fitted model. -1-6 -4-2 0 2 4 6 Penalty λ (log 10) Michael Gutmann Short Introduction to the Lasso 8 / 24

Some practical aspects Choice of λ: via cross-validation Ridge solution ˆβ r depends on the scale of the covariates. Center so that n i=1 y i = n i=1 x ij = 0 Re-scale so that n i=1 x ij 2 = 1 Assume that the data were preprocessed in this manner. Michael Gutmann Short Introduction to the Lasso 9 / 24

Importance of variable selection It reduces the complexity of the models. The models become easier to interpret. It makes prediction cheaper: only covariates with nonzero ˆβ j need to be measured. ŷ = x 1 ˆβ 1 +... + x 1000 ˆβ 1000 ŷ = x 1 ˆβ1 + x 2 ˆβ2 + x 3 ˆβ3 Michael Gutmann Short Introduction to the Lasso 10 / 24

Lasso regression Lasso regression consists in minimizing J L (β), p J L (β) = y Xβ 2 2 + λ β j (11) j=1 Similar to the cost function J r (β) for ridge regression, p J r (β) = y Xβ 2 2 + λ βj 2 (12) j=1 λ 0 is the regularization (shrinkage) parameter. Penalty: sum of absolute values instead of sum of squares Difference seems minor but it results in a very different behavior: it enables shrinkage and selection of covariates. Michael Gutmann Short Introduction to the Lasso 11 / 24

Shrinkage and variable selection with the lasso The lasso generally lacks an analytical solution. Closed form solution when X X = I p ˆβ o λ ˆβ j L 2 if ˆβ o λ 2 = 0 if ˆβ o ( λ 2, λ 2 ) ˆβ o + λ 2 if ˆβ o λ 2 (13) 1.5 1.5 1 shrinkage 1 shrinkage Lasso coefficient 0.5 0-0.5 variable selection Ridge coefficient 0.5 0-0.5-1 -1-1.5-1.5-1 -0.5 0 0.5 1 1.5 Least squares coefficient -1.5-1.5-1 -0.5 0 0.5 1 1.5 Least squares coefficient Michael Gutmann Short Introduction to the Lasso 12 / 24

Back to the example Data were artificially generated with β = (3, 2, 1, 0,..., 0) The vector is sparse: only 3/10 nonzero terms Lasso regression combines shrinkage and variable selection. Mean squared prediction error 64 63 62 61 60 59 58 57 56 Lasso regression Least squares Regression coefficients 3.5 3 2.5 2 1.5 1 0.5 0-0.5 55-6 -4-2 0 2 4 6 Penalty λ (log 10) -1-6 -4-2 0 2 4 6 Penalty λ (log 10) Michael Gutmann Short Introduction to the Lasso 13 / 24

Proof Assume X X = I p We want to show that the shrinkage and selection operator ˆβ o λ ˆβ j L 2 if ˆβ o λ 2 = 0 if ˆβ o ( λ 2, λ 2 ) (14) ˆβ o + λ 2 if ˆβ o λ 2 minimizes J L (β), p J L (β) = y Xβ 2 2 + λ β j (15) j=1 Michael Gutmann Short Introduction to the Lasso 14 / 24

Proof p J L (β) = y Xβ 2 2 + λ β j (16) j=1 p = (y Xβ) (y Xβ) + λ β j (17) j=1 p = y y y Xβ β X y + β X Xβ + λ β j (18) j=1 p = y y 2β X y + β X}{{ X} β + λ β j (19) I p j=1 p = y y 2β X y +β β + λ β }{{} j (20) ˆβ o j=1 =r Michael Gutmann Short Introduction to the Lasso 15 / 24

Proof p J L (β) = y y 2β r + β β + λ β j (21) j=1 p p p = y y 2 β j r j + βj 2 + λ β j (22) j=1 j=1 j=1 p ( ) = y y + 2β j r j + βj 2 + λ β j (23) j=1 }{{} f j (β j ) p = constant + f j (β j ) (24) j=1 For X X = I p, the optimization problem decomposes into p independent problems. Minimizing each f j (β j ) separately will minimize J L (β). Michael Gutmann Short Introduction to the Lasso 16 / 24

Proof Drop the subscripts for a moment and consider a single f only. Problem: derivative at zero not defined f (β) = β 2 2rβ + λ β (25) 6 5 λ = 1 λ = 2 λ = 3 4 f (for r=1) 3 2 1 0-1 -1-0.5 0 0.5 1 1.5 2 β Michael Gutmann Short Introduction to the Lasso 17 / 24

Proof Approach: Make a smooth approximation β h ɛ (β) { ɛ h ɛ (β) = 2 + 1 2ɛ β2 if β ( ɛ, ɛ) β otherwise (26) Do all the work with ɛ > 0 and, at the end, take the limit ɛ 0. 0.2 0.18 0.16 absolute value approximation 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0-0.2-0.15-0.1-0.05 0 0.05 0.1 0.15 0.2 β Michael Gutmann Short Introduction to the Lasso 18 / 24

Proof Using h ɛ (β) instead of β gives f (β) = β 2 2rβ + λh ɛ (β) (27) The derivative of f (β) is f (β) = 2β 2r + λh ɛ(β) (28) 1 0.8 1 if β ɛ h ɛ(β) β = ɛ if β ( ɛ, ɛ) 1 if β ɛ 0.6 0.4 0.2 0-0.2-0.4-0.6-0.8-1 -0.2-0.15-0.1-0.05 0 0.05 0.1 0.15 0.2 β Michael Gutmann Short Introduction to the Lasso 19 / 24

Proof Setting the derivative of f (β) to zero gives the condition 2β 2r + λh ɛ(β) = 0 (29) β + λ 2 h ɛ(β) = r (30) The left-hand side is a piecewise linear, monotonically increasing function g ɛ (β): β is uniquely determined by r. 0.8 0.6 β + λ 2 if β ɛ g ɛ (β) = β(1 + λ 2ɛ ) if β ( ɛ, ɛ) β λ 2 if β ɛ 0.4 0.2 0-0.2-0.4-0.6-0.8-0.2-0.15-0.1-0.05 0 0.05 0.1 0.15 0.2 β Michael Gutmann Short Introduction to the Lasso 20 / 24

Proof There are three cases 1. r ɛ + λ 2 β + λ! = r β = r λ 2 2 2. r ( ɛ λ, ɛ + ) λ 2 2 β 2ɛ + λ 2ɛ! = r β = 2ɛr 2ɛ + λ 1.5 1 0.5 0-0.5-1 Case 1 Case 2 3. r ɛ λ 2 β λ 2! = r β = r + λ 2 Case 3-1.5-0.6-0.4-0.2 0 0.2 0.4 0.6 β Michael Gutmann Short Introduction to the Lasso 21 / 24

Proof There are three cases 1. r ɛ + λ 2 β + λ! = r β = r λ 2 2 2. r ( ɛ λ, ɛ + ) λ 2 2 β 2ɛ + λ 2ɛ! = r β = 2ɛr 2ɛ + λ 1.5 1 0.5 0-0.5-1 Case 1 Case 2 3. r ɛ λ 2 Case 3-1.5-0.6-0.4-0.2 0 0.2 0.4 0.6 β β λ 2! = r β = r + λ 2 r λ 2 if r ɛ + λ 2 Hence β = 2ɛ 2ɛ+λ r if r ( ɛ λ 2, ɛ + λ 2 ) r + λ 2 if r ɛ λ 2 Michael Gutmann Short Introduction to the Lasso 22 / 24

Proof Taking the limit ɛ 0 gives r λ 2 if r λ 2 ˆβ = 0 if r ( λ 2, λ 2 ) r + λ 2 if r λ 2 With the subscripts, and r j = ˆβ j o, we have ˆβ j o λ 2 if ˆβ j o λ 2 ˆβ j = 0 if ˆβ j o ( λ 2, λ 2 ) ˆβ j o + λ 2 if ˆβ j o λ 2 (31) (32) which is the lasso solution ˆβ L j. Michael Gutmann Short Introduction to the Lasso 23 / 24

Summary Lasso Least Absolute Shrinkage and Selection Operator Method to regularize linear regression (like ridge regression) Regularization / shrinkage can improve prediction accuracy. Method to perform covariate selection (unlike ridge regression) Covariate selection reduces the complexity of fitted models; makes them easier to interpret. Combination of shrinkage and selection is achieved by penalizing the absolute values of the regression coefficients. Michael Gutmann Short Introduction to the Lasso 24 / 24

Appendix

Constrained optimization point of view Ridge regression: min y β Xβ 2 2 p subject to t j=1 β 2 j Lasso regression: min y β Xβ 2 2 p subject to β j t j=1 RSS increases Iso-contours of the RSS Constraint set (Based on figures from chapter 6 of Introduction to Statistical Learning) Michael Gutmann Short Introduction to the Lasso 2 / 2