Lasso, Ridge, and Elastic Net

Similar documents
Lasso, Ridge, and Elastic Net

l 1 and l 2 Regularization

Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression

Classification. The goal: map from input X to a label Y. Y has a discrete set of possible values. We focused on binary Y (values 0 or 1).

Statistical Data Mining and Machine Learning Hilary Term 2016

LINEAR REGRESSION, RIDGE, LASSO, SVR

Linear Methods for Regression. Lijun Zhang

Linear regression COMS 4771

MSA220/MVE440 Statistical Learning for Big Data

COMS 4721: Machine Learning for Data Science Lecture 6, 2/2/2017

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010

DATA MINING AND MACHINE LEARNING

Sparse Linear Models (10/7/13)

Linear Model Selection and Regularization

Parameter Norm Penalties. Sargur N. Srihari

Subgradient Descent. David S. Rosenberg. New York University. February 7, 2018

CS 4491/CS 7990 SPECIAL TOPICS IN BIOINFORMATICS

Linear Regression with Strongly Correlated Designs Using Ordered Weigthed l 1

Linear Regression Linear Regression with Shrinkage

Convex optimization COMS 4771

STAT 462-Computational Data Analysis

Linear Regression. Volker Tresp 2018

Chapter 3. Linear Models for Regression

MS&E 226. In-Class Midterm Examination Solutions Small Data October 20, 2015

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

Generalization theory

SCMA292 Mathematical Modeling : Machine Learning. Krikamol Muandet. Department of Mathematics Faculty of Science, Mahidol University.

ORIE 4741: Learning with Big Messy Data. Regularization

A New Combined Approach for Inference in High-Dimensional Regression Models with Correlated Variables

COMS 4771 Lecture Fixed-design linear regression 2. Ridge and principal components regression 3. Sparse regression and Lasso

Convex relaxation for Combinatorial Penalties

ESL Chap3. Some extensions of lasso

Simultaneous regression shrinkage, variable selection, and supervised clustering of predictors with OSCAR

Machine Learning. Regularization and Feature Selection. Fabio Vandin November 14, 2017

Regularization and Variable Selection via the Elastic Net

Optimization methods

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 10

STAT 535 Lecture 5 November, 2018 Brief overview of Model Selection and Regularization c Marina Meilă

Linear Regression Linear Regression with Shrinkage

MS-C1620 Statistical inference

Lecture 14: Variable Selection - Beyond LASSO

Prediction & Feature Selection in GLM

LASSO Review, Fused LASSO, Parallel LASSO Solvers

OWL to the rescue of LASSO

Lecture 17 Intro to Lasso Regression

Basic Surveying Week 3, Lesson 2 Semester 2017/18/2 Vectors, equation of line, circle, ellipse

(Part 1) High-dimensional statistics May / 41

Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox February 4 th, Emily Fox 2014

Linear regression methods

Shrinkage Methods: Ridge and Lasso

Introduction to Machine Learning

The prediction of house price

MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications. Class 08: Sparsity Based Regularization. Lorenzo Rosasco

Recap from previous lecture

Today. Calculus. Linear Regression. Lagrange Multipliers

Machine Learning for Biomedical Engineering. Enrico Grisan

The Learning Problem and Regularization Class 03, 11 February 2004 Tomaso Poggio and Sayan Mukherjee

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Discriminative Models

Machine Learning for Economists: Part 4 Shrinkage and Sparsity

Institute of Statistics Mimeo Series No Simultaneous regression shrinkage, variable selection and clustering of predictors with OSCAR

Classification Logistic Regression

MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications. Class 04: Features and Kernels. Lorenzo Rosasco

Statistical Methods for Data Mining

Discriminative Models

Machine Learning for OR & FE

Oslo Class 6 Sparsity based regularization

Coordinate descent. Geoff Gordon & Ryan Tibshirani Optimization /

Stability and the elastic net

Biostatistics Advanced Methods in Biostatistics IV

Parcimonie en apprentissage statistique

Sparse Composite Models of Galaxy Spectra (or Anything) using L1 Norm Regularization. Benjamin Weiner Steward Observatory University of Arizona

Tutorial on Linear Regression

Generalized Elastic Net Regression

A Modern Look at Classical Multivariate Techniques

1 Sparsity and l 1 relaxation

Lecture 3: Introduction to Complexity Regularization

Machine Learning. Regularization and Feature Selection. Fabio Vandin November 13, 2017

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

Lecture 14: Shrinkage

ISyE 691 Data mining and analytics

Function Approximation

Neural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

A robust hybrid of lasso and ridge regression

MSA220/MVE440 Statistical Learning for Big Data

Ridge Regression 1. to which some random noise is added. So that the training labels can be represented as:

Introduction to conic sections. Author: Eduard Ortega

Lecture 24 May 30, 2018

IEOR165 Discussion Week 5

12. Structural Risk Minimization. ECE 830 & CS 761, Spring 2016

MLCC 2018 Variable Selection and Sparsity. Lorenzo Rosasco UNIGE-MIT-IIT

Direct Learning: Linear Regression. Donglin Zeng, Department of Biostatistics, University of North Carolina

High-dimensional regression

6. Regularized linear regression

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Statistical Computing (36-350)

Math 211, Fall 2014, Carleton College

CS 195-5: Machine Learning Problem Set 1

Ordinary Least Squares Linear Regression

Transcription:

Lasso, Ridge, and Elastic Net David Rosenberg New York University February 7, 2017 David Rosenberg (New York University) DS-GA 1003 February 7, 2017 1 / 29

Linearly Dependent Features Linearly Dependent Features David Rosenberg (New York University) DS-GA 1003 February 7, 2017 2 / 29

Linearly Dependent Features Suppose We Have 2 Equal Features Input features: x 1,x 2 R. Outcome: y R. Linear prediction functions f (x) = w 1 x 2 + w 2 x 2 Suppose x 1 = x 2. Then all functions with w 1 + w 2 = k are the same. give same predictions and have same empirical risk What function will we select if we do ERM with l 1 or l 2 constraint? David Rosenberg (New York University) DS-GA 1003 February 7, 2017 3 / 29

Linearly Dependent Features Equal Features, l 2 Constraint w 2 w 1 + w 2 = 2 2 + 8.75 w 2 2 w 1 + w 2 = 2 2 + 7 w 1 + w 2 = 2 2 + 5.25 w 1 w 1 + w 2 = 2 2 + 3.5 w 1 + w 2 = 2 2 + 1.75 w 1 + w 2 = 2 2 Suppose the line w 1 + w 2 = 2 2 + 3.5 corresponds to the empirical risk minimizers. Empirical risk increase as we move away from these parameter settings Intersection of w 1 + w 2 = 2 2 and the norm ball w 2 2 is ridge solution. Note that w 1 = w 2 at the solution David Rosenberg (New York University) DS-GA 1003 February 7, 2017 4 / 29

Linearly Dependent Features Equal Features, l 1 Constraint w2 w1 + w2 = 10.75 w1 + w2 = 9 w 1 2 w1 + w2 = 7.25 w1 + w2 = 5.5 w1 w1 + w2 = 3.75 w1 + w2 = 2 Suppose the line w 1 + w 2 = 5.5 corresponds to the empirical risk minimizers. Intersection of w 1 + w 2 = 2 and the norm ball w 1 2 is lasso solution. Note that the solution set is {(w 1,w 2 ) : w 1 + w 2 = 2,w 1,w 2 0}. David Rosenberg (New York University) DS-GA 1003 February 7, 2017 5 / 29

Linearly Dependent Features Linearly Related Features Same setup, now suppose x 2 = 2x 1. Then all functions with w 1 + 2w 2 = k are the same. give same predictions and have same empirical risk What function will we select if we do ERM with l 1 or l 2 constraint? David Rosenberg (New York University) DS-GA 1003 February 7, 2017 6 / 29

Linearly Dependent Features Linearly Related Features, l 2 Constraint w 2 2 w 2 w 1 + 2w 2 = 10/ 5 + 10.5 w 1 + 2w 2 = 10/ 5 + 7 w 1 + 2w 2 = 10/ 5 + 3.5 w 1 + 2w 2 = 10/ 5 w 1 w 1 + 2w 2 = 10/ 5 + 7 corresponds to the empirical risk minimizers. Intersection of w 1 + 2w 2 = 10 5 and the norm ball w 2 2 is ridge solution. At solution, w 2 = 2w 1. David Rosenberg (New York University) DS-GA 1003 February 7, 2017 7 / 29

Linearly Dependent Features Linearly Related Features, l 1 Constraint w 2 w 1 + 2w 2 = 16 w 1 + 2w 2 = 12 w 1 2 w 1 + 2w 2 = 8 w 1 w 1 + 2w 2 = 4 Intersection of w 1 + 2w 2 = 4 and the norm ball w 1 2 is lasso solution. Solution is now a corner of the l 1 ball, corresonding to a sparse solution. David Rosenberg (New York University) DS-GA 1003 February 7, 2017 8 / 29

Linearly Dependent Features Linearly Dependent Features: Take Away For identical features l 1 regularization spreads weight arbitrarily (all weights same sign) l 2 regularization spreads weight evenly Linearly related features l 1 regularization chooses variable with larger scale, 0 weight to others l 2 prefers variables with larger scale spreads weight proportional to scale David Rosenberg (New York University) DS-GA 1003 February 7, 2017 9 / 29

Linearly Dependent Features Empirical Risk for Square Loss and Linear Predictors Recall our discussion of linear predictors f (x) = w T x and square loss. Sets of w giving same empirical risk (i.e. level sets) formed ellipsoids around the ERM. With x 1 and x 2 linearly related, we get a degenerate ellipse. That s why level sets were lines (actually pairs of lines, one on each side of ERM). KPM Fig. 13.3 David Rosenberg (New York University) DS-GA 1003 February 7, 2017 10 / 29

Linearly Dependent Features Correlated Features Same Scale Suppose x 1 and x 2 are highly correlated and the same scale. This is quite typical in real data, after normalizing data. Nothing degenerate here, so level sets are ellipsoids. But, the higher the correlation, the closer to degenerate we get. That is, ellipsoids keep stretching out, getting closer to two parallel lines. David Rosenberg (New York University) DS-GA 1003 February 7, 2017 11 / 29

Linearly Dependent Features Correlated Features, l 1 Regularization w2 w2 w 1 2 w 1 2 w1 w1 Intersection could be anywhere on the top right edge. Minor perturbations (in data) can drastically change intersection point very unstable solution. Makes division of weight among highly correlated features (of same scale) seem arbitrary. If x 1 2x 2, ellipse changes orientation and we probably hit a corner. David Rosenberg (New York University) DS-GA 1003 February 7, 2017 12 / 29

Linearly Dependent Features A Very Simple Model Suppose we have one feature x 1 R. Response variable y R. Got some data and ran least squares linear regression. The ERM is What happens if we get a new feature x 2, but we always have x 2 = x 1? ˆf (x 1 ) = 4x 1. David Rosenberg (New York University) DS-GA 1003 February 7, 2017 13 / 29

Linearly Dependent Features Duplicate Features New feature x 2 gives no new information. ERM is still ˆf (x 1,x 2 ) = 4x 1. Now there are some more ERMs: ˆf (x 1,x 2 ) = 2x 1 + 2x 2 ˆf (x 1,x 2 ) = x 1 + 3x 2 ˆf (x 1,x 2 ) = 4x 2 What if we introduce l 1 or l 2 regularization? David Rosenberg (New York University) DS-GA 1003 February 7, 2017 14 / 29

Linearly Dependent Features Duplicate Features: l 1 and l 2 norms ˆf (x 1,x 2 ) = w 1 x 1 + w 2 x 2 is an ERM iff w 1 + w 2 = 4. Consider the l 1 and l 2 norms of various solutions: w 1 w 2 w 1 w 2 2 4 0 4 16 2 2 4 8 1 3 4 10-1 5 6 26 w 1 doesn t discriminate, as long as all have same sign w 2 2 minimized when weight is spread equally Picture proof: Level sets of loss are lines of the form w 1 + w 2 = c... David Rosenberg (New York University) DS-GA 1003 February 7, 2017 15 / 29

Correlated Features and the Grouping Issue Correlated Features and the Grouping Issue David Rosenberg (New York University) DS-GA 1003 February 7, 2017 16 / 29

Correlated Features and the Grouping Issue Example with highly correlated features Model in words: y is a linear combination of z 1 and z 2 But we don t observe z 1 and z 2 directly. We get 3 noisy observations of z 1. We get 3 noisy observations of z 2. We want to predict y from our noisy observations. Example from Section 4.2 in Hastie et al s Statistical Learning with Sparsity. David Rosenberg (New York University) DS-GA 1003 February 7, 2017 17 / 29

Correlated Features and the Grouping Issue Example with highly correlated features Suppose (x, y) generated as follows: z 1,z 2 N(0,1) (independent) ε 0,ε 1,...,ε 6 N(0,1) (independent) y = 1 1.5z 2 + 2ε 0 3z { x j = z 1 + ε j /5 for j = 1,2,3 z 2 + ε j /5 for j = 4,5,6 Generated a sample of (x,y) pairs of size 100. Correlations within the groups of x s were around 0.97. Example from Section 4.2 in Hastie et al s Statistical Learning with Sparsity. David Rosenberg (New York University) DS-GA 1003 February 7, 2017 18 / 29

Correlated Features and the Grouping Issue Example with highly correlated features Lasso regularization paths: Lasso Paths Elastic-Net Paths coefficients 1.0 0.5 0.0 0.5 coefficients 0.5 1.0 1.5 2.0 λ 1 1 2 3 4 5 6 λ 1 Lines with the same color correspond to features with essentially the same information Distribution of weight among them seems almost arbitrary David Rosenberg (New York University) DS-GA 1003 February 7, 2017 19 / 29

Correlated Features and the Grouping Issue Hedge Bets When Variables Highly Correlated When variables are highly correlated (and same scale, after normalization), we want to give them roughly the same weight. Why? Let their error cancel out How can we get the weight spread more evenly? David Rosenberg (New York University) DS-GA 1003 February 7, 2017 20 / 29

Correlated Features and the Grouping Issue Elastic Net The elastic net combines lasso and ridge penalties: 1 ŵ = arg min w R d n n { w T } 2 x i y i + λ1 w 1 + λ 2 w 2 2 We expect correlated random variables to have similar coefficients. i=1 David Rosenberg (New York University) DS-GA 1003 February 7, 2017 21 / 29

Correlated Features and the Grouping Issue Highly Correlated Features, Elastic Net Constraint w2.8 w 1 +.2 w 2 2 2 w1 Elastic net solution is closer to w 2 = w 1 line, despite high correlation. David Rosenberg (New York University) DS-GA 1003 February 7, 2017 22 / 29

Correlated Features and the Grouping Issue Elastic Net - Sparse Regions Suppose design matrix X is orthogonal, so X T X = I, and contours are circles (and features uncorrelated) Then OLS solution in green or red regions implies elastic-net constrained solution will be at corner Fig from Mairal et al. s Sparse Modeling for Image and Vision Processing Fig 1.9 David Rosenberg (New York University) DS-GA 1003 February 7, 2017 23 / 29

Correlated Features and the Grouping Issue Elastic Net A Theorem for Correlated Variables Theorem a Let ρ ij = ĉorr(x i,x j ). Suppose ŵ i and ŵ j are selected by elastic net, with y centered and predictors x 1,...,x d standardized. If ŵ i ŵ j > 0, then ŵ i ŵ j y 2 λ 2 1 ρij. a Theorem 1 in Regularization and variable selection via the elastic net : https://web.stanford.edu/~hastie/papers/b67.2%20(2005)%20301-320%20zou%20&%20hastie.pdf David Rosenberg (New York University) DS-GA 1003 February 7, 2017 24 / 29

Correlated Features and the Grouping Issue Elastic Net Results on Model Lasso Paths Elastic-Net Paths 1.0 1.0 coefficients coefficients 0.5 0.5 0.0 0.0 coefficients coefficients 0.5 0.5 0.5 0.5 1.0 1.0 1.5 1.5 2.0 2.0 λ 1 λ 1 1 1 2 2 3 3 4 4 5 5 6 6 7 7 λ 1 λ 1 Lasso on left; Elastic net on right. Ratio of l 2 to l 1 regularization roughly 2 : 1. David Rosenberg (New York University) DS-GA 1003 February 7, 2017 25 / 29

Extra Pictures Extra Pictures David Rosenberg (New York University) DS-GA 1003 February 7, 2017 26 / 29

Extra Pictures Elastic Net vs Lasso Norm Ball From Figure 4.2 of Hastie et al s Statistical Learning with Sparsity. David Rosenberg (New York University) DS-GA 1003 February 7, 2017 27 / 29

Extra Pictures The ( l q ) q Constraint Generalize to l q : ( w q ) q = w 1 q + w 2 q. Note: w q is a norm if q 1, but not for q (0,1) F = {f (x) = w 1 x 1 + w 2 x 2 }. Contours of w q q = w 1 q + w 2 q : David Rosenberg (New York University) DS-GA 1003 February 7, 2017 28 / 29

Extra Pictures l 1.2 vs Elastic Net From Hastie et al s Elements of Statistical Learning. David Rosenberg (New York University) DS-GA 1003 February 7, 2017 29 / 29