Lasso, Ridge, and Elastic Net

Size: px

Start display at page:

Download "Lasso, Ridge, and Elastic Net"

Jeffery Dawson
6 years ago
Views:

1 Lasso, Ridge, and Elastic Net David Rosenberg New York University October 29, 2016 David Rosenberg (New York University) DS-GA 1003 October 29, / 14

2 A Very Simple Model Suppose we have one feature x 1 R. Response variable y R. Got some data and ran least squares linear regression. The ERM is ˆf (x 1 ) = 4x 1. What happens if we get a new feature x 2, but we always have x 2 = x 1? avid Rosenberg (New York University) DS-GA 1003 October 29, / 14

3 Duplicate Features New feature x 2 gives no new information. ERM is still ˆf (x 1,x 2 ) = 4x 1. Now there are some more ERMs: ˆf (x 1,x 2 ) = 2x 1 + 2x 2 ˆf (x 1,x 2 ) = x 1 + 3x 2 ˆf (x 1,x 2 ) = 4x 2 What if we introduce l 1 or l 2 regularization? avid Rosenberg (New York University) DS-GA 1003 October 29, / 14

4 Duplicate Features: l 1 and l 2 norms ˆf (x 1,x 2 ) = w 1 x 1 + w 2 x 2 is an ERM iff w 1 + w 2 = 4. Consider the l 1 and l 2 norms of various solutions: w 1 w 2 w 1 w w 1 doesn t discriminate, as long as all have same sign w 2 2 minimized when weight is spread equally Picture proof: Level sets of loss are lines of the form w 1 + w 2 = c... avid Rosenberg (New York University) DS-GA 1003 October 29, / 14

5 Duplicate Features: Take Away For identical features l 1 regularization spreads weight arbitrarily (all weights same sign) l 2 regularization spreads weight evenly Extrapolation to correlated variables: l 1 regularization may choose just one variable from a group and ignore the rest l 2 tends to spread weight roughly equally among correlated variables David Rosenberg (New York University) DS-GA 1003 October 29, / 14

6 Example with highly correlated features Model in words: y is a linear combination of z 1 and z 2 But we don t observe z 1 and z 2 directly. We get 3 noisy observations of z 1. We get 3 noisy observations of z 2. We want to predict y from our noisy observations. Example based on Section 4.2 in Hastie et al s Statistical Learning with Sparsity. avid Rosenberg (New York University) DS-GA 1003 October 29, / 14

7 Example with highly correlated features Suppose (x, y) generated as follows: z 1,z 2 N(0,1) (independent) ε 0,ε 1,...,ε 6 N(0,1) (independent) y = 1 1.5z 2 + ε 0 3z { x j = z 1 + ε j /5 for j = 1,2,3 z 2 + ε j /5 for j = 4,5,6 Generated a sample of (x,y) pairs of size 100. Correlations within the groups of x s were around Example based on Section 4.2 in Hastie et al s Statistical Learning with Sparsity. avid Rosenberg (New York University) DS-GA 1003 October 29, / 14

8 Example with highly correlated features Lasso regularization paths: This is not a good outcome why? From Figure 4.1 of Hastie et al s Statistical Learning with Sparsity. avid Rosenberg (New York University) DS-GA 1003 October 29, / 14

9 Hedge Bets When Variables Highly Correlated When variables are highly correlated, we want to give them roughly the same weight. Why? robustness: what if one of the input variables has large error How can we get the weight spread more evenly? From Figure 4.1 of Hastie et al s Statistical Learning with Sparsity. avid Rosenberg (New York University) DS-GA 1003 October 29, / 14

10 Elastic Net Theorem The elastic net combines lasso and ridge penalties: 1 ŵ = argmin w R d n n { w T } 2 x i y i + λ1 w 1 + λ 2 w 2 2 i=1 We expect correlated random variables to have similar coefficients. a Let ρ ij = ĉorr(x i,x j ). Suppose ŵ i and ŵ j are selected by elastic net. If ŵ i ŵ j > 0, then ŵ i ŵ j y 2 λ 2 1 ρij. a avid Rosenberg (New York University) DS-GA 1003 October 29, / 14

11 Elastic Net Results on Model Lasso on left; Elastic net on right. Ratio of l 2 to l 1 regularization roughly 2 : 1. From Figure 4.1 of Hastie et al s Statistical Learning with Sparsity. avid Rosenberg (New York University) DS-GA 1003 October 29, / 14

12 Elastic Net vs Lasso Norm Ball From Figure 4.2 of Hastie et al s Statistical Learning with Sparsity. avid Rosenberg (New York University) DS-GA 1003 October 29, / 14

13 The ( l q ) q Norm Constraint Generalize to l q norm: ( w q ) q = w 1 q + w 2 q. F = {f (x) = w 1 x 1 + w 2 x 2 }. Contours of w q q = w 1 q + w 2 q : David Rosenberg (New York University) DS-GA 1003 October 29, / 14

14 l 1.2 vs Elastic Net From Hastie et al s Elements of Statistical Learning. avid Rosenberg (New York University) DS-GA 1003 October 29, / 14

Lasso, Ridge, and Elastic Net

Lasso, Ridge, and Elastic Net David Rosenberg New York University February 7, 2017 David Rosenberg (New York University) DS-GA 1003 February 7, 2017 1 / 29 Linearly Dependent Features Linearly Dependent