Lasso Regression: Regularization for feature selection

Size: px

Start display at page:

Download "Lasso Regression: Regularization for feature selection"

Denis Phelps
5 years ago
Views:

1 Lasso Regression: Regularization for feature selection Emily Fox University of Washington January 18, Feature selection task 2 1

2 Why might you want to perform feature selection? Efficiency: - If size(w) = 100B, each prediction is expensive - If sparse, computation only depends on # of non-zeros many zeros = i = j h j (x i ) Interpretability: - Which features are relevant for prediction? 3 3 Sparsity: Housing application 4 $? Lot size Single Family Year built Last sold price Last sale price/sqft Finished sqft Unfinished sqft Finished basement sqft # floors Flooring types Parking type Parking amount Cooling Heating Exterior materials Roof type Structure style Dishwasher Garbage disposal Microwave Range / Oven Refrigerator Washer Dryer Laundry location Heating type Jetted Tub Deck Fenced Yard Lawn Garden Sprinkler System 2

3 Option 1: All subsets or greedy variants 5 Exhaustive approach: all subsets Consider all possible models, each using a subset of features How many models were evaluated?each indexed by features included 6 y i = ε i [ ] y i = w 0 h 0 (x i ) + ε i [ ] y i = w 1 h 1 (x i ) + ε i [ ] y i = w 0 h 0 (x i ) + w 1 h 1 (x i ) + ε i y i = w 0 h 0 (x i ) + w 1 h 1 (x i ) + + w D h D (x i )+ ε i [ ] [ ] 2 D 2 8 = = 1,073,741, = x B = HUGE!!!!!! Typically, computationally infeasible 3

4 Choosing model complexity? Option 1: Assess on validation set Option 2: Cross validation Option 3+: Other metrics for penalizing model complexity like BIC 7 Greedy algorithms Forward stepwise: Starting from simple model and iteratively add features most useful to fit Backward stepwise: Start with full model and iteratively remove features least useful to fit Combining forward and backward steps: In forward algorithm, insert steps to remove features no longer as important Lots of other variants, too

5 Option 2: Regularize 9 Ridge regression: L 2 regularized regression Total cost = measure of fit + λ measure of magnitude of coefficients RSS(w) 2 w 2 =w w 2 D Encourages small weights but not exactly

6 Coefficient path ridge coefficients j 11 λ Using regularization for feature selection Instead of searching over a discrete set of solutions, can we use regularization? - Start with full model (all possible features) - Shrink some coefficients exactly to 0 i.e., knock out certain features - Non-zero coefficients indicate selected features 12 6

7 Thresholding ridge coefficients? Why don t we just set small ridge coefficients to 0? 0 13 Thresholding ridge coefficients? Selected features for a given threshold value

8 Thresholding ridge coefficients? Let s look at two related features 0 Nothing measuring bathrooms was included! 15 Thresholding ridge coefficients? If only one of the features had been included

9 Thresholding ridge coefficients? Would have included bathrooms in selected model 0 Can regularization lead directly to sparsity? 17 Try this cost instead of ridge Total cost = measure of fit + λ measure of magnitude of coefficients RSS(w) w 1 = w w D Lasso regression (a.k.a. L 1 regularized regression) Leads to sparse solutions! 18 9

10 Lasso regression: L 1 regularized regression Just like ridge regression, solution is governed by a continuous parameter λ If λ=0: If λ= : RSS(w) + λ w 1 tuning parameter = balance of fit and sparsity If λ in between: 19 Coefficient path ridge coefficients j 20 λ 10

11 Coefficient path lasso coefficients j 21 λ Fitting the lasso regression model (for given λ value) 22 11

12 How we optimized past objectives To solve for, previously took gradient of total cost objective and either: 1) Derived closed-form solution 2) Used in gradient descent algorithm 23 Optimizing the lasso objective Lasso total cost: RSS(w) + λ w 1 Issues: 1) What s the derivative of w j? 2) Even if we could compute derivative, no closed-form solution can use subgradient descent gradients subgradients 24 12

13 Aside 1: Coordinate descent 25 Coordinate descent Goal: Minimize some function g Often, hard to find minimum for all coordinates, but easy for each coordinate Coordinate descent: Initialize = 0 (or smartly ) while not converged pick a coordinate j j 26 13

14 Comments on coordinate descent How do we pick next coordinate? - At random ( random or stochastic coordinate descent), round robin, No stepsize to choose! Super useful approach for many problems - Converges to optimum in some cases (e.g., strongly convex ) - Converges for lasso objective 27 Aside 2: Normalizing features 28 14

15 Normalizing features Scale training columns (not rows!) as: h j (x k ) = h j (x k ) h j (x i ) 2 Normalizer: z j Training features Apply same training scale factors to test data: apply to test point h j (x k ) = h j (x k ) h j (x i ) 2 Normalizer: z j summing over training points Test features 29 Aside 3: Coordinate descent for unregularized regression (for normalized features) 30 15

16 Optimizing least squares objective one coordinate at a time RSS(w) = (y i - w j h j (x i )) 2 Fix all coordinates w -j and take partial w.r.t. w j w j RSS(w) = -2 h j (x i )(y i w j h j (x i )) 31 Optimizing least squares objective one coordinate at a time RSS(w) = (y i - w j h j (x i )) 2 Set partial = 0 and solve w j RSS(w) = -2ρ j + 2w j =

17 Coordinate descent for least squares regression Initialize = 0 (or smartly ) while not converged for j=0,1,,d residual without feature j compute: ρ j = h j (x i )(y i i ( -j)) set: j = ρ j prediction without feature j 33 Coordinate descent for lasso (for normalized features) 34 17

18 Coordinate descent for least squares regression Initialize = 0 (or smartly ) while not converged for j=0,1,,d residual without feature j compute: ρ j = h j (x i )(y i i ( -j)) set: j = ρ j prediction without feature j 35 Coordinate descent for lasso Initialize = 0 (or smartly ) while not converged for j=0,1,,d compute: ρ j = h j (x i )(y i i ( -j)) set: j = ρ j + λ/2 if ρ j < -λ/2 0 if ρ j in [-λ/2, λ/2] ρ j λ/2 if ρ j > λ/

19 Soft thresholding ρ j + λ/2 if ρ j < -λ/2 j = 0 if ρ j in [-λ/2, λ/2] ρ j λ/2 if ρ j > λ/2 j ρ j 37 How to assess convergence? Initialize = 0 (or smartly ) while not converged for j=0,1,,d compute: ρ j = h j (x i )(y i i ( -j)) set: j = ρ j + λ/2 if ρ j < -λ/2 0 if ρ j in [-λ/2, λ/2] ρ j λ/2 if ρ j > λ/

20 Convergence criteria When to stop? For convex problems, will start to take smaller and smaller steps Measure size of steps taken in a full loop over all features - stop when max step < ε 39 Other lasso solvers Classically: Least angle regression (LARS) [Efron et al. 04] Then: Coordinate descent algorithm [Fu 98, Friedman, Hastie, & Tibshirani 08] Now: Parallel CD (e.g., Shotgun, [Bradley et al. 11]) Other parallel learning approaches for linear models - Parallel stochastic gradient descent (SGD) (e.g., Hogwild! [Niu et al. 11]) - Parallel independent solutions then averaging [Zhang et al. 12] Alternating directions method of multipliers (ADMM) [Boyd et al. 11] 40 20

21 Coordinate descent for lasso (for unnormalized features) 41 Coordinate descent for lasso with normalized features Initialize = 0 (or smartly ) while not converged for j=0,1,,d compute: ρ j = h j (x i )(y i i ( -j)) set: j = ρ j + λ/2 if ρ j < -λ/2 0 if ρ j in [-λ/2, λ/2] ρ j λ/2 if ρ j > λ/

22 Coordinate descent for lasso with unnormalized features Precompute: z j = h j (x i ) 2 Initialize = 0 (or smartly ) while not converged for j=0,1,,d compute: ρ j = h j (x i )(y i i ( -j)) set: j = (ρ j + λ/2)/z j if ρ j < -λ/2 0 if ρ j in [-λ/2, λ/2] (ρ j λ/2)/z j if ρ j > λ/2 43 How to choose λ 44 22

23 If sufficient amount of data Training set Validation set Test set fit λ test performance of λ to select λ * assess generalization error of λ* 45 Summary for feature selection and lasso regression 46 23

24 Impact of feature selection and lasso Lasso has changed machine learning, statistics, & electrical engineering But, for feature selection in general, be careful about interpreting selected features - selection only considers features included - sensitive to correlations between features - result depends on algorithm used - there are theoretical guarantees for lasso under certain conditions 47 What you can do now Describe all subsets and greedy variants for feature selection Analyze computational costs of these algorithms Formulate lasso objective Describe what happens to estimated lasso coefficients as tuning parameter λ is varied Interpret lasso coefficient path plot Contrast ridge and lasso regression Estimate lasso regression parameters using an iterative coordinate descent algorithm 48 24

25 Deriving the lasso coordinate descent update 49 Optimizing lasso objective one coordinate at a time RSS(w) + λ w 1 = (y i - w j h j (x i )) 2 + λ w j Fix all coordinates w -j and take partial w.r.t. w j derive without normalizing features 50 25

26 Part 1: Partial of RSS term RSS(w) + λ w 1 = (y i - w j h j (x i )) 2 + λ w j w j RSS(w) = -2 h j (x i )(y i w j h j (x i )) 51 Part 2: Partial of L 1 penalty term RSS(w) + λ w 1 = (y i - w j h j (x i )) 2 + λ w j λ w j =??? w j x x 52 26

27 Subgradients of convex functions Gradients lower bound convex functions: g(x) a b x unique at x if function differentiable at x Subgradients: Generalize gradients to non-differentiable points: - Any plane that lower bounds function x x 53 Part 2: Subgradient of L 1 term RSS(w) + λ w 1 = (y i - w j h j (x i )) 2 + λ w j λ w j = w j -λ when w j < 0 [-λ,λ] when w j = 0 λ when w j > 0 w j w j 54 27

28 Putting it all together RSS(w) + λ w 1 = (y i - w j h j (x i )) 2 + λ w j -λ when w j < 0 w j [lasso cost] = 2z j w j 2ρ j + [-λ, λ] when w j = 0 λ when w j > 0 = 2z j w j 2ρ j λ when w j < 0 [-2ρ j -λ, -2ρ j +λ] when w j = 0 2z j w j 2ρ j + λ when w j > 0 55 Optimal solution: Set subgradient = 0 [lasso cost] = [-2ρ j -λ, -2ρ j +λ] when w j = 0 = 0 w j 2z j w j 2ρ j λ when w j < 0 2z j w j 2ρ j + λ when w j > 0 Case 1 (w j < 0): 2z j j 2ρ j λ = 0 For j < 0, need Case 2 (w j = 0): j = 0 For j = 0, need [-2ρ j -λ, -2ρ j +λ] to contain 0: Case 3 (w j > 0): 2z j j 2ρ j + λ = 0 For j > 0, need 56 28

29 Optimal solution: Set subgradient = 0 [lasso cost] = w j = 0 2z j w j 2ρ j λ when w j < 0 [-2ρ j -λ, -2ρ j +λ] when w j = 0 2z j w j 2ρ j + λ when w j > 0 j = (ρ j + λ/2)/z j if ρ j < -λ/2 0 if ρ j in [-λ/2, λ/2] (ρ j λ/2)/z j if ρ j > λ/2 57 Soft thresholding j = (ρ j + λ/2)/z j if ρ j < -λ/2 0 if ρ j in [-λ/2, λ/2] (ρ j λ/2)/z j if ρ j > λ/2 j ρ j 58 29

30 Coordinate descent for lasso Precompute: z j = h j (x i ) 2 Initialize = 0 (or smartly ) while not converged for j=0,1,,d compute: ρ j = h j (x i )(y i i ( -j)) set: j = (ρ j + λ/2)/z j if ρ j < -λ/2 0 if ρ j in [-λ/2, λ/2] (ρ j λ/2)/z j if ρ j > λ/

Lasso Regression: Regularization for feature selection

Lasso Regression: Regularization for feature selection Emily Fox University of Washington January 18, 2017 Feature selection task 1 Why might you want to perform feature selection? Efficiency: - If size(w)