Is the test error unbiased for these programs? 2017 Kevin Jamieson

Size: px

Start display at page:

Download "Is the test error unbiased for these programs? 2017 Kevin Jamieson"

Irma Daniel
5 years ago
Views:

1 Is the test error unbiased for these programs? 2017 Kevin Jamieson 1

2 Is the test error unbiased for this program? 2017 Kevin Jamieson 2

3 Simple Variable Selection LASSO: Sparse Regression Machine Learning CSE546 Kevin Jamieson University of Washington October 9,

4 Sparsity bw LS = arg min w y i x T i w 2 Vector w is sparse, if many entries are zero Very useful for many tasks, e.g., Efficiency: If size(w) = 100 Billion, each prediction is expensive: If part of an online system, too slow If w is sparse, prediction computation only depends on number of non-zeros 4

5 Sparsity bw LS = arg min w y i x T i w 2 Vector w is sparse, if many entries are zero Very useful for many tasks, e.g., Efficiency: If size(w) = 100 Billion, each prediction is expensive: If part of an online system, too slow If w is sparse, prediction computation only depends on number of non-zeros Interpretability: What are the relevant dimension to make a prediction? E.g., what are the parts of the brain associated with particular words? 5Figure from Tom Mitchell

6 Sparsity bw LS = arg min w y i x T i w 2 Vector w is sparse, if many entries are zero Very useful for many tasks, e.g., Efficiency: If size(w) = 100 Billion, each prediction is expensive: If part of an online system, too slow If w is sparse, prediction computation only depends on number of non-zeros Interpretability: What are the relevant dimension to make a prediction? E.g., what are the parts of the brain associated with particular words? How do we find best subset among all possible? 6Figure from Tom Mitchell

7 Greedy model selection algorithm Pick a dictionary of features e.g., cosines of random inner products Greedy heuristic: Start from empty (or simple) set of features F 0 = Run learning algorithm for current set of features F t Obtain weights for these features Select next best feature h i (x) * e.g., h j (x) that results in lowest training error learner when using F t + {h j (x) * } F t+1! F t + {h i (x) * } Recurse 7

8 Greedy model selection Applicable in many other settings: Considered later in the course: Logistic regression: Selecting features (basis functions) Naïve Bayes: Selecting (independent) features P(X i Y) Decision trees: Selecting leaves to expand Only a heuristic! Finding the best set of k features is computationally intractable! Sometimes you can prove something strong about it 8

9 When do we stop??? Greedy heuristic: Select next best feature X i * E.g. h j (x) that results in lowest training error learner when using F t + {h j (x) * } Recurse When do you stop??? When training error is low enough? When test set error is low enough? Using cross validation? Is there a more principled approach? 9

10 Recall Ridge Regression Ridge Regression objective: bw ridge = arg min w = y i x T i w 2 + w

11 Ridge vs. Lasso Regression Ridge Regression objective: bw ridge = arg min w = y i x T i w 2 + w 2 2 Lasso objective: bw lasso = arg min w y i x T i w 2 + w

12 Penalized Least Squares Ridge : r(w) = w 2 2 Lasso : r(w) = w 1 bw r = arg min w y i x T i w 2 + r(w) 12

13 Penalized Least Squares Ridge : r(w) = w 2 2 Lasso : r(w) = w 1 bw r = arg min w y i x T i w 2 + r(w) For any 0 for which bw r achieves the minimum, there exists a 0 such that bw r = arg min w y i x T i w 2 subject to r(w) apple 13

14 Penalized Least Squares Ridge : r(w) = w 2 2 Lasso : r(w) = w 1 bw r = arg min w y i x T i w 2 + r(w) For any 0 for which bw r achieves the minimum, there exists a 0 such that bw r = arg min w y i x T i w 2 subject to r(w) apple 14

15 Optimizing the LASSO Objective LASSO solution: bw lasso, b b lasso = arg min w,b b blasso = arg min w,b 1 n y i (x T i w + b) 2 + w 1 y i x T i bw lasso ) 15

16 Optimizing the LASSO Objective LASSO solution: bw lasso, b b lasso = arg min w,b b blasso = arg min w,b 1 n y i (x T i w + b) 2 + w 1 y i x T i bw lasso ) So as usual, preprocess to make sure that 1 n so we don t have to worry about an o set. y i =0, 1 n x i = 0 16

17 Optimizing the LASSO Objective LASSO solution: bw lasso, b b lasso = arg min w,b b blasso = arg min w,b 1 n y i (x T i w + b) 2 + w 1 y i x T i bw lasso ) So as usual, preprocess to make sure that 1 n so we don t have to worry about an o set. y i =0, 1 n x i = 0 bw lasso = arg min w How do we solve this? y i x T i w 2 + w 1 17

18 Coordinate Descent Given a function, we want to find minimum Often, it is easy to find minimum along a single coordinate: How do we pick next coordinate? Super useful approach for *many* problems Converges to optimum in some cases, such as LASSO 18

19 Optimizing LASSO Objective One Coordinate at a Time Fix any j 2 {1,...,d} y i x T i w 2 + w 1 = y i 0 X y i x i,k w k k6=j! 2 dx dx x i,k w k + w k k=1 k=1 1 2 x i,j w j A + X w k + w j k6=j 19

20 Optimizing LASSO Objective One Coordinate at a Time Fix any j 2 {1,...,d} y i x T i w 2 + w 1 = y i 0 X y i x i,k w k k6=j! 2 dx dx x i,k w k + w k k=1 k=1 1 2 x i,j w j A + X w k + w j k6=j Initialize bw k = 0 for all k 2 {1,...,d} Loop over j 2 {1,...,n}: r (j) i = y i X k6=j bw j = arg min w j x i,j bw k r (j) i x i,j w j 2 + wj 20

21 Convex Functions Equivalent definitions of convexity: y x x f convex: f ( x +(1 )y) apple f(x)+(1 )f(y) 8x, y, 2 [0, 1] f(y) f(x)+rf(x) T (y x) 8x, y Gradients lower bound convex functions and are unique at x iff function differentiable at x Subgradients generalize gradients to non-differentiable points: Any supporting hyperplane at x that lower bounds entire function g is a subgradient at x if f(y) f(x)+g T (y x) 21

22 Taking the Subgradient bw j = arg min w j r (j) i x i,j w j 2 + wj g is a subgradient at x if f(y) f(x)+g T (y x) Convex function is minimized at w if 0 is a sub-gradient at wj w j wj n X r (j) i x i,j w j 2 = 22

23 Setting Subgradient to wj a j =( n X! 2 r (j) i x i,j w j + wj x 2 i,j) c j = 2( r (j) i x i,j ) = 8 >< a j w j c j if w j < 0 [ c j, c j + ] if w j =0 >: a j w j c j + if w j > 0 23

24 Setting Subgradient to wj a j =( n X! 2 r (j) i x i,j w j + wj x 2 i,j) c j = 2( bw j = arg min w j r (j) i x i,j ) = 8 >< a j w j c j if w j < 0 [ c j, c j + ] if w j =0 >: a j w j c j + if w j > 0 r (j) i x i,j w j 2 + wj w is a minimum if 0 is a sub-gradient at w bw j = 8 >< (c j + )/a j if c j < 0 if c j apple >: (c j )/a j if c j > 24

25 Soft Thresholding a j = bw j = x 2 i,j 8 >< (c j + )/a j if c j < 0 if c j apple >: (c j )/a j if c j > X c j =2 y i x i,k w k x i,j k6=j 25

26 Coordinate Descent for LASSO (aka Shooting Algorithm) Repeat until convergence (initialize w=0) Pick a coordinate l at (random or sequentially) Set: 8 >< (c j + )/a j if c j < bw j = 0 if c j apple >: (c j )/a j if c j > Where: a j = x 2 i,j c j =2 X y i x i,k bw k x i,j k6=j For convergence rates, see Shalev-Shwartz and Tewari 2009 Other common technique = LARS Least angle regression and shrinkage, Efron et al

27 Recall: Ridge Coefficient Path From Kevin Murphy textbook Typical approach: select λ using cross validation 27

28 Now: LASSO Coefficient Path From Kevin Murphy textbook 28

29 What you need to know Variable Selection: find a sparse solution to learning problem L 1 regularization is one way to do variable selection Applies beyond regression Hundreds of other approaches out there LASSO objective non-differentiable, but convex Use subgradient No closed-form solution for minimization Use coordinate descent Shooting algorithm is simple approach for solving LASSO 29

Is the test error unbiased for these programs?

Is the test error unbiased for these programs? Xtrain avg N o Preprocessing by de meaning using whole TEST set 2017 Kevin Jamieson 1 Is the test error unbiased for this program? e Stott see non for f x