Is the test error unbiased for these programs?

Size: px

Start display at page:

Download "Is the test error unbiased for these programs?"

Geoffrey Riley
5 years ago
Views:

1 Is the test error unbiased for these programs? Xtrain avg N o Preprocessing by de meaning using whole TEST set 2017 Kevin Jamieson 1

2 Is the test error unbiased for this program? e Stott see non for f x x µ c xtw peter 1C annotated slides correct example b c III yi 2017 Kevin Jamieson 2

3 Simple Variable Selection LASSO: Sparse Regression Machine Learning CSE546 Kevin Jamieson University of Washington October 9, Kevin Jamieson 3

4 Sparsity bw LS = arg min w y i x T i w 2 Vector w is sparse, if many entries are zero Very useful for many tasks, e.g., Efficiency: If size(w) = 100 Billion, each prediction is expensive: If part of an online system, too slow If w is sparse, prediction computation only depends on number of non-zeros 2018 Kevin Jamieson 4

5 Sparsity bw LS = arg min w y i x T i w 2 Vector w is sparse, if many entries are zero Very useful for many tasks, e.g., Efficiency: If size(w) = 100 Billion, each prediction is expensive: If part of an online system, too slow If w is sparse, prediction computation only depends on number of non-zeros Interpretability: What are the relevant dimension to make a prediction? E.g., what are the parts of the brain associated with particular words? 2018 Kevin Jamieson 5Figure from Tom Mitchell

6 Sparsity bw LS = arg min w y i x T i w 2 Vector w is sparse, if many entries are zero Very useful for many tasks, e.g., Efficiency: If size(w) = 100 Billion, each prediction is expensive: If part of an online system, too slow If w is sparse, prediction computation only depends on number of non-zeros Interpretability: What are the relevant dimension to make a prediction? E.g., what are the parts of the brain associated with particular words? How do we find best subset among all possible? 2018 Kevin Jamieson 6Figure from Tom Mitchell

7 Greedy model selection algorithm Pick a dictionary of features e.g., cosines of random inner products Greedy heuristic: Start from empty (or simple) set of features F 0 = Run learning algorithm for current set of features F t Obtain weights for these features Select next best feature h i (x) * e.g., h j (x) that results in lowest training error learner when using F t + {h j (x) * } F t+1! F t + {h i (x) * } Recurse 2018 Kevin Jamieson 7

8 Greedy model selection Applicable in many other settings: Considered later in the course: Logistic regression: Selecting features (basis functions) Naïve Bayes: Selecting (independent) features P(X i Y) Decision trees: Selecting leaves to expand Only a heuristic! Finding the best set of k features is computationally intractable! Sometimes you can prove something strong about it 2018 Kevin Jamieson 8

9 When do we stop??? Greedy heuristic: Select next best feature X i * E.g. h j (x) that results in lowest training error learner when using F t + {h j (x) * } Recurse When do you stop??? When training error is low enough? When test set error is low enough? Using cross validation? Is there a more principled approach? 2018 Kevin Jamieson 9

10 Recall Ridge Regression Ridge Regression objective: bw ridge = arg min w = y i x T i w 2 + w Kevin Jamieson 10

11 Ridge vs. Lasso Regression Ridge Regression objective: bw ridge = arg min w = y i x T i w 2 + w 2 2 Lasso objective: bw lasso = arg min w y i x T i w 2 + w x 2018 Kevin Jamieson 11

12 Penalized Least Squares Ridge : r(w) = w 2 2 Lasso : r(w) = w 1 bw r = arg min w y i x T i w 2 + r(w) 2018 Kevin Jamieson 12

13 Penalized Least Squares Ridge : r(w) = w 2 2 Lasso : r(w) = w 1 bw r = arg min w y i x T i w 2 + r(w) For any 0 for which bw r achieves the minimum, there exists a 0 such that bw r = arg min w y i x T i w 2 subject to r(w) apple 2018 Kevin Jamieson 13

14 Penalized Least Squares Ridge : r(w) = w 2 2 Lasso : r(w) = w 1 bw r = arg min w y i x T i w 2 + r(w) For any 0 for which bw r achieves the minimum, there exists a 0 such that bw r = arg min w y i x T i w 2 subject to r(w) apple 2018 Kevin Jamieson H H EV goof T Ev 14

15 Optimizing the LASSO Objective LASSO solution: bw lasso, b b lasso = arg min w,b b blasso = arg min w,b 1 n y i (x T i w + b) 2 + w 1 y i x T i bw lasso ) 2018 Kevin Jamieson 15

16 Optimizing the LASSO Objective LASSO solution: bw lasso, b b lasso = arg min w,b b blasso = arg min w,b 1 n y i (x T i w + b) 2 + w 1 y i x T i bw lasso ) So as usual, preprocess to make sure that 1 n so we don t have to worry about an o set. y i =0, 1 n x i = Kevin Jamieson 16

17 Optimizing the LASSO Objective LASSO solution: bw lasso, b b lasso = arg min w,b b blasso = arg min w,b 1 n y i (x T i w + b) 2 + w 1 y i x T i bw lasso ) V So as usual, preprocess to make sure that 1 n so we don t have to worry about an o set. y i =0, 1 n x i = Kevin Jamieson bw lasso = arg min w How do we solve this? y i x T i w 2 + w 1 17

18 Coordinate Descent Given a function, we want to find minimum Often, it is easy to find minimum along a single coordinate: t How do we pick next coordinate? Super useful approach for *many* problems 1,515 bin Converges to optimum in some cases, such as LASSO 2018 Kevin Jamieson 18

19 Optimizing LASSO Objective One Coordinate at a Time Fix any j 2 {1,...,d} y i x T i w 2 + w 1 = k=1 k=1 y i 0 X y i x i,k w k k6=j Tin! 2 dx dx x i,k w k + w k 1 2 x i,j w j A + X w k + w j k6=j u 2018 Kevin Jamieson 19

20 Optimizing LASSO Objective One Coordinate at a Time Fix any j 2 {1,...,d} y i x T i w 2 + w 1 = y i 0 X y i x i,k w k k6=j! 2 dx dx x i,k w k + w k k=1 k=1 1 2 x i,j w j A + X w k + w j k6=j Initialize bw k = 0 for all k 2 {1,...,d} Loop over j 2 {1,...,n}: r (j) i = y i X k6=j bw j = arg min w j x i,j bw k r (j) i x i,j w j 2 + wj 2018 Kevin Jamieson 20

21 Convex Functions f is convex card zza Gcs Z is a convex set Equivalent definitions of convexity: y Y x x f convex: f ( x +(1 )y) apple f(x)+(1 )f(y) 8x, y, 2 [0, 1] f(y) f(x)+rf(x) T (y x) 8x, y Gradients lower bound convex functions and are unique at x iff function differentiable at x Subgradients generalize gradients to non-differentiable points: Any supporting hyperplane at x that lower bounds entire function g is a subgradient at x if f(y) f(x)+g T (y x) 2018 Kevin Jamieson 21

22 Taking the Subgradient bw j = arg min w j r (j) i x i,j w j 2 + wj g is a subgradient at x if f(y) f(x)+g T (y x) Convex function is minimized at w if 0 is a sub-gradient at wj n wj w j = II r (j) i x i,j w j 2 = 2cm xi His Yw o lytzlottgly o l if w co gy 2018 Kevin Jamieson 22

23 T O I 2x re xi w t a

24 Setting Subgradient to wj a j =( n X! 2 r (j) i x i,j w j + wj = x 2 i,j) c j = 2( r (j) i x i,j ) 8 >< a j w j c j 0 if w j < 0 2 [ c j, c j + ] if w j =0 >: 2 a j w j c j + if w j > 0 W G so Xe Cj 1 XIE Kevin Jamieson 23

25 Setting Subgradient to wj a j =( n X! 2 r (j) i x i,j w j + wj = x 2 i,j) c j = 2( bw j = arg min w j r (j) i x i,j ) 8 >< a j w j c j if w j < 0 [ c j, c j + ] if w j =0 >: a j w j c j + if w j > 0 r (j) i x i,j w j 2 + wj w is a minimum if 0 is a sub-gradient at w bw j = 8 >< (c j + )/a j if c j < 0 if c j apple >: (c j )/a j if c j > 2018 Kevin Jamieson 24

26 Soft Thresholding a j = bw j = x 2 i,j 8 >< (c j + )/a j if c j < 0 if c j apple >: (c j )/a j if c j > X c j =2 y i x i,k w k x i,j k6=j 1 0 X O 2018 Kevin Jamieson 25

27 Coordinate Descent for LASSO (aka Shooting Algorithm) Repeat until convergence (initialize w=0) Pick a coordinate l at (random or sequentially) Set: 8 >< (c j + )/a j if c j < bw j = 0 if c j apple >: (c j )/a j if c j > Where: a j = x 2 i,j c j =2 X y i x i,k bw k x i,j k6=j For convergence rates, see Shalev-Shwartz and Tewari 2009 Other common technique = LARS Least angle regression and shrinkage, Efron et al Kevin Jamieson 26

28 Recall: Ridge Coefficient Path From Kevin Murphy textbook Ya Typical approach: select λ using cross validation 2018 Kevin Jamieson 27

29 Now: LASSO Coefficient Path W wz W From Kevin Murphy Wdtextbook VX 2018 Kevin Jamieson 28

30 What you need to know Variable Selection: find a sparse solution to learning problem L 1 regularization is one way to do variable selection Applies beyond regression Hundreds of other approaches out there LASSO objective non-differentiable, but convex Use subgradient No closed-form solution for minimization Use coordinate descent Shooting algorithm is simple approach for solving LASSO 2018 Kevin Jamieson 29

31 Classification Logistic Regression Machine Learning CSE546 Kevin Jamieson University of Washington October 11 9, 2016 Kevin Jamieson

32 THUS FAR, REGRESSION: PREDICT A CONTINUOUS VALUE GIVEN SOME INPUTS Kevin Jamieson

33 Weather prediction revisted Temperature Kevin Jamieson

34 Reading Your Brain, Simple Example Pairwise classification accuracy: 85% Person Animal [Mitchell et al.] Kevin Jamieson

35 Binary Classification Learn: f:x >Y X features Y target classes Y 2 {0, 1} Loss function: Expected loss of f: f x Y 011 Loss Ex u HEHN 3 E E.nl EfGdtt3 X x x 1 fix o p Y 1 X x Suppose you know P(Y X) exactly, how should you classify? Bayes optimal classifier: floc a9yma lp Y y X x Kevin Jamieson

36 Binary Classification Learn: f:x >Y X features Y target classes Y 2 {0, 1} Loss function: `(f(x),y)=1{f(x) 6= y} Expected loss of f: Suppose you know P(Y X) exactly, how should you classify? Bayes optimal classifier: E XY [1{f(X) 6= Y }] =E X [E Y X [1{f(x) 6= Y } X = x]] E Y X [1{f(x) 6= Y } X = x] =1{f(x) =1}P(Y =0 X = x)+1{f(x) =0}P(Y =1 X = x) f(x) = arg max y P(Y = y X = x) Kevin Jamieson

37 Link Functions Estimating P(Y X): Why not use standard linear regression? Combining regression and probability? Need a mapping from real values to [0,1] A link function! Kevin Jamieson

38 Logistic Regression Logistic function (or Sigmoid): Learn P(Y X) directly Assume a particular functional form for link function Sigmoid applied to a linear function of the input features: Z Features can be discrete or continuous! Kevin Jamieson

39 Understanding the sigmoid w 0 =-2, w 1 =-1 w 0 =0, w 1 =-1 w 0 =0, w 1 = Kevin Jamieson

40 Sigmoid for binary classes P(Y =0 w, X) = exp(w 0 + P k w kx k ) P(Y =1 w, X) =1 P(Y =0 w, X) = exp(w 0 + P k w kx k ) 1 + exp(w 0 + P k w kx k ) P(Y =1 w, X) P(Y =0 w, X) = Kevin Jamieson

41 Sigmoid for binary classes P(Y =0 w, X) = exp(w 0 + P k w kx k ) P(Y =1 w, X) =1 P(Y =0 w, X) = exp(w 0 + P k w kx k ) 1 + exp(w 0 + P k w kx k ) P(Y =1 w, X) P(Y =0 w, X) =exp(w 0 + X k w k X k ) log P(Y =1 w, X) P(Y =0 w, X) = w 0 + X k w k X k Linear Decision Rule! Kevin Jamieson

42 Logistic Regression a Linear classifier Kevin Jamieson

43 Loss function: Conditional Likelihood Have a bunch of iid data of the form: P (Y = 1 x, w) = P (Y =1 x, w) = exp(wt x) 1 + exp(w T x) {(x i,y i )} n x i 2 R d, y i 2 { 1, 1} exp(w T x) This is equivalent to: P (Y = y x, w) = exp( yw T x) So we can compute the maximum likelihood estimator: bw MLE = arg max w ny P (y i x i,w) Kevin Jamieson

44 Loss function: Conditional Likelihood Have a bunch of iid data of the form: bw MLE = arg max w = arg min w {(x i,y i )} n x i 2 R d, y i 2 { 1, 1} ny P (y P (Y = y x, w) = i x i,w) log(1 + exp( y i x T i w)) exp( yw T x) Kevin Jamieson

45 Loss function: Conditional Likelihood Have a bunch of iid data of the form: bw MLE = arg max w = arg min w {(x i,y i )} n x i 2 R d, y i 2 { 1, 1} ny P (y P (Y = y x, w) = i x i,w) log(1 + exp( Logistic Loss: `i(w) = log(1 + exp( y i x T i w)) y i x T i w)) exp( yw T x) Squared error Loss: `i(w) =(y i x T i w)2 (MLE for Gaussian noise) Kevin Jamieson

46 Loss function: Conditional Likelihood Have a bunch of iid data of the form: bw MLE = arg max w = arg min w {(x i,y i )} n x i 2 R d, y i 2 { 1, 1} ny P (y P (Y = y x, w) = i x i,w) log(1 + exp( What does J(w) look like? Is it convex? y i x T i w)) = J(w) exp( yw T x) Kevin Jamieson

47 Loss function: Conditional Likelihood Have a bunch of iid data of the form: bw MLE = arg max w = arg min w {(x i,y i )} n x i 2 R d, y i 2 { 1, 1} ny P (y P (Y = y x, w) = i x i,w) log(1 + exp( y i x T i w)) = J(w) exp( yw T x) Good news: J(w) is convex function of w, no local optima problems Bad news: no closed-form solution to maximize J(w) Good news: convex functions easy to optimize Kevin Jamieson

48 Linear Separability arg min w log(1 + exp( y i x T i w)) When is this loss small? Kevin Jamieson

49 Large parameters Overfitting If data is linearly separable, weights go to infinity In general, leads to overfitting: Penalizing high weights can prevent overfitting Kevin Jamieson

50 Regularized Conditional Log Likelihood Add regularization penalty, e.g., L 2 : arg min log 1 + exp( y i (x T i w + b)) + w 2 2 w,b Be sure to not regularize the o set b! Kevin Jamieson

51 Gradient Descent Machine Learning CSE546 Kevin Jamieson University of Washington October 11, 2016 Kevin Jamieson

52 Machine Learning Problems Have a bunch of iid data of the form: {(x i,y i )} n x i 2 R d y i 2 R Learning a model s parameters: Each `i(w) is convex. `i(w) Kevin Jamieson

53 Machine Learning Problems Have a bunch of iid data of the form: {(x i,y i )} n x i 2 R d y i 2 R Learning a model s parameters: Each `i(w) is convex. x x y `i(w) g is a subgradient at x if f(y) f(x)+g T (y x) f convex: f ( x +(1 )y) apple f(x)+(1 )f(y) 8x, y, 2 [0, 1] f(y) f(x)+rf(x) T (y x) 8x, y Kevin Jamieson

54 Machine Learning Problems Have a bunch of iid data of the form: {(x i,y i )} n x i 2 R d y i 2 R Learning a model s parameters: Each `i(w) is convex. `i(w) Logistic Loss: `i(w) = log(1 + exp( y i x T i w)) Squared error Loss: `i(w) =(y i x T i w)2 Kevin Jamieson

55 Least squares Have a bunch of iid data of the form: {(x i,y i )} n x i 2 R d y i 2 R Learning a model s parameters: Each `i(w) is convex. `i(w) Squared error Loss: `i(w) =(y i x T i w)2 How does software solve: 1 2 Xw y 2 2 Kevin Jamieson

56 Least squares Have a bunch of iid data of the form: {(x i,y i )} n x i 2 R d y i 2 R Learning a model s parameters: Each `i(w) is convex. `i(w) Squared error Loss: `i(w) =(y i x T i w)2 How does software solve: its complicated: (LAPACK, BLAS, MKL ) 1 2 Xw y 2 2 Do you need high precision? Is X column/row sparse? Is bw LS sparse? Is X T X well-conditioned? Can X T X fit in cache/memory? Kevin Jamieson

57 Taylor Series Approximation Taylor series in one dimension: f(x + )=f(x)+f 0 (x) f 00 (x) Gradient descent: Kevin Jamieson

58 Taylor Series Approximation Taylor series in d dimensions: f(x + v) =f(x)+rf(x) T v vt r 2 f(x)v +... Gradient descent: Kevin Jamieson

59 Gradient Descent f(w) = 1 2 Xw y 2 2 w t+1 = w t rf(w t ) rf(w) = Kevin Jamieson

60 Gradient Descent f(w) = 1 2 Xw y 2 2 w t+1 = w t rf(w t ) (w t+1 w )=(I X T X)(w t w ) Example: X= =(I X T X) t+1 (w 0 w ) apple y= apple w 0 = apple 0 0 w = Kevin Jamieson

61 Taylor Series Approximation Taylor series in one dimension: f(x + )=f(x)+f 0 (x) f 00 (x) Newton s method: Kevin Jamieson

62 Taylor Series Approximation Taylor series in d dimensions: f(x + v) =f(x)+rf(x) T v vt r 2 f(x)v +... Newton s method: Kevin Jamieson

63 Newton s Method f(w) = 1 2 Xw y 2 2 rf(w) = r 2 f(w) = v t is solution to : r 2 f(w t )v t = rf(w t ) w t+1 = w t + v t Kevin Jamieson

64 Newton s Method f(w) = 1 2 Xw y 2 2 rf(w) = X T (Xw y) r 2 f(w) = X T X v t is solution to : r 2 f(w t )v t = rf(w t ) w t+1 = w t + v t For quadratics, Newton s method converges in one step! (Not a surprise, why?) w 1 = w 0 (X T X) 1 X T (Xw 0 y)=w Kevin Jamieson

65 General case In general for Newton s method to achieve f(w t ) f(w ) apple : So why are ML problems overwhelmingly solved by gradient methods? Hint: v t is solution to : r 2 f(w t )v t = rf(w t ) Kevin Jamieson

66 General Convex case f(w t ) f(w ) apple Clean converge nice proofs: Bubeck Newton s method: t log(log(1/ )) Gradient descent: f is smooth and strongly convex: ai r 2 f(w) : bi f is smooth: r 2 f(w) bi f is potentially non-differentiable: rf(w) 2 apple c Nocedal +Wright, Bubeck Other: BFGS, Heavy-ball, BCD, SVRG, ADAM, Adagrad, Kevin Jamieson

67 Revisiting Logistic Regression Machine Learning CSE546 Kevin Jamieson University of Washington October 16, 2016 Kevin Jamieson

68 Loss function: Conditional Likelihood Have a bunch of iid data of the form: bw MLE = arg max w f(w) rf(w) = = arg min w {(x i,y i )} n x i 2 R d, y i 2 { 1, 1} ny P (y P (Y = y x, w) = i x i,w) log(1 + exp( y i x T i w)) exp( yw T x) Kevin Jamieson

Classification Logistic Regression

Classification Logistic Regression Machine Learning CSE546 Kevin Jamieson University of Washington October 16, 2016 1 THUS FAR, REGRESSION: PREDICT A CONTINUOUS VALUE GIVEN SOME INPUTS 2 Weather prediction