Warm up. Regrade requests submitted directly in Gradescope, do not instructors.

Size: px

Start display at page:

Download "Warm up. Regrade requests submitted directly in Gradescope, do not instructors."

Aileen Gibson
5 years ago
Views:

1 float in NumPy = 8 bytes 10 6 2 20 bytes = 1 MB

1 Warm up Regrade requests submitted directly in Gradescope, do not instructors. 1 float in NumPy = 8 bytes bytes = 1 MB bytes = 1 GB For each block compute the memory required in terms of n, p, d. If d << p << n, what is the most memory efficient program (blue, green, red)? If you have unlimited memory, what do you think is the fastest program? Kevin Jamieson

2 Gradient Descent Machine Learning CSE546 Kevin Jamieson University of Washington October 18, 2016 Kevin Jamieson

3 Machine Learning Problems Have a bunch of iid data of the form: {(x i,y i )} n i=1 x i 2 R d y i 2 R Learning a model s parameters: Each `i(w) is convex. nx i=1 `i(w) Kevin Jamieson

4 Machine Learning Problems Have a bunch of iid data of the form: {(x i,y i )} n i=1 x i 2 R d y i 2 R Learning a model s parameters: Each `i(w) is convex. nx i=1 `i(w) x x y g is a subgradient at x if f(y) f(x)+g T (y x) f convex: f ( x +(1 )y) apple f(x)+(1 )f(y) 8x, y, 2 [0, 1] f(y) f(x)+rf(x) T (y x) 8x, y Kevin Jamieson

5 Machine Learning Problems Have a bunch of iid data of the form: {(x i,y i )} n i=1 x i 2 R d y i 2 R Learning a model s parameters: Each `i(w) is convex. nx i=1 `i(w) Logistic Loss: `i(w) = log(1 + exp( y i x T i w)) Squared error Loss: `i(w) =(y i x T i w)2 Kevin Jamieson

6 Least squares Have a bunch of iid data of the form: {(x i,y i )} n i=1 x i 2 R d y i 2 R Learning a model s parameters: Each `i(w) is convex. nx i=1 `i(w) Squared error Loss: `i(w) =(y i x T i w)2 How does software solve: 1 2 Xw y 2 2 Kevin Jamieson

7 Least squares Have a bunch of iid data of the form: {(x i,y i )} n i=1 x i 2 R d y i 2 R Learning a model s parameters: Each `i(w) is convex. nx i=1 `i(w) Squared error Loss: `i(w) =(y i x T i w)2 How does software solve: its complicated: (LAPACK, BLAS, MKL ) 1 2 Xw y 2 2 Do you need high precision? Is X column/row sparse? Is bw LS sparse? Is X T X well-conditioned? Can X T X fit in cache/memory? Kevin Jamieson

8 Taylor Series Approximation Taylor series in one dimension: f(x + )=f(x)+f 0 (x) f 00 (x) Gradient descent: Kevin Jamieson

9 Taylor Series Approximation Taylor series in d dimensions: f(x + v) =f(x)+rf(x) T v vt r 2 f(x)v +... Gradient descent: Kevin Jamieson

10 Gradient Descent f(w) = 1 2 Xw y 2 2 w t+1 = w t rf(w t ) rf(w) = Kevin Jamieson

11 Gradient Descent f(w) = 1 2 Xw y 2 2 w t+1 = w t rf(w t ) rf(w) = X T (Xw y) w = arg min w f(w) =) rf(w )=0 w t+1 w = w t w rf(w t ) = w t w (rf(w t ) rf(w )) = w t w X T X(w t w ) =(I X T X)(w t w ) =(I X T X) t+1 (w 0 w ) Kevin Jamieson

12 Gradient Descent f(w) = 1 2 Xw y 2 2 w t+1 = w t rf(w t ) (w t+1 w )=(I X T X)(w t w ) Example: X= =(I X T X) t+1 (w 0 w ) apple y= apple w 0 = apple 0 0 w = Kevin Jamieson

13 Gradient Descent f(w) = 1 2 Xw y 2 2 w t+1 = w t rf(w t ) (w t+1 w )=(I X T X)(w t w ) Example: X= =(I X T X) t+1 (w 0 w ) apple y= apple w 0 = apple 0 0 apple 1 w = 1 X T X = apple Pick such that max{ , 1 } < 1 w t+1,1 w,1 = t+1 w 0,1 w,1 = t+1 w t+1,2 w,2 = 1 t+1 w 0,2 w,2 = 1 t+1 Kevin Jamieson

14 Taylor Series Approximation Taylor series in one dimension: f(x + )=f(x)+f 0 (x) f 00 (x) Newton s method: Kevin Jamieson

15 Taylor Series Approximation Taylor series in d dimensions: f(x + v) =f(x)+rf(x) T v vt r 2 f(x)v +... Newton s method: Kevin Jamieson

16 Newton s Method f(w) = 1 2 Xw y 2 2 rf(w) = r 2 f(w) = v t is solution to : r 2 f(w t )v t = rf(w t ) w t+1 = w t + v t Kevin Jamieson

17 Newton s Method f(w) = 1 2 Xw y 2 2 rf(w) = X T (Xw y) r 2 f(w) = X T X v t is solution to : r 2 f(w t )v t = rf(w t ) w t+1 = w t + v t For quadratics, Newton s method can converge in one step! (No surprise, why?) w 1 = w 0 (X T X) 1 X T (Xw 0 y) =(1 )w 0 + (X T X) 1 X T y =(1 )w 0 + w In general, for w t close enough to w one should use =1 Kevin Jamieson

18 General case In general for Newton s method to achieve f(w t ) f(w ) apple : So why are ML problems overwhelmingly solved by gradient methods? Hint: v t is solution to : r 2 f(w t )v t = rf(w t ) Kevin Jamieson

19 General Convex case f(w t ) f(w ) apple Clean converge nice proofs: Bubeck Newton s method: t log(log(1/ )) Gradient descent: f is smooth and strongly convex: ai r 2 f(w) : bi f is smooth: r 2 f(w) bi f is potentially non-differentiable: rf(w) 2 apple c Nocedal +Wright, Bubeck Other: BFGS, Heavy-ball, BCD, SVRG, ADAM, Adagrad, Kevin Jamieson

20 Revisiting Logistic Regression Machine Learning CSE546 Kevin Jamieson University of Washington October 18, 2016 Kevin Jamieson

21 Loss function: Conditional Likelihood Have a bunch of iid data of the form: bw MLE = arg max w f(w) rf(w) = = arg min w {(x i,y i )} n i=1 x i 2 R d, y i 2 { 1, 1} ny P (y P (Y = y x, w) = i x i,w) i=1 nx log(1 + exp( i=1 y i x T i w)) exp( yw T x) Kevin Jamieson

22 Stochastic Gradient Descent Machine Learning CSE546 Kevin Jamieson University of Washington October 18, 2016 Kevin Jamieson

23 Stochastic Gradient Descent Have a bunch of iid data of the form: {(x i,y i )} n i=1 x i 2 R d y i 2 R Learning a model s parameters: 1 nx Each `i(w) is convex. `i(w) n i=1 Kevin Jamieson

24 Stochastic Gradient Descent Have a bunch of iid data of the form: {(x i,y i )} n i=1 x i 2 R d y i 2 R Learning a model s parameters: 1 nx Each `i(w) is convex. `i(w) n i=1 Gradient Descent:! 1 nx w t+1 = w t r w `i(w) n i=1 w=w t Kevin Jamieson

25 Stochastic Gradient Descent Have a bunch of iid data of the form: {(x i,y i )} n i=1 x i 2 R d y i 2 R Learning a model s parameters: 1 nx Each `i(w) is convex. `i(w) n i=1 Gradient Descent:! 1 nx w t+1 = w t r w `i(w) n Stochastic Gradient Descent: w t+1 = w t E[r`It (w)] = r w`it (w) i=1 w=w t w=w t I t drawn uniform at random from {1,...,n} Kevin Jamieson

26 Stochastic Gradient Descent Theorem Let w t+1 = w t r w`it (w) w=w t I t drawn uniform at random from {1,...,n} so that If E r`it (w) = 1 n nx r`i(w) =:r`(w) i=1 sup max kr`i(w)k 2 apple G kw 1 w 0 k 2 2 apple R and then w i E[`( w) `(w )] apple R 2T + G 2 apple r RG T = r R GT w = 1 T TX w t t=1 (In practice use last iterate) Kevin Jamieson

27 Stochastic Gradient Descent Proof E[ w t+1 w 2 2]=E[ w t r`it (w t ) w 2 2] Kevin Jamieson

28 Stochastic Gradient Descent Proof E[ w t+1 w 2 2]=E[ w t r`it (w t ) w 2 2] = E[ w t w 2 2] 2 E[r`It (w t ) T (w t w )] + 2 E[ r`it (w t ) 2 2] apple E[ w t w 2 2] 2 E[`(w t ) `(w )] + 2 G E[r`It (w t ) T (w t w )] = E E[r`It (w t ) T (w t w ) I 1,w 1,...,I t 1,w t 1 ] = E r`(w t ) T (w t w ) E `(w t ) `(w ) TX t=1 E[`(w t ) `(w )] apple 1 2 E[ w 1 w 2 2] E[ w T +1 w 2 2]+T 2 G apple R 2 + T G 2 Kevin Jamieson

29 Stochastic Gradient Descent Proof Jensen s inequality: For any random Z 2 R d and convex function : R d! R, (E[Z]) apple E[ (Z)] E[`( w) `(w )] apple 1 T TX t=1 E[`(w t ) `(w )] w = 1 T TX t=1 w t Kevin Jamieson

30 Stochastic Gradient Descent Proof Jensen s inequality: For any random Z 2 R d and convex function : R d! R, (E[Z]) apple E[ (Z)] E[`( w) `(w )] apple 1 T TX t=1 E[`(w t ) `(w )] w = 1 T TX t=1 w t E[`( w) `(w )] apple R 2T + G 2 apple r RG T = r R GT Kevin Jamieson

31 Stochastic Gradient Descent: A Learning perspective Machine Learning CSE546 Kevin Jamieson University of Washington October 18, 2016 Kevin Jamieson

32 Learning Problems as Expectations Minimizing loss in training data: Given dataset: Sampled iid from some distribution p(x) on features: Loss function, e.g., hinge loss, logistic loss, We often minimize loss in training data: However, we should really minimize expected loss on all data: So, we are approximating the integral by the average on the training data Sham Kakade

33 Gradient descent in Terms of Expectations True objective function: Taking the gradient: True gradient descent rule: How do we estimate expected gradient? Sham Kakade

34 SGD: Stochastic Gradient Descent True gradient: Sample based approximation: What if we estimate gradient with just one sample??? Unbiased estimate of gradient Very noisy! Also called stochastic gradient descent Among many other names VERY useful in practice!!! Sham Kakade

35 Perceptron Machine Learning CSE546 Kevin Jamieson University of Washington October 18, 2018 Kevin Jamieson

36 Online learning Click prediction for ads is a streaming data task: User enters query, and ad must be selected Observe x j, and must predict y j User either clicks or doesn t click on ad Label y j is revealed afterwards Google gets a reward if user clicks on ad Update model for next time Kevin Jamieson

37 Online classification New point arrives at time k Kevin Jamieson

38 The Perceptron Algorithm [Rosenblatt 58, 62] Classification setting: y in {-1,+1} Linear model Prediction: Training: Initialize weight vector: At each time step: Observe features: Make prediction: Observe true class: Update model: If prediction is not equal to truth Kevin Jamieson

39 The Perceptron Algorithm [Rosenblatt 58, 62] Classification setting: y in {-1,+1} Linear model Prediction: Training: Initialize weight vector: At each time step: Observe features: Make prediction: Observe true class: sign(w T x i + b) w 0 =0,b 0 =0 x k sign(x T k w k + b k ) y k Update model: If prediction is not equal to truth apple wk+1 b k+1 = apple wk b k + y k apple xk 1 Kevin Jamieson

see, write, reproduce itself and be conscious of its

40 Rosenblatt 1957 "the embryo of an electronic computer that [the Navy] expects will be able to walk, talk, see, write, reproduce itself and be conscious of its existence." The New York Times, 1958 Kevin Jamieson

41 Linear Separability Perceptron guaranteed to converge if Data linearly separable: Kevin Jamieson

42 Perceptron Analysis: Linearly Separable Case Theorem [Block, Novikoff]: Given a sequence of labeled examples: Each feature vector has bounded norm: If dataset is linearly separable: Then the number of mistakes made by the online perceptron on any such sequence is bounded by Kevin Jamieson

43 Beyond Linearly Separable Case Perceptron algorithm is super cool! No assumption about data distribution! Could be generated by an oblivious adversary, no need to be iid Makes a fixed number of mistakes, and it s done for ever! Even if you see infinite data Kevin Jamieson

44 Beyond Linearly Separable Case Perceptron algorithm is super cool! No assumption about data distribution! Could be generated by an oblivious adversary, no need to be iid Makes a fixed number of mistakes, and it s done for ever! Even if you see infinite data Perceptron is useless in practice! Real world not linearly separable If data not separable, cycles forever and hard to detect Even if separable may not give good generalization accuracy (small margin) Kevin Jamieson

45 What is the Perceptron Doing??? When we discussed logistic regression: Started from maximizing conditional log-likelihood When we discussed the Perceptron: Started from description of an algorithm What is the Perceptron optimizing???? Kevin Jamieson

46 Support Vector Machines Machine Learning CSE546 Kevin Jamieson University of Washington October 18, Kevin Jamieson 46

47 Linear classifiers Which line is better? 2018 Kevin Jamieson 47

48 Pick the one with the largest margin! x T w + b =0 margin 2γ 2018 Kevin Jamieson 48

49 Pick the one with the largest margin! w x 0 x T w + b =0 Distance from x 0 to hyperplane defined by x T w + b = 0? 2018 Kevin Jamieson 49

50 Pick the one with the largest margin! w x 0 x T w + b =0 Distance from x 0 to hyperplane defined by x T w + b = 0? If ex 0 is the projection of x 0 onto the hyperplane then x 0 ex 0 2 = (x T 0 ex 0 ) T w w 2 = 1 w 2 x T 0 w ex T 0 w = 1 w 2 x T 0 w + b 2018 Kevin Jamieson 50

51 Pick the one with the largest margin! x T w + b =0 Distance of x 0 from hyperplane x T w + b: 1 w 2 (x T 0 w + b) Optimal Hyperplane max w,b subject to 1 w 2 y i (x T i w + b) 8i margin 2γ 2018 Kevin Jamieson 51

52 Pick the one with the largest margin! x T w + b =0 Distance of x 0 from hyperplane x T w + b: 1 w 2 (x T 0 w + b) Optimal Hyperplane max w,b subject to 1 w 2 y i (x T i w + b) 8i Optimal Hyperplane (reparameterized) margin 2γ min w,b w 2 2 subject to y i (x T i w + b) 1 8i 2018 Kevin Jamieson 52

53 Pick the one with the largest margin! x T w + b =0 Solve efficiently by many methods, e.g., quadratic programming (QP) Well-studied solution algorithms Stochastic gradient descent Coordinate descent (in the dual) Optimal Hyperplane (reparameterized) margin 2γ min w,b w 2 2 subject to y i (x T i w + b) 1 8i 2018 Kevin Jamieson 53

54 What if the data is still not linearly separable? x T w + b =0 1 w 2 If data is linearly separable min w,b w 2 2 y i (x T i w + b) 1 8i 1 w Kevin Jamieson 54

55 What if the data is still not linearly separable? x T w + b =0 If data is linearly separable 1 w 2 min w,b w 2 2 y i (x T i w + b) 1 8i 1 If data is not linearly separable, some points don t satisfy margin constraint: x T w + b =0 w 2 1 w 2 min w,b w 2 2 y i (x T i w + b) 1 i 8i nx i 0, j apple j=1 1 w Kevin Jamieson 55

56 What if the data is still not linearly separable? x T w + b =0 If data is linearly separable 1 w 2 min w,b w 2 2 y i (x T i w + b) 1 8i 1 If data is not linearly separable, some points don t satisfy margin constraint: x T w + b =0 w 2 1 w 2 min w,b w 2 2 y i (x T i w + b) 1 i 8i nx i 0, j apple j=1 1 w 2 What are support vectors? 2018 Kevin Jamieson 56

57 SVM as penalization method Original quadratic program with linear constraints: min w,b w 2 2 y i (x T i w + b) 1 i 8i nx i 0, j apple j= Kevin Jamieson 57

SVM as penalization method Original quadratic program with linear constraints: min w,b w 2 2 y i (x T i w + b) 1 i 8i nx i 0, j apple j=1 Using same constrained convex

58 SVM as penalization method Original quadratic program with linear constraints: min w,b w 2 2 y i (x T i w + b) 1 i 8i nx i 0, j apple j=1 Using same constrained convex optimization trick as for lasso: For any 0 there exists a 0 such that the solution the following solution is equivalent: nx max{0, 1 y i (b + x T i w)} + w 2 2 i= Kevin Jamieson 58

59 Machine Learning Problems Have a bunch of iid data of the form: {(x i,y i )} n i=1 x i 2 R d y i 2 R nx Learning a model s parameters: Each `i(w) is convex. i=1 `i(w) Hinge Loss: `i(w) = max{0, 1 y i x T i w} Logistic Loss: `i(w) = log(1 + exp( y i x T i w)) Squared error Loss: `i(w) =(y i x T i w)2 How do we solve for w? The last two lectures! Kevin Jamieson

60 Perceptron is optimizing what? Perceptron update rule: apple wk+1 b k+1 = apple wk b k + y k apple xk 1 1{y i (b + x T i w) < 0} SVM objective: nx max{0, 1 y i (b + x T i w)} + w 2 2 = i=1 nx `i(w, b) i=1 r w`i(w, b) = ( r b`i(w, b) = ( x i y i + 2 n w if y i(b + x T i w) < 1 0 otherwise y i if y i (b + x T i w) < 1 0 otherwise Perceptron is just SGD on SVM with = 0, = 1! 2018 Kevin Jamieson 60

61 SVMs vs logistic regression We often want probabilities/confidences, logistic wins here? 2018 Kevin Jamieson 61

62 SVMs vs logistic regression We often want probabilities/confidences, logistic wins here? No! Perform isotonic regression or non-parametric bootstrap for probability calibration. Predictor gives some score, how do we transform that score to a probability? 2018 Kevin Jamieson 62

63 SVMs vs logistic regression We often want probabilities/confidences, logistic wins here? No! Perform isotonic regression or non-parametric bootstrap for probability calibration. Predictor gives some score, how do we transform that score to a probability? For classification loss, logistic and svm are comparable Multiclass setting: Softmax naturally generalizes logistic regression SVMs have What about good old least squares? 2018 Kevin Jamieson 63

64 What about multiple classes? 2018 Kevin Jamieson 64

Warm up: risk prediction with logistic regression

Warm up: risk prediction with logistic regression Boss gives you a bunch of data on loans defaulting or not: {(x i,y i )} n i= x i 2 R d, y i 2 {, } You model the data as: P (Y = y x, w) = + exp( yw T