Warm up. Regrade requests submitted directly in Gradescope, do not instructors.

Warm up Regrade requests submitted directly in Gradescope, do not email instructors. 1 float in NumPy = 8 bytes 10 6 2 20 bytes = 1 MB 10 9 2 30 bytes = 1 GB For each block compute the memory required in terms of n, p, d. If d << p << n, what is the most memory efficient program (blue, green, red)? If you have unlimited memory, what do you think is the fastest program? Kevin Jamieson 2018 1

Gradient Descent Machine Learning CSE546 Kevin Jamieson University of Washington October 18, 2016 Kevin Jamieson 2018 2

Machine Learning Problems Have a bunch of iid data of the form: {(x i,y i )} n i=1 x i 2 R d y i 2 R Learning a model s parameters: Each `i(w) is convex. nx i=1 `i(w) Kevin Jamieson 2018 3

Machine Learning Problems Have a bunch of iid data of the form: {(x i,y i )} n i=1 x i 2 R d y i 2 R Learning a model s parameters: Each `i(w) is convex. nx i=1 `i(w) x x y g is a subgradient at x if f(y) f(x)+g T (y x) f convex: f ( x +(1 )y) apple f(x)+(1 )f(y) 8x, y, 2 [0, 1] f(y) f(x)+rf(x) T (y x) 8x, y Kevin Jamieson 2018 4

Machine Learning Problems Have a bunch of iid data of the form: {(x i,y i )} n i=1 x i 2 R d y i 2 R Learning a model s parameters: Each `i(w) is convex. nx i=1 `i(w) Logistic Loss: `i(w) = log(1 + exp( y i x T i w)) Squared error Loss: `i(w) =(y i x T i w)2 Kevin Jamieson 2018 5

Least squares Have a bunch of iid data of the form: {(x i,y i )} n i=1 x i 2 R d y i 2 R Learning a model s parameters: Each `i(w) is convex. nx i=1 `i(w) Squared error Loss: `i(w) =(y i x T i w)2 How does software solve: its complicated: (LAPACK, BLAS, MKL ) 1 2 Xw y 2 2 Do you need high precision? Is X column/row sparse? Is bw LS sparse? Is X T X well-conditioned? Can X T X fit in cache/memory? Kevin Jamieson 2018 7

Taylor Series Approximation Taylor series in one dimension: f(x + )=f(x)+f 0 (x) + 1 2 f 00 (x) 2 +... Gradient descent: Kevin Jamieson 2018 8

Taylor Series Approximation Taylor series in d dimensions: f(x + v) =f(x)+rf(x) T v + 1 2 vt r 2 f(x)v +... Gradient descent: Kevin Jamieson 2018 9

Gradient Descent f(w) = 1 2 Xw y 2 2 w t+1 = w t rf(w t ) rf(w) = Kevin Jamieson 2018 10

Gradient Descent f(w) = 1 2 Xw y 2 2 w t+1 = w t rf(w t ) rf(w) = X T (Xw y) w = arg min w f(w) =) rf(w )=0 w t+1 w = w t w rf(w t ) = w t w (rf(w t ) rf(w )) = w t w X T X(w t w ) =(I X T X)(w t w ) =(I X T X) t+1 (w 0 w ) Kevin Jamieson 2018 11

Gradient Descent f(w) = 1 2 Xw y 2 2 w t+1 = w t rf(w t ) (w t+1 w )=(I X T X)(w t w ) Example: X= =(I X T X) t+1 (w 0 w ) apple 10 3 0 0 1 y= apple 10 3 1 w 0 = apple 0 0 w = Kevin Jamieson 2018 12

Gradient Descent f(w) = 1 2 Xw y 2 2 w t+1 = w t rf(w t ) (w t+1 w )=(I X T X)(w t w ) Example: X= =(I X T X) t+1 (w 0 w ) apple 10 3 0 0 1 y= apple 10 3 1 w 0 = apple 0 0 apple 1 w = 1 X T X = apple 10 6 0 0 1 Pick such that max{ 1 10 6, 1 } < 1 w t+1,1 w,1 = 1 10 6 t+1 w 0,1 w,1 = 1 10 6 t+1 w t+1,2 w,2 = 1 t+1 w 0,2 w,2 = 1 t+1 Kevin Jamieson 2018 13

Taylor Series Approximation Taylor series in one dimension: f(x + )=f(x)+f 0 (x) + 1 2 f 00 (x) 2 +... Newton s method: Kevin Jamieson 2018 14

Taylor Series Approximation Taylor series in d dimensions: f(x + v) =f(x)+rf(x) T v + 1 2 vt r 2 f(x)v +... Newton s method: Kevin Jamieson 2018 15

Newton s Method f(w) = 1 2 Xw y 2 2 rf(w) = r 2 f(w) = v t is solution to : r 2 f(w t )v t = rf(w t ) w t+1 = w t + v t Kevin Jamieson 2018 16

Newton s Method f(w) = 1 2 Xw y 2 2 rf(w) = X T (Xw y) r 2 f(w) = X T X v t is solution to : r 2 f(w t )v t = rf(w t ) w t+1 = w t + v t For quadratics, Newton s method can converge in one step! (No surprise, why?) w 1 = w 0 (X T X) 1 X T (Xw 0 y) =(1 )w 0 + (X T X) 1 X T y =(1 )w 0 + w In general, for w t close enough to w one should use =1 Kevin Jamieson 2018 17

General case In general for Newton s method to achieve f(w t ) f(w ) apple : So why are ML problems overwhelmingly solved by gradient methods? Hint: v t is solution to : r 2 f(w t )v t = rf(w t ) Kevin Jamieson 2018 18

General Convex case f(w t ) f(w ) apple Clean converge nice proofs: Bubeck Newton s method: t log(log(1/ )) Gradient descent: f is smooth and strongly convex: ai r 2 f(w) : bi f is smooth: r 2 f(w) bi f is potentially non-differentiable: rf(w) 2 apple c Nocedal +Wright, Bubeck Other: BFGS, Heavy-ball, BCD, SVRG, ADAM, Adagrad, Kevin Jamieson 2018 19

Revisiting Logistic Regression Machine Learning CSE546 Kevin Jamieson University of Washington October 18, 2016 Kevin Jamieson 2018 20

Loss function: Conditional Likelihood Have a bunch of iid data of the form: bw MLE = arg max w f(w) rf(w) = = arg min w {(x i,y i )} n i=1 x i 2 R d, y i 2 { 1, 1} ny P (y P (Y = y x, w) = i x i,w) i=1 nx log(1 + exp( i=1 y i x T i w)) 1 1 + exp( yw T x) Kevin Jamieson 2018 21

Stochastic Gradient Descent Machine Learning CSE546 Kevin Jamieson University of Washington October 18, 2016 Kevin Jamieson 2018 22

Stochastic Gradient Descent Have a bunch of iid data of the form: {(x i,y i )} n i=1 x i 2 R d y i 2 R Learning a model s parameters: 1 nx Each `i(w) is convex. `i(w) n i=1 Kevin Jamieson 2018 23

Stochastic Gradient Descent Have a bunch of iid data of the form: {(x i,y i )} n i=1 x i 2 R d y i 2 R Learning a model s parameters: 1 nx Each `i(w) is convex. `i(w) n i=1 Gradient Descent:! 1 nx w t+1 = w t r w `i(w) n Stochastic Gradient Descent: w t+1 = w t E[r`It (w)] = r w`it (w) i=1 w=w t w=w t I t drawn uniform at random from {1,...,n} Kevin Jamieson 2018 25

Stochastic Gradient Descent Theorem Let w t+1 = w t r w`it (w) w=w t I t drawn uniform at random from {1,...,n} so that If E r`it (w) = 1 n nx r`i(w) =:r`(w) i=1 sup max kr`i(w)k 2 apple G kw 1 w 0 k 2 2 apple R and then w i E[`( w) `(w )] apple R 2T + G 2 apple r RG T = r R GT w = 1 T TX w t t=1 (In practice use last iterate) Kevin Jamieson 2018 26

Stochastic Gradient Descent Proof E[ w t+1 w 2 2]=E[ w t r`it (w t ) w 2 2] Kevin Jamieson 2018 27

Stochastic Gradient Descent Proof E[ w t+1 w 2 2]=E[ w t r`it (w t ) w 2 2] = E[ w t w 2 2] 2 E[r`It (w t ) T (w t w )] + 2 E[ r`it (w t ) 2 2] apple E[ w t w 2 2] 2 E[`(w t ) `(w )] + 2 G E[r`It (w t ) T (w t w )] = E E[r`It (w t ) T (w t w ) I 1,w 1,...,I t 1,w t 1 ] = E r`(w t ) T (w t w ) E `(w t ) `(w ) TX t=1 E[`(w t ) `(w )] apple 1 2 E[ w 1 w 2 2] E[ w T +1 w 2 2]+T 2 G apple R 2 + T G 2 Kevin Jamieson 2018 28

Stochastic Gradient Descent Proof Jensen s inequality: For any random Z 2 R d and convex function : R d! R, (E[Z]) apple E[ (Z)] E[`( w) `(w )] apple 1 T TX t=1 E[`(w t ) `(w )] w = 1 T TX t=1 w t Kevin Jamieson 2018 29

Stochastic Gradient Descent: A Learning perspective Machine Learning CSE546 Kevin Jamieson University of Washington October 18, 2016 Kevin Jamieson 2018 31

Learning Problems as Expectations Minimizing loss in training data: Given dataset: Sampled iid from some distribution p(x) on features: Loss function, e.g., hinge loss, logistic loss, We often minimize loss in training data: However, we should really minimize expected loss on all data: So, we are approximating the integral by the average on the training data Sham Kakade 2016 32

Gradient descent in Terms of Expectations True objective function: Taking the gradient: True gradient descent rule: How do we estimate expected gradient? Sham Kakade 2016 33

SGD: Stochastic Gradient Descent True gradient: Sample based approximation: What if we estimate gradient with just one sample??? Unbiased estimate of gradient Very noisy! Also called stochastic gradient descent Among many other names VERY useful in practice!!! Sham Kakade 2016 34

Perceptron Machine Learning CSE546 Kevin Jamieson University of Washington October 18, 2018 Kevin Jamieson 2018 35

Online learning Click prediction for ads is a streaming data task: User enters query, and ad must be selected Observe x j, and must predict y j User either clicks or doesn t click on ad Label y j is revealed afterwards Google gets a reward if user clicks on ad Update model for next time Kevin Jamieson 2016 36

Online classification New point arrives at time k Kevin Jamieson 2016 37

The Perceptron Algorithm [Rosenblatt 58, 62] Classification setting: y in {-1,+1} Linear model Prediction: Training: Initialize weight vector: At each time step: Observe features: Make prediction: Observe true class: Update model: If prediction is not equal to truth Kevin Jamieson 2016 38

The Perceptron Algorithm [Rosenblatt 58, 62] Classification setting: y in {-1,+1} Linear model Prediction: Training: Initialize weight vector: At each time step: Observe features: Make prediction: Observe true class: sign(w T x i + b) w 0 =0,b 0 =0 x k sign(x T k w k + b k ) y k Update model: If prediction is not equal to truth apple wk+1 b k+1 = apple wk b k + y k apple xk 1 Kevin Jamieson 2016 39

Rosenblatt 1957 "the embryo of an electronic computer that [the Navy] expects will be able to walk, talk, see, write, reproduce itself and be conscious of its existence." The New York Times, 1958 Kevin Jamieson 2016 40

Linear Separability Perceptron guaranteed to converge if Data linearly separable: Kevin Jamieson 2016 41

Perceptron Analysis: Linearly Separable Case Theorem [Block, Novikoff]: Given a sequence of labeled examples: Each feature vector has bounded norm: If dataset is linearly separable: Then the number of mistakes made by the online perceptron on any such sequence is bounded by Kevin Jamieson 2016 42

Beyond Linearly Separable Case Perceptron algorithm is super cool! No assumption about data distribution! Could be generated by an oblivious adversary, no need to be iid Makes a fixed number of mistakes, and it s done for ever! Even if you see infinite data Perceptron is useless in practice! Real world not linearly separable If data not separable, cycles forever and hard to detect Even if separable may not give good generalization accuracy (small margin) Kevin Jamieson 2016 44

What is the Perceptron Doing??? When we discussed logistic regression: Started from maximizing conditional log-likelihood When we discussed the Perceptron: Started from description of an algorithm What is the Perceptron optimizing???? Kevin Jamieson 2016 45

Support Vector Machines Machine Learning CSE546 Kevin Jamieson University of Washington October 18, 2018 2018 Kevin Jamieson 46

Linear classifiers Which line is better? 2018 Kevin Jamieson 47

Pick the one with the largest margin! x T w + b =0 margin 2γ 2018 Kevin Jamieson 48

Pick the one with the largest margin! w x 0 x T w + b =0 Distance from x 0 to hyperplane defined by x T w + b = 0? 2018 Kevin Jamieson 49

Pick the one with the largest margin! w x 0 x T w + b =0 Distance from x 0 to hyperplane defined by x T w + b = 0? If ex 0 is the projection of x 0 onto the hyperplane then x 0 ex 0 2 = (x T 0 ex 0 ) T w w 2 = 1 w 2 x T 0 w ex T 0 w = 1 w 2 x T 0 w + b 2018 Kevin Jamieson 50

Pick the one with the largest margin! x T w + b =0 Distance of x 0 from hyperplane x T w + b: 1 w 2 (x T 0 w + b) Optimal Hyperplane max w,b subject to 1 w 2 y i (x T i w + b) 8i margin 2γ 2018 Kevin Jamieson 51

Pick the one with the largest margin! x T w + b =0 Distance of x 0 from hyperplane x T w + b: 1 w 2 (x T 0 w + b) Optimal Hyperplane max w,b subject to 1 w 2 y i (x T i w + b) 8i Optimal Hyperplane (reparameterized) margin 2γ min w,b w 2 2 subject to y i (x T i w + b) 1 8i 2018 Kevin Jamieson 52

Pick the one with the largest margin! x T w + b =0 Solve efficiently by many methods, e.g., quadratic programming (QP) Well-studied solution algorithms Stochastic gradient descent Coordinate descent (in the dual) Optimal Hyperplane (reparameterized) margin 2γ min w,b w 2 2 subject to y i (x T i w + b) 1 8i 2018 Kevin Jamieson 53

What if the data is still not linearly separable? x T w + b =0 1 w 2 If data is linearly separable min w,b w 2 2 y i (x T i w + b) 1 8i 1 w 2 2018 Kevin Jamieson 54

What if the data is still not linearly separable? x T w + b =0 If data is linearly separable 1 w 2 min w,b w 2 2 y i (x T i w + b) 1 8i 1 If data is not linearly separable, some points don t satisfy margin constraint: x T w + b =0 w 2 1 w 2 min w,b w 2 2 y i (x T i w + b) 1 i 8i nx i 0, j apple j=1 1 w 2 2018 Kevin Jamieson 55

SVM as penalization method Original quadratic program with linear constraints: min w,b w 2 2 y i (x T i w + b) 1 i 8i nx i 0, j apple j=1 2018 Kevin Jamieson 57

SVM as penalization method Original quadratic program with linear constraints: min w,b w 2 2 y i (x T i w + b) 1 i 8i nx i 0, j apple j=1 Using same constrained convex optimization trick as for lasso: For any 0 there exists a 0 such that the solution the following solution is equivalent: nx max{0, 1 y i (b + x T i w)} + w 2 2 i=1 2018 Kevin Jamieson 58

Machine Learning Problems Have a bunch of iid data of the form: {(x i,y i )} n i=1 x i 2 R d y i 2 R nx Learning a model s parameters: Each `i(w) is convex. i=1 `i(w) Hinge Loss: `i(w) = max{0, 1 y i x T i w} Logistic Loss: `i(w) = log(1 + exp( y i x T i w)) Squared error Loss: `i(w) =(y i x T i w)2 How do we solve for w? The last two lectures! Kevin Jamieson 2018 59

Perceptron is optimizing what? Perceptron update rule: apple wk+1 b k+1 = apple wk b k + y k apple xk 1 1{y i (b + x T i w) < 0} SVM objective: nx max{0, 1 y i (b + x T i w)} + w 2 2 = i=1 nx `i(w, b) i=1 r w`i(w, b) = ( r b`i(w, b) = ( x i y i + 2 n w if y i(b + x T i w) < 1 0 otherwise y i if y i (b + x T i w) < 1 0 otherwise Perceptron is just SGD on SVM with = 0, = 1! 2018 Kevin Jamieson 60

SVMs vs logistic regression We often want probabilities/confidences, logistic wins here? 2018 Kevin Jamieson 61

SVMs vs logistic regression We often want probabilities/confidences, logistic wins here? No! Perform isotonic regression or non-parametric bootstrap for probability calibration. Predictor gives some score, how do we transform that score to a probability? For classification loss, logistic and svm are comparable Multiclass setting: Softmax naturally generalizes logistic regression SVMs have What about good old least squares? 2018 Kevin Jamieson 63

What about multiple classes? 2018 Kevin Jamieson 64