Warm up Regrade requests submitted directly in Gradescope, do not email instructors. 1 float in NumPy = 8 bytes 10 6 2 20 bytes = 1 MB 10 9 2 30 bytes = 1 GB For each block compute the memory required in terms of n, p, d. If d << p << n, what is the most memory efficient program (blue, green, red)? If you have unlimited memory, what do you think is the fastest program? Kevin Jamieson 2018 1
Gradient Descent Machine Learning CSE546 Kevin Jamieson University of Washington October 18, 2016 Kevin Jamieson 2018 2
Machine Learning Problems Have a bunch of iid data of the form: {(x i,y i )} n i=1 x i 2 R d y i 2 R Learning a model s parameters: Each `i(w) is convex. nx i=1 `i(w) Kevin Jamieson 2018 3
Machine Learning Problems Have a bunch of iid data of the form: {(x i,y i )} n i=1 x i 2 R d y i 2 R Learning a model s parameters: Each `i(w) is convex. nx i=1 `i(w) x x y g is a subgradient at x if f(y) f(x)+g T (y x) f convex: f ( x +(1 )y) apple f(x)+(1 )f(y) 8x, y, 2 [0, 1] f(y) f(x)+rf(x) T (y x) 8x, y Kevin Jamieson 2018 4
Machine Learning Problems Have a bunch of iid data of the form: {(x i,y i )} n i=1 x i 2 R d y i 2 R Learning a model s parameters: Each `i(w) is convex. nx i=1 `i(w) Logistic Loss: `i(w) = log(1 + exp( y i x T i w)) Squared error Loss: `i(w) =(y i x T i w)2 Kevin Jamieson 2018 5
Least squares Have a bunch of iid data of the form: {(x i,y i )} n i=1 x i 2 R d y i 2 R Learning a model s parameters: Each `i(w) is convex. nx i=1 `i(w) Squared error Loss: `i(w) =(y i x T i w)2 How does software solve: 1 2 Xw y 2 2 Kevin Jamieson 2018 6
Least squares Have a bunch of iid data of the form: {(x i,y i )} n i=1 x i 2 R d y i 2 R Learning a model s parameters: Each `i(w) is convex. nx i=1 `i(w) Squared error Loss: `i(w) =(y i x T i w)2 How does software solve: its complicated: (LAPACK, BLAS, MKL ) 1 2 Xw y 2 2 Do you need high precision? Is X column/row sparse? Is bw LS sparse? Is X T X well-conditioned? Can X T X fit in cache/memory? Kevin Jamieson 2018 7
Taylor Series Approximation Taylor series in one dimension: f(x + )=f(x)+f 0 (x) + 1 2 f 00 (x) 2 +... Gradient descent: Kevin Jamieson 2018 8
Taylor Series Approximation Taylor series in d dimensions: f(x + v) =f(x)+rf(x) T v + 1 2 vt r 2 f(x)v +... Gradient descent: Kevin Jamieson 2018 9
Gradient Descent f(w) = 1 2 Xw y 2 2 w t+1 = w t rf(w t ) rf(w) = Kevin Jamieson 2018 10
Gradient Descent f(w) = 1 2 Xw y 2 2 w t+1 = w t rf(w t ) rf(w) = X T (Xw y) w = arg min w f(w) =) rf(w )=0 w t+1 w = w t w rf(w t ) = w t w (rf(w t ) rf(w )) = w t w X T X(w t w ) =(I X T X)(w t w ) =(I X T X) t+1 (w 0 w ) Kevin Jamieson 2018 11
Gradient Descent f(w) = 1 2 Xw y 2 2 w t+1 = w t rf(w t ) (w t+1 w )=(I X T X)(w t w ) Example: X= =(I X T X) t+1 (w 0 w ) apple 10 3 0 0 1 y= apple 10 3 1 w 0 = apple 0 0 w = Kevin Jamieson 2018 12
Gradient Descent f(w) = 1 2 Xw y 2 2 w t+1 = w t rf(w t ) (w t+1 w )=(I X T X)(w t w ) Example: X= =(I X T X) t+1 (w 0 w ) apple 10 3 0 0 1 y= apple 10 3 1 w 0 = apple 0 0 apple 1 w = 1 X T X = apple 10 6 0 0 1 Pick such that max{ 1 10 6, 1 } < 1 w t+1,1 w,1 = 1 10 6 t+1 w 0,1 w,1 = 1 10 6 t+1 w t+1,2 w,2 = 1 t+1 w 0,2 w,2 = 1 t+1 Kevin Jamieson 2018 13
Taylor Series Approximation Taylor series in one dimension: f(x + )=f(x)+f 0 (x) + 1 2 f 00 (x) 2 +... Newton s method: Kevin Jamieson 2018 14
Taylor Series Approximation Taylor series in d dimensions: f(x + v) =f(x)+rf(x) T v + 1 2 vt r 2 f(x)v +... Newton s method: Kevin Jamieson 2018 15
Newton s Method f(w) = 1 2 Xw y 2 2 rf(w) = r 2 f(w) = v t is solution to : r 2 f(w t )v t = rf(w t ) w t+1 = w t + v t Kevin Jamieson 2018 16
Newton s Method f(w) = 1 2 Xw y 2 2 rf(w) = X T (Xw y) r 2 f(w) = X T X v t is solution to : r 2 f(w t )v t = rf(w t ) w t+1 = w t + v t For quadratics, Newton s method can converge in one step! (No surprise, why?) w 1 = w 0 (X T X) 1 X T (Xw 0 y) =(1 )w 0 + (X T X) 1 X T y =(1 )w 0 + w In general, for w t close enough to w one should use =1 Kevin Jamieson 2018 17
General case In general for Newton s method to achieve f(w t ) f(w ) apple : So why are ML problems overwhelmingly solved by gradient methods? Hint: v t is solution to : r 2 f(w t )v t = rf(w t ) Kevin Jamieson 2018 18
General Convex case f(w t ) f(w ) apple Clean converge nice proofs: Bubeck Newton s method: t log(log(1/ )) Gradient descent: f is smooth and strongly convex: ai r 2 f(w) : bi f is smooth: r 2 f(w) bi f is potentially non-differentiable: rf(w) 2 apple c Nocedal +Wright, Bubeck Other: BFGS, Heavy-ball, BCD, SVRG, ADAM, Adagrad, Kevin Jamieson 2018 19
Revisiting Logistic Regression Machine Learning CSE546 Kevin Jamieson University of Washington October 18, 2016 Kevin Jamieson 2018 20
Loss function: Conditional Likelihood Have a bunch of iid data of the form: bw MLE = arg max w f(w) rf(w) = = arg min w {(x i,y i )} n i=1 x i 2 R d, y i 2 { 1, 1} ny P (y P (Y = y x, w) = i x i,w) i=1 nx log(1 + exp( i=1 y i x T i w)) 1 1 + exp( yw T x) Kevin Jamieson 2018 21
Stochastic Gradient Descent Machine Learning CSE546 Kevin Jamieson University of Washington October 18, 2016 Kevin Jamieson 2018 22
Stochastic Gradient Descent Have a bunch of iid data of the form: {(x i,y i )} n i=1 x i 2 R d y i 2 R Learning a model s parameters: 1 nx Each `i(w) is convex. `i(w) n i=1 Kevin Jamieson 2018 23
Stochastic Gradient Descent Have a bunch of iid data of the form: {(x i,y i )} n i=1 x i 2 R d y i 2 R Learning a model s parameters: 1 nx Each `i(w) is convex. `i(w) n i=1 Gradient Descent:! 1 nx w t+1 = w t r w `i(w) n i=1 w=w t Kevin Jamieson 2018 24
Stochastic Gradient Descent Have a bunch of iid data of the form: {(x i,y i )} n i=1 x i 2 R d y i 2 R Learning a model s parameters: 1 nx Each `i(w) is convex. `i(w) n i=1 Gradient Descent:! 1 nx w t+1 = w t r w `i(w) n Stochastic Gradient Descent: w t+1 = w t E[r`It (w)] = r w`it (w) i=1 w=w t w=w t I t drawn uniform at random from {1,...,n} Kevin Jamieson 2018 25
Stochastic Gradient Descent Theorem Let w t+1 = w t r w`it (w) w=w t I t drawn uniform at random from {1,...,n} so that If E r`it (w) = 1 n nx r`i(w) =:r`(w) i=1 sup max kr`i(w)k 2 apple G kw 1 w 0 k 2 2 apple R and then w i E[`( w) `(w )] apple R 2T + G 2 apple r RG T = r R GT w = 1 T TX w t t=1 (In practice use last iterate) Kevin Jamieson 2018 26
Stochastic Gradient Descent Proof E[ w t+1 w 2 2]=E[ w t r`it (w t ) w 2 2] Kevin Jamieson 2018 27
Stochastic Gradient Descent Proof E[ w t+1 w 2 2]=E[ w t r`it (w t ) w 2 2] = E[ w t w 2 2] 2 E[r`It (w t ) T (w t w )] + 2 E[ r`it (w t ) 2 2] apple E[ w t w 2 2] 2 E[`(w t ) `(w )] + 2 G E[r`It (w t ) T (w t w )] = E E[r`It (w t ) T (w t w ) I 1,w 1,...,I t 1,w t 1 ] = E r`(w t ) T (w t w ) E `(w t ) `(w ) TX t=1 E[`(w t ) `(w )] apple 1 2 E[ w 1 w 2 2] E[ w T +1 w 2 2]+T 2 G apple R 2 + T G 2 Kevin Jamieson 2018 28
Stochastic Gradient Descent Proof Jensen s inequality: For any random Z 2 R d and convex function : R d! R, (E[Z]) apple E[ (Z)] E[`( w) `(w )] apple 1 T TX t=1 E[`(w t ) `(w )] w = 1 T TX t=1 w t Kevin Jamieson 2018 29
Stochastic Gradient Descent Proof Jensen s inequality: For any random Z 2 R d and convex function : R d! R, (E[Z]) apple E[ (Z)] E[`( w) `(w )] apple 1 T TX t=1 E[`(w t ) `(w )] w = 1 T TX t=1 w t E[`( w) `(w )] apple R 2T + G 2 apple r RG T = r R GT Kevin Jamieson 2018 30
Stochastic Gradient Descent: A Learning perspective Machine Learning CSE546 Kevin Jamieson University of Washington October 18, 2016 Kevin Jamieson 2018 31
Learning Problems as Expectations Minimizing loss in training data: Given dataset: Sampled iid from some distribution p(x) on features: Loss function, e.g., hinge loss, logistic loss, We often minimize loss in training data: However, we should really minimize expected loss on all data: So, we are approximating the integral by the average on the training data Sham Kakade 2016 32
Gradient descent in Terms of Expectations True objective function: Taking the gradient: True gradient descent rule: How do we estimate expected gradient? Sham Kakade 2016 33
SGD: Stochastic Gradient Descent True gradient: Sample based approximation: What if we estimate gradient with just one sample??? Unbiased estimate of gradient Very noisy! Also called stochastic gradient descent Among many other names VERY useful in practice!!! Sham Kakade 2016 34
Perceptron Machine Learning CSE546 Kevin Jamieson University of Washington October 18, 2018 Kevin Jamieson 2018 35
Online learning Click prediction for ads is a streaming data task: User enters query, and ad must be selected Observe x j, and must predict y j User either clicks or doesn t click on ad Label y j is revealed afterwards Google gets a reward if user clicks on ad Update model for next time Kevin Jamieson 2016 36
Online classification New point arrives at time k Kevin Jamieson 2016 37
The Perceptron Algorithm [Rosenblatt 58, 62] Classification setting: y in {-1,+1} Linear model Prediction: Training: Initialize weight vector: At each time step: Observe features: Make prediction: Observe true class: Update model: If prediction is not equal to truth Kevin Jamieson 2016 38
The Perceptron Algorithm [Rosenblatt 58, 62] Classification setting: y in {-1,+1} Linear model Prediction: Training: Initialize weight vector: At each time step: Observe features: Make prediction: Observe true class: sign(w T x i + b) w 0 =0,b 0 =0 x k sign(x T k w k + b k ) y k Update model: If prediction is not equal to truth apple wk+1 b k+1 = apple wk b k + y k apple xk 1 Kevin Jamieson 2016 39
Rosenblatt 1957 "the embryo of an electronic computer that [the Navy] expects will be able to walk, talk, see, write, reproduce itself and be conscious of its existence." The New York Times, 1958 Kevin Jamieson 2016 40
Linear Separability Perceptron guaranteed to converge if Data linearly separable: Kevin Jamieson 2016 41
Perceptron Analysis: Linearly Separable Case Theorem [Block, Novikoff]: Given a sequence of labeled examples: Each feature vector has bounded norm: If dataset is linearly separable: Then the number of mistakes made by the online perceptron on any such sequence is bounded by Kevin Jamieson 2016 42
Beyond Linearly Separable Case Perceptron algorithm is super cool! No assumption about data distribution! Could be generated by an oblivious adversary, no need to be iid Makes a fixed number of mistakes, and it s done for ever! Even if you see infinite data Kevin Jamieson 2016 43
Beyond Linearly Separable Case Perceptron algorithm is super cool! No assumption about data distribution! Could be generated by an oblivious adversary, no need to be iid Makes a fixed number of mistakes, and it s done for ever! Even if you see infinite data Perceptron is useless in practice! Real world not linearly separable If data not separable, cycles forever and hard to detect Even if separable may not give good generalization accuracy (small margin) Kevin Jamieson 2016 44
What is the Perceptron Doing??? When we discussed logistic regression: Started from maximizing conditional log-likelihood When we discussed the Perceptron: Started from description of an algorithm What is the Perceptron optimizing???? Kevin Jamieson 2016 45
Support Vector Machines Machine Learning CSE546 Kevin Jamieson University of Washington October 18, 2018 2018 Kevin Jamieson 46
Linear classifiers Which line is better? 2018 Kevin Jamieson 47
Pick the one with the largest margin! x T w + b =0 margin 2γ 2018 Kevin Jamieson 48
Pick the one with the largest margin! w x 0 x T w + b =0 Distance from x 0 to hyperplane defined by x T w + b = 0? 2018 Kevin Jamieson 49
Pick the one with the largest margin! w x 0 x T w + b =0 Distance from x 0 to hyperplane defined by x T w + b = 0? If ex 0 is the projection of x 0 onto the hyperplane then x 0 ex 0 2 = (x T 0 ex 0 ) T w w 2 = 1 w 2 x T 0 w ex T 0 w = 1 w 2 x T 0 w + b 2018 Kevin Jamieson 50
Pick the one with the largest margin! x T w + b =0 Distance of x 0 from hyperplane x T w + b: 1 w 2 (x T 0 w + b) Optimal Hyperplane max w,b subject to 1 w 2 y i (x T i w + b) 8i margin 2γ 2018 Kevin Jamieson 51
Pick the one with the largest margin! x T w + b =0 Distance of x 0 from hyperplane x T w + b: 1 w 2 (x T 0 w + b) Optimal Hyperplane max w,b subject to 1 w 2 y i (x T i w + b) 8i Optimal Hyperplane (reparameterized) margin 2γ min w,b w 2 2 subject to y i (x T i w + b) 1 8i 2018 Kevin Jamieson 52
Pick the one with the largest margin! x T w + b =0 Solve efficiently by many methods, e.g., quadratic programming (QP) Well-studied solution algorithms Stochastic gradient descent Coordinate descent (in the dual) Optimal Hyperplane (reparameterized) margin 2γ min w,b w 2 2 subject to y i (x T i w + b) 1 8i 2018 Kevin Jamieson 53
What if the data is still not linearly separable? x T w + b =0 1 w 2 If data is linearly separable min w,b w 2 2 y i (x T i w + b) 1 8i 1 w 2 2018 Kevin Jamieson 54
What if the data is still not linearly separable? x T w + b =0 If data is linearly separable 1 w 2 min w,b w 2 2 y i (x T i w + b) 1 8i 1 If data is not linearly separable, some points don t satisfy margin constraint: x T w + b =0 w 2 1 w 2 min w,b w 2 2 y i (x T i w + b) 1 i 8i nx i 0, j apple j=1 1 w 2 2018 Kevin Jamieson 55
What if the data is still not linearly separable? x T w + b =0 If data is linearly separable 1 w 2 min w,b w 2 2 y i (x T i w + b) 1 8i 1 If data is not linearly separable, some points don t satisfy margin constraint: x T w + b =0 w 2 1 w 2 min w,b w 2 2 y i (x T i w + b) 1 i 8i nx i 0, j apple j=1 1 w 2 What are support vectors? 2018 Kevin Jamieson 56
SVM as penalization method Original quadratic program with linear constraints: min w,b w 2 2 y i (x T i w + b) 1 i 8i nx i 0, j apple j=1 2018 Kevin Jamieson 57
SVM as penalization method Original quadratic program with linear constraints: min w,b w 2 2 y i (x T i w + b) 1 i 8i nx i 0, j apple j=1 Using same constrained convex optimization trick as for lasso: For any 0 there exists a 0 such that the solution the following solution is equivalent: nx max{0, 1 y i (b + x T i w)} + w 2 2 i=1 2018 Kevin Jamieson 58
Machine Learning Problems Have a bunch of iid data of the form: {(x i,y i )} n i=1 x i 2 R d y i 2 R nx Learning a model s parameters: Each `i(w) is convex. i=1 `i(w) Hinge Loss: `i(w) = max{0, 1 y i x T i w} Logistic Loss: `i(w) = log(1 + exp( y i x T i w)) Squared error Loss: `i(w) =(y i x T i w)2 How do we solve for w? The last two lectures! Kevin Jamieson 2018 59
Perceptron is optimizing what? Perceptron update rule: apple wk+1 b k+1 = apple wk b k + y k apple xk 1 1{y i (b + x T i w) < 0} SVM objective: nx max{0, 1 y i (b + x T i w)} + w 2 2 = i=1 nx `i(w, b) i=1 r w`i(w, b) = ( r b`i(w, b) = ( x i y i + 2 n w if y i(b + x T i w) < 1 0 otherwise y i if y i (b + x T i w) < 1 0 otherwise Perceptron is just SGD on SVM with = 0, = 1! 2018 Kevin Jamieson 60
SVMs vs logistic regression We often want probabilities/confidences, logistic wins here? 2018 Kevin Jamieson 61
SVMs vs logistic regression We often want probabilities/confidences, logistic wins here? No! Perform isotonic regression or non-parametric bootstrap for probability calibration. Predictor gives some score, how do we transform that score to a probability? 2018 Kevin Jamieson 62
SVMs vs logistic regression We often want probabilities/confidences, logistic wins here? No! Perform isotonic regression or non-parametric bootstrap for probability calibration. Predictor gives some score, how do we transform that score to a probability? For classification loss, logistic and svm are comparable Multiclass setting: Softmax naturally generalizes logistic regression SVMs have What about good old least squares? 2018 Kevin Jamieson 63
What about multiple classes? 2018 Kevin Jamieson 64