Announcements Project proposal due next week: Tuesday 10/24 Still looking for people to work on deep learning Phytolith project, join #phytolith slack channel 2017 Kevin Jamieson 1
Gradient Descent Machine Learning CSE546 Kevin Jamieson University of Washington October 16, 2016 22
Machine Learning Problems Have a bunch of iid data of the form: {(x i,y i )} n i=1 x i 2 R d y i 2 R Learning a model s parameters: Each `i(w) is convex. nx `i(w) i=1 x x y g is a subgradient at x if f(y) f(x)+g T (y x) f convex: f ( x +(1 )y) apple f(x)+(1 )f(y) 8x, y, 2 [0, 1] f(y) f(x)+rf(x) T (y x) 8x, y 24
Machine Learning Problems Have a bunch of iid data of the form: {(x i,y i )} n i=1 x i 2 R d y i 2 R Learning a model s parameters: Each `i(w) is convex. nx `i(w) i=1 Logistic Loss: `i(w) = log(1 + exp( y i x T i w)) Squared error Loss: `i(w) =(y i x T i w)2 25
Taylor Series Approximation Taylor series in one dimension: f(x + )=f(x)+f 0 (x) + 1 2 f 00 (x) 2 +... Gradient descent: 28
Taylor Series Approximation Taylor series in d dimensions: f(x + v) =f(x)+rf(x) T v + 1 2 vt r 2 f(x)v +... Gradient descent: 29
General case In general for Newton s method to achieve f(w t ) f(w ) apple : So why are ML problems overwhelmingly solved by gradient methods? Hint: v t is solution to : r 2 f(w t )v t = rf(w t ) 36
General Convex case f(w t ) f(w ) apple Clean converge nce proofs: Bubeck Newton s method: t log(log(1/ )) Gradient descent: f is smooth and strongly convex: ai r 2 f(w) : bi f is smooth: r 2 f(w) bi f is potentially non-differentiable: rf(w) 2 apple c Nocedal +Wright, Bubeck Other: BFGS, Heavy-ball, BCD, SVRG, ADAM, Adagrad, 37
Revisiting Logistic Regression Machine Learning CSE546 Kevin Jamieson University of Washington October 16, 2016 2
Loss function: Conditional Likelihood Have a bunch of iid data of the form: bw MLE = arg max w f(w) rf(w) = = arg min w {(x i,y i )} n i=1 x i 2 R d, y i 2 { 1, 1} ny P (y i x i,w) P (Y = y x, w) = i=1 nx log(1 + exp( i=1 y i x T i w)) 1 1 + exp( yw T x) 3
Online Learning Machine Learning CSE546 Kevin Jamieson University of Washington October 18, 2016 4
Going to the moon Guidance computer predicts trajectories around moon and back with - Noisy sensors - Imperfect models - Little computational power - Big risk of failure 5
Going to the moon Guidance computer predicts trajectories around moon and back with - Noisy sensors - Imperfect models Apollo 13 - Little computational power - Big risk of failure Why is Tom Hanks flying erratically? Because they didn t have the power to turn on the Kalman FIlter! 6
State Estimation - Predict current state given past state and current control input ew n = f(w n 1 )+g(u n ) x n - Given current context, compare your prediction to noisy measurement `n( ew n )=(y n h(x n, ew n )) 2 y n - Update current state to include measurement w n = ew n K n r w`n(w) w= ewn Kalman filter does optimal least squares state estimation if f,g,h are linear! 7
Recursive Least Squares (RLS) Least squares = special case of Kalman Filter: no dynamics, no control ew n = f(w n 1 )+g(u n ) `n( ew n )=(y n h(x n, ew n )) 2 w n = ew n K n r w`n(w) w= ewn 8
Recursive Least Squares (RLS) Least squares = special case of Kalman Filter: no dynamics, no control ew n = f(w n 1 )+g(u n ) = w n 1 `n( ew n )=(y n h(x n, ew n )) 2 =(y n x T n ew n ) 2 =(y n x T n w n 1 ) 2 Ideally: w n = arg min w nx (y i x T i w) 2 i=1 w n = ew n K n r w`n(w) w= ewn = w n 1 + 2(y n x T n w n 1 )K n x n 9
Recursive Least Squares (RLS) Sherman Morrison: w n = nx x i x T i i=1! 1 X n x i y i i=1 Ideally: w n = arg min w nx (y i x T i w) 2 i=1 10
Recursive Least Squares (RLS) w n = nx x i x T i i=1! 1 X n x i y i i=1 Great, what s the time-complexity of this? It is 2017. Not the 60 s is limited computation still really a problem? 11
Digital Signal Processing The original Big Data Wifi/cell-phones are constantly solving least squares to invert out multipath Low power devices, high data rates 12
Digital Signal Processing The original Big Data Wifi/cell-phones are constantly solving least squares to invert out multipath Low power devices, high data rates Gigabytes of data per second 13
Incremental Gradient Descent (x t,y t ) arrive: w t+1 = w t Note: no matrix multiply hr w (y t x T t w) 2 w=wt i We know RLS is exact. How much worse is this? In general convex `t(w) arrives: `( ) is convex () `(y) `(x)+r`(x) T (y x) 8x, y 14
Incremental Gradient Descent w t+1 w 2 2 = w t r`t(w t ) w 2 2 15
Incremental Gradient Descent 16
Stochastic Gradient Descent Have a bunch of iid data of the form: {(x i,y i )} n i=1 x i 2 R d y i 2 R Learning a model s parameters: 1 nx Each `i(w) is convex. `i(w) n i=1 17
Stochastic Gradient Descent Have a bunch of iid data of the form: {(x i,y i )} n i=1 x i 2 R d y i 2 R Learning a model s parameters: 1 nx Each `i(w) is convex. `i(w) n i=1 Gradient Descent:! 1 nx w t+1 = w t r w `i(w) n i=1 w=w t 18
Stochastic Gradient Descent Have a bunch of iid data of the form: {(x i,y i )} n i=1 x i 2 R d y i 2 R Learning a model s parameters: 1 nx Each `i(w) is convex. `i(w) n i=1 Gradient Descent:! 1 nx w t+1 = w t r w `i(w) n Stochastic Gradient Descent: w t+1 = w t E[r`It (w)] = r w`it (w) i=1 w=w t w=w t I t drawn uniform at random from {1,...,n} 19
Stochastic Gradient Descent Gradient Descent: 1 w t+1 = w t r w n Stochastic Gradient Descent: w t+1 = w t r w`it (w)! nx `i(w) i=1 w=w t w=w t I t drawn uniform at random from {1,...,n} 20
Stochastic Gradient Ascent for Logistic Regression Logistic loss as a stochastic function: Batch gradient ascent updates: Stochastic gradient ascent updates: Online setting: Sham Kakade 2016 21
Stochastic Gradient Descent: A Learning perspective Machine Learning CSE546 Kevin Jamieson University of Washington October 16, 2016 22
Learning Problems as Expectations Minimizing loss in training data: Given dataset: Sampled iid from some distribution p(x) on features: Loss function, e.g., hinge loss, logistic loss, We often minimize loss in training data: However, we should really minimize expected loss on all data: So, we are approximating the integral by the average on the training data Sham Kakade 2016 23
Gradient descent in Terms of Expectations True objective function: Taking the gradient: True gradient descent rule: How do we estimate expected gradient? Sham Kakade 2016 24
SGD: Stochastic Gradient Descent True gradient: Sample based approximation: What if we estimate gradient with just one sample??? Unbiased estimate of gradient Very noisy! Called stochastic gradient descent Among many other names VERY useful in practice!!! Sham Kakade 2016 25