ECS171: Machine Learning Lecture 4: Optimization (LFD 3.3, SGD) Cho-Jui Hsieh UC Davis Jan 22, 2018
Gradient descent
Optimization Goal: find the minimizer of a function min f (w) w For now we assume f is twice differentiable Machine learning algorithm: find the hypothesis that minimizes E in
Convex vs Nonconvex Convex function: f (w ) = 0 w is global minimum A function is convex if 2 f (w) is positive definite Example: linear regression, logistic regression, Non-convex function: f (x) = 0 Global min, local min, or saddle point most algorithms only converge to gradient= 0 Example: neural network,
Convex vs Nonconvex Convex function: f (w ) = 0 w is global minimum A function is convex if 2 f (w) is positive definite Example: linear regression, logistic regression, Non-convex function: f (w ) = 0 w is Global min, local min, or saddle point most algorithms only converge to gradient= 0 Example: neural network,
Gradient Descent Gradient descent: repeatedly do α > 0 is the step size w t+1 w t α f (w t )
Gradient Descent Gradient descent: repeatedly do w t+1 w t α f (w t ) α > 0 is the step size Generate the sequence w 1, w 2, converge to minimum solution ( f (w) = 0)
Gradient Descent Gradient descent: repeatedly do w t+1 w t α f (w t ) α > 0 is the step size Generate the sequence w 1, w 2, converge to minimum solution ( f (w) = 0) Step size too large diverge; too small slow convergence
Why gradient descent? Reason I: Gradient is the steepest direction to decrease the objective function locally
Why gradient descent? Reason I: Gradient is the steepest direction to decrease the objective function locally Reason II: successive approximation view At each iteration, form an approximation function of f ( ): f (w t + d) g(d) := f (w t ) + f (w t ) T d + 1 2α d 2 Update solution by w t+1 w t + d d = arg min d g(d) g(d ) = 0 f (w t ) + 1 α d = 0 d = α f (w t )
Why gradient descent? Reason I: Gradient is the steepest direction to decrease the objective function locally Reason II: successive approximation view At each iteration, form an approximation function of f ( ): f (w t + d) g(d) := f (w t ) + f (w t ) T d + 1 2α d 2 Update solution by w t+1 w t + d d = arg min d g(d) g(d ) = 0 f (w t ) + 1 α d = 0 d = α f (w t ) d will decrease f ( ) if α (step size) is sufficiently small
Illustration of gradient descent
Illustration of gradient descent Form a quadratic approximation f (w t + d) g(d) = f (w t ) + f (w t ) T d + 1 2α d 2
Illustration of gradient descent Minimize g(d): g(d ) = 0 f (w t ) + 1 α d = 0 d = α f (w t )
Illustration of gradient descent Update w t+1 = w t + d = w t α f (w t )
Illustration of gradient descent Form another quadratic approximation f (w t+1 + d) g(d) = f (w t+1 ) + f (w t+1 ) T d + 1 2α d 2 d = α f (w t+1 )
Illustration of gradient descent Update w t+2 = w t+1 + d = w t+1 α f (w t+1 )
When will it diverge? Can diverge (f (w t )<f (w t+1 )) if g is not an upperbound of f
When will it converge? Always converge (f (w t )>f (w t+1 )) when g is an upperbound of f
Convergence Let L be the Lipchitz constant ( 2 f (x) LI for all x) Theorem: gradient descent converges if α < 1 L
Convergence Let L be the Lipchitz constant ( 2 f (x) LI for all x) Theorem: gradient descent converges if α < 1 L In practice, we do not know L need to tune step size when running gradient descent
Applying to Logistic regression gradient descent for logistic regression Initialize the weights w 0 For t = 1, 2, Compute the gradient E in = 1 N N y n x n 1 + e ynw T x n n=1 Update the weights: w w η E in Return the final weights w
Applying to Logistic regression gradient descent for logistic regression Initialize the weights w 0 For t = 1, 2, Compute the gradient E in = 1 N N y n x n 1 + e ynw T x n n=1 Update the weights: w w η E in Return the final weights w When to stop? Fixed number of iterations, or Stop when E in < ɛ
Stochastic Gradient descent
Large-scale Problems Machine learning: usually minimizing the in-sample loss (training loss) min w { 1 N min w { 1 N N l(w T x n, y n )} := E in (w) (linear model) n=1 N l(h w (x n ), y n )} := E in (w) (general hypothesis) n=1 l: loss function (e.g., l(a, b) = (a b) 2 ) Gradient descent: w w η E in (w) }{{} Main computation
Large-scale Problems Machine learning: usually minimizing the in-sample loss (training loss) min w { 1 N min w { 1 N N l(w T x n, y n )} := E in (w) (linear model) n=1 N l(h w (x n ), y n )} := E in (w) (general hypothesis) n=1 l: loss function (e.g., l(a, b) = (a b) 2 ) Gradient descent: w w η In general, E in (w) = 1 N N n=1 f n(w), each f n (w) only depends on (x n, y n ) E in (w) }{{} Main computation
Stochastic gradient Gradient: E in (w) = 1 N N f n (w) Each gradient computation needs to go through all training samples slow when millions of samples Faster way to compute approximate gradient? n=1
Stochastic gradient Gradient: E in (w) = 1 N N f n (w) Each gradient computation needs to go through all training samples slow when millions of samples n=1 Faster way to compute approximate gradient? Use stochastic sampling: Sample a small subset B {1,, N} Estimate gradient E in (w) 1 f n (w) B B : batch size n B
Stochastic gradient descent Stochastic Gradient Descent (SGD) Input: training data {x n, y n } N n=1 Initialize w (zero or random) For t = 1, 2, Sample a small batch B {1,, N} Update parameter w w η t 1 B f n (w) n B
Stochastic gradient descent Stochastic Gradient Descent (SGD) Input: training data {x n, y n } N n=1 Initialize w (zero or random) For t = 1, 2, Sample a small batch B {1,, N} Update parameter w w η t 1 B f n (w) Extreme case: B = 1 Sample one training data at a time n B
Logistic Regression by SGD Logistic regression: min w 1 N N log(1 + e ynw T x n ) }{{} f n(w) n=1 SGD for Logistic Regression Input: training data {x n, y n } N n=1 Initialize w (zero or random) For t = 1, 2, Sample a batch B {1,, N} Update parameter w w η t 1 y n x n B 1 + e ynw T x n i B }{{} f n(w)
Why SGD works? Stochastic gradient is an unbiased estimator of full gradient: E[ 1 B f n (w)] = 1 N n B N f n (w) n=1 = E in (w)
Why SGD works? Stochastic gradient is an unbiased estimator of full gradient: E[ 1 B f n (w)] = 1 N n B N f n (w) n=1 = E in (w) Each iteration updated by gradient + zero-mean noise
Stochastic gradient descent In gradient descent, η (step size) is a fixed constant Can we use fixed step size for SGD?
Stochastic gradient descent In gradient descent, η (step size) is a fixed constant Can we use fixed step size for SGD? SGD with fixed step size cannot converge to global/local minimizers
Stochastic gradient descent In gradient descent, η (step size) is a fixed constant Can we use fixed step size for SGD? SGD with fixed step size cannot converge to global/local minimizers If w is the minimizer, f (w ) = 1 N N n=1 f n(w )=0,
Stochastic gradient descent In gradient descent, η (step size) is a fixed constant Can we use fixed step size for SGD? SGD with fixed step size cannot converge to global/local minimizers If w is the minimizer, f (w ) = 1 N N n=1 f n(w )=0, but 1 B f n (w ) 0 n B if B is a subset
Stochastic gradient descent In gradient descent, η (step size) is a fixed constant Can we use fixed step size for SGD? SGD with fixed step size cannot converge to global/local minimizers If w is the minimizer, f (w ) = 1 N N n=1 f n(w )=0, but 1 B f n (w ) 0 n B if B is a subset (Even if we got minimizer, SGD will move away from it)
Stochastic gradient descent, step size To make SGD converge: Step size should decrease to 0 η t 0 Usually with polynomial rate: η t t a with constant a
Stochastic gradient descent vs Gradient descent Stochastic gradient descent: pros: cheaper computation per iteration faster convergence in the beginning cons: less stable, slower final convergence hard to tune step size (Figure from https://medium.com/@imadphd/ gradient-descent-algorithm-and-its-variants-10f652806a3)
Revisit perceptron Learning Algorithm Given a classification data {x n, y n } N n=1 Learning a linear model: Consider the loss: min w 1 N N l(w T x n, y n ) n=1 l(w T x n, y n ) = max(0, y n w T x n ) What s the gradient?
Revisit perceptron Learning Algorithm l(w T x n, y n ) = max(0, y n w T x n ) Consider two cases: Case I: y n w T x n > 0 (prediction correct) l(w T x n, y n ) = 0 w l(w T x n, y n ) = 0
Revisit perceptron Learning Algorithm l(w T x n, y n ) = max(0, y n w T x n ) Consider two cases: Case I: y n w T x n > 0 (prediction correct) l(w T x n, y n ) = 0 w l(w T x n, y n ) = 0 Case II: y n w T x n < 0 (prediction wrong) l(w T x n, y n ) = y n w T x n w l(w T x n, y n ) = y n x n
Revisit perceptron Learning Algorithm l(w T x n, y n ) = max(0, y n w T x n ) Consider two cases: Case I: y n w T x n > 0 (prediction correct) l(w T x n, y n ) = 0 w l(w T x n, y n ) = 0 Case II: y n w T x n < 0 (prediction wrong) l(w T x n, y n ) = y n w T x n w l(w T x n, y n ) = y n x n SGD update rule: Sample an index n { w t+1 w t if y n w T x n 0 (predict correct) w t + η t y n x n if y n w T x n <0 (predict wrong) Equivalent to Perceptron Learning Algorithm when η t = 1
Conclusions Gradient descent Stochastic gradient descent Next class: LFD 2 Questions?