ECS171: Machine Learning

Size: px

Start display at page:

Download "ECS171: Machine Learning"

Sibyl Walters
6 years ago
Views:

1 ECS171: Machine Learning Lecture 4: Optimization (LFD 3.3, SGD) Cho-Jui Hsieh UC Davis Jan 22, 2018

2 Gradient descent

3 Optimization Goal: find the minimizer of a function min f (w) w For now we assume f is twice differentiable Machine learning algorithm: find the hypothesis that minimizes E in

4 Convex vs Nonconvex Convex function: f (w ) = 0 w is global minimum A function is convex if 2 f (w) is positive definite Example: linear regression, logistic regression, Non-convex function: f (x) = 0 Global min, local min, or saddle point most algorithms only converge to gradient= 0 Example: neural network,

5 Convex vs Nonconvex Convex function: f (w ) = 0 w is global minimum A function is convex if 2 f (w) is positive definite Example: linear regression, logistic regression, Non-convex function: f (w ) = 0 w is Global min, local min, or saddle point most algorithms only converge to gradient= 0 Example: neural network,

6 Gradient Descent Gradient descent: repeatedly do α > 0 is the step size w t+1 w t α f (w t )

7 Gradient Descent Gradient descent: repeatedly do w t+1 w t α f (w t ) α > 0 is the step size Generate the sequence w 1, w 2, converge to minimum solution ( f (w) = 0)

8 Gradient Descent Gradient descent: repeatedly do w t+1 w t α f (w t ) α > 0 is the step size Generate the sequence w 1, w 2, converge to minimum solution ( f (w) = 0) Step size too large diverge; too small slow convergence

9 Why gradient descent? Reason I: Gradient is the steepest direction to decrease the objective function locally

10 Why gradient descent? Reason I: Gradient is the steepest direction to decrease the objective function locally Reason II: successive approximation view At each iteration, form an approximation function of f ( ): f (w t + d) g(d) := f (w t ) + f (w t ) T d + 1 2α d 2 Update solution by w t+1 w t + d d = arg min d g(d) g(d ) = 0 f (w t ) + 1 α d = 0 d = α f (w t )

11 Why gradient descent? Reason I: Gradient is the steepest direction to decrease the objective function locally Reason II: successive approximation view At each iteration, form an approximation function of f ( ): f (w t + d) g(d) := f (w t ) + f (w t ) T d + 1 2α d 2 Update solution by w t+1 w t + d d = arg min d g(d) g(d ) = 0 f (w t ) + 1 α d = 0 d = α f (w t ) d will decrease f ( ) if α (step size) is sufficiently small

12 Illustration of gradient descent

13 Illustration of gradient descent Form a quadratic approximation f (w t + d) g(d) = f (w t ) + f (w t ) T d + 1 2α d 2

14 Illustration of gradient descent Minimize g(d): g(d ) = 0 f (w t ) + 1 α d = 0 d = α f (w t )

15 Illustration of gradient descent Update w t+1 = w t + d = w t α f (w t )

16 Illustration of gradient descent Form another quadratic approximation f (w t+1 + d) g(d) = f (w t+1 ) + f (w t+1 ) T d + 1 2α d 2 d = α f (w t+1 )

17 Illustration of gradient descent Update w t+2 = w t+1 + d = w t+1 α f (w t+1 )

18 When will it diverge? Can diverge (f (w t )<f (w t+1 )) if g is not an upperbound of f

19 When will it converge? Always converge (f (w t )>f (w t+1 )) when g is an upperbound of f

20 Convergence Let L be the Lipchitz constant ( 2 f (x) LI for all x) Theorem: gradient descent converges if α < 1 L

21 Convergence Let L be the Lipchitz constant ( 2 f (x) LI for all x) Theorem: gradient descent converges if α < 1 L In practice, we do not know L need to tune step size when running gradient descent

22 Applying to Logistic regression gradient descent for logistic regression Initialize the weights w 0 For t = 1, 2, Compute the gradient E in = 1 N N y n x n 1 + e ynw T x n n=1 Update the weights: w w η E in Return the final weights w

23 Applying to Logistic regression gradient descent for logistic regression Initialize the weights w 0 For t = 1, 2, Compute the gradient E in = 1 N N y n x n 1 + e ynw T x n n=1 Update the weights: w w η E in Return the final weights w When to stop? Fixed number of iterations, or Stop when E in < ɛ

24 Stochastic Gradient descent

25 Large-scale Problems Machine learning: usually minimizing the in-sample loss (training loss) min w { 1 N min w { 1 N N l(w T x n, y n )} := E in (w) (linear model) n=1 N l(h w (x n ), y n )} := E in (w) (general hypothesis) n=1 l: loss function (e.g., l(a, b) = (a b) 2 ) Gradient descent: w w η E in (w) }{{} Main computation

26 Large-scale Problems Machine learning: usually minimizing the in-sample loss (training loss) min w { 1 N min w { 1 N N l(w T x n, y n )} := E in (w) (linear model) n=1 N l(h w (x n ), y n )} := E in (w) (general hypothesis) n=1 l: loss function (e.g., l(a, b) = (a b) 2 ) Gradient descent: w w η In general, E in (w) = 1 N N n=1 f n(w), each f n (w) only depends on (x n, y n ) E in (w) }{{} Main computation

27 Stochastic gradient Gradient: E in (w) = 1 N N f n (w) Each gradient computation needs to go through all training samples slow when millions of samples Faster way to compute approximate gradient? n=1

28 Stochastic gradient Gradient: E in (w) = 1 N N f n (w) Each gradient computation needs to go through all training samples slow when millions of samples n=1 Faster way to compute approximate gradient? Use stochastic sampling: Sample a small subset B {1,, N} Estimate gradient E in (w) 1 f n (w) B B : batch size n B

29 Stochastic gradient descent Stochastic Gradient Descent (SGD) Input: training data {x n, y n } N n=1 Initialize w (zero or random) For t = 1, 2, Sample a small batch B {1,, N} Update parameter w w η t 1 B f n (w) n B

30 Stochastic gradient descent Stochastic Gradient Descent (SGD) Input: training data {x n, y n } N n=1 Initialize w (zero or random) For t = 1, 2, Sample a small batch B {1,, N} Update parameter w w η t 1 B f n (w) Extreme case: B = 1 Sample one training data at a time n B

31 Logistic Regression by SGD Logistic regression: min w 1 N N log(1 + e ynw T x n ) }{{} f n(w) n=1 SGD for Logistic Regression Input: training data {x n, y n } N n=1 Initialize w (zero or random) For t = 1, 2, Sample a batch B {1,, N} Update parameter w w η t 1 y n x n B 1 + e ynw T x n i B }{{} f n(w)

32 Why SGD works? Stochastic gradient is an unbiased estimator of full gradient: E[ 1 B f n (w)] = 1 N n B N f n (w) n=1 = E in (w)

33 Why SGD works? Stochastic gradient is an unbiased estimator of full gradient: E[ 1 B f n (w)] = 1 N n B N f n (w) n=1 = E in (w) Each iteration updated by gradient + zero-mean noise

34 Stochastic gradient descent In gradient descent, η (step size) is a fixed constant Can we use fixed step size for SGD?

35 Stochastic gradient descent In gradient descent, η (step size) is a fixed constant Can we use fixed step size for SGD? SGD with fixed step size cannot converge to global/local minimizers

36 Stochastic gradient descent In gradient descent, η (step size) is a fixed constant Can we use fixed step size for SGD? SGD with fixed step size cannot converge to global/local minimizers If w is the minimizer, f (w ) = 1 N N n=1 f n(w )=0,

37 Stochastic gradient descent In gradient descent, η (step size) is a fixed constant Can we use fixed step size for SGD? SGD with fixed step size cannot converge to global/local minimizers If w is the minimizer, f (w ) = 1 N N n=1 f n(w )=0, but 1 B f n (w ) 0 n B if B is a subset

38 Stochastic gradient descent In gradient descent, η (step size) is a fixed constant Can we use fixed step size for SGD? SGD with fixed step size cannot converge to global/local minimizers If w is the minimizer, f (w ) = 1 N N n=1 f n(w )=0, but 1 B f n (w ) 0 n B if B is a subset (Even if we got minimizer, SGD will move away from it)

39 Stochastic gradient descent, step size To make SGD converge: Step size should decrease to 0 η t 0 Usually with polynomial rate: η t t a with constant a

40 Stochastic gradient descent vs Gradient descent Stochastic gradient descent: pros: cheaper computation per iteration faster convergence in the beginning cons: less stable, slower final convergence hard to tune step size (Figure from gradient-descent-algorithm-and-its-variants-10f652806a3)

41 Revisit perceptron Learning Algorithm Given a classification data {x n, y n } N n=1 Learning a linear model: Consider the loss: min w 1 N N l(w T x n, y n ) n=1 l(w T x n, y n ) = max(0, y n w T x n ) What s the gradient?

42 Revisit perceptron Learning Algorithm l(w T x n, y n ) = max(0, y n w T x n ) Consider two cases: Case I: y n w T x n > 0 (prediction correct) l(w T x n, y n ) = 0 w l(w T x n, y n ) = 0

43 Revisit perceptron Learning Algorithm l(w T x n, y n ) = max(0, y n w T x n ) Consider two cases: Case I: y n w T x n > 0 (prediction correct) l(w T x n, y n ) = 0 w l(w T x n, y n ) = 0 Case II: y n w T x n < 0 (prediction wrong) l(w T x n, y n ) = y n w T x n w l(w T x n, y n ) = y n x n

44 Revisit perceptron Learning Algorithm l(w T x n, y n ) = max(0, y n w T x n ) Consider two cases: Case I: y n w T x n > 0 (prediction correct) l(w T x n, y n ) = 0 w l(w T x n, y n ) = 0 Case II: y n w T x n < 0 (prediction wrong) l(w T x n, y n ) = y n w T x n w l(w T x n, y n ) = y n x n SGD update rule: Sample an index n { w t+1 w t if y n w T x n 0 (predict correct) w t + η t y n x n if y n w T x n <0 (predict wrong) Equivalent to Perceptron Learning Algorithm when η t = 1

45 Conclusions Gradient descent Stochastic gradient descent Next class: LFD 2 Questions?

CS260: Machine Learning Algorithms

CS260: Machine Learning Algorithms Lecture 4: Stochastic Gradient Descent Cho-Jui Hsieh UCLA Jan 16, 2019 Large-scale Problems Machine learning: usually minimizing the training loss min w { 1 N min w {