CS260: Machine Learning Algorithms

Size: px

Start display at page:

Download "CS260: Machine Learning Algorithms"

Octavia Cross
5 years ago
Views:

1 CS260: Machine Learning Algorithms Lecture 4: Stochastic Gradient Descent Cho-Jui Hsieh UCLA Jan 16, 2019

2 Large-scale Problems Machine learning: usually minimizing the training loss min w { 1 N min w { 1 N N l(w T x n, y n )} := f (w) (linear model) n=1 N l(h w (x n ), y n )} := f (w) (general hypothesis) n=1 l: loss function (e.g., l(a, b) = (a b) 2 ) Gradient descent: w w η f (w) }{{} Main computation

3 Large-scale Problems Machine learning: usually minimizing the training loss min w { 1 N min w { 1 N N l(w T x n, y n )} := f (w) (linear model) n=1 N l(h w (x n ), y n )} := f (w) (general hypothesis) n=1 l: loss function (e.g., l(a, b) = (a b) 2 ) Gradient descent: w w η In general, f (w) = 1 N N n=1 f n(w), each f n (w) only depends on (x n, y n ) f (w) }{{} Main computation

4 Stochastic gradient Gradient: f (w) = 1 N N f n (w) Each gradient computation needs to go through all training samples slow when millions of samples Faster way to compute approximate gradient? n=1

5 Stochastic gradient Gradient: f (w) = 1 N N f n (w) Each gradient computation needs to go through all training samples slow when millions of samples n=1 Faster way to compute approximate gradient? Use stochastic sampling: Sample a small subset B {1,, N} Estimated gradient f (w) 1 f n (w) B B : batch size n B

6 Stochastic gradient descent Stochastic Gradient Descent (SGD) Input: training data {x n, y n } N n=1 Initialize w (zero or random) For t = 1, 2, Sample a small batch B {1,, N} Update parameter w w η t 1 B f n (w) n B

7 Stochastic gradient descent Stochastic Gradient Descent (SGD) Input: training data {x n, y n } N n=1 Initialize w (zero or random) For t = 1, 2, Sample a small batch B {1,, N} Update parameter w w η t 1 B f n (w) Extreme case: B = 1 Sample one training data at a time n B

8 Logistic Regression by SGD Logistic regression: min w 1 N N log(1 + e ynw T x n ) }{{} f n(w) n=1 SGD for Logistic Regression Input: training data {x n, y n } N n=1 Initialize w (zero or random) For t = 1, 2, Sample a batch B {1,, N} Update parameter w w η t 1 y n x n B 1 + e ynw T x n i B }{{} f n(w)

9 Why SGD works? Stochastic gradient is an unbiased estimator of full gradient: E[ 1 B f n (w)] = 1 N n B N f n (w) n=1 = f (w)

10 Why SGD works? Stochastic gradient is an unbiased estimator of full gradient: E[ 1 B f n (w)] = 1 N n B N f n (w) n=1 = f (w) Each iteration updated by gradient + zero-mean noise

11 Stochastic gradient descent In gradient descent, η (step size) is a fixed constant Can we use fixed step size for SGD?

12 Stochastic gradient descent In gradient descent, η (step size) is a fixed constant Can we use fixed step size for SGD? SGD with fixed step size cannot converge to global/local minimizers

13 Stochastic gradient descent In gradient descent, η (step size) is a fixed constant Can we use fixed step size for SGD? SGD with fixed step size cannot converge to global/local minimizers If w is the minimizer, f (w ) = 1 N N n=1 f n(w )=0,

14 Stochastic gradient descent In gradient descent, η (step size) is a fixed constant Can we use fixed step size for SGD? SGD with fixed step size cannot converge to global/local minimizers If w is the minimizer, f (w ) = 1 N N n=1 f n(w )=0, but 1 B f n (w ) 0 n B if B is a subset

15 Stochastic gradient descent In gradient descent, η (step size) is a fixed constant Can we use fixed step size for SGD? SGD with fixed step size cannot converge to global/local minimizers If w is the minimizer, f (w ) = 1 N N n=1 f n(w )=0, but 1 B f n (w ) 0 n B if B is a subset (Even if we got minimizer, SGD will move away from it)

16 Stochastic gradient descent, step size To make SGD converge: Step size should decrease to 0 η t 0 Usually with polynomial rate: η t t a with constant a

17 Stochastic gradient descent vs Gradient descent Stochastic gradient descent: pros: cheaper computation per iteration faster convergence in the beginning cons: less stable, slower final convergence hard to tune step size (Figure from gradient-descent-algorithm-and-its-variants-10f652806a3)

18 Revisit perceptron Learning Algorithm Given a classification data {x n, y n } N n=1 Learning a linear model: Consider the loss: min w 1 N N l(w T x n, y n ) n=1 l(w T x n, y n ) = max(0, y n w T x n ) What s the gradient?

19 Revisit perceptron Learning Algorithm l(w T x n, y n ) = max(0, y n w T x n ) Consider two cases: Case I: y n w T x n > 0 (prediction correct) l(w T x n, y n ) = 0 w l(w T x n, y n ) = 0

20 Revisit perceptron Learning Algorithm l(w T x n, y n ) = max(0, y n w T x n ) Consider two cases: Case I: y n w T x n > 0 (prediction correct) l(w T x n, y n ) = 0 w l(w T x n, y n ) = 0 Case II: y n w T x n < 0 (prediction wrong) l(w T x n, y n ) = y n w T x n w l(w T x n, y n ) = y n x n

21 Revisit perceptron Learning Algorithm l(w T x n, y n ) = max(0, y n w T x n ) Consider two cases: Case I: y n w T x n > 0 (prediction correct) l(w T x n, y n ) = 0 w l(w T x n, y n ) = 0 Case II: y n w T x n < 0 (prediction wrong) l(w T x n, y n ) = y n w T x n w l(w T x n, y n ) = y n x n SGD update rule: Sample an index n { w t+1 w t if y n w T x n 0 (predict correct) w t + η t y n x n if y n w T x n <0 (predict wrong) Equivalent to Perceptron Learning Algorithm when η t = 1

22 Momentum Gradient descent: only using current gradient (local information) Momentum: use previous gradient information

23 Momentum Gradient descent: only using current gradient (local information) Momentum: use previous gradient information The momentum update rule: v t = βv t 1 + (1 β) f (w t ) w t+1 = w t αv t β [0, 1): discount factors, α: step size

24 Momentum Gradient descent: only using current gradient (local information) Momentum: use previous gradient information The momentum update rule: v t = βv t 1 + (1 β) f (w t ) w t+1 = w t αv t β [0, 1): discount factors, α: step size Equivalent to using moving average of gradient: v t = (1 β) f (w t ) + β(1 β) f (w t 1 ) + β 2 (1 β) f (w t 2 ) +

25 Momentum Gradient descent: only using current gradient (local information) Momentum: use previous gradient information The momentum update rule: v t = βv t 1 + (1 β) f (w t ) w t+1 = w t αv t β [0, 1): discount factors, α: step size Equivalent to using moving average of gradient: v t = (1 β) f (w t ) + β(1 β) f (w t 1 ) + β 2 (1 β) f (w t 2 ) + Another equivalent form: v t = βv t 1 + α f (w t ) w t+1 = w t v t

26 Momentum gradient descent Momentum gradient descent Initialize w 0, v 0 = 0 For t = 1, 2, Compute v t βv t 1 + (1 β) f (w t ) Update w t+1 w t αv t α: learning rate β: discount factor (β = 0 means no momentum)

27 Momentum stochastic gradient descent Optimizing f (w) = 1 N N i=1 f i(w) Momentum stochastic gradient descent Initialize w 0, v 0 = 0 For t = 1, 2, Sample an i {1,, N} Compute v t βv t 1 + (1 β) f i (w t ) Update w t+1 w t αv t α: learning rate β: discount factor (β = 0 means no momentum)

28 Nesterov accelerated gradient Using the look-ahead gradient v t = βv t 1 + α f (w t βv t 1 ) w t+1 = w t v t (Figure from

29 Why momentum works? Reduce variance of gradient estimator for SGD Even for gradient descent, it s able to speed up convergence in some cases:

30 Adagrad: Adaptive updates (2010) SGD update: same step size for all variables Adaptive algorithms: each dimension can have a different step size

31 Adagrad: Adaptive updates (2010) Adagrad SGD update: same step size for all variables Adaptive algorithms: each dimension can have a different step size Initialize w 0 For t = 1, 2, Sample an i {1,, N} Compute g t f i (w t) Gi t G t 1 i + (gi t ) 2 Update w t+1 w t η g t G t i +ɛ i η: step size (constant) ɛ: small constant to avoid division by 0

32 Adagrad For each dimension i, we have observed T samples gi 1,, g i t Standard deviation of g i : t t (gi ) 2 (G t = i ) 2 t t Assume step size is η/ t, then the update becomes w t+1 i wi t η t t (G t i ) g t 2 i

33 Adam: Momentum + Adaptive updates (2015) Adam Initialize w 0, m 0 = 0, v 0 = 0, For t = 1, 2, Sample an i {1,, N} Compute g t f i (w t ) m t β 1 m t 1 + (1 β 1 )g t v t β 2 v t 1 + (1 β 2 )g 2 t ˆm t m t /(1 β t 1 ) ˆv t v t /(1 β t 2 ) Update w t w t 1 α ˆm t /( ˆv t + ɛ)

34 Conclusions Stochastic gradient descent Momentum & adaptive updates Questions?

ECS171: Machine Learning

ECS171: Machine Learning Lecture 4: Optimization (LFD 3.3, SGD) Cho-Jui Hsieh UC Davis Jan 22, 2018 Gradient descent Optimization Goal: find the minimizer of a function min f (w) w For now we assume f