Stochastic Gradient Descent

Size: px

Start display at page:

Download "Stochastic Gradient Descent"

Georgina Hill
6 years ago
Views:

1 Stochastic Gradient Descent Weihang Chen, Xingchen Chen, Jinxiu Liang, Cheng Xu, Zehao Chen and Donglin He March 26, 2017

2 Outline What is Stochastic Gradient Descent Comparison between BGD and SGD Analysis on SGD Extensions and Variants 2/23 Stochastic Gradient Descent

3 What is Stochastic Gradient Descent? I Structural Risk Minimization in Machine Learning Given samples{(, y )} n and a loss function (h, y) =1 Find a prediction function h (; ) by minimizing a risk measure n n R () = (h ( ; ), y ) = ƒ () =1 =1 Update via BGD (k+1) = (k) t k R (k) n = (k) t k ƒ (k) =1 3/23 Stochastic Gradient Descent

4 What is Stochastic Gradient Descent? II Update via SGD (k+1) = (k) t k R (k) = (k) t k ƒ k (k) Suppose we want to minimize the sum of functions m mn ƒ (), = 1, 2,..., m =1 BGD would sum all the gradients n (k+1) = (k) t k ƒ (k), k = 1, 2,... =1 4/23 Stochastic Gradient Descent

5 What is Stochastic Gradient Descent? III SGD instead looks at each gradient individually (k+1) = (k) t k ƒ k (k), k = 1, 2,... Where k {1,..., m} is some chosen index at iteration k Random rule: choose k {1,..., m} uniformly at random (more commom) Circle rule: choose k = 1, 2,..., m, 1, 2,..., m,... 5/23 Stochastic Gradient Descent

6 Comparison between BGD and SGD Gradient computation: Batch steps:o (np) Doable when n is moderate, but not when n Stochastic steps:o (p) So clearly, e.g., 10K stochastic steps are much more affordable Rule of thumb: SGD thrive Figure: The "classic picture" far from optimum and struggle close to optimum 6/23 Stochastic Gradient Descent

7 Comparison between BGD and SGD Update via BGD More expensive steps Opportunities for parallelism Update via SGD Very cheap iteration Descent in expectation Intuition Using all the sample data in every iteration is inefficient Data involves a good deal of redundancy in many applications Suppose data is 10 copies of a set S.Iteration of BGD 10 times more expensive, while SGD performs same computations Sometimes working with half of the training set is sufficient 7/23 Stochastic Gradient Descent

8 Learning Rate Analysis Figure: Effects of learning rate on loss Figure: An example of a typical loss func 10/23 Stochastic Gradient Descent

9 Converge Analysis Computationally, m stochastic steps one batch step But what about progress? BGD(one step): m (k+1) = (k) t ƒ (k) =1 SGD(Cyclic rule,k =,m steps): m (k+m) = (k) t ƒ (k+ 1) =1 m Difference in direction is ƒ =1 (k+ 1) ƒ (k) So SGD should converge if each ƒ () doesn t vary wildly with x 11/23 Stochastic Gradient Descent

10 Example Problem: Solution: The linear regression loss: min m 1 =1 2 (y ) 2 Update via BGD: m (k+1) = (k) t k =1 (k) 2 y Update via SGD: (k+1) = (k) + t k (k) k 2 k k y k 12/23 Stochastic Gradient Descent

11 Example Figure: SGD loss-iteration times Figure: Result of linear regression via SGD 13/23 Stochastic Gradient Descent

12 Mini-Batch Gradient Descent Batch Gradient Descent m (k+1) = (k) t k ƒ (k) =1 Stochastic Gradient Descent (k+1) = (k) t k ƒ k (k) mini-batch Gradient Descent m (k+1) = (k) t k ƒ k (k) =1 14/23 Stochastic Gradient Descent

13 Challenges Figure: Problems with the learning rate Figure: Local Minima 15/23 Stochastic Gradient Descent

14 SGD with momentum Accelerate SGD in the relevant direction and dampens oscillations Take a big jump in direction of updated accumulated gradient Compute the gradient at the current location ν (k+1) = γν (k) + t k ƒ k (k) (k+1) = (k) ν (k+1) SGD: (k+1) = (k) t k ƒ k (k) 16/23 Stochastic Gradient Descent

15 Nesterov Accelerated Gradient Accelerate SGD in the relevant direction and dampens oscillations Take a big jump in direction of previous accumulated gradient Measure gradient where you end up and make a correction ˆ (k) = (k) + γν (k) ν (k+1) = γν (k) + t k ƒ k ˆ (k) SGD with momentum ν (k+1) = γν (k) + t k ƒ k (k) (k+1) = (k) ν (k+1) (k+1) = (k) ν (k+1) 17/23 Stochastic Gradient Descent

16 Adaptive Gradient Algorithm Adapts the learning rate to the parameters Performs larger updates for infrequent Performs smaller updates for frequent parameters It is well-suited for dealing with sparse data SGD: (k+1) = (k) t k ƒ k (k) ν (k+1) = ν (k) + ƒ k (k) 2 (k+1) = (k) α ƒ k (k) ν (k+1) + ε ε is a smoothing term that avoids division by zero(usually on the order of 1e 8 ) 18/23 Stochastic Gradient Descent

17 Adadelta Restricts window of accumulated past gradients to fixed size Reduce AdaGrad s aggressive, monotonically decreasing learning rate As a fraction γ similarly to the Momentum term ν (k+1) = γν (k) + (1 γ) ƒ k (k) 2 (k+1) = (k) α ƒ k (k) ν (k+1) + ε Runing Average k at step k depends only on the previous average and the current gradient (as a fraction γ similarly to the Momentum term) 19/23 Stochastic Gradient Descent

18 Adaptive Moment Estimation Keeps an average of past gradients additionally Similar to momentum m t and t are biased towards zero In the initial time steps as they are initialized as vectors of 0 s When the decay rates are small (i.e. β1 and β 2 are close to 1) m (k+1) = β 1 m (k) + (1 β 1 ) ƒ k (k) ν (k+1) = β 2 ν (k) + (1 β 2 ) ƒ k (k) 2 ˆm t = m t 1 β t 1, ˆν t = ν t 1 β t 2 (k+1) = (k) t k ˆν t + ε bias-corrected first and second moment estimates 20/23 Stochastic Gradient Descent ˆm t

19 Which optimizer to use? You should use one of the adaptive learning-rate methods: If input data is sparse. For faster convergence and deep or complex neural network training. Insofar, adadelta and adam are very similar algorithms that do well in similar circumstances. Adam slightly outperform adadelta towards the end of optimization as gradients become sparser. Insofar, adam might be the best overall choice. 22/23 Stochastic Gradient Descent

20 Reference Hongmin Cai(2016): Sub-gradient Method, Lecture 7 Cnblogs Murongxixi(2013): Stochastic Gradient Descent Leon Bottou(2016): Optimization Methods for Large-Scale Machine Learning Abdelkrim Bennar(2007): Almost sure convergence of a stochastic approximation process in a convex set A. Shapiro, Y. Wardi(1996): Convergence Analysis of Gradient Descent Stochastic Algorithms Wikipedia: Stochastic gradient descent Sebastian Ruder (2016): An overview of gradient descent optimization algorithms Zhihua Zhou(2016): Machine Learning, Chapter 6 23/23 Stochastic Gradient Descent

21 Thank you for your time! 24/23 Stochastic Gradient Descent

Machine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Machine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science Machine Learning CS 4900/5900 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Machine Learning is Optimization Parametric ML involves minimizing an objective function