Optimization for Training I. First-Order Methods Training algorithm

Size: px

Start display at page:

Download "Optimization for Training I. First-Order Methods Training algorithm"

Emily Rose
6 years ago
Views:

1 Optimization for Training I First-Order Methods Training algorithm

2 2 OPTIMIZATION METHODS Topics: Types of optimization methods. Practical optimization methods breakdown into two categories: 1. First-order methods 2. Second-order methods Ĵ(θ) =J(a)+ θ J(a)(θ a)+ 1 2 (θ a) H(θ a) Today we will focus on first-order methods

3 STOCHASTIC GRADIENT DESCENT 3 Vanilla SGD is still probably the most popular method of training deep learning models. (+) Works on a single example or a mini-batch / ( - ) Can converge slowly. Algorithm 1 Stochastic gradient descent (SGD) update at time t Require: Learning rate. Require: Initial parameter while Stopping criterion not met do Sample a minibatch of m examples from the training set {x (1),...,x (m) }. Set g = 0 for t =1tom do Compute gradient estimate: g g + L(f(x (t) ; ), y (t) ) end for Apply update: g end while

4 STOCHASTIC GRADIENT DESCENT 4 E! w!

5 MOMENTUM METHOD 5 Designed to accelerate learning, especially with small consistent gradients. Inspired from physical interpretation of the optimization process: Imagine you have a small ball rolling on a surface defined by the loss function.

6 MOMENTUM METHOD 6 Algorithm 1 Stochastic gradient descent (SGD) with momentum Require: Learning rate, momentum parameter. Require: Initial parameter, initial velocity v. while Stopping criterion not met do Sample a minibatch of m examples from the training set {x (1),...,x (m) }. Set g = 0 for t =1tom do Compute gradient estimate: g g + L(f(x (t) ; ), y (t) ) end for Compute velocity update: v v g Apply update: + v end while

7 MOMENTUM METHOD a!gentle!but!consistent!gradient.! 7

8 NESTEROV MOMENTUM 8 Sutskever et al (ICML 2013) presented a modified version of momentum they called Nesterov momentum. Basic idea: apply the gradient correction after the velocity term is applied.

9 NESTEROV MOMENTUM 9 Algorithm 1 Stochastic gradient descent (SGD) with Nesterov momentum Require: Learning rate, momentum parameter. Require: Initial parameter, initial velocity v. while Stopping criterion not met do Sample a minibatch of m examples from the training set {x (1),...,x (m) }. Apply interim update: + v Set g = 0 for t =1tom do Compute gradient (at interim point): g g + L(f(x (t) ; ), y (t) ) end for Compute velocity update: v v g Apply update: + v end while

10 NESTEROV MOMENTUM 10 blue!vectors!=!standard!momentum!

11 ADAGRAD 11 Adagrad (Duchi et al, COLT 2010) is a method of adapting the learning rate. (+) Can adapt independent learning rates for all parameters ( - ) Accumulating gradients from the start makes later learning very slow.

12 ADAGRAD 12 Algorithm 1 The Adagrad algorithm Require: Global learning rate, Require: Initial parameter Initialize gradient accumulation variable r = 0, while Stopping criterion not met do Sample a minibatch of m examples from the training set {x (1),...,x (m) }. Set g = 0 for t =1tom do Compute gradient: g g + L(f(x (t) ; ), y (t) ) end for Accumulate gradient: r r + g 2 Compute update: r g. % ( 1 applied element-wise) r Apply update: + t end while

13 RMSPROP 13 RMSprop (Tieleman, published) is a simple moving average version of adagrad. Algorithm 1 The RMSprop algorithm Require: Global learning rate, decay rate. Require: Initial parameter Initialize accumulation variables r = 0 while Stopping criterion not met do Sample a minibatch of m examples from the training set {x (1),...,x (m) }. Set g = 0 for t =1tom do Compute gradient: g g + L(f(x (t) ; ), y (t) ) end for Accumulate gradient: r r +(1 )g 2 Compute parameter update: = g. %( 1 r r Apply update: + end while applied element-wise)

14 RMSPROP+MOMENTUM 14 Algorithm 1 RMSprop algorithm with Nesterov momentum Require: Global learning rate, decay rate, momentum para. Require: Initial parameter, initial velocity v. Initialize accumulation variable r = 0 while Stopping criterion not met do Sample a minibatch of m examples from the training set {x (1),...,x (m) }. Compute interim update: + v Set g = 0 for t =1tom do Compute gradient: g g + L(f(x (t) ; ), y (t) ) end for Accumulate gradient: r r +(1 )g 2 Compute velocity update: v v g. r Apply update: + v % ( 1 applied element-wise) r end while

15 ADADELTA 15 Adadelta (Zeiler, 2012) combines the learning rate adaptation of adadelta (with the moving average of RMSprop) with a diagonal approximation of the Hessian. Similar in spirit to the algorithm proposed by Schaul and LeCun (ICLR, 2013), No more pesky learning rate.

16 ADADELTA 16 Algorithm 1 The Adadelta algorithm Require: Decay rate, constant Require: Initial parameter Initialize accumulation variables r = 0, s = 0, while Stopping criterion not met do Sample a minibatch of m examples from the training set {x (1),...,x (m) }. Set g = 0 for t =1tom do Compute gradient: g g + L(f(x (t) ; ), y (t) ) end for Accumulate gradient: r r +(1 )g 2 Compute update: = s+ r+ Accumulate update: s s +(1 )[ ] 2 Apply update: + end while g % (operations applied element-wise)

17 ADADELTA 17 What motivate the use of the accumulated updates in the learning rate adaptation? It s an approximation to the Hessian. Newton step: θ H 1 g If we consider only the diagonal terms: θ i = J θ i 2 J θ 2 i 1 2 J θ 2 i = θ i J θ i Idea: use the (time-averaged or accumulated) updates to compute the Hessian instead of the other way around.

Machine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Machine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science Machine Learning CS 4900/5900 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Machine Learning is Optimization Parametric ML involves minimizing an objective function