CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent

Size: px

Start display at page:

Download "CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent"

Mabel McCarthy
5 years ago
Views:

1 CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent April 27, / 32

2 Outline 1) Moment and Nesterov s accelerated gradient descent 2) AdaGrad and RMSProp 4) Adam 5) Stochastic gradient descent and optimization in Mach. Learning 2 / 32

3 Materials Some of these materials is covered in Chapter 14 of the book Understanding Machine Learning by S. Shalev-Shwartz and S. Ben-David. Some of the figures and examples in these slides are from this book. Other sources include the original Adam paper: D. Kingma and J. Lei Ba, Adam: A Method for Stochastic Optimization, arxiv: v9 We look at a demo at 3 / 32

4 Outline 1) Moment and Nesterov s accelerated gradient descent 2) AdaGrad and RMSProp 4) Adam 5) Stochastic gradient descent and optimization in Mach. Learning 4 / 32

5 Gradient descent method input: function f, starting point x 0 i 0; repeat 1 i i + 1; 2 Gradient: g i f(x i 1 ); 3 Line search: choose a step size t i 0 via line search; 4 Update: x i x i 1 t i g i ; until stopping criterion is satisfied (e.g., f(x i ) 2 η) 5 / 32

6 Exact line search Choose t i to minimize f along the ray {x i 1 + tg i : t 0}: t = arg min s 0 f(x i 1 + sg i ) Useful when the cost of this minimization is low w.r.t. computing g i (e.g., analytical solution) Almost never the case. 6 / 32

7 Fixed-step gradient descent input: function f, starting point x, step size / learning rate α. i 0; repeat 1 i i + 1; 2 Gradient: g i f(x i 1 ); 3 Line search: choose a step size t i 0 via line search; 4 Update: x i x i 1 αg i ; until stopping criterion is satisfied (Demo) On a non-convex function, GD may get stuck in a local minimum. 7 / 32

8 Going over local minima GD: a walking person completely stopping after each step. How do you jump over a gap? Use momentum! Let s take into account the previous gradients 8 / 32

9 Classical Momentum Let s take into account the previous gradients input: function f, starting point x, learning rate α, decay constant µ i j0; m 0 0; repeat 1 i i + 1; 2 Gradient: g i f(x i 1 ); 3 Momentum: m i µm i 1 + g i ; 4 Update: x i x i 1 αm i ; until stopping criterion is satisfied 9 / 32

10 Classical Momentum Momentum achieves than just allowing easier jumping over local minima: it accelerates the descent along directions where the gradient is relatively stable; decelerates the descent along directions with oscillating gradient/ (Demo) 10 / 32

11 Nesterov s Accelerated Gradient Descent Idea: don t compute the gradient at the current solution, but at the one where we would end up if we keep going in the same direction. Update for momentum-based GD: x i x i 1 αm i ; Expand m i with its definition (m i µm i 1 + g i ): x i x i 1 αµm i 1 + αg i ; x i 1 αµm i 1 is a valid solution, but g i is not the gradient there! 11 / 32

12 Nesterov s Accelerated Gradient Descent x i x i 1 αµm i 1 + αg i ; Let s compute the gradient at x i 1 αµm i 1 instead: input: function f, starting point x, learning rate α, decay constant µ i 0; m 0 0; repeat 1 i i + 1; 2 Gradient: g i f(x i 1 αµm i 1 ); 3 Momentum: m i µm i 1 + g i ; 4 Update: x i x i 1 αm i ; until stopping criterion is satisfied NAGD converges provably faster than GD; 12 / 32

13 Outline 1) Moment and Nesterov s accelerated gradient descent 2) AdaGrad and RMSProp 4) Adam 5) Stochastic gradient descent and optimization in Mach. Learning 13 / 32

14 Adaptive Subgradient Descent (AdaGrad) Goal: make GD adapt to the amount of observed change in different dimensions: slow down along dimensions that already changed significantly speed up along those that haven t changed much How? 1 Keep track of the sum of the squares of the gradients; 2 Use it to dampen the learning rate α 14 / 32

15 Adaptive Subgradient Descent (AdaGrad) input: function f, starting point x, learning rate α. i 0; G 0 0; n 0 0; repeat 1 i i + 1; 2 Gradient: g i f(x i 1 ); 3 Matrix of squares of gradients: G i G i 1 + g i g T i ; 4 Diagonal: n i Diag(G i ) 5 Update: x i x i 1 α ni +ε g i; until stopping criterion is satisfied Each component of n i is the l 2 -norm of previous partial derivatives; The learning rate is now adapted for each of the dimensions 15 / 32

16 RMSProp What happens to n i in the long term? It keeps growing and descent slows down on all dimensions Idea: lets keep a weighted running averages of the squared gradients 16 / 32

17 RMSProp input: function f, starting point x, learning rate α, decaying parameter γ i 0; G 0 0; n 0 0; repeat 1 i i + 1; 2 Gradient: g i f(x i 1 ); 3 Matrix of squares of gradients: G i (1 γ)g i 1 + γg i g T i ; 4 Diagonal: n i Diag(G i ) 5 Update: x i x i 1 α ni +ε g i; until stopping criterion is satisfied 17 / 32

18 Outline 1) Moment and Nesterov s accelerated gradient descent 2) AdaGrad and RMSProp 4) Adam 5) Stochastic gradient descent and optimization in Mach. Learning 18 / 32

19 Adaptive Moment Estimation (Adam) Idea: Combine momentum-based and norm-based methods 1 Use momentum, but with a decaying mean instead of a decaying sum 2 Combine it with RMSProp Result: update the gradient proportionally to avg grad avgsquaredgrad (Add initialization bias correction to offset instability) 19 / 32

20 Adaptive Moment Estimation (Adam) input: function f, starting point x, learning rate α, norm decaying parameter γ, momentum decaying parameter µ i 0; G 0 0; n 0 0; ˆn 0 0; m 0 0; ˆm 0 0; repeat 1 i i + 1; 2 Gradient: g i f(x i 1 ); 3 Momentum: m i µm i 1 + (1 µ)g i ; 4 Bias correction: ˆm i m i /(1 µ i ); 5 Matrix of squares of gradients: G i (1 γ)g i 1 + γg i g T i ; 6 Diagonal: n i Diag(G i ); 7 Bias correction: ˆn i n i /(1 γ i ); 8 Update: x i x i 1 ˆni α +ε ˆm i; until stopping criterion is satisfied 20 / 32

21 Adaptive Moment Estimation (Adam) Initializing m 0 with 0 introduces some bias. (similar for n i ) Definition of m i : Expanding: Taking the expectation m i = µm i 1 + (1 µ)g i m i = i (1 µ)µ i j g j j=1 E[m i ] = (1 µ i )E[g i ] Bias correction: divide m i by (1 µ i ) (Demo) 21 / 32

22 Timeline + Recap 1964: Classical Momentum 1983: Nesterov s accelerated gradient descent 2011: AdaGrad 2012: RMSProp 2015: Adam Why the recent speed up in innovation? 22 / 32

23 Outline 1) Moment and Nesterov s accelerated gradient descent 2) AdaGrad and RMSProp 4) Adam 5) Stochastic gradient descent and optimization in Mach. Learning 23 / 32

24 The learning setting D: arbitrary set of objects we wish to classify { 1, 1}: possible labels, i.e., classifications π: an unknown probability distribution on D { 1, 1} S = {(x 1, y 1 ),..., (x m, y m )}: training set of labeled points, sampled according to π. H: family of classifiers, i.e., functions from D to { 1, 1}. l: a loss function from H (D { 1, 1}) to R Goal: use S to find h H that minimizes the risk L π (h) = E π [l(h, z)] 24 / 32

25 Gradient descent in learning We want to minimize the function E π [l(h, z)] H is often parametrized through a weight vector w, so L π (h) = L π (w). We assume that the set of wight vectors is convex, and the loss function is convex. π is unknown cannot compute the gradient of L π (w). 25 / 32

26 Stochastic gradient descent Idea: a step along a random direction g i! Guaranteed to converge (in expectation) to the minimum as long as the expectation of g i is the negative of the gradient, i.e. E π [g i w i 1 ] = L π (w i 1 ) How to chose the direction g i? 26 / 32

27 Stochastic gradient descent How to chose the direction g i? Use the points in the training set! Let g i be the gradient of l(w, (x i, y i )) at the iterate w i 1 g i = l(w i 1, x i, y i ) It holds: E π [g i w i 1 ] = E π [ l(w i 1, z)] = E[l(w i 1, z)] = L π (w i 1 ) where we used the linearity of the gradient operator. I.e., g i is an unbiased estimate of the gradient. 27 / 32

28 Stochastic gradient descent input: function f, starting point x, learning rate α. i 0; for i 1,..., m 1 i i + 1; 2 Gradient: g i = l(w i 1, x i, y i ) 3 Update: x i x i 1 αg i ; 28 / 32

29 Stochastic gradient descent Figure An illustration of the gradient descent algorithm (left) and the stochastic gradient descent algorithm (right). The function to be minimized is 1.25(x + 6) 2 + (y 8) 2. For the stochastic case, the solid line depicts the averaged value of w. 29 / 32

30 Stochastic gradient descent It is possible to iterate over the training set multiple times, but the order must be randomized Rather than considering the gradient at a single training point, we can use mini-batches, where we take the average at multiple points. Mini-batches improve the estimate g i of the gradient (lower variance). Combined with back-propagation, SGD is the standard way to train Neural Networks; It is also used to train many other ML models. 30 / 32

31 Outline 1) Moment and Nesterov s accelerated gradient descent 2) AdaGrad and RMSProp 4) Adam 5) Stochastic gradient descent and optimization in Mach. Learning 31 / 32

32 Conclusions Optimization is a deep subject, with a mature and fast-moving research component Used everywhere. Everywhere. Everywhere. We looked at LP, ILP, Convex Programming, Stochastic Optimization, and others. There is so much more. Keep reading! 32 / 32

Day 3 Lecture 3. Optimizing deep networks

Day 3 Lecture 3. Optimizing deep networks Day 3 Lecture 3 Optimizing deep networks Convex optimization A function is convex if for all α [0,1]: f(x) Tangent line Examples Quadratics 2-norms Properties Local minimum is global minimum x Gradient