Adam: A Method for Stochastic Optimization

Size: px

Start display at page:

Download "Adam: A Method for Stochastic Optimization"

Rosanna Washington
5 years ago
Views:

1 Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by

2 Content Background Supervised ML theory and the importance of optimum finding Gradient descent and its variants Limitations of SGD Earlier approaches/building blocks Momentum Nestrov accelerated gradient (NAG) AdaGrad (Adaptive Gradient) AdaDelta and RMSProp Adam Update rule Bias correction AdaMax Post Adam innovations Improving Adam Additional approaches Shampoo 2

3 Content Background Supervised ML theory and the importance of optimum finding Gradient descent and its variants Limitations of SGD Earlier approaches/building blocks Momentum Nestrov accelerated gradient (NAG) AdaGrad (Adaptive Gradient) AdaDelta and RMSProp Adam Update rule Bias correction AdaMax Post Adam innovations Improving Adam Additional approaches Shampoo 3

4 4

5 5

6 Basic Supervised Machine Learning Terminology Notation Explanation! Instance space " Label space!~$ unknown probability distribution %:! " True mapping between instance and label spaces, unknown. 7

7 Examples of instance and labels spaces X Y The space of all RGB images of some dimension Is there a cat in the image ({0,1}) The space of a stock s historical price sequences The stock s next day s closing price ([0, )) The space of all finite Chinese sentences The set of all finite English sentences The space of all information regarding two companies A probability of a merge to be successful The space of all MRI images A probability, location and type of a tumor The space of all finite length voice sequences The corresponding Amazon product being referred

8 Basic Supervised Machine Learning Terminology Notation Explanation! Instance space " Label space!~$ unknown probability distribution %:! " True mapping between instance and label spaces, unknown 9

9 Basic Supervised Machine Learning Terminology Notation! Instance space " Label space Explanation!~$ unknown probability distribution %:! " True mapping between instance and label spaces, unknown {(* +, - + )} 2 +01, * +!, - + " Training set h:! ", h 5 Hypothesis, the object we wish to learn 10

10 Basic Supervised Machine Learning theory The goal is to find a hypothesis h that approximates " as best as possible, but approximates in what sense? We need a way for evaluating a hypothesis quality 11

11 Basic Supervised Machine Learning theory Let us define a loss function!: # # [0, ) For example: Zero-one loss +(h(. / ) 1 / ) Quadratic loss h(. / ) 1 / 3 Here we use: 1 / = 5(. / ) A measure of how bad the hypothesis did on a single sample 12

12 Basic Supervised Machine Learning theory Let us also define the Generalization error (aka Risk) of a hypothesis h: " # h = % & '~# ) h * +, -(* + ) A measure of how bad the hypothesis did on a the entire instance space A good hypothesis is one with a low Risk value. 13

13 Basic Supervised Machine Learning theory Let us also define the Generalization error (aka Risk) of a hypothesis h: " # h = % & '~# ) h * +, -(* + ) A measure of how bad the hypothesis did on a the entire instance space A good hypothesis is one with a low Risk value. Unfortunately 14

14 Basic Supervised Machine Learning theory Unfortunately, the true Risk! " # cannot be computed because the distribution $ is unknown to the learning algorithm. 15

15 Basic Supervised Machine Learning theory Unfortunately, the true Risk! " # cannot be computed because the distribution $ is unknown to the learning algorithm. We can, however, compute a proxy of the true Risk, called the Empirical Risk. 16

16 Basic Supervised Machine Learning theory Let us define the Empirical Risk of a hypothesis h. " #$% h = 1 ( ) * +,- / h 0 +, Recall: {(0 +, 3 + )} +,-., 0 + 9, 3 + : is the Training set " ; h = < = >~; / h 0 +, 2(0 + ) 17

17 Empirical Risk Minimization (ERM) strategy After mountains of theory (PAC Learning, VC Theory, etc.), the following theorem is proven (this is very unformal): The best strategy for a learning algorithm is to minimize the Empirical Risk 18

18 Empirical Risk Minimization strategy (ERM) This means that the learning algorithm defined by the ERM principle boils down to solving the following optimization problem: Find!h such that:!h = $%&'() *, -./0 h = $%&'() *, h : 5, < = 5 19

19 The hypothesis class can be very simple when! # $ and % { 1,1} + is class of all three dimensional hyper planes h =. / $ 2 $ 4 = {. /,. 1,. 3,. $ } # 5 21

20 The hypothesis class can be very complex Resent year s Deep learning architectures, result in models with tens of millions of parameters,! # $(&'() 22

21 The bottom line Machine Learning applications require us to find good local minima of extremely high dimensional, noisy, non-convex functions. 4 1!h = $%&'() *,. / h 6 1,

22 The bottom line Machine Learning applications require us to find good local minima of extremely high dimensional, noisy, non-convex functions. We need a principal method for achieving this goal. 4 1!h = $%&'() *,. / h 6 1,

23 Introducing The Gradient method 25

24 Questions?

25 Content Background Supervised ML theory and the importance of optimum finding Gradient descent and its variants Limitations of SGD Earlier approaches/building blocks Momentum Nestrov accelerated gradient (NAG) AdaGrad (Adaptive Gradient) AdaDelta and RMSProp Adam Update rule Bias correction AdaMax Post Adam innovations Improving Adam Additional approaches Shampoo 27

26 A couple of notes before we head on I stick to the papers notations. From here we ll use! ", which is the same as # $%& h. 28

27 Introducing The Gradient method Input: learning rate!, tolerance parameter " > 0 Initialization: pick % & ( ) arbitratily General Step: Set % *+, = % *! 0 1 % * if 0 1 % *+, ", then STOP and % *+, is the output 29

28 Gradient descent example simple Linear regression! = #, $ = # {(' (, * ( )} / (-., ' (!, * ( $ 1 = 2' #, 4 #}, 5 = (2, 4) h = 2'

29 Gradient descent example simple Linear regression! = #, $ = # {(' (, * ( )} / (-., ' (!, * ( $ 1 = 2' #, 4 #}, 5 = (2, 4) h = 2' + 4 The goal is to find good 2 and 4 values 31

30 Gradient descent example simple Linear regression h = #$ + & ' h $, ) = ) h($) -. /01 h = ' h $ 6, 8 ) 6 = ) 6 h $ 6 - = ) 6 (#$ 6 + &) - 32

31 GD example computing the gradient For!: " "# $ %#& h = " ) "# * + * -.) / - (! ) 6 = ) * + * " -.) "# / - (! ) 6 = ) * -.) * 2 / - (! ) ( 2 - ) = 8 * 8! $ = 2 9 : -.) 2 - / - (! ) Similarly for 4: 8 * 84 $ = 2 9 : -.) / - (! ) 33

32 GD example computing the gradient So the gradient vector is: " # $ = & &' #, & &) # 1 = 2, -./ '2. + ), 2, -./ ('2. + )) 34

33 GD example the complete algorithm Input: learning rate!, tolerance parameter " > 0 Initialization: pick % & = (m, b). / arbitratily General Step: Set % 012 = % 0! 5 6 % = 7 0 +! / 9 ;<2 9 = ; > ; 7= ; +?? 012 =? 0 +! / 9 ;<2 9 > ; 7= ; +? if 5 6 % 012 ", then STOP and % 012 is the output 35

34 GD example visualized 36

35 Variants of Gradient descent Differ in how much data we use to compute the gradients of the objective function. A trade-off between the accuracy of the update and the computation time per update. 38

36 Batch Gradient descent Computes the gradients of the function w.r.t the parameters! for the entire training dataset.! =! $ % ' (!; * (,:.), 1 (,:.) 39

37 Batch Gradient descent Pros and Cons Pros Guaranteed to converge to global/local minimum. An unbiased estimate of gradients. Cons Possibly slow or impossible to compute. Some examples may be redundant. Converges to the minimum of the basin the parameters are placed in. 40

38 Stochastic Gradient descent (SGD) Computes the gradients of the function w.r.t the parameters! for a single training sample.! =! $ % ' (!; * (,), / (,) 41

39 Stochastic Gradient descent Pros and Cons Pros Much faster to compute. Potential to jump to better basins (and better local minima). Cons High variance that causes the objective to fluctuate heavily. 42

40 Mini-batch Gradient descent Computes the gradients of the function w.r.t the parameters! for a mini-batch of " training sample.! =! % & ( )!; + (-:-/0), 3 (-:-/0) " is usually 32 to

41 Mini-batch Gradient descent Pros and Cons the best of both worlds - fast, explorational and allows for stable convergence. Makes use of highly optimized matrix optimizations libraries and hardware. The method of for most Supervised Machine leaning scenarios. 44

42 Variants of Gradient descent - visualizations 45

43 Its all Mini-batch from here The remaining of the presentation will focus on variants of the Minibatch version. From here on Gradient descent, SGD, Gradient step, all refer to the Mini-batch variant. We ll leave out the parameters! (#:#%&), ) (#:#%&) for simplicity. 48

44 Challenges and limitations of the plain SGD Choosing a proper learning rate. Sharing the learning rate for all parameters. Optimization in the face of highly non-convex functions. 49

45 Questions?

46 Content Background Supervised ML theory and the importance of optimum finding Gradient descent and its variants Limitations of SGD Earlier approaches/building blocks Momentum Nestrov accelerated gradient (NAG) AdaGrad (Adaptive Gradient) AdaDelta and RMSProp Adam Update rule Bias correction AdaMax Post Adam innovations Improving Adam Additional approaches Shampoo 54

47 Novelties over the plain SGD Momentum Nestrov accelerated gradient (NAG) AdaGrad (Adaptive Gradient) AdaDelta and RMSProp We will focus only on algorithms that are feasible to compute in practice for high dimensional data sets (and will ignore second-order methods such as Newton s method). 55

48 56

49 Momentum (Qian, N. 1999) Plain SGD can make erratic updates on non-smooth loss functions Consider an outlier example which throws off the learning process Need to maintain some history of updates. Physics example: A moving ball acquires momentum, at which point it becomes less sensitive to the direct force (gradient). 57

50 Momentum (Qian, N. 1999) Add a fraction! (usually about 0.9) of the update vector of the past time step to the current update vector. Faster convergence and reduced oscillations. " # =!" #%& + ( ) +, - - = - " # = -!" #%& ( ) +, -, - /012345" /8 - : ; <=>=?242>@ +, - A>=B5284 "234/> ( C2=>858A >=42 - = - ( ) +, - 58

51 Momentum (Qian, N. 1999) (a) SGD without momentum (b) SGD with momentum 59

52 Nestrov accelerated gradient (Nestrov, Y. 1983) Momentum is usually pretty high once we get near our goal point. The Algorithm has no idea when to slow down and therefore might miss the goal point. We would like our momentum to have a kind of foresight. 60

53 Momentum has no idea when to slow down 61

54 Nestrov accelerated gradient (Nestrov, Y. 1983) First make a jump based on our previous momentum, calculate the gradients and then make a correction. look ahead by calculating the gradient not w.r.t. to our current parameters but w.r.t. the approximate future position of our parameters.! " = $! "%& + ( ) +, - $! "%&, - /012345! /8 - = -! " - : ; <=>=?242>@ +, - A>=B5284!234/> ( C2=>858A >=42 - = - ( ) +, - 62

55 Nestrov accelerated gradient (Nestrov, Y. 1983) 63

56 Adagrad (Duchi, J., Hazan, E., & Singer, Y. 2011) Now we are able to adapt our updates to the slope. But updates are the same for all the parameters being updated. We would like to adapt our updates to each individual parameter depending on their importance. 64

57 Adagrad (Duchi, J., Hazan, E., & Singer, Y. 2011) Perform larger updates for infrequent parameters and smaller updates for frequent ones. Use a different learning rate for every parameter! ", at every time step #. well-suited for dealing with sparse data. Eliminates the need to manually tune the learning rate (most just use 0.01) 65

58 Adagrad (Duchi, J., Hazan, E., & Singer, Y. 2011)! ",$ = '( ) * ",$, * "+,,$ = * ",$. 2! / (,00 +1 ",$ ) * :6 ;<=7893= ABCBD686CE ' ) *!CBF96=8 :6783C. G6BC=9=! CB86 * = *. 2 ' ) * 66

59 Adagrad (Duchi, J., Hazan, E., & Singer, Y. 2011)! ",$ = '( ) * ",$, * "+,,$ = * ",$. 2! / (,00 +1 ",$ ) * :;<=>3?@= ABC>3?:C * 6 7 DEFEG=3=FH ' ) J=EFC?C! FE3= * = *. 2 ' ) *! ",$ - the gradient w.r.t the parameter * $, at time step 3. 4 " diagonal matrix, 4 ",$$ is the sum of squares gradients w.r.t * $ up to time step t. 9 prevents division by zero (in the order of 1e-8) 67

60 Adagrad (Duchi, J., Hazan, E., & Singer, Y. 2011)! ",$ = '( ) * ",$, * "+,,$ = * ",$. 2! / (,00 +1 ",$ * "+, = * ".,! / ( +1 " ) * ;<=>?4@A> BCD?4@;D * 7 8 EFGFH>4>GI ' ) *!GFJ@>D4 A>?4;G. K>FGD@D! GF4> * = *. 2 ' ) *! ",$ - the gradient w.r.t the parameter * $, at time step 4. 5 " diagonal matrix, 5 ",$$ is the sum of squares gradients w.r.t * $ up to time step t. : prevents division by zero (in the order of 1e-8) 68

61 Adagrad vs. Plain SGD Adagrad:! "#$,& =! ",& ) $ * +,,, # ! ",& Plain SGD:! "#$,& =! ",& ). 0+ 1! ",& 69

62 Adadelta (Zeiler, M, D. 2012) and RMSProp (Hinton) In Adagrad, the accumulated sum keeps growing. This causes the learning rate to shrink and become infinitesimally small, impeding convergence. We need an efficient way of reducing this aggressive, monotonic, decreasing learning rate. 70

63 Adadelta (Zeiler, M, D. 2012) and RMSProp (Hinton) Recursively define a decaying average of those gradients. The running average at time step t depends only on the previous time step and the current gradient. 71

64 Adadelta (Zeiler, M, D. 2012) and RMSProp (Hinton)! " # $ = &! " # # $'( + 1 & " $ 1, $-( =, $.! " # $ + / 0 " $ 1, :;<6782<, F 1, "BAG85< B. H5AB<8<" BA75, =,. 0 F 1,; J, L & C2C5<7;C 625:. " $ = F N 1, $ 72

65 Adadelta (Zeiler, M, D. 2012) and RMSProp (Hinton)! " # $ = &! " # # $'( + 1 & " $ 1, $-( =, $.! " # $ + / 0 " $ 3, :7 ABCBD717CE G 3, "CBH97=1 :7814C. I7BC=9=" CB17, =,. 0 G 3,; K, M & D4D7=1<D 847;. " $ = G O 3, $! " # $ the running average of squared gradient, for time step 1. 2 prevents division by zero (in the order of 1e-8) 73

66 Adadelta (Zeiler, M, D. 2012) 74

67 Visualizations of the discussed algorithms 75

68 Questions?

69 Content Background Supervised ML theory and the importance of optimum finding Gradient descent and its variants Limitations of SGD Earlier approaches/building blocks Momentum Nestrov accelerated gradient (NAG) AdaGrad (Adaptive Gradient) AdaDelta and RMSProp Adam Update rule Bias correction AdaMax Post Adam innovations Improving Adam Additional approaches Shampoo 77

70 Adam (Kingma D. P, & Ba. J 2014) RMSProp allows for a adaptive per-parameter update, but the update itself is still done using the current, noisy, gradient. We would like the gradient itself to be replaced by a similar exponential decaying average of past gradients. 78

71 Adam Update rule! " = $ %! "&% + (1 $ % ), ". - " = $. - "&% + (1 $. ), " / "0% = / " " + 2 3! " ; / <=>9?@A-9 BCD?@A<D / F G HIJI!9@9JK M ; /,JINA9D@ -9?@<J 1 O9IJDAD, JI@9 / = / 1 3 M ; /; Q, R, " = MS ; / "! " and - " are estimates of the first moment (the mean) and the second moment (the uncentered variance) of the gradients respectively. Recommended values in the paper are $ % = 0.9, $. = 0.999, 8 =

72 Adam Bias towards zero as! " and # " are initialized as 0 s, they are biased towards zero. Most significant in the initial steps. Most significant when $ % and $ & are close to 1. A correction is required. 80

73 Adam Bias correction!" # = " # 1 ' ( #, *+ # = + # 1 ', # :+7 ;<=89:4= ABCB"797CD F 3 - GCBH:7= C / I7BC=:=G CB97 - #.( = - # / 1 *+ # + 1 2!" # G # = FJ 3 - # Correct both moments to get the final update rule. 81

74 Adam vs. Adadelta Adam:! "#$ =! " ' $ () * #+, -. " 0! :;5671;! = >?@A@.464AB D 0! /A@E74; A ' F4@A;7;/ A@64 / " = D * 0! " Adadelta:! "#$ =! " ' $ ) * #+, / ". " = G $. "H$ + (1 G $ )/ " 8 " = G M 8 "H$ + (1 G M )/ " M 82

75 Adam - Performance 85

76 Adam - Performance 86

77 Adam - Performance 87

78 Questions?

79 Content Background Supervised ML theory and the importance of optimum finding Gradient descent and its variants Limitations of SGD Earlier approaches/building blocks Momentum Nestrov accelerated gradient (NAG) AdaGrad (Adaptive Gradient) AdaDelta and RMSProp Adam Update rule Bias correction AdaMax Post Adam innovations Improving Adam Additional approaches Shampoo 89

80 improving Adam Ndam Incorporating Nestrov into Adam AdamW - decoupling weight decay! "#$ =! " ' $ () * #+, -. " '/ "! " AMSGrad fixing the exponential moving average 01 " = "4$, 1 " 1! "#$ =! " ' 01 " + 8,. " Maximum of past gradients instead of their exp. moving average Adam with warm restarts 90

81 Additional approaches Snapshot ensembles Learning to optimize 91

82 Shampoo (Gupta, V., Koren, T., & Singer, Y. 2018) 92

83 Summary A brief walkthrough of Supervised Machine Learning. A conviction of the importance and relevance of Gradient methods. An Overview of modern Gradient descent optimization algorithms. The contribution of Adam. Innovations future to Adam. 93

84 94

85 Questions?

86 The End

87 links

88 references Ruder, S. (2016). An overview of gradient descent optimization algorithms. arxiv preprint arxiv: Yann N. Dauphin, Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, Surya Ganguli, and Yoshua Bengio. Identifying and attacking the saddle point problem in high-dimensional nonconvex optimization. arxiv, pages 1 14, Timothy Dozat. Incorporating Nesterov Momentum into Adam. ICLRWorkshop, (1): , Diederik P. Kingma and Jimmy Lei Ba. Adam: a Method for Stochastic Optimization. International Conference on Learning Representations, pages 1 13, Yurii Nesterov. A method for unconstrained convex minimization problem with the rate of convergence o(1/k2). Doklady ANSSSR (translated as Soviet.Math.Docl.), 269: Ning Qian. On the momentum term in gradient descent learning algorithms. Neural networks : the official journal of the International Neural Network Society, 12(1): , Matthew D. Zeiler. ADADELTA: An Adaptive Learning Rate Method. arxiv preprint arxiv: , Duchi, J., Hazan, E., & Singer, Y. (2011). Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul),

Overview of gradient descent optimization algorithms. HYUNG IL KOO Based on

Overview of gradient descent optimization algorithms HYUNG IL KOO Based on http://sebastianruder.com/optimizing-gradient-descent/ Problem Statement Machine Learning Optimization Problem Training samples: