Don t Decay the Learning Rate, Increase the Batch Size. Samuel L. Smith, Pieter-Jan Kindermans, Quoc V. Le December 9 th 2017

Size: px

Start display at page:

Download "Don t Decay the Learning Rate, Increase the Batch Size. Samuel L. Smith, Pieter-Jan Kindermans, Quoc V. Le December 9 th 2017"

Maximillian Powers
5 years ago
Views:

1 Don t Decay the Learning Rate, Increase the Batch Size Samuel L. Smith, Pieter-Jan Kindermans, Quoc V. Le December 9 th 2017 slsmith@ Google Brain

2 Three related questions: What properties control generalization? How should we tune SGD hyper-parameters? Can we train efficiently with large batches? (> 50,000 examples)

3 Small batches out-generalize large batches (at constant learning rate) As observed by: On Large Batch Training, Keskar et al. (2017)

4 Which minimum is best? Proprietary + Confidential

5 Bayesian model comparison Proprietary + Confidential

6 Bayesian model comparison Proprietary + Confidential

7 Bayesian model comparison Probability ratio of two competing models

8 Bayesian model comparison Prior probability ratio of the models. Usually 1. Probability ratio of two competing models

9 Bayesian model comparison Prior probability ratio of the models. Usually 1. Probability ratio of two competing models The evidence ratio!

10 The Bayesian evidence (Gaussian approximation) λ i is the i th Hessian eigenvalue Proprietary + Confidential λ is the L2 regularization parameter

11 The Bayesian evidence (Gaussian approximation) λ i is the i th Hessian eigenvalue Proprietary + Confidential λ is the L2 regularization parameter Evidence for a minimum

12 The Bayesian evidence (Gaussian approximation) λ i is the i th Hessian eigenvalue Proprietary + Confidential λ is the L2 regularization parameter Evidence for a minimum Depth of the minimum

13 The Bayesian evidence (Gaussian approximation) λ i is the i th Hessian eigenvalue Proprietary + Confidential λ is the L2 regularization parameter Evidence for a minimum Depth of the minimum Width of the minimum

14 The Bayesian evidence (Gaussian approximation) λ i is the i th Hessian eigenvalue Proprietary + Confidential λ is the L2 regularization parameter Invariant to changes in model parameterization (sharp minima can t generalize!) Width of the minimum

15 Which minimum is best? Proprietary + Confidential

16 Which minimum is best? Generalization is a weighted combination of: 1) Depth 2) Width

17 Which minimum is best? The SGD should not minimize the cost function It should maximize the evidence

18 The SGD gradient update True gradient Noise

19 The SGD gradient update Proprietary + Confidential

20 The SGD gradient update Proprietary + Confidential

21 The SGD gradient update Batch size

22 How to choose the batch size? (at constant learning rate) Proprietary + Confidential

23 How to choose the batch size? (at constant learning rate) Too much noise (small batches) Just right! Too little noise (big batches)

24 How to choose the batch size? (at constant learning rate) Too much noise (small batches) Just right! Too little noise (big batches) There should be an optimum batch size

25 How to choose the batch size? (at constant learning rate) Proprietary + Confidential

26 How to choose the batch size? (at constant learning rate) As predicted!

27 Defining the SGD noise scale Proprietary + Confidential

28 Defining the SGD noise scale SGD integrates an underlying stochastic differential equation Proprietary + Confidential

29 Defining the SGD noise scale SGD integrates an underlying stochastic differential equation Noise scale

30 Defining the SGD noise scale SGD integrates an underlying stochastic differential equation After a little math:

31 Defining the SGD noise scale SGD integrates an underlying stochastic differential equation After a little math: Prediction:

34 Consequences 1) We can linearly scale batch size and learning rate Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour, Goyal et al. (2017) 2) We expect training sets to grow over time Suggests batch sizes will rise

35 What about momentum? Proprietary + Confidential

36 What about momentum? Proprietary + Confidential

37 What about momentum? Proprietary + Confidential

39 Decaying learning rate and increasing batch size are equivalent Proprietary + Confidential

40 Decaying learning rate and increasing batch size are equivalent Proprietary + Confidential

41 Decaying learning rate and increasing batch size are equivalent We can choose any combination of ε and B with the same g. (so long as ε isn t too large)

42 Three equivalent schedules: Wide ResNet on CIFAR-10

43 Training curves: Ghost batch norm, Hoffer et al., 2017

44 Training curves: Computational cost constant

45 Training curves: Computational cost constant But parallelizable

46 Test curves: Momentum Nesterov momentum

47 Test curves: Vanilla SGD Adam

48 Towards large batch training: Proprietary + Confidential

49 Towards large batch training: Proprietary + Confidential

50 Towards large batch training: Proprietary + Confidential

51 Towards large batch training: Proprietary + Confidential

52 Towards large batch training: Typical speed-up X

53 Why does momentum scaling reduce test accuracy? Proprietary + Confidential

54 Why does momentum scaling reduce test accuracy? Accumulation stores moving average of gradients

55 Why does momentum scaling reduce test accuracy? Larger momentum equals longer memory

56 Why does momentum scaling reduce test accuracy? Larger momentum equals longer memory The gradient changes too slowly as we explore the parameter space

57 Training ImageNet in under 2500 updates! Inception-Resnet-V2 Original implementation: ~ 400,000 updates ImageNet in one hour Goyal et al., 2017 (learning rate scaling) ~ 14,000 updates

58 Training ImageNet in under 2500 updates! 79% accuracy in under 6000 updates 77% accuracy in under 2500 updates Batches of 65,536 images

Thank You! A Bayesian Perspective on Generalization and Stochastic Gradient Descent, arxiv:1710.

00489 Samuel L Smith*, Pieter-Jan Kindermans* and Quoc V.

59 Thank You! A Bayesian Perspective on Generalization and Stochastic Gradient Descent, arxiv: Samuel L Smith and Quoc V. Le Don t Decay the Learning Rate, Increase the Batch Size, arxiv: Samuel L Smith*, Pieter-Jan Kindermans* and Quoc V. Le *Equal contribution Stochastic Gradient Descent as Approximate Bayesian Inference, arxiv: Stephan Mandt, Matthew D. Hoffman and David M. Blei slsmith@ pikinder@ qvl@

Machine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Machine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science Machine Learning CS 4900/5900 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Machine Learning is Optimization Parametric ML involves minimizing an objective function