Unraveling the mysteries of stochastic gradient descent on deep neural networks

Size: px

Start display at page:

Download "Unraveling the mysteries of stochastic gradient descent on deep neural networks"

Nora Sutton
5 years ago
Views:

1 Unraveling the mysteries of stochastic gradient descent on deep neural networks Pratik Chaudhari UCLA VISION LAB 1

descent x k +1 = x k +f b (x k ) Many, many variants: AdaGrad, rmsprop,

2 The question measures disagreement of predictions with ground truth Cat Dog... x = argmin x f (x) weights aka parameters Stochastic gradient descent x k +1 = x k +f b (x k ) Many, many variants: AdaGrad, rmsprop, Adam, SAG, SVRG, Catalyst, APPA, Natasha, Katyusha Why is SGD so special? 2

Empirical evidence: wide minima 10 5 Frequency 10 3 10 Frequency 10 4 10 3 10 2

3 Empirical evidence: wide minima 10 5 Frequency Frequency Short negative tail Eigenvalues Eigenvalues 3

4 A bit of statistical physics Energy landscape of a binary perceptron Many sharp minima Few wide minima,! but generalize better! [Baldassi et al., '15] Wide minima are a large deviations phenomenon 4

5 Tilting the Gibbs measure Local Entropy [Chaudhari et al., ICLR '17] x = argmin x = argmax x argmin x f (x) e f (x) log G e f (x) Gaussian kernel! of variance 5

6 Parle: parallelization of SGD State-of-the-art performance [Chaudhari et al., SysML '18] Wide-ResNet: CIFAR-10 All-CNN: CIFAR-10 (25% data) 6

7 The question Why is SGD so special? 7

8 A continuous-time view of SGD Diffusion matrix: variance of mini-batch gradients var +f b (x) = D(x) b = 1 b *, 1 N NX k =1 +f k (x) +f k (x) > +f (x) +f (x) > + - Temperature: ratio of learning rate and step-size 1 = 2b 8

9 A continuous-time view of SGD Continuous-time limit of discrete-time updates dx = +f (x) dt {z}, q D(x) dw(t ) will assume x 2 d Fokker-Planck (FP) equation gives the distribution on the weight space induced by SGD t = div +f {z} drift + 1 div D {z } di usion where x(t ) (t ) 9

10 Wasserstein gradient flow Heat equation Dirichlet energy t = div I performs steepest descent on the Z + (x) 2 dx It is also the steepest descent in the Wasserstein metric for H ( ) = Z log d k +1 2 argmin 8> < > : H ( ) (, k ) 2 converges to trajectories of the heat equation 9> = > ; Negative entropy is a Lyapunov functional for Brownian motion ss heat = argmin H ( ) 10

11 Wasserstein gradient flow: with drift If D = I, the Fokker-Planck equation t = div +f + 1 I + has the Jordan-Kinderleher-Otto (JKO) functional [Jordan et al., '97] ss (x) = argmin x f f (x) g {z } energetic term 1 H ( ) {z } entropic term as the Lyapunov functional. FP is the steepest descent on JKO in the Wasserstein metric 11

12 What happens for non-isotropic noise? t = div +f {z} drift + 1 div D {z } di usion FP monotonically minimizes the free energy ss (x) = argmin x f (x) g 1 H ( ) Rewrite as F ( ) = 1 KL ( `` ss ) compare with x - x* for deterministic optimization. 12

13 SGD performs variational inference Theorem [Chaudhari & Soatto, ICLR '18] The functional F ( ) = 1 KL ( `` ss ) is minimized monotonically by trajectories of the Fokker-Planck equation t = div +f + 1 div (D ) with ss as the steady-state distribution. Moreover, = 1 log ss up to a constant. 13

14 Some implications Learning rate should scale linearly with batch-size 1 = 2b should not be small Sampling with replacement regularizes better than without 1 w/o replacement = 2b 1 b N! also generalizes better 14

15 Information Bottleneck Principle Minimize mutual information of the representation with the training data [Tishby '99, Achille & Soatto '17] IB ( ) = x f f (x) g 1 KL `` prior Minimizing these functionals is hard, SGD does it naturally 15

16 Potential Phi vs. original loss f The solution of the variational problem is ss (x) = 1 Z e (x) Key point Most ss (x), 1 Z 0 e f (x) likely locations of SGD are not the critical points of the original loss The two losses are equal if and only if noise is isotropic D(x) = I, (x) = f (x) 16

17 Deep networks have highly non-isotropic noise CIFAR-10 (D) =0.27 ± 0.84 rank(d) =0.34% CIFAR-100 (D) =0.98 ± 2.16 rank(d) =0.47% Evaluate neural architectures using the diffusion matrix 17

18 How different are cats and dogs, really? 18

19 SGD converges to limit cycles Theorem [Chaudhari & Soatto, ICLR '18] The most likely trajectories of SGD are x = j (x), where the "leftover" vector field j (x) = +f (x) + D(x) + (x) 1 divd(x) is such that div j (x) =0. 19

20 Trajectories of SGD Run SGD for epochs 105 FFT of x i k +1 x i k 20

21 An example + (x) =0 force-field saddle-point j (x) =0 j (x) is small very large j (x) 21

22 Most likely locations are not the critical points of the original loss Theorem [Chaudhari & Soatto, ICLR '18] The Ito SDE dx = +f dt + q 2 1 DdW(t ) is equivalent to an A-type SDE dx = D + Q + dt + q 2 1 DdW(t ) with the same steady-state ss / e (x) if +f = D + Q + 1 div D + Q. 22

23 Knots in our understanding ARCHITECTURE OPTIMIZATION GENERALIZATION 23

24 Punchline Is SGD special? 24

25 arxiv: , ICLR '18 Stochastic gradient descent performs variational inference, converges to limit cycles for deep networks, Pratik Chaudhari and Stefano Soatto. Thank you, questions? 25

A picture of the energy landscape of! deep neural networks

A picture of the energy landscape of! deep neural networks Pratik Chaudhari December 15, 2017 UCLA VISION LAB 1 Dy (x; w) = (w p (w p 1 (... (w 1 x))...)) w = argmin w (x,y ) 2 D kx 1 {y =i } log Dy i