A picture of the energy landscape of! deep neural networks

Size: px

Start display at page:

Download "A picture of the energy landscape of! deep neural networks"

Joy Gregory
6 years ago
Views:

1 A picture of the energy landscape of! deep neural networks Pratik Chaudhari December 15, 2017 UCLA VISION LAB 1

2 Dy (x; w) = (w p (w p 1 (... (w 1 x))...)) w = argmin w (x,y ) 2 D kx 1 {y =i } log Dy i (x; w) i =1 {z }, f (w) What is the shape? (S)GD w k +1 = w k +f b (w k ) Many, many variants! AdaGrad, SVRG, rmsprop, Adam, Eve,! APPA, Catalyst, Natasha, Katyusha variance reduction How should it inform the optimization? 2

3 What do we know about the energy landscape? Saxe et al., 14 orthogonal! initializations Ge et al., 15! Sun et al., 15, 16 can escape strict! saddle points Lee et al., 16 mere GD always! finds local minima SGD Schmidt et al. 13! Defazio, et al. 14 Haeffele & Vidal 15 linear networks variance reduction matrix/tensor factorization Duvenaud et al., 14 deep GPs Soundry & Carmon, 16 piecewise linear Baldi & Hornik 89 PCA Duchi et al., 10 AdaGrad multiple,! equivalent! minima empirical! results Dauphin et al., 14 all local minima are! close to global minimum Bray & Dean 07, Fyodorov & Williams 07! Choromanska et al. 15,! Chaudhari & Soatto, 15 Gaussian random! fields, spin glasses statistical physics binary! perceptron Baldassi et al. 15 saddle points! slow down SGD many descending! directions at high! energy our work Hardt et al. 15! Goodfellow & Vinyals 15 generalization can easily get zero! training error, yet! generalize poorly paradox between a benign energy landscape and delicate training algorithms 3

Motivation from the Hessian 10 5 Frequency 10 3 10 Frequency 10 4 10 3 10 2 Short

4 Motivation from the Hessian 10 5 Frequency Frequency Short negative tail Eigenvalues Eigenvalues 4

5 How do we exploit this? Magnify the energy landscape and smooth with a kernel w = argmin w = argmax x argmin w f (w) e f (w ) log G e f (w ) Gaussian kernel! of variance g focuses on the! neighborhood of w 5

6 A physics interpretation Baldassi et al., 15, 16 Our new loss is local entropy f (w) = log Z w 0 exp f (w 0 )! 1 2 kw w 0 k 2 dw f (x) F(x, 10 3 ) F(x, ) 1.0 x candidate bx F new global minimum 0.5 bx f original global minimum 6

7 Minimizing local entropy Solve w = argmin w f (w, ) Gradient of local entropy +f (w, ) = 1 w w 0 w 0 = Z (w, ) 1 Z denotes an expectation over! a local Gibbs distribution w 0 w 0 exp f (w 0 ) 1 2 kw w 0 k 2! dw 0 Estimate the gradient using MCMC can be applied to general deep networks 7

5 Entropy-SGD Entropy-SGD Cross-Entropy Loss 0.4 0.3 0.2 %Error 15 10 7.

8 Medium-scale CNN All-CNN-BN on CIFAR-10 Do not see much plateauing of training or validation loss 0.6 SGD 20 SGD 0.5 Entropy-SGD Entropy-SGD Cross-Entropy Loss %Error % % Epochs L Epochs L x-axis µ wall-clock time 8

9 A PDE interpretation Local entropy is the solution of a Hamilton-Jacobi equation u t = 1 2 `+u` u u(w, 0) = f (w) Stochastic control interpretation f (w) = u(w, ) original loss as the initial condition dw = (s) ds + db(s), t apple s apple T w(t ) = w. C(w( ), ( )) = " f (w(t )) (w, t ) = +u(w, t ) u(w, t ) = min ( ) Z T C(w ( ), ( )) t # k (s)k 2 ds quadratic penalty for! greedy gradient descent 9

10 New PDEs Use the non-viscous HJ equation u t = 1 2 `+u` u 0 Hopf-Lax formula gives the solution u(w, t ) = inf w 0 ( f (w 0 ) kw w 0 k 2 inf-convolution! or Moreau envelope ) Simple formula for the gradient (proximal point iteration) This has a few magical properties +u(w, t ) = p p = +f (w tp ) 10

11 Smoothing using PDEs f (x) u viscous HJ (x, T) initial density,! SGD gets stuck u non-viscous HJ (x, T) final density of viscous HJ final density of non-viscous HJ 11

12 Distributed training algorithms Zhang et. al., 15 Elastic-SGD argmin w, w 01,...,w 0 p px k =1 f (w 0k ) kw 0 k w k 2 A continuous-time view of local entropy fast variable dw = dw 0 = 1 1 (w w 0 ) ds " +f (w 0 ) + 1 # (w 0 w) ds + 1 p db(s) If w (s) is very fast, w(s) only sees its average +f (w, ) = 1 w w 0 dw = 1 (w 0 ; w) / exp f (w 0 ) 1 w w 0 ds! 1 2 kw 0 w k 2 homogenized dynamics! as e! 0 12

13 Wide-ResNet on CIFAR-10 /

14 WTH is implicit regularization? Many, many variants AdaGrad, SVRG,! SAG, rmsprop,! Adam, Eve,! APPA, Catalyst,! Natasha, Katyusha Why is SGD so special? Stochastic differential equation q dw = +f (w) {z} dt D(w) db(t ), = 2b Fokker-Planck equation and optimal transportation = argmin Z (w) (w)dw + 1 Z log d. Information bottleneck, Bayesian inference, large batch-sizes, sampling techniques, hyper-parameter choices, neural architecture search,. 14

15 SGD does not minimize f (w), what is (w)? Deep networks induce highly non-isotropic noise Leads to deterministic, Hamiltonian dynamics in SGD x = j (x) div j (x) =0 Most likely trajectories of SGD are closed loops CIFAR-10 (D) =0.27 ± 0.84 rank(d) =0.34% Deep networks have out-of-equilibrium distribution (w) 6/ e / e f (w ) (w ) saddle-point 15

16 Summary Techniques from control and physics are interpretable, also lead to state-of-the-art algorithms PDEs, stochastic control, stability of limit cycles, Fokker-Planck equations, continuous-time analysis Control has powerful tools to make inroads into understanding and improving deep networks input-output stability, reinforcement learning & optimal control Deep learning is powerful, and quite easy to get into even the fundamentals are unknown and debated upon 16

17 Joint work with Stefano Soatto!! Guillaume Carlier, Adam Oberman, Stanley Osher!! Yann LeCun, Anna Choromanska, Ameet Talwalkar!! Carlo Baldassi, Christian Borgs, Jennifer Chayes, Riccardo Zecchina Thank You! 17

Unraveling the mysteries of stochastic gradient descent on deep neural networks

Unraveling the mysteries of stochastic gradient descent on deep neural networks Pratik Chaudhari UCLA VISION LAB 1 The question measures disagreement of predictions with ground truth Cat Dog... x = argmin