Energy Landscapes of Deep Networks

Size: px

Start display at page:

Download "Energy Landscapes of Deep Networks"

Moses Berry
6 years ago
Views:

Energy Landscapes of Deep Networks Pratik Chaudhari February 29, 2016 References Auffinger, A., Arous, G. B., & Černý, J. (2013). Random matrices and complexity of spin glasses. arxiv:1003.1129.

1 Energy Landscapes of Deep Networks Pratik Chaudhari February 29, 2016 References Auffinger, A., Arous, G. B., & Černý, J. (2013). Random matrices and complexity of spin glasses. arxiv: # Choromanska, A., Henaff, M., Mathieu, M., Arous, G. B., & LeCun, Y. (2014). The Loss Surface of Multilayer Networks. arxiv: # Fyodorov, Yan V. "High-dimensional random fields and random matrix theory." arxiv: (2013).# Chaudhari Pratik & Soatto Stefano. The Effect of Gradient Noise on the Energy Landscape of Deep Networks. arxiv: (2015). 1

2 Today Introduction to spin glasses! Write a deep network as a spin glass! Analyze local minima of such networks! Modify the energy landscape of deep networks 2

3 An introduction to spin glasses Ising model Model from statistical mechanics to explain magnetism Hamiltonian H(s) =Â (i,j) J s i s j s i 2 { 1, 1} upward / downward spins sum over all neighbors coupling strength, also called disorder p n p n dipoles on a lattice Hamiltonian gives the energy of a configuration# Thermodynamically, the model likes to find the ground state P(configuration s) = 1 Z e bh(s) Boltzmann distribution Boltzmann s constant inverse temperature b = 1 k B T temperature normalizing constant, also called partition function # very hard to compute Z = Â e bh(s) s2{ 1,1} n 3

# Beyond a temperature threshold, spins no# longer aligned together# Great model for analysis of graphical models Would also be

4 Ising model Hamiltonian H(s) =Â (i,j) J s i s j s i 2 { 1, 1} upward / downward spins Why interesting? sum over all neighbors coupling strength, also called disorder Behaves like a magnet!# Beyond a temperature threshold, spins no# longer aligned together# Great model for analysis of graphical models Would also be a good model for deep nets! But very hard to solve for large correlations at # low temperature complete disorder at# high temperature Z = Â s2{ 1,1} {z } n 2 n terms e bh(s) exponentially many terms 4

5 Enter spin glasses 2-spin glass H(s) = Â (i 1,i 2 applen) J i1,i 2 s i1 s i2 Hamiltonian is also a random variable We will need a p-spin glass note the sum over all other spins is now a random variable with# exponential tails# usually Gaussian# couples all possible pairs of spins helps use a lot of results from# probability theory# much more known about spin-glasses# than Ising / Potts models# In particular, their energy landscape H p (s) = Â 1applei 1,...,i p applen J i1,...,i p s i1 s i2...s ip Why call them glasses? Long range coupling, not just neighbors# Glass is really just a liquid that flows very slowly Crystals: cooling a liquid very quickly, short range coupling# Glasses have long range coupling 5

6 A model for deep networks sparse, random connections, threshold nonlinearities scalar output h: hidden units n neurons on each layer p layers X: input Loss function (zero-one loss) is a spin glass Hamiltonian: H(s) law J n (p 1)/2 n Â i 1,...,i p =1 J i1,...,i p s i1... s ip Active paths in the network# Unique weights in the network zero-mean standard Gaussian 6

7 Random sparse deep network p total layers p = log d n = 1 a Y = W > p s(w p 1 s(w p 2...s(W 1 X))...)) output X 2 R n class-label Y 2 R thresholds weights W i 2 [ 1, 1] n n decision mechanism W p 2 R n SVMs, soft-max etc.! d sparse as well Choromanska et. al. use a non-sparse# model with ReLU nonlinearities feature vector Some assumptions on weights on an average, d non-zeros d = n a ; a < 1/5 weights are not all close to zero E W k,ij var W k,ij = 0 n 2/5 7

8 Random sparse deep network Can write it as class-label Y = n Â i=1 sum over the output neurons# dimensions of X but introduces weird couplings / disorder z } { Â X i g2g i i th p k=1 w k g! element dimension of X sum over all active paths in the network these are our spins! product of all the# weights along pathg G i set of active paths from# output neuron to Y i th 8

9 Deep network as a spin glass Two ways to remove correlations Assume independence of paths Choromanska et. al. X i1! X i1,...,i p We shall explicitly account for them Y = n Â i,i 1,...,i p =1 first edge, i.e., between# layer 1 and layer 2 g i,i1 X i W i,i1 {z } :=Z i,i1 net coupling of a path due to# the sum over all outputs is small g i1,...,i p W i1,...,i p path from layer 2 to Y Â i Z i apple n 1/5 9

10 Model for deep networks For a 0-1 / logistic loss function, can show: H(s) law number of layers J n (p 1)/2 n Â i 1,...,i p =1 number of neurons on each layer J i1,...,i p s i1... s ip Because of sparsity, can# estimate the probability# that a neuron of any layer# is active or not estimate the number of active# paths in the sparse network# zero-mean standard Gaussian rewrite in terms of unique weights Q(n) (for analysis) assume spherical constraint s 2 S n 1 ( p n) Key Idea - Can now analyze a generic neural network 10

11 Number of critical points All critical points: crt(b) = {s : rh(s) =0, H(s) 2 nb} any interval of real line gradient is zero energy lies in interval B Critical points of index k: crt k (B) = n s : rh(s) =0, H(s) 2 nb, i r 2 H(s) o = k counts saddle points in interval B k eigenvalues of Hessian# have negative real part 11

12 Number of critical points crt(b) =all critical points crt k (B) =saddle points of index k lim n! 1 n log E crt k(u) =Q k (u), for all u 2 R, k 2 N lim n! 1 n log E crt(u) =Q(u), for all u 2 R, k 2 N exponentially many local minima and saddle points! u (, u) rate functions, look like this Energy barriers for p = Q(u) 0.00 k = 0 E 0 = E inf = k = 1 k = 4 k = H/n 12

13 Number of critical points crt(b) =all critical points crt k (B) =saddle points of index k Corollary lim n! 1 n log E crt k(r) = 1 2 log(p 2) p 3 p 1 lim n! 1 n log E crt(r) =1 2 log(p 2) exponentially many local minima as well as saddle points in total! 13

14 Location of critical points Theorem i.e., decreases exponentially lim sup 1 n! n 2 log P (9 a saddle point of index k above energy n(e e)) < 0 lim sup 1 n! n log P (9 a saddle point of index k above energy n(e k + e)) < 0 Local minima exist throughout# SGD might think low-index saddle points# are minima# SGD finds one of these jumping between local minima is hard # normalized width of energy bands# is small, can t check numerically [ ne 0, ne ] layered structure index apple k saddle points between [ ne 0, ne k ] energy bands have to climb the energy barrier to reach a saddle point, then climb down again fortunately, width of bands decreases with k 14

15 Experimental evaluation: 2-spin glass, simulated annealing same as number of neurons n Choromanska et. al. 15

16 Experimental evaluation: 1-layer MNIST network, SGD number of neurons n from Choromanska et. al. Choromanska et. al. 16

17 Experimental evaluation: 1-layer MNIST network, SGD variance increases with n? should become worse with more layers due to correlations Choromanska et. al. SGD finds critical points of low normalized index,# i.e., does not get stuck at saddle points# Of course, there are those exponentially many local minima 17

18 Trivialization of critical points Simple example H = 1 2 s> Js + h > s; s = n spherical constraint random matrix, each entry is Gaussian var(j ij )= 1 + d ij n random magnetic field, Gaussian var(h i )=n 2 Three distinct regimes O(n) asymptotically critical points for # n = O(n 1/2 ) Number of critical points decreases with magnetic field# Total trivialization, 2 critical points for n = O(n 1/6 ) 18

19 Trivialization of critical points n = O(n 1/2 ) n = O(n 1/6 ) Ecrt(R) 2n Ecrt(R) g = nn2 2 k = n1/3 n

20 Trivialization: Changing the number of critical points eh(s) = H(s)+Â i h i s i original Hamiltonian random magnetic field, zero-mean Gaussian Variance of magnetic field: n c = q p(p 2) Three distinct regimes Exponentially many critical points with small noise# Polynomially-many critical points near threshold# Total trivialization, 2 critical points n < n c n > n c + W(n 1 ) n n c O(n 1 ) Important regime Want to avoid this 20

21 Monte-Carlo simulation on a spin glass eh(s) = H(s)+Â i h i s i n = 100, p = 3 Local-minima separated by high energy barriers# Saddle points are the canyons above Almost constant Hamiltonian# All points land at the same local minimum 21

22 Monte-Carlo simulation on a spin glass eh(s) = H(s)+Â i h i s i Very few local minima# Much larger proportion of saddle points (for a slightly different model) can prove that in this regime, only low order saddle points and local minima persist 22

23 Distribution of local minima concentrates lim P min H(s) m n t n! s = exp 1 e c pt c p Subag, Zeitouni (2015) Probability density of Hamiltonian at local minima Gumbel density µ = 1.58 b = 0.03 Gumbel distribution H/n 23

24 15 10 Probability density of Hamiltonian at local minima Distribution of local minima concentrates lim P min H(s) m n t n! s Gumbel density µ = 1.58 b = 0.03 and all saddle points = exp 1 e c pt c p Subag, Zeitouni (2015) Gumbel distribution H/n k = 0 E 0 = E inf = k = 1 k = 4 k = 10 24

25 Perturbations of local minima Theorem DH n (0), 1 n H quad(y s ) H quad (0) quadratic approximation at a critical point quadratic approximation at a critical point n = O(1) lim n! DH n(0) =n ky s kapplen 1/4 Roughly, magnetic field does not perturb the local minima, only changes the Hamiltonian 25

26 Annealing the magnetic field eh(s) = H(s)+ p a 2 p n Â i s 2 i + Â i h i s i L2 regularization (same as before) random magnetic field Set a parameter: n = Jp t n Magnetic field for the polynomial regime is: 1 + t a2 n + 2t n 1/2 Helps transition between# Polynomial regime:# Exponential regime: t 1 t! But fix its direction, don t resample! Anneal magnetic field to zero as training progresses: t(k) = n a k N epoch number total number of epochs 26

27 Fully-connected deep network : MNIST mnistfc : input! linear 64 {z } 20 relu nonlinearities! linear 10! softmax 100 mnistfc: Training error 90 mnistfc: Validation error `2 `2, triv, anneal %error %error 40 5 `2 `2, triv, anneal Iteration Epoch Does not exactly match the theoretical model, so set p = 20, n = s #weights p 27

28 Fully-connected deep network : CIFAR-10 cifarfc : input! linear 128 {z } 20! linear 10! softmax relu nonlinearities error = k c Trivialization is a great way to avoid bad initialization of weights 100 cifarfc: Training error 90 cifarfc: Validation error `2 `2, triv, anneal %error 40 %error `2 `2, triv, anneal Iteration Epoch 28

Convolution neural network: CIFAR-10 cifarconv : input! conv 5 5 8 {z } 8! linear 128 {z } 8! linear 10!

29 Convolution neural network: CIFAR-10 cifarconv : input! conv {z } 8! linear 128 {z } 8! linear 10! softmax relu nonlinearities, max-pooling 80 cifarconv: Training error `2 `2, triv, anneal `2, anneal, resample cifarconv: Validation error `2 `2, triv, anneal `2, anneal, resample %error %error Iteration Epoch Exact same training parameters as previous experiment 29

Weights keep sloshing with resampling Much

triv, anneal `2, anneal, resample min g k 10 10

30 Convolution neural network: CIFAR-10 cifarconv : input! conv {z } 8! linear 128 {z } 8! linear 10! softmax relu nonlinearities, max-pooling Weights keep sloshing with resampling Much larger minimum gradient 10 0 cifarconv: Alignment of weights cifarconv: Minimum gradient Âi h i s i `2, triv, anneal `2, anneal, resample min g k `2 `2, triv, anneal `2, anneal, resample Iteration Iteration 30

31 Network-in-network architecture: CIFAR-10 cifarnin : input! conv {z } 3! max-pool 3 3! dropout {z } 2! conv ! conv ! conv ! mean-pool 8 8! softmax relu nonlinearities, batch-normalization, global average pooling cifarnin: Validation error `2 `2, triv, anneal `2, anneal, resample %error Epoch Exact same parameters! 31

32 Conclusions Spin glasses Good models for deep networks,! energy landscape has similar properties Gradient noise of right magnitude dramatically helps very-deep networks, fully-connected layers Theoretical goals Effect of regularization on the energy landscape! Devise specialized, non-local optimization algorithms 32

A picture of the energy landscape of! deep neural networks

A picture of the energy landscape of! deep neural networks Pratik Chaudhari December 15, 2017 UCLA VISION LAB 1 Dy (x; w) = (w p (w p 1 (... (w 1 x))...)) w = argmin w (x,y ) 2 D kx 1 {y =i } log Dy i