Implicit Optimization Bias as a key to Understanding Deep Learning Nati Srebro (TTIC) Based on joint work with Behnam Neyshabur (TTIC IAS), Ryota Tomioka (TTIC MSR), Srinadh Bhojanapalli, Suriya Gunasekar, Blake Woodworth, Pedro Savarese (TTIC), Russ Salakhutdinov (CMU), Ashia Wilson, Becca Roelofs, Mitchel Stern, Ben Recht (Berkeley), Daniel Soudry, Elad Hoffer, Mor Shpigel (Technion), Jason Lee (USC)
Increasing the Network Size [Neyshabur Tomioka S ICLR 15]
Increasing the Network Size [Neyshabur Tomioka S ICLR 15]
Increasing the Network Size [Neyshabur Tomioka S ICLR 15]
Test Error Increasing the Network Size 1 0.5 0 Complexity (Path Norm?) [Neyshabur Tomioka S ICLR 15]
What is the relevant complexity measure (eg norm)? How is this minimized (or controlled) by the optimization algorithm? How does it change if we change the opt algorithm?
CIFAR-100 With Dropout SVHN CIFAR-10 MNIST 2.5 2 1.5 1 0.5 Cross-Entropy Training Loss 0.02 0.015 0.01 0.005 0/1 Training Error 0/1 Test Error 0.5 0.48 0.46 0.035 0.03 0.025 0.02 Path-SGD SGD 0 2.5 0 50 100 150 200 250 300 2 1.5 1 0.5 0 0 50 100 150 200 250 300 2.5 2 1.5 1 0.5 Epoch Epoch 0 0.20 50 100 150 200 250 300 0.15 0.1 0.05 0.44 0.42 Epoch Epoch 0.015 0.50 50 100 150 200 250 300 0.48 0.46 0.44 0.4 0.42 0 50 100 150 200 250 300 0 0 50 100 150 200 250 300 0.4 0.3 0.2 0.1 0.4 0 50 100 150 200 250 300 0.18 0.17 0.16 0.15 0.14 0.13 Epoch Epoch 5 0 0 100 200 300 400 0 0.80 100 200 300 400 0.12 0.750 100 200 300 400 4 0.6 3 2 1 0.4 0.2 0.7 0 0 100 200 300 400 Epoch 0 0 100 200 300 400 Epoch 0.65 0 100 200 300 400 [Neyshabur Salakhudtinov S NIPS 15]
Traini Error (Preplexity) Test Error (Preplexity) SGD vs ADAM Results on Penn Treebank using 3-layer LSTM [Wilson Roelofs Stern S Recht, The Marginal Value of Adaptive Gradient Methods in Machine Learning, NIPS 17]
The Deep Recurrent Residual Boosting Machine Joe Flow, DeepFace Labs Section 1: Introduction We suggest a new amazing architecture and loss function that is great for learning. All you have to do to learn is fit the model on your training data Section 2: Learning Contribution: our model The model class h w is amazing. Our learning method is: 1 arg min σ w m i=1 m loss(h w x ; y) (*) Section 3: Optimization This is how we solve the optimization problem (*): [ ] Section 4: Experiments It works!
Different optimization algorithm Different bias in optimum reached Different Inductive bias Different learning properties Goal: understand optimization algorithms not just as reaching some (global) optimum, but as reaching a specific optimum
Today Precisely understand implicit bias in: Matrix Factorization Linear Classification (Logistic Regression) Linear Convolutional Networks
Matrix Reconstruction min F W = A W y W R n n 2 2 A W i = A i, W A 1,, A m R n n y R m Matrix completion (A i is indicator matrix) Matrix reconstruction from linear measurements Multi-task learning (A i = e task of example i φ example i ) 2 4 5 1 4 2 3 1 2 2 5 4 4 2 4 1 3 1 3 3 y 4 2 4 2 3 1 4 3 2 2 2 1 4 5 2 4 1 4 2 3 1 3 1 1 4 3 4 2 2 5 3 1 4 5 1 4 2 3 1 2 2 5 4 4 2 4 1 3 1 3 3 4 2 4 We are interested in the regime m n 2 2 3 1 4 3 2 2 2 1 4 5 2 4 1 4 2 3 1 3 1 1 4 3 4 2 2 5 3 1 Many global optima for which A W = y Easy to have A W = y without reconstruction/generalization - E.g. for matrix completion, set all unobserved entries to 0 Gradient Descent on W will generally yield trivial non-generalizing solution A 1 4 5 1 4 2 3 1 2 5 4 4 2 4 1 3 1 3 3 4 2 4 2 3 1 4 3 2 A 2 2 2 1 4 5 2 4 1 4 2 3 1 3 1 1 4 3 4 2 2 5 3 1 4 5 1 4 2 3 1 2 2 5 4 4 2 4 1 3 1 3 3 4 2 4 2 1 4 3 2 2 2A 1 3 4 5 2 4 1 4 2 3 1 3 1 1 4 3 4 2 2 5 3 1
Factorized Matrix Reconstruction 2 4 5 1 4 2 3 1 2 2 5 4 4 2 4 1 3 1 3 3 4 2 4 2 3 1 4 3 2 y 2 2 1 4 5 2 4 1 4 2 3 1 3 1 1 4 3 4 2 2 5 3 1 W = U V min f U, V = F U,V R n n UV = A UV 2 y 2 Since U, V full dim, no constraint on W, equivalent to min F(W) Underdetermined, all the same global min, trivial to minimize without generalizing What happens when we optimize by gradient descent on U, V? W
Gradient descent on f U, V gets to good global minima
Gradient descent on f U, V gets to good global minima Gradient descent on f U, V generalizes better with smaller step size
Question: Which global minima does gradient descent reach? Why does it generalize well?
Gradient descent on f(u, V) converges to a minimum nuclear norm solution
Conjecture: With stepsize 0 (i.e. gradient flow) and initialization 0, gradient descent on U converges to minimum nuclear norm solution: UU min W 0 W s. t. A X = y [Gunasekar Woodworth Bhojanapalli Neyshabur S 2017] Rigorous proof when A i s commute General A i : empirical validation + hand waving Yuanzhi Li, Hongyang Zhang and Tengyu Ma: proved when y = A(W ), W low rank, A RIP
Implicit Bias in Least Squared min Aw b 2 Gradient Descent (+Momentum) on w min Aw=b w 2 Gradient Descent on factorization W = UV AdaGrad on w probably min A W =b W tr with stepsize 0 and init 0, but only in limit, depends on stepsize, init, proved only in special cases in some special cases min Aw=b w, but not always, and it depends on stepsize, adaptation param, momentum Steepest Descent w.r.t. w??? Not min w, even as stepsize 0! Aw=b and it depends on stepsize, init, momentum Coordinate Descent (steepest descent w.r.t. w 1 ) Related to, but not quite the Lasso (with stepsize 0 and particular tie-breaking LARS)
Training Single Unit on Separable Data m arg min w R n L w = i=1 l z = log 1 + e z l y i w, x i Data x i, y m i i=1 linearly separable ( w i y i w, x i > 0) Where does gradient descent converge? w t = w t η L(w(t)) inf L w = 0, but minima unattainable GD diverges to infinity: w t, L w t 0 In what direction? What does Theorem: w t w t 2 w t w t converge to? w w 2 w = arg min w 2 s. t. i y i w, x i 1
Other Objectives and Opt Methods Single linear unit, logistic loss hard margin SVM solution (regardless of init, stepsize) Multi-class problems with softmax loss multiclass SVM solution (regardless of init, stepsize) Steepest Descent w.r.t. w arg min w s. t. i y i w, x i 1 (regardless of init, stepsize) Coordinate Descent arg min w 1 s. t. i y i w, x i 1 (regardless of init, stepsize) Matrix factorization problems L U, V = σ i l A i, UV, including 1-bit matrix completion arg min W tr s. t. A i, W 1 (regardless of init)
Linear Neural Networks Graph G(V, E), with h v = σ u v w u v h u Input units h in = x i R n, single output h out (x i ), binary label y i ±1 σ i=1 m Training: min l y i h out x i w Implements linear predictor: h out x i = P w, x i Training: min L P w = l y i P w, x i w i=1 Just a different parametrization of linear classification: min L(β) β Im P GD on w: different optimization procedure for same argmin problem Limit of GD: β = lim t P w t P w t m Im P = R n in all our examples
Fully Connected Linear NNs L fully connected layers with D l 1 units in layer l h l D l, h 0 = h in h l = W T l h l 1 h out = h L parameters: w = W l R D l D l 1, l = 1.. L Theorem: β arg min β 2 s. t. i y i β, x i 1 for l z = exp( z), almost all linearly separable data sets and initializations w(0) and any bounded stepsizes s.t. L w t 0 and Δw t = w t w t 1 converges in direction
Linear Conv Nets L-1 hidden layers, h l R n, each with full-width cyclic convolution : D 1 h l d = k=0 w l k h l 1 [d + k mod D] Params: w = w l R D, l = 1.. L h out = w L, h L 1 Theorem: With single conv layer (L=2), β arg min Fβ 1 s. t. i y i β, x i 1 Theorem: β critical point of min Fβ F=discrete Fourier transform 2 ΤL s. t. i y i β, x i 1 for l z = exp( z), almost all linearly separable data sets and initializations w(0) and any bounded stepsizes s.t. L 0, and Δw(t) converge in direction
min β 2 s. t. i y i β, x i 1 L = 2 min Fβ 2 s. t. i y i β, x i 1 L L = 5 min β 2 s. t. i y i β, x i 1 L L = 5
Goal: understand optimization algorithms not just as reaching some (global) optimum, but as reaching a specific optimum Different optimization algorithm Different bias in optimum reached Different inductive bias Different learning properties