On stochasc gradient descent, ﬂatness and generalizaon Yoshua Bengio

Size: px

Start display at page:

Download "On stochas*c gradient descent, ﬂatness and generaliza*on Yoshua Bengio"

Catherine Welch
5 years ago
Views:

1 On stochas*c gradient descent, ﬂatness and generaliza*on Yoshua Bengio July 14, 2018 ICML 2018 Workshop on nonconvex op=miza=on

2 Disentangling optimization and generalization The tradi=onal ML picture is that op=miza=on and generaliza=on are neatly separated aspects That makes theory easier to handle, separately Unfortunately not the case SGD variants influence op=miza=on AND generaliza=on

3 Memorization in Deep Networks Mostly from preprint arxiv: Devansh Arpit, Stanisław Jastrzębski, Nicolas Ballas, David Krueger, Emmanuel Bengio, Maxinder S Kanwal, Tegan Maharaj, Asja Fischer, Aaron Courville, Yoshua Bengio, Simon Lacoste-Julien

4 Memorization in Deep Networks Deep networks trained with SGD generalize well due to its implicit regulariza=on effect (Zhang et al 2016) Deep networks achieve ~100% train accuracy on random data (Zhang et al 2016) Do deep networks also memorize real data?

5 Real data has Dominant Patterns Real data: some samples are learned first. Random data: samples are learned in arbitrary order. Frac=on of =mes each of 1000 samples is classified correctly ajer 1 epoch across 100 runs

6 Larger Margin on Real data Real data: distance from decision boundary is large Cri=cal sample ra=o = frac=on of samples which have adversarial examples in their vicinity Random data: distance from decision boundary is small

7 Patterns come First Valida=on accuracy peaks before falling Panerns in real data learned before overfiong noise Train (full) and valida=on (doned) accuracy on MNIST during training with noisy labels

8 Regularization Hinders Memorization Dropout is best at hindering memoriza=on Maintains performance on real data for reduced memoriza=on on random data. Best valida=on performance (picked across hyper parameter grid) on real data vs. training performance on noise labels for the same model, for different regularizers.

9 Take Home Message DNNs learn pa:erns before memorizing noise Why? Does it have to do with SGD?

10 On the relevance of loss function geometry for generalization Laurent Dinh, Razvan Pascanu, Samy Bengio, Yoshua Bengio

11 Flatness

12 Reparametrization Differentiation at critical point Flat minima " Sharp minima Sharp minima" Flat minima

13 Reparametrization Sharp minima can generalize Flat minima can poorly generalize

14 " Eppur, si muove!" And yet, it generalizes!

15 Factors influencing Minima in SGD Mostly from preprint arxiv: Stanisław Jastrzębski, Zachary Kenton, Devansh Arpit, Nicolas Ballas, Asja Fischer, Yoshua Bengio, Amos Storkey

16 Behavior of SGD Small mini-batch finds wider minima (Keskar et al 2016) What dynamics/factors govern the quality of minima found by SGD?

17 <latexit sha1_base64="njgxlgzd+msjarm0lhehhvkd0vc=">aaaelhicjvnnb9naej3ufjtw0rsoxcxsphigsiokqfwquhdgucrck9vvtxbxirxba9nrqmx5x/al+blc4cabia4i/gvvppuipkluo8sz7817o7o2wyjnkjscfuusejcux7m6eq17/cbnw2u99dtvk1oxkr5hjjxlxqgqnsa5htvepnqvklxkwltvhrmxzo8e67jktp7gnht6ifotpimtsflah71nqvyqqakkvdpepf6rzsbtdhrgtwcn2qr2qbsgl5j2sncfdozy+eetkuv65k4ds97jkkajmhrrtrlpyskit0lrhdineq2pahzadbaswsk8ppa60nao0qhqqgf4n2c179aca/asrb1hlxs/ekqf7knjufci59184wtxzvrf3o14cm8nuifokwnqaqr0f7p55uv1piulmj7idalmkgth6slnusupcof+x1nzobtaod8cxykprdk/z180lczoz6ue/ywvjpi6cru1/zyuuc8c2ds5r0wmyohfaof9cqgnvxd3fnigxti7n5xdu0gof+7czqnvhxfaesosbznkzj3nkbwommkl0qnwrtaxnmag9rebqcj5hr6l4ew/pyzpyzz8zrjftlts2qkpwgyxudxcaobrxdcwwn73zyfjrchtwej1o/72c/dzrnjdukebepuf0za9pb0ay/sp9ik+0xfvo/fv++79oc1d6tjnhtpzet//aphj44m=</latexit> <latexit sha1_base64="njgxlgzd+msjarm0lhehhvkd0vc=">aaaelhicjvnnb9naej3ufjtw0rsoxcxsphigsiokqfwquhdgucrck9vvtxbxirxba9nrqmx5x/al+blc4cabia4i/gvvppuipkluo8sz7817o7o2wyjnkjscfuusejcux7m6eq17/cbnw2u99dtvk1oxkr5hjjxlxqgqnsa5htvepnqvklxkwltvhrmxzo8e67jktp7gnht6ifotpimtsflah71nqvyqqakkvdpepf6rzsbtdhrgtwcn2qr2qbsgl5j2sncfdozy+eetkuv65k4ds97jkkajmhrrtrlpyskit0lrhdineq2pahzadbaswsk8ppa60nao0qhqqgf4n2c179aca/asrb1hlxs/ekqf7knjufci59184wtxzvrf3o14cm8nuifokwnqaqr0f7p55uv1piulmj7idalmkgth6slnusupcof+x1nzobtaod8cxykprdk/z180lczoz6ue/ywvjpi6cru1/zyuuc8c2ds5r0wmyohfaof9cqgnvxd3fnigxti7n5xdu0gof+7czqnvhxfaesosbznkzj3nkbwommkl0qnwrtaxnmag9rebqcj5hr6l4ew/pyzpyzz8zrjftlts2qkpwgyxudxcaobrxdcwwn73zyfjrchtwej1o/72c/dzrnjdukebepuf0za9pb0ay/sp9ik+0xfvo/fv++79oc1d6tjnhtpzet//aphj44m=</latexit> <latexit sha1_base64="njgxlgzd+msjarm0lhehhvkd0vc=">aaaelhicjvnnb9naej3ufjtw0rsoxcxsphigsiokqfwquhdgucrck9vvtxbxirxba9nrqmx5x/al+blc4cabia4i/gvvppuipkluo8sz7817o7o2wyjnkjscfuusejcux7m6eq17/cbnw2u99dtvk1oxkr5hjjxlxqgqnsa5htvepnqvklxkwltvhrmxzo8e67jktp7gnht6ifotpimtsflah71nqvyqqakkvdpepf6rzsbtdhrgtwcn2qr2qbsgl5j2sncfdozy+eetkuv65k4ds97jkkajmhrrtrlpyskit0lrhdineq2pahzadbaswsk8ppa60nao0qhqqgf4n2c179aca/asrb1hlxs/ekqf7knjufci59184wtxzvrf3o14cm8nuifokwnqaqr0f7p55uv1piulmj7idalmkgth6slnusupcof+x1nzobtaod8cxykprdk/z180lczoz6ue/ywvjpi6cru1/zyuuc8c2ds5r0wmyohfaof9cqgnvxd3fnigxti7n5xdu0gof+7czqnvhxfaesosbznkzj3nkbwommkl0qnwrtaxnmag9rebqcj5hr6l4ew/pyzpyzz8zrjftlts2qkpwgyxudxcaobrxdcwwn73zyfjrchtwej1o/72c/dzrnjdukebepuf0za9pb0ay/sp9ik+0xfvo/fv++79oc1d6tjnhtpzet//aphj44m=</latexit> <latexit sha1_base64="njgxlgzd+msjarm0lhehhvkd0vc=">aaaelhicjvnnb9naej3ufjtw0rsoxcxsphigsiokqfwquhdgucrck9vvtxbxirxba9nrqmx5x/al+blc4cabia4i/gvvppuipkluo8sz7817o7o2wyjnkjscfuusejcux7m6eq17/cbnw2u99dtvk1oxkr5hjjxlxqgqnsa5htvepnqvklxkwltvhrmxzo8e67jktp7gnht6ifotpimtsflah71nqvyqqakkvdpepf6rzsbtdhrgtwcn2qr2qbsgl5j2sncfdozy+eetkuv65k4ds97jkkajmhrrtrlpyskit0lrhdineq2pahzadbaswsk8ppa60nao0qhqqgf4n2c179aca/asrb1hlxs/ekqf7knjufci59184wtxzvrf3o14cm8nuifokwnqaqr0f7p55uv1piulmj7idalmkgth6slnusupcof+x1nzobtaod8cxykprdk/z180lczoz6ue/ywvjpi6cru1/zyuuc8c2ds5r0wmyohfaof9cqgnvxd3fnigxti7n5xdu0gof+7czqnvhxfaesosbznkzj3nkbwommkl0qnwrtaxnmag9rebqcj5hr6l4ew/pyzpyzz8zrjftlts2qkpwgyxudxcaobrxdcwwn73zyfjrchtwej1o/72c/dzrnjdukebepuf0za9pb0ay/sp9ik+0xfvo/fv++79oc1d6tjnhtpzet//aphj44m=</latexit> SGD as Stochastic Differential Equation Mini-batch gradient g (S) (θ) (due to CLT), batch size S: SGD with learning rate η is described by: Con=nuous stochas=c differen=al equa=on (SDE) form: (Li et al If small enough learning rate, ie. small steps Note: C(θ) = B(θ) T B(θ)

18 Equilibrium Distribution of SGD The equilibrium distribu=on of this SDE is given by: ~Inverse rela=on between loss and density Noise n controls the granularity of the equilibrium distribu=on n = η/s Note: η = learning rate, S = batch size, σ 2 = fixed isotropic gradient variance

19 SGD Moves a Cloud of Points Consider the last k values of θ Form a cloud of points The cloud gradually moves with SGD updates The width of the cloud grows with the noise level (l.rate/bs) It cannot go in valleys sharper than that width 19

20 Implications of the Theory Probability of ending in a minima A described by Hessian H A : In general, minima with larger volume is favored more (simply because it has higher probability mass) Higher noise n priori=zes width (volume) over depth Final equilibrium distribu=on is unchanged when learning rate and batch size are scaled propor=onally η βη, S βs Note: n = η/s, η = learning rate, S = batch size, σ 2 = fixed isotropic gradient variance

21 Experimental Results Smaller Noise Sharper Bowl Equal noise Equal Width

22 Same Noise - Same Learning Dynamics Theory talks about final equilibrium distribu=on but seems to apply along trajectory as well But even learning dynamics is similar when learning rate and batch size are scaled propor=onally η βη, S βs Cyclic Learning Rate and Cyclic Batch-size Constant Learning Rate and Constant Batch-size

23 Take Home Messages DNNs learn pa:erns before memorizing noise Regulariza*on hinders memoriza*on The quality of final minima and learning dynamics is similar when learning rate and batch size are scaled propor*onally Larger noise favors large volume minima over deep ones Larger noise (e.g. due to BS or l.rate) hinders memoriza*on

24 A Walk with SGD Xing, Arpit, Tsirigotis & Bengio ArXiv: Interpolate in parameter space between minibatch SGD updates and see convex shape Ajer ini=al phase, updates bounce off valley floor, which monotonically improves, traversing larger distances with smaller batch sizes (BS) Learning rate: height from floor BS: explora=on noise Pure GD gets stuck on floor, while SGD finds flaner regions, 24 which generalize bener

25 Sharpest Directions Along the SGD Trajectory (Jastrzębski, Kenton, Ballas, Fischer, Bengio, Storkey) Even at the beginning of training, a high learning rate or small batch size influences SGD to visit flaner loss regions. the largest eigenvalues appears to always follow a similar panern, with a fast increase in the early phase and a decrease thereajer, where the peak value is determined by the learning rate and batch size. altering the learning rate just in the direc=on of the eigenvectors associated with the largest eigenvalues, SGD can be steered towards regions which are an order of magnitude sharper but correspond to models with similar generaliza=on, confirming that curvature of the endpoint found by SGD is not predic*ve of its generaliza*on proper*es. 25

26 Montreal Ins*tute for Learning Algorithms

arxiv: v1 [cs.lg] 7 Jan 2019

arxiv: v1 [cs.lg] 7 Jan 2019 Generalization in Deep Networks: The Role of Distance from Initialization arxiv:1901672v1 [cs.lg] 7 Jan 2019 Vaishnavh Nagarajan Computer Science Department Carnegie-Mellon University Pittsburgh, PA 15213