Deep Neural Networks: From Flat Minima to Numerically Nonvacuous Generalization Bounds via PAC-Bayes

Size: px

Start display at page:

Download "Deep Neural Networks: From Flat Minima to Numerically Nonvacuous Generalization Bounds via PAC-Bayes"

Willis Russell
5 years ago
Views:

1 Deep Neural Networks: From Flat Minima to Numerically Nonvacuous Generalization Bounds via PAC-Bayes Daniel M. Roy University of Toronto; Vector Institute Joint work with Gintarė K. Džiugaitė University of Cambridge (Almost) 50 Shades of Bayesian Learning: PAC-Bayesian trends and insights Long Beach, CA. December 2017

2 How does SGD work? Growing body of work arguing that SGD performs implicit regularization Problem: No matching generalization bounds that are nonvacuous when applied to real data and networks. We focus on flat minima weights w such that training error is insensitive to large perturbations We show the size/flatness/location of minima found by SGD on MNIST imply generalization using PAC-Bayes bounds Focusing on MNIST, we show how to compute generalization bounds that are nonvacuous for stochastic networks with millions of weights. Dziugaite and Roy 1/19

3 How does SGD work? Growing body of work arguing that SGD performs implicit regularization Problem: No matching generalization bounds that are nonvacuous when applied to real data and networks. We focus on flat minima weights w such that training error is insensitive to large perturbations We show the size/flatness/location of minima found by SGD on MNIST imply generalization using PAC-Bayes bounds Focusing on MNIST, we show how to compute generalization bounds that are nonvacuous for stochastic networks with millions of weights. We obtain our (data-dependent, PAC-Bayesian) generalization bounds via a fair bit of computation with SGD. Our approach is a modern take on Langford and Caruana (2002). Dziugaite and Roy 1/19

4 .. Nonvacuous generalization bounds complexity of H risk: L D (h) := E (x,y) D [l(h(x), y)], D unknown empirical risk: L S Hse te ee( (h) = 1 m m i=1 l(h(x i), y i ), S = {(x 1, y 1 ),..., (x m, y m )} generalization error: L D (h) L S (h) ( ) D P L D (ĥ) LS(ĥ) < ɛ(h, m, δ, S, ĥ) > 1 δ S D m }{{} generalization err. bound IEEE (! I # data Dziugaite and Roy 2/19

5 SGD is X and X implies generalization SGD Is X X generalization SGD is Empirical Risk Minimization for large enough networks SGD is (Implicit) Regularized Loss Minimization SGD is Approximate Bayesian Inference... No statement of the form SGD is X explains generalization in deep learning until we know that X implies generalization under real-world conditions. Dziugaite and Roy 3/19

SGD is (not simply) empirical risk minimization Training error of SGD at convergence. Test error at convergence and for early stopping identical.

6 SGD is (not simply) empirical risk minimization Training error of SGD at convergence. Test error at convergence and for early stopping identical. SGD Empirical Risk Minimization arg min w H L S (w) MNIST has 60,000 training data Two-layer fully connected ReLU network has >1m parameters = PAC bounds are vacuous = PAC bounds can t explain this curve Dziugaite and Roy 4/19

7 Our focus: Statistical Learning Aspect SGD Is X X generalization On MNIST, with realistic networks,... VC bounds don t imply generalization }{{} Classic Margin + Norm-bounded Rademacher Complexity Bounds don t imply generalization Being Bayesian does not necessarily imply generalization (sorry!) Using PAC-Bayes bounds, we show that size/flatness/location of minima, found by SGD on MNIST, imply generalization for MNIST. Our bounds require a fair bit of computation/optimization to evaluate. Strictly speaking, they bound the error of a random perturbation of the SGD solution. Dziugaite and Roy 5/19

8 Flat minima... training error in flat minima is insensitive to large perturbations (Hochreiter and Schmidhuber, 1997) Dziugaite and Roy 6/19

9 Flat minima... training error in flat minima is insensitive to large perturbations (Hochreiter and Schmidhuber, 1997)... meets the PAC-Bayes theorem (McAllister) D P [ P Q ( L S (Q), L D (Q) ) KL(Q P)+log I S D m m (m) δ ] 1 δ For any data distribution, D, i.e., no assumptions, For any prior randomized classifier P, even nonsense, with high probability over m i.i.d. samples S D m, For any posterior randomized classifier Q, not nec. Bayes rule, Generalization error of Q bounded approximately by 1 m KL(Q P) Dziugaite and Roy 6/19

10 Controlling generalization error of randomized classifiers Let H be a hypothesis class of binary classifiers R k { 1, 1}. A randomized classifier is a distribution Q on H. Its risk is L D (Q) = E w Q [L D (h w )] Among the sharpest generalization bounds for randomized classifiers are PAC-Bayes bounds (McAllester, 1999). Theorem (PAC-Bayes (Catoni, 2007)).. Let δ > 0 and m N and assume L D is bounded. Then P, D, ( P Q, L D (Q) 2 L S (Q) + 2 KL(Q P) + log 1 ) δ 1 δ S D m m Dziugaite and Roy 7/19

11 Our approach given m i.i.d. data S D m Dziugaite and Roy 8/19

12 Our approach given m i.i.d. data S D m LEDET empirical error surface w L S (h w ) 0057 Dziugaite and Roy 8/19

13 Our approach given m i.i.d. data S D m 0057 empirical error surface w L S (h w ) Dziugaite and Roy 8/19

14 Our approach given m i.i.d. data S D m 0057 empirical error surface w L S (h w ) w SGD R weights learned by SGD on MNIST Dziugaite and Roy 8/19

15 Our approach given m i.i.d. data S D m 0057 empirical error surface w L S (h w ) w SGD R weights learned by SGD on MNIST 0057 Dziugaite and Roy 8/19

16 Our approach given m i.i.d. data S D m 0057 empirical error surface w L S (h w ) w SGD R weights learned by SGD on MNIST ˆQ = N (w SGD + w, Σ ) stochastic neural net 0057 Dziugaite and Roy 8/19

17 Our approach given m i.i.d. data S D m 0057 empirical error surface w L S (h w ) w SGD R weights learned by SGD on MNIST ˆQ = N (w SGD + w, Σ ) stochastic neural net generalization/error bound: D 0057 ( P L D ( ˆQ) ) < 0.17 > 0.95 S D m Dziugaite and Roy 8/19

18 Optimizing PAC-Bayes bounds Given data S, we can find a provably good classifier Q by optimizing the PAC-Bayes bound w.r.t. Q. For Catoni s PAC-Bayes bound, the optimization problem is of the form sup Q τl S (Q) KL(Q P). Lemma. Optimal Q satisfies dq dp (w) = exp( τl S (w)). exp( τls (w))p(dw) }{{} generalized Bayes rule Observation. Under log loss and τ = m, the term τl S (w) is the expected log likelihood under Q and the objective is the ELBO. Lemma. log exp( τl S (w))p(dw) = sup τl S (Q) KL(Q P). Q Observation. Under log loss and τ = m, l.h.s. is log marginal likelihood. Cf. Zhang 2004, 2006, Alquier et al. 2015, Germain et al Dziugaite and Roy 9/19

19 PAC-Bayes Bound optimization Dziugaite and Roy 10/19

20 PAC-Bayes Bound optimization inf Q L S (Q) + KL(Q P) + log 1 δ m Dziugaite and Roy 10/19

21 PAC-Bayes Bound optimization inf Q L S (Q) + KL(Q P) + log 1 δ m Let L S (Q) L S (Q) with L S differentiable. Dziugaite and Roy 10/19

22 PAC-Bayes Bound optimization inf Q L S (Q) + KL(Q P) + log 1 δ m Let L S (Q) L S (Q) with L S differentiable. inf Q m L S (Q) + KL(Q P) Dziugaite and Roy 10/19

23 PAC-Bayes Bound optimization inf Q L S (Q) + KL(Q P) + log 1 δ m Let L S (Q) L S (Q) with L S differentiable. inf Q Let Q w,s = N (w, diag(s)). m L S (Q) + KL(Q P) Dziugaite and Roy 10/19

24 PAC-Bayes Bound optimization inf Q L S (Q) + KL(Q P) + log 1 δ m Let L S (Q) L S (Q) with L S differentiable. inf Q Let Q w,s = N (w, diag(s)). m L S (Q) + KL(Q P) min w R d s R d + m L S (Q w,s ) + KL(Q w,s P) Dziugaite and Roy 10/19

25 PAC-Bayes Bound optimization inf Q L S (Q) + KL(Q P) + log 1 δ m Let L S (Q) L S (Q) with L S differentiable. inf Q Let Q w,s = N (w, diag(s)). m L S (Q) + KL(Q P) min w R d s R d + m L S (Q w,s ) + KL(Q w,s P) Take P = N (w 0, λ I d ) with λ = c exp{ j/b}. Dziugaite and Roy 10/19

26 PAC-Bayes Bound optimization inf Q L S (Q) + KL(Q P) + log 1 δ m Let L S (Q) L S (Q) with L S differentiable. inf Q Let Q w,s = N (w, diag(s)). m L S (Q) + KL(Q P) min w R d s R d + m L S (Q w,s ) + KL(Q w,s P) Take P = N (w 0, λ I d ) with λ = c exp{ j/b}. min w R d s R d + λ (0,c) m L S (Q w,s ) + KL(Q w,s N (w 0, λi )) }{{} +2 log(b log c λ ) Dziugaite and Roy 10/19

27 PAC-Bayes Bound optimization inf Q L S (Q) + KL(Q P) + log 1 δ m Let L S (Q) L S (Q) with L S differentiable. inf Q Let Q w,s = N (w, diag(s)). m L S (Q) + KL(Q P) min w R d s R d + m L S (Q w,s ) + KL(Q w,s P) Take P = N (w 0, λ I d ) with λ = c exp{ j/b}. min w R d s R d + λ (0,c) m L S (Q w,s ) + KL(Q w,s N (w 0, λi )) }{{} +2 log(b log c λ ) 1 2 ( 1 λ s λ w w d log λ 1 d log s d). Dziugaite and Roy 10/19

28 Numerical generalization bounds on MNIST # Hidden Layers (R) Train error Test error SNN train error SNN test error PAC-Bayes bound KL divergence # parameters 472k 832k 1193k 472k VC dimension 26m 66m 121m 26m We have shown that type of flat minima found in practice can be turned into a generalization guarantee. Bounds are loose, but only nonvacuous bounds in this setting. Dziugaite and Roy 11/19

29 Actually, SGD is pretty dangerous Dziugaite and Roy 12/19

30 Actually, SGD is pretty dangerous SGD achieves zero training error reliably Despite no explicit regularization, training and test error very close Explicit regularization has minor effect SGD can reliably obtain zero training error on randomized labels Hence, Rademacher complexity of model class is near maximal w.h.p. Dziugaite and Roy 12/19

31 Entropy-SGD (Chaudhari et al., 2017) Entropy-SGD replaces stochastic gradient descent on L S by stochastic gradient ascent applied to the optimization problem: arg max w R d F γ,τ (w; S), { where F γ,τ (w; S) = log exp τl S (w ) τ γ } R 2 w w 2 2 dw. p The local entropy F γ,τ ( ; S) emphasizes flat minima of L S. Dziugaite and Roy 13/19

32 Entropy-SGD optimizes PAC-Bayes bound w.r.t. prior Entropy-SGD optimizes the local entropy { F γ,τ (w; S) = log exp τl S (w ) τ γ } R 2 w w 2 2 dw. p Dziugaite and Roy 14/19

33 Entropy-SGD optimizes PAC-Bayes bound w.r.t. prior Entropy-SGD optimizes the local entropy { F γ,τ (w; S) = log exp τl S (w ) τ γ } R 2 w w 2 2 dw. p Theorem. Maximizing F γ,τ (w; S) w.r.t. w corresponds to minimizing PAC-Bayes risk bound w.r.t. prior s mean w. Dziugaite and Roy 14/19

34 Entropy-SGD optimizes PAC-Bayes bound w.r.t. prior Entropy-SGD optimizes the local entropy { F γ,τ (w; S) = log exp τl S (w ) τ γ } R 2 w w 2 2 dw. p Theorem. Maximizing F γ,τ (w; S) w.r.t. w corresponds to minimizing PAC-Bayes risk bound w.r.t. prior s mean w. Theorem. Let P(S) be an ɛ-differentially private distribution. Then D, ( P S D m ( Q) KL(L S (Q) L D (Q)) KL(Q P(S)) + ln 2m + 2 max{ln 3 δ, mɛ2 } m 1 ) 1 δ. We optimize F γ,τ (w; S) using SGLD, obtaining (ɛ, δ)-differential privacy. SGLD is known to converge weakly to the ɛ-differentially private exponential mechanism. Our analysis makes a coarse approximation: privacy of SGLD is that of exponential mechanism. Dziugaite and Roy 14/19

35 Dziugaite and Roy 15/19

36 Dziugaite and Roy 16/19

37 Dziugaite and Roy 17/19

38 Conclusion We show that the size/flatness/location of minima (that were found by SGD on MNIST) imply generalization using PAC-Bayes bounds; We show Entropy-SGD optimizes the prior in a PAC-Bayes bound, which is not valid; We give a differentially private version of PAC-Bayes theorem and modify Entropy-SGD so that prior is privately optimized. Dziugaite and Roy 18/19

39 Achille, A. and S. Soatto (2017). On the Emergence of Invariance and Disentangling in Deep Representations. arxiv preprint arxiv: Catoni, O. (2007). PAC-Bayesian supervised classification: the thermodynamics of statistical learning. arxiv: [stat.ml]. Chaudhari, P., A. Choromanska, S. Soatto, Y. LeCun, C. Baldassi, C. Borgs, J. Chayes, L. Sagun, and R. Zecchina (2017). Entropy-SGD: Biasing Gradient Descent Into Wide Valleys. In: International Conference on Learning Representations (ICLR). arxiv: v4 [cs.lg]. Hinton, G. E. and D. van Camp (1993). Keeping the Neural Networks Simple by Minimizing the Description Length of the Weights. In: Proceedings of the Sixth Annual Conference on Computational Learning Theory. COLT 93. Santa Cruz, California, USA: ACM, pp Hochreiter, S. and J. Schmidhuber (1997). Flat Minima. Neural Comput. 9.1, pp Keskar, N. S., D. Mudigere, J. Nocedal, M. Smelyanskiy, and P. T. P. Tang (2017). On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima. In: International Conference on Learning Representations (ICLR). arxiv: v2 [cs.lg]. Langford, J. and R. Caruana (2002). (Not) Bounding the True Error. In: Advances in Neural Information Processing Systems 14. Ed. by T. G. Dietterich, S. Becker, and Z. Ghahramani. MIT Press, pp Langford, J. and M. Seeger (2001). Bounds for Averaging Classifiers. Tech. rep. CMU-CS Carnegie Mellon University. McAllester, D. A. (1999). PAC-Bayesian Model Averaging. In: Proceedings of the Twelfth Annual Conference on Computational Learning Theory. COLT 99. Santa Cruz, California, USA: ACM, pp Dziugaite and Roy 19/19

Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima

Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima J. Nocedal with N. Keskar Northwestern University D. Mudigere INTEL P. Tang INTEL M. Smelyanskiy INTEL 1 Initial Remarks SGD