Deep Neural Networks: From Flat Minima to Numerically Nonvacuous Generalization Bounds via PAC-Bayes

Size: px
Start display at page:

Download "Deep Neural Networks: From Flat Minima to Numerically Nonvacuous Generalization Bounds via PAC-Bayes"

Transcription

1 Deep Neural Networks: From Flat Minima to Numerically Nonvacuous Generalization Bounds via PAC-Bayes Daniel M. Roy University of Toronto; Vector Institute Joint work with Gintarė K. Džiugaitė University of Cambridge (Almost) 50 Shades of Bayesian Learning: PAC-Bayesian trends and insights Long Beach, CA. December 2017

2 How does SGD work? Growing body of work arguing that SGD performs implicit regularization Problem: No matching generalization bounds that are nonvacuous when applied to real data and networks. We focus on flat minima weights w such that training error is insensitive to large perturbations We show the size/flatness/location of minima found by SGD on MNIST imply generalization using PAC-Bayes bounds Focusing on MNIST, we show how to compute generalization bounds that are nonvacuous for stochastic networks with millions of weights. Dziugaite and Roy 1/19

3 How does SGD work? Growing body of work arguing that SGD performs implicit regularization Problem: No matching generalization bounds that are nonvacuous when applied to real data and networks. We focus on flat minima weights w such that training error is insensitive to large perturbations We show the size/flatness/location of minima found by SGD on MNIST imply generalization using PAC-Bayes bounds Focusing on MNIST, we show how to compute generalization bounds that are nonvacuous for stochastic networks with millions of weights. We obtain our (data-dependent, PAC-Bayesian) generalization bounds via a fair bit of computation with SGD. Our approach is a modern take on Langford and Caruana (2002). Dziugaite and Roy 1/19

4 .. Nonvacuous generalization bounds complexity of H risk: L D (h) := E (x,y) D [l(h(x), y)], D unknown empirical risk: L S Hse te ee( (h) = 1 m m i=1 l(h(x i), y i ), S = {(x 1, y 1 ),..., (x m, y m )} generalization error: L D (h) L S (h) ( ) D P L D (ĥ) LS(ĥ) < ɛ(h, m, δ, S, ĥ) > 1 δ S D m }{{} generalization err. bound IEEE (! I # data Dziugaite and Roy 2/19

5 SGD is X and X implies generalization SGD Is X X generalization SGD is Empirical Risk Minimization for large enough networks SGD is (Implicit) Regularized Loss Minimization SGD is Approximate Bayesian Inference... No statement of the form SGD is X explains generalization in deep learning until we know that X implies generalization under real-world conditions. Dziugaite and Roy 3/19

6 SGD is (not simply) empirical risk minimization Training error of SGD at convergence. Test error at convergence and for early stopping identical. SGD Empirical Risk Minimization arg min w H L S (w) MNIST has 60,000 training data Two-layer fully connected ReLU network has >1m parameters = PAC bounds are vacuous = PAC bounds can t explain this curve Dziugaite and Roy 4/19

7 Our focus: Statistical Learning Aspect SGD Is X X generalization On MNIST, with realistic networks,... VC bounds don t imply generalization }{{} Classic Margin + Norm-bounded Rademacher Complexity Bounds don t imply generalization Being Bayesian does not necessarily imply generalization (sorry!) Using PAC-Bayes bounds, we show that size/flatness/location of minima, found by SGD on MNIST, imply generalization for MNIST. Our bounds require a fair bit of computation/optimization to evaluate. Strictly speaking, they bound the error of a random perturbation of the SGD solution. Dziugaite and Roy 5/19

8 Flat minima... training error in flat minima is insensitive to large perturbations (Hochreiter and Schmidhuber, 1997) Dziugaite and Roy 6/19

9 Flat minima... training error in flat minima is insensitive to large perturbations (Hochreiter and Schmidhuber, 1997)... meets the PAC-Bayes theorem (McAllister) D P [ P Q ( L S (Q), L D (Q) ) KL(Q P)+log I S D m m (m) δ ] 1 δ For any data distribution, D, i.e., no assumptions, For any prior randomized classifier P, even nonsense, with high probability over m i.i.d. samples S D m, For any posterior randomized classifier Q, not nec. Bayes rule, Generalization error of Q bounded approximately by 1 m KL(Q P) Dziugaite and Roy 6/19

10 Controlling generalization error of randomized classifiers Let H be a hypothesis class of binary classifiers R k { 1, 1}. A randomized classifier is a distribution Q on H. Its risk is L D (Q) = E w Q [L D (h w )] Among the sharpest generalization bounds for randomized classifiers are PAC-Bayes bounds (McAllester, 1999). Theorem (PAC-Bayes (Catoni, 2007)).. Let δ > 0 and m N and assume L D is bounded. Then P, D, ( P Q, L D (Q) 2 L S (Q) + 2 KL(Q P) + log 1 ) δ 1 δ S D m m Dziugaite and Roy 7/19

11 Our approach given m i.i.d. data S D m Dziugaite and Roy 8/19

12 Our approach given m i.i.d. data S D m LEDET empirical error surface w L S (h w ) 0057 Dziugaite and Roy 8/19

13 Our approach given m i.i.d. data S D m 0057 empirical error surface w L S (h w ) Dziugaite and Roy 8/19

14 Our approach given m i.i.d. data S D m 0057 empirical error surface w L S (h w ) w SGD R weights learned by SGD on MNIST Dziugaite and Roy 8/19

15 Our approach given m i.i.d. data S D m 0057 empirical error surface w L S (h w ) w SGD R weights learned by SGD on MNIST 0057 Dziugaite and Roy 8/19

16 Our approach given m i.i.d. data S D m 0057 empirical error surface w L S (h w ) w SGD R weights learned by SGD on MNIST ˆQ = N (w SGD + w, Σ ) stochastic neural net 0057 Dziugaite and Roy 8/19

17 Our approach given m i.i.d. data S D m 0057 empirical error surface w L S (h w ) w SGD R weights learned by SGD on MNIST ˆQ = N (w SGD + w, Σ ) stochastic neural net generalization/error bound: D 0057 ( P L D ( ˆQ) ) < 0.17 > 0.95 S D m Dziugaite and Roy 8/19

18 Optimizing PAC-Bayes bounds Given data S, we can find a provably good classifier Q by optimizing the PAC-Bayes bound w.r.t. Q. For Catoni s PAC-Bayes bound, the optimization problem is of the form sup Q τl S (Q) KL(Q P). Lemma. Optimal Q satisfies dq dp (w) = exp( τl S (w)). exp( τls (w))p(dw) }{{} generalized Bayes rule Observation. Under log loss and τ = m, the term τl S (w) is the expected log likelihood under Q and the objective is the ELBO. Lemma. log exp( τl S (w))p(dw) = sup τl S (Q) KL(Q P). Q Observation. Under log loss and τ = m, l.h.s. is log marginal likelihood. Cf. Zhang 2004, 2006, Alquier et al. 2015, Germain et al Dziugaite and Roy 9/19

19 PAC-Bayes Bound optimization Dziugaite and Roy 10/19

20 PAC-Bayes Bound optimization inf Q L S (Q) + KL(Q P) + log 1 δ m Dziugaite and Roy 10/19

21 PAC-Bayes Bound optimization inf Q L S (Q) + KL(Q P) + log 1 δ m Let L S (Q) L S (Q) with L S differentiable. Dziugaite and Roy 10/19

22 PAC-Bayes Bound optimization inf Q L S (Q) + KL(Q P) + log 1 δ m Let L S (Q) L S (Q) with L S differentiable. inf Q m L S (Q) + KL(Q P) Dziugaite and Roy 10/19

23 PAC-Bayes Bound optimization inf Q L S (Q) + KL(Q P) + log 1 δ m Let L S (Q) L S (Q) with L S differentiable. inf Q Let Q w,s = N (w, diag(s)). m L S (Q) + KL(Q P) Dziugaite and Roy 10/19

24 PAC-Bayes Bound optimization inf Q L S (Q) + KL(Q P) + log 1 δ m Let L S (Q) L S (Q) with L S differentiable. inf Q Let Q w,s = N (w, diag(s)). m L S (Q) + KL(Q P) min w R d s R d + m L S (Q w,s ) + KL(Q w,s P) Dziugaite and Roy 10/19

25 PAC-Bayes Bound optimization inf Q L S (Q) + KL(Q P) + log 1 δ m Let L S (Q) L S (Q) with L S differentiable. inf Q Let Q w,s = N (w, diag(s)). m L S (Q) + KL(Q P) min w R d s R d + m L S (Q w,s ) + KL(Q w,s P) Take P = N (w 0, λ I d ) with λ = c exp{ j/b}. Dziugaite and Roy 10/19

26 PAC-Bayes Bound optimization inf Q L S (Q) + KL(Q P) + log 1 δ m Let L S (Q) L S (Q) with L S differentiable. inf Q Let Q w,s = N (w, diag(s)). m L S (Q) + KL(Q P) min w R d s R d + m L S (Q w,s ) + KL(Q w,s P) Take P = N (w 0, λ I d ) with λ = c exp{ j/b}. min w R d s R d + λ (0,c) m L S (Q w,s ) + KL(Q w,s N (w 0, λi )) }{{} +2 log(b log c λ ) Dziugaite and Roy 10/19

27 PAC-Bayes Bound optimization inf Q L S (Q) + KL(Q P) + log 1 δ m Let L S (Q) L S (Q) with L S differentiable. inf Q Let Q w,s = N (w, diag(s)). m L S (Q) + KL(Q P) min w R d s R d + m L S (Q w,s ) + KL(Q w,s P) Take P = N (w 0, λ I d ) with λ = c exp{ j/b}. min w R d s R d + λ (0,c) m L S (Q w,s ) + KL(Q w,s N (w 0, λi )) }{{} +2 log(b log c λ ) 1 2 ( 1 λ s λ w w d log λ 1 d log s d). Dziugaite and Roy 10/19

28 Numerical generalization bounds on MNIST # Hidden Layers (R) Train error Test error SNN train error SNN test error PAC-Bayes bound KL divergence # parameters 472k 832k 1193k 472k VC dimension 26m 66m 121m 26m We have shown that type of flat minima found in practice can be turned into a generalization guarantee. Bounds are loose, but only nonvacuous bounds in this setting. Dziugaite and Roy 11/19

29 Actually, SGD is pretty dangerous Dziugaite and Roy 12/19

30 Actually, SGD is pretty dangerous SGD achieves zero training error reliably Despite no explicit regularization, training and test error very close Explicit regularization has minor effect SGD can reliably obtain zero training error on randomized labels Hence, Rademacher complexity of model class is near maximal w.h.p. Dziugaite and Roy 12/19

31 Entropy-SGD (Chaudhari et al., 2017) Entropy-SGD replaces stochastic gradient descent on L S by stochastic gradient ascent applied to the optimization problem: arg max w R d F γ,τ (w; S), { where F γ,τ (w; S) = log exp τl S (w ) τ γ } R 2 w w 2 2 dw. p The local entropy F γ,τ ( ; S) emphasizes flat minima of L S. Dziugaite and Roy 13/19

32 Entropy-SGD optimizes PAC-Bayes bound w.r.t. prior Entropy-SGD optimizes the local entropy { F γ,τ (w; S) = log exp τl S (w ) τ γ } R 2 w w 2 2 dw. p Dziugaite and Roy 14/19

33 Entropy-SGD optimizes PAC-Bayes bound w.r.t. prior Entropy-SGD optimizes the local entropy { F γ,τ (w; S) = log exp τl S (w ) τ γ } R 2 w w 2 2 dw. p Theorem. Maximizing F γ,τ (w; S) w.r.t. w corresponds to minimizing PAC-Bayes risk bound w.r.t. prior s mean w. Dziugaite and Roy 14/19

34 Entropy-SGD optimizes PAC-Bayes bound w.r.t. prior Entropy-SGD optimizes the local entropy { F γ,τ (w; S) = log exp τl S (w ) τ γ } R 2 w w 2 2 dw. p Theorem. Maximizing F γ,τ (w; S) w.r.t. w corresponds to minimizing PAC-Bayes risk bound w.r.t. prior s mean w. Theorem. Let P(S) be an ɛ-differentially private distribution. Then D, ( P S D m ( Q) KL(L S (Q) L D (Q)) KL(Q P(S)) + ln 2m + 2 max{ln 3 δ, mɛ2 } m 1 ) 1 δ. We optimize F γ,τ (w; S) using SGLD, obtaining (ɛ, δ)-differential privacy. SGLD is known to converge weakly to the ɛ-differentially private exponential mechanism. Our analysis makes a coarse approximation: privacy of SGLD is that of exponential mechanism. Dziugaite and Roy 14/19

35 Dziugaite and Roy 15/19

36 Dziugaite and Roy 16/19

37 Dziugaite and Roy 17/19

38 Conclusion We show that the size/flatness/location of minima (that were found by SGD on MNIST) imply generalization using PAC-Bayes bounds; We show Entropy-SGD optimizes the prior in a PAC-Bayes bound, which is not valid; We give a differentially private version of PAC-Bayes theorem and modify Entropy-SGD so that prior is privately optimized. Dziugaite and Roy 18/19

39 Achille, A. and S. Soatto (2017). On the Emergence of Invariance and Disentangling in Deep Representations. arxiv preprint arxiv: Catoni, O. (2007). PAC-Bayesian supervised classification: the thermodynamics of statistical learning. arxiv: [stat.ml]. Chaudhari, P., A. Choromanska, S. Soatto, Y. LeCun, C. Baldassi, C. Borgs, J. Chayes, L. Sagun, and R. Zecchina (2017). Entropy-SGD: Biasing Gradient Descent Into Wide Valleys. In: International Conference on Learning Representations (ICLR). arxiv: v4 [cs.lg]. Hinton, G. E. and D. van Camp (1993). Keeping the Neural Networks Simple by Minimizing the Description Length of the Weights. In: Proceedings of the Sixth Annual Conference on Computational Learning Theory. COLT 93. Santa Cruz, California, USA: ACM, pp Hochreiter, S. and J. Schmidhuber (1997). Flat Minima. Neural Comput. 9.1, pp Keskar, N. S., D. Mudigere, J. Nocedal, M. Smelyanskiy, and P. T. P. Tang (2017). On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima. In: International Conference on Learning Representations (ICLR). arxiv: v2 [cs.lg]. Langford, J. and R. Caruana (2002). (Not) Bounding the True Error. In: Advances in Neural Information Processing Systems 14. Ed. by T. G. Dietterich, S. Becker, and Z. Ghahramani. MIT Press, pp Langford, J. and M. Seeger (2001). Bounds for Averaging Classifiers. Tech. rep. CMU-CS Carnegie Mellon University. McAllester, D. A. (1999). PAC-Bayesian Model Averaging. In: Proceedings of the Twelfth Annual Conference on Computational Learning Theory. COLT 99. Santa Cruz, California, USA: ACM, pp Dziugaite and Roy 19/19

Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima

Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima J. Nocedal with N. Keskar Northwestern University D. Mudigere INTEL P. Tang INTEL M. Smelyanskiy INTEL 1 Initial Remarks SGD

More information

ENTROPY-SGD OPTIMIZES THE PRIOR

ENTROPY-SGD OPTIMIZES THE PRIOR ENTROPY-SGD OPTIMIZES THE PRIOR OF A PAC-BAYES BOUND: DATA-DEPENDENT PAC- BAYES PRIORS VIA DIFFERENTIAL PRIVACY Anonymous authors Paper under double-blind review ABSTRACT We show that Entropy-SGD (Chaudhari

More information

Entropy-SGD optimizes the prior of a PAC-Bayes bound: Generalization properties of Entropy-SGD and data-dependent priors

Entropy-SGD optimizes the prior of a PAC-Bayes bound: Generalization properties of Entropy-SGD and data-dependent priors : Generalization properties of Entropy-SGD and data-dependent priors Gintare Karolina Dziugaite 1 2 Daniel M. Roy 3 2 Abstract We show that Entropy-SGD (Chaudhari et al., 2017), when viewed as a learning

More information

PAC-Bayesian Generalization Bound for Multi-class Learning

PAC-Bayesian Generalization Bound for Multi-class Learning PAC-Bayesian Generalization Bound for Multi-class Learning Loubna BENABBOU Department of Industrial Engineering Ecole Mohammadia d Ingènieurs Mohammed V University in Rabat, Morocco Benabbou@emi.ac.ma

More information

TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter Generalization and Regularization

TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter Generalization and Regularization TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2019 Generalization and Regularization 1 Chomsky vs. Kolmogorov and Hinton Noam Chomsky: Natural language grammar cannot be learned by

More information

A PAC-Bayesian Approach to Spectrally-Normalized Margin Bounds for Neural Networks

A PAC-Bayesian Approach to Spectrally-Normalized Margin Bounds for Neural Networks A PAC-Bayesian Approach to Spectrally-Normalize Margin Bouns for Neural Networks Behnam Neyshabur, Srinah Bhojanapalli, Davi McAllester, Nathan Srebro Toyota Technological Institute at Chicago {bneyshabur,

More information

A PAC-BAYESIAN APPROACH TO SPECTRALLY-NORMALIZED MARGIN BOUNDS

A PAC-BAYESIAN APPROACH TO SPECTRALLY-NORMALIZED MARGIN BOUNDS Published as a conference paper at ICLR 08 A PAC-BAYESIAN APPROACH TO SPECTRALLY-NORMALIZED MARGIN BOUNDS FOR NEURAL NETWORKS Behnam Neyshabur, Srinadh Bhojanapalli, Nathan Srebro Toyota Technological

More information

Generalization of the PAC-Bayesian Theory

Generalization of the PAC-Bayesian Theory Generalization of the PACBayesian Theory and Applications to SemiSupervised Learning Pascal Germain INRIA Paris (SIERRA Team) Modal Seminar INRIA Lille January 24, 2017 Dans la vie, l essentiel est de

More information

arxiv: v1 [cs.lg] 7 Jan 2019

arxiv: v1 [cs.lg] 7 Jan 2019 Generalization in Deep Networks: The Role of Distance from Initialization arxiv:1901672v1 [cs.lg] 7 Jan 2019 Vaishnavh Nagarajan Computer Science Department Carnegie-Mellon University Pittsburgh, PA 15213

More information

arxiv: v2 [cs.lg] 28 Nov 2017

arxiv: v2 [cs.lg] 28 Nov 2017 Towards Understanding Generalization of Deep Learning: Perspective of Loss Landscapes arxiv:1706.239v2 [cs.lg] 28 ov 17 Lei Wu School of Mathematics, Peking University Beijing, China leiwu@pku.edu.cn Zhanxing

More information

IFT Lecture 7 Elements of statistical learning theory

IFT Lecture 7 Elements of statistical learning theory IFT 6085 - Lecture 7 Elements of statistical learning theory This version of the notes has not yet been thoroughly checked. Please report any bugs to the scribes or instructor. Scribe(s): Brady Neal and

More information

Introduction to Statistical Learning Theory

Introduction to Statistical Learning Theory Introduction to Statistical Learning Theory In the last unit we looked at regularization - adding a w 2 penalty. We add a bias - we prefer classifiers with low norm. How to incorporate more complicated

More information

Theory of Deep Learning IIb: Optimization Properties of SGD

Theory of Deep Learning IIb: Optimization Properties of SGD CBMM Memo No. 72 December 27, 217 Theory of Deep Learning IIb: Optimization Properties of SGD by Chiyuan Zhang 1 Qianli Liao 1 Alexander Rakhlin 2 Brando Miranda 1 Noah Golowich 1 Tomaso Poggio 1 1 Center

More information

A picture of the energy landscape of! deep neural networks

A picture of the energy landscape of! deep neural networks A picture of the energy landscape of! deep neural networks Pratik Chaudhari December 15, 2017 UCLA VISION LAB 1 Dy (x; w) = (w p (w p 1 (... (w 1 x))...)) w = argmin w (x,y ) 2 D kx 1 {y =i } log Dy i

More information

Generalization bounds

Generalization bounds Advanced Course in Machine Learning pring 200 Generalization bounds Handouts are jointly prepared by hie Mannor and hai halev-hwartz he problem of characterizing learnability is the most basic question

More information

what can deep learning learn from linear regression? Benjamin Recht University of California, Berkeley

what can deep learning learn from linear regression? Benjamin Recht University of California, Berkeley what can deep learning learn from linear regression? Benjamin Recht University of California, Berkeley Collaborators Joint work with Samy Bengio, Moritz Hardt, Michael Jordan, Jason Lee, Max Simchowitz,

More information

Unsupervised Learning

Unsupervised Learning CS 3750 Advanced Machine Learning hkc6@pitt.edu Unsupervised Learning Data: Just data, no labels Goal: Learn some underlying hidden structure of the data P(, ) P( ) Principle Component Analysis (Dimensionality

More information

Variations sur la borne PAC-bayésienne

Variations sur la borne PAC-bayésienne Variations sur la borne PAC-bayésienne Pascal Germain INRIA Paris Équipe SIRRA Séminaires du département d informatique et de génie logiciel Université Laval 11 juillet 2016 Pascal Germain INRIA/SIRRA

More information

Implicit Optimization Bias

Implicit Optimization Bias Implicit Optimization Bias as a key to Understanding Deep Learning Nati Srebro (TTIC) Based on joint work with Behnam Neyshabur (TTIC IAS), Ryota Tomioka (TTIC MSR), Srinadh Bhojanapalli, Suriya Gunasekar,

More information

PAC-Bayes Risk Bounds for Sample-Compressed Gibbs Classifiers

PAC-Bayes Risk Bounds for Sample-Compressed Gibbs Classifiers PAC-Bayes Ris Bounds for Sample-Compressed Gibbs Classifiers François Laviolette Francois.Laviolette@ift.ulaval.ca Mario Marchand Mario.Marchand@ift.ulaval.ca Département d informatique et de génie logiciel,

More information

Overparametrization for Landscape Design in Non-convex Optimization

Overparametrization for Landscape Design in Non-convex Optimization Overparametrization for Landscape Design in Non-convex Optimization Jason D. Lee University of Southern California September 19, 2018 The State of Non-Convex Optimization Practical observation: Empirically,

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Logistic Regression Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE 474/574

More information

A BAYESIAN PERSPECTIVE ON GENERALIZATION AND STOCHASTIC GRADIENT DESCENT

A BAYESIAN PERSPECTIVE ON GENERALIZATION AND STOCHASTIC GRADIENT DESCENT A BAYESIAN PERSPECTIVE ON GENERALIZATION AND STOCHASTIC GRADIENT DESCENT Samuel L. Smith & Quoc V. Le Google Brain {slsmith, qvl}@google.com ABSTRACT We consider two questions at the heart of machine learning;

More information

Lecture 5: Logistic Regression. Neural Networks

Lecture 5: Logistic Regression. Neural Networks Lecture 5: Logistic Regression. Neural Networks Logistic regression Comparison with generative models Feed-forward neural networks Backpropagation Tricks for training neural networks COMP-652, Lecture

More information

Logistic Regression. COMP 527 Danushka Bollegala

Logistic Regression. COMP 527 Danushka Bollegala Logistic Regression COMP 527 Danushka Bollegala Binary Classification Given an instance x we must classify it to either positive (1) or negative (0) class We can use {1,-1} instead of {1,0} but we will

More information

Logistic Regression. Machine Learning Fall 2018

Logistic Regression. Machine Learning Fall 2018 Logistic Regression Machine Learning Fall 2018 1 Where are e? We have seen the folloing ideas Linear models Learning as loss minimization Bayesian learning criteria (MAP and MLE estimation) The Naïve Bayes

More information

A Tight Excess Risk Bound via a Unified PAC-Bayesian- Rademacher-Shtarkov-MDL Complexity

A Tight Excess Risk Bound via a Unified PAC-Bayesian- Rademacher-Shtarkov-MDL Complexity A Tight Excess Risk Bound via a Unified PAC-Bayesian- Rademacher-Shtarkov-MDL Complexity Peter Grünwald Centrum Wiskunde & Informatica Amsterdam Mathematical Institute Leiden University Joint work with

More information

Statistical Learning Theory and the C-Loss cost function

Statistical Learning Theory and the C-Loss cost function Statistical Learning Theory and the C-Loss cost function Jose Principe, Ph.D. Distinguished Professor ECE, BME Computational NeuroEngineering Laboratory and principe@cnel.ufl.edu Statistical Learning Theory

More information

Bayesian Learning in Undirected Graphical Models

Bayesian Learning in Undirected Graphical Models Bayesian Learning in Undirected Graphical Models Zoubin Ghahramani Gatsby Computational Neuroscience Unit University College London, UK http://www.gatsby.ucl.ac.uk/ and Center for Automated Learning and

More information

arxiv: v1 [cs.lg] 5 Feb 2019

arxiv: v1 [cs.lg] 5 Feb 2019 Distribution-Dependent Analysis of Gibbs-ERM Principle arxiv:190.01846v1 cs.lg 5 Feb 019 Ilja Kuzborskij Università degli Studi di Milano Nicolò Cesa-Bianchi Università degli Studi di Milano Csaba Szepesvári

More information

Unraveling the mysteries of stochastic gradient descent on deep neural networks

Unraveling the mysteries of stochastic gradient descent on deep neural networks Unraveling the mysteries of stochastic gradient descent on deep neural networks Pratik Chaudhari UCLA VISION LAB 1 The question measures disagreement of predictions with ground truth Cat Dog... x = argmin

More information

Deep Neural Networks as Gaussian Processes

Deep Neural Networks as Gaussian Processes Deep Neural Networks as Gaussian Processes Jaehoon Lee, Yasaman Bahri, Roman Novak, Samuel S. Schoenholz, Jeffrey Pennington, Jascha Sohl-Dickstein Google Brain {jaehlee, yasamanb, romann, schsam, jpennin,

More information

10-701/15-781, Machine Learning: Homework 4

10-701/15-781, Machine Learning: Homework 4 10-701/15-781, Machine Learning: Homewor 4 Aarti Singh Carnegie Mellon University ˆ The assignment is due at 10:30 am beginning of class on Mon, Nov 15, 2010. ˆ Separate you answers into five parts, one

More information

Reading Group on Deep Learning Session 1

Reading Group on Deep Learning Session 1 Reading Group on Deep Learning Session 1 Stephane Lathuiliere & Pablo Mesejo 2 June 2016 1/31 Contents Introduction to Artificial Neural Networks to understand, and to be able to efficiently use, the popular

More information

The connection of dropout and Bayesian statistics

The connection of dropout and Bayesian statistics The connection of dropout and Bayesian statistics Interpretation of dropout as approximate Bayesian modelling of NN http://mlg.eng.cam.ac.uk/yarin/thesis/thesis.pdf Dropout Geoffrey Hinton Google, University

More information

Variational Dropout via Empirical Bayes

Variational Dropout via Empirical Bayes Variational Dropout via Empirical Bayes Valery Kharitonov 1 kharvd@gmail.com Dmitry Molchanov 1, dmolch111@gmail.com Dmitry Vetrov 1, vetrovd@yandex.ru 1 National Research University Higher School of Economics,

More information

Generalization theory

Generalization theory Generalization theory Daniel Hsu Columbia TRIPODS Bootcamp 1 Motivation 2 Support vector machines X = R d, Y = { 1, +1}. Return solution ŵ R d to following optimization problem: λ min w R d 2 w 2 2 + 1

More information

Restricted Boltzmann Machines

Restricted Boltzmann Machines Restricted Boltzmann Machines Boltzmann Machine(BM) A Boltzmann machine extends a stochastic Hopfield network to include hidden units. It has binary (0 or 1) visible vector unit x and hidden (latent) vector

More information

PAC-Bayesian Learning and Domain Adaptation

PAC-Bayesian Learning and Domain Adaptation PAC-Bayesian Learning and Domain Adaptation Pascal Germain 1 François Laviolette 1 Amaury Habrard 2 Emilie Morvant 3 1 GRAAL Machine Learning Research Group Département d informatique et de génie logiciel

More information

Empirical Risk Minimization

Empirical Risk Minimization Empirical Risk Minimization Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Introduction PAC learning ERM in practice 2 General setting Data X the input space and Y the output space

More information

Adaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade

Adaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade Adaptive Gradient Methods AdaGrad / Adam Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade 1 Announcements: HW3 posted Dual coordinate ascent (some review of SGD and random

More information

THERE ARE MANY CONSISTENT EXPLANATIONS OF UNLABELED DATA: WHY YOU SHOULD AVERAGE

THERE ARE MANY CONSISTENT EXPLANATIONS OF UNLABELED DATA: WHY YOU SHOULD AVERAGE THERE ARE MANY CONSISTENT EXPLANATIONS OF UNLABELED DATA: WHY YOU SHOULD AVERAGE Anonymous authors Paper under double-blind review Abstract Presently the most successful approaches to semi-supervised learning

More information

Theory of Deep Learning III: Generalization Properties of SGD

Theory of Deep Learning III: Generalization Properties of SGD . CBMM Memo No. 67 April 4, 217 Theory of Deep Learning III: Generalization Properties of SGD by Chiyuan Zhang 1 Qianli Liao 1 Alexander Rakhlin 2 Karthik Sridharan 3 Brando Miranda 1 Noah Golowich 1 Tomaso

More information

Bayesian Learning in Undirected Graphical Models

Bayesian Learning in Undirected Graphical Models Bayesian Learning in Undirected Graphical Models Zoubin Ghahramani Gatsby Computational Neuroscience Unit University College London, UK http://www.gatsby.ucl.ac.uk/ Work with: Iain Murray and Hyun-Chul

More information

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Neural Networks: A brief touch Yuejie Chi Department of Electrical and Computer Engineering Spring 2018 1/41 Outline

More information

PAC-Bayes Generalization Bounds for Randomized Structured Prediction

PAC-Bayes Generalization Bounds for Randomized Structured Prediction PAC-Bayes Generalization Bounds for Randomized Structured Prediction Ben London University of Maryland blondon@cs.umd.edu Ben Taskar University of Washington taskar@cs.washington.edu Bert Huang University

More information

Learning to Disentangle Factors of Variation with Manifold Learning

Learning to Disentangle Factors of Variation with Manifold Learning Learning to Disentangle Factors of Variation with Manifold Learning Scott Reed Kihyuk Sohn Yuting Zhang Honglak Lee University of Michigan, Department of Electrical Engineering and Computer Science 08

More information

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang Machine Learning Basics Lecture 7: Multiclass Classification Princeton University COS 495 Instructor: Yingyu Liang Example: image classification indoor Indoor outdoor Example: image classification (multiclass)

More information

Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm

Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm Qiang Liu and Dilin Wang NIPS 2016 Discussion by Yunchen Pu March 17, 2017 March 17, 2017 1 / 8 Introduction Let x R d

More information

Bayesian Deep Learning

Bayesian Deep Learning Bayesian Deep Learning Mohammad Emtiyaz Khan AIP (RIKEN), Tokyo http://emtiyaz.github.io emtiyaz.khan@riken.jp June 06, 2018 Mohammad Emtiyaz Khan 2018 1 What will you learn? Why is Bayesian inference

More information

Variational Bayesian Logistic Regression

Variational Bayesian Logistic Regression Variational Bayesian Logistic Regression Sargur N. University at Buffalo, State University of New York USA Topics in Linear Models for Classification Overview 1. Discriminant Functions 2. Probabilistic

More information

A Unified View of Deep Generative Models

A Unified View of Deep Generative Models SAILING LAB Laboratory for Statistical Artificial InteLigence & INtegreative Genomics A Unified View of Deep Generative Models Zhiting Hu and Eric Xing Petuum Inc. Carnegie Mellon University 1 Deep generative

More information

Lecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity;

Lecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity; CSCI699: Topics in Learning and Game Theory Lecture 2 Lecturer: Ilias Diakonikolas Scribes: Li Han Today we will cover the following 2 topics: 1. Learning infinite hypothesis class via VC-dimension and

More information

Information theoretic perspectives on learning algorithms

Information theoretic perspectives on learning algorithms Information theoretic perspectives on learning algorithms Varun Jog University of Wisconsin - Madison Departments of ECE and Mathematics Shannon Channel Hangout! May 8, 2018 Jointly with Adrian Tovar-Lopez

More information

Chapter 20. Deep Generative Models

Chapter 20. Deep Generative Models Peng et al.: Deep Learning and Practice 1 Chapter 20 Deep Generative Models Peng et al.: Deep Learning and Practice 2 Generative Models Models that are able to Provide an estimate of the probability distribution

More information

PAC-learning, VC Dimension and Margin-based Bounds

PAC-learning, VC Dimension and Margin-based Bounds More details: General: http://www.learning-with-kernels.org/ Example of more complex bounds: http://www.research.ibm.com/people/t/tzhang/papers/jmlr02_cover.ps.gz PAC-learning, VC Dimension and Margin-based

More information

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

A graph contains a set of nodes (vertices) connected by links (edges or arcs) BOLTZMANN MACHINES Generative Models Graphical Models A graph contains a set of nodes (vertices) connected by links (edges or arcs) In a probabilistic graphical model, each node represents a random variable,

More information

MythBusters: A Deep Learning Edition

MythBusters: A Deep Learning Edition 1 / 8 MythBusters: A Deep Learning Edition Sasha Rakhlin MIT Jan 18-19, 2018 2 / 8 Outline A Few Remarks on Generalization Myths 3 / 8 Myth #1: Current theory is lacking because deep neural networks have

More information

Logistic Regression Trained with Different Loss Functions. Discussion

Logistic Regression Trained with Different Loss Functions. Discussion Logistic Regression Trained with Different Loss Functions Discussion CS640 Notations We restrict our discussions to the binary case. g(z) = g (z) = g(z) z h w (x) = g(wx) = + e z = g(z)( g(z)) + e wx =

More information

Introduction to Neural Networks

Introduction to Neural Networks CUONG TUAN NGUYEN SEIJI HOTTA MASAKI NAKAGAWA Tokyo University of Agriculture and Technology Copyright by Nguyen, Hotta and Nakagawa 1 Pattern classification Which category of an input? Example: Character

More information

Posterior Regularization

Posterior Regularization Posterior Regularization 1 Introduction One of the key challenges in probabilistic structured learning, is the intractability of the posterior distribution, for fast inference. There are numerous methods

More information

Optimization geometry and implicit regularization

Optimization geometry and implicit regularization Optimization geometry and implicit regularization Suriya Gunasekar Joint work with N. Srebro (TTIC), J. Lee (USC), D. Soudry (Technion), M.S. Nacson (Technion), B. Woodworth (TTIC), S. Bhojanapalli (TTIC),

More information

Deep Relaxation: PDEs for optimizing Deep Neural Networks

Deep Relaxation: PDEs for optimizing Deep Neural Networks Deep Relaxation: PDEs for optimizing Deep Neural Networks IPAM Mean Field Games August 30, 2017 Adam Oberman (McGill) Thanks to the Simons Foundation (grant 395980) and the hospitality of the UCLA math

More information

Distirbutional robustness, regularizing variance, and adversaries

Distirbutional robustness, regularizing variance, and adversaries Distirbutional robustness, regularizing variance, and adversaries John Duchi Based on joint work with Hongseok Namkoong and Aman Sinha Stanford University November 2017 Motivation We do not want machine-learned

More information

Fast Rates for Estimation Error and Oracle Inequalities for Model Selection

Fast Rates for Estimation Error and Oracle Inequalities for Model Selection Fast Rates for Estimation Error and Oracle Inequalities for Model Selection Peter L. Bartlett Computer Science Division and Department of Statistics University of California, Berkeley bartlett@cs.berkeley.edu

More information

Why ResNet Works? Residuals Generalize

Why ResNet Works? Residuals Generalize Why ResNet Works? Residuals Generalize Fengxiang He Tongliang Liu Dacheng Tao arxiv:1904.01367v1 [stat.ml] 2 Apr 2019 Abstract Residual connections significantly boost the performance of deep neural networks.

More information

Introduction to Statistical Learning Theory

Introduction to Statistical Learning Theory Introduction to Statistical Learning Theory Definition Reminder: We are given m samples {(x i, y i )} m i=1 Dm and a hypothesis space H and we wish to return h H minimizing L D (h) = E[l(h(x), y)]. Problem

More information

Don t Decay the Learning Rate, Increase the Batch Size. Samuel L. Smith, Pieter-Jan Kindermans, Quoc V. Le December 9 th 2017

Don t Decay the Learning Rate, Increase the Batch Size. Samuel L. Smith, Pieter-Jan Kindermans, Quoc V. Le December 9 th 2017 Don t Decay the Learning Rate, Increase the Batch Size Samuel L. Smith, Pieter-Jan Kindermans, Quoc V. Le December 9 th 2017 slsmith@ Google Brain Three related questions: What properties control generalization?

More information

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade Machine Learning for Big Data CSE547/STAT548 University of Washington S. M. Kakade (UW) Optimization for

More information

Understanding Generalization Error: Bounds and Decompositions

Understanding Generalization Error: Bounds and Decompositions CIS 520: Machine Learning Spring 2018: Lecture 11 Understanding Generalization Error: Bounds and Decompositions Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the

More information

Some Statistical Properties of Deep Networks

Some Statistical Properties of Deep Networks Some Statistical Properties of Deep Networks Peter Bartlett UC Berkeley August 2, 2018 1 / 22 Deep Networks Deep compositions of nonlinear functions h = h m h m 1 h 1 2 / 22 Deep Networks Deep compositions

More information

Generalization Bounds in Machine Learning. Presented by: Afshin Rostamizadeh

Generalization Bounds in Machine Learning. Presented by: Afshin Rostamizadeh Generalization Bounds in Machine Learning Presented by: Afshin Rostamizadeh Outline Introduction to generalization bounds. Examples: VC-bounds Covering Number bounds Rademacher bounds Stability bounds

More information

When can Deep Networks avoid the curse of dimensionality and other theoretical puzzles

When can Deep Networks avoid the curse of dimensionality and other theoretical puzzles When can Deep Networks avoid the curse of dimensionality and other theoretical puzzles Tomaso Poggio, MIT, CBMM Astar CBMM s focus is the Science and the Engineering of Intelligence We aim to make progress

More information

Distributed Estimation, Information Loss and Exponential Families. Qiang Liu Department of Computer Science Dartmouth College

Distributed Estimation, Information Loss and Exponential Families. Qiang Liu Department of Computer Science Dartmouth College Distributed Estimation, Information Loss and Exponential Families Qiang Liu Department of Computer Science Dartmouth College Statistical Learning / Estimation Learning generative models from data Topic

More information

Dropout as a Bayesian Approximation: Insights and Applications

Dropout as a Bayesian Approximation: Insights and Applications Dropout as a Bayesian Approximation: Insights and Applications Yarin Gal and Zoubin Ghahramani Discussion by: Chunyuan Li Jan. 15, 2016 1 / 16 Main idea In the framework of variational inference, the authors

More information

Probabilistic Graphical Models

Probabilistic Graphical Models 10-708 Probabilistic Graphical Models Homework 3 (v1.1.0) Due Apr 14, 7:00 PM Rules: 1. Homework is due on the due date at 7:00 PM. The homework should be submitted via Gradescope. Solution to each problem

More information

CSC321 Lecture 18: Learning Probabilistic Models

CSC321 Lecture 18: Learning Probabilistic Models CSC321 Lecture 18: Learning Probabilistic Models Roger Grosse Roger Grosse CSC321 Lecture 18: Learning Probabilistic Models 1 / 25 Overview So far in this course: mainly supervised learning Language modeling

More information

CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent

CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent April 27, 2018 1 / 32 Outline 1) Moment and Nesterov s accelerated gradient descent 2) AdaGrad and RMSProp 4) Adam 5) Stochastic

More information

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory Computational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 4: MDL and PAC-Bayes Uniform vs Non-Uniform Bias No Free Lunch: we need some inductive bias Limiting attention to hypothesis

More information

Fast yet Simple Natural-Gradient Variational Inference in Complex Models

Fast yet Simple Natural-Gradient Variational Inference in Complex Models Fast yet Simple Natural-Gradient Variational Inference in Complex Models Mohammad Emtiyaz Khan RIKEN Center for AI Project, Tokyo, Japan http://emtiyaz.github.io Joint work with Wu Lin (UBC) DidrikNielsen,

More information

Differentially-private Learning and Information Theory

Differentially-private Learning and Information Theory Differentially-private Learning and Information Theory Darakhshan Mir Department of Computer Science Rutgers University NJ, USA mir@cs.rutgers.edu ABSTRACT Using results from PAC-Bayesian bounds in learning

More information

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory Computational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 17: Stochastic Optimization Part II: Realizable vs Agnostic Rates Part III: Nearest Neighbor Classification Stochastic

More information

Stochastic Optimization Algorithms Beyond SG

Stochastic Optimization Algorithms Beyond SG Stochastic Optimization Algorithms Beyond SG Frank E. Curtis 1, Lehigh University involving joint work with Léon Bottou, Facebook AI Research Jorge Nocedal, Northwestern University Optimization Methods

More information

arxiv: v2 [cs.lg] 16 Aug 2018

arxiv: v2 [cs.lg] 16 Aug 2018 An Alternative View: When Does SGD Escape Local Minima? arxiv:180.06175v [cs.lg] 16 Aug 018 Robert Kleinberg Department of Computer Science Cornell University rdk@cs.cornell.edu Yang Yuan Department of

More information

The sample complexity of agnostic learning with deterministic labels

The sample complexity of agnostic learning with deterministic labels The sample complexity of agnostic learning with deterministic labels Shai Ben-David Cheriton School of Computer Science University of Waterloo Waterloo, ON, N2L 3G CANADA shai@uwaterloo.ca Ruth Urner College

More information

Surrogate loss functions, divergences and decentralized detection

Surrogate loss functions, divergences and decentralized detection Surrogate loss functions, divergences and decentralized detection XuanLong Nguyen Department of Electrical Engineering and Computer Science U.C. Berkeley Advisors: Michael Jordan & Martin Wainwright 1

More information

Large-Scale Feature Learning with Spike-and-Slab Sparse Coding

Large-Scale Feature Learning with Spike-and-Slab Sparse Coding Large-Scale Feature Learning with Spike-and-Slab Sparse Coding Ian J. Goodfellow, Aaron Courville, Yoshua Bengio ICML 2012 Presented by Xin Yuan January 17, 2013 1 Outline Contributions Spike-and-Slab

More information

Variational Principal Components

Variational Principal Components Variational Principal Components Christopher M. Bishop Microsoft Research 7 J. J. Thomson Avenue, Cambridge, CB3 0FB, U.K. cmbishop@microsoft.com http://research.microsoft.com/ cmbishop In Proceedings

More information

Data Mining & Machine Learning

Data Mining & Machine Learning Data Mining & Machine Learning CS57300 Purdue University March 1, 2018 1 Recap of Last Class (Model Search) Forward and Backward passes 2 Feedforward Neural Networks Neural Networks: Architectures 2-layer

More information

12. Structural Risk Minimization. ECE 830 & CS 761, Spring 2016

12. Structural Risk Minimization. ECE 830 & CS 761, Spring 2016 12. Structural Risk Minimization ECE 830 & CS 761, Spring 2016 1 / 23 General setup for statistical learning theory We observe training examples {x i, y i } n i=1 x i = features X y i = labels / responses

More information

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

Machine Learning. Linear Models. Fabio Vandin October 10, 2017 Machine Learning Linear Models Fabio Vandin October 10, 2017 1 Linear Predictors and Affine Functions Consider X = R d Affine functions: L d = {h w,b : w R d, b R} where ( d ) h w,b (x) = w, x + b = w

More information

A summary of Deep Learning without Poor Local Minima

A summary of Deep Learning without Poor Local Minima A summary of Deep Learning without Poor Local Minima by Kenji Kawaguchi MIT oral presentation at NIPS 2016 Learning Supervised (or Predictive) learning Learn a mapping from inputs x to outputs y, given

More information

Overfitting, Bias / Variance Analysis

Overfitting, Bias / Variance Analysis Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic

More information

ECS171: Machine Learning

ECS171: Machine Learning ECS171: Machine Learning Lecture 4: Optimization (LFD 3.3, SGD) Cho-Jui Hsieh UC Davis Jan 22, 2018 Gradient descent Optimization Goal: find the minimizer of a function min f (w) w For now we assume f

More information

1 Review of The Learning Setting

1 Review of The Learning Setting COS 5: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #8 Scribe: Changyan Wang February 28, 208 Review of The Learning Setting Last class, we moved beyond the PAC model: in the PAC model we

More information

10-701/ Machine Learning, Fall

10-701/ Machine Learning, Fall 0-70/5-78 Machine Learning, Fall 2003 Homework 2 Solution If you have questions, please contact Jiayong Zhang .. (Error Function) The sum-of-squares error is the most common training

More information

Variational Autoencoders

Variational Autoencoders Variational Autoencoders Recap: Story so far A classification MLP actually comprises two components A feature extraction network that converts the inputs into linearly separable features Or nearly linearly

More information

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu Logistic Regression Review 10-601 Fall 2012 Recitation September 25, 2012 TA: Selen Uguroglu!1 Outline Decision Theory Logistic regression Goal Loss function Inference Gradient Descent!2 Training Data

More information

CMU-Q Lecture 24:

CMU-Q Lecture 24: CMU-Q 15-381 Lecture 24: Supervised Learning 2 Teacher: Gianni A. Di Caro SUPERVISED LEARNING Hypotheses space Hypothesis function Labeled Given Errors Performance criteria Given a collection of input

More information

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability Probability theory Naïve Bayes classification Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. s: A person s height, the outcome of a coin toss Distinguish

More information