MythBusters: A Deep Learning Edition

Similar documents
Generalization theory

Learning Theory. Piyush Rai. CS5350/6350: Machine Learning. September 27, (CS5350/6350) Learning Theory September 27, / 14

Machine Learning Basics

Understanding Generalization Error: Bounds and Decompositions

Foundations of Deep Learning: SGD, Overparametrization, and Generalization

Fast Rates for Estimation Error and Oracle Inequalities for Model Selection

PAC-learning, VC Dimension and Margin-based Bounds

On Learnability, Complexity and Stability

PAC-learning, VC Dimension and Margin-based Bounds

The Learning Problem and Regularization

Computational and Statistical Learning Theory

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Stochastic Optimization

Lecture 3: Introduction to Complexity Regularization

Overparametrization for Landscape Design in Non-convex Optimization

Online Passive-Aggressive Algorithms

How Good is a Kernel When Used as a Similarity Measure?

Hilbert Space Methods in Learning

Some Statistical Properties of Deep Networks

9 Classification. 9.1 Linear Classifiers

IFT Lecture 7 Elements of statistical learning theory

An Introduction to Statistical Theory of Learning. Nakul Verma Janelia, HHMI

Deep Neural Networks: From Flat Minima to Numerically Nonvacuous Generalization Bounds via PAC-Bayes

what can deep learning learn from linear regression? Benjamin Recht University of California, Berkeley

Online Learning: Random Averages, Combinatorial Parameters, and Learnability

Efficiently Training Sum-Product Neural Networks using Forward Greedy Selection

Empirical Risk Minimization

Adaptive Online Gradient Descent

Optimistic Rates Nati Srebro

Machine Learning for NLP

Introduction to Machine Learning (67577) Lecture 5

Neural networks and optimization

Machine Learning for NLP

Failures of Gradient-Based Deep Learning

Learning Theory Continued

MIT 9.520/6.860, Fall 2017 Statistical Learning Theory and Applications. Class 19: Data Representation by Design

Learning Kernels -Tutorial Part III: Theoretical Guarantees.

Statistical learning theory, Support vector machines, and Bioinformatics

Consistency of Nearest Neighbor Methods

12. Structural Risk Minimization. ECE 830 & CS 761, Spring 2016

Introduction to Machine Learning

Statistical Data Mining and Machine Learning Hilary Term 2016

COMS 4771 Introduction to Machine Learning. Nakul Verma

When can Deep Networks avoid the curse of dimensionality and other theoretical puzzles

ADANET: adaptive learning of neural networks

Statistical Learning Learning From Examples

Trade-Offs in Distributed Learning and Optimization

Non-Convex Optimization in Machine Learning. Jan Mrkos AIC

Does Unlabeled Data Help?

Regularization Algorithms for Learning

Sample width for multi-category classifiers

On the tradeoff between computational complexity and sample complexity in learning

STAT 535 Lecture 5 November, 2018 Brief overview of Model Selection and Regularization c Marina Meilă

Advanced Machine Learning

Information theoretic perspectives on learning algorithms

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Margin Maximizing Loss Functions

Class 2 & 3 Overfitting & Regularization

Linear and Logistic Regression. Dr. Xiaowei Huang

Online Learning and Sequential Decision Making

Lecture Support Vector Machine (SVM) Classifiers

A PAC-Bayesian Approach to Spectrally-Normalized Margin Bounds for Neural Networks

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015

Qualifying Exam in Machine Learning

1 Active Learning Foundations of Machine Learning and Data Science. Lecturer: Maria-Florina Balcan Lecture 20 & 21: November 16 & 18, 2015

Least Squares Regression

Deep Boosting. Joint work with Corinna Cortes (Google Research) Umar Syed (Google Research) COURANT INSTITUTE & GOOGLE RESEARCH.

Implicit Optimization Bias

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

An Introduction to No Free Lunch Theorems

Large-scale Stochastic Optimization

Online Passive-Aggressive Algorithms

Mining Classification Knowledge

Day 3: Classification, logistic regression

Practical Agnostic Active Learning

Stochastic Optimization Algorithms Beyond SG

CSC321 Lecture 5: Multilayer Perceptrons

Machine Learning Theory (CS 6783)

On the Interpretability of Conditional Probability Estimates in the Agnostic Setting

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

6.036 midterm review. Wednesday, March 18, 15

ECE662: Pattern Recognition and Decision Making Processes: HW TWO

Vapnik-Chervonenkis Dimension of Neural Nets

CSCI-567: Machine Learning (Spring 2019)

Logistic Regression Logistic

Generalization Error on Pruning Decision Trees

Learnability, Stability, Regularization and Strong Convexity

SVRG++ with Non-uniform Sampling

Stochastic optimization in Hilbert spaces

Rademacher Complexity Bounds for Non-I.I.D. Processes

Nearest neighbor classification in metric spaces: universal consistency and rates of convergence

B.3 K -NN. Concha Bielza, Pedro Larran aga

Computational Learning Theory

E.G., data mining, prediction. Function approximation problem: How >9 best estimate the function 0 from partial information (examples)

Discriminative Models

Part of the slides are adapted from Ziko Kolter

CSC242: Intro to AI. Lecture 21

ECE-271B. Nuno Vasconcelos ECE Department, UCSD

ECS171: Machine Learning

A note on the generalization performance of kernel classifiers with margin. Theodoros Evgeniou and Massimiliano Pontil

Transcription:

1 / 8 MythBusters: A Deep Learning Edition Sasha Rakhlin MIT Jan 18-19, 2018

2 / 8 Outline A Few Remarks on Generalization Myths

3 / 8 Myth #1: Current theory is lacking because deep neural networks have too many parameters.

3 / 8 Myth #1: Current theory is lacking because deep neural networks have too many parameters. P. Bartlett, The Sample Complexity of Pattern Classification with Neural Networks: The Size of the Weights is More Important than the Size of the Network, 1998

3 / 8 Myth #1: Current theory is lacking because deep neural networks have too many parameters. P. Bartlett, The Sample Complexity of Pattern Classification with Neural Networks: The Size of the Weights is More Important than the Size of the Network, 1998 Margin theory was developed to address this very problem for Boosting and NN (e.g. Koltchinskii & Panchenko 02 and references therein) Example: linear classifiers {x sign( w, x ) w 2 1} and assume margin. Then dimension of w (num. of params in 1-layer NN) never appears in generalization bounds (and can be infinite). This observation already appears in the 60 s.

3 / 8 Myth #1: Current theory is lacking because deep neural networks have too many parameters. P. Bartlett, The Sample Complexity of Pattern Classification with Neural Networks: The Size of the Weights is More Important than the Size of the Network, 1998 Margin theory was developed to address this very problem for Boosting and NN (e.g. Koltchinskii & Panchenko 02 and references therein) Example: linear classifiers {x sign( w, x ) w 2 1} and assume margin. Then dimension of w (num. of params in 1-layer NN) never appears in generalization bounds (and can be infinite). This observation already appears in the 60 s. In Statistics, one often deals with infinite-dimensional models Numer of parameters is rarely the right notion of complexity (true, in classical statistics still the case for linear regression or simple models) VC dimension is known to be a loose quantity (distribution-free, only an upper bound)

4 / 8 A study of complexity notions Our own (arguably incomplete) take on this problem: T. Liang, T. Poggio, J. Stokes, A.R. Fisher-Rao Metric, Geometry, and Complexity of Neural Networks, 2017.

4 / 8 A study of complexity notions Our own (arguably incomplete) take on this problem: T. Liang, T. Poggio, J. Stokes, A.R. Fisher-Rao Metric, Geometry, and Complexity of Neural Networks, 2017. Fisher local norm as a common starting point for many measures of complexity currently studied in the literature (see work of Srebro s group and Bartlett et al). Information Geometry suggests Natural Gradient as the optimization method. Appears to resolve ill-conditioned problems in Shalev-Shwartz et al 17.

5 / 8 Myth #2: To prove good out-of-sample performance, we need to show uniform convergence (a la Vapnik) over some class.

5 / 8 Myth #2: To prove good out-of-sample performance, we need to show uniform convergence (a la Vapnik) over some class. The oldest counter-example: Cover and Hart, Nearest neighbor pattern classification, 1967.

5 / 8 Myth #2: To prove good out-of-sample performance, we need to show uniform convergence (a la Vapnik) over some class. The oldest counter-example: Cover and Hart, Nearest neighbor pattern classification, 1967. Second (related) issue: uniform vs universal consistency. Uniform Consistency There exists a sequence {ŷ t} t=1 of estimators, such that for any ɛ > 0, there exists n ɛ such that for any distribution P P and n n ɛ, EL(ŷ n) inf L(f) ɛ Universal Consistency There exists a sequence {ŷ t} t=1 of estimators, such that for any distribution P P and any ɛ > 0, there exists n ɛ such that for n n ɛ(p), EL(ŷ n) inf L(f) ɛ

5 / 8 Myth #2: To prove good out-of-sample performance, we need to show uniform convergence (a la Vapnik) over some class. The oldest counter-example: Cover and Hart, Nearest neighbor pattern classification, 1967. Second (related) issue: uniform vs universal consistency. Uniform Consistency There exists a sequence {ŷ t} t=1 of estimators, such that for any ɛ > 0, there exists n ɛ such that for any distribution P P and n n ɛ, EL(ŷ n) inf L(f) ɛ Universal Consistency There exists a sequence {ŷ t} t=1 of estimators, such that for any distribution P P and any ɛ > 0, there exists n ɛ such that for n n ɛ(p), EL(ŷ n) inf L(f) ɛ Importantly, can interpolate between the two notions using penalization. A few more approaches (e.g. use bracketing entropy) ask me after the talk.

6 / 8 Myth #3: Sample complexity of neural nets scales exponentially with depth.

6 / 8 Myth #3: Sample complexity of neural nets scales exponentially with depth. A common pitfall of making conclusions based on (possibly loose) upper bounds. Mostly resolved: N. Golowich, A.R., O. Shamir, Size-Independent Sample Complexity of Neural Networks, 2017 From 2 d to d dependence was simply a technical issue. From d to O(1) requires more work.

7 / 8 Myth #4: If we can fit any set of labels, then Rademacher complexity is too large and, hence, nothing useful can be concluded.

7 / 8 Myth #4: If we can fit any set of labels, then Rademacher complexity is too large and, hence, nothing useful can be concluded. Related to Myth #2, but let s illustrate with a slightly different technique. Bottom line: we can have a very large overall model, but performance depends on a posteriori complexity of the obtained solution. Most trivial example: take a large F = k F k, where F k = {f compl n (f) k} and for simplicity assume compl n (f) is positive homogenous. Suppose (this is standard) we have that with high probability f F 1, Ef Êf R(F 1) +... where R(F 1) is empirical Rademacher. Then with same probability f F, Ef Êf compl n (f) R(F1) +... Conclusion: an a posteriori data-dependent guarantee for all f based on complexity of f, yet R(F) never appears (huge or infinite). If complexity is not positive homogenous, use union bound instead.

8 / 8 So, is there anything left to do? Yes, tons. Perhaps need to ask different questions. What are the properties of solutions that optimization methods find in a nonconvex landscape? Is there implicit regularization that we can isolate? A nice line of work by Srebro and co-authors What are the salient features of the random landscape? Uniform deviations for gradients and Hessians? Nice work by Montanari and co-authors How can one exploit randomness to make conclusions about optimization solutions? (e.g. see the SGLD work of Raginsky et al, as well as papers on escaping saddles) What geometric notions can be associated to multi-layer neural nets? How can this geometry be exploited in optimization methods and be reflected in sample complexity? Theoretical understanding of adversarial examples. etc.