Some Statistical Properties of Deep Networks

Similar documents
Generalization in Deep Networks

Nearly-tight VC-dimension bounds for piecewise linear neural networks

A Surprising Linear Relationship Predicts Test Performance in Deep Networks

arxiv: v1 [cs.lg] 8 Mar 2017

Vapnik-Chervonenkis Dimension of Neural Nets

Vapnik-Chervonenkis Dimension of Neural Nets

what can deep learning learn from linear regression? Benjamin Recht University of California, Berkeley

A PAC-BAYESIAN APPROACH TO SPECTRALLY-NORMALIZED MARGIN BOUNDS

Spectrally-normalized margin bounds for neural networks

Australian National University. Abstract. number of training examples necessary for satisfactory learning performance grows

An Introduction to Statistical Theory of Learning. Nakul Verma Janelia, HHMI

arxiv: v1 [cs.lg] 4 Oct 2018

Classical generalization bounds are surprisingly tight for Deep Networks

Theory of Deep Learning IIb: Optimization Properties of SGD

COMS 4771 Introduction to Machine Learning. Nakul Verma

COMP9444: Neural Networks. Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization

A PAC-Bayesian Approach to Spectrally-Normalized Margin Bounds for Neural Networks

The sample complexity of agnostic learning with deterministic labels

When can Deep Networks avoid the curse of dimensionality and other theoretical puzzles

Discriminative Models

Neural Network Learning: Testing Bounds on Sample Complexity

Uniform concentration inequalities, martingales, Rademacher complexity and symmetrization

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang

Empirical Risk Minimization

Stochastic Optimization

Active Learning and Optimized Information Gathering

Discriminative Models

TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter Multiclass Logistic Regression. Multilayer Perceptrons (MLPs)

1 Machine Learning Concepts (16 points)

MythBusters: A Deep Learning Edition

ON THE EXPRESSIVE POWER OF DEEP NEURAL NET-

Computational Learning Theory

arxiv: v1 [cs.lg] 7 Jan 2019

Foundations of Deep Learning: SGD, Overparametrization, and Generalization

Learning Theory. Piyush Rai. CS5350/6350: Machine Learning. September 27, (CS5350/6350) Learning Theory September 27, / 14

ADANET: adaptive learning of neural networks

Why ResNet Works? Residuals Generalize

Introduction to Machine Learning (67577) Lecture 3

Sample width for multi-category classifiers

Computational Learning Theory

Statistical Properties of Large Margin Classifiers

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015

Machine Learning Basics

Learnability, Stability, Regularization and Strong Convexity

I D I A P. Online Policy Adaptation for Ensemble Classifiers R E S E A R C H R E P O R T. Samy Bengio b. Christos Dimitrakakis a IDIAP RR 03-69

Implicit Optimization Bias

Linear and Logistic Regression. Dr. Xiaowei Huang

Statistical and Computational Learning Theory

Multilayer Perceptron

Generalization theory

Deep Neural Networks: From Flat Minima to Numerically Nonvacuous Generalization Bounds via PAC-Bayes

Deep Reinforcement Learning SISL. Jeremy Morton (jmorton2) November 7, Stanford Intelligent Systems Laboratory

AdaBoost and other Large Margin Classifiers: Convexity in Classification

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels

Does Modeling Lead to More Accurate Classification?

Neural networks and support vector machines

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

9 Classification. 9.1 Linear Classifiers

Lecture 15: Neural Networks Theory

Large-Scale Feature Learning with Spike-and-Slab Sparse Coding

Theory of Deep Learning III: explaining the non-overfitting puzzle

Margin Preservation of Deep Neural Networks

Does Unlabeled Data Help?

Introduction to Support Vector Machines

Lecture Support Vector Machine (SVM) Classifiers

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Neural Networks. Nicholas Ruozzi University of Texas at Dallas

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Distirbutional robustness, regularizing variance, and adversaries

STAT 535 Lecture 5 November, 2018 Brief overview of Model Selection and Regularization c Marina Meilă

12. Structural Risk Minimization. ECE 830 & CS 761, Spring 2016

Understanding Generalization Error: Bounds and Decompositions

Error bounds for approximations with deep ReLU networks

Characterization of Gradient Dominance and Regularity Conditions for Neural Networks

Nearly-tight VC-dimension bounds for piecewise linear neural networks

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Active Learning from Single and Multiple Annotators

A Bound on the Label Complexity of Agnostic Active Learning

Robust Large Margin Deep Neural Networks

On Learnability, Complexity and Stability

Classification and Pattern Recognition

Deep Feedforward Networks. Han Shao, Hou Pong Chan, and Hongyi Zhang

Lecture 8. Instructor: Haipeng Luo

Computational Learning Theory. Definitions

Theory of Deep Learning III: the non-overfitting puzzle

Deep Learning: Self-Taught Learning and Deep vs. Shallow Architectures. Lecture 04

Deep Feedforward Networks

l 1 -Regularized Linear Regression: Persistence and Oracle Inequalities

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Statistical learning theory, Support vector machines, and Bioinformatics

Theory of Deep Learning III: explaining the non-overfitting puzzle

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer

Failures of Gradient-Based Deep Learning

Topic 3: Neural Networks

Nonparametric regression using deep neural networks with ReLU activation function

Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers

ECE521 Lecture7. Logistic Regression

Feed-forward Network Functions

Transcription:

Some Statistical Properties of Deep Networks Peter Bartlett UC Berkeley August 2, 2018 1 / 22

Deep Networks Deep compositions of nonlinear functions h = h m h m 1 h 1 2 / 22

Deep Networks Deep compositions of nonlinear functions h = h m h m 1 h 1 e.g., h i : x σ(w i x) 1 σ(v) i = 1 + exp( v i ), 2 / 22

Deep Networks Deep compositions of nonlinear functions h = h m h m 1 h 1 e.g., h i : x σ(w i x) 1 σ(v) i = 1 + exp( v i ), 2 / 22

Deep Networks Deep compositions of nonlinear functions h = h m h m 1 h 1 e.g., h i : x σ(w i x) h i : x r(w i x) 1 σ(v) i = 1 + exp( v i ), r(v) i = max{0, v i } 2 / 22

Deep Networks Deep compositions of nonlinear functions h = h m h m 1 h 1 e.g., h i : x σ(w i x) h i : x r(w i x) 1 σ(v) i = 1 + exp( v i ), r(v) i = max{0, v i } 2 / 22

Deep Networks Representation learning Rich non-parametric family 3 / 22

Deep Networks Representation learning Depth provides an effective way of representing useful features. Rich non-parametric family 3 / 22

Deep Networks Representation learning Depth provides an effective way of representing useful features. Rich non-parametric family Depth provides parsimonious representions. Nonlinear parameterizations provide better rates of approximation. (Birman & Solomjak, 1967), (DeVore et al, 1991) 3 / 22

Deep Networks Representation learning Depth provides an effective way of representing useful features. Rich non-parametric family Depth provides parsimonious representions. Nonlinear parameterizations provide better rates of approximation. (Birman & Solomjak, 1967), (DeVore et al, 1991) Some functions require much more complexity for a shallow representation. (Telgarsky, 2015), (Eldan & Shamir, 2015) 3 / 22

Deep Networks Representation learning Depth provides an effective way of representing useful features. Rich non-parametric family Depth provides parsimonious representions. Nonlinear parameterizations provide better rates of approximation. (Birman & Solomjak, 1967), (DeVore et al, 1991) Some functions require much more complexity for a shallow representation. (Telgarsky, 2015), (Eldan & Shamir, 2015) Statistical complexity? 3 / 22

Outline VC theory: Number of parameters Margins analysis: Size of parameters Understanding generalization failures 4 / 22

VC Theory 5 / 22

VC Theory Assume network maps to { 1, 1}. (Threshold its output) 5 / 22

VC Theory Assume network maps to { 1, 1}. (Threshold its output) Data generated by a probability distribution P on X { 1, 1}. 5 / 22

VC Theory Assume network maps to { 1, 1}. (Threshold its output) Data generated by a probability distribution P on X { 1, 1}. Want to choose a function f such that P(f (x) y) is small (near optimal). 5 / 22

VC Theory Theorem (Vapnik and Chervonenkis) Suppose F { 1, 1} X. For every prob distribution P on X { 1, 1}, with probability 1 δ over n iid examples (x 1, y 1 ),..., (x n, y n ), every f in F satisfies P(f (x) y) 1 n {i : f (x i) y i } + ( c n (VCdim(F) + log(1/δ)) ) 1/2. 6 / 22

VC Theory Theorem (Vapnik and Chervonenkis) Suppose F { 1, 1} X. For every prob distribution P on X { 1, 1}, with probability 1 δ over n iid examples (x 1, y 1 ),..., (x n, y n ), every f in F satisfies P(f (x) y) 1 n {i : f (x i) y i } + ( c n (VCdim(F) + log(1/δ)) ) 1/2. For uniform bounds (that is, for all distributions and all f F, proportions are close to probabilities), this inequality is tight within a constant factor. 6 / 22

VC Theory Theorem (Vapnik and Chervonenkis) Suppose F { 1, 1} X. For every prob distribution P on X { 1, 1}, with probability 1 δ over n iid examples (x 1, y 1 ),..., (x n, y n ), every f in F satisfies P(f (x) y) 1 n {i : f (x i) y i } + ( c n (VCdim(F) + log(1/δ)) ) 1/2. For uniform bounds (that is, for all distributions and all f F, proportions are close to probabilities), this inequality is tight within a constant factor. For neural networks, VC-dimension: 6 / 22

VC Theory Theorem (Vapnik and Chervonenkis) Suppose F { 1, 1} X. For every prob distribution P on X { 1, 1}, with probability 1 δ over n iid examples (x 1, y 1 ),..., (x n, y n ), every f in F satisfies P(f (x) y) 1 n {i : f (x i) y i } + ( c n (VCdim(F) + log(1/δ)) ) 1/2. For uniform bounds (that is, for all distributions and all f F, proportions are close to probabilities), this inequality is tight within a constant factor. For neural networks, VC-dimension: increases with number of parameters 6 / 22

VC Theory Theorem (Vapnik and Chervonenkis) Suppose F { 1, 1} X. For every prob distribution P on X { 1, 1}, with probability 1 δ over n iid examples (x 1, y 1 ),..., (x n, y n ), every f in F satisfies P(f (x) y) 1 n {i : f (x i) y i } + ( c n (VCdim(F) + log(1/δ)) ) 1/2. For uniform bounds (that is, for all distributions and all f F, proportions are close to probabilities), this inequality is tight within a constant factor. For neural networks, VC-dimension: increases with number of parameters depends on nonlinearity and depth 6 / 22

VC-Dimension of Neural Networks Theorem Consider the class F of { 1, 1}-valued functions computed by a network with L layers, p parameters, and k computation units with the following nonlinearities: 7 / 22

VC-Dimension of Neural Networks Theorem Consider the class F of { 1, 1}-valued functions computed by a network with L layers, p parameters, and k computation units with the following nonlinearities: 1 Piecewise constant (linear threshold units): VCdim(F) = Θ (p). (Baum and Haussler, 1989) 7 / 22

VC-Dimension of Neural Networks Theorem Consider the class F of { 1, 1}-valued functions computed by a network with L layers, p parameters, and k computation units with the following nonlinearities: 1 Piecewise constant (linear threshold units): VCdim(F) = Θ (p). (Baum and Haussler, 1989) 2 Piecewise linear (ReLUs): VCdim(F) = Θ (pl). (B., Harvey, Liaw, Mehrabian, 2017) 7 / 22

VC-Dimension of Neural Networks Theorem Consider the class F of { 1, 1}-valued functions computed by a network with L layers, p parameters, and k computation units with the following nonlinearities: 1 Piecewise constant (linear threshold units): VCdim(F) = Θ (p). (Baum and Haussler, 1989) 2 Piecewise linear (ReLUs): VCdim(F) = Θ (pl). (B., Harvey, Liaw, Mehrabian, 2017) 3 Piecewise polynomial: VCdim(F) = Õ ( pl 2). (B., Maiorov, Meir, 1998) 7 / 22

VC-Dimension of Neural Networks Theorem Consider the class F of { 1, 1}-valued functions computed by a network with L layers, p parameters, and k computation units with the following nonlinearities: 1 Piecewise constant (linear threshold units): VCdim(F) = Θ (p). (Baum and Haussler, 1989) 2 Piecewise linear (ReLUs): VCdim(F) = Θ (pl). (B., Harvey, Liaw, Mehrabian, 2017) 3 Piecewise polynomial: VCdim(F) = Õ ( pl 2). (B., Maiorov, Meir, 1998) 4 Sigmoid: VCdim(F) = Õ ( p 2 k 2). (Karpinsky and MacIntyre, 1994) 7 / 22

Outline VC theory: Number of parameters Margins analysis: Size of parameters Understanding generalization failures 8 / 22

Generalization in Deep Networks Spectrally-normalized margin bounds for neural networks. B., Dylan J. Foster, Matus Telgarsky, NIPS 2017. arxiv:1706.08498 Dylan Foster Cornell Matus Telgarsky UIUC 9 / 22

Large-Margin Classifiers Consider a vector-valued function f : X R m used for classification, y {1,..., m}. 10 / 22

Large-Margin Classifiers Consider a vector-valued function f : X R m used for classification, y {1,..., m}. The prediction on x X is arg max y f (x) y. 10 / 22

Large-Margin Classifiers Consider a vector-valued function f : X R m used for classification, y {1,..., m}. The prediction on x X is arg max y f (x) y. For a pattern-label pair (x, y) X {1,..., m}, define the margin M(f (x), y) = f (x) y max i y f (x) i. 10 / 22

Large-Margin Classifiers Consider a vector-valued function f : X R m used for classification, y {1,..., m}. The prediction on x X is arg max y f (x) y. For a pattern-label pair (x, y) X {1,..., m}, define the margin M(f (x), y) = f (x) y max i y f (x) i. If M(f (x), y) > 0 then f classifies x correctly. 10 / 22

Large-Margin Classifiers Consider a vector-valued function f : X R m used for classification, y {1,..., m}. The prediction on x X is arg max y f (x) y. For a pattern-label pair (x, y) X {1,..., m}, define the margin M(f (x), y) = f (x) y max i y f (x) i. If M(f (x), y) > 0 then f classifies x correctly. We can view a larger margin as a more confident correct classification. 10 / 22

Large-Margin Classifiers Consider a vector-valued function f : X R m used for classification, y {1,..., m}. The prediction on x X is arg max y f (x) y. For a pattern-label pair (x, y) X {1,..., m}, define the margin M(f (x), y) = f (x) y max i y f (x) i. If M(f (x), y) > 0 then f classifies x correctly. We can view a larger margin as a more confident correct classification. Minimizing a continuous loss, such as encourages large margins. n f (X i ) Y i 2, i=1 10 / 22

Large-Margin Classifiers Consider a vector-valued function f : X R m used for classification, y {1,..., m}. The prediction on x X is arg max y f (x) y. For a pattern-label pair (x, y) X {1,..., m}, define the margin M(f (x), y) = f (x) y max i y f (x) i. If M(f (x), y) > 0 then f classifies x correctly. We can view a larger margin as a more confident correct classification. Minimizing a continuous loss, such as encourages large margins. n f (X i ) Y i 2, i=1 For large-margin classifiers, we should expect the fine-grained details of f to be less important. 10 / 22

Generalization in Deep Networks New results for generalization in deep ReLU networks Measure the size of functions computed by a network of ReLUs via operator norms. 11 / 22

Generalization in Deep Networks New results for generalization in deep ReLU networks Measure the size of functions computed by a network of ReLUs via operator norms. Large multiclass versus binary classification. 11 / 22

Generalization in Deep Networks New results for generalization in deep ReLU networks Measure the size of functions computed by a network of ReLUs via operator norms. Large multiclass versus binary classification. Definitions Consider operator norms: For a matrix A i, A i := sup x 1 A i x. 11 / 22

Generalization in Deep Networks New results for generalization in deep ReLU networks Measure the size of functions computed by a network of ReLUs via operator norms. Large multiclass versus binary classification. Definitions Consider operator norms: For a matrix A i, A i := sup x 1 A i x. Recall: Multiclass margin function for f : X R m, y {1,..., m}, is M(f (x), y) = f (x) y max i y f (x) i. 11 / 22

Generalization in Deep Networks Theorem With high probability, every f A 12 / 22

Generalization in Deep Networks Theorem With high probability, every f A Definitions Network with L layers, parameters A 1,..., A L : f A (x) := σ L (A L σ L 1 (A L 1 σ 1 (A 1 x) )). 12 / 22

Generalization in Deep Networks Theorem With high probability, every f A satisfies Pr(M(f A (X ), Y ) 0) Definitions Network with L layers, parameters A 1,..., A L : f A (x) := σ L (A L σ L 1 (A L 1 σ 1 (A 1 x) )). 12 / 22

Generalization in Deep Networks Theorem With high probability, every f A Pr(M(f A (X ), Y ) 0) 1 n satisfies n 1[M(f A (X i ), Y i ) γ] i=1 Definitions Network with L layers, parameters A 1,..., A L : f A (x) := σ L (A L σ L 1 (A L 1 σ 1 (A 1 x) )). 12 / 22

Generalization in Deep Networks Theorem With high probability, every f A Pr(M(f A (X ), Y ) 0) 1 n satisfies n 1[M(f A (X i ), Y i ) γ] + Õ i=1 ( ) rl γ. n Definitions Network with L layers, parameters A 1,..., A L : f A (x) := σ L (A L σ L 1 (A L 1 σ 1 (A 1 x) )). 12 / 22

Generalization in Deep Networks Theorem With high probability, every f A with R A r Pr(M(f A (X ), Y ) 0) 1 n satisfies n 1[M(f A (X i ), Y i ) γ] + Õ i=1 ( ) rl γ. n Definitions Network with L layers, parameters A 1,..., A L : f A (x) := σ L (A L σ L 1 (A L 1 σ 1 (A 1 x) )). Scale of f A : R A := L i=1 A i. 12 / 22

Generalization in Deep Networks Theorem With high probability, every f A with R A r Pr(M(f A (X ), Y ) 0) 1 n satisfies n 1[M(f A (X i ), Y i ) γ] + Õ i=1 ( ) rl γ. n Definitions Network with L layers, parameters A 1,..., A L : f A (x) := σ L (A L σ L 1 (A L 1 σ 1 (A 1 x) )). Scale of f A : R A := L i=1 A i (Assume σ i is 1-Lipschitz, inputs normalized.) L i=1 A i 2/3 2,1 A i 2/3 3/2. 12 / 22

Outline VC theory: Number of parameters Margins analysis: Size of parameters Understanding generalization failures 13 / 22

Understanding Generalization Failures CIFAR10 http://corochann.com/ 14 / 22

Understanding Generalization Failures Stochastic Gradient Training Error on CIFAR10 (Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals, 2017) 15 / 22

Understanding Generalization Failures Training margins on CIFAR10 with true and random labels How does this match the large margin explanation? 16 / 22

Understanding Generalization Failures If we rescale the margins by R A (the scale parameter): Rescaled margins on CIFAR10 17 / 22

Understanding Generalization Failures If we rescale the margins by R A (the scale parameter): Rescaled cumulative margins on MNIST 18 / 22

Generalization in Deep Networks Theorem With high probability, every f A with R A r satisfies Pr(M(f A (X ), Y ) 0) 1 n n 1[M(f A (X i ), Y i ) γ] + Õ i=1 ( ) rl γ. n Network with L layers, parameters A 1,..., A L : f A (x) := σ(a L σ L 1 (A L 1 σ 1 (A 1 x) )). Scale of f A : R A := L i=1 A i L i=1 A i 2/3 2,1 A i 2/3 3/2. 19 / 22

Understanding Generalization Failures cifar Lipschitz cifar [random] Lipschitz epoch 10 epoch 100 20 / 22

Understanding Generalization Failures excess risk 0.9 cifar excess risk cifar Lipschitz cifar [random] excess risk cifar [random] Lipschitz excess risk 0.3 epoch 10 epoch 100 21 / 22

Generalization in Neural Networks With appropriate normalization, the margins analysis is qualitatively consistent with the generalization performance. 22 / 22

Generalization in Neural Networks With appropriate normalization, the margins analysis is qualitatively consistent with the generalization performance. Recent work by Golowich, Rakhlin, and Shamir give bounds with improved dependence on depth. 22 / 22

Generalization in Neural Networks With appropriate normalization, the margins analysis is qualitatively consistent with the generalization performance. Recent work by Golowich, Rakhlin, and Shamir give bounds with improved dependence on depth. Regularization and optimization: explicit control of operator norms? 22 / 22