Some Statistical Properties of Deep Networks

Size: px
Start display at page:

Download "Some Statistical Properties of Deep Networks"

Transcription

1 Some Statistical Properties of Deep Networks Peter Bartlett UC Berkeley August 2, / 22

2 Deep Networks Deep compositions of nonlinear functions h = h m h m 1 h 1 2 / 22

3 Deep Networks Deep compositions of nonlinear functions h = h m h m 1 h 1 e.g., h i : x σ(w i x) 1 σ(v) i = 1 + exp( v i ), 2 / 22

4 Deep Networks Deep compositions of nonlinear functions h = h m h m 1 h 1 e.g., h i : x σ(w i x) 1 σ(v) i = 1 + exp( v i ), 2 / 22

5 Deep Networks Deep compositions of nonlinear functions h = h m h m 1 h 1 e.g., h i : x σ(w i x) h i : x r(w i x) 1 σ(v) i = 1 + exp( v i ), r(v) i = max{0, v i } 2 / 22

6 Deep Networks Deep compositions of nonlinear functions h = h m h m 1 h 1 e.g., h i : x σ(w i x) h i : x r(w i x) 1 σ(v) i = 1 + exp( v i ), r(v) i = max{0, v i } 2 / 22

7 Deep Networks Representation learning Rich non-parametric family 3 / 22

8 Deep Networks Representation learning Depth provides an effective way of representing useful features. Rich non-parametric family 3 / 22

9 Deep Networks Representation learning Depth provides an effective way of representing useful features. Rich non-parametric family Depth provides parsimonious representions. Nonlinear parameterizations provide better rates of approximation. (Birman & Solomjak, 1967), (DeVore et al, 1991) 3 / 22

10 Deep Networks Representation learning Depth provides an effective way of representing useful features. Rich non-parametric family Depth provides parsimonious representions. Nonlinear parameterizations provide better rates of approximation. (Birman & Solomjak, 1967), (DeVore et al, 1991) Some functions require much more complexity for a shallow representation. (Telgarsky, 2015), (Eldan & Shamir, 2015) 3 / 22

11 Deep Networks Representation learning Depth provides an effective way of representing useful features. Rich non-parametric family Depth provides parsimonious representions. Nonlinear parameterizations provide better rates of approximation. (Birman & Solomjak, 1967), (DeVore et al, 1991) Some functions require much more complexity for a shallow representation. (Telgarsky, 2015), (Eldan & Shamir, 2015) Statistical complexity? 3 / 22

12 Outline VC theory: Number of parameters Margins analysis: Size of parameters Understanding generalization failures 4 / 22

13 VC Theory 5 / 22

14 VC Theory Assume network maps to { 1, 1}. (Threshold its output) 5 / 22

15 VC Theory Assume network maps to { 1, 1}. (Threshold its output) Data generated by a probability distribution P on X { 1, 1}. 5 / 22

16 VC Theory Assume network maps to { 1, 1}. (Threshold its output) Data generated by a probability distribution P on X { 1, 1}. Want to choose a function f such that P(f (x) y) is small (near optimal). 5 / 22

17 VC Theory Theorem (Vapnik and Chervonenkis) Suppose F { 1, 1} X. For every prob distribution P on X { 1, 1}, with probability 1 δ over n iid examples (x 1, y 1 ),..., (x n, y n ), every f in F satisfies P(f (x) y) 1 n {i : f (x i) y i } + ( c n (VCdim(F) + log(1/δ)) ) 1/2. 6 / 22

18 VC Theory Theorem (Vapnik and Chervonenkis) Suppose F { 1, 1} X. For every prob distribution P on X { 1, 1}, with probability 1 δ over n iid examples (x 1, y 1 ),..., (x n, y n ), every f in F satisfies P(f (x) y) 1 n {i : f (x i) y i } + ( c n (VCdim(F) + log(1/δ)) ) 1/2. For uniform bounds (that is, for all distributions and all f F, proportions are close to probabilities), this inequality is tight within a constant factor. 6 / 22

19 VC Theory Theorem (Vapnik and Chervonenkis) Suppose F { 1, 1} X. For every prob distribution P on X { 1, 1}, with probability 1 δ over n iid examples (x 1, y 1 ),..., (x n, y n ), every f in F satisfies P(f (x) y) 1 n {i : f (x i) y i } + ( c n (VCdim(F) + log(1/δ)) ) 1/2. For uniform bounds (that is, for all distributions and all f F, proportions are close to probabilities), this inequality is tight within a constant factor. For neural networks, VC-dimension: 6 / 22

20 VC Theory Theorem (Vapnik and Chervonenkis) Suppose F { 1, 1} X. For every prob distribution P on X { 1, 1}, with probability 1 δ over n iid examples (x 1, y 1 ),..., (x n, y n ), every f in F satisfies P(f (x) y) 1 n {i : f (x i) y i } + ( c n (VCdim(F) + log(1/δ)) ) 1/2. For uniform bounds (that is, for all distributions and all f F, proportions are close to probabilities), this inequality is tight within a constant factor. For neural networks, VC-dimension: increases with number of parameters 6 / 22

21 VC Theory Theorem (Vapnik and Chervonenkis) Suppose F { 1, 1} X. For every prob distribution P on X { 1, 1}, with probability 1 δ over n iid examples (x 1, y 1 ),..., (x n, y n ), every f in F satisfies P(f (x) y) 1 n {i : f (x i) y i } + ( c n (VCdim(F) + log(1/δ)) ) 1/2. For uniform bounds (that is, for all distributions and all f F, proportions are close to probabilities), this inequality is tight within a constant factor. For neural networks, VC-dimension: increases with number of parameters depends on nonlinearity and depth 6 / 22

22 VC-Dimension of Neural Networks Theorem Consider the class F of { 1, 1}-valued functions computed by a network with L layers, p parameters, and k computation units with the following nonlinearities: 7 / 22

23 VC-Dimension of Neural Networks Theorem Consider the class F of { 1, 1}-valued functions computed by a network with L layers, p parameters, and k computation units with the following nonlinearities: 1 Piecewise constant (linear threshold units): VCdim(F) = Θ (p). (Baum and Haussler, 1989) 7 / 22

24 VC-Dimension of Neural Networks Theorem Consider the class F of { 1, 1}-valued functions computed by a network with L layers, p parameters, and k computation units with the following nonlinearities: 1 Piecewise constant (linear threshold units): VCdim(F) = Θ (p). (Baum and Haussler, 1989) 2 Piecewise linear (ReLUs): VCdim(F) = Θ (pl). (B., Harvey, Liaw, Mehrabian, 2017) 7 / 22

25 VC-Dimension of Neural Networks Theorem Consider the class F of { 1, 1}-valued functions computed by a network with L layers, p parameters, and k computation units with the following nonlinearities: 1 Piecewise constant (linear threshold units): VCdim(F) = Θ (p). (Baum and Haussler, 1989) 2 Piecewise linear (ReLUs): VCdim(F) = Θ (pl). (B., Harvey, Liaw, Mehrabian, 2017) 3 Piecewise polynomial: VCdim(F) = Õ ( pl 2). (B., Maiorov, Meir, 1998) 7 / 22

26 VC-Dimension of Neural Networks Theorem Consider the class F of { 1, 1}-valued functions computed by a network with L layers, p parameters, and k computation units with the following nonlinearities: 1 Piecewise constant (linear threshold units): VCdim(F) = Θ (p). (Baum and Haussler, 1989) 2 Piecewise linear (ReLUs): VCdim(F) = Θ (pl). (B., Harvey, Liaw, Mehrabian, 2017) 3 Piecewise polynomial: VCdim(F) = Õ ( pl 2). (B., Maiorov, Meir, 1998) 4 Sigmoid: VCdim(F) = Õ ( p 2 k 2). (Karpinsky and MacIntyre, 1994) 7 / 22

27 Outline VC theory: Number of parameters Margins analysis: Size of parameters Understanding generalization failures 8 / 22

28 Generalization in Deep Networks Spectrally-normalized margin bounds for neural networks. B., Dylan J. Foster, Matus Telgarsky, NIPS arxiv: Dylan Foster Cornell Matus Telgarsky UIUC 9 / 22

29 Large-Margin Classifiers Consider a vector-valued function f : X R m used for classification, y {1,..., m}. 10 / 22

30 Large-Margin Classifiers Consider a vector-valued function f : X R m used for classification, y {1,..., m}. The prediction on x X is arg max y f (x) y. 10 / 22

31 Large-Margin Classifiers Consider a vector-valued function f : X R m used for classification, y {1,..., m}. The prediction on x X is arg max y f (x) y. For a pattern-label pair (x, y) X {1,..., m}, define the margin M(f (x), y) = f (x) y max i y f (x) i. 10 / 22

32 Large-Margin Classifiers Consider a vector-valued function f : X R m used for classification, y {1,..., m}. The prediction on x X is arg max y f (x) y. For a pattern-label pair (x, y) X {1,..., m}, define the margin M(f (x), y) = f (x) y max i y f (x) i. If M(f (x), y) > 0 then f classifies x correctly. 10 / 22

33 Large-Margin Classifiers Consider a vector-valued function f : X R m used for classification, y {1,..., m}. The prediction on x X is arg max y f (x) y. For a pattern-label pair (x, y) X {1,..., m}, define the margin M(f (x), y) = f (x) y max i y f (x) i. If M(f (x), y) > 0 then f classifies x correctly. We can view a larger margin as a more confident correct classification. 10 / 22

34 Large-Margin Classifiers Consider a vector-valued function f : X R m used for classification, y {1,..., m}. The prediction on x X is arg max y f (x) y. For a pattern-label pair (x, y) X {1,..., m}, define the margin M(f (x), y) = f (x) y max i y f (x) i. If M(f (x), y) > 0 then f classifies x correctly. We can view a larger margin as a more confident correct classification. Minimizing a continuous loss, such as encourages large margins. n f (X i ) Y i 2, i=1 10 / 22

35 Large-Margin Classifiers Consider a vector-valued function f : X R m used for classification, y {1,..., m}. The prediction on x X is arg max y f (x) y. For a pattern-label pair (x, y) X {1,..., m}, define the margin M(f (x), y) = f (x) y max i y f (x) i. If M(f (x), y) > 0 then f classifies x correctly. We can view a larger margin as a more confident correct classification. Minimizing a continuous loss, such as encourages large margins. n f (X i ) Y i 2, i=1 For large-margin classifiers, we should expect the fine-grained details of f to be less important. 10 / 22

36 Generalization in Deep Networks New results for generalization in deep ReLU networks Measure the size of functions computed by a network of ReLUs via operator norms. 11 / 22

37 Generalization in Deep Networks New results for generalization in deep ReLU networks Measure the size of functions computed by a network of ReLUs via operator norms. Large multiclass versus binary classification. 11 / 22

38 Generalization in Deep Networks New results for generalization in deep ReLU networks Measure the size of functions computed by a network of ReLUs via operator norms. Large multiclass versus binary classification. Definitions Consider operator norms: For a matrix A i, A i := sup x 1 A i x. 11 / 22

39 Generalization in Deep Networks New results for generalization in deep ReLU networks Measure the size of functions computed by a network of ReLUs via operator norms. Large multiclass versus binary classification. Definitions Consider operator norms: For a matrix A i, A i := sup x 1 A i x. Recall: Multiclass margin function for f : X R m, y {1,..., m}, is M(f (x), y) = f (x) y max i y f (x) i. 11 / 22

40 Generalization in Deep Networks Theorem With high probability, every f A 12 / 22

41 Generalization in Deep Networks Theorem With high probability, every f A Definitions Network with L layers, parameters A 1,..., A L : f A (x) := σ L (A L σ L 1 (A L 1 σ 1 (A 1 x) )). 12 / 22

42 Generalization in Deep Networks Theorem With high probability, every f A satisfies Pr(M(f A (X ), Y ) 0) Definitions Network with L layers, parameters A 1,..., A L : f A (x) := σ L (A L σ L 1 (A L 1 σ 1 (A 1 x) )). 12 / 22

43 Generalization in Deep Networks Theorem With high probability, every f A Pr(M(f A (X ), Y ) 0) 1 n satisfies n 1[M(f A (X i ), Y i ) γ] i=1 Definitions Network with L layers, parameters A 1,..., A L : f A (x) := σ L (A L σ L 1 (A L 1 σ 1 (A 1 x) )). 12 / 22

44 Generalization in Deep Networks Theorem With high probability, every f A Pr(M(f A (X ), Y ) 0) 1 n satisfies n 1[M(f A (X i ), Y i ) γ] + Õ i=1 ( ) rl γ. n Definitions Network with L layers, parameters A 1,..., A L : f A (x) := σ L (A L σ L 1 (A L 1 σ 1 (A 1 x) )). 12 / 22

45 Generalization in Deep Networks Theorem With high probability, every f A with R A r Pr(M(f A (X ), Y ) 0) 1 n satisfies n 1[M(f A (X i ), Y i ) γ] + Õ i=1 ( ) rl γ. n Definitions Network with L layers, parameters A 1,..., A L : f A (x) := σ L (A L σ L 1 (A L 1 σ 1 (A 1 x) )). Scale of f A : R A := L i=1 A i. 12 / 22

46 Generalization in Deep Networks Theorem With high probability, every f A with R A r Pr(M(f A (X ), Y ) 0) 1 n satisfies n 1[M(f A (X i ), Y i ) γ] + Õ i=1 ( ) rl γ. n Definitions Network with L layers, parameters A 1,..., A L : f A (x) := σ L (A L σ L 1 (A L 1 σ 1 (A 1 x) )). Scale of f A : R A := L i=1 A i (Assume σ i is 1-Lipschitz, inputs normalized.) L i=1 A i 2/3 2,1 A i 2/3 3/2. 12 / 22

47 Outline VC theory: Number of parameters Margins analysis: Size of parameters Understanding generalization failures 13 / 22

48 Understanding Generalization Failures CIFAR / 22

49 Understanding Generalization Failures Stochastic Gradient Training Error on CIFAR10 (Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals, 2017) 15 / 22

50 Understanding Generalization Failures Training margins on CIFAR10 with true and random labels How does this match the large margin explanation? 16 / 22

51 Understanding Generalization Failures If we rescale the margins by R A (the scale parameter): Rescaled margins on CIFAR10 17 / 22

52 Understanding Generalization Failures If we rescale the margins by R A (the scale parameter): Rescaled cumulative margins on MNIST 18 / 22

53 Generalization in Deep Networks Theorem With high probability, every f A with R A r satisfies Pr(M(f A (X ), Y ) 0) 1 n n 1[M(f A (X i ), Y i ) γ] + Õ i=1 ( ) rl γ. n Network with L layers, parameters A 1,..., A L : f A (x) := σ(a L σ L 1 (A L 1 σ 1 (A 1 x) )). Scale of f A : R A := L i=1 A i L i=1 A i 2/3 2,1 A i 2/3 3/2. 19 / 22

54 Understanding Generalization Failures cifar Lipschitz cifar [random] Lipschitz epoch 10 epoch / 22

55 Understanding Generalization Failures excess risk 0.9 cifar excess risk cifar Lipschitz cifar [random] excess risk cifar [random] Lipschitz excess risk 0.3 epoch 10 epoch / 22

56 Generalization in Neural Networks With appropriate normalization, the margins analysis is qualitatively consistent with the generalization performance. 22 / 22

57 Generalization in Neural Networks With appropriate normalization, the margins analysis is qualitatively consistent with the generalization performance. Recent work by Golowich, Rakhlin, and Shamir give bounds with improved dependence on depth. 22 / 22

58 Generalization in Neural Networks With appropriate normalization, the margins analysis is qualitatively consistent with the generalization performance. Recent work by Golowich, Rakhlin, and Shamir give bounds with improved dependence on depth. Regularization and optimization: explicit control of operator norms? 22 / 22

Generalization in Deep Networks

Generalization in Deep Networks Generalization in Deep Networks Peter Bartlett BAIR UC Berkeley November 28, 2017 1 / 29 Deep neural networks Game playing (Jung Yeon-Je/AFP/Getty Images) 2 / 29 Deep neural networks Image recognition

More information

Nearly-tight VC-dimension bounds for piecewise linear neural networks

Nearly-tight VC-dimension bounds for piecewise linear neural networks Proceedings of Machine Learning Research vol 65:1 5, 2017 Nearly-tight VC-dimension bounds for piecewise linear neural networks Nick Harvey Christopher Liaw Abbas Mehrabian Department of Computer Science,

More information

A Surprising Linear Relationship Predicts Test Performance in Deep Networks

A Surprising Linear Relationship Predicts Test Performance in Deep Networks CBMM Memo No. 91 July 26, 218 arxiv:187.9659v1 [cs.lg] 25 Jul 218 A Surprising Linear Relationship Predicts Test Performance in Deep Networks Qianli Liao 1, Brando Miranda 1, Andrzej Banburski 1, Jack

More information

arxiv: v1 [cs.lg] 8 Mar 2017

arxiv: v1 [cs.lg] 8 Mar 2017 Nearly-tight VC-dimension bounds for piecewise linear neural networks arxiv:1703.02930v1 [cs.lg] 8 Mar 2017 Nick Harvey Chris Liaw Abbas Mehrabian March 9, 2017 Abstract We prove new upper and lower bounds

More information

Vapnik-Chervonenkis Dimension of Neural Nets

Vapnik-Chervonenkis Dimension of Neural Nets P. L. Bartlett and W. Maass: Vapnik-Chervonenkis Dimension of Neural Nets 1 Vapnik-Chervonenkis Dimension of Neural Nets Peter L. Bartlett BIOwulf Technologies and University of California at Berkeley

More information

Vapnik-Chervonenkis Dimension of Neural Nets

Vapnik-Chervonenkis Dimension of Neural Nets Vapnik-Chervonenkis Dimension of Neural Nets Peter L. Bartlett BIOwulf Technologies and University of California at Berkeley Department of Statistics 367 Evans Hall, CA 94720-3860, USA bartlett@stat.berkeley.edu

More information

what can deep learning learn from linear regression? Benjamin Recht University of California, Berkeley

what can deep learning learn from linear regression? Benjamin Recht University of California, Berkeley what can deep learning learn from linear regression? Benjamin Recht University of California, Berkeley Collaborators Joint work with Samy Bengio, Moritz Hardt, Michael Jordan, Jason Lee, Max Simchowitz,

More information

A PAC-BAYESIAN APPROACH TO SPECTRALLY-NORMALIZED MARGIN BOUNDS

A PAC-BAYESIAN APPROACH TO SPECTRALLY-NORMALIZED MARGIN BOUNDS Published as a conference paper at ICLR 08 A PAC-BAYESIAN APPROACH TO SPECTRALLY-NORMALIZED MARGIN BOUNDS FOR NEURAL NETWORKS Behnam Neyshabur, Srinadh Bhojanapalli, Nathan Srebro Toyota Technological

More information

Spectrally-normalized margin bounds for neural networks

Spectrally-normalized margin bounds for neural networks Spectrally-normalized margin bounds for neural networks Peter L. Bartlett Dylan J. Foster Matus Telgarsky 1 Overview Abstract This paper presents a margin-based multiclass generalization bound for neural

More information

Australian National University. Abstract. number of training examples necessary for satisfactory learning performance grows

Australian National University. Abstract. number of training examples necessary for satisfactory learning performance grows The VC-Dimension and Pseudodimension of Two-Layer Neural Networks with Discrete Inputs Peter L. Bartlett Robert C. Williamson Department of Systems Engineering Research School of Information Sciences and

More information

An Introduction to Statistical Theory of Learning. Nakul Verma Janelia, HHMI

An Introduction to Statistical Theory of Learning. Nakul Verma Janelia, HHMI An Introduction to Statistical Theory of Learning Nakul Verma Janelia, HHMI Towards formalizing learning What does it mean to learn a concept? Gain knowledge or experience of the concept. The basic process

More information

arxiv: v1 [cs.lg] 4 Oct 2018

arxiv: v1 [cs.lg] 4 Oct 2018 Gradient descent aligns the layers of deep linear networks Ziwei Ji Matus Telgarsky {ziweiji,mjt}@illinois.edu University of Illinois, Urbana-Champaign arxiv:80.003v [cs.lg] 4 Oct 08 Abstract This paper

More information

Classical generalization bounds are surprisingly tight for Deep Networks

Classical generalization bounds are surprisingly tight for Deep Networks CBMM Memo No. 9 July, 28 Classical generalization bounds are surprisingly tight for Deep Networks Qianli Liao, Brando Miranda, Jack Hidary 2 and Tomaso Poggio Center for Brains, Minds, and Machines, MIT

More information

Theory of Deep Learning IIb: Optimization Properties of SGD

Theory of Deep Learning IIb: Optimization Properties of SGD CBMM Memo No. 72 December 27, 217 Theory of Deep Learning IIb: Optimization Properties of SGD by Chiyuan Zhang 1 Qianli Liao 1 Alexander Rakhlin 2 Brando Miranda 1 Noah Golowich 1 Tomaso Poggio 1 1 Center

More information

COMS 4771 Introduction to Machine Learning. Nakul Verma

COMS 4771 Introduction to Machine Learning. Nakul Verma COMS 4771 Introduction to Machine Learning Nakul Verma Announcements HW2 due now! Project proposal due on tomorrow Midterm next lecture! HW3 posted Last time Linear Regression Parametric vs Nonparametric

More information

COMP9444: Neural Networks. Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization

COMP9444: Neural Networks. Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization : Neural Networks Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization 11s2 VC-dimension and PAC-learning 1 How good a classifier does a learner produce? Training error is the precentage

More information

A PAC-Bayesian Approach to Spectrally-Normalized Margin Bounds for Neural Networks

A PAC-Bayesian Approach to Spectrally-Normalized Margin Bounds for Neural Networks A PAC-Bayesian Approach to Spectrally-Normalize Margin Bouns for Neural Networks Behnam Neyshabur, Srinah Bhojanapalli, Davi McAllester, Nathan Srebro Toyota Technological Institute at Chicago {bneyshabur,

More information

The sample complexity of agnostic learning with deterministic labels

The sample complexity of agnostic learning with deterministic labels The sample complexity of agnostic learning with deterministic labels Shai Ben-David Cheriton School of Computer Science University of Waterloo Waterloo, ON, N2L 3G CANADA shai@uwaterloo.ca Ruth Urner College

More information

When can Deep Networks avoid the curse of dimensionality and other theoretical puzzles

When can Deep Networks avoid the curse of dimensionality and other theoretical puzzles When can Deep Networks avoid the curse of dimensionality and other theoretical puzzles Tomaso Poggio, MIT, CBMM Astar CBMM s focus is the Science and the Engineering of Intelligence We aim to make progress

More information

Discriminative Models

Discriminative Models No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models

More information

Neural Network Learning: Testing Bounds on Sample Complexity

Neural Network Learning: Testing Bounds on Sample Complexity Neural Network Learning: Testing Bounds on Sample Complexity Joaquim Marques de Sá, Fernando Sereno 2, Luís Alexandre 3 INEB Instituto de Engenharia Biomédica Faculdade de Engenharia da Universidade do

More information

Uniform concentration inequalities, martingales, Rademacher complexity and symmetrization

Uniform concentration inequalities, martingales, Rademacher complexity and symmetrization Uniform concentration inequalities, martingales, Rademacher complexity and symmetrization John Duchi Outline I Motivation 1 Uniform laws of large numbers 2 Loss minimization and data dependence II Uniform

More information

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang Machine Learning Basics Lecture 7: Multiclass Classification Princeton University COS 495 Instructor: Yingyu Liang Example: image classification indoor Indoor outdoor Example: image classification (multiclass)

More information

Empirical Risk Minimization

Empirical Risk Minimization Empirical Risk Minimization Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Introduction PAC learning ERM in practice 2 General setting Data X the input space and Y the output space

More information

Stochastic Optimization

Stochastic Optimization Introduction Related Work SGD Epoch-GD LM A DA NANJING UNIVERSITY Lijun Zhang Nanjing University, China May 26, 2017 Introduction Related Work SGD Epoch-GD Outline 1 Introduction 2 Related Work 3 Stochastic

More information

Active Learning and Optimized Information Gathering

Active Learning and Optimized Information Gathering Active Learning and Optimized Information Gathering Lecture 7 Learning Theory CS 101.2 Andreas Krause Announcements Project proposal: Due tomorrow 1/27 Homework 1: Due Thursday 1/29 Any time is ok. Office

More information

Discriminative Models

Discriminative Models No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models

More information

TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter Multiclass Logistic Regression. Multilayer Perceptrons (MLPs)

TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter Multiclass Logistic Regression. Multilayer Perceptrons (MLPs) TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2018 Multiclass Logistic Regression Multilayer Perceptrons (MLPs) Stochastic Gradient Descent (SGD) 1 Multiclass Classification We consider

More information

1 Machine Learning Concepts (16 points)

1 Machine Learning Concepts (16 points) CSCI 567 Fall 2018 Midterm Exam DO NOT OPEN EXAM UNTIL INSTRUCTED TO DO SO PLEASE TURN OFF ALL CELL PHONES Problem 1 2 3 4 5 6 Total Max 16 10 16 42 24 12 120 Points Please read the following instructions

More information

MythBusters: A Deep Learning Edition

MythBusters: A Deep Learning Edition 1 / 8 MythBusters: A Deep Learning Edition Sasha Rakhlin MIT Jan 18-19, 2018 2 / 8 Outline A Few Remarks on Generalization Myths 3 / 8 Myth #1: Current theory is lacking because deep neural networks have

More information

ON THE EXPRESSIVE POWER OF DEEP NEURAL NET-

ON THE EXPRESSIVE POWER OF DEEP NEURAL NET- ON THE EXPRESSIVE POWER OF DEEP NEURAL NET- WORKS Maithra Raghu Google Brain and Cornell University Ben Poole Stanford University and Google Brain Jon Kleinberg Cornell University Surya Ganguli Stanford

More information

Computational Learning Theory

Computational Learning Theory 1 Computational Learning Theory 2 Computational learning theory Introduction Is it possible to identify classes of learning problems that are inherently easy or difficult? Can we characterize the number

More information

arxiv: v1 [cs.lg] 7 Jan 2019

arxiv: v1 [cs.lg] 7 Jan 2019 Generalization in Deep Networks: The Role of Distance from Initialization arxiv:1901672v1 [cs.lg] 7 Jan 2019 Vaishnavh Nagarajan Computer Science Department Carnegie-Mellon University Pittsburgh, PA 15213

More information

Foundations of Deep Learning: SGD, Overparametrization, and Generalization

Foundations of Deep Learning: SGD, Overparametrization, and Generalization Foundations of Deep Learning: SGD, Overparametrization, and Generalization Jason D. Lee University of Southern California November 13, 2018 Deep Learning Single Neuron x σ( w, x ) ReLU: σ(z) = [z] + Figure:

More information

Learning Theory. Piyush Rai. CS5350/6350: Machine Learning. September 27, (CS5350/6350) Learning Theory September 27, / 14

Learning Theory. Piyush Rai. CS5350/6350: Machine Learning. September 27, (CS5350/6350) Learning Theory September 27, / 14 Learning Theory Piyush Rai CS5350/6350: Machine Learning September 27, 2011 (CS5350/6350) Learning Theory September 27, 2011 1 / 14 Why Learning Theory? We want to have theoretical guarantees about our

More information

ADANET: adaptive learning of neural networks

ADANET: adaptive learning of neural networks ADANET: adaptive learning of neural networks Joint work with Corinna Cortes (Google Research) Javier Gonzalo (Google Research) Vitaly Kuznetsov (Google Research) Scott Yang (Courant Institute) MEHRYAR

More information

Why ResNet Works? Residuals Generalize

Why ResNet Works? Residuals Generalize Why ResNet Works? Residuals Generalize Fengxiang He Tongliang Liu Dacheng Tao arxiv:1904.01367v1 [stat.ml] 2 Apr 2019 Abstract Residual connections significantly boost the performance of deep neural networks.

More information

Introduction to Machine Learning (67577) Lecture 3

Introduction to Machine Learning (67577) Lecture 3 Introduction to Machine Learning (67577) Lecture 3 Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem General Learning Model and Bias-Complexity tradeoff Shai Shalev-Shwartz

More information

Sample width for multi-category classifiers

Sample width for multi-category classifiers R u t c o r Research R e p o r t Sample width for multi-category classifiers Martin Anthony a Joel Ratsaby b RRR 29-2012, November 2012 RUTCOR Rutgers Center for Operations Research Rutgers University

More information

Computational Learning Theory

Computational Learning Theory Computational Learning Theory Pardis Noorzad Department of Computer Engineering and IT Amirkabir University of Technology Ordibehesht 1390 Introduction For the analysis of data structures and algorithms

More information

Statistical Properties of Large Margin Classifiers

Statistical Properties of Large Margin Classifiers Statistical Properties of Large Margin Classifiers Peter Bartlett Division of Computer Science and Department of Statistics UC Berkeley Joint work with Mike Jordan, Jon McAuliffe, Ambuj Tewari. slides

More information

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015 Machine Learning 10-701, Fall 2015 VC Dimension and Model Complexity Eric Xing Lecture 16, November 3, 2015 Reading: Chap. 7 T.M book, and outline material Eric Xing @ CMU, 2006-2015 1 Last time: PAC and

More information

Machine Learning Basics

Machine Learning Basics Security and Fairness of Deep Learning Machine Learning Basics Anupam Datta CMU Spring 2019 Image Classification Image Classification Image classification pipeline Input: A training set of N images, each

More information

Learnability, Stability, Regularization and Strong Convexity

Learnability, Stability, Regularization and Strong Convexity Learnability, Stability, Regularization and Strong Convexity Nati Srebro Shai Shalev-Shwartz HUJI Ohad Shamir Weizmann Karthik Sridharan Cornell Ambuj Tewari Michigan Toyota Technological Institute Chicago

More information

I D I A P. Online Policy Adaptation for Ensemble Classifiers R E S E A R C H R E P O R T. Samy Bengio b. Christos Dimitrakakis a IDIAP RR 03-69

I D I A P. Online Policy Adaptation for Ensemble Classifiers R E S E A R C H R E P O R T. Samy Bengio b. Christos Dimitrakakis a IDIAP RR 03-69 R E S E A R C H R E P O R T Online Policy Adaptation for Ensemble Classifiers Christos Dimitrakakis a IDIAP RR 03-69 Samy Bengio b I D I A P December 2003 D a l l e M o l l e I n s t i t u t e for Perceptual

More information

Implicit Optimization Bias

Implicit Optimization Bias Implicit Optimization Bias as a key to Understanding Deep Learning Nati Srebro (TTIC) Based on joint work with Behnam Neyshabur (TTIC IAS), Ryota Tomioka (TTIC MSR), Srinadh Bhojanapalli, Suriya Gunasekar,

More information

Linear and Logistic Regression. Dr. Xiaowei Huang

Linear and Logistic Regression. Dr. Xiaowei Huang Linear and Logistic Regression Dr. Xiaowei Huang https://cgi.csc.liv.ac.uk/~xiaowei/ Up to now, Two Classical Machine Learning Algorithms Decision tree learning K-nearest neighbor Model Evaluation Metrics

More information

Statistical and Computational Learning Theory

Statistical and Computational Learning Theory Statistical and Computational Learning Theory Fundamental Question: Predict Error Rates Given: Find: The space H of hypotheses The number and distribution of the training examples S The complexity of the

More information

Multilayer Perceptron

Multilayer Perceptron Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Single Perceptron 3 Boolean Function Learning 4

More information

Generalization theory

Generalization theory Generalization theory Daniel Hsu Columbia TRIPODS Bootcamp 1 Motivation 2 Support vector machines X = R d, Y = { 1, +1}. Return solution ŵ R d to following optimization problem: λ min w R d 2 w 2 2 + 1

More information

Deep Neural Networks: From Flat Minima to Numerically Nonvacuous Generalization Bounds via PAC-Bayes

Deep Neural Networks: From Flat Minima to Numerically Nonvacuous Generalization Bounds via PAC-Bayes Deep Neural Networks: From Flat Minima to Numerically Nonvacuous Generalization Bounds via PAC-Bayes Daniel M. Roy University of Toronto; Vector Institute Joint work with Gintarė K. Džiugaitė University

More information

Deep Reinforcement Learning SISL. Jeremy Morton (jmorton2) November 7, Stanford Intelligent Systems Laboratory

Deep Reinforcement Learning SISL. Jeremy Morton (jmorton2) November 7, Stanford Intelligent Systems Laboratory Deep Reinforcement Learning Jeremy Morton (jmorton2) November 7, 2016 SISL Stanford Intelligent Systems Laboratory Overview 2 1 Motivation 2 Neural Networks 3 Deep Reinforcement Learning 4 Deep Learning

More information

AdaBoost and other Large Margin Classifiers: Convexity in Classification

AdaBoost and other Large Margin Classifiers: Convexity in Classification AdaBoost and other Large Margin Classifiers: Convexity in Classification Peter Bartlett Division of Computer Science and Department of Statistics UC Berkeley Joint work with Mikhail Traskin. slides at

More information

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels Karl Stratos June 21, 2018 1 / 33 Tangent: Some Loose Ends in Logistic Regression Polynomial feature expansion in logistic

More information

Does Modeling Lead to More Accurate Classification?

Does Modeling Lead to More Accurate Classification? Does Modeling Lead to More Accurate Classification? A Comparison of the Efficiency of Classification Methods Yoonkyung Lee* Department of Statistics The Ohio State University *joint work with Rui Wang

More information

Neural networks and support vector machines

Neural networks and support vector machines Neural netorks and support vector machines Perceptron Input x 1 Weights 1 x 2 x 3... x D 2 3 D Output: sgn( x + b) Can incorporate bias as component of the eight vector by alays including a feature ith

More information

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6 Machine Learning for Large-Scale Data Analysis and Decision Making 80-629-17A Neural Networks Week #6 Today Neural Networks A. Modeling B. Fitting C. Deep neural networks Today s material is (adapted)

More information

9 Classification. 9.1 Linear Classifiers

9 Classification. 9.1 Linear Classifiers 9 Classification This topic returns to prediction. Unlike linear regression where we were predicting a numeric value, in this case we are predicting a class: winner or loser, yes or no, rich or poor, positive

More information

Lecture 15: Neural Networks Theory

Lecture 15: Neural Networks Theory EECS 598-5: Theoretical Foundations of Machine Learning Fall 25 Lecture 5: Neural Networks Theory Lecturer: Matus Telgarsky Scribes: Xintong Wang, Editors: Yuan Zhuang 5. Neural Networks Definition and

More information

Large-Scale Feature Learning with Spike-and-Slab Sparse Coding

Large-Scale Feature Learning with Spike-and-Slab Sparse Coding Large-Scale Feature Learning with Spike-and-Slab Sparse Coding Ian J. Goodfellow, Aaron Courville, Yoshua Bengio ICML 2012 Presented by Xin Yuan January 17, 2013 1 Outline Contributions Spike-and-Slab

More information

Theory of Deep Learning III: explaining the non-overfitting puzzle

Theory of Deep Learning III: explaining the non-overfitting puzzle CBMM Memo No. 073 January 17, 2018 arxiv:1801.00173v2 [cs.lg] 16 Jan 2018 Theory of Deep Learning III: explaining the non-overfitting puzzle by T. Poggio, K. Kawaguchi, Q. Liao, B. Miranda, L. Rosasco

More information

Margin Preservation of Deep Neural Networks

Margin Preservation of Deep Neural Networks 1 Margin Preservation of Deep Neural Networks Jure Sokolić 1, Raja Giryes 2, Guillermo Sapiro 3, Miguel R. D. Rodrigues 1 1 Department of E&EE, University College London, London, UK 2 School of EE, Faculty

More information

Does Unlabeled Data Help?

Does Unlabeled Data Help? Does Unlabeled Data Help? Worst-case Analysis of the Sample Complexity of Semi-supervised Learning. Ben-David, Lu and Pal; COLT, 2008. Presentation by Ashish Rastogi Courant Machine Learning Seminar. Outline

More information

Introduction to Support Vector Machines

Introduction to Support Vector Machines Introduction to Support Vector Machines Hsuan-Tien Lin Learning Systems Group, California Institute of Technology Talk in NTU EE/CS Speech Lab, November 16, 2005 H.-T. Lin (Learning Systems Group) Introduction

More information

Lecture Support Vector Machine (SVM) Classifiers

Lecture Support Vector Machine (SVM) Classifiers Introduction to Machine Learning Lecturer: Amir Globerson Lecture 6 Fall Semester Scribe: Yishay Mansour 6.1 Support Vector Machine (SVM) Classifiers Classification is one of the most important tasks in

More information

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) = Until now we have always worked with likelihoods and prior distributions that were conjugate to each other, allowing the computation of the posterior distribution to be done in closed form. Unfortunately,

More information

Neural Networks. Nicholas Ruozzi University of Texas at Dallas

Neural Networks. Nicholas Ruozzi University of Texas at Dallas Neural Networks Nicholas Ruozzi University of Texas at Dallas Handwritten Digit Recognition Given a collection of handwritten digits and their corresponding labels, we d like to be able to correctly classify

More information

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)

More information

Distirbutional robustness, regularizing variance, and adversaries

Distirbutional robustness, regularizing variance, and adversaries Distirbutional robustness, regularizing variance, and adversaries John Duchi Based on joint work with Hongseok Namkoong and Aman Sinha Stanford University November 2017 Motivation We do not want machine-learned

More information

STAT 535 Lecture 5 November, 2018 Brief overview of Model Selection and Regularization c Marina Meilă

STAT 535 Lecture 5 November, 2018 Brief overview of Model Selection and Regularization c Marina Meilă STAT 535 Lecture 5 November, 2018 Brief overview of Model Selection and Regularization c Marina Meilă mmp@stat.washington.edu Reading: Murphy: BIC, AIC 8.4.2 (pp 255), SRM 6.5 (pp 204) Hastie, Tibshirani

More information

12. Structural Risk Minimization. ECE 830 & CS 761, Spring 2016

12. Structural Risk Minimization. ECE 830 & CS 761, Spring 2016 12. Structural Risk Minimization ECE 830 & CS 761, Spring 2016 1 / 23 General setup for statistical learning theory We observe training examples {x i, y i } n i=1 x i = features X y i = labels / responses

More information

Understanding Generalization Error: Bounds and Decompositions

Understanding Generalization Error: Bounds and Decompositions CIS 520: Machine Learning Spring 2018: Lecture 11 Understanding Generalization Error: Bounds and Decompositions Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the

More information

Error bounds for approximations with deep ReLU networks

Error bounds for approximations with deep ReLU networks Error bounds for approximations with deep ReLU networks arxiv:1610.01145v3 [cs.lg] 1 May 2017 Dmitry Yarotsky d.yarotsky@skoltech.ru May 2, 2017 Abstract We study expressive power of shallow and deep neural

More information

Characterization of Gradient Dominance and Regularity Conditions for Neural Networks

Characterization of Gradient Dominance and Regularity Conditions for Neural Networks Characterization of Gradient Dominance and Regularity Conditions for Neural Networks Yi Zhou Ohio State University Yingbin Liang Ohio State University Abstract zhou.1172@osu.edu liang.889@osu.edu The past

More information

Nearly-tight VC-dimension bounds for piecewise linear neural networks

Nearly-tight VC-dimension bounds for piecewise linear neural networks Nearly-tight VC-dimension bounds for piecewise linear neural networks Nicholas J. A. Harvey, Christopher Liaw, Abbas Mehrabian University of British Columbia COLT 17 July 1, 217 Neural networks (w * x

More information

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel Logistic Regression Pattern Recognition 2016 Sandro Schönborn University of Basel Two Worlds: Probabilistic & Algorithmic We have seen two conceptual approaches to classification: data class density estimation

More information

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Machine Learning. Lecture 9: Learning Theory. Feng Li. Machine Learning Lecture 9: Learning Theory Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018 Why Learning Theory How can we tell

More information

Active Learning from Single and Multiple Annotators

Active Learning from Single and Multiple Annotators Active Learning from Single and Multiple Annotators Kamalika Chaudhuri University of California, San Diego Joint work with Chicheng Zhang Classification Given: ( xi, yi ) Vector of features Discrete Labels

More information

A Bound on the Label Complexity of Agnostic Active Learning

A Bound on the Label Complexity of Agnostic Active Learning A Bound on the Label Complexity of Agnostic Active Learning Steve Hanneke March 2007 CMU-ML-07-103 School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 Machine Learning Department,

More information

Robust Large Margin Deep Neural Networks

Robust Large Margin Deep Neural Networks Robust Large Margin Deep Neural Networks 1 Jure Sokolić, Student Member, IEEE, Raja Giryes, Member, IEEE, Guillermo Sapiro, Fellow, IEEE, and Miguel R. D. Rodrigues, Senior Member, IEEE arxiv:1605.08254v2

More information

On Learnability, Complexity and Stability

On Learnability, Complexity and Stability On Learnability, Complexity and Stability Silvia Villa, Lorenzo Rosasco and Tomaso Poggio 1 Introduction A key question in statistical learning is which hypotheses (function) spaces are learnable. Roughly

More information

Classification and Pattern Recognition

Classification and Pattern Recognition Classification and Pattern Recognition Léon Bottou NEC Labs America COS 424 2/23/2010 The machine learning mix and match Goals Representation Capacity Control Operational Considerations Computational Considerations

More information

Deep Feedforward Networks. Han Shao, Hou Pong Chan, and Hongyi Zhang

Deep Feedforward Networks. Han Shao, Hou Pong Chan, and Hongyi Zhang Deep Feedforward Networks Han Shao, Hou Pong Chan, and Hongyi Zhang Deep Feedforward Networks Goal: approximate some function f e.g., a classifier, maps input to a class y = f (x) x y Defines a mapping

More information

Lecture 8. Instructor: Haipeng Luo

Lecture 8. Instructor: Haipeng Luo Lecture 8 Instructor: Haipeng Luo Boosting and AdaBoost In this lecture we discuss the connection between boosting and online learning. Boosting is not only one of the most fundamental theories in machine

More information

Computational Learning Theory. Definitions

Computational Learning Theory. Definitions Computational Learning Theory Computational learning theory is interested in theoretical analyses of the following issues. What is needed to learn effectively? Sample complexity. How many examples? Computational

More information

Theory of Deep Learning III: the non-overfitting puzzle

Theory of Deep Learning III: the non-overfitting puzzle CBMM Memo No. 073 January 30, 2018 Theory of Deep Learning III: the non-overfitting puzzle by T. Poggio, K. Kawaguchi, Q. Liao, B. Miranda, L. Rosasco with X. Boix, J. Hidary, H. Mhaskar, Center for Brains,

More information

Deep Learning: Self-Taught Learning and Deep vs. Shallow Architectures. Lecture 04

Deep Learning: Self-Taught Learning and Deep vs. Shallow Architectures. Lecture 04 Deep Learning: Self-Taught Learning and Deep vs. Shallow Architectures Lecture 04 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Self-Taught Learning 1. Learn

More information

Deep Feedforward Networks

Deep Feedforward Networks Deep Feedforward Networks Liu Yang March 30, 2017 Liu Yang Short title March 30, 2017 1 / 24 Overview 1 Background A general introduction Example 2 Gradient based learning Cost functions Output Units 3

More information

l 1 -Regularized Linear Regression: Persistence and Oracle Inequalities

l 1 -Regularized Linear Regression: Persistence and Oracle Inequalities l -Regularized Linear Regression: Persistence and Oracle Inequalities Peter Bartlett EECS and Statistics UC Berkeley slides at http://www.stat.berkeley.edu/ bartlett Joint work with Shahar Mendelson and

More information

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)

More information

Statistical learning theory, Support vector machines, and Bioinformatics

Statistical learning theory, Support vector machines, and Bioinformatics 1 Statistical learning theory, Support vector machines, and Bioinformatics Jean-Philippe.Vert@mines.org Ecole des Mines de Paris Computational Biology group ENS Paris, november 25, 2003. 2 Overview 1.

More information

Theory of Deep Learning III: explaining the non-overfitting puzzle

Theory of Deep Learning III: explaining the non-overfitting puzzle arxiv:1801.00173v1 [cs.lg] 30 Dec 2017 CBMM Memo No. 073 January 3, 2018 Theory of Deep Learning III: explaining the non-overfitting puzzle by Tomaso Poggio,, Kenji Kawaguchi, Qianli Liao, Brando Miranda,

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression

More information

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer Vicente L. Malave February 23, 2011 Outline Notation minimize a number of functions φ

More information

Failures of Gradient-Based Deep Learning

Failures of Gradient-Based Deep Learning Failures of Gradient-Based Deep Learning Shai Shalev-Shwartz, Shaked Shammah, Ohad Shamir The Hebrew University and Mobileye Representation Learning Workshop Simons Institute, Berkeley, 2017 Shai Shalev-Shwartz

More information

Topic 3: Neural Networks

Topic 3: Neural Networks CS 4850/6850: Introduction to Machine Learning Fall 2018 Topic 3: Neural Networks Instructor: Daniel L. Pimentel-Alarcón c Copyright 2018 3.1 Introduction Neural networks are arguably the main reason why

More information

Nonparametric regression using deep neural networks with ReLU activation function

Nonparametric regression using deep neural networks with ReLU activation function Nonparametric regression using deep neural networks with ReLU activation function Johannes Schmidt-Hieber February 2018 Caltech 1 / 20 Many impressive results in applications... Lack of theoretical understanding...

More information

Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers

Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers Erin Allwein, Robert Schapire and Yoram Singer Journal of Machine Learning Research, 1:113-141, 000 CSE 54: Seminar on Learning

More information

ECE521 Lecture7. Logistic Regression

ECE521 Lecture7. Logistic Regression ECE521 Lecture7 Logistic Regression Outline Review of decision theory Logistic regression A single neuron Multi-class classification 2 Outline Decision theory is conceptually easy and computationally hard

More information

Feed-forward Network Functions

Feed-forward Network Functions Feed-forward Network Functions Sargur Srihari Topics 1. Extension of linear models 2. Feed-forward Network Functions 3. Weight-space symmetries 2 Recap of Linear Models Linear Models for Regression, Classification

More information