Generalization in Deep Networks

Size: px
Start display at page:

Download "Generalization in Deep Networks"

Transcription

1 Generalization in Deep Networks Peter Bartlett BAIR UC Berkeley November 28, / 29

2 Deep neural networks Game playing (Jung Yeon-Je/AFP/Getty Images) 2 / 29

3 Deep neural networks Image recognition (Krizhevsky et al, 2012) 3 / 29

4 Deep neural networks Speech recognition (Graves et al, 2013) 4 / 29

5 Outline What determines the statistical complexity of a deep network? 5 / 29

6 Outline What determines the statistical complexity of a deep network? VC theory: Number of parameters 5 / 29

7 Outline What determines the statistical complexity of a deep network? VC theory: Number of parameters Margins analysis: Size of parameters 5 / 29

8 Outline What determines the statistical complexity of a deep network? VC theory: Number of parameters Margins analysis: Size of parameters Understanding generalization failures 5 / 29

9 Outline What determines the statistical complexity of a deep network? VC theory: Number of parameters Margins analysis: Size of parameters Understanding generalization failures 6 / 29

10 VC Theory 7 / 29

11 VC Theory Assume network maps to { 1, 1}. (Threshold its output) 7 / 29

12 VC Theory Assume network maps to { 1, 1}. (Threshold its output) Data generated by a probability distribution P on X { 1, 1}. 7 / 29

13 VC Theory Assume network maps to { 1, 1}. (Threshold its output) Data generated by a probability distribution P on X { 1, 1}. Want to choose a function f such that P(f (x) y) is small (near optimal). 7 / 29

14 VC Theory Theorem (Vapnik and Chervonenkis) Suppose F { 1, 1} X. For every prob distribution P on X { 1, 1}, with probability 1 δ over n iid examples (x 1, y 1 ),..., (x n, y n ), every f in F satisfies P(f (x) y) 1 n {i : f (x i) y i } + ( c n (VCdim(F) + log(1/δ)) ) 1/2. 8 / 29

15 VC Theory Theorem (Vapnik and Chervonenkis) Suppose F { 1, 1} X. For every prob distribution P on X { 1, 1}, with probability 1 δ over n iid examples (x 1, y 1 ),..., (x n, y n ), every f in F satisfies P(f (x) y) 1 n {i : f (x i) y i } + ( c n (VCdim(F) + log(1/δ)) ) 1/2. For uniform bounds (that is, for all distributions and all f F, proportions are close to probabilities), this inequality is tight within a constant factor. 8 / 29

16 VC Theory Theorem (Vapnik and Chervonenkis) Suppose F { 1, 1} X. For every prob distribution P on X { 1, 1}, with probability 1 δ over n iid examples (x 1, y 1 ),..., (x n, y n ), every f in F satisfies P(f (x) y) 1 n {i : f (x i) y i } + ( c n (VCdim(F) + log(1/δ)) ) 1/2. For uniform bounds (that is, for all distributions and all f F, proportions are close to probabilities), this inequality is tight within a constant factor. For neural networks, VC-dimension: 8 / 29

17 VC Theory Theorem (Vapnik and Chervonenkis) Suppose F { 1, 1} X. For every prob distribution P on X { 1, 1}, with probability 1 δ over n iid examples (x 1, y 1 ),..., (x n, y n ), every f in F satisfies P(f (x) y) 1 n {i : f (x i) y i } + ( c n (VCdim(F) + log(1/δ)) ) 1/2. For uniform bounds (that is, for all distributions and all f F, proportions are close to probabilities), this inequality is tight within a constant factor. For neural networks, VC-dimension: increases with number of parameters 8 / 29

18 VC Theory Theorem (Vapnik and Chervonenkis) Suppose F { 1, 1} X. For every prob distribution P on X { 1, 1}, with probability 1 δ over n iid examples (x 1, y 1 ),..., (x n, y n ), every f in F satisfies P(f (x) y) 1 n {i : f (x i) y i } + ( c n (VCdim(F) + log(1/δ)) ) 1/2. For uniform bounds (that is, for all distributions and all f F, proportions are close to probabilities), this inequality is tight within a constant factor. For neural networks, VC-dimension: increases with number of parameters depends on nonlinearity and depth 8 / 29

19 VC-Dimension of Neural Networks Theorem Consider the class F of { 1, 1}-valued functions computed by a network with L layers, p parameters, and k computation units with the following nonlinearities: 9 / 29

20 VC-Dimension of Neural Networks Theorem Consider the class F of { 1, 1}-valued functions computed by a network with L layers, p parameters, and k computation units with the following nonlinearities: 1 Piecewise constant (linear threshold units): VCdim(F) = Õ (p). (Baum and Haussler, 1989) 9 / 29

21 VC-Dimension of Neural Networks Theorem Consider the class F of { 1, 1}-valued functions computed by a network with L layers, p parameters, and k computation units with the following nonlinearities: 1 Piecewise constant (linear threshold units): VCdim(F) = Õ (p). (Baum and Haussler, 1989) 2 Piecewise linear (ReLUs): VCdim(F) = Õ (pl). (B., Harvey, Liaw, Mehrabian, 2017) 9 / 29

22 VC-Dimension of Neural Networks Theorem Consider the class F of { 1, 1}-valued functions computed by a network with L layers, p parameters, and k computation units with the following nonlinearities: 1 Piecewise constant (linear threshold units): VCdim(F) = Õ (p). (Baum and Haussler, 1989) 2 Piecewise linear (ReLUs): VCdim(F) = Õ (pl). (B., Harvey, Liaw, Mehrabian, 2017) 3 Piecewise polynomial: VCdim(F) = Õ ( pl 2). (B., Maiorov, Meir, 1998) 9 / 29

23 VC-Dimension of Neural Networks Theorem Consider the class F of { 1, 1}-valued functions computed by a network with L layers, p parameters, and k computation units with the following nonlinearities: 1 Piecewise constant (linear threshold units): VCdim(F) = Õ (p). (Baum and Haussler, 1989) 2 Piecewise linear (ReLUs): VCdim(F) = Õ (pl). (B., Harvey, Liaw, Mehrabian, 2017) 3 Piecewise polynomial: VCdim(F) = Õ ( pl 2). (B., Maiorov, Meir, 1998) 4 Sigmoid: VCdim(F) = Õ ( p 2 k 2). (Karpinsky and MacIntyre, 1994) 9 / 29

24 Generalization in Neural Networks: Number of Parameters NIPS / 29

25 Outline What determines the statistical complexity of a deep network? VC theory: Number of parameters Margins analysis: Size of parameters Understanding generalization failures 11 / 29

26 Large-Margin Classifiers Consider a real-valued function f : X R used for classification. 12 / 29

27 Large-Margin Classifiers Consider a real-valued function f : X R used for classification. The prediction on x X is sign(f (x)) { 1, 1}. 12 / 29

28 Large-Margin Classifiers Consider a real-valued function f : X R used for classification. The prediction on x X is sign(f (x)) { 1, 1}. For a pattern-label pair (x, y) X { 1, 1}, if yf (x) > 0 then f classifies x correctly. 12 / 29

29 Large-Margin Classifiers Consider a real-valued function f : X R used for classification. The prediction on x X is sign(f (x)) { 1, 1}. For a pattern-label pair (x, y) X { 1, 1}, if yf (x) > 0 then f classifies x correctly. We call yf (x) the margin of f on x. 12 / 29

30 Large-Margin Classifiers Consider a real-valued function f : X R used for classification. The prediction on x X is sign(f (x)) { 1, 1}. For a pattern-label pair (x, y) X { 1, 1}, if yf (x) > 0 then f classifies x correctly. We call yf (x) the margin of f on x. We can view a larger margin as a more confident correct classification. 12 / 29

31 Large-Margin Classifiers Consider a real-valued function f : X R used for classification. The prediction on x X is sign(f (x)) { 1, 1}. For a pattern-label pair (x, y) X { 1, 1}, if yf (x) > 0 then f classifies x correctly. We call yf (x) the margin of f on x. We can view a larger margin as a more confident correct classification. Minimizing a continuous loss, such as encourages large margins. n (f (X i ) Y i ) 2, i=1 12 / 29

32 Large-Margin Classifiers Consider a real-valued function f : X R used for classification. The prediction on x X is sign(f (x)) { 1, 1}. For a pattern-label pair (x, y) X { 1, 1}, if yf (x) > 0 then f classifies x correctly. We call yf (x) the margin of f on x. We can view a larger margin as a more confident correct classification. Minimizing a continuous loss, such as encourages large margins. n (f (X i ) Y i ) 2, i=1 For large-margin classifiers, we should expect the fine-grained details of f to be less important. 12 / 29

33 Generalization: Margins and Size of Parameters Theorem (B., 1996) 13 / 29

34 Generalization: Margins and Size of Parameters Theorem (B., 1996) n training examples (X 1, Y 1 ),..., (X n, Y n ) X {±1} 13 / 29

35 Generalization: Margins and Size of Parameters Theorem (B., 1996) n training examples (X 1, Y 1 ),..., (X n, Y n ) X {±1} f F R X 13 / 29

36 Generalization: Margins and Size of Parameters Theorem (B., 1996) n training examples (X 1, Y 1 ),..., (X n, Y n ) X {±1} f F R X Pr(sign(f (X )) Y ) 13 / 29

37 Generalization: Margins and Size of Parameters Theorem (B., 1996) 1. With high probability over n training examples (X 1, Y 1 ),..., (X n, Y n ) X {±1}, every f F R X has ( ) Pr(sign(f (X )) Y ) 1 n fatf (γ) 1[Y i f (X i ) γ] + Õ. n n i=1 The bound depends on the margin loss plus an error term. 13 / 29

38 Generalization: Margins and Size of Parameters Theorem (B., 1996) 1. With high probability over n training examples (X 1, Y 1 ),..., (X n, Y n ) X {±1}, every f F R X has ( ) Pr(sign(f (X )) Y ) 1 n fatf (γ) 1[Y i f (X i ) γ] + Õ. n n i=1 The bound depends on the margin loss plus an error term. Minimizing quadratic loss or cross-entropy loss leads to large margins. 13 / 29

39 Generalization: Margins and Size of Parameters Theorem (B., 1996) 1. With high probability over n training examples (X 1, Y 1 ),..., (X n, Y n ) X {±1}, every f F R X has ( ) Pr(sign(f (X )) Y ) 1 n fatf (γ) 1[Y i f (X i ) γ] + Õ. n n i=1 The bound depends on the margin loss plus an error term. Minimizing quadratic loss or cross-entropy loss leads to large margins. fat F (γ) is a scale-sensitive version of VC-dimension. Unlike the VC-dimension, it need not grow with the number of parameters. 13 / 29

40 Generalization: Margins and Size of Parameters Theorem (B., 1996) 1. With high probability over n training examples (X 1, Y 1 ),..., (X n, Y n ) X {±1}, every f F R X has ( ) Pr(sign(f (X )) Y ) 1 n fatf (γ) 1[Y i f (X i ) γ] + Õ. n n i=1 2. If functions in F are computed by two-layer sigmoid networks with each unit s weights bounded in 1-norm, that is, w 1 B, then fat F (γ) = Õ((B/γ) 2 ). The bound depends on the margin loss plus an error term. Minimizing quadratic loss or cross-entropy loss leads to large margins. fat F (γ) is a scale-sensitive version of VC-dimension. Unlike the VC-dimension, it need not grow with the number of parameters. 13 / 29

41 Generalization: Margins and Size of Parameters Theorem (B., 1996) 1. With high probability over n training examples (X 1, Y 1 ),..., (X n, Y n ) X {±1}, every f F R X has ( ) Pr(sign(f (X )) Y ) 1 n fatf (γ) 1[Y i f (X i ) γ] + Õ. n n i=1 2. If functions in F are computed by two-layer sigmoid networks with each unit s weights bounded in 1-norm, that is, w 1 B, then fat F (γ) = Õ((B/γ) 2 ). Same ideas used to give rigorous dimension-independent generalization bounds for SVMs (B. and Shawe-Taylor, 1999) 13 / 29

42 Generalization: Margins and Size of Parameters Theorem (B., 1996) 1. With high probability over n training examples (X 1, Y 1 ),..., (X n, Y n ) X {±1}, every f F R X has ( ) Pr(sign(f (X )) Y ) 1 n fatf (γ) 1[Y i f (X i ) γ] + Õ. n n i=1 2. If functions in F are computed by two-layer sigmoid networks with each unit s weights bounded in 1-norm, that is, w 1 B, then fat F (γ) = Õ((B/γ) 2 ). Same ideas used to give rigorous dimension-independent generalization bounds for SVMs (B. and Shawe-Taylor, 1999)... and margins analysis of AdaBoost. (Schapire, Freund, B., Lee, 1998) 13 / 29

43 Generalization: Margins and Size of Parameters Theorem (B., 1996) 1. With high probability over n training examples (X 1, Y 1 ),..., (X n, Y n ) X {±1}, every f F R X has ( ) Pr(sign(f (X )) Y ) 1 n fatf (γ) 1[Y i f (X i ) γ] + Õ. n n i=1 2. If functions in F are computed by two-layer sigmoid networks with each unit s weights bounded in 1-norm, that is, w 1 B, then fat F (γ) = Õ((B/γ) 2 ). The scale of functions f F is important. 13 / 29

44 Generalization: Margins and Size of Parameters Theorem (B., 1996) 1. With high probability over n training examples (X 1, Y 1 ),..., (X n, Y n ) X {±1}, every f F R X has ( ) Pr(sign(f (X )) Y ) 1 n fatf (γ) 1[Y i f (X i ) γ] + Õ. n n i=1 2. If functions in F are computed by two-layer sigmoid networks with each unit s weights bounded in 1-norm, that is, w 1 B, then fat F (γ) = Õ((B/γ) 2 ). The scale of functions f F is important. Bigger f s give bigger margins, so fat F (γ) should be bigger. 13 / 29

45 Generalization: Margins and Size of Parameters Theorem (B., 1996) 1. With high probability over n training examples (X 1, Y 1 ),..., (X n, Y n ) X {±1}, every f F R X has ( ) Pr(sign(f (X )) Y ) 1 n fatf (γ) 1[Y i f (X i ) γ] + Õ. n n i=1 2. If functions in F are computed by two-layer sigmoid networks with each unit s weights bounded in 1-norm, that is, w 1 B, then fat F (γ) = Õ((B/γ) 2 ). The scale of functions f F is important. Bigger f s give bigger margins, so fat F (γ) should be bigger. The output y of a sigmoid layer has y 1, so w 1 B controls the scale of f. 13 / 29

46 Generalization: Margins and Size of Parameters 1996: Sigmoid networks Qualitative behavior explained by small weights theorem. 14 / 29

47 Generalization: Margins and Size of Parameters 1996: Sigmoid networks 2017: Deep ReLU networks simons.berkeley.edu Qualitative behavior explained by small weights theorem. 14 / 29

48 Generalization: Margins and Size of Parameters 1996: Sigmoid networks 2017: Deep ReLU networks simons.berkeley.edu Qualitative behavior explained by small weights theorem. How to measure the complexity of a ReLU network? 14 / 29

49 Outline What determines the statistical complexity of a deep network? VC theory: Number of parameters Margins analysis: Size of parameters Understanding generalization failures 15 / 29

50 Explaining Generalization Failures CIFAR / 29

51 Explaining Generalization Failures Stochastic Gradient Training Error on CIFAR10 (Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals, 2017) 17 / 29

52 Explaining Generalization Failures Training margins on CIFAR10 with true and random labels 18 / 29

53 Explaining Generalization Failures Training margins on CIFAR10 with true and random labels How does this match the large margin explanation? 18 / 29

54 Explaining Generalization Failures Training margins on CIFAR10 with true and random labels How does this match the large margin explanation? Need to account for the scale of the neural network functions. 18 / 29

55 Explaining Generalization Failures Training margins on CIFAR10 with true and random labels How does this match the large margin explanation? Need to account for the scale of the neural network functions. What is the appropriate notion of the size of these functions? 18 / 29

56 Generalization in Deep Networks Spectrally-normalized margin bounds for neural networks. B., Dylan J. Foster, Matus Telgarsky, arxiv: Matus Telgarsky UIUC Dylan Foster Cornell 19 / 29

57 Generalization in Deep Networks New results for generalization in deep ReLU networks Measuring the size of functions computed by a network of ReLUs. 20 / 29

58 Generalization in Deep Networks New results for generalization in deep ReLU networks Measuring the size of functions computed by a network of ReLUs. (c.f. sigmoid networks: the output y of a layer has y 1, so w 1 B keeps the scale under control.) 20 / 29

59 Generalization in Deep Networks New results for generalization in deep ReLU networks Measuring the size of functions computed by a network of ReLUs. (c.f. sigmoid networks: the output y of a layer has y 1, so w 1 B keeps the scale under control.) Large multiclass versus binary classification. 20 / 29

60 Generalization in Deep Networks New results for generalization in deep ReLU networks Measuring the size of functions computed by a network of ReLUs. (c.f. sigmoid networks: the output y of a layer has y 1, so w 1 B keeps the scale under control.) Large multiclass versus binary classification. Definitions Consider operator norms: For a matrix A i, A i := sup A i x. x 1 20 / 29

61 Generalization in Deep Networks New results for generalization in deep ReLU networks Measuring the size of functions computed by a network of ReLUs. (c.f. sigmoid networks: the output y of a layer has y 1, so w 1 B keeps the scale under control.) Large multiclass versus binary classification. Definitions Consider operator norms: For a matrix A i, A i := sup A i x. x 1 Multiclass margin function for f : X R m, y {1,..., m}: M(f (x), y) = f (x) y max i y f (x) i. 20 / 29

62 Generalization in Deep Networks Theorem With high probability, every f A 21 / 29

63 Generalization in Deep Networks Theorem With high probability, every f A Definitions Network with L layers, parameters A 1,..., A L : f A (x) := σ L (A L σ L 1 (A L 1 σ 1 (A 1 x) )). 21 / 29

64 Generalization in Deep Networks Theorem With high probability, every f A satisfies Pr(M(f A (X ), Y ) 0) Definitions Network with L layers, parameters A 1,..., A L : f A (x) := σ L (A L σ L 1 (A L 1 σ 1 (A 1 x) )). 21 / 29

65 Generalization in Deep Networks Theorem With high probability, every f A Pr(M(f A (X ), Y ) 0) 1 n satisfies n 1[M(f A (X i ), Y i ) γ] i=1 Definitions Network with L layers, parameters A 1,..., A L : f A (x) := σ L (A L σ L 1 (A L 1 σ 1 (A 1 x) )). 21 / 29

66 Generalization in Deep Networks Theorem With high probability, every f A Pr(M(f A (X ), Y ) 0) 1 n satisfies n 1[M(f A (X i ), Y i ) γ] + Õ i=1 ( ) rl γ. n Definitions Network with L layers, parameters A 1,..., A L : f A (x) := σ L (A L σ L 1 (A L 1 σ 1 (A 1 x) )). 21 / 29

67 Generalization in Deep Networks Theorem With high probability, every f A with R A r Pr(M(f A (X ), Y ) 0) 1 n satisfies n 1[M(f A (X i ), Y i ) γ] + Õ i=1 ( ) rl γ. n Definitions Network with L layers, parameters A 1,..., A L : f A (x) := σ L (A L σ L 1 (A L 1 σ 1 (A 1 x) )). Scale of f A : R A := L i=1 A L A i F i i=1 A i. 21 / 29

68 Generalization in Deep Networks Theorem With high probability, every f A with R A r Pr(M(f A (X ), Y ) 0) 1 n satisfies n 1[M(f A (X i ), Y i ) γ] + Õ i=1 ( ) rl γ. n Definitions Network with L layers, parameters A 1,..., A L : f A (x) := σ L (A L σ L 1 (A L 1 σ 1 (A 1 x) )). Scale of f A : R A := L i=1 A L A i F i i=1 A i. (Assume σ i is 1-Lipschitz, inputs normalized.) 21 / 29

69 Explaining Generalization Failures Stochastic Gradient Training Error on CIFAR10 (Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals, 2017) 22 / 29

70 Explaining Generalization Failures Training margins on CIFAR10 with true and random labels How does this match the large margin explanation? 23 / 29

71 Explaining Generalization Failures If we rescale the margins by R A (the scale parameter): Rescaled margins on CIFAR10 24 / 29

72 Explaining Generalization Failures If we rescale the margins by R A (the scale parameter): Rescaled cumulative margins on MNIST 25 / 29

73 Generalization in Deep Networks Theorem With high probability, every f A with R A r satisfies Pr(M(f A (X ), Y ) 0) 1 n n 1[M(f A (X i ), Y i ) γ] + Õ i=1 ( ) rl γ. n Network with L layers, parameters A 1,..., A L : f A (x) := σ(a L σ L 1 (A L 1 σ 1 (A 1 x) )). Scale of f A : R A := L i=1 A L A i F i i=1 A i. 26 / 29

74 Explaining Generalization Failures 27 / 29

75 Generalization in Neural Networks With appropriate normalization, the margins analysis is qualitatively consistent with the generalization performance. 28 / 29

76 Generalization in Neural Networks With appropriate normalization, the margins analysis is qualitatively consistent with the generalization performance. Margin bounds extend to residual networks. 28 / 29

77 Generalization in Neural Networks With appropriate normalization, the margins analysis is qualitatively consistent with the generalization performance. Margin bounds extend to residual networks. Lower bounds? 28 / 29

78 Generalization in Neural Networks With appropriate normalization, the margins analysis is qualitatively consistent with the generalization performance. Margin bounds extend to residual networks. Lower bounds? Regularization: explicit control of operator norms? 28 / 29

79 Generalization in Neural Networks With appropriate normalization, the margins analysis is qualitatively consistent with the generalization performance. Margin bounds extend to residual networks. Lower bounds? Regularization: explicit control of operator norms? Role of depth? 28 / 29

80 Generalization in Neural Networks With appropriate normalization, the margins analysis is qualitatively consistent with the generalization performance. Margin bounds extend to residual networks. Lower bounds? Regularization: explicit control of operator norms? Role of depth? Interplay with optimization? 28 / 29

81 Outline What determines the statistical complexity of a deep network? VC theory: Number of parameters Margins analysis: Size of parameters Understanding generalization failures 29 / 29

Some Statistical Properties of Deep Networks

Some Statistical Properties of Deep Networks Some Statistical Properties of Deep Networks Peter Bartlett UC Berkeley August 2, 2018 1 / 22 Deep Networks Deep compositions of nonlinear functions h = h m h m 1 h 1 2 / 22 Deep Networks Deep compositions

More information

Nearly-tight VC-dimension bounds for piecewise linear neural networks

Nearly-tight VC-dimension bounds for piecewise linear neural networks Proceedings of Machine Learning Research vol 65:1 5, 2017 Nearly-tight VC-dimension bounds for piecewise linear neural networks Nick Harvey Christopher Liaw Abbas Mehrabian Department of Computer Science,

More information

Vapnik-Chervonenkis Dimension of Neural Nets

Vapnik-Chervonenkis Dimension of Neural Nets P. L. Bartlett and W. Maass: Vapnik-Chervonenkis Dimension of Neural Nets 1 Vapnik-Chervonenkis Dimension of Neural Nets Peter L. Bartlett BIOwulf Technologies and University of California at Berkeley

More information

Vapnik-Chervonenkis Dimension of Neural Nets

Vapnik-Chervonenkis Dimension of Neural Nets Vapnik-Chervonenkis Dimension of Neural Nets Peter L. Bartlett BIOwulf Technologies and University of California at Berkeley Department of Statistics 367 Evans Hall, CA 94720-3860, USA bartlett@stat.berkeley.edu

More information

what can deep learning learn from linear regression? Benjamin Recht University of California, Berkeley

what can deep learning learn from linear regression? Benjamin Recht University of California, Berkeley what can deep learning learn from linear regression? Benjamin Recht University of California, Berkeley Collaborators Joint work with Samy Bengio, Moritz Hardt, Michael Jordan, Jason Lee, Max Simchowitz,

More information

A PAC-BAYESIAN APPROACH TO SPECTRALLY-NORMALIZED MARGIN BOUNDS

A PAC-BAYESIAN APPROACH TO SPECTRALLY-NORMALIZED MARGIN BOUNDS Published as a conference paper at ICLR 08 A PAC-BAYESIAN APPROACH TO SPECTRALLY-NORMALIZED MARGIN BOUNDS FOR NEURAL NETWORKS Behnam Neyshabur, Srinadh Bhojanapalli, Nathan Srebro Toyota Technological

More information

Spectrally-normalized margin bounds for neural networks

Spectrally-normalized margin bounds for neural networks Spectrally-normalized margin bounds for neural networks Peter L. Bartlett Dylan J. Foster Matus Telgarsky 1 Overview Abstract This paper presents a margin-based multiclass generalization bound for neural

More information

arxiv: v1 [cs.lg] 8 Mar 2017

arxiv: v1 [cs.lg] 8 Mar 2017 Nearly-tight VC-dimension bounds for piecewise linear neural networks arxiv:1703.02930v1 [cs.lg] 8 Mar 2017 Nick Harvey Chris Liaw Abbas Mehrabian March 9, 2017 Abstract We prove new upper and lower bounds

More information

Australian National University. Abstract. number of training examples necessary for satisfactory learning performance grows

Australian National University. Abstract. number of training examples necessary for satisfactory learning performance grows The VC-Dimension and Pseudodimension of Two-Layer Neural Networks with Discrete Inputs Peter L. Bartlett Robert C. Williamson Department of Systems Engineering Research School of Information Sciences and

More information

A Surprising Linear Relationship Predicts Test Performance in Deep Networks

A Surprising Linear Relationship Predicts Test Performance in Deep Networks CBMM Memo No. 91 July 26, 218 arxiv:187.9659v1 [cs.lg] 25 Jul 218 A Surprising Linear Relationship Predicts Test Performance in Deep Networks Qianli Liao 1, Brando Miranda 1, Andrzej Banburski 1, Jack

More information

Machine Learning Basics

Machine Learning Basics Security and Fairness of Deep Learning Machine Learning Basics Anupam Datta CMU Spring 2019 Image Classification Image Classification Image classification pipeline Input: A training set of N images, each

More information

An Introduction to Statistical Theory of Learning. Nakul Verma Janelia, HHMI

An Introduction to Statistical Theory of Learning. Nakul Verma Janelia, HHMI An Introduction to Statistical Theory of Learning Nakul Verma Janelia, HHMI Towards formalizing learning What does it mean to learn a concept? Gain knowledge or experience of the concept. The basic process

More information

I D I A P. Online Policy Adaptation for Ensemble Classifiers R E S E A R C H R E P O R T. Samy Bengio b. Christos Dimitrakakis a IDIAP RR 03-69

I D I A P. Online Policy Adaptation for Ensemble Classifiers R E S E A R C H R E P O R T. Samy Bengio b. Christos Dimitrakakis a IDIAP RR 03-69 R E S E A R C H R E P O R T Online Policy Adaptation for Ensemble Classifiers Christos Dimitrakakis a IDIAP RR 03-69 Samy Bengio b I D I A P December 2003 D a l l e M o l l e I n s t i t u t e for Perceptual

More information

Lecture Support Vector Machine (SVM) Classifiers

Lecture Support Vector Machine (SVM) Classifiers Introduction to Machine Learning Lecturer: Amir Globerson Lecture 6 Fall Semester Scribe: Yishay Mansour 6.1 Support Vector Machine (SVM) Classifiers Classification is one of the most important tasks in

More information

arxiv: v1 [cs.lg] 4 Oct 2018

arxiv: v1 [cs.lg] 4 Oct 2018 Gradient descent aligns the layers of deep linear networks Ziwei Ji Matus Telgarsky {ziweiji,mjt}@illinois.edu University of Illinois, Urbana-Champaign arxiv:80.003v [cs.lg] 4 Oct 08 Abstract This paper

More information

Large-Scale Feature Learning with Spike-and-Slab Sparse Coding

Large-Scale Feature Learning with Spike-and-Slab Sparse Coding Large-Scale Feature Learning with Spike-and-Slab Sparse Coding Ian J. Goodfellow, Aaron Courville, Yoshua Bengio ICML 2012 Presented by Xin Yuan January 17, 2013 1 Outline Contributions Spike-and-Slab

More information

COMP9444: Neural Networks. Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization

COMP9444: Neural Networks. Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization : Neural Networks Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization 11s2 VC-dimension and PAC-learning 1 How good a classifier does a learner produce? Training error is the precentage

More information

Discriminative Models

Discriminative Models No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models

More information

Learning Theory. Piyush Rai. CS5350/6350: Machine Learning. September 27, (CS5350/6350) Learning Theory September 27, / 14

Learning Theory. Piyush Rai. CS5350/6350: Machine Learning. September 27, (CS5350/6350) Learning Theory September 27, / 14 Learning Theory Piyush Rai CS5350/6350: Machine Learning September 27, 2011 (CS5350/6350) Learning Theory September 27, 2011 1 / 14 Why Learning Theory? We want to have theoretical guarantees about our

More information

Classical generalization bounds are surprisingly tight for Deep Networks

Classical generalization bounds are surprisingly tight for Deep Networks CBMM Memo No. 9 July, 28 Classical generalization bounds are surprisingly tight for Deep Networks Qianli Liao, Brando Miranda, Jack Hidary 2 and Tomaso Poggio Center for Brains, Minds, and Machines, MIT

More information

arxiv: v1 [cs.lg] 7 Jan 2019

arxiv: v1 [cs.lg] 7 Jan 2019 Generalization in Deep Networks: The Role of Distance from Initialization arxiv:1901672v1 [cs.lg] 7 Jan 2019 Vaishnavh Nagarajan Computer Science Department Carnegie-Mellon University Pittsburgh, PA 15213

More information

A PAC-Bayesian Approach to Spectrally-Normalized Margin Bounds for Neural Networks

A PAC-Bayesian Approach to Spectrally-Normalized Margin Bounds for Neural Networks A PAC-Bayesian Approach to Spectrally-Normalize Margin Bouns for Neural Networks Behnam Neyshabur, Srinah Bhojanapalli, Davi McAllester, Nathan Srebro Toyota Technological Institute at Chicago {bneyshabur,

More information

Sample width for multi-category classifiers

Sample width for multi-category classifiers R u t c o r Research R e p o r t Sample width for multi-category classifiers Martin Anthony a Joel Ratsaby b RRR 29-2012, November 2012 RUTCOR Rutgers Center for Operations Research Rutgers University

More information

A note on the generalization performance of kernel classifiers with margin. Theodoros Evgeniou and Massimiliano Pontil

A note on the generalization performance of kernel classifiers with margin. Theodoros Evgeniou and Massimiliano Pontil MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE LABORATORY and CENTER FOR BIOLOGICAL AND COMPUTATIONAL LEARNING DEPARTMENT OF BRAIN AND COGNITIVE SCIENCES A.I. Memo No. 68 November 999 C.B.C.L

More information

Introduction to Machine Learning Lecture 13. Mehryar Mohri Courant Institute and Google Research

Introduction to Machine Learning Lecture 13. Mehryar Mohri Courant Institute and Google Research Introduction to Machine Learning Lecture 13 Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Multi-Class Classification Mehryar Mohri - Introduction to Machine Learning page 2 Motivation

More information

COMS 4771 Lecture Boosting 1 / 16

COMS 4771 Lecture Boosting 1 / 16 COMS 4771 Lecture 12 1. Boosting 1 / 16 Boosting What is boosting? Boosting: Using a learning algorithm that provides rough rules-of-thumb to construct a very accurate predictor. 3 / 16 What is boosting?

More information

Does Modeling Lead to More Accurate Classification?

Does Modeling Lead to More Accurate Classification? Does Modeling Lead to More Accurate Classification? A Comparison of the Efficiency of Classification Methods Yoonkyung Lee* Department of Statistics The Ohio State University *joint work with Rui Wang

More information

Support Vector Machines: Maximum Margin Classifiers

Support Vector Machines: Maximum Margin Classifiers Support Vector Machines: Maximum Margin Classifiers Machine Learning and Pattern Recognition: September 16, 2008 Piotr Mirowski Based on slides by Sumit Chopra and Fu-Jie Huang 1 Outline What is behind

More information

The sample complexity of agnostic learning with deterministic labels

The sample complexity of agnostic learning with deterministic labels The sample complexity of agnostic learning with deterministic labels Shai Ben-David Cheriton School of Computer Science University of Waterloo Waterloo, ON, N2L 3G CANADA shai@uwaterloo.ca Ruth Urner College

More information

SGD and Deep Learning

SGD and Deep Learning SGD and Deep Learning Subgradients Lets make the gradient cheating more formal. Recall that the gradient is the slope of the tangent. f(w 1 )+rf(w 1 ) (w w 1 ) Non differentiable case? w 1 Subgradients

More information

Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers

Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers Erin Allwein, Robert Schapire and Yoram Singer Journal of Machine Learning Research, 1:113-141, 000 CSE 54: Seminar on Learning

More information

Introduction to Support Vector Machines

Introduction to Support Vector Machines Introduction to Support Vector Machines Hsuan-Tien Lin Learning Systems Group, California Institute of Technology Talk in NTU EE/CS Speech Lab, November 16, 2005 H.-T. Lin (Learning Systems Group) Introduction

More information

On the Generalization of Soft Margin Algorithms

On the Generalization of Soft Margin Algorithms IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 48, NO. 10, OCTOBER 2002 2721 On the Generalization of Soft Margin Algorithms John Shawe-Taylor, Member, IEEE, and Nello Cristianini Abstract Generalization

More information

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18 CSE 417T: Introduction to Machine Learning Lecture 11: Review Henry Chai 10/02/18 Unknown Target Function!: # % Training data Formal Setup & = ( ), + ),, ( -, + - Learning Algorithm 2 Hypothesis Set H

More information

Learning with multiple models. Boosting.

Learning with multiple models. Boosting. CS 2750 Machine Learning Lecture 21 Learning with multiple models. Boosting. Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square Learning with multiple models: Approach 2 Approach 2: use multiple models

More information

Computational Learning Theory

Computational Learning Theory Computational Learning Theory Pardis Noorzad Department of Computer Engineering and IT Amirkabir University of Technology Ordibehesht 1390 Introduction For the analysis of data structures and algorithms

More information

A Study of Relative Efficiency and Robustness of Classification Methods

A Study of Relative Efficiency and Robustness of Classification Methods A Study of Relative Efficiency and Robustness of Classification Methods Yoonkyung Lee* Department of Statistics The Ohio State University *joint work with Rui Wang April 28, 2011 Department of Statistics

More information

Discriminative Models

Discriminative Models No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models

More information

Statistical learning theory, Support vector machines, and Bioinformatics

Statistical learning theory, Support vector machines, and Bioinformatics 1 Statistical learning theory, Support vector machines, and Bioinformatics Jean-Philippe.Vert@mines.org Ecole des Mines de Paris Computational Biology group ENS Paris, november 25, 2003. 2 Overview 1.

More information

Machine Learning Basics Lecture 4: SVM I. Princeton University COS 495 Instructor: Yingyu Liang

Machine Learning Basics Lecture 4: SVM I. Princeton University COS 495 Instructor: Yingyu Liang Machine Learning Basics Lecture 4: SVM I Princeton University COS 495 Instructor: Yingyu Liang Review: machine learning basics Math formulation Given training data x i, y i : 1 i n i.i.d. from distribution

More information

Statistical and Computational Learning Theory

Statistical and Computational Learning Theory Statistical and Computational Learning Theory Fundamental Question: Predict Error Rates Given: Find: The space H of hypotheses The number and distribution of the training examples S The complexity of the

More information

Linear and Logistic Regression. Dr. Xiaowei Huang

Linear and Logistic Regression. Dr. Xiaowei Huang Linear and Logistic Regression Dr. Xiaowei Huang https://cgi.csc.liv.ac.uk/~xiaowei/ Up to now, Two Classical Machine Learning Algorithms Decision tree learning K-nearest neighbor Model Evaluation Metrics

More information

Margin Preservation of Deep Neural Networks

Margin Preservation of Deep Neural Networks 1 Margin Preservation of Deep Neural Networks Jure Sokolić 1, Raja Giryes 2, Guillermo Sapiro 3, Miguel R. D. Rodrigues 1 1 Department of E&EE, University College London, London, UK 2 School of EE, Faculty

More information

Reading Group on Deep Learning Session 1

Reading Group on Deep Learning Session 1 Reading Group on Deep Learning Session 1 Stephane Lathuiliere & Pablo Mesejo 2 June 2016 1/31 Contents Introduction to Artificial Neural Networks to understand, and to be able to efficiently use, the popular

More information

Based on the original slides of Hung-yi Lee

Based on the original slides of Hung-yi Lee Based on the original slides of Hung-yi Lee Google Trends Deep learning obtains many exciting results. Can contribute to new Smart Services in the Context of the Internet of Things (IoT). IoT Services

More information

Neural networks and support vector machines

Neural networks and support vector machines Neural netorks and support vector machines Perceptron Input x 1 Weights 1 x 2 x 3... x D 2 3 D Output: sgn( x + b) Can incorporate bias as component of the eight vector by alays including a feature ith

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table

More information

Foundations of Deep Learning: SGD, Overparametrization, and Generalization

Foundations of Deep Learning: SGD, Overparametrization, and Generalization Foundations of Deep Learning: SGD, Overparametrization, and Generalization Jason D. Lee University of Southern California November 13, 2018 Deep Learning Single Neuron x σ( w, x ) ReLU: σ(z) = [z] + Figure:

More information

AdaBoost and other Large Margin Classifiers: Convexity in Classification

AdaBoost and other Large Margin Classifiers: Convexity in Classification AdaBoost and other Large Margin Classifiers: Convexity in Classification Peter Bartlett Division of Computer Science and Department of Statistics UC Berkeley Joint work with Mikhail Traskin. slides at

More information

Generalization, Overfitting, and Model Selection

Generalization, Overfitting, and Model Selection Generalization, Overfitting, and Model Selection Sample Complexity Results for Supervised Classification Maria-Florina (Nina) Balcan 10/03/2016 Two Core Aspects of Machine Learning Algorithm Design. How

More information

Neural Networks and Deep Learning

Neural Networks and Deep Learning Neural Networks and Deep Learning Professor Ameet Talwalkar November 12, 2015 Professor Ameet Talwalkar Neural Networks and Deep Learning November 12, 2015 1 / 16 Outline 1 Review of last lecture AdaBoost

More information

Does Unlabeled Data Help?

Does Unlabeled Data Help? Does Unlabeled Data Help? Worst-case Analysis of the Sample Complexity of Semi-supervised Learning. Ben-David, Lu and Pal; COLT, 2008. Presentation by Ashish Rastogi Courant Machine Learning Seminar. Outline

More information

Online Learning: Random Averages, Combinatorial Parameters, and Learnability

Online Learning: Random Averages, Combinatorial Parameters, and Learnability Online Learning: Random Averages, Combinatorial Parameters, and Learnability Alexander Rakhlin Department of Statistics University of Pennsylvania Karthik Sridharan Toyota Technological Institute at Chicago

More information

Error bounds for approximations with deep ReLU networks

Error bounds for approximations with deep ReLU networks Error bounds for approximations with deep ReLU networks arxiv:1610.01145v3 [cs.lg] 1 May 2017 Dmitry Yarotsky d.yarotsky@skoltech.ru May 2, 2017 Abstract We study expressive power of shallow and deep neural

More information

Implicit Optimization Bias

Implicit Optimization Bias Implicit Optimization Bias as a key to Understanding Deep Learning Nati Srebro (TTIC) Based on joint work with Behnam Neyshabur (TTIC IAS), Ryota Tomioka (TTIC MSR), Srinadh Bhojanapalli, Suriya Gunasekar,

More information

BOOSTING THE MARGIN: A NEW EXPLANATION FOR THE EFFECTIVENESS OF VOTING METHODS

BOOSTING THE MARGIN: A NEW EXPLANATION FOR THE EFFECTIVENESS OF VOTING METHODS The Annals of Statistics 1998, Vol. 26, No. 5, 1651 1686 BOOSTING THE MARGIN: A NEW EXPLANATION FOR THE EFFECTIVENESS OF VOTING METHODS BY ROBERT E. SCHAPIRE, YOAV FREUND, PETER BARTLETT AND WEE SUN LEE

More information

COMS 4771 Introduction to Machine Learning. Nakul Verma

COMS 4771 Introduction to Machine Learning. Nakul Verma COMS 4771 Introduction to Machine Learning Nakul Verma Announcements HW2 due now! Project proposal due on tomorrow Midterm next lecture! HW3 posted Last time Linear Regression Parametric vs Nonparametric

More information

Lecture 8. Instructor: Haipeng Luo

Lecture 8. Instructor: Haipeng Luo Lecture 8 Instructor: Haipeng Luo Boosting and AdaBoost In this lecture we discuss the connection between boosting and online learning. Boosting is not only one of the most fundamental theories in machine

More information

Feature Noising. Sida Wang, joint work with Part 1: Stefan Wager, Percy Liang Part 2: Mengqiu Wang, Chris Manning, Percy Liang, Stefan Wager

Feature Noising. Sida Wang, joint work with Part 1: Stefan Wager, Percy Liang Part 2: Mengqiu Wang, Chris Manning, Percy Liang, Stefan Wager Feature Noising Sida Wang, joint work with Part 1: Stefan Wager, Percy Liang Part 2: Mengqiu Wang, Chris Manning, Percy Liang, Stefan Wager Outline Part 0: Some backgrounds Part 1: Dropout as adaptive

More information

Neural Network Learning: Testing Bounds on Sample Complexity

Neural Network Learning: Testing Bounds on Sample Complexity Neural Network Learning: Testing Bounds on Sample Complexity Joaquim Marques de Sá, Fernando Sereno 2, Luís Alexandre 3 INEB Instituto de Engenharia Biomédica Faculdade de Engenharia da Universidade do

More information

Boosting the margin: A new explanation for the effectiveness of voting methods

Boosting the margin: A new explanation for the effectiveness of voting methods Machine Learning: Proceedings of the Fourteenth International Conference, 1997. Boosting the margin: A new explanation for the effectiveness of voting methods Robert E. Schapire Yoav Freund AT&T Labs 6

More information

Boosting the Margin: A New Explanation for the Effectiveness of Voting Methods

Boosting the Margin: A New Explanation for the Effectiveness of Voting Methods The Annals of Statistics, 26(5)1651-1686, 1998. Boosting the Margin A New Explanation for the Effectiveness of Voting Methods Robert E. Schapire AT&T Labs 18 Park Avenue, Room A279 Florham Park, NJ 7932-971

More information

Maximal Width Learning of Binary Functions

Maximal Width Learning of Binary Functions Maximal Width Learning of Binary Functions Martin Anthony Department of Mathematics, London School of Economics, Houghton Street, London WC2A2AE, UK Joel Ratsaby Electrical and Electronics Engineering

More information

Neural Networks. Nicholas Ruozzi University of Texas at Dallas

Neural Networks. Nicholas Ruozzi University of Texas at Dallas Neural Networks Nicholas Ruozzi University of Texas at Dallas Handwritten Digit Recognition Given a collection of handwritten digits and their corresponding labels, we d like to be able to correctly classify

More information

Maximal Width Learning of Binary Functions

Maximal Width Learning of Binary Functions Maximal Width Learning of Binary Functions Martin Anthony Department of Mathematics London School of Economics Houghton Street London WC2A 2AE United Kingdom m.anthony@lse.ac.uk Joel Ratsaby Ben-Gurion

More information

Statistical Properties of Large Margin Classifiers

Statistical Properties of Large Margin Classifiers Statistical Properties of Large Margin Classifiers Peter Bartlett Division of Computer Science and Department of Statistics UC Berkeley Joint work with Mike Jordan, Jon McAuliffe, Ambuj Tewari. slides

More information

Margin Maximizing Loss Functions

Margin Maximizing Loss Functions Margin Maximizing Loss Functions Saharon Rosset, Ji Zhu and Trevor Hastie Department of Statistics Stanford University Stanford, CA, 94305 saharon, jzhu, hastie@stat.stanford.edu Abstract Margin maximizing

More information

MACHINE LEARNING. Support Vector Machines. Alessandro Moschitti

MACHINE LEARNING. Support Vector Machines. Alessandro Moschitti MACHINE LEARNING Support Vector Machines Alessandro Moschitti Department of information and communication technology University of Trento Email: moschitti@dit.unitn.it Summary Support Vector Machines

More information

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang Machine Learning Basics Lecture 7: Multiclass Classification Princeton University COS 495 Instructor: Yingyu Liang Example: image classification indoor Indoor outdoor Example: image classification (multiclass)

More information

Understanding Generalization Error: Bounds and Decompositions

Understanding Generalization Error: Bounds and Decompositions CIS 520: Machine Learning Spring 2018: Lecture 11 Understanding Generalization Error: Bounds and Decompositions Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the

More information

AdaBoost. Lecturer: Authors: Center for Machine Perception Czech Technical University, Prague

AdaBoost. Lecturer: Authors: Center for Machine Perception Czech Technical University, Prague AdaBoost Lecturer: Jan Šochman Authors: Jan Šochman, Jiří Matas Center for Machine Perception Czech Technical University, Prague http://cmp.felk.cvut.cz Motivation Presentation 2/17 AdaBoost with trees

More information

Kernels. Machine Learning CSE446 Carlos Guestrin University of Washington. October 28, Carlos Guestrin

Kernels. Machine Learning CSE446 Carlos Guestrin University of Washington. October 28, Carlos Guestrin Kernels Machine Learning CSE446 Carlos Guestrin University of Washington October 28, 2013 Carlos Guestrin 2005-2013 1 Linear Separability: More formally, Using Margin Data linearly separable, if there

More information

Advanced Machine Learning

Advanced Machine Learning Advanced Machine Learning Deep Boosting MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Outline Model selection. Deep boosting. theory. algorithm. experiments. page 2 Model Selection Problem:

More information

Neural Networks Lecturer: J. Matas Authors: J. Matas, B. Flach, O. Drbohlav

Neural Networks Lecturer: J. Matas Authors: J. Matas, B. Flach, O. Drbohlav Neural Networks 30.11.2015 Lecturer: J. Matas Authors: J. Matas, B. Flach, O. Drbohlav 1 Talk Outline Perceptron Combining neurons to a network Neural network, processing input to an output Learning Cost

More information

ON THE EXPRESSIVE POWER OF DEEP NEURAL NET-

ON THE EXPRESSIVE POWER OF DEEP NEURAL NET- ON THE EXPRESSIVE POWER OF DEEP NEURAL NET- WORKS Maithra Raghu Google Brain and Cornell University Ben Poole Stanford University and Google Brain Jon Kleinberg Cornell University Surya Ganguli Stanford

More information

Fast Rates for Estimation Error and Oracle Inequalities for Model Selection

Fast Rates for Estimation Error and Oracle Inequalities for Model Selection Fast Rates for Estimation Error and Oracle Inequalities for Model Selection Peter L. Bartlett Computer Science Division and Department of Statistics University of California, Berkeley bartlett@cs.berkeley.edu

More information

COMS 4771 Introduction to Machine Learning. Nakul Verma

COMS 4771 Introduction to Machine Learning. Nakul Verma COMS 4771 Introduction to Machine Learning Nakul Verma Announcements HW1 due next lecture Project details are available decide on the group and topic by Thursday Last time Generative vs. Discriminative

More information

Classification and Pattern Recognition

Classification and Pattern Recognition Classification and Pattern Recognition Léon Bottou NEC Labs America COS 424 2/23/2010 The machine learning mix and match Goals Representation Capacity Control Operational Considerations Computational Considerations

More information

Machine Learning Lecture 7

Machine Learning Lecture 7 Course Outline Machine Learning Lecture 7 Fundamentals (2 weeks) Bayes Decision Theory Probability Density Estimation Statistical Learning Theory 23.05.2016 Discriminative Approaches (5 weeks) Linear Discriminant

More information

Support Vector Machine. Industrial AI Lab. Prof. Seungchul Lee

Support Vector Machine. Industrial AI Lab. Prof. Seungchul Lee Support Vector Machine Industrial AI Lab. Prof. Seungchul Lee Classification (Linear) Autonomously figure out which category (or class) an unknown item should be categorized into Number of categories /

More information

Why ResNet Works? Residuals Generalize

Why ResNet Works? Residuals Generalize Why ResNet Works? Residuals Generalize Fengxiang He Tongliang Liu Dacheng Tao arxiv:1904.01367v1 [stat.ml] 2 Apr 2019 Abstract Residual connections significantly boost the performance of deep neural networks.

More information

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann Neural Networks with Applications to Vision and Language Feedforward Networks Marco Kuhlmann Feedforward networks Linear separability x 2 x 2 0 1 0 1 0 0 x 1 1 0 x 1 linearly separable not linearly separable

More information

Uniform Glivenko-Cantelli Theorems and Concentration of Measure in the Mathematical Modelling of Learning

Uniform Glivenko-Cantelli Theorems and Concentration of Measure in the Mathematical Modelling of Learning Uniform Glivenko-Cantelli Theorems and Concentration of Measure in the Mathematical Modelling of Learning Martin Anthony Department of Mathematics London School of Economics Houghton Street London WC2A

More information

Supervised Learning Part I

Supervised Learning Part I Supervised Learning Part I http://www.lps.ens.fr/~nadal/cours/mva Jean-Pierre Nadal CNRS & EHESS Laboratoire de Physique Statistique (LPS, UMR 8550 CNRS - ENS UPMC Univ. Paris Diderot) Ecole Normale Supérieure

More information

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina Indirect Rule Learning: Support Vector Machines Indirect learning: loss optimization It doesn t estimate the prediction rule f (x) directly, since most loss functions do not have explicit optimizers. Indirection

More information

PAC-learning, VC Dimension and Margin-based Bounds

PAC-learning, VC Dimension and Margin-based Bounds More details: General: http://www.learning-with-kernels.org/ Example of more complex bounds: http://www.research.ibm.com/people/t/tzhang/papers/jmlr02_cover.ps.gz PAC-learning, VC Dimension and Margin-based

More information

Generalization theory

Generalization theory Generalization theory Chapter 4 T.P. Runarsson (tpr@hi.is) and S. Sigurdsson (sven@hi.is) Introduction Suppose you are given the empirical observations, (x 1, y 1 ),..., (x l, y l ) (X Y) l. Consider the

More information

STATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION

STATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION STATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION Tong Zhang The Annals of Statistics, 2004 Outline Motivation Approximation error under convex risk minimization

More information

Introduction to Neural Networks

Introduction to Neural Networks CUONG TUAN NGUYEN SEIJI HOTTA MASAKI NAKAGAWA Tokyo University of Agriculture and Technology Copyright by Nguyen, Hotta and Nakagawa 1 Pattern classification Which category of an input? Example: Character

More information

On Learnability, Complexity and Stability

On Learnability, Complexity and Stability On Learnability, Complexity and Stability Silvia Villa, Lorenzo Rosasco and Tomaso Poggio 1 Introduction A key question in statistical learning is which hypotheses (function) spaces are learnable. Roughly

More information

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring / Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring 2015 http://ce.sharif.edu/courses/93-94/2/ce717-1 / Agenda Combining Classifiers Empirical view Theoretical

More information

Multilayer Perceptron

Multilayer Perceptron Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Single Perceptron 3 Boolean Function Learning 4

More information

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs E0 270 Machine Learning Lecture 5 (Jan 22, 203) Support Vector Machines for Classification and Regression Lecturer: Shivani Agarwal Disclaimer: These notes are a brief summary of the topics covered in

More information

TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter Multiclass Logistic Regression. Multilayer Perceptrons (MLPs)

TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter Multiclass Logistic Regression. Multilayer Perceptrons (MLPs) TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2018 Multiclass Logistic Regression Multilayer Perceptrons (MLPs) Stochastic Gradient Descent (SGD) 1 Multiclass Classification We consider

More information

Deep Neural Networks: From Flat Minima to Numerically Nonvacuous Generalization Bounds via PAC-Bayes

Deep Neural Networks: From Flat Minima to Numerically Nonvacuous Generalization Bounds via PAC-Bayes Deep Neural Networks: From Flat Minima to Numerically Nonvacuous Generalization Bounds via PAC-Bayes Daniel M. Roy University of Toronto; Vector Institute Joint work with Gintarė K. Džiugaitė University

More information

Active Learning and Optimized Information Gathering

Active Learning and Optimized Information Gathering Active Learning and Optimized Information Gathering Lecture 7 Learning Theory CS 101.2 Andreas Krause Announcements Project proposal: Due tomorrow 1/27 Homework 1: Due Thursday 1/29 Any time is ok. Office

More information

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) and Kernel Methods Support Vector Machine (SVM) and Kernel Methods CE-717: Machine Learning Sharif University of Technology Fall 2014 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin

More information

ECE521 Lectures 9 Fully Connected Neural Networks

ECE521 Lectures 9 Fully Connected Neural Networks ECE521 Lectures 9 Fully Connected Neural Networks Outline Multi-class classification Learning multi-layer neural networks 2 Measuring distance in probability space We learnt that the squared L2 distance

More information

Support Vector Machines and Kernel Methods

Support Vector Machines and Kernel Methods Support Vector Machines and Kernel Methods Geoff Gordon ggordon@cs.cmu.edu July 10, 2003 Overview Why do people care about SVMs? Classification problems SVMs often produce good results over a wide range

More information

An Introduction to Boosting and Leveraging

An Introduction to Boosting and Leveraging An Introduction to Boosting and Leveraging Ron Meir 1 and Gunnar Rätsch 2 1 Department of Electrical Engineering, Technion, Haifa 32000, Israel rmeir@ee.technion.ac.il, http://www-ee.technion.ac.il/ rmeir

More information