Generalization in Deep Networks
|
|
- Merry McKinney
- 6 years ago
- Views:
Transcription
1 Generalization in Deep Networks Peter Bartlett BAIR UC Berkeley November 28, / 29
2 Deep neural networks Game playing (Jung Yeon-Je/AFP/Getty Images) 2 / 29
3 Deep neural networks Image recognition (Krizhevsky et al, 2012) 3 / 29
4 Deep neural networks Speech recognition (Graves et al, 2013) 4 / 29
5 Outline What determines the statistical complexity of a deep network? 5 / 29
6 Outline What determines the statistical complexity of a deep network? VC theory: Number of parameters 5 / 29
7 Outline What determines the statistical complexity of a deep network? VC theory: Number of parameters Margins analysis: Size of parameters 5 / 29
8 Outline What determines the statistical complexity of a deep network? VC theory: Number of parameters Margins analysis: Size of parameters Understanding generalization failures 5 / 29
9 Outline What determines the statistical complexity of a deep network? VC theory: Number of parameters Margins analysis: Size of parameters Understanding generalization failures 6 / 29
10 VC Theory 7 / 29
11 VC Theory Assume network maps to { 1, 1}. (Threshold its output) 7 / 29
12 VC Theory Assume network maps to { 1, 1}. (Threshold its output) Data generated by a probability distribution P on X { 1, 1}. 7 / 29
13 VC Theory Assume network maps to { 1, 1}. (Threshold its output) Data generated by a probability distribution P on X { 1, 1}. Want to choose a function f such that P(f (x) y) is small (near optimal). 7 / 29
14 VC Theory Theorem (Vapnik and Chervonenkis) Suppose F { 1, 1} X. For every prob distribution P on X { 1, 1}, with probability 1 δ over n iid examples (x 1, y 1 ),..., (x n, y n ), every f in F satisfies P(f (x) y) 1 n {i : f (x i) y i } + ( c n (VCdim(F) + log(1/δ)) ) 1/2. 8 / 29
15 VC Theory Theorem (Vapnik and Chervonenkis) Suppose F { 1, 1} X. For every prob distribution P on X { 1, 1}, with probability 1 δ over n iid examples (x 1, y 1 ),..., (x n, y n ), every f in F satisfies P(f (x) y) 1 n {i : f (x i) y i } + ( c n (VCdim(F) + log(1/δ)) ) 1/2. For uniform bounds (that is, for all distributions and all f F, proportions are close to probabilities), this inequality is tight within a constant factor. 8 / 29
16 VC Theory Theorem (Vapnik and Chervonenkis) Suppose F { 1, 1} X. For every prob distribution P on X { 1, 1}, with probability 1 δ over n iid examples (x 1, y 1 ),..., (x n, y n ), every f in F satisfies P(f (x) y) 1 n {i : f (x i) y i } + ( c n (VCdim(F) + log(1/δ)) ) 1/2. For uniform bounds (that is, for all distributions and all f F, proportions are close to probabilities), this inequality is tight within a constant factor. For neural networks, VC-dimension: 8 / 29
17 VC Theory Theorem (Vapnik and Chervonenkis) Suppose F { 1, 1} X. For every prob distribution P on X { 1, 1}, with probability 1 δ over n iid examples (x 1, y 1 ),..., (x n, y n ), every f in F satisfies P(f (x) y) 1 n {i : f (x i) y i } + ( c n (VCdim(F) + log(1/δ)) ) 1/2. For uniform bounds (that is, for all distributions and all f F, proportions are close to probabilities), this inequality is tight within a constant factor. For neural networks, VC-dimension: increases with number of parameters 8 / 29
18 VC Theory Theorem (Vapnik and Chervonenkis) Suppose F { 1, 1} X. For every prob distribution P on X { 1, 1}, with probability 1 δ over n iid examples (x 1, y 1 ),..., (x n, y n ), every f in F satisfies P(f (x) y) 1 n {i : f (x i) y i } + ( c n (VCdim(F) + log(1/δ)) ) 1/2. For uniform bounds (that is, for all distributions and all f F, proportions are close to probabilities), this inequality is tight within a constant factor. For neural networks, VC-dimension: increases with number of parameters depends on nonlinearity and depth 8 / 29
19 VC-Dimension of Neural Networks Theorem Consider the class F of { 1, 1}-valued functions computed by a network with L layers, p parameters, and k computation units with the following nonlinearities: 9 / 29
20 VC-Dimension of Neural Networks Theorem Consider the class F of { 1, 1}-valued functions computed by a network with L layers, p parameters, and k computation units with the following nonlinearities: 1 Piecewise constant (linear threshold units): VCdim(F) = Õ (p). (Baum and Haussler, 1989) 9 / 29
21 VC-Dimension of Neural Networks Theorem Consider the class F of { 1, 1}-valued functions computed by a network with L layers, p parameters, and k computation units with the following nonlinearities: 1 Piecewise constant (linear threshold units): VCdim(F) = Õ (p). (Baum and Haussler, 1989) 2 Piecewise linear (ReLUs): VCdim(F) = Õ (pl). (B., Harvey, Liaw, Mehrabian, 2017) 9 / 29
22 VC-Dimension of Neural Networks Theorem Consider the class F of { 1, 1}-valued functions computed by a network with L layers, p parameters, and k computation units with the following nonlinearities: 1 Piecewise constant (linear threshold units): VCdim(F) = Õ (p). (Baum and Haussler, 1989) 2 Piecewise linear (ReLUs): VCdim(F) = Õ (pl). (B., Harvey, Liaw, Mehrabian, 2017) 3 Piecewise polynomial: VCdim(F) = Õ ( pl 2). (B., Maiorov, Meir, 1998) 9 / 29
23 VC-Dimension of Neural Networks Theorem Consider the class F of { 1, 1}-valued functions computed by a network with L layers, p parameters, and k computation units with the following nonlinearities: 1 Piecewise constant (linear threshold units): VCdim(F) = Õ (p). (Baum and Haussler, 1989) 2 Piecewise linear (ReLUs): VCdim(F) = Õ (pl). (B., Harvey, Liaw, Mehrabian, 2017) 3 Piecewise polynomial: VCdim(F) = Õ ( pl 2). (B., Maiorov, Meir, 1998) 4 Sigmoid: VCdim(F) = Õ ( p 2 k 2). (Karpinsky and MacIntyre, 1994) 9 / 29
24 Generalization in Neural Networks: Number of Parameters NIPS / 29
25 Outline What determines the statistical complexity of a deep network? VC theory: Number of parameters Margins analysis: Size of parameters Understanding generalization failures 11 / 29
26 Large-Margin Classifiers Consider a real-valued function f : X R used for classification. 12 / 29
27 Large-Margin Classifiers Consider a real-valued function f : X R used for classification. The prediction on x X is sign(f (x)) { 1, 1}. 12 / 29
28 Large-Margin Classifiers Consider a real-valued function f : X R used for classification. The prediction on x X is sign(f (x)) { 1, 1}. For a pattern-label pair (x, y) X { 1, 1}, if yf (x) > 0 then f classifies x correctly. 12 / 29
29 Large-Margin Classifiers Consider a real-valued function f : X R used for classification. The prediction on x X is sign(f (x)) { 1, 1}. For a pattern-label pair (x, y) X { 1, 1}, if yf (x) > 0 then f classifies x correctly. We call yf (x) the margin of f on x. 12 / 29
30 Large-Margin Classifiers Consider a real-valued function f : X R used for classification. The prediction on x X is sign(f (x)) { 1, 1}. For a pattern-label pair (x, y) X { 1, 1}, if yf (x) > 0 then f classifies x correctly. We call yf (x) the margin of f on x. We can view a larger margin as a more confident correct classification. 12 / 29
31 Large-Margin Classifiers Consider a real-valued function f : X R used for classification. The prediction on x X is sign(f (x)) { 1, 1}. For a pattern-label pair (x, y) X { 1, 1}, if yf (x) > 0 then f classifies x correctly. We call yf (x) the margin of f on x. We can view a larger margin as a more confident correct classification. Minimizing a continuous loss, such as encourages large margins. n (f (X i ) Y i ) 2, i=1 12 / 29
32 Large-Margin Classifiers Consider a real-valued function f : X R used for classification. The prediction on x X is sign(f (x)) { 1, 1}. For a pattern-label pair (x, y) X { 1, 1}, if yf (x) > 0 then f classifies x correctly. We call yf (x) the margin of f on x. We can view a larger margin as a more confident correct classification. Minimizing a continuous loss, such as encourages large margins. n (f (X i ) Y i ) 2, i=1 For large-margin classifiers, we should expect the fine-grained details of f to be less important. 12 / 29
33 Generalization: Margins and Size of Parameters Theorem (B., 1996) 13 / 29
34 Generalization: Margins and Size of Parameters Theorem (B., 1996) n training examples (X 1, Y 1 ),..., (X n, Y n ) X {±1} 13 / 29
35 Generalization: Margins and Size of Parameters Theorem (B., 1996) n training examples (X 1, Y 1 ),..., (X n, Y n ) X {±1} f F R X 13 / 29
36 Generalization: Margins and Size of Parameters Theorem (B., 1996) n training examples (X 1, Y 1 ),..., (X n, Y n ) X {±1} f F R X Pr(sign(f (X )) Y ) 13 / 29
37 Generalization: Margins and Size of Parameters Theorem (B., 1996) 1. With high probability over n training examples (X 1, Y 1 ),..., (X n, Y n ) X {±1}, every f F R X has ( ) Pr(sign(f (X )) Y ) 1 n fatf (γ) 1[Y i f (X i ) γ] + Õ. n n i=1 The bound depends on the margin loss plus an error term. 13 / 29
38 Generalization: Margins and Size of Parameters Theorem (B., 1996) 1. With high probability over n training examples (X 1, Y 1 ),..., (X n, Y n ) X {±1}, every f F R X has ( ) Pr(sign(f (X )) Y ) 1 n fatf (γ) 1[Y i f (X i ) γ] + Õ. n n i=1 The bound depends on the margin loss plus an error term. Minimizing quadratic loss or cross-entropy loss leads to large margins. 13 / 29
39 Generalization: Margins and Size of Parameters Theorem (B., 1996) 1. With high probability over n training examples (X 1, Y 1 ),..., (X n, Y n ) X {±1}, every f F R X has ( ) Pr(sign(f (X )) Y ) 1 n fatf (γ) 1[Y i f (X i ) γ] + Õ. n n i=1 The bound depends on the margin loss plus an error term. Minimizing quadratic loss or cross-entropy loss leads to large margins. fat F (γ) is a scale-sensitive version of VC-dimension. Unlike the VC-dimension, it need not grow with the number of parameters. 13 / 29
40 Generalization: Margins and Size of Parameters Theorem (B., 1996) 1. With high probability over n training examples (X 1, Y 1 ),..., (X n, Y n ) X {±1}, every f F R X has ( ) Pr(sign(f (X )) Y ) 1 n fatf (γ) 1[Y i f (X i ) γ] + Õ. n n i=1 2. If functions in F are computed by two-layer sigmoid networks with each unit s weights bounded in 1-norm, that is, w 1 B, then fat F (γ) = Õ((B/γ) 2 ). The bound depends on the margin loss plus an error term. Minimizing quadratic loss or cross-entropy loss leads to large margins. fat F (γ) is a scale-sensitive version of VC-dimension. Unlike the VC-dimension, it need not grow with the number of parameters. 13 / 29
41 Generalization: Margins and Size of Parameters Theorem (B., 1996) 1. With high probability over n training examples (X 1, Y 1 ),..., (X n, Y n ) X {±1}, every f F R X has ( ) Pr(sign(f (X )) Y ) 1 n fatf (γ) 1[Y i f (X i ) γ] + Õ. n n i=1 2. If functions in F are computed by two-layer sigmoid networks with each unit s weights bounded in 1-norm, that is, w 1 B, then fat F (γ) = Õ((B/γ) 2 ). Same ideas used to give rigorous dimension-independent generalization bounds for SVMs (B. and Shawe-Taylor, 1999) 13 / 29
42 Generalization: Margins and Size of Parameters Theorem (B., 1996) 1. With high probability over n training examples (X 1, Y 1 ),..., (X n, Y n ) X {±1}, every f F R X has ( ) Pr(sign(f (X )) Y ) 1 n fatf (γ) 1[Y i f (X i ) γ] + Õ. n n i=1 2. If functions in F are computed by two-layer sigmoid networks with each unit s weights bounded in 1-norm, that is, w 1 B, then fat F (γ) = Õ((B/γ) 2 ). Same ideas used to give rigorous dimension-independent generalization bounds for SVMs (B. and Shawe-Taylor, 1999)... and margins analysis of AdaBoost. (Schapire, Freund, B., Lee, 1998) 13 / 29
43 Generalization: Margins and Size of Parameters Theorem (B., 1996) 1. With high probability over n training examples (X 1, Y 1 ),..., (X n, Y n ) X {±1}, every f F R X has ( ) Pr(sign(f (X )) Y ) 1 n fatf (γ) 1[Y i f (X i ) γ] + Õ. n n i=1 2. If functions in F are computed by two-layer sigmoid networks with each unit s weights bounded in 1-norm, that is, w 1 B, then fat F (γ) = Õ((B/γ) 2 ). The scale of functions f F is important. 13 / 29
44 Generalization: Margins and Size of Parameters Theorem (B., 1996) 1. With high probability over n training examples (X 1, Y 1 ),..., (X n, Y n ) X {±1}, every f F R X has ( ) Pr(sign(f (X )) Y ) 1 n fatf (γ) 1[Y i f (X i ) γ] + Õ. n n i=1 2. If functions in F are computed by two-layer sigmoid networks with each unit s weights bounded in 1-norm, that is, w 1 B, then fat F (γ) = Õ((B/γ) 2 ). The scale of functions f F is important. Bigger f s give bigger margins, so fat F (γ) should be bigger. 13 / 29
45 Generalization: Margins and Size of Parameters Theorem (B., 1996) 1. With high probability over n training examples (X 1, Y 1 ),..., (X n, Y n ) X {±1}, every f F R X has ( ) Pr(sign(f (X )) Y ) 1 n fatf (γ) 1[Y i f (X i ) γ] + Õ. n n i=1 2. If functions in F are computed by two-layer sigmoid networks with each unit s weights bounded in 1-norm, that is, w 1 B, then fat F (γ) = Õ((B/γ) 2 ). The scale of functions f F is important. Bigger f s give bigger margins, so fat F (γ) should be bigger. The output y of a sigmoid layer has y 1, so w 1 B controls the scale of f. 13 / 29
46 Generalization: Margins and Size of Parameters 1996: Sigmoid networks Qualitative behavior explained by small weights theorem. 14 / 29
47 Generalization: Margins and Size of Parameters 1996: Sigmoid networks 2017: Deep ReLU networks simons.berkeley.edu Qualitative behavior explained by small weights theorem. 14 / 29
48 Generalization: Margins and Size of Parameters 1996: Sigmoid networks 2017: Deep ReLU networks simons.berkeley.edu Qualitative behavior explained by small weights theorem. How to measure the complexity of a ReLU network? 14 / 29
49 Outline What determines the statistical complexity of a deep network? VC theory: Number of parameters Margins analysis: Size of parameters Understanding generalization failures 15 / 29
50 Explaining Generalization Failures CIFAR / 29
51 Explaining Generalization Failures Stochastic Gradient Training Error on CIFAR10 (Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals, 2017) 17 / 29
52 Explaining Generalization Failures Training margins on CIFAR10 with true and random labels 18 / 29
53 Explaining Generalization Failures Training margins on CIFAR10 with true and random labels How does this match the large margin explanation? 18 / 29
54 Explaining Generalization Failures Training margins on CIFAR10 with true and random labels How does this match the large margin explanation? Need to account for the scale of the neural network functions. 18 / 29
55 Explaining Generalization Failures Training margins on CIFAR10 with true and random labels How does this match the large margin explanation? Need to account for the scale of the neural network functions. What is the appropriate notion of the size of these functions? 18 / 29
56 Generalization in Deep Networks Spectrally-normalized margin bounds for neural networks. B., Dylan J. Foster, Matus Telgarsky, arxiv: Matus Telgarsky UIUC Dylan Foster Cornell 19 / 29
57 Generalization in Deep Networks New results for generalization in deep ReLU networks Measuring the size of functions computed by a network of ReLUs. 20 / 29
58 Generalization in Deep Networks New results for generalization in deep ReLU networks Measuring the size of functions computed by a network of ReLUs. (c.f. sigmoid networks: the output y of a layer has y 1, so w 1 B keeps the scale under control.) 20 / 29
59 Generalization in Deep Networks New results for generalization in deep ReLU networks Measuring the size of functions computed by a network of ReLUs. (c.f. sigmoid networks: the output y of a layer has y 1, so w 1 B keeps the scale under control.) Large multiclass versus binary classification. 20 / 29
60 Generalization in Deep Networks New results for generalization in deep ReLU networks Measuring the size of functions computed by a network of ReLUs. (c.f. sigmoid networks: the output y of a layer has y 1, so w 1 B keeps the scale under control.) Large multiclass versus binary classification. Definitions Consider operator norms: For a matrix A i, A i := sup A i x. x 1 20 / 29
61 Generalization in Deep Networks New results for generalization in deep ReLU networks Measuring the size of functions computed by a network of ReLUs. (c.f. sigmoid networks: the output y of a layer has y 1, so w 1 B keeps the scale under control.) Large multiclass versus binary classification. Definitions Consider operator norms: For a matrix A i, A i := sup A i x. x 1 Multiclass margin function for f : X R m, y {1,..., m}: M(f (x), y) = f (x) y max i y f (x) i. 20 / 29
62 Generalization in Deep Networks Theorem With high probability, every f A 21 / 29
63 Generalization in Deep Networks Theorem With high probability, every f A Definitions Network with L layers, parameters A 1,..., A L : f A (x) := σ L (A L σ L 1 (A L 1 σ 1 (A 1 x) )). 21 / 29
64 Generalization in Deep Networks Theorem With high probability, every f A satisfies Pr(M(f A (X ), Y ) 0) Definitions Network with L layers, parameters A 1,..., A L : f A (x) := σ L (A L σ L 1 (A L 1 σ 1 (A 1 x) )). 21 / 29
65 Generalization in Deep Networks Theorem With high probability, every f A Pr(M(f A (X ), Y ) 0) 1 n satisfies n 1[M(f A (X i ), Y i ) γ] i=1 Definitions Network with L layers, parameters A 1,..., A L : f A (x) := σ L (A L σ L 1 (A L 1 σ 1 (A 1 x) )). 21 / 29
66 Generalization in Deep Networks Theorem With high probability, every f A Pr(M(f A (X ), Y ) 0) 1 n satisfies n 1[M(f A (X i ), Y i ) γ] + Õ i=1 ( ) rl γ. n Definitions Network with L layers, parameters A 1,..., A L : f A (x) := σ L (A L σ L 1 (A L 1 σ 1 (A 1 x) )). 21 / 29
67 Generalization in Deep Networks Theorem With high probability, every f A with R A r Pr(M(f A (X ), Y ) 0) 1 n satisfies n 1[M(f A (X i ), Y i ) γ] + Õ i=1 ( ) rl γ. n Definitions Network with L layers, parameters A 1,..., A L : f A (x) := σ L (A L σ L 1 (A L 1 σ 1 (A 1 x) )). Scale of f A : R A := L i=1 A L A i F i i=1 A i. 21 / 29
68 Generalization in Deep Networks Theorem With high probability, every f A with R A r Pr(M(f A (X ), Y ) 0) 1 n satisfies n 1[M(f A (X i ), Y i ) γ] + Õ i=1 ( ) rl γ. n Definitions Network with L layers, parameters A 1,..., A L : f A (x) := σ L (A L σ L 1 (A L 1 σ 1 (A 1 x) )). Scale of f A : R A := L i=1 A L A i F i i=1 A i. (Assume σ i is 1-Lipschitz, inputs normalized.) 21 / 29
69 Explaining Generalization Failures Stochastic Gradient Training Error on CIFAR10 (Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals, 2017) 22 / 29
70 Explaining Generalization Failures Training margins on CIFAR10 with true and random labels How does this match the large margin explanation? 23 / 29
71 Explaining Generalization Failures If we rescale the margins by R A (the scale parameter): Rescaled margins on CIFAR10 24 / 29
72 Explaining Generalization Failures If we rescale the margins by R A (the scale parameter): Rescaled cumulative margins on MNIST 25 / 29
73 Generalization in Deep Networks Theorem With high probability, every f A with R A r satisfies Pr(M(f A (X ), Y ) 0) 1 n n 1[M(f A (X i ), Y i ) γ] + Õ i=1 ( ) rl γ. n Network with L layers, parameters A 1,..., A L : f A (x) := σ(a L σ L 1 (A L 1 σ 1 (A 1 x) )). Scale of f A : R A := L i=1 A L A i F i i=1 A i. 26 / 29
74 Explaining Generalization Failures 27 / 29
75 Generalization in Neural Networks With appropriate normalization, the margins analysis is qualitatively consistent with the generalization performance. 28 / 29
76 Generalization in Neural Networks With appropriate normalization, the margins analysis is qualitatively consistent with the generalization performance. Margin bounds extend to residual networks. 28 / 29
77 Generalization in Neural Networks With appropriate normalization, the margins analysis is qualitatively consistent with the generalization performance. Margin bounds extend to residual networks. Lower bounds? 28 / 29
78 Generalization in Neural Networks With appropriate normalization, the margins analysis is qualitatively consistent with the generalization performance. Margin bounds extend to residual networks. Lower bounds? Regularization: explicit control of operator norms? 28 / 29
79 Generalization in Neural Networks With appropriate normalization, the margins analysis is qualitatively consistent with the generalization performance. Margin bounds extend to residual networks. Lower bounds? Regularization: explicit control of operator norms? Role of depth? 28 / 29
80 Generalization in Neural Networks With appropriate normalization, the margins analysis is qualitatively consistent with the generalization performance. Margin bounds extend to residual networks. Lower bounds? Regularization: explicit control of operator norms? Role of depth? Interplay with optimization? 28 / 29
81 Outline What determines the statistical complexity of a deep network? VC theory: Number of parameters Margins analysis: Size of parameters Understanding generalization failures 29 / 29
Some Statistical Properties of Deep Networks
Some Statistical Properties of Deep Networks Peter Bartlett UC Berkeley August 2, 2018 1 / 22 Deep Networks Deep compositions of nonlinear functions h = h m h m 1 h 1 2 / 22 Deep Networks Deep compositions
More informationNearly-tight VC-dimension bounds for piecewise linear neural networks
Proceedings of Machine Learning Research vol 65:1 5, 2017 Nearly-tight VC-dimension bounds for piecewise linear neural networks Nick Harvey Christopher Liaw Abbas Mehrabian Department of Computer Science,
More informationVapnik-Chervonenkis Dimension of Neural Nets
P. L. Bartlett and W. Maass: Vapnik-Chervonenkis Dimension of Neural Nets 1 Vapnik-Chervonenkis Dimension of Neural Nets Peter L. Bartlett BIOwulf Technologies and University of California at Berkeley
More informationVapnik-Chervonenkis Dimension of Neural Nets
Vapnik-Chervonenkis Dimension of Neural Nets Peter L. Bartlett BIOwulf Technologies and University of California at Berkeley Department of Statistics 367 Evans Hall, CA 94720-3860, USA bartlett@stat.berkeley.edu
More informationwhat can deep learning learn from linear regression? Benjamin Recht University of California, Berkeley
what can deep learning learn from linear regression? Benjamin Recht University of California, Berkeley Collaborators Joint work with Samy Bengio, Moritz Hardt, Michael Jordan, Jason Lee, Max Simchowitz,
More informationA PAC-BAYESIAN APPROACH TO SPECTRALLY-NORMALIZED MARGIN BOUNDS
Published as a conference paper at ICLR 08 A PAC-BAYESIAN APPROACH TO SPECTRALLY-NORMALIZED MARGIN BOUNDS FOR NEURAL NETWORKS Behnam Neyshabur, Srinadh Bhojanapalli, Nathan Srebro Toyota Technological
More informationSpectrally-normalized margin bounds for neural networks
Spectrally-normalized margin bounds for neural networks Peter L. Bartlett Dylan J. Foster Matus Telgarsky 1 Overview Abstract This paper presents a margin-based multiclass generalization bound for neural
More informationarxiv: v1 [cs.lg] 8 Mar 2017
Nearly-tight VC-dimension bounds for piecewise linear neural networks arxiv:1703.02930v1 [cs.lg] 8 Mar 2017 Nick Harvey Chris Liaw Abbas Mehrabian March 9, 2017 Abstract We prove new upper and lower bounds
More informationAustralian National University. Abstract. number of training examples necessary for satisfactory learning performance grows
The VC-Dimension and Pseudodimension of Two-Layer Neural Networks with Discrete Inputs Peter L. Bartlett Robert C. Williamson Department of Systems Engineering Research School of Information Sciences and
More informationA Surprising Linear Relationship Predicts Test Performance in Deep Networks
CBMM Memo No. 91 July 26, 218 arxiv:187.9659v1 [cs.lg] 25 Jul 218 A Surprising Linear Relationship Predicts Test Performance in Deep Networks Qianli Liao 1, Brando Miranda 1, Andrzej Banburski 1, Jack
More informationMachine Learning Basics
Security and Fairness of Deep Learning Machine Learning Basics Anupam Datta CMU Spring 2019 Image Classification Image Classification Image classification pipeline Input: A training set of N images, each
More informationAn Introduction to Statistical Theory of Learning. Nakul Verma Janelia, HHMI
An Introduction to Statistical Theory of Learning Nakul Verma Janelia, HHMI Towards formalizing learning What does it mean to learn a concept? Gain knowledge or experience of the concept. The basic process
More informationI D I A P. Online Policy Adaptation for Ensemble Classifiers R E S E A R C H R E P O R T. Samy Bengio b. Christos Dimitrakakis a IDIAP RR 03-69
R E S E A R C H R E P O R T Online Policy Adaptation for Ensemble Classifiers Christos Dimitrakakis a IDIAP RR 03-69 Samy Bengio b I D I A P December 2003 D a l l e M o l l e I n s t i t u t e for Perceptual
More informationLecture Support Vector Machine (SVM) Classifiers
Introduction to Machine Learning Lecturer: Amir Globerson Lecture 6 Fall Semester Scribe: Yishay Mansour 6.1 Support Vector Machine (SVM) Classifiers Classification is one of the most important tasks in
More informationarxiv: v1 [cs.lg] 4 Oct 2018
Gradient descent aligns the layers of deep linear networks Ziwei Ji Matus Telgarsky {ziweiji,mjt}@illinois.edu University of Illinois, Urbana-Champaign arxiv:80.003v [cs.lg] 4 Oct 08 Abstract This paper
More informationLarge-Scale Feature Learning with Spike-and-Slab Sparse Coding
Large-Scale Feature Learning with Spike-and-Slab Sparse Coding Ian J. Goodfellow, Aaron Courville, Yoshua Bengio ICML 2012 Presented by Xin Yuan January 17, 2013 1 Outline Contributions Spike-and-Slab
More informationCOMP9444: Neural Networks. Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization
: Neural Networks Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization 11s2 VC-dimension and PAC-learning 1 How good a classifier does a learner produce? Training error is the precentage
More informationDiscriminative Models
No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models
More informationLearning Theory. Piyush Rai. CS5350/6350: Machine Learning. September 27, (CS5350/6350) Learning Theory September 27, / 14
Learning Theory Piyush Rai CS5350/6350: Machine Learning September 27, 2011 (CS5350/6350) Learning Theory September 27, 2011 1 / 14 Why Learning Theory? We want to have theoretical guarantees about our
More informationClassical generalization bounds are surprisingly tight for Deep Networks
CBMM Memo No. 9 July, 28 Classical generalization bounds are surprisingly tight for Deep Networks Qianli Liao, Brando Miranda, Jack Hidary 2 and Tomaso Poggio Center for Brains, Minds, and Machines, MIT
More informationarxiv: v1 [cs.lg] 7 Jan 2019
Generalization in Deep Networks: The Role of Distance from Initialization arxiv:1901672v1 [cs.lg] 7 Jan 2019 Vaishnavh Nagarajan Computer Science Department Carnegie-Mellon University Pittsburgh, PA 15213
More informationA PAC-Bayesian Approach to Spectrally-Normalized Margin Bounds for Neural Networks
A PAC-Bayesian Approach to Spectrally-Normalize Margin Bouns for Neural Networks Behnam Neyshabur, Srinah Bhojanapalli, Davi McAllester, Nathan Srebro Toyota Technological Institute at Chicago {bneyshabur,
More informationSample width for multi-category classifiers
R u t c o r Research R e p o r t Sample width for multi-category classifiers Martin Anthony a Joel Ratsaby b RRR 29-2012, November 2012 RUTCOR Rutgers Center for Operations Research Rutgers University
More informationA note on the generalization performance of kernel classifiers with margin. Theodoros Evgeniou and Massimiliano Pontil
MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE LABORATORY and CENTER FOR BIOLOGICAL AND COMPUTATIONAL LEARNING DEPARTMENT OF BRAIN AND COGNITIVE SCIENCES A.I. Memo No. 68 November 999 C.B.C.L
More informationIntroduction to Machine Learning Lecture 13. Mehryar Mohri Courant Institute and Google Research
Introduction to Machine Learning Lecture 13 Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Multi-Class Classification Mehryar Mohri - Introduction to Machine Learning page 2 Motivation
More informationCOMS 4771 Lecture Boosting 1 / 16
COMS 4771 Lecture 12 1. Boosting 1 / 16 Boosting What is boosting? Boosting: Using a learning algorithm that provides rough rules-of-thumb to construct a very accurate predictor. 3 / 16 What is boosting?
More informationDoes Modeling Lead to More Accurate Classification?
Does Modeling Lead to More Accurate Classification? A Comparison of the Efficiency of Classification Methods Yoonkyung Lee* Department of Statistics The Ohio State University *joint work with Rui Wang
More informationSupport Vector Machines: Maximum Margin Classifiers
Support Vector Machines: Maximum Margin Classifiers Machine Learning and Pattern Recognition: September 16, 2008 Piotr Mirowski Based on slides by Sumit Chopra and Fu-Jie Huang 1 Outline What is behind
More informationThe sample complexity of agnostic learning with deterministic labels
The sample complexity of agnostic learning with deterministic labels Shai Ben-David Cheriton School of Computer Science University of Waterloo Waterloo, ON, N2L 3G CANADA shai@uwaterloo.ca Ruth Urner College
More informationSGD and Deep Learning
SGD and Deep Learning Subgradients Lets make the gradient cheating more formal. Recall that the gradient is the slope of the tangent. f(w 1 )+rf(w 1 ) (w w 1 ) Non differentiable case? w 1 Subgradients
More informationReducing Multiclass to Binary: A Unifying Approach for Margin Classifiers
Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers Erin Allwein, Robert Schapire and Yoram Singer Journal of Machine Learning Research, 1:113-141, 000 CSE 54: Seminar on Learning
More informationIntroduction to Support Vector Machines
Introduction to Support Vector Machines Hsuan-Tien Lin Learning Systems Group, California Institute of Technology Talk in NTU EE/CS Speech Lab, November 16, 2005 H.-T. Lin (Learning Systems Group) Introduction
More informationOn the Generalization of Soft Margin Algorithms
IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 48, NO. 10, OCTOBER 2002 2721 On the Generalization of Soft Margin Algorithms John Shawe-Taylor, Member, IEEE, and Nello Cristianini Abstract Generalization
More informationCSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18
CSE 417T: Introduction to Machine Learning Lecture 11: Review Henry Chai 10/02/18 Unknown Target Function!: # % Training data Formal Setup & = ( ), + ),, ( -, + - Learning Algorithm 2 Hypothesis Set H
More informationLearning with multiple models. Boosting.
CS 2750 Machine Learning Lecture 21 Learning with multiple models. Boosting. Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square Learning with multiple models: Approach 2 Approach 2: use multiple models
More informationComputational Learning Theory
Computational Learning Theory Pardis Noorzad Department of Computer Engineering and IT Amirkabir University of Technology Ordibehesht 1390 Introduction For the analysis of data structures and algorithms
More informationA Study of Relative Efficiency and Robustness of Classification Methods
A Study of Relative Efficiency and Robustness of Classification Methods Yoonkyung Lee* Department of Statistics The Ohio State University *joint work with Rui Wang April 28, 2011 Department of Statistics
More informationDiscriminative Models
No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models
More informationStatistical learning theory, Support vector machines, and Bioinformatics
1 Statistical learning theory, Support vector machines, and Bioinformatics Jean-Philippe.Vert@mines.org Ecole des Mines de Paris Computational Biology group ENS Paris, november 25, 2003. 2 Overview 1.
More informationMachine Learning Basics Lecture 4: SVM I. Princeton University COS 495 Instructor: Yingyu Liang
Machine Learning Basics Lecture 4: SVM I Princeton University COS 495 Instructor: Yingyu Liang Review: machine learning basics Math formulation Given training data x i, y i : 1 i n i.i.d. from distribution
More informationStatistical and Computational Learning Theory
Statistical and Computational Learning Theory Fundamental Question: Predict Error Rates Given: Find: The space H of hypotheses The number and distribution of the training examples S The complexity of the
More informationLinear and Logistic Regression. Dr. Xiaowei Huang
Linear and Logistic Regression Dr. Xiaowei Huang https://cgi.csc.liv.ac.uk/~xiaowei/ Up to now, Two Classical Machine Learning Algorithms Decision tree learning K-nearest neighbor Model Evaluation Metrics
More informationMargin Preservation of Deep Neural Networks
1 Margin Preservation of Deep Neural Networks Jure Sokolić 1, Raja Giryes 2, Guillermo Sapiro 3, Miguel R. D. Rodrigues 1 1 Department of E&EE, University College London, London, UK 2 School of EE, Faculty
More informationReading Group on Deep Learning Session 1
Reading Group on Deep Learning Session 1 Stephane Lathuiliere & Pablo Mesejo 2 June 2016 1/31 Contents Introduction to Artificial Neural Networks to understand, and to be able to efficiently use, the popular
More informationBased on the original slides of Hung-yi Lee
Based on the original slides of Hung-yi Lee Google Trends Deep learning obtains many exciting results. Can contribute to new Smart Services in the Context of the Internet of Things (IoT). IoT Services
More informationNeural networks and support vector machines
Neural netorks and support vector machines Perceptron Input x 1 Weights 1 x 2 x 3... x D 2 3 D Output: sgn( x + b) Can incorporate bias as component of the eight vector by alays including a feature ith
More informationLinear & nonlinear classifiers
Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table
More informationFoundations of Deep Learning: SGD, Overparametrization, and Generalization
Foundations of Deep Learning: SGD, Overparametrization, and Generalization Jason D. Lee University of Southern California November 13, 2018 Deep Learning Single Neuron x σ( w, x ) ReLU: σ(z) = [z] + Figure:
More informationAdaBoost and other Large Margin Classifiers: Convexity in Classification
AdaBoost and other Large Margin Classifiers: Convexity in Classification Peter Bartlett Division of Computer Science and Department of Statistics UC Berkeley Joint work with Mikhail Traskin. slides at
More informationGeneralization, Overfitting, and Model Selection
Generalization, Overfitting, and Model Selection Sample Complexity Results for Supervised Classification Maria-Florina (Nina) Balcan 10/03/2016 Two Core Aspects of Machine Learning Algorithm Design. How
More informationNeural Networks and Deep Learning
Neural Networks and Deep Learning Professor Ameet Talwalkar November 12, 2015 Professor Ameet Talwalkar Neural Networks and Deep Learning November 12, 2015 1 / 16 Outline 1 Review of last lecture AdaBoost
More informationDoes Unlabeled Data Help?
Does Unlabeled Data Help? Worst-case Analysis of the Sample Complexity of Semi-supervised Learning. Ben-David, Lu and Pal; COLT, 2008. Presentation by Ashish Rastogi Courant Machine Learning Seminar. Outline
More informationOnline Learning: Random Averages, Combinatorial Parameters, and Learnability
Online Learning: Random Averages, Combinatorial Parameters, and Learnability Alexander Rakhlin Department of Statistics University of Pennsylvania Karthik Sridharan Toyota Technological Institute at Chicago
More informationError bounds for approximations with deep ReLU networks
Error bounds for approximations with deep ReLU networks arxiv:1610.01145v3 [cs.lg] 1 May 2017 Dmitry Yarotsky d.yarotsky@skoltech.ru May 2, 2017 Abstract We study expressive power of shallow and deep neural
More informationImplicit Optimization Bias
Implicit Optimization Bias as a key to Understanding Deep Learning Nati Srebro (TTIC) Based on joint work with Behnam Neyshabur (TTIC IAS), Ryota Tomioka (TTIC MSR), Srinadh Bhojanapalli, Suriya Gunasekar,
More informationBOOSTING THE MARGIN: A NEW EXPLANATION FOR THE EFFECTIVENESS OF VOTING METHODS
The Annals of Statistics 1998, Vol. 26, No. 5, 1651 1686 BOOSTING THE MARGIN: A NEW EXPLANATION FOR THE EFFECTIVENESS OF VOTING METHODS BY ROBERT E. SCHAPIRE, YOAV FREUND, PETER BARTLETT AND WEE SUN LEE
More informationCOMS 4771 Introduction to Machine Learning. Nakul Verma
COMS 4771 Introduction to Machine Learning Nakul Verma Announcements HW2 due now! Project proposal due on tomorrow Midterm next lecture! HW3 posted Last time Linear Regression Parametric vs Nonparametric
More informationLecture 8. Instructor: Haipeng Luo
Lecture 8 Instructor: Haipeng Luo Boosting and AdaBoost In this lecture we discuss the connection between boosting and online learning. Boosting is not only one of the most fundamental theories in machine
More informationFeature Noising. Sida Wang, joint work with Part 1: Stefan Wager, Percy Liang Part 2: Mengqiu Wang, Chris Manning, Percy Liang, Stefan Wager
Feature Noising Sida Wang, joint work with Part 1: Stefan Wager, Percy Liang Part 2: Mengqiu Wang, Chris Manning, Percy Liang, Stefan Wager Outline Part 0: Some backgrounds Part 1: Dropout as adaptive
More informationNeural Network Learning: Testing Bounds on Sample Complexity
Neural Network Learning: Testing Bounds on Sample Complexity Joaquim Marques de Sá, Fernando Sereno 2, Luís Alexandre 3 INEB Instituto de Engenharia Biomédica Faculdade de Engenharia da Universidade do
More informationBoosting the margin: A new explanation for the effectiveness of voting methods
Machine Learning: Proceedings of the Fourteenth International Conference, 1997. Boosting the margin: A new explanation for the effectiveness of voting methods Robert E. Schapire Yoav Freund AT&T Labs 6
More informationBoosting the Margin: A New Explanation for the Effectiveness of Voting Methods
The Annals of Statistics, 26(5)1651-1686, 1998. Boosting the Margin A New Explanation for the Effectiveness of Voting Methods Robert E. Schapire AT&T Labs 18 Park Avenue, Room A279 Florham Park, NJ 7932-971
More informationMaximal Width Learning of Binary Functions
Maximal Width Learning of Binary Functions Martin Anthony Department of Mathematics, London School of Economics, Houghton Street, London WC2A2AE, UK Joel Ratsaby Electrical and Electronics Engineering
More informationNeural Networks. Nicholas Ruozzi University of Texas at Dallas
Neural Networks Nicholas Ruozzi University of Texas at Dallas Handwritten Digit Recognition Given a collection of handwritten digits and their corresponding labels, we d like to be able to correctly classify
More informationMaximal Width Learning of Binary Functions
Maximal Width Learning of Binary Functions Martin Anthony Department of Mathematics London School of Economics Houghton Street London WC2A 2AE United Kingdom m.anthony@lse.ac.uk Joel Ratsaby Ben-Gurion
More informationStatistical Properties of Large Margin Classifiers
Statistical Properties of Large Margin Classifiers Peter Bartlett Division of Computer Science and Department of Statistics UC Berkeley Joint work with Mike Jordan, Jon McAuliffe, Ambuj Tewari. slides
More informationMargin Maximizing Loss Functions
Margin Maximizing Loss Functions Saharon Rosset, Ji Zhu and Trevor Hastie Department of Statistics Stanford University Stanford, CA, 94305 saharon, jzhu, hastie@stat.stanford.edu Abstract Margin maximizing
More informationMACHINE LEARNING. Support Vector Machines. Alessandro Moschitti
MACHINE LEARNING Support Vector Machines Alessandro Moschitti Department of information and communication technology University of Trento Email: moschitti@dit.unitn.it Summary Support Vector Machines
More informationMachine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang
Machine Learning Basics Lecture 7: Multiclass Classification Princeton University COS 495 Instructor: Yingyu Liang Example: image classification indoor Indoor outdoor Example: image classification (multiclass)
More informationUnderstanding Generalization Error: Bounds and Decompositions
CIS 520: Machine Learning Spring 2018: Lecture 11 Understanding Generalization Error: Bounds and Decompositions Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the
More informationAdaBoost. Lecturer: Authors: Center for Machine Perception Czech Technical University, Prague
AdaBoost Lecturer: Jan Šochman Authors: Jan Šochman, Jiří Matas Center for Machine Perception Czech Technical University, Prague http://cmp.felk.cvut.cz Motivation Presentation 2/17 AdaBoost with trees
More informationKernels. Machine Learning CSE446 Carlos Guestrin University of Washington. October 28, Carlos Guestrin
Kernels Machine Learning CSE446 Carlos Guestrin University of Washington October 28, 2013 Carlos Guestrin 2005-2013 1 Linear Separability: More formally, Using Margin Data linearly separable, if there
More informationAdvanced Machine Learning
Advanced Machine Learning Deep Boosting MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Outline Model selection. Deep boosting. theory. algorithm. experiments. page 2 Model Selection Problem:
More informationNeural Networks Lecturer: J. Matas Authors: J. Matas, B. Flach, O. Drbohlav
Neural Networks 30.11.2015 Lecturer: J. Matas Authors: J. Matas, B. Flach, O. Drbohlav 1 Talk Outline Perceptron Combining neurons to a network Neural network, processing input to an output Learning Cost
More informationON THE EXPRESSIVE POWER OF DEEP NEURAL NET-
ON THE EXPRESSIVE POWER OF DEEP NEURAL NET- WORKS Maithra Raghu Google Brain and Cornell University Ben Poole Stanford University and Google Brain Jon Kleinberg Cornell University Surya Ganguli Stanford
More informationFast Rates for Estimation Error and Oracle Inequalities for Model Selection
Fast Rates for Estimation Error and Oracle Inequalities for Model Selection Peter L. Bartlett Computer Science Division and Department of Statistics University of California, Berkeley bartlett@cs.berkeley.edu
More informationCOMS 4771 Introduction to Machine Learning. Nakul Verma
COMS 4771 Introduction to Machine Learning Nakul Verma Announcements HW1 due next lecture Project details are available decide on the group and topic by Thursday Last time Generative vs. Discriminative
More informationClassification and Pattern Recognition
Classification and Pattern Recognition Léon Bottou NEC Labs America COS 424 2/23/2010 The machine learning mix and match Goals Representation Capacity Control Operational Considerations Computational Considerations
More informationMachine Learning Lecture 7
Course Outline Machine Learning Lecture 7 Fundamentals (2 weeks) Bayes Decision Theory Probability Density Estimation Statistical Learning Theory 23.05.2016 Discriminative Approaches (5 weeks) Linear Discriminant
More informationSupport Vector Machine. Industrial AI Lab. Prof. Seungchul Lee
Support Vector Machine Industrial AI Lab. Prof. Seungchul Lee Classification (Linear) Autonomously figure out which category (or class) an unknown item should be categorized into Number of categories /
More informationWhy ResNet Works? Residuals Generalize
Why ResNet Works? Residuals Generalize Fengxiang He Tongliang Liu Dacheng Tao arxiv:1904.01367v1 [stat.ml] 2 Apr 2019 Abstract Residual connections significantly boost the performance of deep neural networks.
More informationNeural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann
Neural Networks with Applications to Vision and Language Feedforward Networks Marco Kuhlmann Feedforward networks Linear separability x 2 x 2 0 1 0 1 0 0 x 1 1 0 x 1 linearly separable not linearly separable
More informationUniform Glivenko-Cantelli Theorems and Concentration of Measure in the Mathematical Modelling of Learning
Uniform Glivenko-Cantelli Theorems and Concentration of Measure in the Mathematical Modelling of Learning Martin Anthony Department of Mathematics London School of Economics Houghton Street London WC2A
More informationSupervised Learning Part I
Supervised Learning Part I http://www.lps.ens.fr/~nadal/cours/mva Jean-Pierre Nadal CNRS & EHESS Laboratoire de Physique Statistique (LPS, UMR 8550 CNRS - ENS UPMC Univ. Paris Diderot) Ecole Normale Supérieure
More informationIndirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina
Indirect Rule Learning: Support Vector Machines Indirect learning: loss optimization It doesn t estimate the prediction rule f (x) directly, since most loss functions do not have explicit optimizers. Indirection
More informationPAC-learning, VC Dimension and Margin-based Bounds
More details: General: http://www.learning-with-kernels.org/ Example of more complex bounds: http://www.research.ibm.com/people/t/tzhang/papers/jmlr02_cover.ps.gz PAC-learning, VC Dimension and Margin-based
More informationGeneralization theory
Generalization theory Chapter 4 T.P. Runarsson (tpr@hi.is) and S. Sigurdsson (sven@hi.is) Introduction Suppose you are given the empirical observations, (x 1, y 1 ),..., (x l, y l ) (X Y) l. Consider the
More informationSTATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION
STATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION Tong Zhang The Annals of Statistics, 2004 Outline Motivation Approximation error under convex risk minimization
More informationIntroduction to Neural Networks
CUONG TUAN NGUYEN SEIJI HOTTA MASAKI NAKAGAWA Tokyo University of Agriculture and Technology Copyright by Nguyen, Hotta and Nakagawa 1 Pattern classification Which category of an input? Example: Character
More informationOn Learnability, Complexity and Stability
On Learnability, Complexity and Stability Silvia Villa, Lorenzo Rosasco and Tomaso Poggio 1 Introduction A key question in statistical learning is which hypotheses (function) spaces are learnable. Roughly
More informationMachine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /
Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring 2015 http://ce.sharif.edu/courses/93-94/2/ce717-1 / Agenda Combining Classifiers Empirical view Theoretical
More informationMultilayer Perceptron
Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Single Perceptron 3 Boolean Function Learning 4
More informationSupport Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs
E0 270 Machine Learning Lecture 5 (Jan 22, 203) Support Vector Machines for Classification and Regression Lecturer: Shivani Agarwal Disclaimer: These notes are a brief summary of the topics covered in
More informationTTIC 31230, Fundamentals of Deep Learning David McAllester, Winter Multiclass Logistic Regression. Multilayer Perceptrons (MLPs)
TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2018 Multiclass Logistic Regression Multilayer Perceptrons (MLPs) Stochastic Gradient Descent (SGD) 1 Multiclass Classification We consider
More informationDeep Neural Networks: From Flat Minima to Numerically Nonvacuous Generalization Bounds via PAC-Bayes
Deep Neural Networks: From Flat Minima to Numerically Nonvacuous Generalization Bounds via PAC-Bayes Daniel M. Roy University of Toronto; Vector Institute Joint work with Gintarė K. Džiugaitė University
More informationActive Learning and Optimized Information Gathering
Active Learning and Optimized Information Gathering Lecture 7 Learning Theory CS 101.2 Andreas Krause Announcements Project proposal: Due tomorrow 1/27 Homework 1: Due Thursday 1/29 Any time is ok. Office
More informationSupport Vector Machine (SVM) and Kernel Methods
Support Vector Machine (SVM) and Kernel Methods CE-717: Machine Learning Sharif University of Technology Fall 2014 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin
More informationECE521 Lectures 9 Fully Connected Neural Networks
ECE521 Lectures 9 Fully Connected Neural Networks Outline Multi-class classification Learning multi-layer neural networks 2 Measuring distance in probability space We learnt that the squared L2 distance
More informationSupport Vector Machines and Kernel Methods
Support Vector Machines and Kernel Methods Geoff Gordon ggordon@cs.cmu.edu July 10, 2003 Overview Why do people care about SVMs? Classification problems SVMs often produce good results over a wide range
More informationAn Introduction to Boosting and Leveraging
An Introduction to Boosting and Leveraging Ron Meir 1 and Gunnar Rätsch 2 1 Department of Electrical Engineering, Technion, Haifa 32000, Israel rmeir@ee.technion.ac.il, http://www-ee.technion.ac.il/ rmeir
More information