Generalization in Deep Networks

Size: px

Start display at page:

Download "Generalization in Deep Networks"

Merry McKinney
6 years ago
Views:

1 Generalization in Deep Networks Peter Bartlett BAIR UC Berkeley November 28, / 29

2 Deep neural networks Game playing (Jung Yeon-Je/AFP/Getty Images) 2 / 29

3 Deep neural networks Image recognition (Krizhevsky et al, 2012) 3 / 29

4 Deep neural networks Speech recognition (Graves et al, 2013) 4 / 29

5 Outline What determines the statistical complexity of a deep network? 5 / 29

6 Outline What determines the statistical complexity of a deep network? VC theory: Number of parameters 5 / 29

7 Outline What determines the statistical complexity of a deep network? VC theory: Number of parameters Margins analysis: Size of parameters 5 / 29

8 Outline What determines the statistical complexity of a deep network? VC theory: Number of parameters Margins analysis: Size of parameters Understanding generalization failures 5 / 29

9 Outline What determines the statistical complexity of a deep network? VC theory: Number of parameters Margins analysis: Size of parameters Understanding generalization failures 6 / 29

10 VC Theory 7 / 29

11 VC Theory Assume network maps to { 1, 1}. (Threshold its output) 7 / 29

12 VC Theory Assume network maps to { 1, 1}. (Threshold its output) Data generated by a probability distribution P on X { 1, 1}. 7 / 29

13 VC Theory Assume network maps to { 1, 1}. (Threshold its output) Data generated by a probability distribution P on X { 1, 1}. Want to choose a function f such that P(f (x) y) is small (near optimal). 7 / 29

14 VC Theory Theorem (Vapnik and Chervonenkis) Suppose F { 1, 1} X. For every prob distribution P on X { 1, 1}, with probability 1 δ over n iid examples (x 1, y 1 ),..., (x n, y n ), every f in F satisfies P(f (x) y) 1 n {i : f (x i) y i } + ( c n (VCdim(F) + log(1/δ)) ) 1/2. 8 / 29

15 VC Theory Theorem (Vapnik and Chervonenkis) Suppose F { 1, 1} X. For every prob distribution P on X { 1, 1}, with probability 1 δ over n iid examples (x 1, y 1 ),..., (x n, y n ), every f in F satisfies P(f (x) y) 1 n {i : f (x i) y i } + ( c n (VCdim(F) + log(1/δ)) ) 1/2. For uniform bounds (that is, for all distributions and all f F, proportions are close to probabilities), this inequality is tight within a constant factor. 8 / 29

16 VC Theory Theorem (Vapnik and Chervonenkis) Suppose F { 1, 1} X. For every prob distribution P on X { 1, 1}, with probability 1 δ over n iid examples (x 1, y 1 ),..., (x n, y n ), every f in F satisfies P(f (x) y) 1 n {i : f (x i) y i } + ( c n (VCdim(F) + log(1/δ)) ) 1/2. For uniform bounds (that is, for all distributions and all f F, proportions are close to probabilities), this inequality is tight within a constant factor. For neural networks, VC-dimension: 8 / 29

17 VC Theory Theorem (Vapnik and Chervonenkis) Suppose F { 1, 1} X. For every prob distribution P on X { 1, 1}, with probability 1 δ over n iid examples (x 1, y 1 ),..., (x n, y n ), every f in F satisfies P(f (x) y) 1 n {i : f (x i) y i } + ( c n (VCdim(F) + log(1/δ)) ) 1/2. For uniform bounds (that is, for all distributions and all f F, proportions are close to probabilities), this inequality is tight within a constant factor. For neural networks, VC-dimension: increases with number of parameters 8 / 29

18 VC Theory Theorem (Vapnik and Chervonenkis) Suppose F { 1, 1} X. For every prob distribution P on X { 1, 1}, with probability 1 δ over n iid examples (x 1, y 1 ),..., (x n, y n ), every f in F satisfies P(f (x) y) 1 n {i : f (x i) y i } + ( c n (VCdim(F) + log(1/δ)) ) 1/2. For uniform bounds (that is, for all distributions and all f F, proportions are close to probabilities), this inequality is tight within a constant factor. For neural networks, VC-dimension: increases with number of parameters depends on nonlinearity and depth 8 / 29

19 VC-Dimension of Neural Networks Theorem Consider the class F of { 1, 1}-valued functions computed by a network with L layers, p parameters, and k computation units with the following nonlinearities: 9 / 29

20 VC-Dimension of Neural Networks Theorem Consider the class F of { 1, 1}-valued functions computed by a network with L layers, p parameters, and k computation units with the following nonlinearities: 1 Piecewise constant (linear threshold units): VCdim(F) = Õ (p). (Baum and Haussler, 1989) 9 / 29

21 VC-Dimension of Neural Networks Theorem Consider the class F of { 1, 1}-valued functions computed by a network with L layers, p parameters, and k computation units with the following nonlinearities: 1 Piecewise constant (linear threshold units): VCdim(F) = Õ (p). (Baum and Haussler, 1989) 2 Piecewise linear (ReLUs): VCdim(F) = Õ (pl). (B., Harvey, Liaw, Mehrabian, 2017) 9 / 29

22 VC-Dimension of Neural Networks Theorem Consider the class F of { 1, 1}-valued functions computed by a network with L layers, p parameters, and k computation units with the following nonlinearities: 1 Piecewise constant (linear threshold units): VCdim(F) = Õ (p). (Baum and Haussler, 1989) 2 Piecewise linear (ReLUs): VCdim(F) = Õ (pl). (B., Harvey, Liaw, Mehrabian, 2017) 3 Piecewise polynomial: VCdim(F) = Õ ( pl 2). (B., Maiorov, Meir, 1998) 9 / 29

23 VC-Dimension of Neural Networks Theorem Consider the class F of { 1, 1}-valued functions computed by a network with L layers, p parameters, and k computation units with the following nonlinearities: 1 Piecewise constant (linear threshold units): VCdim(F) = Õ (p). (Baum and Haussler, 1989) 2 Piecewise linear (ReLUs): VCdim(F) = Õ (pl). (B., Harvey, Liaw, Mehrabian, 2017) 3 Piecewise polynomial: VCdim(F) = Õ ( pl 2). (B., Maiorov, Meir, 1998) 4 Sigmoid: VCdim(F) = Õ ( p 2 k 2). (Karpinsky and MacIntyre, 1994) 9 / 29

24 Generalization in Neural Networks: Number of Parameters NIPS / 29

25 Outline What determines the statistical complexity of a deep network? VC theory: Number of parameters Margins analysis: Size of parameters Understanding generalization failures 11 / 29

26 Large-Margin Classifiers Consider a real-valued function f : X R used for classification. 12 / 29

27 Large-Margin Classifiers Consider a real-valued function f : X R used for classification. The prediction on x X is sign(f (x)) { 1, 1}. 12 / 29

28 Large-Margin Classifiers Consider a real-valued function f : X R used for classification. The prediction on x X is sign(f (x)) { 1, 1}. For a pattern-label pair (x, y) X { 1, 1}, if yf (x) > 0 then f classifies x correctly. 12 / 29

29 Large-Margin Classifiers Consider a real-valued function f : X R used for classification. The prediction on x X is sign(f (x)) { 1, 1}. For a pattern-label pair (x, y) X { 1, 1}, if yf (x) > 0 then f classifies x correctly. We call yf (x) the margin of f on x. 12 / 29

30 Large-Margin Classifiers Consider a real-valued function f : X R used for classification. The prediction on x X is sign(f (x)) { 1, 1}. For a pattern-label pair (x, y) X { 1, 1}, if yf (x) > 0 then f classifies x correctly. We call yf (x) the margin of f on x. We can view a larger margin as a more confident correct classification. 12 / 29

31 Large-Margin Classifiers Consider a real-valued function f : X R used for classification. The prediction on x X is sign(f (x)) { 1, 1}. For a pattern-label pair (x, y) X { 1, 1}, if yf (x) > 0 then f classifies x correctly. We call yf (x) the margin of f on x. We can view a larger margin as a more confident correct classification. Minimizing a continuous loss, such as encourages large margins. n (f (X i ) Y i ) 2, i=1 12 / 29

32 Large-Margin Classifiers Consider a real-valued function f : X R used for classification. The prediction on x X is sign(f (x)) { 1, 1}. For a pattern-label pair (x, y) X { 1, 1}, if yf (x) > 0 then f classifies x correctly. We call yf (x) the margin of f on x. We can view a larger margin as a more confident correct classification. Minimizing a continuous loss, such as encourages large margins. n (f (X i ) Y i ) 2, i=1 For large-margin classifiers, we should expect the fine-grained details of f to be less important. 12 / 29

33 Generalization: Margins and Size of Parameters Theorem (B., 1996) 13 / 29

34 Generalization: Margins and Size of Parameters Theorem (B., 1996) n training examples (X 1, Y 1 ),..., (X n, Y n ) X {±1} 13 / 29

35 Generalization: Margins and Size of Parameters Theorem (B., 1996) n training examples (X 1, Y 1 ),..., (X n, Y n ) X {±1} f F R X 13 / 29

36 Generalization: Margins and Size of Parameters Theorem (B., 1996) n training examples (X 1, Y 1 ),..., (X n, Y n ) X {±1} f F R X Pr(sign(f (X )) Y ) 13 / 29

37 Generalization: Margins and Size of Parameters Theorem (B., 1996) 1. With high probability over n training examples (X 1, Y 1 ),..., (X n, Y n ) X {±1}, every f F R X has ( ) Pr(sign(f (X )) Y ) 1 n fatf (γ) 1[Y i f (X i ) γ] + Õ. n n i=1 The bound depends on the margin loss plus an error term. 13 / 29

38 Generalization: Margins and Size of Parameters Theorem (B., 1996) 1. With high probability over n training examples (X 1, Y 1 ),..., (X n, Y n ) X {±1}, every f F R X has ( ) Pr(sign(f (X )) Y ) 1 n fatf (γ) 1[Y i f (X i ) γ] + Õ. n n i=1 The bound depends on the margin loss plus an error term. Minimizing quadratic loss or cross-entropy loss leads to large margins. 13 / 29

39 Generalization: Margins and Size of Parameters Theorem (B., 1996) 1. With high probability over n training examples (X 1, Y 1 ),..., (X n, Y n ) X {±1}, every f F R X has ( ) Pr(sign(f (X )) Y ) 1 n fatf (γ) 1[Y i f (X i ) γ] + Õ. n n i=1 The bound depends on the margin loss plus an error term. Minimizing quadratic loss or cross-entropy loss leads to large margins. fat F (γ) is a scale-sensitive version of VC-dimension. Unlike the VC-dimension, it need not grow with the number of parameters. 13 / 29

40 Generalization: Margins and Size of Parameters Theorem (B., 1996) 1. With high probability over n training examples (X 1, Y 1 ),..., (X n, Y n ) X {±1}, every f F R X has ( ) Pr(sign(f (X )) Y ) 1 n fatf (γ) 1[Y i f (X i ) γ] + Õ. n n i=1 2. If functions in F are computed by two-layer sigmoid networks with each unit s weights bounded in 1-norm, that is, w 1 B, then fat F (γ) = Õ((B/γ) 2 ). The bound depends on the margin loss plus an error term. Minimizing quadratic loss or cross-entropy loss leads to large margins. fat F (γ) is a scale-sensitive version of VC-dimension. Unlike the VC-dimension, it need not grow with the number of parameters. 13 / 29

41 Generalization: Margins and Size of Parameters Theorem (B., 1996) 1. With high probability over n training examples (X 1, Y 1 ),..., (X n, Y n ) X {±1}, every f F R X has ( ) Pr(sign(f (X )) Y ) 1 n fatf (γ) 1[Y i f (X i ) γ] + Õ. n n i=1 2. If functions in F are computed by two-layer sigmoid networks with each unit s weights bounded in 1-norm, that is, w 1 B, then fat F (γ) = Õ((B/γ) 2 ). Same ideas used to give rigorous dimension-independent generalization bounds for SVMs (B. and Shawe-Taylor, 1999) 13 / 29

42 Generalization: Margins and Size of Parameters Theorem (B., 1996) 1. With high probability over n training examples (X 1, Y 1 ),..., (X n, Y n ) X {±1}, every f F R X has ( ) Pr(sign(f (X )) Y ) 1 n fatf (γ) 1[Y i f (X i ) γ] + Õ. n n i=1 2. If functions in F are computed by two-layer sigmoid networks with each unit s weights bounded in 1-norm, that is, w 1 B, then fat F (γ) = Õ((B/γ) 2 ). Same ideas used to give rigorous dimension-independent generalization bounds for SVMs (B. and Shawe-Taylor, 1999)... and margins analysis of AdaBoost. (Schapire, Freund, B., Lee, 1998) 13 / 29

43 Generalization: Margins and Size of Parameters Theorem (B., 1996) 1. With high probability over n training examples (X 1, Y 1 ),..., (X n, Y n ) X {±1}, every f F R X has ( ) Pr(sign(f (X )) Y ) 1 n fatf (γ) 1[Y i f (X i ) γ] + Õ. n n i=1 2. If functions in F are computed by two-layer sigmoid networks with each unit s weights bounded in 1-norm, that is, w 1 B, then fat F (γ) = Õ((B/γ) 2 ). The scale of functions f F is important. 13 / 29

44 Generalization: Margins and Size of Parameters Theorem (B., 1996) 1. With high probability over n training examples (X 1, Y 1 ),..., (X n, Y n ) X {±1}, every f F R X has ( ) Pr(sign(f (X )) Y ) 1 n fatf (γ) 1[Y i f (X i ) γ] + Õ. n n i=1 2. If functions in F are computed by two-layer sigmoid networks with each unit s weights bounded in 1-norm, that is, w 1 B, then fat F (γ) = Õ((B/γ) 2 ). The scale of functions f F is important. Bigger f s give bigger margins, so fat F (γ) should be bigger. 13 / 29

45 Generalization: Margins and Size of Parameters Theorem (B., 1996) 1. With high probability over n training examples (X 1, Y 1 ),..., (X n, Y n ) X {±1}, every f F R X has ( ) Pr(sign(f (X )) Y ) 1 n fatf (γ) 1[Y i f (X i ) γ] + Õ. n n i=1 2. If functions in F are computed by two-layer sigmoid networks with each unit s weights bounded in 1-norm, that is, w 1 B, then fat F (γ) = Õ((B/γ) 2 ). The scale of functions f F is important. Bigger f s give bigger margins, so fat F (γ) should be bigger. The output y of a sigmoid layer has y 1, so w 1 B controls the scale of f. 13 / 29

46 Generalization: Margins and Size of Parameters 1996: Sigmoid networks Qualitative behavior explained by small weights theorem. 14 / 29

47 Generalization: Margins and Size of Parameters 1996: Sigmoid networks 2017: Deep ReLU networks simons.berkeley.edu Qualitative behavior explained by small weights theorem. 14 / 29

48 Generalization: Margins and Size of Parameters 1996: Sigmoid networks 2017: Deep ReLU networks simons.berkeley.edu Qualitative behavior explained by small weights theorem. How to measure the complexity of a ReLU network? 14 / 29

49 Outline What determines the statistical complexity of a deep network? VC theory: Number of parameters Margins analysis: Size of parameters Understanding generalization failures 15 / 29

50 Explaining Generalization Failures CIFAR / 29

51 Explaining Generalization Failures Stochastic Gradient Training Error on CIFAR10 (Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals, 2017) 17 / 29

52 Explaining Generalization Failures Training margins on CIFAR10 with true and random labels 18 / 29

53 Explaining Generalization Failures Training margins on CIFAR10 with true and random labels How does this match the large margin explanation? 18 / 29

54 Explaining Generalization Failures Training margins on CIFAR10 with true and random labels How does this match the large margin explanation? Need to account for the scale of the neural network functions. 18 / 29

55 Explaining Generalization Failures Training margins on CIFAR10 with true and random labels How does this match the large margin explanation? Need to account for the scale of the neural network functions. What is the appropriate notion of the size of these functions? 18 / 29

56 Generalization in Deep Networks Spectrally-normalized margin bounds for neural networks. B., Dylan J. Foster, Matus Telgarsky, arxiv: Matus Telgarsky UIUC Dylan Foster Cornell 19 / 29

57 Generalization in Deep Networks New results for generalization in deep ReLU networks Measuring the size of functions computed by a network of ReLUs. 20 / 29

58 Generalization in Deep Networks New results for generalization in deep ReLU networks Measuring the size of functions computed by a network of ReLUs. (c.f. sigmoid networks: the output y of a layer has y 1, so w 1 B keeps the scale under control.) 20 / 29

59 Generalization in Deep Networks New results for generalization in deep ReLU networks Measuring the size of functions computed by a network of ReLUs. (c.f. sigmoid networks: the output y of a layer has y 1, so w 1 B keeps the scale under control.) Large multiclass versus binary classification. 20 / 29

60 Generalization in Deep Networks New results for generalization in deep ReLU networks Measuring the size of functions computed by a network of ReLUs. (c.f. sigmoid networks: the output y of a layer has y 1, so w 1 B keeps the scale under control.) Large multiclass versus binary classification. Definitions Consider operator norms: For a matrix A i, A i := sup A i x. x 1 20 / 29

61 Generalization in Deep Networks New results for generalization in deep ReLU networks Measuring the size of functions computed by a network of ReLUs. (c.f. sigmoid networks: the output y of a layer has y 1, so w 1 B keeps the scale under control.) Large multiclass versus binary classification. Definitions Consider operator norms: For a matrix A i, A i := sup A i x. x 1 Multiclass margin function for f : X R m, y {1,..., m}: M(f (x), y) = f (x) y max i y f (x) i. 20 / 29

62 Generalization in Deep Networks Theorem With high probability, every f A 21 / 29

63 Generalization in Deep Networks Theorem With high probability, every f A Definitions Network with L layers, parameters A 1,..., A L : f A (x) := σ L (A L σ L 1 (A L 1 σ 1 (A 1 x) )). 21 / 29

64 Generalization in Deep Networks Theorem With high probability, every f A satisfies Pr(M(f A (X ), Y ) 0) Definitions Network with L layers, parameters A 1,..., A L : f A (x) := σ L (A L σ L 1 (A L 1 σ 1 (A 1 x) )). 21 / 29

65 Generalization in Deep Networks Theorem With high probability, every f A Pr(M(f A (X ), Y ) 0) 1 n satisfies n 1[M(f A (X i ), Y i ) γ] i=1 Definitions Network with L layers, parameters A 1,..., A L : f A (x) := σ L (A L σ L 1 (A L 1 σ 1 (A 1 x) )). 21 / 29

66 Generalization in Deep Networks Theorem With high probability, every f A Pr(M(f A (X ), Y ) 0) 1 n satisfies n 1[M(f A (X i ), Y i ) γ] + Õ i=1 ( ) rl γ. n Definitions Network with L layers, parameters A 1,..., A L : f A (x) := σ L (A L σ L 1 (A L 1 σ 1 (A 1 x) )). 21 / 29

67 Generalization in Deep Networks Theorem With high probability, every f A with R A r Pr(M(f A (X ), Y ) 0) 1 n satisfies n 1[M(f A (X i ), Y i ) γ] + Õ i=1 ( ) rl γ. n Definitions Network with L layers, parameters A 1,..., A L : f A (x) := σ L (A L σ L 1 (A L 1 σ 1 (A 1 x) )). Scale of f A : R A := L i=1 A L A i F i i=1 A i. 21 / 29

68 Generalization in Deep Networks Theorem With high probability, every f A with R A r Pr(M(f A (X ), Y ) 0) 1 n satisfies n 1[M(f A (X i ), Y i ) γ] + Õ i=1 ( ) rl γ. n Definitions Network with L layers, parameters A 1,..., A L : f A (x) := σ L (A L σ L 1 (A L 1 σ 1 (A 1 x) )). Scale of f A : R A := L i=1 A L A i F i i=1 A i. (Assume σ i is 1-Lipschitz, inputs normalized.) 21 / 29

69 Explaining Generalization Failures Stochastic Gradient Training Error on CIFAR10 (Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals, 2017) 22 / 29

70 Explaining Generalization Failures Training margins on CIFAR10 with true and random labels How does this match the large margin explanation? 23 / 29

71 Explaining Generalization Failures If we rescale the margins by R A (the scale parameter): Rescaled margins on CIFAR10 24 / 29

72 Explaining Generalization Failures If we rescale the margins by R A (the scale parameter): Rescaled cumulative margins on MNIST 25 / 29

73 Generalization in Deep Networks Theorem With high probability, every f A with R A r satisfies Pr(M(f A (X ), Y ) 0) 1 n n 1[M(f A (X i ), Y i ) γ] + Õ i=1 ( ) rl γ. n Network with L layers, parameters A 1,..., A L : f A (x) := σ(a L σ L 1 (A L 1 σ 1 (A 1 x) )). Scale of f A : R A := L i=1 A L A i F i i=1 A i. 26 / 29

74 Explaining Generalization Failures 27 / 29

75 Generalization in Neural Networks With appropriate normalization, the margins analysis is qualitatively consistent with the generalization performance. 28 / 29

76 Generalization in Neural Networks With appropriate normalization, the margins analysis is qualitatively consistent with the generalization performance. Margin bounds extend to residual networks. 28 / 29

77 Generalization in Neural Networks With appropriate normalization, the margins analysis is qualitatively consistent with the generalization performance. Margin bounds extend to residual networks. Lower bounds? 28 / 29

78 Generalization in Neural Networks With appropriate normalization, the margins analysis is qualitatively consistent with the generalization performance. Margin bounds extend to residual networks. Lower bounds? Regularization: explicit control of operator norms? 28 / 29

79 Generalization in Neural Networks With appropriate normalization, the margins analysis is qualitatively consistent with the generalization performance. Margin bounds extend to residual networks. Lower bounds? Regularization: explicit control of operator norms? Role of depth? 28 / 29

80 Generalization in Neural Networks With appropriate normalization, the margins analysis is qualitatively consistent with the generalization performance. Margin bounds extend to residual networks. Lower bounds? Regularization: explicit control of operator norms? Role of depth? Interplay with optimization? 28 / 29

81 Outline What determines the statistical complexity of a deep network? VC theory: Number of parameters Margins analysis: Size of parameters Understanding generalization failures 29 / 29

Some Statistical Properties of Deep Networks

Some Statistical Properties of Deep Networks Peter Bartlett UC Berkeley August 2, 2018 1 / 22 Deep Networks Deep compositions of nonlinear functions h = h m h m 1 h 1 2 / 22 Deep Networks Deep compositions