A PAC-Bayesian Approach to Spectrally-Normalized Margin Bounds for Neural Networks

Size: px

Start display at page:

Download "A PAC-Bayesian Approach to Spectrally-Normalized Margin Bounds for Neural Networks"

Octavia York
6 years ago
Views:

1 A PAC-Bayesian Approach to Spectrally-Normalize Margin Bouns for Neural Networks Behnam Neyshabur, Srinah Bhojanapalli, Davi McAllester, Nathan Srebro Toyota Technological Institute at Chicago {bneyshabur, srinah, mcallester, Abstract We present a generalization boun for feeforwar neural networks in terms of the prouct of the spectral norm of the layers an the Frobenius norm of the weights The generalization boun is erive using a PAC-Bayes analysis 1 Introuction In this note we present an prove a margin base generalization boun for feeforwar neural networks, that epens on the prouct of the spectral norm of the weights in each layer, as well as the Frobenius norm of the weights Our generalization boun shares much similarity with a margin base generalization boun recently presente by Bartlett et al [1] Both bouns epen similarly on the prouct of the spectral norms of each layer, multiplie by a factor that is aitive across layers In aition, Bartlett et al s [1] boun epens on the elementwise l 1 -norm of the weights in each layer, while our boun epens on the Frobenius elementwise l norm of the weights in each layer, with an aitional multiplicative epenence on the with The two bouns are thus not irectly comparable, an each one ominates in a ifferent regime, roughly epening on the sparsity of the weights More importantly, our proof technique is entirely ifferent, an arguably simpler, than that of Bartlett et al [1] We erive our boun using PAC-Bayes analysis, an more specifically a generic PAC-Bayes margin boun Lemma 1 The main ingreient is a perturbation boun Lemma, bouning the changes in the output of a network when the weights are perturbe, in terms of the prouct of the spectral norm of the layers This is an entirely ifferent analysis approach from Bartlett et al s [1] covering number analysis We hope our analysis can give more irect intuition into the ifferent ingreients in the boun an will allow moifying the analysis, eg by using ifferent prior an perturbation istributions in the PAC-Bayes boun, to obtain tighter bouns, perhaps with epenence on ifferent layer-wise norms We note that prior bouns in terms of elementwise or unit-wise norms such as the Frobenius norm an elementwise l 1 norms of layers, without a spectral norm epenence, all have a multiplicative epenence across layers or exponential epenence on epth Bartlett an Menelson [3], Neyshabur et al [11], or are for constant epth networks Bartlett [] Here only the spectral norm is multiplie across layers, an thus if the spectral norms are close to one, the exponential epenence on epth can be avoie 11 Preliminaries Consier the classification task with input omain X B,n = { x R n n x i B} an output omain R k where the output of the moel is a score for each class an the class with the maximum score will be selecte as the preicte label Let f w x : X B,n R k be the function compute

2 by a layer fee-forwar network for the classification task with parameters w = vec {W i }, f w x = W φw 1 φφw 1 x, here φ is the ReLU activation function Let fwx i enote the output of layer i before activation an h be an upper boun on the number of output units in each layer We can then efine fully connecte feeforwar networks recursively: fwx 1 = W 1 x an fwx i = W i φfw i 1 x Let F, 1 an enote the Frobenius norm, the element-wise l 1 norm an the spectral norm respectively We further enote the l p norm of a vector by p For any istribution D an margin γ > 0, we efine the expecte margin loss as follows: ] L γ f w = P x,y D [f w x[y] γ + max f wx[j] j y Let L γ f w be the empirical estimate of the above expecte margin loss Since setting γ = 0 correspons to the classification loss, we will use L 0 f w an L 0 f w to refer to the expecte risk an the training error The loss L γ efine this way is boune between 0 an 1 1 PAC-Bayesian framework The PAC-Bayesian framework [9, 10] provies generalization guarantees for ranomize preictors, rawn form a learne istribution Q as oppose to a learne single preictor that epens on the training ata In particular, let f w be any preictor not necessarily a neural network learne from the training ata an parametrize by w We consier the istribution Q over preictors of the form f w+u, where u is a ranom variable whose istribution may also epen on the training ata Given a prior istribution P over the set of preictors that is inepenent of the training ata, the PAC-Bayes theorem states that with probability at least 1 δ over the raw of the training ata, the expecte error of f w+u can be boune as follows [8]: KL w + u P + ln m δ E u [L 0 f w+u ] E u [ L 0 f w+u ] + m 1 To get a boun on the expecte risk L 0 f w for a single preictor f w, we nee to relate the expecte perturbe loss, E u [L 0 f w+u ] in the above equation with L 0 f w Towar this we use the following lemma that gives a margin-base generalization boun erive from the PAC-Bayesian boun : Lemma 1 Let f w x : X R k be any preictor not necessarily a neural network with parameters w, an P be any istribution on the parameters that is inepenent of the training ata Then, for any γ, δ > 0, with probability [ 1 δ over the training set of size m, for any w, an any ranom perturbation u st P u maxx X f w+u x f w x < γ ] 4 1, we have: L 0 f w L γ f w + 4 KL w + u P + ln 6m δ m 1 In the above expression the KL is evaluate for a fixe w an only u is ranom, ie the istribution of w + u is the istribution of u shifte by w The lemma is analogous to similar analysis of Langfor an Shawe-Taylor [7] an McAllester [8] obtaining PAC-Bayes margin bouns for linear preictors As we state the lemma, it is not specific to linear separators, nor neural networks, an hols generally for any real-value preictor We next show how to utilize the above general PAC-Bayes boun to prove generalization guarantees for feeforwar networks base on the spectral norm of its layers Generalization Boun In this section we present our generalization boun for feefowar networks with ReLU activations, erive using the PAC-Bayesian framework Langfor an Caruana [6], an more recently Dziugaite an Roy [4] an Neyshabur et al [1], use PAC-Bayes bouns to analyze generalization behavior in neural networks, evaluating the KL-ivergence, perturbation error L[f w+u ] L[f w ], or the entire boun numerically Here, we use the PAC-Bayes framework as a tool to analytically erive a margin-base boun in terms of norms of the weights As we saw in Lemma 1, the key to oing so is bouning the change in the output of the network when the weights are perturbe In the following lemma, we boun this change in terms of the spectral norm of the layers: 1

3 Lemma Perturbation Boun For any B, > 0, let f w : X B,n R k be a -layer network Then for any w, an x X B,n, an any perturbation u = vec {U i } such that Ui 1 W i, the change in the output of the network can be boune as follows: f w+ux f wx eb W i U i W i Next we use the above perturbation boun an the PAC-Bayes result Lemma 1 to erive the following generalization guarantee Theorem 1 Generalization Boun For any B,, h > 0, let f w : X B,n R k be a -layer feeforwar network with ReLU activations Then, for any δ, γ > 0, with probability 1 δ over a training set of size m, for any w, we have: L 0f w L γf w + O B h lnhπ Wi γ m W i F W i + ln m δ Comparing the above result to Bartlett et al s [1] boils own to comparing h W i F with W i 1 Recalling that W i is an h h matrix, we have that W i F W i 1 h W i F When the weights are fairly ense an are of uniform magnitue, the secon inequality will be tight, an we will have h Wi F W i 1, an Theorem 1 will ominate When the weights are sparse with roughly a constant number of significant weights per unit ie weight matrix with sparsity Θh, the bouns will be similar Bartlett et al s [1] boun will ominate when the weights are extremely sparse, with much fewer significant weights than units, ie when most units o not have any incoming or outgoing weights of significant magnitue Proof of Theorem 1 The proof involves mainly two steps In the first step we calculate what is the maximum allowe perturbation of parameters to satisfy a given margin conition γ, using Lemma In the secon step we calculate the KL term in the PAC-Bayes boun in Lemma 1, for this value of the perturbation 1/ Let β = W i an consier a network with the normalize weights Wi = β W i W i Due to the homogeneity of the ReLU, we have that for feeforwar networks with ReLU activations f w = f w, an so the empirical an expecte loss incluing margin loss is the same for w an w We can also verify that W i = W i an Wi F W i = W i F W, an so the excess i error in the Theorem statement is also invariant to this transformation It is therefore sufficient to prove the Theorem only for the normalize weights w, an hence we assume wlog that the spectral norm is equal across layers, ie for any layer i, W i = β Choose the istribution of the prior P to be N 0, σ I, an consier the ranom perturbation u N 0, σ I, with the same σ, which we will set later accoring to β More precisely, since the prior cannot epen on the learne preictor w or its norm, we will set σ base on an approximation β For each value of β on a pre-etermine gri, we will compute the PAC-Bayes boun, establishing the generalization guarantee for all w for which β β 1 β, an ensuring that each relevant value of β is covere by some β on the gri We will then take a union boun over all β on the gri For now, we will consier a fixe β an the w for which β β 1 β, an hence 1 e β 1 β 1 eβ 1 Since u N 0, σ I, we get the following boun for the spectral norm of U i [? ]: P Ui N0,σ I [ U i > t] he t /hσ Taking a union bon over the layers, we get that, with probability 1, the spectral norm of the perturbation U i in each layer is boune by σ h ln4h Plugging this spectral norm boun into Lemma we have that with probability at least 1, max x X B,n f w+ux f wx ebβ i U i β = ebβ 1 i U i e B β 1 σ h ln4h γ 4, 3

4 γ where we choose σ = 4B β 1 to get the last inequality Hence, the perturbation u with h ln4h the above value of σ satisfies the assumptions of the Lemma 1 We now calculate the KL-term in Lemma 1 with the chosen istributions for P an u, for the above value of σ KLw + u P w σ O B h lnh Π W i W i F γ W i Hence, for any β, with probability 1 δ an for all w such that, β β 1 β, we have: L 0f w L γf w + O B h lnhπ Wi W i F + ln m W i δ γ m 3 Finally we nee to take a union boun over ifferent choices of β Let us see how many choices of β we nee to ensure we always have β in the gri st β β 1 β We only nee to consier values of β in the range γ 1/ B β γ m 1/ B For β outsie this range the theorem statement hols trivially: Recall that the LHS of the theorem statement, L 0 f w is always boune by 1 If β < γ B, then for any x, f w x β B γ/ an therefore L γ = 1 Alternately, if β > γ m B, then the secon term in equation is greater than one Hence, we only nee to consier values of β in the range iscusse above Since we nee β to satisfy β β 1 β 1 γ 1/, B the size of the cover we nee to consier is boune by m 1 Taking a union boun over the choices of β in this cover an using the boun in equation 3 gives us the theorem statement Proof of Lemma Let i = f w+ux i fwx i We will prove using inuction that for any i 0: i i i i U j x W j The above inequality together with e proves the lemma statement The inuction base clearly hols since 0 = x x = 0 For any i 1, we have the following: i+1 = Wi+1 + U i+1 φ i f i w+ux W i+1 φ i f i wx = Wi+1 + U i+1 φ i f i w+ux φ i f i wx + U i+1 φ i f i wx W i+1 + U i+1 φ i f i w+ux φ i f i wx + U i+1 φ i f i wx W i+1 + U i+1 f i w+u x f i wx + U i+1 f i w x = i W i+1 + U i+1 + U i+1 f i w x, where the last inequality is by the Lipschitz property of the activation function an using φ0 = 0 The l norm of outputs of layer i is boune by x Π i an by the lemma assumption we have U i+1 1 W i+1 Therefore, using the inuction step, we get the following boun: i+1 i W i+1 + U i+1 x i+1 i+1 x i+1 i+1 i i+1 x i U j + Ui+1 x W i+1 U j i+1 W i 4

5 References [1] P Bartlett, D J Foster, an M Telgarsky Spectrally-normalize margin bouns for neural networks arxiv preprint arxiv: , 017 [] P L Bartlett The sample complexity of pattern classification with neural networks: the size of the weights is more important than the size of the network IEEE transactions on Information Theory, 44:55 536, 1998 [3] P L Bartlett an S Menelson Raemacher an gaussian complexities: Risk bouns an structural results Journal of Machine Learning Research, 3Nov:463 48, 00 [4] G K Dziugaite an D M Roy Computing nonvacuous generalization bouns for eep stochastic neural networks with many more parameters than training ata arxiv preprint arxiv: , 017 [5] N Harvey, C Liaw, an A Mehrabian Nearly-tight vc-imension bouns for piecewise linear neural networks arxiv preprint arxiv: , 017 [6] J Langfor an R Caruana not bouning the true error In Proceeings of the 14th International Conference on Neural Information Processing Systems: Natural an Synthetic, pages MIT Press, 001 [7] J Langfor an J Shawe-Taylor Pac-bayes & margins In Avances in neural information processing systems, pages , 003 [8] D McAllester Simplifie pac-bayesian margin bouns Lecture notes in computer science, pages 03 15, 003 [9] D A McAllester Some PAC-Bayesian theorems In Proceeings of the eleventh annual conference on Computational learning theory, pages ACM, 1998 [10] D A McAllester PAC-Bayesian moel averaging In Proceeings of the twelfth annual conference on Computational learning theory, pages ACM, 1999 [11] B Neyshabur, R Tomioka, an N Srebro Norm-base capacity control in neural networks In Proceeing of the 8th Conference on Learning Theory COLT, 015 [1] B Neyshabur, S Bhojanapalli, D McAllester, an N Srebro Exploring generalization in eep learning arxiv preprint arxiv: , 017 5

A PAC-BAYESIAN APPROACH TO SPECTRALLY-NORMALIZED MARGIN BOUNDS

Published as a conference paper at ICLR 08 A PAC-BAYESIAN APPROACH TO SPECTRALLY-NORMALIZED MARGIN BOUNDS FOR NEURAL NETWORKS Behnam Neyshabur, Srinadh Bhojanapalli, Nathan Srebro Toyota Technological