A Simple Weight Decay Can Improve. Abstract. It has been observed in numerical simulations that a weight decay can improve

Size: px

Start display at page:

Download "A Simple Weight Decay Can Improve. Abstract. It has been observed in numerical simulations that a weight decay can improve"

Denis Weaver
5 years ago
Views:

1 In Advances in Neural Information Processing Systems 4, J.E. Moody, S.J. Hanson and R.P. Limann, eds. Morgan Kaumann Publishers, San Mateo CA, 1995,. 950{957. A Simle Weight Decay Can Imrove Generalization Anders Krogh connect, The Niels Bohr Institute Blegdamsvej 17 DK-2100 Coenhagen, Denmark krogh@cse.ucsc.edu John A. Hertz Nordita Blegdamsvej 17 DK-2100 Coenhagen, Denmark hertz@nordita.dk Abstract It has been observed in numerical simulations that a weight decay can imrove generalization in a feed-forward neural network. This aer exlains why. It is roven that a weight decay has two eects in a linear network. First, it suresses any irrelevant comonents of the weight vector by choosing the smallest vector that solves the learning roblem. Second, if the size is chosen right, a weight decay can suress some of the eects of static noise on the targets, which imroves generalization quite a lot. It is then shown how to extend these results to networks with hidden layers and non-linear units. Finally the theory is conrmed by some numerical simulations using the data from NetTalk. 1 INTRODUCTION Many recent studies have shown that the generalization ability of a neural network (or any other `learning machine') deends on a balance between the information in the training examles and the comlexity of the network, see for instance [1, 2, 3]. Bad generalization occurs if the information does not match the comlexity, e.g. if the network is very comlex and there is little information in the training set. In this last instance the network will be over-tting the data, and the oosite situation corresonds to under-tting. Present address: Comuter and Information Sciences, Univ. of California Santa Cruz, Santa Cruz, CA

2 Often the number of free arameters, i.e. the number of weights and thresholds, is used as a measure of the network comlexity, and algorithms have been develoed, which minimizes the number of weights while still keeing the error on the training examles small [4, 5, 6]. This minimization of the number of free arameters is not always what is needed. A dierent way to constrain a network, and thus decrease its comlexity, is to limit the growth of the weights through some kind of weight decay. It should revent the weights from growing too large unless it is really necessary. It can be realized by adding a term to the cost function that enalizes large weights, E(w) = E 0 (w) + 1 w 2 i ; (1) 2 i where E 0 is one's favorite error measure (usually the sum of squared errors), and is a arameter governing how strongly large weights are enalized. w is a vector containing all free arameters of the network, it will be called the weight vector. If gradient descend is used for learning, the last term in the cost function leads to a new term?w i in the weight udate: _w i i? w i : (2) Here it is formulated in continuous time. If the gradient of E 0 (the `force term') were not resent this equation would lead to an exonential decay of the weights. Obviously there are innitely many ossibilities for choosing other forms of the additional term in (1), but here we will concentrate on this simle form. It has been known for a long time that a weight decay of this form can imrove generalization [7], but until now not very widely recognized. The aim of this aer is to analyze this eect both theoretically and exerimentally. Weight decay as a secial kind of regularization is also discussed in [8, 9]. 2 FEED-FORWARD NETWORKS A feed-forward neural network imlements a function of the inuts that deends on the weight vector w, it is called f w. For simlicity it is assumed that there is only one outut unit. When the inut is the outut is f w (). Note that the inut vector is a vector in the N-dimensional inut sace, whereas the weight vector is a vector in the weight sace which has a dierent dimension W. The aim of the learning is not only to learn the examles, but to learn the underlying function that roduces the targets for the learning rocess. First, we assume that this target function can actually be imlemented by the network. This means there exists a weight vector u such that the target function is equal to f u. The network with arameters u is often called the teacher, because from inut vectors it can roduce the right targets. The sum of squared errors is E 0 (w) = 1 2 =1 [f u ( )? f w ( )] 2 ; (3) 2

3 where is the number of training atterns. The learning equation (2) can then be written _w i / [f u ( )? f w ( w()? w i : i Now the idea is to exand this around the solution u, but rst the linear case will be analyzed in some detail. 3 THE LINEAR PERCEPTRON The simlest kind of `network' is the linear ercetron characterized by f w () = N?1=2 w i i (5) where the N?1=2 is just a convenient normalization factor. Here the dimension of the weight sace (W ) is the same as the dimension of the inut sace (N). The learning equation then takes the simle form _w i / N?1 [u j? w j ] j? w i i: (6) Dening and it becomes j i v i u i? w i (7) A ij = N?1 i j (8) _v i /? A ij v j + (u i? v i ): (9) Transforming this equation to the basis where A is diagonal yields j _v r /?(A r + )v r + u r ; (10) where A r are the eigenvalues of A, and a subscrit r indicates transformation to this basis. The generalization error is dened as the error averaged over the distribution of inut vectors F = h[f u ()? f w ()] 2 i = hn?1 ( v i i ) 2 i = N?1 v i v j h i j i i ij = N?1 v 2 i : (11) i Here it is assumed that h i j i = ij. The generalization error F is thus roortional to jvj 2, which is also quite natural. The eigenvalues of the covariance matrix A are non-negative, and its rank can easily be shown to be less than or equal to. It is also easily seen that all eigenvectors belonging to eigenvalues larger than 0 lies in the subsace of weight sace sanned 3

4 by the inut atterns 1 ; : : :;. This subsace, called the attern subsace, will be denoted V, and the orthogonal subsace is denoted by V?. When there are suciently many examles they san the whole sace, and there will be no zero eigenvalues. This can only haen for N. When = 0 the solution to (10) inside V is just a simle exonential decay to v r = 0. Outside the attern subsace A r = 0, and the corresonding art of v r will be constant. Any weight vector which has the same rojection onto the attern subsace as u gives a learning error 0. One can think of this as a `valley' in the error surface given by u + V?. The training set contains no information that can hel us choose between all these solutions to the learning roblem. When learning with a weight decay > 0, the constant art in V? will decay to zero asymtotically (as e?t, where t is the time). An innitesimal weight decay will therefore choose the solution with the smallest norm out of all the solutions in the valley described above. This solution can be shown to be the otimal one on average. 4 LEARNING WITH AN UNRELIABLE TEACHER Random errors made by the teacher can be modeled by adding a random term to the targets: f u ( )?! f u ( ) + : (12) The variance of is called 2, and it is assumed to have zero mean. Note that these targets are not exactly realizable by the network (for > 0), and therefore this is a simle model for studying learning of an unrealizable function. With this noise the learning equation (2) becomes _w i / (N?1 v j + N?1=2 )? w j i i: (13) j Transforming it to the basis where A is diagonal as before, _v r /?(A r + )v r + u r? N?1=2 r : (14) The asymtotic solution to this equation is v r = u r? N?1=2 P r + A r : (15) The contribution to the generalization error is the square of this summed over all r. If averaged over the noise (shown by the bar) it becomes for each r P F r = v 2 = 2 u 2 r + N?1 ( r ) 2 ( ) 2 r = 2 u 2 + A r r 2 ( + A r ) 2 ( + A r ) : (16) 2 The last exression has a minimum in, which can be found by utting the derivative with resect to equal to zero, r = otimal 2 =u 2 r. Remarkably it deends only 4

5 Figure 1: Generalization error as a function of = =N. The full line is for = 2 = 0:2, and the dashed line for = 0. The dotted line is the generalization error with no noise and = 0. 1 F /N on u and the variance of the noise, and not on A. If it is assumed that u is random (16) can be averaged over u. This yields an otimal indeendent of r, where u 2 is the average of N?1 juj 2. otimal = 2 u 2 ; (17) In this case the weight decay to some extent revents the network from tting the noise. From equation (14) one can see that the noise is rojected onto the attern subsace. Therefore the contribution to the generalization error from V? is the same as before, and this contribution is on average minimized by a weight decay of any size. Equation (17) was derived in [10] in the context of a articular eigenvalue sectrum. Figure g. 1 shows the dramatic imrovement in generalization error when the otimal weight decay is used in this case. The resent treatment shows that (17) is indeendent of the sectrum of A. We conclude that a weight decay has two ositive eects on generalization in a linear network: 1) It suresses any irrelevant comonents of the weight vector by choosing the smallest vector that solves the learning roblem. 2) If the size is chosen right, it can suress some of the eect of static noise on the targets. 5 NON-LINEAR NETWORKS It is not ossible to analyze a general non-linear network exactly, as done above for the linear case. By a local linearization, it is however, ossible to draw some interesting conclusions from the results in the revious section. Assume the function is realizable, f = f u. Then learning corresonds to solving the equations f w ( ) = f u ( ) (18) 5

6 in W variables, where W is the number of weights. For < W these equations dene a manifold in weight sace of dimension at least W?. Any oint ~w on this manifold gives a learning error of zero, and therefore (4) can be exanded around ~w. Putting v = ~w? w, exanding f w in v, and using it in (4) ( w ( ) _v i /? v j + ( ~w i? v i i ;j =? A ij ( ~w)v j? v i + ~w i (19) j (The derivatives in this equation should be taken at ~w.) The analogue of A is dened w ( w ( ) A ij ( ~w) : j Since it is of outer roduct form (like A) its rank R( ~w) minf; W g. Thus when < W, A is never of full rank. The rank of A is of course equal to W minus the dimension of the manifold mentioned above. From these simle observations one can argue that good generalization should not be exected for < W. This is in accordance with other results (cf. [3]), and with current `folk-lore'. The dierence from the linear case is that the `rain gutter' need not be (and most robably is not) linear, but curved in this case. There may in fact be other valleys or rain gutters disconnected from the one containing u. One can also see that if A has full rank, all oints in the immediate neighborhood of ~w = u give a learning error larger than 0, i.e. there is a simle minimum at u. Assume that the learning nds one of these valleys. A small weight decay will ick out the oint in the valley with the smallest norm among all the oints in the valley. In general it can not be roven that icking that solution is the best strategy. But, at least from a hilosohical oint of view, it seems sensible, because it is (in a loose sense) the solution with the smallest comlexity the one that Ockham would robably have chosen. The value of a weight decay is more evident if there are small errors in the targets. In that case one can go through exactly the same line of arguments as for the linear case to show that a weight decay can imrove generalization, and even with the same otimal choice (17) of. This is strictly true only for small errors (where the linear aroximation is valid). 6 NUMERICAL EPERIMENTS A weight decay has been tested on the NetTalk roblem [11]. In the simulations back-roagation derived from the `entroic error measure' [12] with a momentum term xed at 0.8 was used. The network had 726 inut units, 40 hidden units and 26 outut units. In all about 8400 weights. It was trained on 400 to 5000 random words from the data base of around words, and tested on a dierent set of 1000 random words. The training set and test set were indeendent from run to run. 6

7 Errors F Figure 2: The to full line corresonds to the generalization error after 300 eochs (300 cycles through the training set) without a weight decay. The lower full line is with a weight decay. The to dotted line is the lowest error seen during learning without a weight decay, and the lower dotted with a weight decay. The size of the weight decay was = 0: Insert: Same gure excet that the error rate is shown instead of the squared error. The error rate is the fraction of wrong honemes when the honeme vector with the smallest angle to the actual outut is chosen, see [11]. Results are shown in g. 2. There is a clear imrovement in generalization error when weight decay is used. There is also an imrovement in error rate (insert of g. 2), but it is less ronounced in terms of relative imrovement. Results shown here are for a weight decay of = 0: The values and was also tried and gave basically the same curves. 7 CONCLUSION It was shown how a weight decay can imrove generalization in two ways: 1) It suresses any irrelevant comonents of the weight vector by choosing the smallest vector that solves the learning roblem. 2) If the size is chosen right, a weight decay can suress some of the eect of static noise on the targets. Static noise on the targets can be viewed as a model of learning an unrealizable function. The analysis assumed that the network could be exanded around an otimal weight vector, and 7

8 therefore it is strictly only valid in a little neighborhood around that vector. The imrovement from a weight decay was also tested by simulations. For the NetTalk data it was shown that a weight decay can decrease the generalization error (squared error) and also, although less signicantly, the actual mistake rate of the network when the honeme closest to the outut is chosen. Acknowledgements AK acknowledges suort from the Danish Natural Science Council and the Danish Technical Research Council through the Comutational Neural Network Center (connect). References [1] D.B. Schwartz, V.K. Samalam, S.A. Solla, and J.S. Denker. Exhaustive learning. Neural Comutation, 2:371{382, [2] N. Tishby, E. Levin, and S.A. Solla. Consistent inference of robabilities in layered networks: redictions and generalization. In International Joint Conference on Neural Networks, ages 403{410, (Washington 1989), IEEE, New York, [3] E.B. Baum and D. Haussler. What size net gives valid generalization? Neural Comutation, 1:151{160, [4] Y. Le Cun, J.S. Denker, and S.A. Solla. Otimal brain damage. In D.S. Touretzky, editor, Advances in Neural Information Processing Systems, ages 598{ 605, (Denver 1989), Morgan Kaufmann, San Mateo, [5] H.H. Thodberg. Imroving generalization of neural networks through runing. International Journal of Neural Systems, 1:317{326, [6] D.H. Weigend, D.E. Rumelhart, and B.A. Huberman. Generalization by weight-elimination with alication to forecasting. In R.P. Limann et al, editors, Advances in Neural Information Processing Systems, age 875{882, (Denver 1989), Morgan Kaufmann, San Mateo, [7] G.E. Hinton. Learning translation invariant recognition in a massively arallel network. In G. Goos and J. Hartmanis, editors, PARLE: Parallel Architectures and Languages Euroe. Lecture Notes in Comuter Science, ages 1{13, Sringer-Verlag, Berlin, [8] J.Moody. Generalization, weight decay, and architecture selection for nonlinear learning systems. These roceedings. [9] D. MacKay. A ractical bayesian framework for backro networks. These roceedings. [10] A. Krogh and J.A. Hertz. Generalization in a Linear Percetron in the Presence of Noise. To aear in Journal of Physics A [11] T.J. Sejnowski and C.R. Rosenberg. Parallel networks that learn to ronounce english text. Comlex Systems, 1:145{168, [12] J.A. Hertz, A. Krogh, and R.G. Palmer. Introduction to the Theory of Neural Comutation. Addison-Wesley, Redwood City,

A Simple Weight Decay Can Improve Generalization

A Simple Weight Decay Can Improve Generalization Anders Krogh CONNECT, The Niels Bohr Institute Blegdamsvej 17 DK-2100 Copenhagen, Denmark krogh@cse.ucsc.edu John A. Hertz Nordita Blegdamsvej 17 DK-2100