Why feed-forward networks are n a bad shape Patrck van der Smagt, Gerd Hrznger Insttute of Robotcs and System Dynamcs German Aerospace Center (DLR Oberpfaffenhofen) 82230 Wesslng, GERMANY emal smagt@dlr.de Abstract It has often been noted that the learnng problem n feed-forward neural networks s very badly condtoned. Although, generally, the specal form of the transfer functon s usually taken to be the cause of ths condton, we show that t s caused by the manner n whch neurons are connected. By analyzng the expected values of the Hessan n a feed-forward network t s shown that, even n a network where all the learnng samples are well chosen and the transfer functon s not n ts saturated state, the system has a non-optmal condton. We subsequently propose a change n the feed-forward network structure whch allevates ths problem. We fnally demonstrate the postve nfluence of ths approach. 1 Introducton It has long been known [1, 3, 4, 6] that learnng n feed-forward networks s a dffcult problem, and that ths s ntrnsc to the structure of such networks. The cause of the learnng dffcultes s reflected n the Hessan matrx of the learnng problem, whch conssts of second dervatves of the error functon. When the Hessan s very badly condtoned, the error functon has a very strongly elongated shape; ndeed, condtons of 10 20 are no excepton n feedforward network learnng, and n fact mean that the problem exceeds the representatonal accuracy of the computer. We wll show that ths problem s caused by the structure of the feed-forward network. An adaptaton to the learnng rule s shown to mprove ths condton. 2 The learnng problem We defne a feed-forward neural network wth a sngle layer of hdden unts (where ndcates the th nput, h the h th hdden unt, o the o th output, and ~x s an nput vector (x 1 ;x 2 ;::: ;x N )) N (~x; W ) o = h w ho s w h x + h : (1) 0 In: L. Nklasson, M. Bodén, and T. Zemke, edtors, Proceedngs of the 8th Internatonal Conference on Artfcal Neural Networks, pages 159 164. Sprnger Verlag, 1998.
The total number of weghts w j s n. W.l.o.g. we wll assume o = 1 n the sequel. The learnng task conssts of mnmzng an approxmaton error E N (W )= 1 2 N (~x (p) ;W), ~y (p) (2) where s the number of learnng samples and ~y s an output vector (y 1 ;y 2 ;::: ;y No ). kks usually taken to be the L 2 norm. The error E can be wrtten as a Taylor expanson around W 0 : E N (W + W 0 )=E(W 0 ), J T W + 1 2 W T HW + r(w 0 ): (3) Here, H s the Hessan and J s the Jacoban of E around W 0. In the case that E s quadratc, the rest term s 0 and a Newton-based second-order search technque can then be used to fnd the extremum n the ellpsod E. A problem arses, however, when the axes of ths ellpsod are very dfferent. When the rato of the lengths of the largest and smallest axes s very large (close to the computatonal precson of the computer used n optmzaton), the computaton of the exact local gradent wll be mprecse, such that the system s dffcult to mnmze. Now, snce H s a real symmetrc matrx, ts egenvectors span an orthogonal bass, and the drectons of the axes of E are equal to the egenvectors of H. Furthermore, the correspondng egenvalues are the square root of the lengths of the axes. Therefore, the condton number of the Hessan matrx, whch s defned as the rato of the largest and smallest sngular values (and therefore, for a postve defnte matrx, of the largest and smallest egenvalues), determnes how well the error surface E can be mnmzed. 2.1 The lnear network In the case that s() s the dentty, N (W ) s equvalent to a lnear feed-forward network N (W 0 ) wthout hdden unts, and we can wrte E N (W 0 + W 0 0 )=E(W 0 0 ), J T W 0 + 1 2 W 0T HW 0 : (4) In the lnear case the Hessan reduces to (leavng the ndex (p) out): H jk =,1 x j x k ; 1 j; k n = N +1 (5) where, for notatonal smplcty, we set x N+1 1. In ths case H s the covarance matrx of the nput patterns. It s nstantly clear that H s a postve defnte symmetrc matrx. Le Cun et al. [1] show that, when the nput patterns are uncorrelated, H has a contnuous spectrum of egenvalues, << +. Furthermore, there s one
egenvalue n of multplcty order n present only n the case that hx k 6= 0. Therefore, the Hessan for a lnear feed-forward network s optmally condtoned when hx k =0. The reason for ths behavour s very understandable. As!1, the summaton of uncorrelated elements x x j wll cancel out when hx =0, except where = j,.e., on the dagonal of the covarance matrx. In the lmt P these dagonal elements go towards the varance of the nput data (x )= 2 p x(p). From Gerschgorn s theorem we know that the egenvalues of a dagonal matrx equal the elements on the dagonal. 2.2 Mult-layer feed-forward networks In the case that a nonlnear feed-forward network s used,.e., s() s a nonlnear transfer functon, the rest term n Eq. (3) cannot be neglected n general. However, t s a well-known fact [2] that the rest term r() s neglgble close enough to a mnmum. From the defnton of E N we can compute that H j;k =,1 [N (~x), y] @2 N (~x) @w j @w k + @N (~x) @N (~x) : (6) @w j @w k We nvestgate the propertes of H j;k of Eq. (6). The frst term of the Hessan has a factor [N (~x), y]. Close to a mnmum, ths factor s close to zero such that t can be neglected. Also, when summed over many learnng samples, ths factor equals the random measurement error and cancels out n the summaton. Therefore we can wrte H j;k,1 @N (~x) @N (~x) : (7) @w j @w k 2.3 Propertes of the Hessan Every Hessan of a feed-forward network wth one layer of hdden unts can be parttoned nto four parts, dependng on whether the dervatve s taken wth respect to a weght from nput to hdden or from hdden to output unt. We take the smplfcaton of Eq. (7) as a startng pont. The partal dervatve of N can be computed to be @N (~x) @w ho = s w h x + h ah ; @N (~x) @w h = x w ho s 0 (a h ) x w ho a 0 h where s 0 () s the dervatve of s(). We can wrte the Hessan as a block matrx H P (x p 1 w h1oa 0 h 1 )(x 2 w h2oa 0 h 2 ) 00 H P (x p w h1oa 0 h 1 )a h2 10 H P (x p w h2oa 0 h 2 )a h1 01 H P a p h 1 a h2 11 H
(note that 10 H T = 01 H). Assumng that the nput samples have a normal (0,1) dstrbuton, we can analytcally compute the expectatons and varances of the elements n H by determnng the dstrbuton functons of these elements. Fgure 1 (left) depcts the expectatons and standard devatons of the elements of H for = 100. From the fgure t can be seen that, even though the network s n an optmal state, the elements of 11 H are much larger than those of 00 H. Naturally, ths effect s much stronger when weghts from nput to hdden unts are large, such that the hdden unts are close to ther saturated state. In that case, a 0 h 0 and the elements of 00 H (as well as those of 01 H) tend to 0. The centerng method as proposed n [1], when also appled to the actvaton values of the hdden unts, ensures an optmal condton for 11 H. Schraudolph and Sejnowsk [4] have shown that centerng n backpropagaton further mproves the learnng problem. Usng argumentaton smlar to Le Cun et al. [1], P they furthermore suggest that centerng the o = y o,a o as well as the h = o ow ho s 0 (a h ) mproves the condton of H. Although ths wll mprove the condton of the 00 H and 01 H, the problem that the elements of 00 H and 11 H are very dfferent n sze remans. We suggest that ths approach alone s not suffcent to mprove the learnng problem. 3 An adapted learnng rule To understand why the elements of H are so dfferent, we have to consder the back-propagaton learnng rule. Frst, Saarnen et al. [3] lst a few cases n whch the Hessan may become (nearly) sngular. The lsted reasons are assocated wth the ll character of the transfer functons whch are customarly used: the sgmodal functon s(x), whch saturates (.e., the dervatve becomes zero) for large nput. However, another problem exsts when a network has a small weght leavng from a hdden unt, the nfluence of the weghts that feed nto ths hdden unt s sgnfcantly reduced. Ths problem touches a characterstc problem n feed-forward network learnng: the gradents n the lower-layer weghts are nfluenced by the hgher-layer weghts. Why ths s so can be seen from the back-propagaton learnng method, whch works as follows. For each learnng sample: 1. compute o = y o, a o where a o s the actvaton value for output unt o; 2. compute w ho P = o a h where a h s the actvaton for hdden unt h; 3. compute h = o ow ho s P 0 (a h ); 4. compute w h = h x = o ow ho s 0 (a h )x. The gradent s then computed as the summaton of the w s. The gradent for a weght from an nput to a hdden unt becomes neglgble when o s small (.e., the network correctly represents ths sample), x s small (.e., the network nput s close to 0), w ho s small or s 0 (a h ) s small (because w h s large). The latter two of these cases are undesrable, and lead to paralyss of the weghts from nput to hdden unts.
0.8 S 0.6 0.4 0.2 11 H =j 00 H =j 00 H 6=j 5 n 01 H 10 15 20 11 H 6=j w h1o 1 w 1h 1 w h2o1 h2 h1 w 1h2 Fgure 1: (left) The dstrbuton of the elements of H for = 100. (rght) An exemplar lnearly augmented feed-forward neural network. In order to allevate these problems, we propose a change to the learnng rule as follows: w h = o (w ho s 0 (a h )+a h )x = h x + a h x o : (8) o By addng a h to the mddle term, we can solve both paralyss problems. In effect, an extra connecton from each nput unt to each output unt s created, wth a weght value coupled to the weght from the nput to hdden unt. The o th output of the neural network s now computed as M(~x; W ) o = h w ho s h w h x + h + S o h! w h x (9) where ds(x)=dx s(x). In the case that s(x) s the tanh functon we fnd that S(x) = log cosh x; note that ths functon asymptotcally goes to jxj,log 2 for large x. In effect, we add the absolute of the hdden unt actvaton values to each output unt. We call the new network the lnearly augmented feedforward network. The structure of ths network s depcted n fgure 1 (rght) for one nput and output unt and two hdden unts. Analyss of the Hessan s condton. We can compute the approxmaton error E 0 for M smlar to (2) and construct the Hessan H 0 for M. We can relate the quadrants of H 0 to those of H as follows. Frst, 11 H 0 = 11 H, and 00 H 0 1+h1N ;2+h2N = 00 H ::: + 1 x 1 a h1 x 2 a h2 (1 + w h1oa 0 h 1 + w h2oa 0 h 2 ); 01 H 0 +h 1N ;+h2n = 01 H ::: + 1 x a h1 a h2 : p p Usng the same lne of argument as n secton 2.1 we note that t s mportant that the a s be centered,.e., (a) the nput values should be centered, and (b) t s advantageous to use a centered hdden unt actvaton functon (e.g., tanh(x) rather than 1=(1 + e,x )).
M and the Unversal Approxmaton Theorems. It has been shown n varous publcatons that the ordnary feed-forward neural network N can represent any Borel-measurable functon wth a sngle layer of hdden unts whch have sgmodal or Gaussan actvaton functons. It can be easly shown [6] that N and M are equvalent, such that all representaton theorems that hold for N also hold for M. 4 Examples The new method has been tested on a few problems. Frst, OR classfcaton wth two hdden unts. Secondly, the approxmaton of sn(x) wth 3 hdden unts and 11 learnng samples randomly chosen between 0 and 2. Thrd, the approxmaton of the nverse knematcs plus nverse perspectve transform for a 3 DoF robot arm wth camera fxed n the end-effector. The network conssted of 5 nputs, 8 hdden unts, and 3 outputs; the 1107 learnng samples were gathered usng a Manutec R3 robot. All networks were traned wth Polak-Rbère conjugate gradent wth Powell restarts [5]. All experments were run 1000 tmes wth dfferent ntal weghts. The results are llustrated below. Note that, for the chosen problems, M effectvely smoothes away local mnma and saddle ponts (% stuck goes to 0.0). The reported E was measured only over those runs whch dd not get stuck. References N M OR % stuck 22.4 0.0 # steps to reach E =0:0 189.1 65.3 sn % stuck 42.3 0.0 E after 1000 teratons 29 10,5 7:3 10,5 robot E after 1000 teratons 5:3 10,3 1:8 10,3 [1] Y. Le Cun, I. Kanter, and S. A. Solla. Egenvalues of covarance matrces: Applcaton to neural network learnng. Physcal Revew Letters, 66(18):2396 2399, 1991. [2] E. Polak. Computatonal Methods n Optmzaton. Academc Press, New York, 1971. [3] S. Saarnen, R. Bramley, and G. Cybenko. Ill-condtonng n neural network tranng problems. Sam Journal of Scentfc Computng, 14(3):693 714, May 1993. [4] N. N. Schraudolph and T. J. Sejnowsk. Temperng backpropagaton networks: Not all weghts are created equal. In D. S. Touretzky, M. C. Moser, and M. E. Hasselmo, edtors, Advances n Neural Informaton Processng Systems, pages 563 569, 1996. [5] P. van der Smagt. Mnmsaton methods for tranng feed-forward networks. Neural Networks, 7(1):1 11, 1994. [6] P. van der Smagt and G. Hrznger. Solvng the ll-condtonng n neural network learnng. In J. Orr, K. Müller, and R. Caruana, edtors, Trcks of the Trade: How to Make Neural Networks Really Work. Lecture Notes n Computer Scence, Sprnger Verlag, 1998. In prnt.