Local Minima and Plateaus in Multilayer Neural Networks Kenji Fukumizu and Shun-ichi Amari Brain Science Institute, RIKEN Hirosawa 2-, Wako, Saitama 35-098, Japan E-mail: ffuku, amarig@brain.riken.go.jp Abstract Local minima and plateaus pose a serious problem in learning of neural networks. We investigate the geometric structure of the parameter space of three-layer perceptrons in order to show the existence of local minima and plateaus. It is proved that a critical point of the model with H ; hidden units always gives a critical point of the model with H hidden units. Based on this result, we prove that the critical point corresponding to the global minimum of a smaller model can be a local minimum or a saddle point of the larger model. We give a necessary and sucient condition for this. The results are universal in the sense that they do not use special properties of target, loss functions, and activation functions, but only use the hierarchical structure of the model. Introduction It has been believed that the error surface of multilayer perceptrons (MLP) has in general many local minima. This has been regarded as one of the disadvantages of neural networks, and a great deal of eort has been paid to nd good methods of avoiding them. There have been no rigorous results, however, to prove the existence of local minima. Even in the XOR problem, existence of local minima had been controversial. Lisboa and Perantonis ([]) elucidated all the critical points of the XOR problem and asserted with a help of numerical simulations that some of them are local minima. Recently, Hamney ([2]) and Sprinkhuizen-Kuyper & Boers ([3]) rigorously proved that what have been believed to be local minima in [] correspond to local minima with innite parameter values, and that there are no local minima in the nite weight region for the XOR problem. Existence of local minima in general cases is still an open problem. It is also dicult to derive meaningful results on local minima from numerical experiments. We often see extremely slow dynamics around a point in simulations. However, it is not easy to tell rigorously whether it is a local minimum. It is known ([4],[5]) that a typical learning curve shows a plateau in the middle of training, which causes almost no decrease of the training error. It can be easily misunderstood as a local minimum. We mathematically investigate critical points of MLP,which are caused by the hierarchical structure of the model. We discuss only networks with one output unit in this paper. The function space of networks with H ; hidden units is included in the function space of networks with H hidden units. However, the relation between their parameter spaces is not so simple ([6],[7]). We investigate their geometric structure and elucidate how a parameter of a smaller network is embedded in the parameter space of larger networks. We show thatacritical point of the error surface for the smaller model gives a set of critical points for the larger model. The main purpose of this paper is to show that a subset of the critical points corresponding to the global minimum of the smaller model can be local minima of the larger model. The set of critical points is divided into two parts: local minima and saddles. We give an explicit condition when this occurs. This gives a formal proof of the existence of local minima for the rst time. Moreover, the coexistence of local minima and saddles in Moreover, the coexistence of local minima and saddles explains a serious mechanism of plateaus: when such is the case, the parameters are attracted in the part of local minima, walk randomly for a long time, but eventually go out from the part of saddles.
2 Geometric structure of the parameter space 2. Basic denitions We consider a three-layer perceptron with one linear output unit and L input unit. The function of a network with H hidden units is dened by f (H) (x (H) )= HX ; v j ' w T x j () j= where x 2 R L is an input vector and (H) = (v ::: v H w T ::: wt H )T is the parameter vector. We do not use bias terms for simplicity. The function '(t) is called an activation function. In this paper, we use tanh for '. However, our results can be easily extended to a wider class of functions with necessary modications. Given N training data f(x () y () )g N =, the objective of training is to nd the parameter that minimizes the error function E H () = NX = `(y () f(x () )) (2) where `(y z) is a loss function. If `(y z) = 2 ky;zk2, the objective function is the mean square error. Another popular choice is the cross-entropy. The results in this paper are independent of the choice of a loss function. 2.2 Hierarchical structure of MLP The parameter (H) consists of a (L +)H dimensional Euclidean space H. All the functions eq.() realized by H consist of a function space S H = ff (H) (x (H) ):R L! R j (H) 2 H g: (3) We denote the map from H onto S H by H : H!S H (H) 7! f(x (H) ): (4) We sometimes write f (H) for H (). A very important point is that H is not one-to-one, that is, dierent (H) may give the same function. It is easy to see that the interchange between (v j w j ) and (v j2 w j2 ) does not alter the image of H. For the tanh activation, Chen et al. ([6]) showed that any analytic transform T of H such that f (H) (x T ()) = f (H) (x ) is a composition of the interchanges and sign ips (v j w j ) 7! (;v j ;w j ). These transforms consist of an algebraic group G H. The function spaces S H (H =0 2 :::) have a trivial hierarchical structure S 0 S S H; S H : (5) On the other hand, given a function f (H;) realized by a network with (H;) H ; hidden units, there are a family of parameters (H) 2 H that realizes f (H;). Mathematically speaking, a map from (H;) H; to H that commutes the following diagram is not uniquely determined. H; H;??y ;;;;! H S H; ;;;;! H;??y H S H (6) The set of all the parameters (H) that realize the functions of smaller networks is denoted by H H = ; H ( H;(S H; )). Sussmann ([7]) shows that H is the union of the following three kinds of submanifolds of H A j = fv j =0g ( j H) B j = fw j = 0g ( j H) C j j 2 = fw j = w j2 g (j <j 2 ): Fig. illustrates these parameters. In A j and B j, the jth hidden unit plays no role in the value of the input-output function. In C j j 2,thej th and j 2 th hidden unit can be integrated into one, where v v 2 is the weight of the new unit to the output unit. From the viewpoint of mathematical statistics, it is also known ([8]) that H is the set of all the points at which the Fisher information is singular. Next, we will see how a specic function in the smaller model is embedded in the parameter space of the larger model. Let f (H;) be a function in S (H;) H; ;S H;2. To distinguish H; and H,we use dierent parameter variables and indexing f (H;) (x (H;) )= HX j=2 j '(u T j x): (7) Then, given (H;), the parameter set in H realizing f (H;) is the union of the submanifolds in each ofa j, B j and C j j 2 (H;). For 2
0 w = w 2 0 A j B j C + jj2 Figure : Networks given by A j, B j and C j j 2. simplicity, we show only an example of the submanifolds of A, B and C 2 =fv =0 v j = j w j = u j (j 2) w : freeg =fw = 0 v j = j w j = u j (j 2) v :freeg ; =fw = w 2 = u 2 v v 2 = 2 v j = j w j = u j (j 2)g: (8) All the other sets of parameters realizing f (H;) are obtained as the transform of,, (H;) and ; by T 2 G H. The submanifold is a L dimensional ane space parallel to the w -plane, is a line with an arbitrary v, and ; is a line dened by v v 2 = 2 in the v v 2 -plane. Thus, each function of a smaller network is realized by high-dimensional submanifolds in H. For further analysis, we dene canonical embeddings of H; into H, which commute the diagram (6) w : (H;) 7! (0 2 ::: H w u 2 ::: u H ) v : (H;) 7! (v 2 ::: H 0 u 2 ::: u H ) :(H;) 7! ( 2 ( ; ) 2 3 ::: H u 2 u 2 u 3 ::: u H ): (9) where w 2 R L, v 2 R, and 2 R are their parameters. We see that the images of these embedding changing its parameter span the submanifolds,, and ; respectively that is, = f w ( (H;) ) j w 2 R L g = f v ( (H;) ) j v 2 Rg ; = f ((H;) ) j 2 Rg: 3 Critical points of MLP Generally, the optimum parameter cannot be calculated analytically, and some numerical optimization method is needed to obtain an approximation of the global minimum of E H. One widely-used method is the steepest descent method, which leads to a learning rule dened by (t +)=(t) ; @E H((t)) : (0) @ However, this learning rule stops at a critical point, which satises @EH () = 0, even if @ is not the global minimum. There are three types of critical point: a local minimum, a local maximum, and a saddle point. A critical point 0 is called a local minimum (maximum) if there exists a neighborhood around 0 such that for any in the neighborhood E H () E H ( 0 ) (E H () E H ( 0 )) holds, and called a saddle if it is neither a local minimum nor a local maximum, that is, if in any neighborhood of 0 there exists a point at which E H is smaller than E H ( 0 )andapoint at which E H is larger than E H ( 0 ). It is well known that if the Hessian matrix at a critical point is positive (negative) denite, the critical point is a local minimum (maximum), and if the Hessian has both positive and negative eigenvalues, it is a saddle. We look for a critical point ofe H in H. Let (H;) = ( 2 ::: H u 2 ::: u H ) 2 H; ; H;2 be a critical point ofe H;. It really exists if we assume that the global minimum of E H; is not included in H;2. Then, we have P N @ z`() = '(u T j x() )=0 j P N = @ z`() ' 0 (u T j x() )x () =0 () 3
for 2 j H, where we use @ z`() = @` @z (y() f (H;) (x () (H;) )) (2) for simplicity. We have two kinds of critical points as follows. Theorem. Let v and beasineq.(9). Then, ((H;) ) for all and 0 ( (H;) ) are critical points of E H. The proof is easy. Noting that f (H;) (x (H;) ) = f (H) (x ) for = )or 0 ( (H;) ), the condition of ((H;) a critical point of E H can be reduced to eq.(). Because 0 = 0, they give the same critical point. The critical point ((H;) ) consist of a line in H if we move 2 R. If is a critical point ofe H,soisT() for all T 2 G H. We have many critical lines in H. 4 Local minima of MLP 4. A condition of local minima In this section, we focus on the critical point ((H;) ), and show a condition that it is a local minimum or a saddle point. The usual sucient condition using the Hessian matrix cannot be applied in this case. The Hessian is singular, because a line including the point has the same value of E H in common. Let (H;) be a pointin H;. We dene the following L L symmetric matrix A 2 = 2 N X = @ z`() ' 00 (u T 2 x() )x () x ()T : (3) Theorem 2. Let (H;) be a local minimum of E H; such that the Hessian matrix at (H;) is positive denite. Let be + or ; in eq.(9), and ; = f 2 H j = ( (H;) ) 2 Rg. If A 2 is positive (negative) denite, any point in the set ; 0 = f 2 ; j ( ; ) > 0 (< 0)g is a local minimum of E H, and any point in ; ; ; 0 is a saddle. If A 2 has both positive and negative eigenvalues, all the points in ; are saddlepoints. Error Γ Figure 2: Error surface around local minima For the proof, see Appendix. Local minima given by Theorem 2, if any, appear as line segments. Such a local minimum can be changed into a saddle through the line without altering the function f (H). Fig.2 illustrates the error surface in this case. 4.2 Plateaus We consider the case where A 2 is positive (negative) denite. If we map ; to the function space, H (;) consists of a single function f (H) 2 S (H) H. Therefore, if we regard the cost function E H as a function on S H, H (;) is a saddle, because E H takes both larger and smaller values than E H ( (H) )in any neighborhood of f (H) in S (H) H. It is interesting to see that ; 0 is attractive inits neighborhood. Hence, any point in its small neighborhood is attracted to ; 0. However, ; 0 is neutrally stable in the direction of ; 0, so that the point attracted to ; 0 uctuates randomly along ; 0. It eventually escapes from ; when it reaches ;;; 0. This takes a long time because of the nature of random walk. This explains that this type of critical points are serious plateaus. This is a new type of saddle which has so far not remarked in nonlinear dynamics. This type of \intrinsic saddle" is given rise to by the singular structure of the topology of S H. 4.3 Numerical simulation We have tried a numerical simulation to exemplify local minima given by Theorem 2, using MLP with input, output, and 2 hidden units. We use the logistic function λ 0 4
.0 0.5 0.0 Target function f(x) Local minima serious plateaus because all points around the set of local minima once converge to it and have to escape from it by random uctuation. It is important to see whether the critical sets in Theorem 2 are the only reason of plateaus. If this is the case, we can avoid them by the method of natural gradient ([5],[9]). However, this is still left as an open problem. -0-5 0 5 0 x Figure 3: A local minimum in MLP. '(t) = for the activation function, +e ;t and `(y z) = 2 ky;zk2 for the loss function. For training data, 00 random input data are generated, and output data are obtained as y = f(x)+z, wheref(x) =2'(x);'(4x) and Z is a Gaussian noise with 0 ;4 as its variance. Using back-propagation, we train the parameter of MLP with hidden unit, and use it as the global minimum (). In this case, we have ( 2 u 2 ) = (0:98 0:47) and A 2 =:9 > 0. Then, any pointin; 0 = f + (() )j0 < < g is a local minimum. Wesetv = v 2 = 2 =2as ( ==2), and evaluate E 2 () at million random points around, which are generated by a normal distribution with 0 ;6 as its variance. As a result, all these values are larger than E 2 ( ). This experimentally veries that is a local minimum. The graphs of the target function and the function given by the local minimum f(x () ) are shown in Fig.3. 5 Conclusion We investigated the geometric structure of the parameter space of multilayer perceptrons with H ; hidden units embedded in the parameter space of H hidden units. Based on the structure, we found a nite family of critical point sets of the error surface. We showed that a critical point of a smaller network can be embedded into the parameter space as a set of critical points. We further elucidated a condition that a point in the image of one embedding is a local minimum. We see that under one condition there exist local minima as line segments in the parameter space, which cause References [] P.J.G. Lisboa & S.J. Perantonis. Complete solution of the local minima in the XOR problem. Network, 2:9-24, 99. [2] L.G.C. Hamney. XOR has no local minima: a case study in neural network error surface analysis. Neural Networks, (4):669{682, 998. [3] I.G. Sprinkhuizen-Kuyper & E.J.W. Boers. The error surface of the 2-2- XOR network: the nite stationary points. Neural Networks, (4):683{690, 998. [4] D. Saad & S. Solla. On-line learning in soft committee machines. Physical Review E, 52:4225{4243, 995. [5] S. Amari. Natural gradient works ef- ciently in learning. Neural Computation, 0:25{276, 998. [6] A.M. Chen, H. Lu, & R. Hecht-Nielsen. On the geometry of feedforward neural network error surfaces. Neural Computation, 5:90{927, 993. [7] H.J. Sussmann. Uniqueness of the weights for minimal feedforward nets with a given input-output map. Neural Networks, 5:589{593, 992. [8] K. Fukumizu. A regularity condition of the information matrix of a multilayer perceptron network. Neural Networks, 9(5):87{879, 996. [9] S. Amari, H. Park, & K. Fukumizu. Adaptive method of realizing natural gradient learning for multilayer perceptrons. Neural Computation. 999. To appear. 5
Appendix A Proof of Theorem 2 We have only to prove the theorem on +, because the sign ip (v 2 w 2 )7! (;v 2 ;w 2 ) maps ; ((H;) )to + ((H;) ) preserving the local property ofe H. For simplicity, we change the order of the components of (H) and (H;) as (v v 2 w T wt 2 v 0 v 3 ::: v H w T 3 ::: wt H ) and ( 2 u T 2 0 3 ::: H u T 3 ::: ut H ) respectively. We introduce a new coordinate system of H. Let ( T 2 b T v 3 ::: v H w T 3 ::: wt H ) be a coordinate system of H ;fv + v 2 =0g, where = v ; v 2 = v + v 2 ; w ; w 2 2 = v + v 2 b = v v + v 2 w + v 2 v + v 2 w 2 : (4) This is well-dened as a coordinate system, since the inverse is given by v = 2 + 2 2 w = b + ; + 2 2 v 2 = ; 2 + 2 2 w 2 = b ; + 2 : 2 (5) Using this coordinate system, the embedding is expressed as From eq.(6), @f @b ( )=0 and @f ( @ 2 )=0 also hold. Therefore, rre H ( )= NX = @ z`() rrf(x () ): (8) From Lemma, for any 2 f = 0g @ 2 f @ @! () = 0 and @2 f () =0 hold unless @@!! =. Then, from eq.(6), we have G = 0 @ @ 2 EH @@ ( ) O @ 2 E H; @@ O ((H;) ) By simple calculation, we can prove A : (9) Lemma 2. For any 2f = 0g, we have @ 2 f @@ (x ) =v v 2 2 ' 00 (b T x)xx T : (20) From eq.(8) and Lemma 2, we have @ 2 E H @@ ( )=( ; ) 2 2 A 2: (2) Noting that @2 E H; @@ ((H;) ) is positive definite and 2 6=0,if ( ; )A 2 is positive denite, so is G, and if ( ; )A 2 has negative eigenvalues, G has both positive and negative eigenvalues. This completes the proof. :( 2 u 2 3 ::: H u 3 ::: u H ) 7! ((2;) 2 0 2 u 2 3 ::: H u 3 ::: u H ): (6) The critical point set ; is a line parallel to the -axis with =0, 2 = 2, b = u 2, v j = j, and w j = u j (3 j H). Let be the component of, and V = f 2 H j = g be a complement space of ;. We have ; \ V =. If is a local minimum in V for any 2 ; 0, it is a local minimum also in H, and if is a saddle in V, it is a saddle also in H. Thus, we can reduce the problem to the Hessian of E H restricted on V. We write the Hessian by G. From eq.(5), we have Lemma. For any 2f = 0g, we have @f @ (x ) =0 and @f @ (x ) =0: (7) 6