Slow Dynamics Due to Singularities of Hierarchical Learning Machines

Size: px

Start display at page:

Download "Slow Dynamics Due to Singularities of Hierarchical Learning Machines"

Deirdre Freeman
5 years ago
Views:

1 Progress of Theoretical Physics Supplement No. 157, Slow Dynamics Due to Singularities of Hierarchical Learning Machines Hyeyoung Par 1, Masato Inoue 2, and Masato Oada 3, 1 Computer Science Dept., Kyungpoo National Univ., Daegu , Korea 2 Department of Computational Intelligence and Systems Science, Interdisciplinary Graduate School of Science and Engineering, Toyo Institute of Technology, Yoohama , Japan 3 Department of Complexity Science and Engineering, Graduate School of Frontier Sciences, The University of Toyo, Kashiwa , Japan Recently, slow dynamics in learning of neural networs has been nown to be closely related to singularities, which exist in parameter spaces of hierarchical learning models. To show the influence of singular structure on learning dynamics, we tae statistical mechanical approaches and investigate online-learning dynamics under various learning scenario with different relationship between optimum and singularities. From the investigation, we found a quasi-plateau phenomenon which differs from the well nown plateau. The quasi-plateau and plateau become extremely serious when an optimal point is in a neighborhood of a singularity. The quasi-plateau and plateau disappear in the natural gradient learning, which taes singular structures into account and uses Riemannian measure for the parameter space. 1. Introduction Parameter spaces of hierarchical learning models such as multilayer perceptrons have complex singular structures, which are responsible for various nontrivial properties in estimation performances and learning dynamics. 1, 3 Even though many statistical mechanical analysis on the slow learning dynamics have been done, 2, 5, 6 the singular structure and its influence on learning dynamics have not been treated. Therefore, some interesting phenomena caused by singularities, which we shall discuss in this paper, have not been observed in the previous wors. To see influences of singularity in various situations, it is important to choose learning models and learning scenario carefully, considering the geometrical structure of parameter spaces. We use a multilayer perceptron MLP with one hidden layer. This is the simplest models that have typical singular structure in its parameter spaces. Note that soft-committee machines do not have it. We investigate three types of learning tas classified by the positional relationship between optimal point and singularity. We analyze dynamics of natural gradient as well as that of standard gradient to see the different properties of the two methods. hypar@nu.ac.r Also at RIKEN BSI, Wao , Japan. inoue@sp.dis.titech.ac.jp Also at RIKEN BSI, Wao , Japan, and at Intelligent Cooperation and Control, PRESTO, JST, c/o RIKEN BSI, Wao , Japan. oada@brain.rien.jp

2 276 H. Par, M. Inoue and M. Oada 2. Model with singularity and its learning We use a simple MLP M K with K hidden units defined as ζ = f J,wξ = K i=1 w igj i ξ. Here, ξ R N is the input vector; J i R N and w i R are the weight parameters connected to the i-th hidden unit; and N is the number of input nodes. We also assume a teacher networ ζ = f B,vξ of the same architecture with M hidden units and parameter B n and v n. The space of M K has a typical hierarchical structure with singularities. Consider a space of MLP with two hidden units, S 2. Each point in S 2 is specified by parameters J i and w i i = 1, 2. In the space, a set of all points satisfying J 1 = J 2 = J 0 and w 1 +w 2 = w 0 represent the same MLP with one hidden unit specified by J 0 and w 0. Since all the points on the line w 1 + w 2 = w 0 have the same entropy, the Fisher information matrix becomes singular on the points. In addition, if we consider a functional space of M K, all those points shrin into one point maing an intrinsic singularity in the space. These singular subspaces in which all the points have the same energy level are ubiquitous in the space of MLP and may cause unpleasant phenomena in learning dynamics. We can also suppose that the influence of the singularity intensifies when the optimum is located at or around the singularities. We investigate two on-line gradient descent learning algorithm; the standard gradient learning and the natural gradient learning. At each learning step, new training data ξ, ζ is generated from the teacher networ. The student networ is trained to decrease the squared error between its output and ζ from teacher. The update term of standard gradient learning is given by J i = η N δ iξ, w i = η K M N gx ie, e = w j gx j v n gy n, 2.1 where δ i = w i g x i e, x i = J i ξ, y n = B n ξ, and η is a learning rate. For the natural gradient learning, we need a Fisher information matrix G of a stochastic model of the student networ and its inverse, which we denote as G 1 G ww G wj = G wj T G JJ j=1 ] ] G w i w j i,j=1..k G w i J j = ] G w i J j T ] i,j=1..k G J i J j n=1 i,j=1..k i,j=1..k ]. 2.2 Each bloc of the matrix can be written by G w ij j = φ ij U T, G J ij j = θ ij I + UΘ ij U T, U = J 1,, J K ], where the scalars G w iw j and θ ij, a 1 K vector φ ij, and a K K matrix Θ ij can be deterministically calculated for given J and w. 2 Using the obtained G 1, the update term of natural gradient learning is given by J i = K G w J i T w + G J ij J, wi = K G w i w w + G w ij J. 2.3

3 Slow Dynamics Due to Singularities Statistical mechanical method for analyzing dynamics Using statistical mechanical approach, 5, 6 we investigate average dynamics at thermodynamic limit, i.e., the limit of N. The estimation accuracy of learning is evaluated by the generalization error defined as E gen = 1 2 {f B,wξ f J,wξ} 2. At thermodynamic limit, the generalization error can be described by using new order parameters, which are defined as R in J T i B n, Q ij J T i J j and T nm B T n B m. Especially when gu = erfu/ 2, the explicit form of E gen is given by E gen = 1 K Q ij w i w j arcsin π 1 + Qii 1 + Qjj + 2 i,j M T mn v m v n arcsin 1 + Tmm 1 + Tnn m,n K i M n ] R in w i v n arcsin Qii 1 + Tnn From this, we now that the motion equations of R in, Q ij, and w i are sufficient to trace the dynamics of learning model. In the thermodynamic limit N, the motion equations are given by dr in = η δ iy n, dq ij = η δ ix j + δ j x i + η 2 δ i δ j, dw i = η gx ie, 3.2 where α is a continuous time variable. In the case of gu = erfu/ 2, the motion equations can be given by compact forms with Q ij, R in, T nm, w i and v n. 3, 5 For the natural gradient learning, we can apply the same method to obtain the motion equations and obtain K dr in = η K ] θ i δ y n + R n φ T i g e + Θ i δ x, 3.3 dq ij = η Q i K φ T j g e + Θ j δ x K + Q j φ T i g e + Θ i δ x + K ] K θ i δ x j + θ j δ x i + η 2 θ i θ jl δ δ l, dw i = η K,l 3.4 G w iw g e + φ i δ x ], 3.5

4 278 H. Par, M. Inoue and M. Oada plateau plateau quasi-plateau quasi-plateau Fig. 1. Evolution of generalization error in standard gradient learning. Fig. 2. Evolution of generalization error in natural gradient learning. where R n = R 1n,.., R Kn ], Q i = Q 1i,.., Q Ki ], x = x 1,.., x K ] T, and g = gx. This is a generalization of the motion equations for a soft committee machines. 2 Detailed description will be given in Ref Results and conclusions We analyzed the dynamics of the standard gradient and natural gradient learning for the case K = M = 2. We used three conditions of teacher parameter to represent three types of learning tas: B 1 B 2 for regular case, B 1 = B 2 for singular case, and B 1 = B2 B 1 B 2 = 0.9 for near-singular case. In all cases, we set B i = 1, v i = 0.5 i = 1, 2. For initial condition of learning, we used ] ] ] Q = Q ij ] i,j=1,2 =, R = R 0 1 in ] i,n=1,2 = , w =. 0.1 ε 4.1 For the ε, we tried three different values, 0.02, 0.04 and Time evolutions of the generalization error in standard gradient learning for three types of learning tas are shown in Fig. 1. For regular case a, we can see the well nown plateau cause by the permutation symmetry. Note that the permutation symmetry satisfies the singularity condition discussed in 2. For singular case b in which the symmetry breaing is not necessary, we can still see a different type of slow dynamics. We call it a quasi-plateau. 3 The quasi-plateau is caused by the

5 Slow Dynamics Due to Singularities 279 singular subspace w 1 + w 2 = 1 in this experiment, which does not exist in the soft-committee machines. This is the reason why conventional researches using softcommittee machines had not observed the quasi-plateau. Another important and interesting phenomenon is shown in near-singular case b. In near-singular case, we can see both of plateau and quasi-plateau, which maes the learning extremely slow. Since the near-singular case frequently occurs in practical applications, this phenomenon has very important meaning. Moreover, we can also see that the slow convergence cannot be avoided by changing the initial value ε in near-singular case. On the other hand, we cannot see that the plateau and quasi-plateau in the natural gradient learning. In addition, we can also see that the natural gradient learning hardly depends on the initial condition Fig. 2. By taing a geometrical viewpoint and statistical mechanical approach on learning dynamics, we found the existence of quasi-plateau and severeness of slow dynamics in near-singular case, which is interesting in both of theoretical and practical sense. The mechanism of the slow dynamics in standard gradient learning and its resolution by natural gradient learning will be discussed with detailed explanation on the properties of singular structure in Ref. 4. References 1 S. Amari, T. Ozei and H. Par, Sys. and Comm. in Jpn , M. Inoue, H. Par and M. Oada, J. Phys. Soc. Jpn , H. Par, M. Inoue and M. Oada, J. of Phys. A , H. Par, M. Inoue and M. Oada, in preparation, P. Riegler and M. Biehl, J. of Phys. A , L D. Saad and A. Solla, Phys. Rev. E , 4225.

Local minima and plateaus in hierarchical structures of multilayer perceptrons

Neural Networks PERGAMON Neural Networks 13 (2000) 317 327 Contributed article Local minima and plateaus in hierarchical structures of multilayer perceptrons www.elsevier.com/locate/neunet K. Fukumizu*,