Influence of weight initialization on multilayer perceptron performance

Influence of weight initialization on multilayer perceptron performance M. Karouia (1,2) T. Denœux (1) R. Lengellé (1) (1) Université e Compiègne U.R.A. CNRS 817 Heuiasyc BP 649 - F-66 Compiègne ceex - France mkarouia@hs.univ-compiegne.fr (2) Lyonnaise es Eaux (LIAC) Abstract This paper presents a new algorithm for initializing the weights in multilayer perceptrons. This metho is base on the use of feature vectors extracte by iscriminant analysis. Simulations carrie out with real-worl an synthetic ata sets show that the propose algorithm allows to obtain a better initial state, as compare to ranom initialization. As a result, training time is reuce an lower generalization error can be achieve. Aitionally, it is shown through numerical simulations that the generalization performance of networks initialize with the propose metho becomes less sensitive to network size an input imension. 1 Introuction Many researchers have emphasize the importance of initial weights in multilayer perceptron (MLP) training. Several initialization algorithms have been propose, such as the use of prototypes [2]. The most obvious potential benefits of starting optimization from a goo initial state are faster training an

higher probability of reaching a eep minimum of the error function. Aitionally, it has been foun that introucing prior knowlege in the initial weights may in some cases improve generalization performance [2, 8]. In this paper, a new approach to weight initialization is propose, an its effect on generalization is emonstrate experimentally. The starting point of this work is the relationship between MLPs an iscriminant analysis (DA) pointe out by Gallinari [4]. It can be shown that training networks with one hien layer using the quaratic error function is equivalent to maximizing a measure of class separability in the space spanne by hien units. DA techniques aim at extracting features that are effective in preserving class separability. The algorithm presente in this paper (WIDA: Weight Initialization by Discriminant Analysis) proposes to use such features for initializing the weights in multilayer networks before training by stanar back propagation (BP) or any other learning proceure. The performance of the WIDA metho is then analyze using several synthetic an real-worl ata sets. We examine the effect of weight initialization on the following aspects: convergence spee (training time), generalization error an sensitivity of generalization error to ata imensionality an number of hien units. 2 The initialization metho 2.1 Discriminant analysis We consier a set X of N samples in a -imensional space. The samples are assume to be partitione into M isjoint subsets. Subset X of size N i inclues samples properly associate with class Ω i. Let x ij be the j-th -imensional sample vector from class Ω i. The mean vector of class Ω i is m i = 1 Ni N i j=1 x ij. The overall mean vector is m = 1 Mi=1 N N i m i. We efine the parametric within-class scatter matrix W an the parametric betweenclass scatter matrix B respectively as: W = B = 1 N 1 N M N i (x ij m i )(x ij m i ) T (1) i=1 j=1 M N i (m i m)(m i m) T (2) i=1

where (.) T enotes transposition. Matrix W is assume to be positive efinite, so that W 1 exists. Matrix B is a positive semiefinite matrix with rank at most equal to M 1 (we assume that M). The sum of W an B gives the parametric global covariance matrix G. In parametric iscriminant analysis (PDA), we seek -imensional feature vectors τ maximizing the Fisher s criterion J(τ): J(τ) = τ T Bτ τ T Wτ (3) Such features are obtaine as the eigenvectors of W 1 B, each eigenvalue λ i being equal to the Fisher criterion of its corresponing eigenvector τ i (J(τ i ) = λ i ). PDA has two serious shortcomings. First, the maximum number of iscriminant vectors is limite to M 1. When M = 2, PDA allows to extract only one iscriminant vector. The secon an more funamental problem is the intrinsic parametric nature of PDA. When the class istributions are significantly non-normal, the use of PDA cannot be expecte to accurately etermine goo features preserving the complex structure neee for classification. Non-parametric iscriminant analysis (NPDA) was introuce to overcome both of the aforementione problems [3]. It is base on the use of a non-parametric between-class scatter matrix that measures between-class scatter on a local basis, using a k-nearest neighbor (k-nn) approach. Let us first consier the case where M = 2. Let n il (x) X (l = 1,...,k) be the k nearest neighbors in class Ω i of an arbitrary sample x X. The local mean of class Ω i (the sample mean of the k NNs from Ω i to x) is m ki (x) = 1 k kl=1 n il (x). The non-parametric between-class scatter matrix is then efine as B 12,k = 1 N ( x X p 12 (x)(x m k2 (x))(x m k2 (x)) T + x X p 12 (x)(x m k1 (x))(x m k1 (x)) T ) (4) The term p 12 (x) is efine as a function of the istances between x an its k-th nearest neighbor from each class [3]. Its role is to eemphasize the samples locate far away from the class bounary.

By substituting B with B 12,k in Equation 3, we obtain a non-parametric Fisher s criterion J (τ). The features maximizing J (τ) can be obtaine as the eigenvectors of W 1 B 12,k. Since B 12,k is generally full rank, the number of iscriminant vectors is not limite to M 1. To exten NPDA to general M-class problems, two alternatives have been stuie. The first one consists in consiering M two-class problems or ichotomies. For each ichotomy, we take one class as Ω 1 an the other M 1 classes as Ω 2 ; iscriminant vectors are extracte by the above proceure. Afterwars, the best iscriminant vectors can be chosen accoring to some selection proceure. The secon alternative consists in efining a generalize non-parametric between-class scatter matrix as B k = (1/N 2 ) i<j N i N j B ij,k. 2.2 Application to weight initialization The WIDA metho consists in initializing the hien unit weights as iscriminant vectors extracte by non-parametric DA, an aing bias terms. Learning is then carrie out in 3 steps: 1. the biases of hien neurons are etermine so as to maximize class separability in the space H spanne by hien units. As shown in [5, 6], a suitable measure of class separability is tr(g 1 h B h), where G h an B h are respectively the total an between-class scatter matrices in H. 2. the hien-to-output weights are initialize ranomly an traine separately to minimize the mean square output error; 3. finally, further training of the whole network is performe using the stanar back propagation algorithm. 3 Comparison to ranom initialization The above initialization proceure was teste an compare to other methos using the following ata sets: Waveform ata: it is a three-class synthetic problem in a 21-imensional feature space. Training an test sets both contain 1 samples of each class [1].

misclassification rate (%) 1 8 6 vowel ata (11 hien units) 4 1 1 2 1 4 epoch misclassification rate (%) 6 4 sonar ata (5 hien units) 1 1 1 1 2 1 3 epoch misclassification rate (%) 1 5 waveform ata (4 hien units) 1 1 1 1 2 1 3 epoch Figure 1: Mean test misclassification rate as a function of training cycles (averages over 1 trials). : ranom; - - : WIDA; -.- : prototype metho. Vowel ata: training an test ata have 1 features an are partitione in 11 classes. We use 528 ranomly chosen samples for training an the 462 remaining samples for the test. A complete escription of this ata is given in [7]. Sonar ata: this is a real-worl classification task [7] with 6 features an 2 classes. Training an test ata are both of size 14. The network weights were initialize with the WIDA algorithm, the prototype metho an ranomly. For each classification task, the number n of hien units was varie from 2 to n max. Training an test misclassification error rates were compute after each learning cycle. The algorithm was run 1 times for each value of n an each initialization metho. Figure 1 shows the evolution of mean error rates as a function of time for the three tasks. The means of the best error rates obtaine at each trial by the three methos are represente in Figure 2 as a function of n. As expecte, these results show that the WIDA metho provies goo initial solutions in terms of misclassification error. This results in faster training, although the gain in not very important because we use an accelerate version of back-propagation. The main avantage of our metho happens to be a better generalization performance for all three classification tasks. The test error rates obtaine with the WIDA metho were always significantly lower than those obtaine with ranom initialization (an, to a lesser extent, with the prototype metho).

misclassification rate (%) 6 55 5 45 vowel ata 4 5 1 15 number of hien units misclassification rate (%) 25 15 1 sonar ata 5 1 3 4 number of hien units misclassification rate (%) 19 18 17 16 waveform ata 15 2 4 6 8 1 number of hien units Figure 2: Mean test misclassification rate as a function ofn (averages over 1 trials). : ranom; - - : WIDA; -.- : prototype metho. 4 Influence of imensionality an network size The influence of imensionality an number of weights on generalization performance was stuie experimentally using a set of iscrimination tasks similar to that use in [8]. Each task consists in iscriminating between two multivariate Gaussian classes. Both classes have ientity covariance matrix. The class mean vectors are m 1 = (2,,..., ) an m 2 = m 1. This parameterization allows to keep the Mahalanobis istance, an hence the theoretical Bayes error rate, to constant values. Training sets of 1 samples (6 in each class) an test sets of 4 samples ( in each class) were ranomly generate. The two initialization proceures teste were the WIDA metho an ranom initialization. The number n of hien units was varie from 2 to 1, an the ata imension from 1 to 1 with a step of 1. For each of the 2 9 1 configurations, the learning algorithm was run 1 times. The mean misclassification error rates were compute over the 1 trials. Figure 3 shows the obtaine mean misclassification rates with 95% confience intervals as a function of an n. As shown in Figure 3, the generalization performance of ranomly initialize networks egraes for large values of an n. This epenency of test error rate on the number of parameters to be estimate is well-known in the Pattern Recognition an Neural Network literature as the peaking phenomenon [8]. This phenomenon happens to be less important, in this case, when the initial weights are etermine by iscriminant analysis. The rate of increase of test error rate as a function of is smaller, an practically

(1) NHU=2 (2) NHU=3 (3) NHU=4 1 5 1 (4) NHU=5 1 5 1 (7) NHU=8 1 5 1 1 5 1 (5) NHU=6 1 5 1 (8) NHU=9 1 5 1 1 5 1 (6) NHU=7 1 5 1 (9) NHU=1 1 5 1 Figure 3: Mean test misclassification rate an 95 % confience interval as a function of ata imension an number of hien units (averages over 1 trials). -*- : WIDA initialization, -o- : ranom initialization, = ata imension, Egen = generalization error, NHU = number of hien units (n).

inepenent from n for 2 n 1. This fining can be interprete by remarking that the WIDA metho provies the learning algorithm with prior information concerning the ata structure, in the form of iscriminant axes. This allows to search only a certain region of weight space, in which weight vectors lea to relatively simple iscrimination bounaries. In that sense, careful initialization can be seen as performing some kin of regularization. This is consistent with the theoretical an experimental analysis performe by Rauys [8] in the case of linear classifiers, showing that suitable selection of initial weights may cancel the influence of imensionality on expecte probability of misclassification. 5 Conclusion A new weight initialization proceure for multilayer perceptrons has been presente. This proceure consists in using class-separability preserving feature vectors as the initial hien layer weights. Biases an output weights are then optimize separately, before fine tuning of all network parameters is performe by a stanar back-propagation algorithm. This scheme has been applie to several real-worl an artificial iscrimination tasks, an has been shown to yiel lower generalization error as compare to ranom initialization an (to a lesser extent) to the proceure propose in [2]. Experimental results also suggest that the introuction of prior knowlege about the ata structure in the form of iscriminant vectors reuces the harmful effect of excessive parameters on the expecte probability of misclassification. Our current work aims at combining this initialization proceure with a constructive training algorithm. References [1] L. Breiman, J. H. Frieman, R. A. Olshen, an C. J. Stone. Classification an Regression Trees. Wasworth, Belmont, CA, 1984. [2] T. Denœux an R. Lengellé. Initializing back-propagation networks with prototypes. Neural Networks, 6(3):351 363, 1993.

[3] K. Fukunaga. Introuction to statistical pattern recognition. Electrical Science. 2n. eition, Acaemic Press, 199. [4] P. Gallinari, S. Thiria, F. Baran, an F. Fogelman-Soulie. On the relations between iscriminant analysis an multilayer perceptrons. Neural Networks, 4:349 36, 1991. [5] R. Lengellé an T. Denœux. Optimizing multilayer networks layer per layer without back-propagation. In I. Aleksaner an J. Taylor, eitors, Artificial Neural Networks II, pages 995 998. North-Hollan, Amsteram, 1992. [6] R. Lengellé an T. Denœux. Training MLPs layer by layer using an objective function for internal representations. Neural Networks (to appear), 1995. [7] P. M. Murphy an D. W. Aha. UCI Repository of machine learning atabases [Machine-reaable ata repository]. University of California, Department of Information an Computer Science., Irvine, CA, 1994. [8] Rauys S. Why o multilayer perceptrons have favorable small sample properties? In E. S. Gelsema an L. N. Kanal, eitors, Pattern Recognition in Practice IV, pages 287 298, Amsteram, 1994. Elsevier.