Influence of weight initialization on multilayer perceptron performance

Similar documents
Least-Squares Regression on Sparse Spaces

Bayesian Estimation of the Entropy of the Multivariate Gaussian

Binary Discrimination Methods for High Dimensional Data with a. Geometric Representation

Lecture Introduction. 2 Examples of Measure Concentration. 3 The Johnson-Lindenstrauss Lemma. CS-621 Theory Gems November 28, 2012

Lower bounds on Locality Sensitive Hashing

Fast image compression using matrix K-L transform

On the Surprising Behavior of Distance Metrics in High Dimensional Space

Cascaded redundancy reduction

Radial Basis-Function Networks

Survey Sampling. 1 Design-based Inference. Kosuke Imai Department of Politics, Princeton University. February 19, 2013

On the Value of Partial Information for Learning from Examples

Parameter estimation: A new approach to weighting a priori information

Gaussian processes with monotonicity information

Multi-View Clustering via Canonical Correlation Analysis

A Hybrid Approach for Modeling High Dimensional Medical Data

Flexible High-Dimensional Classification Machines and Their Asymptotic Properties

Multi-View Clustering via Canonical Correlation Analysis

Robust Low Rank Kernel Embeddings of Multivariate Distributions

Introduction to Machine Learning

Thermal conductivity of graded composites: Numerical simulations and an effective medium approximation

Hybrid Fusion for Biometrics: Combining Score-level and Decision-level Fusion

A. Exclusive KL View of the MLE

Computing Exact Confidence Coefficients of Simultaneous Confidence Intervals for Multinomial Proportions and their Functions

Level Construction of Decision Trees in a Partition-based Framework for Classification

A Review of Multiple Try MCMC algorithms for Signal Processing

u!i = a T u = 0. Then S satisfies

A PAC-Bayesian Approach to Spectrally-Normalized Margin Bounds for Neural Networks

Multi-View Clustering via Canonical Correlation Analysis

Capacity Analysis of MIMO Systems with Unknown Channel State Information

Table of Common Derivatives By David Abraham

Problem Sheet 2: Eigenvalues and eigenvectors and their use in solving linear ODEs

Euler equations for multiple integrals

Lower Bounds for the Smoothed Number of Pareto optimal Solutions

Limitations of One-Hidden-Layer Perceptron Networks

Admin BACKPROPAGATION. Neural network. Neural network 11/3/16. Assignment 7. Assignment 8 Goals today. David Kauchak CS158 Fall 2016

Small sample size generalization

Generalization of the persistent random walk to dimensions greater than 1

The Principle of Least Action

KNN Particle Filters for Dynamic Hybrid Bayesian Networks

Generalizing Kronecker Graphs in order to Model Searchable Networks

EVALUATING HIGHER DERIVATIVE TENSORS BY FORWARD PROPAGATION OF UNIVARIATE TAYLOR SERIES

Lectures - Week 10 Introduction to Ordinary Differential Equations (ODES) First Order Linear ODEs

Diagonalization of Matrices Dr. E. Jacobs

Expected Value of Partial Perfect Information

Agmon Kolmogorov Inequalities on l 2 (Z d )

The new concepts of measurement error s regularities and effect characteristics

Multi-View Clustering via Canonical Correlation Analysis

STATISTICAL LIKELIHOOD REPRESENTATIONS OF PRIOR KNOWLEDGE IN MACHINE LEARNING

05 The Continuum Limit and the Wave Equation

Jointly continuous distributions and the multivariate Normal

TEMPORAL AND TIME-FREQUENCY CORRELATION-BASED BLIND SOURCE SEPARATION METHODS. Yannick DEVILLE

THE EFFICIENCIES OF THE SPATIAL MEDIAN AND SPATIAL SIGN COVARIANCE MATRIX FOR ELLIPTICALLY SYMMETRIC DISTRIBUTIONS

WUCHEN LI AND STANLEY OSHER

Physics 5153 Classical Mechanics. The Virial Theorem and The Poisson Bracket-1

EIGEN-ANALYSIS OF KERNEL OPERATORS FOR NONLINEAR DIMENSION REDUCTION AND DISCRIMINATION

Inter-domain Gaussian Processes for Sparse Inference using Inducing Features

arxiv: v1 [hep-lat] 19 Nov 2013

Calculus of Variations

Improving Estimation Accuracy in Nonrandomized Response Questioning Methods by Multiple Answers

Perfect Matchings in Õ(n1.5 ) Time in Regular Bipartite Graphs

Modelling and simulation of dependence structures in nonlife insurance with Bernstein copulas

Necessary and Sufficient Conditions for Sketched Subspace Clustering

Tutorial on Maximum Likelyhood Estimation: Parametric Density Estimation

Time-of-Arrival Estimation in Non-Line-Of-Sight Environments

'HVLJQ &RQVLGHUDWLRQ LQ 0DWHULDO 6HOHFWLRQ 'HVLJQ 6HQVLWLYLW\,1752'8&7,21

θ x = f ( x,t) could be written as

Homework 3 - Solutions

Equilibrium in Queues Under Unknown Service Times and Service Value

7.1 Support Vector Machine

PROBLEMS of estimating relative or absolute camera pose

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers

University of Cambridge Engineering Part IIB Module 3F3: Signal and Pattern Processing Handout 2:. The Multivariate Gaussian & Decision Boundaries

JUST THE MATHS UNIT NUMBER DIFFERENTIATION 2 (Rates of change) A.J.Hobson

Technion - Computer Science Department - M.Sc. Thesis MSC Constrained Codes for Two-Dimensional Channels.

Introduction to the Vlasov-Poisson system

arxiv: v4 [cs.ds] 7 Mar 2014

Stable and compact finite difference schemes

A Course in Machine Learning

LATTICE-BASED D-OPTIMUM DESIGN FOR FOURIER REGRESSION

Rank, Trace, Determinant, Transpose an Inverse of a Matrix Let A be an n n square matrix: A = a11 a1 a1n a1 a an a n1 a n a nn nn where is the jth col

Separation of Variables

CMA-ES with Optimal Covariance Update and Storage Complexity

LECTURE NOTES ON DVORETZKY S THEOREM

Introduction to Markov Processes

Track Initialization from Incomplete Measurements

Estimation of the Maximum Domination Value in Multi-Dimensional Data Sets

Robustness and Perturbations of Minimal Bases

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)

TMA 4195 Matematisk modellering Exam Tuesday December 16, :00 13:00 Problems and solution with additional comments

SYNCHRONOUS SEQUENTIAL CIRCUITS

Topic Modeling: Beyond Bag-of-Words

Local Linear ICA for Mutual Information Estimation in Feature Selection

Linear First-Order Equations

Sparse Reconstruction of Systems of Ordinary Differential Equations

Interpretation of the Multi-Stage Nested Wiener Filter in the Krylov Subspace Framework

A Randomized Approximate Nearest Neighbors Algorithm - a short version

Applied Statistics. Multivariate Analysis - part II. Troels C. Petersen (NBI) Statistics is merely a quantization of common sense 1

SYSTEMS OF DIFFERENTIAL EQUATIONS, EULER S FORMULA. where L is some constant, usually called the Lipschitz constant. An example is

State observers and recursive filters in classical feedback control theory

Logarithmic spurious regressions

Transcription:

Influence of weight initialization on multilayer perceptron performance M. Karouia (1,2) T. Denœux (1) R. Lengellé (1) (1) Université e Compiègne U.R.A. CNRS 817 Heuiasyc BP 649 - F-66 Compiègne ceex - France mkarouia@hs.univ-compiegne.fr (2) Lyonnaise es Eaux (LIAC) Abstract This paper presents a new algorithm for initializing the weights in multilayer perceptrons. This metho is base on the use of feature vectors extracte by iscriminant analysis. Simulations carrie out with real-worl an synthetic ata sets show that the propose algorithm allows to obtain a better initial state, as compare to ranom initialization. As a result, training time is reuce an lower generalization error can be achieve. Aitionally, it is shown through numerical simulations that the generalization performance of networks initialize with the propose metho becomes less sensitive to network size an input imension. 1 Introuction Many researchers have emphasize the importance of initial weights in multilayer perceptron (MLP) training. Several initialization algorithms have been propose, such as the use of prototypes [2]. The most obvious potential benefits of starting optimization from a goo initial state are faster training an

higher probability of reaching a eep minimum of the error function. Aitionally, it has been foun that introucing prior knowlege in the initial weights may in some cases improve generalization performance [2, 8]. In this paper, a new approach to weight initialization is propose, an its effect on generalization is emonstrate experimentally. The starting point of this work is the relationship between MLPs an iscriminant analysis (DA) pointe out by Gallinari [4]. It can be shown that training networks with one hien layer using the quaratic error function is equivalent to maximizing a measure of class separability in the space spanne by hien units. DA techniques aim at extracting features that are effective in preserving class separability. The algorithm presente in this paper (WIDA: Weight Initialization by Discriminant Analysis) proposes to use such features for initializing the weights in multilayer networks before training by stanar back propagation (BP) or any other learning proceure. The performance of the WIDA metho is then analyze using several synthetic an real-worl ata sets. We examine the effect of weight initialization on the following aspects: convergence spee (training time), generalization error an sensitivity of generalization error to ata imensionality an number of hien units. 2 The initialization metho 2.1 Discriminant analysis We consier a set X of N samples in a -imensional space. The samples are assume to be partitione into M isjoint subsets. Subset X of size N i inclues samples properly associate with class Ω i. Let x ij be the j-th -imensional sample vector from class Ω i. The mean vector of class Ω i is m i = 1 Ni N i j=1 x ij. The overall mean vector is m = 1 Mi=1 N N i m i. We efine the parametric within-class scatter matrix W an the parametric betweenclass scatter matrix B respectively as: W = B = 1 N 1 N M N i (x ij m i )(x ij m i ) T (1) i=1 j=1 M N i (m i m)(m i m) T (2) i=1

where (.) T enotes transposition. Matrix W is assume to be positive efinite, so that W 1 exists. Matrix B is a positive semiefinite matrix with rank at most equal to M 1 (we assume that M). The sum of W an B gives the parametric global covariance matrix G. In parametric iscriminant analysis (PDA), we seek -imensional feature vectors τ maximizing the Fisher s criterion J(τ): J(τ) = τ T Bτ τ T Wτ (3) Such features are obtaine as the eigenvectors of W 1 B, each eigenvalue λ i being equal to the Fisher criterion of its corresponing eigenvector τ i (J(τ i ) = λ i ). PDA has two serious shortcomings. First, the maximum number of iscriminant vectors is limite to M 1. When M = 2, PDA allows to extract only one iscriminant vector. The secon an more funamental problem is the intrinsic parametric nature of PDA. When the class istributions are significantly non-normal, the use of PDA cannot be expecte to accurately etermine goo features preserving the complex structure neee for classification. Non-parametric iscriminant analysis (NPDA) was introuce to overcome both of the aforementione problems [3]. It is base on the use of a non-parametric between-class scatter matrix that measures between-class scatter on a local basis, using a k-nearest neighbor (k-nn) approach. Let us first consier the case where M = 2. Let n il (x) X (l = 1,...,k) be the k nearest neighbors in class Ω i of an arbitrary sample x X. The local mean of class Ω i (the sample mean of the k NNs from Ω i to x) is m ki (x) = 1 k kl=1 n il (x). The non-parametric between-class scatter matrix is then efine as B 12,k = 1 N ( x X p 12 (x)(x m k2 (x))(x m k2 (x)) T + x X p 12 (x)(x m k1 (x))(x m k1 (x)) T ) (4) The term p 12 (x) is efine as a function of the istances between x an its k-th nearest neighbor from each class [3]. Its role is to eemphasize the samples locate far away from the class bounary.

By substituting B with B 12,k in Equation 3, we obtain a non-parametric Fisher s criterion J (τ). The features maximizing J (τ) can be obtaine as the eigenvectors of W 1 B 12,k. Since B 12,k is generally full rank, the number of iscriminant vectors is not limite to M 1. To exten NPDA to general M-class problems, two alternatives have been stuie. The first one consists in consiering M two-class problems or ichotomies. For each ichotomy, we take one class as Ω 1 an the other M 1 classes as Ω 2 ; iscriminant vectors are extracte by the above proceure. Afterwars, the best iscriminant vectors can be chosen accoring to some selection proceure. The secon alternative consists in efining a generalize non-parametric between-class scatter matrix as B k = (1/N 2 ) i<j N i N j B ij,k. 2.2 Application to weight initialization The WIDA metho consists in initializing the hien unit weights as iscriminant vectors extracte by non-parametric DA, an aing bias terms. Learning is then carrie out in 3 steps: 1. the biases of hien neurons are etermine so as to maximize class separability in the space H spanne by hien units. As shown in [5, 6], a suitable measure of class separability is tr(g 1 h B h), where G h an B h are respectively the total an between-class scatter matrices in H. 2. the hien-to-output weights are initialize ranomly an traine separately to minimize the mean square output error; 3. finally, further training of the whole network is performe using the stanar back propagation algorithm. 3 Comparison to ranom initialization The above initialization proceure was teste an compare to other methos using the following ata sets: Waveform ata: it is a three-class synthetic problem in a 21-imensional feature space. Training an test sets both contain 1 samples of each class [1].

misclassification rate (%) 1 8 6 vowel ata (11 hien units) 4 1 1 2 1 4 epoch misclassification rate (%) 6 4 sonar ata (5 hien units) 1 1 1 1 2 1 3 epoch misclassification rate (%) 1 5 waveform ata (4 hien units) 1 1 1 1 2 1 3 epoch Figure 1: Mean test misclassification rate as a function of training cycles (averages over 1 trials). : ranom; - - : WIDA; -.- : prototype metho. Vowel ata: training an test ata have 1 features an are partitione in 11 classes. We use 528 ranomly chosen samples for training an the 462 remaining samples for the test. A complete escription of this ata is given in [7]. Sonar ata: this is a real-worl classification task [7] with 6 features an 2 classes. Training an test ata are both of size 14. The network weights were initialize with the WIDA algorithm, the prototype metho an ranomly. For each classification task, the number n of hien units was varie from 2 to n max. Training an test misclassification error rates were compute after each learning cycle. The algorithm was run 1 times for each value of n an each initialization metho. Figure 1 shows the evolution of mean error rates as a function of time for the three tasks. The means of the best error rates obtaine at each trial by the three methos are represente in Figure 2 as a function of n. As expecte, these results show that the WIDA metho provies goo initial solutions in terms of misclassification error. This results in faster training, although the gain in not very important because we use an accelerate version of back-propagation. The main avantage of our metho happens to be a better generalization performance for all three classification tasks. The test error rates obtaine with the WIDA metho were always significantly lower than those obtaine with ranom initialization (an, to a lesser extent, with the prototype metho).

misclassification rate (%) 6 55 5 45 vowel ata 4 5 1 15 number of hien units misclassification rate (%) 25 15 1 sonar ata 5 1 3 4 number of hien units misclassification rate (%) 19 18 17 16 waveform ata 15 2 4 6 8 1 number of hien units Figure 2: Mean test misclassification rate as a function ofn (averages over 1 trials). : ranom; - - : WIDA; -.- : prototype metho. 4 Influence of imensionality an network size The influence of imensionality an number of weights on generalization performance was stuie experimentally using a set of iscrimination tasks similar to that use in [8]. Each task consists in iscriminating between two multivariate Gaussian classes. Both classes have ientity covariance matrix. The class mean vectors are m 1 = (2,,..., ) an m 2 = m 1. This parameterization allows to keep the Mahalanobis istance, an hence the theoretical Bayes error rate, to constant values. Training sets of 1 samples (6 in each class) an test sets of 4 samples ( in each class) were ranomly generate. The two initialization proceures teste were the WIDA metho an ranom initialization. The number n of hien units was varie from 2 to 1, an the ata imension from 1 to 1 with a step of 1. For each of the 2 9 1 configurations, the learning algorithm was run 1 times. The mean misclassification error rates were compute over the 1 trials. Figure 3 shows the obtaine mean misclassification rates with 95% confience intervals as a function of an n. As shown in Figure 3, the generalization performance of ranomly initialize networks egraes for large values of an n. This epenency of test error rate on the number of parameters to be estimate is well-known in the Pattern Recognition an Neural Network literature as the peaking phenomenon [8]. This phenomenon happens to be less important, in this case, when the initial weights are etermine by iscriminant analysis. The rate of increase of test error rate as a function of is smaller, an practically

(1) NHU=2 (2) NHU=3 (3) NHU=4 1 5 1 (4) NHU=5 1 5 1 (7) NHU=8 1 5 1 1 5 1 (5) NHU=6 1 5 1 (8) NHU=9 1 5 1 1 5 1 (6) NHU=7 1 5 1 (9) NHU=1 1 5 1 Figure 3: Mean test misclassification rate an 95 % confience interval as a function of ata imension an number of hien units (averages over 1 trials). -*- : WIDA initialization, -o- : ranom initialization, = ata imension, Egen = generalization error, NHU = number of hien units (n).

inepenent from n for 2 n 1. This fining can be interprete by remarking that the WIDA metho provies the learning algorithm with prior information concerning the ata structure, in the form of iscriminant axes. This allows to search only a certain region of weight space, in which weight vectors lea to relatively simple iscrimination bounaries. In that sense, careful initialization can be seen as performing some kin of regularization. This is consistent with the theoretical an experimental analysis performe by Rauys [8] in the case of linear classifiers, showing that suitable selection of initial weights may cancel the influence of imensionality on expecte probability of misclassification. 5 Conclusion A new weight initialization proceure for multilayer perceptrons has been presente. This proceure consists in using class-separability preserving feature vectors as the initial hien layer weights. Biases an output weights are then optimize separately, before fine tuning of all network parameters is performe by a stanar back-propagation algorithm. This scheme has been applie to several real-worl an artificial iscrimination tasks, an has been shown to yiel lower generalization error as compare to ranom initialization an (to a lesser extent) to the proceure propose in [2]. Experimental results also suggest that the introuction of prior knowlege about the ata structure in the form of iscriminant vectors reuces the harmful effect of excessive parameters on the expecte probability of misclassification. Our current work aims at combining this initialization proceure with a constructive training algorithm. References [1] L. Breiman, J. H. Frieman, R. A. Olshen, an C. J. Stone. Classification an Regression Trees. Wasworth, Belmont, CA, 1984. [2] T. Denœux an R. Lengellé. Initializing back-propagation networks with prototypes. Neural Networks, 6(3):351 363, 1993.

[3] K. Fukunaga. Introuction to statistical pattern recognition. Electrical Science. 2n. eition, Acaemic Press, 199. [4] P. Gallinari, S. Thiria, F. Baran, an F. Fogelman-Soulie. On the relations between iscriminant analysis an multilayer perceptrons. Neural Networks, 4:349 36, 1991. [5] R. Lengellé an T. Denœux. Optimizing multilayer networks layer per layer without back-propagation. In I. Aleksaner an J. Taylor, eitors, Artificial Neural Networks II, pages 995 998. North-Hollan, Amsteram, 1992. [6] R. Lengellé an T. Denœux. Training MLPs layer by layer using an objective function for internal representations. Neural Networks (to appear), 1995. [7] P. M. Murphy an D. W. Aha. UCI Repository of machine learning atabases [Machine-reaable ata repository]. University of California, Department of Information an Computer Science., Irvine, CA, 1994. [8] Rauys S. Why o multilayer perceptrons have favorable small sample properties? In E. S. Gelsema an L. N. Kanal, eitors, Pattern Recognition in Practice IV, pages 287 298, Amsteram, 1994. Elsevier.