Recent Developments in Multilayer Perceptron Neural Networks

Similar documents
Multilayer Perceptron Neural Network (MLPs) For Analyzing the Properties of Jordan Oil Shale

LIMITATIONS OF RECEPTRON. XOR Problem The failure of the perceptron to successfully simple problem such as XOR (Minsky and Papert).

Radial Basis Function Networks: Algorithms

Solved Problems. (a) (b) (c) Figure P4.1 Simple Classification Problems First we draw a line between each set of dark and light data points.

Feedback-error control

Neural network models for river flow forecasting

E( x ) [b(n) - a(n, m)x(m) ]

Computer arithmetic. Intensive Computation. Annalisa Massini 2017/2018

Fuzzy Automata Induction using Construction Method

A Bound on the Error of Cross Validation Using the Approximation and Estimation Rates, with Consequences for the Training-Test Split

Combining Logistic Regression with Kriging for Mapping the Risk of Occurrence of Unexploded Ordnance (UXO)

Uncorrelated Multilinear Principal Component Analysis for Unsupervised Multilinear Subspace Learning

Distributed Rule-Based Inference in the Presence of Redundant Information

Sensitivity of Nonlinear Network Training to Affine Transformed Inputs

FE FORMULATIONS FOR PLASTICITY

E( x ) = [b(n) - a(n,m)x(m) ]

MODELING THE RELIABILITY OF C4ISR SYSTEMS HARDWARE/SOFTWARE COMPONENTS USING AN IMPROVED MARKOV MODEL

An Investigation on the Numerical Ill-conditioning of Hybrid State Estimators

arxiv: v1 [physics.data-an] 26 Oct 2012

Improved Identification of Nonlinear Dynamic Systems using Artificial Immune System

4. Score normalization technical details We now discuss the technical details of the score normalization method.

For q 0; 1; : : : ; `? 1, we have m 0; 1; : : : ; q? 1. The set fh j(x) : j 0; 1; ; : : : ; `? 1g forms a basis for the tness functions dened on the i

Evaluating Circuit Reliability Under Probabilistic Gate-Level Fault Models

Hidden Predictors: A Factor Analysis Primer

System Reliability Estimation and Confidence Regions from Subsystem and Full System Tests

Approximating min-max k-clustering

Factor Analysis of Convective Heat Transfer for a Horizontal Tube in the Turbulent Flow Region Using Artificial Neural Network

A Simple Weight Decay Can Improve. Abstract. It has been observed in numerical simulations that a weight decay can improve

Dynamic System Eigenvalue Extraction using a Linear Echo State Network for Small-Signal Stability Analysis a Novel Application

COMPARISON OF VARIOUS OPTIMIZATION TECHNIQUES FOR DESIGN FIR DIGITAL FILTERS

2-D Analysis for Iterative Learning Controller for Discrete-Time Systems With Variable Initial Conditions Yong FANG 1, and Tommy W. S.

Linear diophantine equations for discrete tomography

Characterizing the Behavior of a Probabilistic CMOS Switch Through Analytical Models and Its Verification Through Simulations

COMPLEX-VALUED ASSOCIATIVE MEMORIES WITH PROJECTION AND ITERATIVE LEARNING RULES

CHAPTER-II Control Charts for Fraction Nonconforming using m-of-m Runs Rules

On Fractional Predictive PID Controller Design Method Emmanuel Edet*. Reza Katebi.**

AI*IA 2003 Fusion of Multiple Pattern Classifiers PART III

Probability Estimates for Multi-class Classification by Pairwise Coupling

Period-two cycles in a feedforward layered neural network model with symmetric sequence processing

THE multilayer perceptron (MLP) is a nonlinear signal

Numerical Linear Algebra

On Line Parameter Estimation of Electric Systems using the Bacterial Foraging Algorithm

State Estimation with ARMarkov Models

Convex Optimization methods for Computing Channel Capacity

Estimation of the large covariance matrix with two-step monotone missing data

A Comparison between Biased and Unbiased Estimators in Ordinary Least Squares Regression

MATHEMATICAL MODELLING OF THE WIRELESS COMMUNICATION NETWORK

PCA fused NN approach for drill wear prediction in drilling mild steel specimen

Best approximation by linear combinations of characteristic functions of half-spaces

Multiple Resonance Networks

A Qualitative Event-based Approach to Multiple Fault Diagnosis in Continuous Systems using Structural Model Decomposition

Uncertainty Modeling with Interval Type-2 Fuzzy Logic Systems in Mobile Robotics

ON POLYNOMIAL SELECTION FOR THE GENERAL NUMBER FIELD SIEVE

EXACTLY PERIODIC SUBSPACE DECOMPOSITION BASED APPROACH FOR IDENTIFYING TANDEM REPEATS IN DNA SEQUENCES

Recursive Estimation of the Preisach Density function for a Smart Actuator

Design of NARMA L-2 Control of Nonlinear Inverted Pendulum

PER-PATCH METRIC LEARNING FOR ROBUST IMAGE MATCHING. Sezer Karaoglu, Ivo Everts, Jan C. van Gemert, and Theo Gevers

STABILITY ANALYSIS TOOL FOR TUNING UNCONSTRAINED DECENTRALIZED MODEL PREDICTIVE CONTROLLERS

integral invariant relations is not limited to one or two such

Scaling Multiple Point Statistics for Non-Stationary Geostatistical Modeling


Research of PMU Optimal Placement in Power Systems

A New Method of DDB Logical Structure Synthesis Using Distributed Tabu Search

Outline. EECS150 - Digital Design Lecture 26 Error Correction Codes, Linear Feedback Shift Registers (LFSRs) Simple Error Detection Coding

Yixi Shi. Jose Blanchet. IEOR Department Columbia University New York, NY 10027, USA. IEOR Department Columbia University New York, NY 10027, USA

Improved Capacity Bounds for the Binary Energy Harvesting Channel

Information collection on a graph

Hotelling s Two- Sample T 2

Cryptanalysis of Pseudorandom Generators

Round-off Errors and Computer Arithmetic - (1.2)

Deriving Indicator Direct and Cross Variograms from a Normal Scores Variogram Model (bigaus-full) David F. Machuca Mory and Clayton V.

DETC2003/DAC AN EFFICIENT ALGORITHM FOR CONSTRUCTING OPTIMAL DESIGN OF COMPUTER EXPERIMENTS

Understanding DPMFoam/MPPICFoam

Using a Computational Intelligence Hybrid Approach to Recognize the Faults of Variance Shifts for a Manufacturing Process

EE 508 Lecture 13. Statistical Characterization of Filter Characteristics

SIMULATED ANNEALING AND JOINT MANUFACTURING BATCH-SIZING. Ruhul SARKER. Xin YAO

Lending Direction to Neural Networks. Richard S. Zemel North Torrey Pines Rd. La Jolla, CA Christopher K. I.

RANDOM WALKS AND PERCOLATION: AN ANALYSIS OF CURRENT RESEARCH ON MODELING NATURAL PROCESSES

CSC165H, Mathematical expression and reasoning for computer science week 12

Upper Bound on Pattern Storage in Feedforward Networks

Chapter 1 Fundamentals

Controllable Spatial Array of Bessel-like Beams with Independent Axial Intensity Distributions for Laser Microprocessing

On the capacity of the general trapdoor channel with feedback

Nonlinear Static Analysis of Cable Net Structures by Using Newton-Raphson Method

Logistics Optimization Using Hybrid Metaheuristic Approach under Very Realistic Conditions

Frequency-Weighted Robust Fault Reconstruction Using a Sliding Mode Observer

A New Perspective on Learning Linear Separators with Large L q L p Margins

Observer/Kalman Filter Time Varying System Identification

John Weatherwax. Analysis of Parallel Depth First Search Algorithms

ADAPTIVE CONTROL METHODS FOR EXCITED SYSTEMS

arxiv: v1 [cond-mat.stat-mech] 1 Dec 2017

Determining Momentum and Energy Corrections for g1c Using Kinematic Fitting

MULTIVARIATE STATISTICAL PROCESS OF HOTELLING S T CONTROL CHARTS PROCEDURES WITH INDUSTRIAL APPLICATION

An Improved Calibration Method for a Chopped Pyrgeometer

Simple geometric interpretation of signal evolution in phase-sensitive fibre optic parametric amplifier

Calculation of eigenvalue and eigenvector derivatives with the improved Kron s substructuring method

Meshless Methods for Scientific Computing Final Project

Lower bound solutions for bearing capacity of jointed rock

Developing A Deterioration Probabilistic Model for Rail Wear

Uncorrelated Multilinear Discriminant Analysis with Regularization and Aggregation for Tensor Object Recognition

Transcription:

Recent Develoments in Multilayer Percetron eural etworks Walter H. Delashmit Lockheed Martin Missiles and Fire Control Dallas, Texas 75265 walter.delashmit@lmco.com walter.delashmit@verizon.net Michael T. Manry University of Texas at Arlington Arlington, TX 7600 manry@uta.edu Abstract Several neural network architectures have been develoed over the ast several years. One of the most oular and most owerful architectures is the multilayer ercetron. This architecture will be described in detail and recent advances in training of the multilayer ercetron will be resented. Multilayer ercetrons are trained using various techniques. For years the most used training method was back roagation and various derivatives of this to incororate gradient information. Recent develoments have used outut weight otimization-hidden weight otimization (OWO-HWO) and full conjugate gradient methods. OWO-HWO is a very owerful technique in terms of accuracy and raid convergence. OWO-HWO has been used with a unique network growing technique to ensure that the mean square error is monotonically non-increasing as the network size increases (i.e., the number of hidden layer nodes increases). This network growing technique was trained using OWO-HWO but is amenable to any training technique. This technique significantly imroves training and testing erformance of the MLP..0 Introduction Several neural networks have been develoed and analyzed over the last few decades. These include self-organizing neural networks [, 2], the Hofield network [3, 4], radial basis function networks [5-6], the Boltzmann machine [7], the mean-field theory machine [8] and multilayer ercetrons (MLPs) [9-7]. MLPs have evolved over the years as a very owerful technique for solving a wide variety of roblems. Much rogress has been made in imroving erformance and in understanding how these neural networks oerate. However, the need for additional imrovements in training these networks still exists since the training rocess is very chaotic in nature. Proceedings of the 7 th Annual Memhis Area Engineering and Science Conference

2.0 Structure and Oeration of Multilayer Percetron eural etworks MLP neural networks consist of units arranged in layers. Each layer is comosed of nodes and in the fully connected networks considered in this aer each node connects to every node in subsequent layers. Each MLP is comosed of a minimum of three layers consisting of an inut layer, one or more hidden layer(s) and an outut layer. The above definition ignores the degenerate linear multilayer ercetron consisting of only an inut layer and an outut layer. The inut layer distributes the inuts to subsequent layers. Inut nodes have linear activation functions and no thresholds. Each hidden unit node and each outut node have thresholds associated with them in addition to the weights. The hidden unit nodes have nonlinear activation functions and the oututs have linear activation functions. Hence, each signal feeding into a node in a subsequent layer has the original inut multilied by a weight with a threshold added and then is assed through an activation function that may be linear or nonlinear (hidden units). A tyical three-layer network is shown in figure. Only three layer MLPs will be considered in this aer since these networks have been shown to aroximate any continuous function [8-20]. For the actual three-layer MLP, all of the inuts are also connected directly to all of the oututs. These connections are not shown in figure to simlify the diagram. x () x (2) w hi (,) net () O () w oh (,) y () y (2) x (3) y (3) Figure 2.. Multilayer ercetron with one hidden layer. Figure 2.. Tyical three-layer multilayer ercetron neural network. x () w hi ( h,) w (M, h ) oh y (M) Inut Layer net ( h ) O ( h ) Outut Layer Hidden Layer Figure. Tyical three-layer multilayer ercetron neural network. The training data consists of a set of v training atterns (x, t ) where reresents the attern number. In figure, x corresonds to the -dimensional inut Proceedings of the 7 th Annual Memhis Area Engineering and Science Conference 2

vector of the th training attern and y corresonds to the M-dimensional outut vector from the trained network for the th attern. For ease of notation and analysis, thresholds on hidden units and outut units are handled by assigning the value of one to an augmented vector comonent denoted by x (+). The outut and inut units have linear activations. The inut to the jth hidden unit, net (j), is exressed by net (j) = + k= w (j, k) x (k j h () hi ) with the outut activation for the th training attern, O (j), being exressed by O (j) = f ( net (j)) (2) The nonlinear activation is tyically chosen to be the sigmoidal function f( (j)) net = net (j) + e (3) In () and (2), the inut units are reresented by the index k and w hi (j,k) denotes the weights connecting the kth inut unit to the jth hidden unit. The overall erformance of the MLP is measured by the mean square error (MSE) exressed by E = v v = E = v M [ t v = i= 2 (i) y (i)] (4) where M E = [ ] 2 t i= (i) E corresonds to the error for the th attern and t is the desired outut for the th attern. This also allows the calculation of the maing error for the ith outut unit to be exressed by E i = [ ] 2 v t (i) y (i) v = y (i) (5) with the ith outut for the th training attern being exressed by Proceedings of the 7 th Annual Memhis Area Engineering and Science Conference 3

+ h y (i) = w (i, k) x (k) + w (i, j) O (j) (6) oi k= j= In (6), w oi (i, k) reresents the weights from the inut nodes to the outut nodes and (i, j) reresents the weights from the hidden nodes to the outut nodes. w oh oh 2. et Control Some of the roblems in MLP training are resolved by using the net-control algorithm of Olvera [2] that makes network weight behavior more stable. Sigmoid activations have small gradients when the inut net function is large. This results in slow, ineffective training. In addition, large inut values can dominate training and reduce the effects of smaller inuts. This can occur even when smaller inuts are more imortant to the overall erformance of the system. et control reduces these roblems by scaling and shifting net functions so that they do not generate small gradients and do not allow large inuts to mask the otential effects of small inuts. This rocess adjusts each net function mean and standard deviation to a user defined standard deviation σ hd and mean m hd. The choice of these values of the standard deviation and mean are arbitrary and are the same for each net function. et control is imlemented by setting the hidden unit net function mean and standard deviation to desired values m hd and σ hd. This requires determining the mean m h (j) and the standard deviation σ h (j) of each hidden unit s net function. For the jth hidden unit net function ( j h ) and the ith-augmented inut ( i +), the inutto-hidden unit weights are adjusted as with the hidden unit thresholds adjusted using w hi w hi (j,i) σ hd (j,i) (7) σ (j) h w hi (j,+) w hi (j,+) +m hd - m h (j) σ σ h (j) hd (8) 2.2 Training of Multilayer Percetron eural etworks Training [22-24] of MLPs is an area of much research and is a very imortant task in develoing useful network architectures. Existing training techniques include backroagation, outut weight otimization-hidden weight otimization (OWO-HWO), the full conjugate gradient method and many others. Backroagation Training Backroagation training [25] was introduced by Werbos [26] with an indeendent develoment by Parker [27]. Another discussion is contained in Rumelhart and Proceedings of the 7 th Annual Memhis Area Engineering and Science Conference 4

McClelland [28]. In backroagation training, all the weights and thresholds are udated using E w(j,i) w(j,i) + Z (9) w(j,i) where Z is a constant learning factor. By using the chain rule, the gradients can be exressed by E = δ (j) O (i) (0) w(j,i) where E δ (j) = () net (j) is called the delta function. The calculation of the delta functions for the outut units and the hidden units resectively [] are exressed by δ (k) = f (net δ (j) = f (net (j)) (t (j)) n (k) O (k)) δ (n)w(n, j) where f is the first derivative of the activation function and n is the index of units in subsequent layers that are connected to the jth unit in the revious layer. Outut Weight Otimization-Hidden Weight Otimization Outut weight otimization hidden weight otimization (OWO-HWO) [2, 29-3] is an imroved technique for udating the multilayer ercetron weights. Linear equations are used to solve for the outut weights in OWO. Searate error functions for each hidden unit are used and multile sets of linear equations are solved to determine the weights connecting to the hidden units [29, 30] in HWO. The desired jth net function net d (j) for the th attern is net (j) = net (j) + Z δ (j) (3) d where net (j) and Z is the learning factor. Desired weight changes w d (j,i) are found by solving δ (j) + i= w d (j,i) x These equations are aroximately solved unit-by-unit for the desired weight changes w d (j,i). The hidden weights, w(j,i) are udated as Δw(j, i) = Z w d (j,i) (5) The total change in the error function E for a given iteration due to the changes in all hidden weights is aroximately (i) (2) (4) Proceedings of the 7 th Annual Memhis Area Engineering and Science Conference 5

h v 2 ΔE Z δ(j) (6) v j= = When the learning factor Z is ositive and small enough to make the aroximation in (4)-(6) valid, the ΔE sequence is non ositive and the sequence of training errors for the kth iteration is nonincreasing. Since nonincreasing sequences of nonnegative real numbers converge, the sequence of training errors converges [32]. When the error surface is highly curved, the aroximation in (4)-(6) may not be valid for some iterations resulting in increases in E. For these cases, the algorithm restores the reviously calculated network and reduces the learning factor Z. After this ste is reeated a finite number of times, the E sequence will again be decreasing since the error surface is continuous [32]. This results in the convergence of the sequence of E values. Full Conjugate Gradient Method The full conjugate gradient training method (FCG) [2, 29-30] is alied to all of the weights and thresholds in the MLP. The weight vector w of dimension w is formed from these weights and thresholds. An otimal learning factor is used to udate all of the weights at the same time. The goal of the FCG method is to minimize the error function E(w) with resect to w, using the gradient vector g = E/ w and the direction vector. For the first iteration, = -g. For the subsequent w - iterations is exressed by -g + E g (i t ) (7) E (i ) g t where E g (i t ) the gradient vector energy in the current iteration and E g (i t -) is the gradient vector energy in the revious iteration. The learning factor Z is calculated by solving de( w + Z ) = 0 (8) dz for Z and then the weights are udated using w w + Z (9) The otimal learning factor Z can be obtained using a Taylor series exansion of E(w + Z ) to obtain the solution of (8). Comarison of Training Techniques Back roagation training often results in excessive training time and erformance may be oor. ewer algorithms such as FCG and OWO-HWO are often referred to back roagation. Kim [33-34] comared FCG and OWO-HWO for many different data tyes. He showed that FCG erforms well on random training data and OWO-HWO erforms better than FCG on correlated data. OWO-HWO is also invariant to inut biases. In Proceedings of the 7 th Annual Memhis Area Engineering and Science Conference 6

addition, OWO-HWO is able to reach memorization in a fixed number of iterations. In contrast, FCG is deendent on inut biases and thus requires normalized training data. FCG also has roblems attaining memorization for correlated training data. 2.3 MLP Design Methodologies When investigators design neural networks for an alication, there are many ways they can investigate the effects of network structure which refers to the secification of network size (i.e., number of hidden units) when the number of inuts and oututs are fixed [35]. A well-organized researcher may design a set of different size networks in an orderly fashion, each with one or more hidden units than the revious network. This aroach can be designated as Design Methodology One (DM-). Alternatively, a researcher may design different size networks in no articular order. This aroach can be designated as Design Methodology Two (DM-2). These two design methodologies are significantly different and the aroach chosen often is based on the goals set for the network design and the associated erformance. In general, the more thorough aroach often used for DM- may take more time to develo the selected network since the researcher may be interested in trying to achieve a trade-off between network erformance and network size. However, the DM- aroach may roduce a suerior design since the design can be ursued until the researcher is satisfied that further increases in network size roduces diminishing returns in terms of decreasing training time or testing errors. In contrast, DM-2 may be quickly ursued for only a few networks and one of these chosen as an adequate network with fewer comarisons to other networks. It is ossible this DM-2 network design could have been significantly imroved with just a bit more attention to network design. 3.0 etwork Initial Tyes and Training Problems MLP training is inherently deendent on the initialization of the networks. Training can be imroved by roer network initialization. Three distinct tyes of networks considered are Randomly Initialized (RI) etworks, Common Starting Point Initialized (CSPI) etworks and Deendently Initialized (DI) etworks [35]. Randomly Initialized etworks: When a set of MLPs are RI, no members of the set have any initial weights and thresholds in common. Practically, this means that the Initial Random umber Seeds (IRS) of the networks are widely searated. RI networks have no initial weights or thresholds in common for networks of the same or different sizes. These networks are useful when the goal is to quickly design one or more networks of the same or different sizes whose weights are statistically indeendent of each other. RI networks can be designed using DM- or DM-2. Common Starting Point Initialized etworks: When a set of MLPs are CSPI, each one starts with the same IRS. These networks are useful when it is desired to make erformance comarisons of networks that have the same IRS for the starting oint. CSPI networks can be designed using DM- or DM-2. Proceedings of the 7 th Annual Memhis Area Engineering and Science Conference 7

Deendently Initialized etworks: Here a series of different size networks are designed with each subsequent network having one or more hidden units than the revious network. Larger size networks are initialized using the final weights and thresholds that resulted from training a smaller size network. DI networks are useful when the goal is a thorough analysis of network erformance versus size and are most relevant to being designed using DM-. 3. Problems with MLP Training Problems that have been identified with MLP training include: (a) on-monotonic For well-trained MLPs, E ( ) should decrease as increases. However, in some f h instances this does not occur since the initial hidden unit basis functions are different for RI networks. An examle of this henomenon is illustrated in figure 2 for a tyical single initial random number seed case. This can also occur for CSPI networks. When investigators analyze MLP erformance versus h they often use many seeds so that the MSE averaged over a large number of seeds, as illustrated in figure 3, will be more monotonic non-increasing. Develoers may also show erformance results using the median error, E ( ), out of a set of seeds (figure 3) or the minimum error, f med h E f ( ), out of a set of seeds (figure 4). The roblem with using the minimum error is min h that the minimum for one size network in general will not corresond to the same seed as the minimum for a different size network as illustrated in figure 4. h 0.0050 0.0040 Ef(h) 0.0030 0.0020 0.000 0.0000 3 4 5 6 7 8 9 0 2 Figure 3.. Tyical hsingle seed maing error. Figure 2. Tyical single seed maing error. Proceedings of the 7 th Annual Memhis Area Engineering and Science Conference 8

Average error 0.025 Figure 3.2. on-monotonic average error curve. 0.020 Mean square error 0.05 Median error 0.00 0.005 0.000 3 4 5 6 7 8 9 0 2 h Figure 3. on-monotonic average error curve. E min ( h ) 0.0050 0.0040 0.0030 0.0020 0.000 0.0000 Figure 3.3. Minimum error curve. Minimum error Seed number 3 4 5 6 7 8 9 0 2 9 8 7 6 5 4 3 2 0 Seed number h Figure 4. Minimum error curve. The results shown in figures 2-4 are for a data set consisting of 5 inuts, outut and 024 atterns to demodulate a frequency modulated signal containing a sinusoid (fm dataset). Proceedings of the 7 th Annual Memhis Area Engineering and Science Conference 9

(b) o standard way to initialize and train additional hidden units As the size of the MLP is increased by adding new hidden units, a consistent technique for initializing these new units as well as the revious hidden units is needed for CSPI networks. However, no widely acceted technique for doing this has been develoed. (c)et control arameters are arbitrary et control [77] is an effective technique for initializing the MLP. Unfortunately, aroriate values for the desired net function means and standard deviations have not been determined in a systematic manner. (d) o rocedure to initialize and train DI networks DI networks take advantage of revious training on smaller size networks in a systematic manner. Existing techniques for training and initializing these networks are heuristic in nature. (e) etwork linear and nonlinear comonent interference eural network training algorithms send a significant amount of their time otimizing the linear outut weights. Unfortunately, the erformance of the MLP can be affected by simultaneous training of the linear and nonlinear comonents of the maing. Techniques which mostly concentrate on determining the nonlinear maing comonent are lacking. This toic will not be addressed in this aer, but will be addressed in future aers. 4.0 DI etwork Develoment and Evaluation DI networks [35] were defined above as a otential imrovement to RI and CSPI networks. DI networks are designed such that each larger network is initialized with the final weights and thresholds from a reviously well-trained smaller network. These networks are designed with DM- where a well-organized researcher designs a set of different size networks that achieve accetable erformance bounds. The larger network may have one or more additional hidden units than the smaller network with the additional hidden units being initialized using random numbers as discussed above. It should be noted that this develoment is based on the technique of designing an initial single network (i.e., single IRS) and using this to extend to larger networks. Hence, this roduces ractical networks that can be readily imlemented rather than many networks of each size. 4. DI etwork Basic Algorithm These networks build on reviously well-trained networks to roduce imroved designs. For DI networks, the numerical values of the common subset of the initial weights and thresholds for the larger network are the final weights and thresholds from the well-trained smaller network [36]. To design a DI network requires an initial starting non-di network of h hidden units. Initial starting networks can have any value for an h. After designing the initial network, each subsequent network is a DI network. The Proceedings of the 7 th Annual Memhis Area Engineering and Science Conference 0

final weights and thresholds, w f, for the well-trained smaller network of size h- are used as the initial weights and thresholds, w int, for the larger DI network of size h. This initialization technique alies whether the network of size h- is the initial starting non- DI network or a reviously trained DI network. 4.2 Proerties of DI etworks Based uon the DI network design technique several very useful roerties can be asserted. () E int ( h ) < E int ( h- ) (2) The E f ( h ) curve is monotonic non-increasing (i.e., E f ( h ) E f ( h- )). (3) E int ( h ) = E f ( h- ) Proofs and justifications of these roerties will be resented in future aers. 4.3 Performance Results for DI etworks for Fixed Iterations Performance results for DI networks [36] are resented in this section for training with a fixed number of iterations ( iter = 250). These results are for the case where all weights and thresholds are retrained as new hidden units are added to the network. These techniques are alicable to adding any number of new hidden units, but the results are for cases where the number of hidden units is incremented by one for each subsequent network with the initial non-di network being designed with one hidden unit. Since these networks are designed using a single IRS, testing results are resented in addition to the training results. fm: Results for the fm data set are shown in figure 5. This DI network has a monotonic non-increasing E f ( h ) as h increases during training. Testing erformance for this network is also good being very consistent and tracking the training error very well. These results also show that a value of h of nine is the maximum value that results in erformance imrovement. Additional results for several other datasets will be resented in future aers. Additional results for several other datasets will be resented in future aers. Proceedings of the 7 th Annual Memhis Area Engineering and Science Conference

E f ( h ) 0.0 0.08 0.06 0.04 0.02 0.00 Training Testing Figure 6.2. Training and testing for fm data set. 3 4 5 6 7 8 9 0 2 h Figure 5. Training and testing for fm data set. 5.0 Conclusions and Future Paers This aer resented a technique that significantly imroves the training erformance of multilayer ercetron neural networks. This basically analyzed roblems with MLP neural networks develoed a deendently initialized training enhancement that ensures that the MSE versus h curves are always monotonic non-increasing Future related aers will discuss a monotonicity analysis develoed for MLP networks. In addition investigation into net control imrovements will be resented, articularly in regard to the initial network design for DI networks. More quantification of a recise hybrid aroach for training both the new and revious hidden units will be. These techniques will also be alied to a searating mean technique where the means of similar units are remove rior to network training. Rules will also be develoed to determine when adding additional hidden units is not warranted based on the additional reduction in the E f ( h ) when the network size increases. Proceedings of the 7 th Annual Memhis Area Engineering and Science Conference 2

References: [] R. P. Liman, An introduction to comuting with neural networks, IEEE ASSP Magazine, Aril 987. [2] T. Kohonen, Self Organization and Associative Memory, Sringer-Verlag, 2 nd edition, 988. [3] J. J. Hofield, eural networks and hysical systems with emergent collective comutational abilities, Proceedings of the ational Academy of Sciences of the U.S.A.,. 2554-2558, 982. [4] G. Galan-Marin and J. Perez, Design and analysis of maximum Hofield networks, IEEE Transactions on eural etworks, vol. 2 no. 2,. 329-339, March 200. [5] M. J. D. Powell, Radial basis functions for multivariable interolation: A review, IMA Conference on Algorithms or the Aroximation of Functions,. 43-67, RMCS, Shrivenham, U. K., 985. [6] C. Panchaakesan, M. Palaniswami, D. Ralh and C. Manzie, Effects of moving the centers in an RBF network, IEEE Transactions on eural etworks, vol. 3, no. 6,. 299-307, ovember 2002. [7] G. E. Hinton and T. J. Sejnowski, Learning and relearning in Boltzmann machines, Parallel Distributed Processing: Exlorations in Microstructure of Cognition, MIT Press, Cambridge, MA., 986. [8] C. Peterson and J. R. Anderson, A mean field theory learning algorithm for neural networks, Comlex Systems,,. 995-09, 987. [9] H. Chandrasekaran, Analysis and Convergence Design of Piecewise Linear etworks, Ph.D. Dissertation, The University of Texas at Arlington, 2000. [0] K. K. Kim, A Local Aroach for Sizing the Multilayer Percetron, Ph.D. Dissertation, The University of Texas at Arlington, 996. [] M. S. Chen and M. T. Manry, Power series analysis of back-roagation neural networks, Proceedings of IJC 9, Seattle, WA.,. I-295 to I-300, 99 [2] H. H. Chen, M. T. Manry and H. Chandrasekaran, A neural network training algorithm utilizing multile sets of linear equations, eurocomuting, vol. 25, no. -3,. 55-72, Aril 999. [3] Z. J. Yang, Hidden-layer size reducing for multilayer neural networks using the orthogonal least squares method, Proceedings of the 997, 36 th IEEE Society of Instruments and Control Engineers (SICE) Annual Conference,. 089-092, July 997. [4] M. A. Satori and P. J. Antsaklis, A simle method to derive the bounds on the size to train multilayer neural networks, IEEE Transactions on eural etworks, Vol. 2, o. 4,. 467-47, July 99. [5] M. S. Chen, Analysis and Design of the Multilayer ercetron Using Polynomial Basis Functions, Ph.D. Dissertation, The University of Texas at Arlington, 99. [6] A. Goalakrishnan, X, Jiang, M. S, Chen and M. T. Manry, Constructive roof of efficient attern storage in the multilayer ercetron, Twenty-seventh Proceedings of the 7 th Annual Memhis Area Engineering and Science Conference 3

Asilomar Conference on Signals, Systems and Comuters, Vol.,. 386-390, ovember 993. [7] M. T. Manry, H. Chandrasekaran and C. Hsieh, Signal rocessing using the multilayer ercetron, Handbook of eural etwork Signal Processing, edited by Y. H. Hu and J. Hwang,. 2- - 2-29, CRC Press, 200. [8] K. Hornik, M. Stinchcombe and H, White, Multilayer feedforward networks are universal aroximators, eural etworks, Vol. 2, o. 5,. 359-366, 989. [9] K. Hornik, M. Stinchcombe and H, White, Universal aroximation of an unknown maing and its derivatives using multilayer feedforward networks, eural etworks, Vol. 3, o. 5,. 55-560, 990. [20] G. Cybenko, Aroximation by suerosition of a sigmoidal function, Mathematics of Control, Signals and Systems, Vol. 2, o. 4,. 303-34, 989. [2] J. Olvera, Monomial Activation Functions for the Multi-Layer Percetron, M.S. Thesis, The University of Texas at Arlington, 992. [22] V. P. Plagianakos, G. D. Magoulas and M.. Vrahatis, Deterministic nonmonotone strategies for efficient training of multilayer ercetrons, IEEE Transactions on eural etworks, vol. 3, no. 6,. 268-284, ovember 2002. [23] H. Chen, eural etwork Training and Pruning Using Multile Regressions, Ph.D. Dissertation, The University of Texas at Arlington, 997. [24] F. J. Maldonado and M. T. Manry, Otimal runing of feedforward neural networks based uon the Schmidt rocedure, Thirty-Sixth Annual Asilomar Conference on Signals, Systems and Comuters,. 024-028, Pacific Grove, CA, 3-6 ovember 2002. [25] E. M. Johansson, F. U. Dowla and D. M. Goodman, Backroagation learning for multilayer feed-forward neural networks using the conjugate gradient method, International Journal of eural Systems, vol. 2, no. 4,.29-30, 99. [26] P. Werbos, Beyond regression: ewtools for Prediction and Analysis in the Behavioral Sciences, Ph.D. Dissertation, Committee on Alied Mathematics, Harvard University, Cambridge, MA. 974. [27] D. B. Parker, Learning logic, Invention Reort S8-64, File, Office of Technology Licensing, Stanford University, 982. [28] D. E. Rumelhart, G. E. Hinton and R. J. Williams, Learning internal reresentations by error roagation, in D. E. Rumelhart and J. L. McClelland (Eds.), Parallel Distributed Processing, Vol. I, Cambridge, MA, The MIT Press, 986. [29] R. Fletcher and C. M. Reeves, Function minimization by conjugate gradients, Comuter Journal, Vol. 7,. 49-54, 964. [30] R. Fletcher, Conjugate direction methods, in umerical Methods for Unconstrained Otimization, W. Murray editor, Academic Press,. 73-86, 972. [3] C. Yu and M. T. Manry, A modified hidden weight otimization algorithm for feed-forward neural networks, Thirty-Sixth Annual Asilomar Conference on Signals, Systems and Comuters,. 034-038, Pacific Grove, CA, 3-6 ovember 2002. Proceedings of the 7 th Annual Memhis Area Engineering and Science Conference 4

[32] I. S. Sokolnikoff and R. M. Redheffer, Mathematics of Physics and Modern Engineering, McGraw-Hill, 2 nd edition, 966. [33] T. Kim, Develoment and Evaluation of Multilayer Percetron Training Algorithms, Ph.D. Dissertation, The University of Texas at Arlington, 200. [34] T. Kim, J. Li and M. T. Manry, Evaluation and imrovement of two training algorithms, Thirty-Sixth Annual Asilomar Conference on Signals, Systems and Comuters,. 09-023, Pacific Grove, CA, 3-6 ovember 2002. [35] W. H. Delashmit, Multilayer Percetron Structured Initialization and Searating Mean Processing, Ph.D. Dissertation, University of Texas at Arlington, May 2003. [36] W. H. Delashmit and M. T. Manry, Enhanced robustness of multilayer ercetron training, Thirty-Sixth Annual Asilomar Conference on Signals, Systems and Comuters,. 029-033, Pacific Grove, CA, 3-6 ovember 2002. Proceedings of the 7 th Annual Memhis Area Engineering and Science Conference 5