An Adaptively Constructing Multilayer Feedforward Neural Networks Using Hermite Polynomials

Size: px

Start display at page:

Download "An Adaptively Constructing Multilayer Feedforward Neural Networks Using Hermite Polynomials"

Stanley Lyons
5 years ago
Views:

1 An Adaptively Constructing Multilayer Feedforward Neural Networks Using Hermite Polynomials L. Ma 1 and K. Khorasani 2 1 Department of Applied Computer Science, Tokyo Polytechnic University, 1583 Iiyama, Atsugi, Kanagawa, Japan maly@cs.t-kougei.ac.jp 2 Department of Electrical and Computer Engineering, Concordia University, Montreal, Quebec, Canada H3G 1M8 kash@ece.concordia.ca Abstract. In this paper a new strategy is introduced for constructing a multi-hidden-layer feedforward neural network (FNN) where each hidden unit employs a polynomial function for its activation function that is different from other units. The proposed scheme incorporates a structure level adaptation as well as a function level adaptation methodologies in constructing the desired network. The activation functions considered consist of orthonormal Hermite polynomials. Using this strategy, a FNN can be constructed as having as many hidden layers and hidden units as dictated by the complexity of the problem being considered. Keywords: Adaptive structure neural networks; Hermite polynomials; Multilayer feedforward neural networks. 1 Introduction The design of a feedforward neural network (FNN) requires consideration of three important issues. The solutions to these will significantly influence the overall performance of the neural networks ( NN) as far as the following two considerations are concerned: (i) recognition rate to new patterns, and (ii) generalization performance to new data sets that have not been presented during network training. The first problem is the selection of data/patterns for network training. This is a problem that has practical implications and has received limited attention. The training data set selection can have considerable effects on the performance of the trained network. The second problem is the selection of an appropriate and efficient training algorithm from a large number of possible training algorithms that have been developed in the literature. Many new training algorithms This research was supported in part by the NSERC (Natural Science and Engineering Research Council of Canada). D.-S. Huang et al. (Eds.): ICIC 2008, LNAI 5227, pp , c Springer-Verlag Berlin Heidelberg 2008

2 An Adaptively Constructing Multilayer Feedforward Neural Networks 643 with faster convergence properties and less computational requirements are being developed by researchers in the NN community. The third problem is the determination of the network size. This problem is more important from a practical point of view when compared to the above two problems, and is generally more difficult to solve. The problem here is to find a network structure as small as possible to meet certain desired performance specifications. What is usually done in practice is that the developer trains a number of networks with different sizes, and then the smallest network that can fulfill all or most of the required performance requirements is selected. This amounts to a tedious process of trial and errors. The 2nd and 3rd problems are actually closely related to one another in the sense that different training algorithms are suitable for different NN topologies. Therefore, the above three considerations are indeed critical when a NN is to be applied to a real-life problem. This paper focuses on developing a systematic procedure for an automatic determination and/or adaptation of the network architecture for a FNN. Algorithms that can determine an appropriate network architecture automatically according to the complexity of the underlying function embedded in the data set are very cost-efficient, and thus highly desirable. Efforts toward the network size determination have been made in the literature for many years, and many techniques have been developed [3], [6] (and the references therein]. 2 Hermite Polynomials In this section, the Hermite polynomials with their hierarchical nonlinearities are first introduced. These polynomials will be used subsequently as the activation functions of the hidden units of the proposed constructive feedforward neural network (FNN). The orthogonal Hermite polynomials are defined over the interval (, ) of the input space and are given formally as follows [11] (see also the references therein): H 0 (x) =1, (1) H 1 (x) =2x, (2).. H n (x) =2xH n 1 (x) 2(n 1)H n 2 (x), n 2. (3) The definition of H n (x) may be given alternatively by ( H n (x) =( 1) n e x2 dn dx n e x2), n > 0, with H 0 (x) =1. (4) The polynomials given in (1) (3) are orthogonal to each other but not orthonormal. The orthonormal Hermite polynomials may then be defined according to h n (x) =α n H n (x)φ(x), (5)

3 644 L. Ma and K. Khorasani where α n =(n!) 1/2 π 1/4 2 (n 1)/2, (6) φ(x) = 1 e x2 /2. 2π (7) The first-order derivative of h n (x) can be easily obtained by virtue of the recursive nature of the polynomials defined in (3), that is, dh n (x) =(2n) 1/2 h n 1 (x) xh n (x), n 1, (8) dx dh 0 (x) dφ(x) = α 0 dx dx = xh 0(x), n =0. (9) In [11], the orthonormal Hermite polynomials are used as basis functions to model 1-D signals in the biomedical field for the purposes of signal analysis and detection. In [4], a selected combination of the orthonormal Hermite polynomials in a weighted-sum form is used as the fixed activation function of all the hidden units of the OHL-FNN. Hermite coefficients in [8], [9] are used as pre-processing filters, or served as the features of the process, which are then fed to (fuzzy) neural networks for classification problems. A feedforward neural network is designed in [10] that uses Hermite function regression formula to approximate the hidden units activation functions to obtain an improved generalization capability. In this FNN, a fixed number of Hermite polynomials is used. It was indicated earlier that the Hermite polynomials are chosen due to their suitable properties. There are other polynomials that have similar properties, such as, Legendre polynomials, Tchebycheff polynomials and Laguerre polynomials [5]. However, the input range for these polynomials does not meet the requirement for a hidden activation function which should take values ranging from to +. If one uses a polynomial with a limited input range as an activation function for a hidden unit, one should then restrict the input-side weight space in which an optimal input-side weight vector has to be searched for under a given training criterion. The restriction of the input-side weight space is not desirable as it would limit the representational capability of the network. The justification and rationale for choosing the Hermite polynomials are therefore further motivated by their restriction-free input range characteristic. Alternatively, the above polynomials may still be considered as activation functions of the hidden units provided that the input to a hidden unit is normalized each time its input-side weight vector is updated. Although, the weight vector is now unconstrained, and only the input is normalized by a properly selected constant, this constant is actually a function of the weight vector. Therefore, the input to the hidden unit during the input-side training phase will be no longer linear with respect to the weight vector. Consequently, as far as the weight training is concerned the input-side training will now have two types of nonlinearities: one arising from the input to the hidden unit, and the other due to the activation function of the hidden unit. The corresponding optimization problem is clearly more complicated as compared to the one with only the nonlinearity of

4 An Adaptively Constructing Multilayer Feedforward Neural Networks 645 the activation function. It is, therefore, our conclusion that Hermite polynomials are expected to be the most suitable choice for the activation functions of the hidden units of a FNN. 3 The Proposed Incremental Constructive Training Algorithm Consider a typical constructive one-hidden-layer feedforward neural network (OHL-FNN) with a linear output layer and a polynomial-type hidden layer, as shown in Figure 1. The output layer can also be nonlinear, as will be shown in our simulation results. The hidden unit with input and output connections denoted by dotted lines is the present neuron that is being trained before it is allowed to join the other existing hidden units in the network. Suppose, without loss of generality, that a given regression problem has an M-dimensional input vector and a scalar one dimensional output. The j-th input and output samples are denoted by X j =(x j 0,xj 1,,xj M )(xj 0 = 1 denoting the bias) and dj (target), respectively, where j =1, 2,,P (P is the number of training data samples). The OHL constructive algorithm starts from a null network without any hidden units. At any given point during the constructive learning process, suppose there are n 1 hidden units in the hidden layer, and the n-th hidden unit is being trained before it is added to the existing network (see Figure 1). The output of the n-th hidden unit for the j-th training sample is given by ( M ) f n (s j n)=h n 1 w n,m x j m (10) m=0 h 0 h 1 v1 v2 vn-1 h n-2 vn Input-side training h n-1 Output-side training Fig. 1. Structure of a constructive OHL-FNN that utilizes the orthonormal Hermite polynomials as its activation functions for the hidden units

5 646 L. Ma and K. Khorasani where h n ( ) denotes the n-th order Hermite orthonormal polynomial. Its derivative can be calculated recursively from equation (8) and s j n is the input to the n-th hidden unit, given by M s j n = w n,m x j m (11) m=0 where w n,0 is the bias weight of the node with its input x j 0 =1,andw n,m (m 0) is the weight from the m-th component of the input vector to the n-th hidden node. The output of the network may now be expressed as follows n 1 y j n 1 = v 0 + v i h i 1 (s j i ) (12) i=1 = v 0 + v 1 h 0 (s j 1 )+v 2h 1 (s j 2 )+v 3h 2 (s j 3 )+ + v nh n 1 (s j n 1 ). where v 0 is the bias of the output node, and v 1,v 2,,v n are the weights from the hidden layer to the output node. Clearly, the above expression has a very close resemblance to a functional series expansion that utilizes the orthonormal Hermite polynomials as its basis functions, and where each additional term in the expansion contributes to improving the accuracy of the function that is being approximated. Through this series expansion, the approximation error will become smaller as more terms, having higher-order nonlinearities, are included. This situation is actually quite similar to that of an incrementally constructing OHL-FNN, in the sense that the error is expected to become smaller as new hidden units having hierarchically higher-order nonlinearities are successively added to the network. In fact, it has been shown that under certain circumstances incorporating new hidden units in a OHL-FNN can decrease monotonically the training error [7]. Therefore, in this sense adding a new hidden unit to the network is somewhat equivalent to the addition of a higher-order term in a Hermite polynomial-based series expansion. It is in this sense that the present network is envisaged to be more suitable for constructive learning paradigm as compared to a fixed-structure FNNs. Note that, only if s j 1 = sj 2 = = sj n 1,thenyj n 1 will be an exact Hermite polynomial-based series expansion. Otherwise, as structured in the algorithm above the expansion would be only approximate since the weights associated with each added neuron is adjusted separately, resulting in different inputs to the activation functions. The problem addressed now is to determine how to train the n-th hidden unit that is to be added to the active network. There are many objective functions (see for example [2], [6] for more details) that can be considered for the inputside training of this hidden unit. A simple, but a general cost function that has been shown in the literature to work quite well, is as follows [2]: J input = (e j n 1 ē n 1)(f n (s j n ) f n ) (13)

6 An Adaptively Constructing Multilayer Feedforward Neural Networks 647 where f n (s j n )=h n 1(s j n ) (14) ē n 1 = 1 e j n 1 P (15) f n = 1 P h n 1 (s j n), (16) e j n 1 = dj y j n 1 (17) where f n (s j n)(orh n 1 (s j n)) is an orthonormal Hermite polynomial of order n 1 used as the activation function of the n-th hidden unit, e j n 1 is the network output error when the network has n 1 hidden units and d j is the j-th target output to be used for the network training. The derivative of J input with respect to the weight w n,i is calculated by J input w n,i = sgn (e j n 1 ē n 1)(f n (s j n) f n ) (18) (e j n 1 ē n 1)h n 1 (sj n )xj i where sgn( ) is a sign function. The first-order derivative h n 1 (sj n)intheabove expression can be easily evaluated by using the recursive expression (8) for n 2 and (9) for n =1. A candidate unit that maximizes the objective function (13) will be incorporated into the network as the n-th hidden unit. Following this stage, the outputside training is performed by solving a least squared (LS) problem given that the output layer has a linear activation function (see Figure 1). Specifically, once the input-side training is accomplished, the network output y j with n hidden units may now be expressed as follows: y j = n k=0 v k f k (s j k ), j =1, 2,,P (19) where v k, (k =1, 2,,n) are output-side weights of the k-th hidden unit, and v 0 is the bias of the output unit with its input being fixed to f 0 = 1. The corresponding output neuron tracking error is now given by e j = d j y j (20) n = d j v k f k (s j k ), j =1, 2,,P. k=0

7 648 L. Ma and K. Khorasani Subsequently, the output-side training is performed by solving the following LS problem given that the output layer has linear activation function, that is { } 2 J output = 1 (e j ) 2 = 1 n d j v k f k (s j k 2 2 ). After performing the output-side training, a new error signal e j is calculated for the next cycle of input-side and output-side training. Our proposed constructive FNN algorithm may now be summarized according to the following steps: Step 1: Initialization of the network Start the network training process with a OHL-FNN having one hidden unit. Set n =1,e j = d j, and the activation function = h 0 ( ). Step 2: Input-side training for the n-th hidden unit Train only the input-side weights associated with the n-th hidden unit h n 1 ( ). The input-side weights for the existing hidden units, if any, are all frozen. One candidate for the n-th hidden unit with random initial weights is trained using the objective function defined in (13), based on the quickprop algorithm [1]. If instead, for the n-th unit, a pool of candidates are trained, then the one candidate that yields the maximum objective function will be chosen as the n-th hidden unit to be added to the active network. Step 3: Output-side training Train the output-side weights of all the hidden units in the present network. When the output layer has a linear activation function the output-side training may be performed by, for example, invoking the pseudo-inverse operation resulting from the least square optimization criterion as indicated earlier. When the activation function of output layer is nonlinear, the Quasi-Newton algorithm is to be used for training. Step 4: Network performance evaluation and training control Evaluate the network performance by monitoring the metric known as fraction of variance unexplained (FVU) [7] metric on the training data set. The FVU measure is defined according to FVU = k=0 (ĝ(x j ) g(x j )) 2 (21) (g(x j ) ḡ) 2 where g( ) is the function being implemented by the FNN, ĝ( ) isanestimate of g( ) or the output of the trained network, and ḡ is the mean value of g( ). The network training is terminated provided that certain stopping conditions are satisfied. These conditions may be specified formally, for example, in terms of a prespecified FVU threshold oramaximumnumberofpermissible hidden units, among others. If the stopping conditions are not satisfied, then the network output y j and the network output error e j = d j y j are computed, n is increased by 1, i.e., n = n + 1, and we proceed to Step 2.

8 An Adaptively Constructing Multilayer Feedforward Neural Networks 649 It is well known in the constructive OHL-FNNs literature [7] that the determination of a proper initial network size is a problem that needs careful attention. This initialization may significantly influence the effectiveness and efficiency of the network training that follows. However, in our proposed scheme, the network training starts from the smallest possible architecture (that is the null network). This is an important advantage of our proposed algorithm over the similar methods previously presented in the literature. 4 Proposed Multi hidden layer FNN Strategy Construction and design of multi-hidden-layer networks are generally more difficult than those of one-hidden-layer (OHL) networks since one has to determine not only the depth of the network in terms of the number of hidden layers, but also the number of neurons in each layer. In this section, a new strategy for constructing a multi-hidden-layer FNN with regular connections is proposed. The new constructive algorithm has the following features and advantages: (1) As with OHL constructive algorithms, the number of hidden units in a given hidden layer is automatically determined by the algorithm. In addition, the number of hidden layers are also determined automatically by the algorithm. (3) The algorithm adds hidden units and hidden layers one at a time, when and if they are needed. The input-side weights for all the hidden units in the installed hidden layer are fixed (input weights are frozen), however, the inputside weights for a new hidden unit that is being added to the network are updated during the input-side training. Generally, a constructive OHL may need to grow a large number of hidden units, or even could fail to represent a map that actually requires more than a single hidden layer. Therefore, it is expected that our proposed algorithm that incrementally expands its hidden layers and units automatically as a function of the complexity of the problem, would converge very fast and is computationally more efficient when compared to other constructive OHL networks. (4) Incentives are also introduced in the proposed algorithm to control and monitor the hidden layer creation process such that a network with fewer layers are constructed. This will encourage the algorithm to solve problems using smaller networks, reducing the costs of network training and implementation. 4.1 Construction, Design and Training of the Proposed Algorithm For sake of simplicity, and without loss of any generality, suppose one desires to construct a FNN to solve a mapping or a regression problem having a multidimensional input and a single scalar dimensional output. Extension to the case of multi-dimensional output space is quite trivial and straightforward. It is also assumed that the output unit has a linear activation function for dealing with the regression problems, although nonlinear activation function have to be used

9 650 L. Ma and K. Khorasani (a) Initial linear network (b) Creation of the first hidden layer (c) Addition of hidden units to the first hidden layer (d) Creation of the second hidden layer and its units Fig. 2. Multi-hidden-layer network construction process for the proposed strategy using Hermite polynomials for the nonlinear activation function of the neurons for pattern recognition problems. Specifically, the output-side training of the weights would have to be performed by utilizing the Delta rule, instead of an LS solution that is used here when the activation function is selected as a linear function. The proposed constructive algorithm consists of the following steps:

10 An Adaptively Constructing Multilayer Feedforward Neural Networks 651 Step 1: Initially train the network with a single output node and no hidden layers (refer to Figure 2(a)). The training of the weights is achieved according to a LS solution or some gradient-based algorithm, including the bias of the output node. The training error FVU (as described below) is monitored and if it is not smaller than some prespecified required threshold, one would then proceed to Step 2a, otherwise one would go to Step 2c. The performance of a network is measured by the fraction of variance unexplained (FVU) [7] whichisdefinedas FVU = (ĝ(x j ) g(x j )) 2 (22) (g(x j ) ḡ) 2 where g( ) is the function to be implemented by the FNN, ĝ( ) isanestimate of g( ) realized by the network, and ḡ is the mean value of g( ). The FVU is equivalently a ratio of the error variance to the variance of the function being analyzed by the network. Generally, the larger the variance of the function, the more difficult it would be to do the regression analysis. Therefore, the FVU may be viewed as a measure normalized by the complexity of the function. Note that the FVU is proportional to the mean square error (MSE). Furthermore, the function under study is likely to be contaminated by an additive noise ɛ. In this case the signal-to-noise ratio (SNR) is defined by (g(x j ) ḡ) 2 SNR =10log 10 (23) (ɛ j ɛ) 2 where ɛ is the mean value of the additive noise, which is usually assumed to be zero. Step 2a: Treat the output node as a hidden unit with unit output-side weight and zero bias (refer to Figure 2(b)). The hidden unit input-side weights are fixed and are as determined from the previous training step. The output error of the new linear output node will be identical to its hidden unit. Proceed to Step 2b. Step 2b: A new hidden unit is added to the present layer in Step 2a one at a time, just as in a constructive OHL-FNN (refer to Figure 2(c)). Each time a new hidden unit is added, the output error (FVU) is monitored to see if it is smaller than a certain prescribed threshold value. If so, one then proceeds to Step 2c. Otherwise, a new hidden unit is added until the output error FVU is stabilized above the prescribed threshold. Once the error is stabilized, Step 2a is invoked again and a new hidden layer is introduced to possibly further reduce the output error (refer to Figure 2(d)). Steps 2a and 2b are

11 652 L. Ma and K. Khorasani repeated as many times as necessary until the output error FVU is below the prescribed desired threshold, following which one then proceeds to Step 2c. Input-side training of a new hidden unit weights is performed by maximizing a correlation-based objective function, whereas the output-side training of the weights is performed through a LS type algorithm. Step 2c: The constructive training is terminated. For implementing the above algorithm some further specifications are needed. First, a metric stop-ratio is introduced to determine if the training error FVU is stabilized as a function of the added hidden unit, namely stop ratio = FVU(n 1) FVU(n) FVU(n 1) 100% (24) where FVU(n) denotes the training error FVU when the n-th hidden unit is added to a hidden layer of the network. Note that according to [6] addition of a proper hidden unit to an existing network should reduce the training error. The issue that needs addressing here is the magnitude selection of the stop-ratio. At the early stages of the constructive training, this ratio may be large. However, as more hidden units are added to the network the ratio will monotonically decrease. When it has become sufficiently small, say a few percents, the inclusion of further additional hidden units will contribute little to the reduction of the training error, and as a matter of fact may actually start giving rise to an increase in the generalization error, as the unnecessary hidden units may force the network to begin to memorize the data. Towards this end, when the stop-ratio is determined to be smaller than a prespecified desired percentage successively for a given number of times (this is denoted by the metric stop-num ), the algorithm will treat the error FVU as being stabilized, and hence will proceed to the next step to explore the possibility of further reducing the training error by extending a new layer to the network. It should be noted that the stop-ratio and the stop-num are to be generally determined by trial and error, and one may utilize prior experience derived from simulation results to assign them. 5 Conclusions In this paper, a new strategy for constructing a multi-hidden-layer FNN was presented. The proposed algorithm extends the constructive algorithm developed for adding hidden units to a OHL network by generating additional hidden layers. The network obtained by using our proposed algorithm has regular connections to facilitate hardware implementation. The constructed network adds new hidden units and layers incrementally only when they are needed based on monitoring the residual error that can not be reduced any further by the already existing network. We have also proposed a new type of a constructive OHL-FNN that adaptively assigns appropriate orthonormal Hermite polynomials to its generated neurons. The network generally learns as effectively as, but generalizes much better than the conventional constructive OHL networks.

12 An Adaptively Constructing Multilayer Feedforward Neural Networks 653 References 1. Fahlman, S.E.: An Empirical Study of Learning Speed in Back-Propagation Networks. Tech. Rep., CMU-CS , Carnegie Mellon University (1988) 2. Fahlman, S.E., Lebiere, C.: The Cascade-correlation Learning Architecture. Tech. Rep., CMU-CS , Carnegie Mellon University (1991) 3. Hush, D.R., Horne, B.G.: Progress in Supervised Neural Networks. IEEE Signal Processing Magazine, 8 39 (January 1993) 4. Hwang, J.N., Lay, S.R., Maechler, M., Martin, R.D., Schimert, J.: Regression Modeling in Back-propagation and Projection Pursuit Learning. IEEE Trans. on Neural Networks 5, (1994) 5. Korn, G.A., Korn, T.M.: Mathematical Handbook for Scientists and Engineers, 2nd edn. McGraw-Hill, New York (1968) 6. Kwok, T.Y., Yeung, D.Y.: Constructive Algorithms for Structure Learning in Feedforward Neural Networks for Regression Problems. IEEE Trans. on Neural Networks 8(3), (1997) 7. Kwok, T.Y., Yeung, D.Y.: Objective Functions for Training New Hidden Units in Constructive Neural Networks. IEEE Trans. on Neural Networks 8(5), (1997) 8. Lau, B., Chao, T.H.: Aided Target Recognition Processing of Mudss Sonar Data. Proceedings of the SPIE 3392, (1998) 9. Linh, T.H., Osowski, S., Stodolski, M.: On-line Heart Beat Recognition Using Hermite Polynomials and Neuro-fuzzy Network. In: IMTC/2002. Proceedings of the 19th IEEE Instrumentation and Measurement Technology Conference, Anchorage, AK, USA, vol. 1, pp (2002) 10. Pilato, G., Sorbello, F., Vassallo, G.: Using the Hermite Regression Algorithm to Improve the Generalization Capability of a Neural Network. In: Neural Nets WIRN Vietri Proceedings of the 11th Italian Workshop on Neural Nets, Salerno, Italy, pp (1999) 11. Rasiah, A.I., Togneri, R., Attikiouzel, Y.: Modeling 1-d Signals Using Hermite Basis Functions. IEE Proc. -Vis. Image Signal Process 144(6), (1997)

Multilayer Neural Networks

Multilayer Neural Networks Introduction Goal: Classify objects by learning nonlinearity There are many problems for which linear discriminants are insufficient for minimum error In previous methods, the