A MODIFIED HIGHER-ORDER FEED FORWARD NEURAL NETWORK WITH SMOOTHING REGULARIZATION

Size: px

Start display at page:

Download "A MODIFIED HIGHER-ORDER FEED FORWARD NEURAL NETWORK WITH SMOOTHING REGULARIZATION"

Arron Long
6 years ago
Views:

1 A MODIFIED HIGHER-ORDER FEED FORWARD NEURAL NETWORK WITH SMOOTHING REGULARIZATION Kh.Sh. Mohamed, W. Wu, Y. Liu Abstract: This paper proposes an offline gradient method with smoothing L 1/ regularization for learning and pruning of the pi-sigma neural networks (PSNNs). The original L 1/ regularization term is not smooth at the origin, since it involves the absolute value function. This causes oscillation in the computation and difficulty in the convergence analysis. In this paper, we propose to use a smooth function to replace and approximate the absolute value function, ending up with a smoothing L 1/ regularization method for PSNN. Numerical simulations show that the smoothing L 1/ regularization method eliminates the oscillation in computation and achieves better learning accuracy. We are also able to prove a convergence theorem for the proposed learning method. Key words: convergence, pi-sigma neural network, offline (batch) gradient method, smoothing L 1/ regularization Received: July 3, 16 Revised and accepted: November 13, 17 DOI: /NNW Introduction Pi-sigma neural networks (PSNNs) [4, 1 1] as a kind of higher-order neural networks can provide more powerful mapping capability than conventional feedforward neural networks, and have been successfully applied to many applications such as the equalization of nonlinear satellite channels, the real classification of seafloor sediments, and the image coding. The PSNN was introduced by Ghosh and Shin [11]. It computes the product of sums of the input layer, instead of the sums of the products. And the weights connecting the product layers and the summation layers are fixed to 1. There are two practical ways to accomplish the gradient weights updating in the network learning process: the online gradient approach, in which weights are updated promptly after a training sample is supplied into the network; and the offline gradient approach, where the network weights are updated after all the Khidir Shaib Mohamed Corresponding author; Wei Wu; Dalian University of Technology, School of Mathematical Sciences, Dalian 1164, China, China khshm7@yahoo.com wuweiw@dlut.edu.cn Yan Liu; Dalian Polytechnic University, School of Information Science and Engineering, Dalian 11634, China, China liuyan@dlpu.edu.cn c CTU FTS

2 Neural Network World 6/17, training samples have been treated by the network (cf. [15, 16]). We shall use the offline gradient approach in this paper. Various kinds of regularization terms have been used in many applications, such as the method in [] in terms of weight decay [5,6], the methods based on Akaike s information criterion (AIC) [3, 13], and the method employing model prior in the Bayesian structure [8]. These regularization terms are inserted into the standard cost function for network learning to improve the learning ability and to get a sparse network. Many regularization terms take the form of the L p norm of the weights, leading to the following new error function E(W) = Ē(W) + λ W p p, (1) where Ē(W) is a usual error function depending on the weights W of the network, W p = ( n w k p ) 1 p is the p-norm of the weights of the network, and λ is the regularization parameter. In particular, the L regularizer is the earliest regularization method applied for variable selection, and its solution is the most sparse. But it is a combinatory optimization problem and is shown to be NP-hard [14]. The popular L 1 regularizer is discussed in [9]. The L 1/ regularizer is suggested in [17] and, among many favourable properties, it is shown to result in better sparsity than the L 1 regularizer. A smoothing L 1/ regularizer is proposed in [1, 7]. And in the numerical experiments, it behaves equally good as or even better than the L 1 regularizer and the standard L 1/ regularizer for neural networks. Therefore, in this paper, we shall use the smoothing L 1/ regularizer for the network regularization. The organization of this paper is as follows. In section, we describe PSNN and the offline gradient method with smoothing L 1/ regularizer. In Section 3, a convergence theorem is given. Section 4 contains some supporting simulation results. The proof of the convergence theorem is given in Section 5. Finally, a brief conclusion is provided in Section 6.. Offline gradient method with smoothing L 1/ regularization In this section, we describe the network structure of PSNN and the off-line gradient method with smoothing L 1/ regularization..1 Error function with L 1/ regularization Let p, n and 1 respectively be the dimensions of the input layer, the summation layer and the product layer of PSNN. Denote by ω = (ω 1,..., ω p ) T R p (1 n) the weight vector connecting the input layer and the k-th summing unit, and write ω = (ω T 1,..., ω T n ) R np. Note that the weights from summing units to product unit are fixed to 1. Let g : R R be a given activation function. The topological structure of SPNN algorithm is shown in Fig

3 Mohamed Kh.Sh., W. Wu, Y. Liu: A modified higher-order feed forward neural... Fig. 1 Pi-sigma neural network structure. For an input data vector x R p, the network output is ( )) y = g (ω i x i = g (ω x), () i=1 where ω x is the usual inner products of ω and x. Suppose that we are supplied with a set of training samples {x l, O l } L R p R, where O l is the desired ideal output for the input x l. By adding an L 1/ regularization term into the the usual error function, the final error function takes the form E(ω) = 1 O l g (ω x l ) + λ ω i 1/. (3) i=1 The partial derivative of the above error function with respect to ω i is E ωi (ω) = (ω x l ) (ω k x l )x l i + λsgn(ω i), (4) ω i 1/ i = 1,,, p; = 1,, n; where λ > is the regularization parameter, E ωi (ω) = E(ω) ω i and (t) = (O l g(t))g (t). Starting from an arbitrary initial value ω, the offline gradient method with L 1/ regularization term updates the weights ω m iteratively by with ωi m = η ω m+1 i = ω m i + ω m i, (5) (ω m x l ) (ωk m x l )x l i + λsgn(ωm i ), (6) ω m i 1/ 579

4 Neural Network World 6/17, i = 1,,, p; = 1,,, n; m =, 1, where again η > is the learning rate.. Error function with smoothing L 1/ regularization The usual L 1/ regularization is non-differentiable at the origin, since it involves the absolute value function. We replace the absolute value function by a smooth function f(t) defined below: t, t a; f(t) = 1 8a t a t + 3 8a, a < t < a; (7) t, t a; where a is a small positive constant. Then we have 1, t a; f (t) = 1 a t at, a < t < a; 1, t a; { f, t a; (t) = 3 a t a, t a; and f(t) [ 3 8 a, ), f (t) [ 1, 1], f (t) [, 3 a ]. Now, the new error function with smoothing L 1/ regularization term is E(ω) = 1 O l g (ω x l ) + λ f 1/ (ω i ). (8) i=1 The partial derivative of the error function E(ω) in Eq. (8) with respect to ω i is E ωi (ω) = ( (ω x)) (ω k x l )x l i + λf (ω i ) f 1/ (ω i ), (9) i = 1,,, p; = 1,,, n. Starting from an arbitrary initial value ω, the offline gradient method with smoothing L 1/ regularization term updates the weights ω m iteratively by with ωi m = η ω m+1 i = ω m i + ω m i, (1) (ω m x) (ωk m x l )x l i + λf (ωi m) f 1/ (ωi m), (11) i = 1,,, p; = 1,,, n; m =, 1, where again η > is the learning rate. 58

5 Mohamed Kh.Sh., W. Wu, Y. Liu: A modified higher-order feed forward neural Main results The following conditions will be used later on to prove the convergence theorem. Proposition 1. (t), δ l (t) C, g(t), g (t), g (t) C 1, t R, 1 l L. Proposition. max{ x l, ω m x l } C, 1 n, 1 l L, m =, 1, Proposition 3. The parameters η and λ are chosen to satisfy: < η < Mλ+C where M = 6 4 a 3 and C = C / + C 3. Proposition 4. There exists a compact set Φ such that ω m Φ {ω Φ : E ω (ω) = } contains finite points. Φ and the set Theorem 5. Let the error function E(ω) be defined by Eq. (8), and the weight {ω m } be generated by the iteration algorithm Eq. (1) for an arbitrary initial value ω. If propositions 1-3 are valid, we have the following error estimates: (i) E ) E(ω m ) (ii) There exists E such that lim m E(ω m ) = E (iii) lim m E ωi (ω m ) =, i = 1,,..., p; = 1,,..., n. Furthermore, if proposition 4 is also valid, we have the following strong convergence: (iv) There exists a point ω Φ such that lim m ω m = ω. 4. Numerical simulations To illustrate the efficiency of our proposed learning method, we consider the numerical simulations of two classification problems (XOR and Parity problems) and two function approximation problems (Gabor and Mayas functions). We compared our batch gradient method with smoothing L 1/ regularization (BGSL1/) with the batch gradient method with L 1/ regularization (BGL1/) and batch gradient method with L regularization (BGL). 4.1 Classification problems The activation function is chosen as g(x) = 1/(1+e x ). Each of the three algorithms takes 1 trials for XOR and parity problems, respectively. Typical performances are shown in Figs. 5. For the XOR problem the network has input nodes, summation nodes, and 1 output node, with the learning rate η =.6 and the regularization parameter λ =.1. For the parity problem, the network has 4 input nodes, 5 summation nodes, and 1 output node, with the learning rate η =.4 and the regularization parameter λ =.1. For both the two cases, the initial weights are selected randomly within the interval [.5,.5], and the maximum iteration number is 3. From Figs. 5, we see that the proposed learning method BGSL1/ enoys the best learning accuracy, while BGL is the worst. In particular, as predicted by 581

6 Neural Network World 6/17, BGL BGL1/ BGSL1/ 1 1 Error function Number of iterations Fig. Learning errors of different algorithms for XOR problem. 1 BGL BGL1/ BGSL1/ 1 1 Norm of gradient Number of iterations Fig. 3 Norms of gradient of different algorithms for XOR problem BGL BGL1/ BGSL1/ Error function Number of iterations Fig. 4 Learning errors of different algorithms for parity problem. 58

7 Mohamed Kh.Sh., W. Wu, Y. Liu: A modified higher-order feed forward neural... 1 BGL BGL1/ BGSL1/ 1 1 Norm of gradient Number of iterations Fig. 5 Norms of gradient of different algorithms for parity problem. Theorem 5, the error of BGSL1/ decreases monotonically and the gradient of the error function goes to zero in the learning process. Tabs. I and II show the results of the average errors over the 1 trials and the average norms of gradients for each learning algorithms. The Average Numbers of neurons Eliminated (ANE in brief) by the pruning over the 1 trials are also shown in Tabs. I and II. The comparison convincingly shows that BGSL1/ is more efficient and has better sparsity-promoting property than BGL1/ and BGL. Algorithm Average training error Norm of gradient ANE BGSL1/ 1.336e BGL1/ e BGL 1.776e Tab. I Numerical results for solving XOR problem. Algorithm Average training error Norm of gradients ANE BGSL1/ 5.98e BGL1/ 1.169e BGL.4495e Tab. II Numerical results for solving Parity problem. 4. Approximation of Gabor function Now, we try to approximate the following Gabor function (cf. Fig. 6) F (x, y) = 1 π(.5) exp ( x + y (.5) ) cos(π(x + y)). 583

8 Neural Network World 6/17, Fig. 6 Gabor function. 36 input training samples are selected from an evenly 6 6 grid on.5 x.5 and.5 y.5. Similarly, the input test samples are 56 points selected from the grid on.5 x.5 and.5 y.5. The neural network has 3 input nodes, 6 summation nodes, and 1 output node. The initial weights are randomly chosen from [.5,.5]. The learning rate η =.6, and the regularization parameter λ =.1. The maximum iteration number is 3. Typical network approximations by using the three algorithms are plotted in Figs We observe that our method BGSL1/ gives the best approximation of the Gabor function. Tab. III presents the average training errors, the average test errors, and the ANE, over the 1 trials for the three learning algorithms. Again, we see that our method BGSL1/ has the best accuracy and the best sparsitypromoting property. Algorithm Average training error Average test error ANE BGSL1/ BGL1/ BGL Tab. III Numerical results for approximating Gabor function. 4.3 Approximation of Mayas function This example is to approximate the following Mayas function F (x, y) =.6(x + y ).48x y. 64 input training samples are selected from an evenly 8 8 grid on.5 x.5 and.5 y.5. Similarly, the input test samples are 9 points selected from the 3 3 grid on.5 x.5 and.5 y.5. The neural network has 3 input nodes, 5 summation nodes, and 1 output node. The initial weights are 584

9 Mohamed Kh.Sh., W. Wu, Y. Liu: A modified higher-order feed forward neural Fig. 7 Gabor function Approximation performane by BGSL1/ Fig. 8 Gabor function Approximation performane by BGL1/ Fig. 9 Gabor function Approximation performane by BGL. 585

10 Neural Network World 6/17, randomly chosen from [.5,.5]. The learning rate η =.9, and the regularization parameter λ =.1. The maximum iteration number is 1. Tab. IV shows that, among the three algorithms, BGSL1/ exhibits the best approximation accuracy and the best sparsity for the Mayas function. Algorithms Average training error Average test error ANE BGSL1/ BGL1/ BGL Tab. IV Numerical results for approximating Mayas function. 5. Proofs In this section, the convergence theorem of the smoothing L 1/ regularization algorithm is proved. First,we give two lemmas. Lemma 6. Suppose that the propositions 1 and are satisfied and {ω m } is generated by Eq. (1), then we have (ω m x l ) C n i=1 ωi m, (1) and (ω m x l ) ( 1 η + C 3) i=1 (ω m x l ) ωi m. (13) Proof. Using the Taylor expansion to first and second orders, we have (ω m x l ) = (ω m x l ) t k ( ω m x l ) (14) 586

11 Mohamed Kh.Sh., W. Wu, Y. Liu: A modified higher-order feed forward neural... = = (ωk m x l )( ω m x l ) + 1 s=1 s k s t,s k ( ωm x l )( ωs m x l ) (ωk m x l )( ω m x l ) + σ, (15) where t k and t i,s k are in between ω m+1 k x l and ωk m xl. By proposition, we can see that t k C, t i,s k C, k = 1,,..., n. (16) Thus, Eq. (1) can be easily obtained from Eq. (14) and Eq. (16). By Eq. (16), we have σ 1 4 (C ) n max 1 l L xl ( ω m + ωs m ) C 3 ωi m. (17) s=1 s i=1 A combination of proposition 1, Eq. (1), Eq. (11), Eq. (14) and Eq. (17) leads to (ω m x l ) (ω m x l ) = = [ (ω m x l ) [ ( ω m x l ) (ωk m x l ) + σ] [ i=1 = 1 η i=1 = ( 1 η + C 3) (ω m x l ) (ωk m x l )x l ] ( ω m ) + LC C 3 (ω m x l ) (ωk m x l )x l i] ωi m + LC C 3 ωi m + C 3 i=1 n i=1 ωi m i=1 i=1 ωi m ωi m ωi m. (18) Here C 3 = LC C 3. This completes the proof. Lemma 7. Suppose that H : R u R is continuous and differentiable on a compact set Ď R and that Ω = { Z Ď H(Z) = } has only finite number of points. If a sequence {Z m } m=1 Ď satisfies lim m Z m+1 Z m = and lim m H(Z m ) =, then there exists a point Z Ω such that lim m Z m = Z. 587

12 Neural Network World 6/17, Proof. Lemma 7 is almost the same as Theorem in [18] and the detail of the proof is omitted The Proof of Theorem 5 is alienated into four parts dealing with statements (i), (ii), (iii) and (iv), respectively. Proof. For convenience, we use the following notations ρ m = i=1 Then, it follows from the error function E(ω) in Eq. (8) that and E ) = 1 O l g E(ω m ) = 1 ( ωi m ). (19) x l ) O l g (ω m x l ) + λ + λ i=1 i=1 f 1/ i ), () f 1/ (ωi m ). (1) Proof to (i) of Theorem 5. By Eq. (1), Eq. (13), Eq. (), Eq. (1) and the Taylor expansion, we have E ) E(ω m ) = 1 O l g = = + λ + 1 i=1 x l ) [f 1/ i ) f 1/ (ωi m )] (ω m x l ) (t l)[ i= λ (t l) i=1 O l g (ω m x l ) (ω m x l ) (ω m x l )] + λ i=1 (ω m x l ) (ωt m x l )x l i ωi m + t=1 t (ω m x l ) [f 1/ i ) f 1/ (ωi m )] [f 1/ i ) f 1/ (ωi m )] (ω m x l ) σ 588

13 Mohamed Kh.Sh., W. Wu, Y. Liu: A modified higher-order feed forward neural... = λ [ i=1 1 η ωm i λ f (ω m i ) f 1/ (ω m i ) (t l) i=1 ] ( ω m i ) + (ω m x l ) [f 1/ i ) f 1/ (ωi m )] = 1 ( ωi m ) + λ η i=1 i=1 + (ω m x l ) σ + 1 δ l(t l) = 1 ( ωi m ) + (ω m x l ) σ η + 1 [ i=1 (t l) ( 1 η λ F (t,m ) ) δ (ω m x l ) σ [ f 1/ i ) f 1/ (ωi m ) f (ωi m) (ω m x l ) ( C + 1 η + C 3 )] n i=1 f(ω m i )1/ ] (ω m x l ) + λ F (t,m ) i=1 ( ωi m ) ( ωi m ), () where t l is in between n (ωm+1 x l ) and n (ωm x l ), t,m R is in between ω m+1 and ω m, M = 6 and F (x) a 3 (f(x))1/. Note that and that F (x) = F (x) = f (x) f(x), (3) f (x) 6 f(x) a. (4) 3 Thus, by using the Lagrangian mean value theorem for f(x) and proposition 3, we have E ) E(ω m ) = [ ( 1 η λ F (t,m )) + ( C 1 η + C 3))] ( ωi m ) ( η λm C C 3) i=1 ( ωi m ) i=1, (5) This completes the proof to statement (i) of Theorem

14 Neural Network World 6/17, Proof to (ii) of Theorem 5. From the conclusion (i), we know that the non-negative sequence {E(ω m )} decreases monotonously. But it is also bounded below. Hence, there must exist E such that lim m E(ω m ) = E. The proof to (ii) is thus completed. Proof to (iii) of Theorem 5. Write β = 1 η λm C C 3 and ρ q = n p i=1 ( ωq i ). It follows from proposition that β >. By using Eq. (5) we get Since E ) >, we have Setting m, we obtain E ) E(ω m ) βρ m E(ω ) β β m ρ q E(ω ) <. q= ρ q 1 β E(ω ) <. q= m ρ q. q= Thus, This leads to and completes the proof. lim ρ m =. m lim m ωm i =, (6) Proof to (iv) of Theorem 5. Note that the error function E(ω) defined in Eq. (8) is continuous and differentiable. According to Eq. (6), proposition 4 and lemma 7, we can easily get the desired result, i.e., there exists a point ω Φ such that This completes the proof to (iv). lim m ωm = ω. 6. Conclusion Some weak and strong convergence results are established for an offline gradient method with smoothing L 1/ regularization for PSNN training. Criterions for choosing the appropriate learning rate and the penalty parameter are given to guarantee the convergence of the learning process. Numerical experiments show that the sparsity and the accuracy of BGSL1/ are better than those of BGL1/ and BGL. 59

15 Mohamed Kh.Sh., W. Wu, Y. Liu: A modified higher-order feed forward neural... Acknowledgements This work is supported by the National Natural Science Foundation of China (Nos and ), Natural Science Foundation Guidance Proect of Liaoning Province (No. 165) and Proect of Dalian Young Technology Stars. References [1] FAN Q., ZURADA J.M., WU W. Convergence of online gradient method for feedforward neural networks with smoothing L 1/ regularization penalty. Neurocomputing. 14, 5(131), pp. 8 16, doi.org/1.116/.neucom [] FRIEDMAN J.H. An overview of predictive learning and function approximation. In: From Statistics to Neural Networks. Springer, Berlin, Heidelberg. 1994, pp. 1 61, org/1.17/ _1 [3] HINTON G. Connectionist learning procedures. Artif Intell. 1989, 4, pp , doi.org/ 1.116/4-37(89)949- [4] HUSSAINA A.J., LIATSISB P. Recurrent pi-sigma networks for DPCM image coding. Neurocomputing. 3, 55(1 ), pp , doi.org/1.116/s95-31()69-x [5] KROGH A., HERTZ J., PALMER R.G. Introduction to the theory of neural computation. Reedwood City CA: Addison-Wesley. 1991, doi.org/1.163/ [6] LE CUN Y., DENKER J., Solla S., Howard R.E, JACKEL L.D., Optimal Brain Damage- Advances in Neural Information Processing Systems II. Morgan Kauffman, San Mafeo, CA [7] LIU Y., LI Z.X., YAN D.K., MOHAMED K.H. S.H., WANG J., Wu W. Convergence of batch gradient learning algorithm with smoothing L 1/ regularization for Sigma Pi Sigma neural networks, Neurocomputing 15, 151, pp , doi.org/1.116/.neucom [8] MACKAY D.J.C., Probable networks and plausible predictions- a review of practical Bayesian methods for supervised neural networks, Network: Computation in Neural Systems, 995, 6(3), pp Available from: 59b6c93553d45aa877ddb6db b.pdf [9] MC LOONE S., IRWIN G. Improving neural network training solutions using regularisation. Neurocomputing. 1 Apr 3; 37(1), pp. 71 9, doi.org/1.116/s95-31()314-3 [1] PIAO J.L.J.X.F., WEN-PING S.C.D. Application of pi-sigma neural network to real-time classification of seafloor sediments. Applied Acoustics. 5, 6, pp. 4. Available from: http: //en.cnki.com.cn/journal_en/a-a5-yysn-5-6.htm [11] SHIN Y., GHOSH J. The pi-sigma network: an efficient higher-order neural network for pattern classification and function approximation. International oint conference on neural networks. 1991, 1, pp , doi:1.119/ijcnn [1] SHIN Y., GHOSH J., SAMANI D. Computationally efficient invariant pattern recognition with higher order Pi-Sigma Networks. The University of Texas at Austin [13] SRIVASTAVA N., HINTON G., KRIZHEVSKY A., SUTSKEVER I., SALAKHUTDI- NOW R. Dropout: A simple way to prevent neural networks from overfitting. J Mach Learn Res. 14, 15(1), pp Available from: srivastava14a/srivastava14a.pdf [14] WANG C., VENKATESH S.S., JUDD J.S. Optimal stopping and effective machine complexity in learning. In: Advances in neural information processing systems 1994, pp Available from: optimal-stopping-and-effective-machine-complexity-in-learning.pdf [15] WU W., FENG G., LI X. Training multilayer perceptrons via minimization of sum of ridge functions. Advances in Computational Mathematics., 17(4), pp , doi.org/1. 13/A:

16 Neural Network World 6/17, [16] WU W., XU Y. Deterministic convergence of an online gradient method for neural networks. Journal of Computational and Applied Mathematics., 144(1), pp , doi.org/1. 116/S377-47(1)571-4 [17] XU Z.B., ZHANG H., WANG Y., CHANG X.Y., LING Y. L 1/ regularization. Science China Information Sciences. 1, 53(6), pp , doi.org/1.17/s [18] YUAN Y.X., SUN W.Y. Optimization Theory and methods, Science Press, Beiing

CONVERGENCE ANALYSIS OF MULTILAYER FEEDFORWARD NETWORKS TRAINED WITH PENALTY TERMS: A REVIEW

CONVERGENCE ANALYSIS OF MULTILAYER FEEDFORWARD NETWORKS TRAINED WITH PENALTY TERMS: A REVIEW Jian Wang 1, Guoling Yang 1, Shan Liu 1, Jacek M. Zurada 2,3 1 College of Science China University of Petroleum,