Stability of backpropagation learning rule

Stability of backpropagation learning rule Petr Krupanský, Petr Pivoňka, Jiří Dohnal Department of Control and Instrumentation Brno University of Technology Božetěchova 2, 612 66 Brno, Czech republic krupan, pivonka, dohnalj@feec.vutbr.cz Abstract A control of real processes requires different approach to neural network learning. The presented modification of backpropagation learning algorithm changes a meaning of learning constants. A base of modification is stability condition of learning dynamics. Keywords: Neural networks, ARMA model, control, backpropagation, stability, the largest singular value, Euclidean norm. 1 Introduction The backpropagation algorithm of learning has suitable properties for an online adaptation. The algorithm is low time and memory consuming. A disadvantage is relatively low convergence ability and a possibility of non-stability in case we choose improperly learning constants. On-line learning simultaneously with running control can bring about a number of problems which can substantially influence the criteria common in learning neural networks. One of the main criteria of the quality of learning processes is the possibility of fast convergence. Specific problems in control of real processes require a different approach to the neural network learning algorithm. An optimum speed of learning must be chosen. It is of crucial importance that the algorithm is able to complete modification of weights in a certain limited time period. In connection with control of real processes the stability of the learning algorithm must be considered. 2 Backpropagation algorithm The learning algorithm is based on minimization of error E k between output y k and the desired value d k : E k = 1 2 m j=1 [y k ( j) d k ( j)] 2 (1) where: y k ( j)... response of network to j-th input pattern in step k, d k ( j)... desired response to j-th input pattern in step k and m... number of input patterns. Gradient of error to weights of network (sigmoid as the output transient function): w k (i, j) = y k (i) where: y k (i) u k (i) u k (i, j) w k (i, j) = ( ) = [ y k (i) d k (i)] y k (i)[1 y k (i)] x k ( j) = = δ k (i)x k ( j) (2) y k (i)... derivation of the error. Difference between the neuron output and the desired value. For inner layer: x j=1 k ( j), = n y k (i) = y k (i) u k (i, j)... derivation of output transient function of neuron,

u k (i) w k (i, j).. derivation of the sum output with respect to a relevant weight. The previous neuron output, x k ( j)... the previous neuron output (the neuron input), δ k (i)... derivation of the error with respect to the sum output. Learning parameters α and β are used. α represents the speed of movement on gradient, and β inertia of the previous value. The final function for the change of weights is: w k (i, j) = αδ k 1 (i)x k 1 ( j)+β w k 1 (i, j) (3) These values w k were calculated for the whole network, and the weights are changed for the next step. 3 Modification for ARMA model If the neural network is used for construction of ARMA model of the system, the network configuration is reduced to one neuron with linear output function. This structure is fully convenient to ARMA model. Also x, a and the system output d are constants, independent of learning dynamics, for one pattern learning. Gradient of local error: w k (i) = y k u k y k u k w k (i) = (y k d) 1 x(i) (4) where: d... constant value of the system output, x(i)... constant value of input vector. The formula for w k : The weights: w k (i) = α(y k d)x(i)+β w k 1 (i) (5) w k+1 (i) = w k (i)+ w k (i) = The vector form of formula: w k = w k (i) α(y k d)x(i)+ +β[w k (i) w k 1 (i)] (6) = w k 1 α[y k 1 d]x+ +β[w k 1 w k 2 ] (7) 4 Stability conditions The formula for y k : y k = w k x T (8) Let us assume the system can be approximated by the formula: d = ax T (9) where: a... vector of the system parameters. Substituting formula 8 and 9 into 7, we get: w k = w k 1 α ( w k 1 x T ax T) x+ +β(w k 1 w k 2 ) (1) By modification we get the formula in Z-transformation: w(z) = w(z)z 1 α [ w(z)z 1 x T ax T] x+ And get the final formula: w(z) = ( αax T x ) +β [ w(z)z 1 w(z)z 2] (11) (I+ [ αx T x βi I ] z 1 + βiz 2) 1 = = N(z)D(z) 1 (12) where: I... unit matrix with suitable dimension. For simplicity and inference it is possible replace matrix αx T x by scalar q and unit matrixes I by scalar 1. The condition for stability of this transfer function is observed, if all roots of characteristic polynomial are inside of unit circle in Z plane. The characteristic polynomial D 1 (for the system with one input) is described by the formula: D 1 (z) = 1+[q β 1] z 1 + β z 2 (13) The roots of the characteristic polynomial D 1 (z) are: z 1,2 = q+β+1± (q β) 2 2(q+β)+1 2 (14) The absolute values of the roots must be smaller than 1: z 1,2 < 1 (15)

By substituting, we get the condition for q: < q < (2+2β) (16) For the system with more inputs will be as follows: < αx T x S < 2I+2Iβ S (17) where: S... the largest singular value norm. The solution for α has lower limit α >. Higher limit is: α < 2I+2Iβ x T x = 2I+2Iβ S S x T = x S = 2+2β = 2+2β S (18) 5 Batch learning on a number of patterns If the network is learnt on h patterns in the learning set, then for i = 1...h the network output is given by the formula: y k,i = w k x T i (19) The system output is: d i = ax T i (2) The upper limit for α is given by the formula: α < 2I+2Iβ S 2+2β = (24) h S x T h S i x i x T i x i If the patterns in the learning set are not too far apart, the largest singular value can be substituted by much simpler Euclidean norm: A E = i a 2 i, j (25) j Then the simplified computation is the following: α 2+2β (26) h E x T i x i 6 Examples of a learning behaviour 6.1 Learning dynamics The example shows the learning behaviour for neuron with two weights. The first figure fig.6.1 shows zeros and poles of the characteristic polynomial of learning dynamics with α on edge of the stability (α = 2+2β ). The second figure fig.6.1 shows the step response of learning dynamics. 1 Pole Zero Map The modification of weights will be as follows: w k = w k 1 α h [ (wk 1 x i T ax i T ) x i ] + +β(w k 1 w k 2 ) (21) Imag Axis.8.6.4 After transfer to Z-transform and modification, we get the final formula for weights: ( w(z) = [ I+ ( α α h h ax i T x i ) x i T x i βi I )z 1 + βiz 2 ] 1 (22) The relationship can be compared with equation 12. After modifications we get the stability condition for a number of patterns: h < α x T i x i < (2I+2Iβ) S (23) S.4.6.8 1 1.8.6.4.4.6.8 1 Real Axis Figure 1: The characteristic polynomial on edge of stability The next figures fig.3 and fig.4 show the learning dynamics for α =.5 2+2β, so for half value of the stability limit. The last two figures fig.5 and fig.6 show the behaviour for α = 1.1 2+2β, so for a little bit bigger value of the stability limit.

1 Pole Zero Map Step Response.35 Weight 1 Weight 2.8.6.3.4 5 Imag Axis.15.1.4.6.5.8 1 1.5 1.5.5 1 Real Axis.5 5 1 15 2 25 3 Time (sec) 5 1 15 2 25 3 Figure 2: The learning dynamics on edge of stability Figure 5: Non-stable characteristic polynomial Step Response Weight 1 Weight 2 15 1 Pole Zero Map.8 1.6 5.4 Imag Axis 5.4 1.6.8 15 5 1 15 2 25 3 Time (sec) 5 1 15 2 25 3 1 1.8.6.4.4.6.8 1 Real Axis Figure 3: Stable characteristic polynomial Figure 6: Non-stable learning dynamics 6.2 Norm comparison.16.14.12 Weight 1 Step Response Weight 2 For simplicity of computation in real time control process is suitable change Euclidean norm instead the largest singular value norm. Graphs show behavior of the norm values during process. The modeled system is described by the transfer function:.1 F S (p) = 1.5 1p 2 +.7p+1 (27).8.6.4 5 1 15 2 25 3 5 1 15 2 25 3 Time (sec) Figure 4: Stable learning dynamics Sample period is T = 1 s, the level of noise is.5. There is 15 patterns in the training set. The form of patterns S(i) is as follows: ( S(i) = x(i),x(i 1),x(i 2),x(i 3), ),y(i 1),y(i 2),y(i 3),1 (28)

where: i... i-th pattern in the training set, x(i)... i-th system input, y(i)... i-th system output, 1... bias. The first figure 7 shows system response to input signal with noise. There is difference of norms in the second figure (fig.8). The last figure demonstrates values of norms (fig.9). The difference values are approximately 1 times less then the norm values. So consequently we can choose the learning constant with accuracy one decimal place. Norm value Difference of norms 7 6 5 4 3 2 1 Difference of norms Euclidean norm and the largest singular value norm 5 1 15 2 25 3 35 4 45 5 Time (s) Figure 9: The norm values 6.3 Algorithm comparison 15 1 5 5 System output For demonstration there were tested algorithms on backpropagation base implemented in MATLAB: 1. GD - Gradient descent backpropagation. 2. GDA - Gradient descent with adaptive learning rate backpropagation. 3. GDM - Gradient descent with momentum backpropagation. System input 1 5 1 15 2 25 3 35 4 45 5 Time (s) Figure 7: The response to input signal with noise 4. GDX - Gradient descent with momentum and adaptive learning rate backpropagation. 5. RP - Resilient backpropagation. 6. SBP - Modified BP by stability condition. Difference of norms 3 25 2 15 1 Difference of norms The simulation had same learning set and same learning constant β =.9. The learning constant α was set by MATLAB function maxlinlr. For modified BP algorithm was α =.9 2+2β. As a criterion of quality of the learning was used mean square error MSE = 1 8. The main parameters of learning were a number of epochs, a learning time of one epoch and quality rate, which is calculated by the formula: 5 5 1 15 2 25 3 35 4 45 5 Time (s) Figure 8: The difference between norms Quality rate = 1 1 6 Learning time o f one epoch Number o f epochs (29) The following table shows the simulation results of these learning algorithms:

Table 1: Algorithm comparison. Learning time of one epoch (1 1 3 s) Algorithm Number of epochs Quality rate GD 967 1.674 1.73 GDA 72 1.741 2.42 GDM 934 1.763 1.89 GDX 226 2.164 9.58 RP 269 2.31 8.55 SBP 145 1.778 12.26 7 Conclusion The stability conditions were given, and on their basis the limits for learning constants. This principle changed the meaning of learning constants. Their magnitude does not only mean the speed of the learning algorithm, but also expresses the degree of its stability. The principle, of course, makes it possible to set such constants that learning can be made faster for different patterns, independent of the initial network values, than with the common methods of BP modification. Acknowledgements The paper has been prepared as a part of the solution of GAČR project No. 12/1/1485, with the support by the research plan CEZ: MSM 2613 and the Ph.D. research grant FR: MSMT IS 432164 - Adaptive Controllers based on Neural Networks [5] K. J. Hunt - D. Sbarbaro - R. Żbikowski - P. J. Gawthrop (1992). Neural Network for Control Systems-A Survey. Automatica, volume 28, no. 6, pages 183-1112, 1992 [6] K. S. Narendra - S. Mukhopadhyay (1992). Intelligent Control using Neural Networks. IEEE Control Systems, pages 11-18, April 1992 [7] H. Demuth - M. Beale (21). Neural Network Toolbox User s Guide. The MathWorks, Inc., version 4, 21 [8] R. M. Hristev (1998). The Ann Book. GNU Public Licence, 1998 [9] M. Riedmiller - H. Braun (1993). A direct adaptive method for faster backpropagation learning: The RPROP algorithm. Proceedings of the IEEE International Conference on Neural Networks, 1993 References [1] P. Krupanský - P. Pivoňka (21). Adaptive Neural Controllers and Their Implementation. Proc. 9th Zittau Fuzzy Colloquium, Zittau, 21 [2] P. Krupanský - P. Pivoňka (22). The Possibilities of Direct Inversion in Neural Controllers. Mendel 22 8th International Conference of Soft Computing, Brno, 22 [3] J. Najvárek (1998). Neural Nets in Predictive Control. Dissertation work, Dept. of Automatic Control and Instrumentation, Brno, 1998 [4] P. Vavřín (1988). Theory of automatic control. Scriptum, Brno, 1988