Temporal Backpropagation for FIR Neural Networks Eric A. Wan Stanford University Department of Electrical Engineering, Stanford, CA 94305-4055 Abstract The traditional feedforward neural network is a static structure which simply maps input to output. To better reflect the dynamics in the biological system a network structure is proposed which models each synapse by a Finite Impulse Response (FIR) linear filter. An eficient gradient descent algorithm is derived which will be shown to be a temporal generalization of the familiar backpropugation algorithm. 1 Introduction A standard neural network models a synapse by a single variable weight parameter. In a feedforward structure this results in a static network which maps input to output. Real neural networks are of course dynamic in nature which is reflected in the temporal properties of the synapse along with such processes as impulse transmission and membrane excitation. While many accurate models of such processes do exist, from an engineering standpoint most are unrealistic to work with. The model we propose to use represents a synapse not by just a single weight parameter, but by an adaptive filter [l]. Further, we restrict the filter to be discrete time, characterized by a Finite Impulse Response (FIR) '. While biologically motivated, we make no claims that the structure is necessarily biologically plausible. With this we proceed to derive algorithms for adapting the synaptic transfer functions so as to train the network as a whole. 2 Network Structure Each synapse in the network is modeled by a Finite Impulse Response (FIR) linear filter. The coefficients for a filter can be represented by a weight vector W = [w(o), w(l),..w(t)it(t denotes transpose). The output of the filter simply corresponds to the weighted sum of delayed samples of the input, z(k) (i.e. y(k) = CT=o~(i)z(k - where k is the discrete time index ). As usual, the total input to a neuron corresponds to the sum of all synaptic filter outputs which connect to that neuron '. The output of the neuron is usually taken to be a non-linear sigmoidal function of its input. Subscripts are used to indicate the specific location of a synapse or neuron within the network. Thus Wfj specifies the synaptic filter connecting the output of neuron i in layer 1 to the input of neuron j in the next layer. For a fully connected feedforward structure with L layers and NI neurons in each layer, the network can be completely specified as follows: i= I i=l where 1 5 i 5 NI, 1 5 j 5 NI+~, and 1 5 I 5 L. Note, if we replace the vectors, W and X by simple scalars, then the above equations reduce to the definition of the familiar static feedforward network [a]. ]The Infinite Impulse Response (IIR) case has also been studied and will be presented in a future paper. Without loss of generalization in the analysis, possible bias terms have been neglected for notational simplicity. I - 575
Figure 1: FIR Neural Network Structure To complete the definitions for the network structure, we must specify the input/output relationships for the network as a whole: input zp(k) 15 i 5 No output zc(k) 15 is NL (6) For notational purposes we have used $(k) as the external input to the network. This should not be confused with the output of a neuron. The structure of the FIR network is illustrated in Figure 1. 3 Adaptation Given a desired response vector for the network s output at each index of time, we have desired response di (k) inst. squared error ei(k) = d;(k) - zc(k) total inst. squared error e2(k) = ef(k) total squared error e2 = Ck=O e2(k) Learning will be based on traditional gradient descent in which we attempt to minimize the total squared error over all time. The error gradient with respect to each weight vector is normally expanded as follows: (7) By taking each term in the expansion as an unbiased instantaneous estimate of the gradient, we may form the on-line training algorithm: ae2(k) Wl,(k + 1) = W!.(k) - p- 3 awgk) (9) in which the weight vectors are updated at each increment of time (p is defined as the learning rate). As we will show, this obvious expansion of into the terms does not lead to a desirable learning J 3 algorithm for this structure. A less intuitive expansion, in fact, yields a more attractive algorithm which exploits the structure s FIR characteristics. For now we will complete the derivation of the first algorithm by proceeding to calculate the terms q. Starting with the last layer of synapses in the network: I - 576
By defining 6f(k) = -2ej(k)f'(yf(k)), for the last layer of synapses we have: W&-'(k + 1) + w&-'(k) - &(k). xf-'(k) (13) This is, of course, simply the LMS algorithm for a bank of FIR filters with a non-linear output. For the previous layer of synapses, however, the derivation is not as simple. Using the proper chain rule expansion we get: n=o ml 1 This last equation has many interpretations. The term b&(k)wfi'(n) is similar to the recurrent formula used in backpropagation to accumulate the gradients. However, there are TL-~ such terms corresponding to the number of tap delays in the final layer of the network. This equation could, in fact, have been written down by inspection if we were to interpret each tap delay as a "virtual" neuron whose input is delayed the appropriate number of time steps. This can be thought of as equivalent to the common technique of viewing the structure unfolded in time. The problem with these equations, however, is that there is a loss of a sense of symmetry between the forward propagation of the network and the backwards propagation of the terms necessary to calculate the gradients. The desired distributed nature of gradient computations disappear. In addition, the gradient equations do not generalize for arbitrary layers. There is no nice recurrent formula for the gradients. The gradient calculations for one more layer back results in a triple sum. In fact, the total number of operations actually grows geometrically with the number of layers. These drawbacks can, however, be overcome if we consider a completely different formulation of the gradient descent algorithm. Our original expansion of the total error gradient is not unique. Consider the following: This now yields an on-line version of the form: I - 577
Note the time index runs over y(k) and not e2(k). We may interpret as the change in the total,=. squared error over all time due to a change in the input to a neuron at a single instant of time. Furthermore, aez ay'.+'(k) a.i+l(k) -,awfj # Only under the assumption that the error and all neuron outputs are stationary processes can we regard each term as an unbiased instantaneous estimate of the true gradient. Now for any layer BY1. (k) = X:-'. Furthermore, we may simply define $& f 6;(k). This allows us to., rewrite Equation 21 in the more familiar notational form: "' ay:+'(k) Wfj(k + 1) = w:j(k) - p6:.+'(k).xi@) (22) which now holds for any layer in the network. To complete the derivation, an explicit formula for $(k) must be found. For the output layer we have simply: which is the same as before. For any hidden layer we have: But we recall: Thus which now yields -={ ay:+,'(t) wjm(t - k) for 0 5 t - k 5 ax; (k) 0 otherwise (29) where we have defined m=l n=o NIL, m= I &(k) = [&(k),&(k + I),.&(k + Z-i)] 3This is never a valid assumption for finite time. However, the expansion in Equation 20 is always valid. (32) (33) (34) I - 578
Figure 2: Backward filter propagation of gradient terms Summarizing, the complete adaptation algorithm can be expressed as follows: WiJk + 1) = Wij(k) - p6;+'(k).xf(k) l=l (35) m=l We now have a recursive formula for the error gradients. To calculate 6:(k) of a given neuron we simply propagate the 6's from the next layer backwards through the synaptic filters for which the given neuron feeds. In this sense, these equations can be thought of as the temporal version of backpropagation where the 6's are formed not by simply taking weighted sums but by backward filtering. For each new input and desired response vector we increment the forward filters one time step and the backward filters one time step. Thus by manipulating the terms used to accumulate the error gradients we have preserved the symmetry between the forward propagation of states, and the backward propagation of the 6 terms. This is illustrated in Figure 2. Again, if we replace the vectors X, W, and now A by scalars, the above equations reduce to the familiar backpropagation algorithm for static networks. In this sense, these equations may also be thought of as a "vector" generalization of backpropagation. This algorithm also has added computational advantages. Each neuron requires on the order of 2NT multiplications while the first algorithm (Equation 19) takes on the order of NT2 multiplications '. This savings comes from grouping terms into products of sums instead of sums of products. In fact, with this algorithm the total number of operations continue to grow linearly with the number of layers versus geometrically as in the first case. The careful reader may have observed what appears to be a flaw in this algorithm. The calculations for the b;!-'(k)'s are in fact non-causal. The source of this non-causal filtering can be seen by considering the terms 4. Since it takes time for the output of any internal neuron to completely propagate through the ayj(k) network, the change in the total error due to a change in an internal state is a function of future values within the network. Since the network is FIR, we can easily remedy the algorithm by adding a finite number of simple delay operators into the network. 4 Experimentation To compare the two algorithms derived we experimentally verify their equivalence. Figure 3 shows the averaged learning curves for a two layer network modeling an unknown non-linear system. Training 'N = nodes per layer, T = number of tap delays. NT2 is valid for the neurons in the first hidden layer back from the output. An explicit formula was not derived for all layers in the cae of the first algorithm. 51 inputs, 1 output, and 5 hidden units with Ctap FIR filters for each synapse. p =.05 I - 579
25 s 2o t i 0 100 200 300 400 time increment, k 0 100 200 300 400 time increment, k (a) Alg. 1 - instantaneous error gradient (b) Alg. 2 - temporal backpropagation Figure 3: Experimental Learning Curves differed only in terms of the algorithm used. As can be seen, the performance of both appear to be roughly equivalent. Differences are hidden in terms of the computational efficiency of the second algorithm as discussed earlier. Initial experimentation also seems to indicate the added benefit of less misadjustment for the second algorithm. Minor discrepancies arise due to differences in the timing at which weights are adjusted relative to the calculation of the error gradients (mathematically the algorithms become equivalent asp-0). 5 Relationship to other work The structure of the FIR networks presented here are similar to the Time-Delay Neural Networks of Waibel et a1 used in speech recognition [3]. Their training, however, is based on instantaneous error gradients as derived in the first algorithm and does not fully exploit the FIR structure of the network. The temporal backpropagation algorithm presented here can, of course, be simply substituted to train their structure. 6 Conclusion This paper has introduced an efficient gradient descent algorithm for FIR neural networks which can be considered a temporal generalization of the backpropagation algorithm. By modeling each synapse as a linear filter, the neural network as a whole, may be thought of as an adaptive system with its own internal dynamics. Equivalently we may think of the network as a complex nonlinear filter. Applications should thus include areas of pattern recognition where there is an inherent temporal quality to the data such as in speech recognition. Also the networks should find a natural use in areas of nonlinear control, and other adaptive signal processing and filtering applications such as noise cancellation or equalization. The purpose of this paper, however, was to introduce the learning algorithm itself, rather than demonstrate the full potential of the actual network structure. This will be the work of future research. References [l] B. Widrow and S. D. Stearns, Adaptive signal processing, Englewood Cliffs, NJ: Prentice Hall, 1985. [2] D.E. Rumelhart, J.L. McClelland, and the PDP Research Group, Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Vol. 1, The MIT Press, Cambridge, MA, 1986. [3] A. Waibel, T. Hanazawa, G. Hinton, K. Shikano, and K. Lang, Phoneme Recognition Using Time-Delay Neural Networks, IEEE Transactions on Acoustics, Speech, and Signal Processing, Vol. 37, No. 3, pp. 328-339, March 1989. Static neural network are already used for many of these applications. With the static networks, time relations are often created by taking the data and rippling it in time across the input of the network or either using the output of the network to form a feedback loop. In both cases the dynamics are really external to the actual network itself. I - 580