Temporal Backpropagation for FIR Neural Networks

Similar documents
Adaptive Inverse Control based on Linear and Nonlinear Adaptive Filtering

EXTENDED FUZZY COGNITIVE MAPS

Adaptive Inverse Control

inear Adaptive Inverse Control

A Logarithmic Neural Network Architecture for Unbounded Non-Linear Function Approximation

Pattern Recognition Prof. P. S. Sastry Department of Electronics and Communication Engineering Indian Institute of Science, Bangalore

New Recursive-Least-Squares Algorithms for Nonlinear Active Control of Sound and Vibration Using Neural Networks

Relating Real-Time Backpropagation and. Backpropagation-Through-Time: An Application of Flow Graph. Interreciprocity.

Equivalence of Backpropagation and Contrastive Hebbian Learning in a Layered Network

T Machine Learning and Neural Networks

ADAPTIVE INVERSE CONTROL BASED ON NONLINEAR ADAPTIVE FILTERING. Information Systems Lab., EE Dep., Stanford University

Supervised (BPL) verses Hybrid (RBF) Learning. By: Shahed Shahir

Artificial Neural Network

Vasil Khalidov & Miles Hansard. C.M. Bishop s PRML: Chapter 5; Neural Networks

Keywords- Source coding, Huffman encoding, Artificial neural network, Multilayer perceptron, Backpropagation algorithm

Lecture 5: Logistic Regression. Neural Networks

y(x n, w) t n 2. (1)

Gradient Descent Training Rule: The Details

A gradient descent rule for spiking neurons emitting multiple spikes

Artificial Neural Networks

COMP-4360 Machine Learning Neural Networks

Neural Networks and the Back-propagation Algorithm

Lecture 4: Perceptrons and Multilayer Perceptrons

100 inference steps doesn't seem like enough. Many neuron-like threshold switching units. Many weighted interconnections among units

The error-backpropagation algorithm is one of the most important and widely used (and some would say wildly used) learning techniques for neural

Introduction to Machine Learning Spring 2018 Note Neural Networks

IN neural-network training, the most well-known online

Introduction to Neural Networks

Supervised Learning in Neural Networks

Back-Propagation Algorithm. Perceptron Gradient Descent Multilayered neural network Back-Propagation More on Back-Propagation Examples

Introduction to Artificial Neural Networks

THE concept of active sound control has been known

COGS Q250 Fall Homework 7: Learning in Neural Networks Due: 9:00am, Friday 2nd November.

Optimal Polynomial Control for Discrete-Time Systems

SPSS, University of Texas at Arlington. Topics in Machine Learning-EE 5359 Neural Networks

ARTIFICIAL NEURAL NETWORK PART I HANIEH BORHANAZAD

THE Back-Prop (Backpropagation) algorithm of Paul

Neural networks. Chapter 20. Chapter 20 1

Using Variable Threshold to Increase Capacity in a Feedback Neural Network

COMP 551 Applied Machine Learning Lecture 14: Neural Networks

Artificial Neural Networks

ADAPTIVE FILTER THEORY

Direct Method for Training Feed-forward Neural Networks using Batch Extended Kalman Filter for Multi- Step-Ahead Predictions

Application of Artificial Neural Networks in Evaluation and Identification of Electrical Loss in Transformers According to the Energy Consumption

Chapter 4 Neural Networks in System Identification

Artificial Neural Networks. Edward Gatt

4. Multilayer Perceptrons

Neural networks. Chapter 19, Sections 1 5 1

STRUCTURED NEURAL NETWORK FOR NONLINEAR DYNAMIC SYSTEMS MODELING

Artificial Intelligence

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

Unit III. A Survey of Neural Network Model

Backpropagation and Neural Networks part 1. Lecture 4-1

Learning and Memory in Neural Networks

Introduction Neural Networks - Architecture Network Training Small Example - ZIP Codes Summary. Neural Networks - I. Henrik I Christensen

6.034f Neural Net Notes October 28, 2010

How to do backpropagation in a brain

A STATE-SPACE NEURAL NETWORK FOR MODELING DYNAMICAL NONLINEAR SYSTEMS

Recurrent Neural Networks

Neural Dynamic Optimization for Control Systems Part II: Theory

Artificial Neural Networks

Artifical Neural Networks

Introduction Biologically Motivated Crude Model Backpropagation

The Neural Impulse Response Filter

POWER SYSTEM DYNAMIC SECURITY ASSESSMENT CLASSICAL TO MODERN APPROACH

Supervised learning in single-stage feedforward networks

Neural Networks. Nicholas Ruozzi University of Texas at Dallas

Neural networks. Chapter 20, Section 5 1

Recurrent neural networks with trainable amplitude of activation functions

Course 395: Machine Learning - Lectures

Neural Networks. Chapter 18, Section 7. TB Artificial Intelligence. Slides from AIMA 1/ 21

EE-559 Deep learning Recurrent Neural Networks

A METHOD OF ADAPTATION BETWEEN STEEPEST- DESCENT AND NEWTON S ALGORITHM FOR MULTI- CHANNEL ACTIVE CONTROL OF TONAL NOISE AND VIBRATION

Data Mining Part 5. Prediction

Last update: October 26, Neural networks. CMSC 421: Section Dana Nau

Machine Learning. Neural Networks. Le Song. CSE6740/CS7641/ISYE6740, Fall Lecture 7, September 11, 2012 Based on slides from Eric Xing, CMU

Feed-forward Networks Network Training Error Backpropagation Applications. Neural Networks. Oliver Schulte - CMPT 726. Bishop PRML Ch.

Feedforward Neural Nets and Backpropagation

Lecture 7 Artificial neural networks: Supervised learning

arxiv: v1 [cs.sd] 28 Feb 2017

ECLT 5810 Classification Neural Networks. Reference: Data Mining: Concepts and Techniques By J. Hand, M. Kamber, and J. Pei, Morgan Kaufmann

CMSC 421: Neural Computation. Applications of Neural Networks

Learning Neural Networks

Blind Equalization Formulated as a Self-organized Learning Process

<Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation)

1 What a Neural Network Computes

Multilayer Perceptron Tutorial

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.

ARTIFICIAL INTELLIGENCE. Artificial Neural Networks

Introduction to feedforward neural networks

Part 8: Neural Networks

Forecasting of Rain Fall in Mirzapur District, Uttar Pradesh, India Using Feed-Forward Artificial Neural Network

C4 Phenomenological Modeling - Regression & Neural Networks : Computational Modeling and Simulation Instructor: Linwei Wang

Computational Graphs, and Backpropagation

Machine Learning and Data Mining. Multi-layer Perceptrons & Neural Networks: Basics. Prof. Alexander Ihler

Neural Networks. Bishop PRML Ch. 5. Alireza Ghane. Feed-forward Networks Network Training Error Backpropagation Applications

Machine Learning. Neural Networks

ECE521 Lecture 7/8. Logistic Regression

Neural Networks, Computation Graphs. CMSC 470 Marine Carpuat

Transcription:

Temporal Backpropagation for FIR Neural Networks Eric A. Wan Stanford University Department of Electrical Engineering, Stanford, CA 94305-4055 Abstract The traditional feedforward neural network is a static structure which simply maps input to output. To better reflect the dynamics in the biological system a network structure is proposed which models each synapse by a Finite Impulse Response (FIR) linear filter. An eficient gradient descent algorithm is derived which will be shown to be a temporal generalization of the familiar backpropugation algorithm. 1 Introduction A standard neural network models a synapse by a single variable weight parameter. In a feedforward structure this results in a static network which maps input to output. Real neural networks are of course dynamic in nature which is reflected in the temporal properties of the synapse along with such processes as impulse transmission and membrane excitation. While many accurate models of such processes do exist, from an engineering standpoint most are unrealistic to work with. The model we propose to use represents a synapse not by just a single weight parameter, but by an adaptive filter [l]. Further, we restrict the filter to be discrete time, characterized by a Finite Impulse Response (FIR) '. While biologically motivated, we make no claims that the structure is necessarily biologically plausible. With this we proceed to derive algorithms for adapting the synaptic transfer functions so as to train the network as a whole. 2 Network Structure Each synapse in the network is modeled by a Finite Impulse Response (FIR) linear filter. The coefficients for a filter can be represented by a weight vector W = [w(o), w(l),..w(t)it(t denotes transpose). The output of the filter simply corresponds to the weighted sum of delayed samples of the input, z(k) (i.e. y(k) = CT=o~(i)z(k - where k is the discrete time index ). As usual, the total input to a neuron corresponds to the sum of all synaptic filter outputs which connect to that neuron '. The output of the neuron is usually taken to be a non-linear sigmoidal function of its input. Subscripts are used to indicate the specific location of a synapse or neuron within the network. Thus Wfj specifies the synaptic filter connecting the output of neuron i in layer 1 to the input of neuron j in the next layer. For a fully connected feedforward structure with L layers and NI neurons in each layer, the network can be completely specified as follows: i= I i=l where 1 5 i 5 NI, 1 5 j 5 NI+~, and 1 5 I 5 L. Note, if we replace the vectors, W and X by simple scalars, then the above equations reduce to the definition of the familiar static feedforward network [a]. ]The Infinite Impulse Response (IIR) case has also been studied and will be presented in a future paper. Without loss of generalization in the analysis, possible bias terms have been neglected for notational simplicity. I - 575

Figure 1: FIR Neural Network Structure To complete the definitions for the network structure, we must specify the input/output relationships for the network as a whole: input zp(k) 15 i 5 No output zc(k) 15 is NL (6) For notational purposes we have used $(k) as the external input to the network. This should not be confused with the output of a neuron. The structure of the FIR network is illustrated in Figure 1. 3 Adaptation Given a desired response vector for the network s output at each index of time, we have desired response di (k) inst. squared error ei(k) = d;(k) - zc(k) total inst. squared error e2(k) = ef(k) total squared error e2 = Ck=O e2(k) Learning will be based on traditional gradient descent in which we attempt to minimize the total squared error over all time. The error gradient with respect to each weight vector is normally expanded as follows: (7) By taking each term in the expansion as an unbiased instantaneous estimate of the gradient, we may form the on-line training algorithm: ae2(k) Wl,(k + 1) = W!.(k) - p- 3 awgk) (9) in which the weight vectors are updated at each increment of time (p is defined as the learning rate). As we will show, this obvious expansion of into the terms does not lead to a desirable learning J 3 algorithm for this structure. A less intuitive expansion, in fact, yields a more attractive algorithm which exploits the structure s FIR characteristics. For now we will complete the derivation of the first algorithm by proceeding to calculate the terms q. Starting with the last layer of synapses in the network: I - 576

By defining 6f(k) = -2ej(k)f'(yf(k)), for the last layer of synapses we have: W&-'(k + 1) + w&-'(k) - &(k). xf-'(k) (13) This is, of course, simply the LMS algorithm for a bank of FIR filters with a non-linear output. For the previous layer of synapses, however, the derivation is not as simple. Using the proper chain rule expansion we get: n=o ml 1 This last equation has many interpretations. The term b&(k)wfi'(n) is similar to the recurrent formula used in backpropagation to accumulate the gradients. However, there are TL-~ such terms corresponding to the number of tap delays in the final layer of the network. This equation could, in fact, have been written down by inspection if we were to interpret each tap delay as a "virtual" neuron whose input is delayed the appropriate number of time steps. This can be thought of as equivalent to the common technique of viewing the structure unfolded in time. The problem with these equations, however, is that there is a loss of a sense of symmetry between the forward propagation of the network and the backwards propagation of the terms necessary to calculate the gradients. The desired distributed nature of gradient computations disappear. In addition, the gradient equations do not generalize for arbitrary layers. There is no nice recurrent formula for the gradients. The gradient calculations for one more layer back results in a triple sum. In fact, the total number of operations actually grows geometrically with the number of layers. These drawbacks can, however, be overcome if we consider a completely different formulation of the gradient descent algorithm. Our original expansion of the total error gradient is not unique. Consider the following: This now yields an on-line version of the form: I - 577

Note the time index runs over y(k) and not e2(k). We may interpret as the change in the total,=. squared error over all time due to a change in the input to a neuron at a single instant of time. Furthermore, aez ay'.+'(k) a.i+l(k) -,awfj # Only under the assumption that the error and all neuron outputs are stationary processes can we regard each term as an unbiased instantaneous estimate of the true gradient. Now for any layer BY1. (k) = X:-'. Furthermore, we may simply define $& f 6;(k). This allows us to., rewrite Equation 21 in the more familiar notational form: "' ay:+'(k) Wfj(k + 1) = w:j(k) - p6:.+'(k).xi@) (22) which now holds for any layer in the network. To complete the derivation, an explicit formula for $(k) must be found. For the output layer we have simply: which is the same as before. For any hidden layer we have: But we recall: Thus which now yields -={ ay:+,'(t) wjm(t - k) for 0 5 t - k 5 ax; (k) 0 otherwise (29) where we have defined m=l n=o NIL, m= I &(k) = [&(k),&(k + I),.&(k + Z-i)] 3This is never a valid assumption for finite time. However, the expansion in Equation 20 is always valid. (32) (33) (34) I - 578

Figure 2: Backward filter propagation of gradient terms Summarizing, the complete adaptation algorithm can be expressed as follows: WiJk + 1) = Wij(k) - p6;+'(k).xf(k) l=l (35) m=l We now have a recursive formula for the error gradients. To calculate 6:(k) of a given neuron we simply propagate the 6's from the next layer backwards through the synaptic filters for which the given neuron feeds. In this sense, these equations can be thought of as the temporal version of backpropagation where the 6's are formed not by simply taking weighted sums but by backward filtering. For each new input and desired response vector we increment the forward filters one time step and the backward filters one time step. Thus by manipulating the terms used to accumulate the error gradients we have preserved the symmetry between the forward propagation of states, and the backward propagation of the 6 terms. This is illustrated in Figure 2. Again, if we replace the vectors X, W, and now A by scalars, the above equations reduce to the familiar backpropagation algorithm for static networks. In this sense, these equations may also be thought of as a "vector" generalization of backpropagation. This algorithm also has added computational advantages. Each neuron requires on the order of 2NT multiplications while the first algorithm (Equation 19) takes on the order of NT2 multiplications '. This savings comes from grouping terms into products of sums instead of sums of products. In fact, with this algorithm the total number of operations continue to grow linearly with the number of layers versus geometrically as in the first case. The careful reader may have observed what appears to be a flaw in this algorithm. The calculations for the b;!-'(k)'s are in fact non-causal. The source of this non-causal filtering can be seen by considering the terms 4. Since it takes time for the output of any internal neuron to completely propagate through the ayj(k) network, the change in the total error due to a change in an internal state is a function of future values within the network. Since the network is FIR, we can easily remedy the algorithm by adding a finite number of simple delay operators into the network. 4 Experimentation To compare the two algorithms derived we experimentally verify their equivalence. Figure 3 shows the averaged learning curves for a two layer network modeling an unknown non-linear system. Training 'N = nodes per layer, T = number of tap delays. NT2 is valid for the neurons in the first hidden layer back from the output. An explicit formula was not derived for all layers in the cae of the first algorithm. 51 inputs, 1 output, and 5 hidden units with Ctap FIR filters for each synapse. p =.05 I - 579

25 s 2o t i 0 100 200 300 400 time increment, k 0 100 200 300 400 time increment, k (a) Alg. 1 - instantaneous error gradient (b) Alg. 2 - temporal backpropagation Figure 3: Experimental Learning Curves differed only in terms of the algorithm used. As can be seen, the performance of both appear to be roughly equivalent. Differences are hidden in terms of the computational efficiency of the second algorithm as discussed earlier. Initial experimentation also seems to indicate the added benefit of less misadjustment for the second algorithm. Minor discrepancies arise due to differences in the timing at which weights are adjusted relative to the calculation of the error gradients (mathematically the algorithms become equivalent asp-0). 5 Relationship to other work The structure of the FIR networks presented here are similar to the Time-Delay Neural Networks of Waibel et a1 used in speech recognition [3]. Their training, however, is based on instantaneous error gradients as derived in the first algorithm and does not fully exploit the FIR structure of the network. The temporal backpropagation algorithm presented here can, of course, be simply substituted to train their structure. 6 Conclusion This paper has introduced an efficient gradient descent algorithm for FIR neural networks which can be considered a temporal generalization of the backpropagation algorithm. By modeling each synapse as a linear filter, the neural network as a whole, may be thought of as an adaptive system with its own internal dynamics. Equivalently we may think of the network as a complex nonlinear filter. Applications should thus include areas of pattern recognition where there is an inherent temporal quality to the data such as in speech recognition. Also the networks should find a natural use in areas of nonlinear control, and other adaptive signal processing and filtering applications such as noise cancellation or equalization. The purpose of this paper, however, was to introduce the learning algorithm itself, rather than demonstrate the full potential of the actual network structure. This will be the work of future research. References [l] B. Widrow and S. D. Stearns, Adaptive signal processing, Englewood Cliffs, NJ: Prentice Hall, 1985. [2] D.E. Rumelhart, J.L. McClelland, and the PDP Research Group, Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Vol. 1, The MIT Press, Cambridge, MA, 1986. [3] A. Waibel, T. Hanazawa, G. Hinton, K. Shikano, and K. Lang, Phoneme Recognition Using Time-Delay Neural Networks, IEEE Transactions on Acoustics, Speech, and Signal Processing, Vol. 37, No. 3, pp. 328-339, March 1989. Static neural network are already used for many of these applications. With the static networks, time relations are often created by taking the data and rippling it in time across the input of the network or either using the output of the network to form a feedback loop. In both cases the dynamics are really external to the actual network itself. I - 580