Diagrammatic Derivation of Gradient Algorithms for Neural Networks

Similar documents
Relating Real-Time Backpropagation and. Backpropagation-Through-Time: An Application of Flow Graph. Interreciprocity.

Temporal Backpropagation for FIR Neural Networks

Diagrammatic Methods for Deriving and. Speech Technology and Research Laboratory, SRI International,

inear Adaptive Inverse Control

ADAPTIVE INVERSE CONTROL BASED ON NONLINEAR ADAPTIVE FILTERING. Information Systems Lab., EE Dep., Stanford University

Adaptive Inverse Control

A Logarithmic Neural Network Architecture for Unbounded Non-Linear Function Approximation

Adaptive Inverse Control based on Linear and Nonlinear Adaptive Filtering

Equivalence of Backpropagation and Contrastive Hebbian Learning in a Layered Network

Keywords- Source coding, Huffman encoding, Artificial neural network, Multilayer perceptron, Backpropagation algorithm

New Recursive-Least-Squares Algorithms for Nonlinear Active Control of Sound and Vibration Using Neural Networks

STRUCTURED NEURAL NETWORK FOR NONLINEAR DYNAMIC SYSTEMS MODELING

Introduction Neural Networks - Architecture Network Training Small Example - ZIP Codes Summary. Neural Networks - I. Henrik I Christensen

Optimal Polynomial Control for Discrete-Time Systems

Back-Propagation Algorithm. Perceptron Gradient Descent Multilayered neural network Back-Propagation More on Back-Propagation Examples

Lecture 5: Logistic Regression. Neural Networks

4. Multilayer Perceptrons

C1.2 Multilayer perceptrons

Neural Network Identification of Non Linear Systems Using State Space Techniques.

A recursive algorithm based on the extended Kalman filter for the training of feedforward neural models. Isabelle Rivals and Léon Personnaz

Analysis of Multilayer Neural Network Modeling and Long Short-Term Memory

Data Mining Part 5. Prediction

( t) Identification and Control of a Nonlinear Bioreactor Plant Using Classical and Dynamical Neural Networks

Application of Artificial Neural Networks in Evaluation and Identification of Electrical Loss in Transformers According to the Energy Consumption

Introduction to Artificial Neural Networks

T Machine Learning and Neural Networks

Pattern Recognition Prof. P. S. Sastry Department of Electronics and Communication Engineering Indian Institute of Science, Bangalore

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

Lecture 7 Artificial neural networks: Supervised learning

Introduction to Machine Learning Spring 2018 Note Neural Networks

EXTENDED FUZZY COGNITIVE MAPS

A STATE-SPACE NEURAL NETWORK FOR MODELING DYNAMICAL NONLINEAR SYSTEMS

IN neural-network training, the most well-known online

ARTIFICIAL NEURAL NETWORK PART I HANIEH BORHANAZAD

memory networks, have been proposed by Hopeld (1982), Lapedes and Farber (1986), Almeida (1987), Pineda (1988), and Rohwer and Forrest (1987). Other r

Supervised (BPL) verses Hybrid (RBF) Learning. By: Shahed Shahir

Modelling Time Series with Neural Networks. Volker Tresp Summer 2017

Artificial Neural Networks

Direct Method for Training Feed-forward Neural Networks using Batch Extended Kalman Filter for Multi- Step-Ahead Predictions

Artificial Neural Networks

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

TECHNICAL RESEARCH REPORT

Neural Networks and the Back-propagation Algorithm

MODELING NONLINEAR DYNAMICS WITH NEURAL. Eric A. Wan. Stanford University, Department of Electrical Engineering, Stanford, CA

Discrete-Time Systems

Introduction Biologically Motivated Crude Model Backpropagation

Neural networks. Chapter 20. Chapter 20 1

Neural Dynamic Optimization for Control Systems Part II: Theory

Artificial Neural Network

Introduction to Neural Networks

Digital Filter Structures. Basic IIR Digital Filter Structures. of an LTI digital filter is given by the convolution sum or, by the linear constant

Learning and Memory in Neural Networks

Using Variable Threshold to Increase Capacity in a Feedback Neural Network

ECE521 Lecture 7/8. Logistic Regression

Lecture 6: Backpropagation

Chapter 4 Neural Networks in System Identification

Christian Mohr

Sections 18.6 and 18.7 Artificial Neural Networks

Artificial Neural Network and Fuzzy Logic

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

18.6 Regression and Classification with Linear Models

Filter-Generating Systems

CS:4420 Artificial Intelligence

Deep Belief Networks are compact universal approximators

CSE 352 (AI) LECTURE NOTES Professor Anita Wasilewska. NEURAL NETWORKS Learning

NN V: The generalized delta learning rule

Recurrent Neural Networks

ANN Control of Non-Linear and Unstable System and its Implementation on Inverted Pendulum

CSC 411 Lecture 10: Neural Networks

ECE503: Digital Signal Processing Lecture 6

AI Programming CS F-20 Neural Networks

ARTIFICIAL INTELLIGENCE. Artificial Neural Networks

Introduction to Natural Computation. Lecture 9. Multilayer Perceptrons and Backpropagation. Peter Lewis

λ-universe: Introduction and Preliminary Study

y(n) Time Series Data

Convergence of Hybrid Algorithm with Adaptive Learning Parameter for Multilayer Neural Network

8-1: Backpropagation Prof. J.C. Kao, UCLA. Backpropagation. Chain rule for the derivatives Backpropagation graphs Examples

100 inference steps doesn't seem like enough. Many neuron-like threshold switching units. Many weighted interconnections among units

A New Weight Initialization using Statistically Resilient Method and Moore-Penrose Inverse Method for SFANN

International Journal of Advanced Research in Computer Science and Software Engineering

(Refer Slide Time: )

Vasil Khalidov & Miles Hansard. C.M. Bishop s PRML: Chapter 5; Neural Networks

Recurrent neural networks with trainable amplitude of activation functions

Feedforward Neural Nets and Backpropagation

Neural Networks. Chapter 18, Section 7. TB Artificial Intelligence. Slides from AIMA 1/ 21

ECE521 Lectures 9 Fully Connected Neural Networks

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4

Automatic Differentiation and Neural Networks

Artificial Neural Networks. Historical description

NONLINEAR PLANT IDENTIFICATION BY WAVELETS

Address for Correspondence

How to do backpropagation in a brain

Multilayer Perceptrons and Backpropagation

Multilayer Perceptrons (MLPs)

Introduction and Perceptron Learning

Frequency Selective Surface Design Based on Iterative Inversion of Neural Networks

Neural Turing Machine. Author: Alex Graves, Greg Wayne, Ivo Danihelka Presented By: Tinghui Wang (Steve)

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes

Neural Networks and Deep Learning

Sections 18.6 and 18.7 Artificial Neural Networks

Transcription:

Communicated by Fernando Pineda - Diagrammatic Derivation of Gradient Algorithms for Neural Networks Deriving gradient algorithms for time-dependent neural network structures typically requires numerous chain rule expansions, diligent bookkeeping, and careful manipulation of terms. In this paper, we show how to derive such algorithms via a set of simple block diagram manipulation rules. The approach provides a common framework to derive popular algorithms including backpropagation and backpropagationthrough-time without a single chain rule expansion. Additional examples are provided for a variety of complicated architectures to illustrate both the generality and the simplicity of the approach. 1 Introduction ~ Deriving the appropriate gradient descent algorithm for a new network architecture or system configuration normally involves brute force derivative calculations. For example, the celebrated backpropagation algorithm for training feedforward neural networks was derived by repeatedly applying chain rule expansions backward through the network (Rumelhart c.f 01. 1986; Werbos 1974; Parker 1982). However, the actual implementation of backpropagation may be viewed as a simple reversal of signal flow through the network. Another popular algorithm, backpropagationthrough-time for recurrent networks, can be derived by Euler-Lagrange or ordered derivative methods, and involves both a signal flow reversal and time reversal (Werbos 1992; Nguyen and Widrow 1989). For both of these algorithms, there is a r.c~iipocn71 nature to the forward propagation of states and the backward propagation of gradient terms. Furthermore, both algorithms are efficient in the sense that calculations are order N, where N is the number of variable weights in the network. These properties are often attributed to the clever manner in which the algorithms

Derivation of Gradient Algorithms 183 were derived for a specific network architecture. We will show, however, that these properties are universal to all network architectures and that the associated gradient algorithm may be formulated directly with virtually no effort. The approach consists of a simple diagrammatic method for construction of a reciprocal network that directly specifies the gradient derivation. This is in contrast to graphical methods, which simply illustrate the relationship between signal and gradient flow after derivation of the algorithm by an alternative method (Narendra and Parthasarathy 1990, Nerrand et al. 1993). The reciprocal networks are, in principle, identical to adjoint systems seen in N-stage optimal control problems (Bryson and Ho 1975). While adjoint methods have been applied to neural networks, such approaches have been restricted to specific architectures where the adjoint systems resulted from a disciplined Euler-Lagrange optimization technique (Parisini and Zoppoli 1994; Toomarian and Barhen 1992; Matsuoka 1991). Here we use a graphic approach for the complete derivation. We thus prefer the term "reciprocal" network, which further imposes certain topological constraints and is taken from the electrical network literature. The concepts detailed in this paper were developed in Wan (1993) and later presented in Wan and Beaufays (1994).' 1.1 Network Adaptation and Error Gradient Propagation. In supervised learning, the goal is to find a set of network weights W that minimizes a cost function ] = C,"=, Lk[d(k), y(k)], where k is used to specify a discrete time index (the actual order of presentation may be random or sequential), y(k) is the output of the network, d(k) is a desired response, and Lk is a generic error metric that may contain additional weight regularization terms. For illustrative purposes, we will work with the squared error metric, Lk = e(k)te(k), where e(k) is the error vector. Optimization techniques invariably require calculation of the gradient vector a]/sw(k). At the architectural level, a variable weight w,, may be isolated between two points in a network with corresponding signals a,(k) and a,(k) [i.e., u,(k) = w,, a,(k)l. Using the chain rule, we get2 where we define the error gradient Sj(k) A a]/aa,(k). The error gradient 6,(k) depends on the entire topology of the network. Specifying the gradients necessitates finding an explicit formula for calculating the delta 'The method presented here is similar in spirit to Automatic Differentiation (Rall 1981; Griewank and Corliss 1991). Automatic Differentiation is a simple method for finding derivative of functions and algorithms that can be represented by acyclic graphs. Our approach, however, applies to discrete-time systems with the possibility of feedback. In addition, we are concerned with diagrammatic derivations rather than computational rule based implementations. *In the general case of a variable parameter, we have a,(k) = f[w,,.a,(k)], and equation 1.1 remains as 6,(k)&,(k)/Dw,,(k), where the partial term depends on the form off.

184 Eric A. Wan and Francoise Beaufays terms. Backpropagation, for example, is nothing more than an algorithm for generating these terms in a feedforward network. In the next section, we develop a simple nonalgebraic method for deriving the delta terms associated with any network architecture. 2 Network Representation and Reciprocal Construction Rules An arbitrary neural network can be represented as a block diagram whose building blocks are: summing junctions, branching points, univariate functions, multivariate functions, and time-delay operators. Only discrete-time systems are considered. A signal located within the network is labeled n,(k). A synaptic weight, for example, niay be thought of as a linear transmittance, which is a special case of a univariate fuiiction. The basic neuron is simply a sum of linear trarismittances followed by a univariate sigmoid function. Networks can then be constructed from individual neurons, and may include additional functioiial blocks and time-delay operators that allow for buffering of signals and internal memory. This block diagram representation is really nothing more than the typical pictorial description of a neural network with a bit of added formalism. Directly from the block diagram we may construct the reciprocal rrefiiwk by reversing the flow direction in the original network, labeling all resulting signals (k), and perforniing the following operations: 1. Suniirziriy jirrictioiir nrcl i cpln~d iuitlr hi I711ChiilS poirifs.

Derivation of Gradient Algorithms 185 Explicitly, the scalar coiitiizzioiis function a,(k) = f[n,(k)] is replaced by h,(k) = f [n,(k)] d,(k), where f [a,(k)]~da,(k)/da,(k). Note this rule replaces a nonlinear function by a linear time-dependeiit transmittance. Special cases are 0 Weights: a, = su,, a,, in which case 6, = 70,, d,. i YJ I 0 1 3 51 Activation functions: a,,(k) = tanh[a,(k)]. In this case, f [a,(k)] = 1 - a:(k). 4. Mirltivariate functions are replaced with their Jacobians. I ai,,fw I A multivariate function maps a vector of input signals into a vector of output signals, aout = F(a,,). In the transformed network, we have 6,,(k) = F [a,,(k)] SOut(k), where F [a,,(k)]~aa,,,(k)/aa,,(k) corresponds to a matrix of partial derivatives. For shorthand, F [a,,(k)] will be written simply as F (k). Clearly both summing junctions and univariate functions are special cases of multivariate functions. A multivariate function may also represent a product junction (for sigma-pi units) or even another multilayer network. For a multi- layer network, the product F [a,,(k)] b,,,(k) is found directly by backpropagating bout through the network. Delay operators are replaced zuitk ndvance operafors. A delay operator q- performs a unit time delay on its argument: a,(k) = q- a,(k) = a,(k - 1). In the reciprocal system, we form a unit time advance: d,(k) = q+ h,(k) = h,(k + 1). The resulting system is thus noncausal. Actual implementation of the reciprocal network in a causal manner is addressed in specific examples.

186 Eric A. Wan and Franpise Beaufays 6. Ozttputs become inputs. network ao= Yo network By reversing the signal flow, output nodes n,,(k) = y,,(k) in the original network become input nodes in the reciprocal network. These inputs are then set at each time step to -2e,,(k). [For cost functions other than squared error, the input should be set to i)l~/dy,,(k).i These six rules allow direct construction of the reciprocal network from the original network.3 Note that there is a topological equivalence between the two networks. The order of computations in the reciprocal network is thus identical to the order of computations in the forward network. Whereas the original network corresponds to a nonlinear timeindependent system (assuming the weights are fixed), the rrciprocn2 network is a linear time-dependent system. The signals 6,(k) that propagate through the reciprocal network correspond to the terms tl]/&,(k) necessary for gradient adaptation. Exact equations may then be "read-out" directly from the reciprocal network, completing the derivation. A formal proof of the validity and generality of this method is presented in Appendix A. 3 Examples 3.1 Backpropagation. We start by rederiving standard backpropagation. Figure 1 shows a hidden neuron feeding other neurons and an output neuron in a multilayer network. For consistency with traditional notation, we have labeled the summing junction signal s: rather than n,, and added superscripts to denote the layer. In addition, since multilayer networks are static structures, we omit the time index k. The reciprocal network shown in Figure 2 is found by applying the construction rules of the previous section. From this figure, we may immediately write down the equations for calculating the delta terms: These are precisely the equations describing standard backpropagation. In this case, there are no delay operators and 6, = b,(k) ajl/as,(k) = 3CClearly, this set of rules is not a minimal set, i.e., a summing junction can be considered a special case of a multivariate function. However, we choose this set for ease and clarity of construction.

Derivation of Gradient Algorithms 187 Figure 1: Block diagram construction of a multilayer network Figure 2: Reciprocal multilayer network. [?let(k)e(k)]/3s,(k) corresponds to an instanfancous gradient. Readers familiar with neural networks have undoubtedly seen these diagrams before. What is new is the concept that the diagrams themselves inay be used directly, completely circumventing all intermediate steps involving tedious algebra. 3.2 Backpropagation-Through-Time. For the next example, consider a network with output feedback (see Fig. 3) described by Y(k) = "X(k),Y(k - 111 (3.21 where x(k) are external inputs, and y(k) represents the vector of outputs that form feedback connections. N is a multilayer neural network. If N has only one layer of neurons, every neuron output has a feedback

188 Eric A. Wan and Francoise Beaufays Figure 3: Recurrent network and backpropagation-through-time. connection to the input of every other neuron and the structure is referred to as a fzilly recurreiit netzoork (Williams and Zipser 1989). Typically, only a select set of the outputs have an actual desired response. The remaining outputs have no desired response (error equals zero) and are used for internal computation. Direct calculation of gradient terms using chain rule expansions is extremely complicated. A weight perturbation at a specified time step affects not only the output at future time steps, but future inputs as well. However, applying the reciprocal construction rules (see Fig. 3) we find immediately: (3.3) These are precisely the equations describing hllckyroprzgatioii-through-tinie, which have been derived in the past using either ordered derivatives (Werbos 1974) or Euler-Lagrange techniques (Plumer 1993). The diagrammatic approach is by far the simplest and most direct method. Note that the causality constraints require these equations to be run backward in time. This implies a forward sweep of the system to generate the output states and internal activation values, followed by a backward sweep through the reciprocal network. Also from rule 4 in the previous section, the product N (k)6(k) may be calculated directly by a standard backpropagation of 6(k + 1) through the network at time k.4 4Backpropagation-through-time is viewed as an off-lirze gradient descent algorithm in which weight updates are made after each presentation of an entire training sequence. An orz-he version in which adaptation occurs at each time step is possible using an algorithm called real-time-bnckprop~gntjon (Williams and Zipser 1989). The algorithm, however, is far more computationally expensive. The authors have presented a method based on flow graph iirferreciprorify to directly relate the two algorithms (Beaufays and Wan 1994a).

Derivation of Gradient Algorithms 189 rfkld I Controller I Figure 4: Neural network control using nonlinear ARMA models 3.2.2 Backpropagation-Through-Time and Neural Control Architectures. Backpropagation-through-time can be extended to a number of neural control architectures (Nguyen and Widrow 1989; Werbos 1992). A system may be configured using full-state feedback or more complicated ARMA (AutoRegressive Moving Average) models as illustrated in Figure 4. To adapt the weights of the controller, it is necessary to find the gradient terms that constitute the effective error for the neural network. Figure 5 illustrates how such terms may be directly acquired using the reciprocal network. A variety of other recurrent architectures may be considered including hierarchical networks to radial basis networks with feedback. In all cases, the diagrammatic approach provides a direct derivation of the gradient algorithm. 3.3 Cascaded Neural Networks. Let us now turn to an example of two cascaded neural networks (Fig. 6), which will further illustrate advantages of the diagrammatic approach. The inputs to the first network are samples from a time sequence x(k). Delayed outputs of the first network are fed to the second network. The cascaded networks are defined as u(k) = Nl[W,.x(k),x(k - l).x(k - 2)] (3.4) y(k) = N2[W2, u(kf, u(k - 3). u(k - 211 (3.5) where W1 and W, represent the weights parameterizing the networks, y(k) is the output, and u(k) the intermediate signal. Given a desired response for the output y of the second network, it is straightforward to use backpropagation for adapting the second network. It is not obvious, however, what the effective error should be for the first network.

190 Eric A. Wan and Franqoise Beaufays Controller Plant Model i Figure 3: Reciprocal network tor control using nonlinear ARMA models. Figure 6: Cascaded neural network filters and reciprocal counterpart.

Derivation of Gradient Algorithms 191 From the reciprocal network also shown in Figure 6, we simply label the desired signals and write down the gradient relations: with 6,,(k) = &(k) + S2(k + 1) + &(k + 2) (3.6) [6i(k) 62(k) 63(k)I = -Wk) N[u(k)] (3.7) i.e., each S,(k) is found by backpropagation through the output network, and the 6,s (after appropriate advance operations) are summed together. The gradient for the weights in the first network is thus given by in which the product term is found by a single backpropagation with &(k) acting as the error to the first network. Equations can be made causal by simply delaying the weight update for a few time steps. Clearly, extrapolating to an arbitrary number of taps is also straightforward. For comparison, let us consider the brute force derivative approach to finding the gradient. Using the chain rule, the instantaneous error gradient is evaluated as: (3.9) + I du(k - 1) du(k - 2) ay(k) du(k - 2) du(k-2) dw, = b I ( k ) g + b2(k) dw, where we define + 63(k) aw, (3.10) Again, the 6, terms are found simultaneously by a single backpropagation of the error through the second network. Each product 6,(k)[du(k - i - 1)/ 3W1] is then found by backpropagation applied to the first network with 6,+l(k) acting as the error. However, since the derivatives used in backpropagation are time-dependent, separate backpropagations are necessary for each b,,, (k). These equations, in fact, imply backpropagation through an unfolded structure and are equivalent to weight sharing (LeCun et al. 1989) as illustrated in Figure 7. In situations where there may be hundreds of taps in the second network, this algorithm is far less efficient than the one derived directly using reciprocal networks. Similar arguments can be used to derive an efficient on-line algorithm for adapting time-delay neural networks (Waibel et al. 1989).

192 Eric A. Wan and Frangoise Beaufays Figure 7: Cascaded neural network filters unfolded-in-time. 3.4 Temporal Backpropagation. An extension of the feedforward network can be constructed by replacing all scalar weights with discrete time linear filters to provide dynamic interconnectivity between neurons. Mathematically, a neuron i in layer I may be specified as (3.11) (3.12) where a(k) are neuron output values, s(k) are summing junctions, f(.) are sigmoid functions, and W(q-') are synaptic filters5 Three possible forms for W(qp') are Case I I M Case I1 (3.13) 5The time domain operator q-' is used instead of the more common z-domain variable 2-'. The z notation would imply an actual transfer function that does not apply in nonlinear systems.

Derivation of Gradient Algorithms 193 a Figure 8: Block diagram construction of an FIR network and corresponding reciprocal structure. In Case I, the filter reduces to a scalar weight and we have the standard definition of a neuron for feedforward networks. Case I1 corresponds to a Finite Impulse Response (FIR) filter in which the synapse forms a weighted sum of past values of its input. Such networks have been utilized for a number of time-series and system identification problems (Wan 1993a,b,c). Case I11 represents the more general Infinite Impulse

194 Eric A. Wan and Fraiifoise Beaufays Figure 9: IIR filter realizations: (a) controller canonical, (b) reciprocal observer canonical, (c) lattice, (d) reciprocal lattice. Response (IIR) filter, in which feedback is permitted. In all cases, coefficients are assumed to be adaptive. Figure 8 illustrates a network composed of FIR filter synapses realized as tap-delay lines. Deriving the gradient descent rule for adapting filter coefficients is quite formidable if we use a direct chain rule approach. However, using the construction rules described earlier, we may trivially form the reciprocal network also shown in Figure 8. By inspection we have Consideration of an output neuron at layer L yields?f(k) = -2e,(k)f'[sF(k)]. These equations define the algorithm known as temporal hnckyropagafioii (Wan 1993a,b). The algorithm may be viewed as a temporal generalization of backpropagation in which error gradients are propagated not by simply taking weighted sums, but by backward filtering. Note that in the reciprocal network, backpropagation is achieved through the reciprocal filters W(q+*). Since this is a noncausal filter, it is necessary to introduce a delay of a few time steps to implement the on-line adaptation. In the IIR case, it is easy to verify that equation 3.14 for temporal backpropagation still applies with W(q+' ) representing a noncausal IIR filter. As with backpropagation-through-time, the network must be trained using a forward and backward sweep necessitating storage of all activation values at each step in time. Different realizations for the filters dictate

Derivation of Gradient Algorithms 195 how signals flow through the reciprocal structure as illustrated in Figure 9. In all cases, computations remain order N (this is in contrast with the order N2 algorithms derived by Back and Tsoi (1991) using direct chain rule methods). Note that the poles of the forward IIR filters are reciprocal to the poles of the reciprocal filters. Stability monitoring can be made easier if we consider lattice realizations in which case stability is guaranteed if the magnitude of each coefficient is less than 1. Regardless of the choice of the filter realization, reciprocal networks provide a simple unified approach for deriving a learning algorithm.6 The above examples allow us to extrapolate the following additional construction rule: Any linear subsystem H(q-') in the original network is transformed to H(q+l) in the reciprocal system. 4 Summary The previous examples served to illustrate the ease with which algorithms may be derived using the diagrammatic approach. One starts with a diagrammatic representation of the network of interest. A reciprocal network is then constructed by simply swapping summing junctions with branching points, continuous functions with derivative transmittances, and time delays with time advances. The final algorithm is read directly off the reciprocal network. No messy chain rules are needed. The approach provides a unified framework for formally deriving gradient algorithms for arbitrary network architectures, network configurations, and systems. Appendix A: Proof of Reciprocal Construction Rules We show that the diagrammatic method constitutes a formal derivation for arbitrary network architectures. Intuitively, the chain rule applied to individual building blocks yields the reciprocal architecture. However, delay operators, which cannot be differentiated, as well as feedback, prevents a straightforward chain rule approach to the proof. Instead, we use a more rigorous approach that may be outlined as follows: (1) It is argued that a perturbation applied to a specific node in the network propagates through a derivative network that is topologically equivalent to the original network. (2) The derivative network is systematically unraveled in time to produce a linear time independent flow graph. (3) Next, the principle of flow graph interreciprocity is evoked to reverse the signal flow through the unraveled network. (4) The reverse flow graph is 'In related work (Beaufays and Wan 1994b), the diagrammatic method was used to derive an algorithm to minimize the output power at each stage of an FIR lattice filter. This provides an adaptive lattice predictor used as a decorrelating preprocessor to a second adaptive filter. The new algorithm is more effective than the Griffiths algorithm (Griffiths 1977).

196 Eric A. Wan and Franqoise Beaufays raveled back in time to produce the reciprocal network. The input, originally corresponding to a perturbation, becomes an output providing the desired gradient. (5) By symmetry, it is argued that a11 signals in the reciprocal network correspond to proper gradient terms. 1. We will initially assume that only uiziuariate functions exist within the network. This is by no means restrictive. It has been shown (Hornik et a/. 1989; Cybenko 1989) that a feedforward network with two or more layers and a sufficient number of internal neurons can approximate any iuiifonnly coiitiiiiioits multivariate function to an arbitrary accuracy. A feedforward network is, of course, composed of simple univariate functions and summing junctions. Thus any multivariate function in the overall network architecture is assumed to be well approximated using a univariate composition. We may completely specify the topology of a network by the set of equations I- E cfo. q -7 (A.2) where n,(k) is the signal corresponding to the node a, at time k. The sum is taken over all signals a,(k) that connect to a,(k), and T,, is a transmittance operator corresponding to either a univariate function (e.g., sigmoid function, constant multiplicative weight), or a delay operator. (The symbol o is used to remind us that T is an operator whose argument is a.) The signals a,(k) may correspond to inputs (a,%,), outputs (arey,), or internal signals to the network. Feedback of signals is permitted. Let us add to a specific node a' a perturbation An*(k) at time k. The perturbation propagates through the network resulting in effective per- turbations Aa,(k) for all nodes in the network. Through a continuous univariate function in which u,(k) = f[al(k)] we have, to first order: where it must be clearly understood that Aa,(k) and &(k) are the perturbations directly resulting from the external perturbation Aa*(k). Through a delay operator, aj(k) = 9-'a1(k) = a,(k - 1), we have Combining these two results with equation A.l gives au, (k) = c T:, 0 aa, (k) vj I (A.5) where we define T' E cf'[nl(k)]. q-'}. Note that f'(ui(k)) is a linear timedependent transmittance. Equation A.5 defines a derizmfiae network that is

Derivation of Gradient Algorithms 197 J: Aa*(k) - Figure 10: (a) Time dependent input/output system for the derivative network. (b) Same system with all delays drawn externally. topologically identical to the original network (i.e., one-to-one correspondence between signals and connections). Functions are simply replaced by their derivatives. This is a rather obvious result, and simply states that a perturbation propagates through the same connections and in the same direction as would normal signals. 2. The derivative network may be considered a time-dependent system with input Aa*(k) and outputs Ay(k) as illustrated in Figure 10a. Imagine now redrawing the network such that all delay operators q-' are dragged outside the functional block (Fig. lob). Equation A.5 still applies. Neither the definition nor the topology of the network has been changed. However, we may now remove the delay operators by cascading copies of the derivative network as illustrated in Figure 11. Each stage has a different set of transmittance values corresponding to the time step. The unraveling process stops at the final time K (K is allowed to approach co). Additionally, the outputs Ay(n) at each stage are multiplied by -2e(~z)~ and then summed over all stages to produce a single output A]$ ~ t -2e(~~)~~y(n). = ~ By removing the delay operators, the time index k may now be treated as simply a labeling index. The unraveled structure is thus a time independent linear flow graph. By linearity, all signals in the flow graph can be divided by Aa*(k) so that the input is now 1, and the output is A]/Aa*(k). In the limit of small Aa*(k) Since the system is causal the partial of the error at time n with respect

~ ~~ 198 Eric A. Wan and Francoise Beaufays Figure 11: Flow graph corresponding to unraveled derivative network. I I I 1.0 Figure 12: Transposed flow graph corresponding to unraveled derivative network. to n*(k) is zero for iz < k. Thus ;?d ; e;w; ' - 5 De(n)Te(i') da*(k) 5 rl=k 11=l - ~~ an*(k) 'J "h'(k) (A.7) The term h*(k) is precisely what we were interested in finding. However, calculating all the 6, (k) terms would require separately propagating signals through the unraveled network with an input of 1 at each location associated with a,(k). The entire process would then have to be repeated at every time step. 3. Next, take the unraveled network (i.e., flow graph) and form its transpose. This is accomplished by reversing the signal flow direction, transposing the branch gains, replacing summing junctions by branching points and vice versa, and interchanging input and output nodes. The new flow graph is represented in Figure 12.

Derivation of Gradient Algorithms 199 From the work by Tellegen (1952) and Bordewijk (19561, we know that transposed flow graphs are a particular case of interreciprocal graphs. This means that the output obtained in one graph, when exciting the input with a given signal, is the same as the output value of the transposed graph, when exciting its input by the same signal. In other words, the two graphs have identical transfer f~nctions.~ Thus, if an input of 1.0 is distributed along the lower horizontal branch of the transposed graph, the final output will equal b*(k). This b*(k) is identical to the output of our original flow graph. 4. The transposed graph can now be raveled back in time to produce the reciprocal network. Since the direction of signal flow has been reversed, delay operators q-' become advance operators q+'. The node a*(k) that was the original source of an input perturbation is now the output S*(k) as desired. The outputs of the original network become inputs with value -2e( k). Summarizing the steps involved in finding the reciprocal network, we start with the original network, form the derivative network, unravel in time, transpose, and then ravel back up in time. These steps are accomplished directly by starting with the original network and simply swapping branching points and summing junctions (rules 1 and 2), functions f for derivatives f' (rule 3), and qpls for q+'s (rule 5). 5. Note that the selection of the specific node a*(k) is totally arbitrary. Had we started with any node a,(k), we would still have arrived at the same result. In all cases, the input to the reciprocal network would still be -2e(k). Thus by symmetry, every signal in the reciprocal network provides S,(k) = aj1/3a,(k). This is exactly what we set out to prove for the signal flow in the reciprocal network. Finally, for multivariate functions, aout(k) = F[a,,(k)], where by our earlier statements it was assumed the function was explicitly represented by a composition of summing junctions and univariate functions. F is restricted to be memoryless and thus cannot be composed of any delay operators. In the reciprocal network, a,,(k) becomes b,,(k) and aout(k) becomes bout (k). But Thus any section within a network that contains no delays may be replaced by the multivariate function F( ), and the corresponding section in the reciprocal network is replaced with the Jacobian F'[ai,(k)]. This verifies rule 4. 5 7This basic property, which was first presented in the context of electrical circuits analysis (Penfield et al. 1970), finds applications in a wide variety of engineering disciplines, such as the reciprocity of emitting and receiving antennas in electromagnetism (Ramo et al. 1984), and the duality between decimation in time and decimation in frequency formulations of the FIT algorithm in signal processing (Oppenheim and Schafer 1989). Flow graph interreciprocity was first applied to neural networks to relate realtime-backpropagation and backpropagation-through-time by Beaufays and Wan (1 994a).

200 Eric A. Wan and Franqoise Beaufays Acknowledgments This work was funded in part by EPRI under Contract RP801013 and by NSF under Grants IRI 91-12531 and ECS-9410823. References Back, A., and Tsoi, A. 1991. FIR and IIR synapses, a new neural networks architecture for time series modeling. Neural Conzp. 3(3), 375-385. Beaufays, F., and Wan, E. 1994a. Relating real-time backpropagation and backpropagation-through-time: An application of flow graph interreciprocity. Neural Comp. 6(2), 296-306. Beaufays, F., and Wan, E. 199413. An efficient first-order stochastic algorithm for lattice filters. ICANN'94 2, 1021-1024. Bordewijk, J. 1956. Inter-reciprocity applied to electrical networks. AppI. Sci. Res. 6B, 1-74. Bryson, A,, and Ho, Y. 1975. Applied Optinznl Coiztrol. Hemisphere, New York. Cybenko, G. 1989. Approximation by superpositions of a sigmoidal function. Math. Corifrol Sigizals Syst. 2(4). Griewank, A., and Coliss, G., (eds.) 1991. Automatic differentiation of algorithms: Theory, implementation, and application. Proc. First SlAM Workshop oiz Autoinatic Differelitintioil, Brekenridge, Colorado. Griffiths, L. 1977. A continuously adaptive filter implemented as a lattice structure. Proc. ICASSP, Hartford, CT, 683-686. Hornik, K., Stinchombe, M., and White, H. 1989. Multilayer feedforward networks are universal approximators. Neirral Networks, 2, 359-366. LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., and Jackel, L. D. 1989. Backpropagation applied to handwritten zip code recognition. Neural Conzp. 1, 541-551. Matsuoka, K. 1991. Learning of neural networks using their adjoint systems. Syst. Computers Jpii. 22(11), 31-41. Narendra, K, and Parthasarathy, K. 1990. Identification and control of dynamic systems using neural networks. IEEE Trails. Neural Netniorks 1(1), 4-27. Nerrand, O., Roussel-Ragot, P., Personnaz, L., Dreyfus, G., and Marcos, S. 1993. Neural networks and nonlinear adaptive filtering: Unifying concepts and new algorithms. Neural Comp. 5(2), 165-199. Nguyen, D., and Widrow, B. 1989. The truck backer-upper: An example of self-learning in neural networks. Proc. Iizt. Joiizt Coizf. on Neural Networks, 11, Washington, DC. 357-363. Oppenheim, A., and Schafer, R. 1989. Digital SigizaI Processing. Prentice-Hall, Englewood Cliffs, NJ. Parisini, T., and Zoppoli, R. 1994. Neural networks for feedback feedforward nonlinear control systems. IEEE Trans. Neural Netzilorks, 5(3), 436-439. Parker, D., 1982. Learning-Logic. Invention Report S81-64, File 1, Office of Technology Licensing, Stanford University, October.

Derivation of Gradient Algorithms 201 Penfield, P., Spence, R., and Duiker. S. 1970. Tellegen's Theorem and Electrical Networks. MIT Press, Cambridge, MA. Plumer, E. 1993. Optimal terminal control using feedforward neural networks. Ph.D. dissertation. Stanford University, San Fransisco, CA. Rall, 8. 1981. Automatic Differentiation: Techniques and Applications. Lecture Notes in Computer Science, Springer-Verlag, Berlin. Ramo, S., Whinnery, J. R., and Van Duzer, T. 1984. Fields and Waves in Communication Electronics, 2nd Ed. John Wiley, New York. Rumelhart, D. E., McClelland, J. L., and the PDP Research Group. 1986. Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Vol. 1. MIT Press, Cambridge, MA. Tellegen, D. 1952. A general network theorem, with applications. Philzps Res. Rep. 7, 259-269. Toomarian, N. B., and Barhen, J. 1992. Learning a trajectory using adjoint function and teacher forcing. Neural Networks 5(3), 473484. Waibel, A., Hanazawa, T., Hinton, G., Shikano, K., and Lang, K. 1989. Phoneme recognition using time-delay neural networks. IEEE Trans. Acoustics, Speech, Signal Process. 37(3), 328-339. Wan, E. 1993a. Finite impulse response neural networks with applications in time series prediction. Ph.D, dissertation. Stanford University, San Fransisco, CA. Wan, E. 199313. Time series prediction using a connectionist network with internal delay lines. In Time Series Prediction: Forecasting the Future and Understanding the Past, A. Weigend and N. Gershenfeld, eds. Addison-Wesley, Reading, MA. Wan, E. 1993c. Modeling nonlinear dynamics with neural networks: Examples in time series prediction. Proc. Fifth Workshop Neural Networks: Acadernic/lndustrial/NASA/Defense, WNN93/FNN93, pp. 327-232, San Francisco. Wan, E., and Beaufays, F. 1994. Network reciprocity: A simple approach to derive gradient algorithms for arbitrary neural network structures. WCNN'94, San Diego, CA, 111, 87-93. Werbos, P. 1974. Beyond regression: New tools for prediction and analysis in the behavioral sciences. Ph.D. thesis, Harvard University, Cambridge, MA. Werbos, P. 1992. Neurocontrol and supervised learning: An overview and evaluation. In Handbook of Intelligent Control: Neural, Fuzzy, and Adaptive Approaches, chap. 3, D. White and D. Sofge, eds. Van Nostrand Reinhold, New York. Williams, R., and Zipser D. 1989. A learning algorithm for continually running fully recurrent neural networks. Neural Comp. 1, 270-280. Received April 29, 1994; accepted May 16, 1995.