Relating Real-Time Backpropagation and. Backpropagation-Through-Time: An Application of Flow Graph. Interreciprocity.

Similar documents
Diagrammatic Methods for Deriving and. Speech Technology and Research Laboratory, SRI International,

Temporal Backpropagation for FIR Neural Networks

inear Adaptive Inverse Control

Equivalence of Backpropagation and Contrastive Hebbian Learning in a Layered Network

Diagrammatic Derivation of Gradient Algorithms for Neural Networks

Adaptive Inverse Control based on Linear and Nonlinear Adaptive Filtering

ADAPTIVE INVERSE CONTROL BASED ON NONLINEAR ADAPTIVE FILTERING. Information Systems Lab., EE Dep., Stanford University

Adaptive Inverse Control

A Logarithmic Neural Network Architecture for Unbounded Non-Linear Function Approximation

IN neural-network training, the most well-known online

memory networks, have been proposed by Hopeld (1982), Lapedes and Farber (1986), Almeida (1987), Pineda (1988), and Rohwer and Forrest (1987). Other r

y(n) Time Series Data

Keywords- Source coding, Huffman encoding, Artificial neural network, Multilayer perceptron, Backpropagation algorithm

New Recursive-Least-Squares Algorithms for Nonlinear Active Control of Sound and Vibration Using Neural Networks

Recurrent neural networks with trainable amplitude of activation functions

Convergence of Hybrid Algorithm with Adaptive Learning Parameter for Multilayer Neural Network

A STATE-SPACE NEURAL NETWORK FOR MODELING DYNAMICAL NONLINEAR SYSTEMS

Recurrent Backpropagat ion and the Dynamical Approach to Adaptive Neural Computation


( t) Identification and Control of a Nonlinear Bioreactor Plant Using Classical and Dynamical Neural Networks

C1.2 Multilayer perceptrons

STRUCTURED NEURAL NETWORK FOR NONLINEAR DYNAMIC SYSTEMS MODELING

Application of neural networks to load-frequency control in. power systems. Francoise Beaufays, Youssef Abdel-Magid, and Bernard Widrow

Adaptive linear quadratic control using policy. iteration. Steven J. Bradtke. University of Massachusetts.

y k = (a)synaptic f(x j ) link linear i/p o/p relation (b) Activation link linear i/p o/p relation

Learning with Ensembles: How. over-tting can be useful. Anders Krogh Copenhagen, Denmark. Abstract

Learning in Boltzmann Trees. Lawrence Saul and Michael Jordan. Massachusetts Institute of Technology. Cambridge, MA January 31, 1995.

A recursive algorithm based on the extended Kalman filter for the training of feedforward neural models. Isabelle Rivals and Léon Personnaz

A Robust PCA by LMSER Learning with Iterative Error. Bai-ling Zhang Irwin King Lei Xu.

4. Multilayer Perceptrons

Abstract. In this paper we propose recurrent neural networks with feedback into the input

Phase-Space Learning Fu-Sheng Tsung Chung Tai Ch'an Temple 56, Yuon-fon Road, Yi-hsin Li, Pu-li Nan-tou County, Taiwan 545 Republic of China Garrison

Optimal Polynomial Control for Discrete-Time Systems

Introduction to Machine Learning Spring 2018 Note Neural Networks

The DFT as Convolution or Filtering

Introduction to Artificial Neural Networks

Neural Network Identification of Non Linear Systems Using State Space Techniques.

Recurrent Neural Net Learning and Vanishing Gradient. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems 6(2):107{116, 1998

1 Introduction Independent component analysis (ICA) [10] is a statistical technique whose main applications are blind source separation, blind deconvo

Back-Propagation Algorithm. Perceptron Gradient Descent Multilayered neural network Back-Propagation More on Back-Propagation Examples

Direct Method for Training Feed-forward Neural Networks using Batch Extended Kalman Filter for Multi- Step-Ahead Predictions

Lecture 5: Recurrent Neural Networks

Performance Comparison of Two Implementations of the Leaky. LMS Adaptive Filter. Scott C. Douglas. University of Utah. Salt Lake City, Utah 84112

Gradient Descent Training Rule: The Details

T Machine Learning and Neural Networks

Butterworth Filter Properties

Implementation of Discrete-Time Systems

Approximate Optimal-Value Functions. Satinder P. Singh Richard C. Yee. University of Massachusetts.

Using Expectation-Maximization for Reinforcement Learning

International Journal of Advanced Research in Computer Science and Software Engineering

Neural Dynamic Optimization for Control Systems Part II: Theory

Model Reference Adaptive Control for Multi-Input Multi-Output Nonlinear Systems Using Neural Networks

Stochastic Dynamics of Learning with Momentum. in Neural Networks. Department of Medical Physics and Biophysics,

Frequency Selective Surface Design Based on Iterative Inversion of Neural Networks

EKF LEARNING FOR FEEDFORWARD NEURAL NETWORKS

Analog Neural Nets with Gaussian or other Common. Noise Distributions cannot Recognize Arbitrary. Regular Languages.

Application of Artificial Neural Networks in Evaluation and Identification of Electrical Loss in Transformers According to the Energy Consumption

Reinforcement Learning: the basics

ECLT 5810 Classification Neural Networks. Reference: Data Mining: Concepts and Techniques By J. Hand, M. Kamber, and J. Pei, Morgan Kaufmann

Phase-Space learning for recurrent networks. and. Abstract. We study the problem of learning nonstatic attractors in recurrent networks.

MODELING NONLINEAR DYNAMICS WITH NEURAL. Eric A. Wan. Stanford University, Department of Electrical Engineering, Stanford, CA

Training Radial Basis Neural Networks with the Extended Kalman Filter

Neural Networks and the Back-propagation Algorithm

Computational Graphs, and Backpropagation

In Advances in Neural Information Processing Systems 6. J. D. Cowan, G. Tesauro and. Convergence of Indirect Adaptive. Andrew G.

Artificial Neural Networks. Historical description

Introduction Biologically Motivated Crude Model Backpropagation

100 inference steps doesn't seem like enough. Many neuron-like threshold switching units. Many weighted interconnections among units

Multilayer Perceptrons (MLPs)

Asymptotic Convergence of Backpropagation: Numerical Experiments

Lecture 7 Artificial neural networks: Supervised learning

ON HYPERPARAMETER OPTIMIZATION IN LEARNING SYSTEMS

Modelling Time Series with Neural Networks. Volker Tresp Summer 2017

Square-Root Algorithms of Recursive Least-Squares Wiener Estimators in Linear Discrete-Time Stochastic Systems

TWO-LAYER LINEAR STRUCTURES FOR FAST ADAPTIVE FILTERING. a dissertation. submitted to the department of electrical engineering

Artificial Neural Networks

ARTIFICIAL NEURAL NETWORK PART I HANIEH BORHANAZAD

Notes on Dantzig-Wolfe decomposition and column generation

Supervised Learning. George Konidaris

t Correspondence should be addressed to the author at the Department of Psychology, Stanford University, Stanford, California, 94305, USA.

Filter-Generating Systems

Artificial Neural Networks

Introduction to Neural Networks

Chapter 3 Supervised learning:

Designing Dynamic Neural Network for Non-Linear System Identification

Gaussian Processes for Regression. Carl Edward Rasmussen. Department of Computer Science. Toronto, ONT, M5S 1A4, Canada.

LSTM CAN SOLVE HARD. Jurgen Schmidhuber Lugano, Switzerland. Abstract. guessing than by the proposed algorithms.

Fast pruning using principal components

Proceedings of the International Conference on Neural Networks, Orlando Florida, June Leemon C. Baird III

Submitted to Electronics Letters. Indexing terms: Signal Processing, Adaptive Filters. The Combined LMS/F Algorithm Shao-Jen Lim and John G. Harris Co

Average Reward Parameters

Automatic Capacity Tuning. of Very Large VC-dimension Classiers. B. Boser 3. EECS Department, University of California, Berkeley, CA 94720

HOPFIELD neural networks (HNNs) are a class of nonlinear

Autonomous learning algorithm for fully connected recurrent networks

Learning and Memory in Neural Networks

THE concept of active sound control has been known

Bidirectional Representation and Backpropagation Learning

A New Hybrid System for Recognition of Handwritten-Script

Analysis of Multilayer Neural Network Modeling and Long Short-Term Memory

Neural Networks. Yan Shao Department of Linguistics and Philology, Uppsala University 7 December 2016

Transcription:

Neural Computation, 1994 Relating Real-Time Backpropagation and Backpropagation-Through-Time: An Application of Flow Graph Interreciprocity. Francoise Beaufays and Eric A. Wan Abstract We show that signal ow graph theory provides a simple way to relate two popular algorithms used for adapting dynamic neural networks, real-time backpropagation and backpropagation-throughtime. Starting with the ow graph for real-time backpropagation, we use a simple transposition to produce a second graph. The new graph is shown to be interreciprocal with the original and to correspond to the backpropagation-through-time algorithm. Interreciprocity provides a theoretical argument to verify that both ow graphs implement the same overall weight update. Introduction Two adaptive algorithms, real-time backpropagation (RTBP) and backpropagation-throughtime (BPTT), are currently used to train multilayer neural networks with output feedback connections. RTBP was rst introduced for single layer fully recurrent networks by Williams and Zipser (1989). The algorithm has since been extended to include feedforward networks with output feedback (see, e.g., Narendra 1990). The algorithm is sometimes referred to as real-time recurrent learning, on-line backpropagation, or dynamic backpropagation (Williams and Zipser; Narendra et al., 1990; and, Hertz et al., 1991). The name recurrent backpropagation is also occasionally used, although this should not be confused with recurrent backpropagation as developed by Pinenda (1987) for learning xed points in feedback networks. RTBP is well suited for on-line adaptation of dynamic networks where a desired response is specied at each time step. BPTT, (Rumelhart et al. 1986; Nguyen and Widrow, 1990; and Werbos, 1990), on the other hand, involves unfolding the network in time and applying standard backpropagation through the unraveled system. It does not allow for on-line adaptation as in RTBP, but has been shown to be computationally less expensive. Both algorithms attempt to minimize the same performance criterion, and are equivalent in terms of what they compute (assuming all weight changes are made o-line). However, they are generally derived independently and take on very dierent mathematical formulations. In this paper, we use ow graph theory as a common support for relating the two algorithms. We begin by deriving a general ow graph diagram for the weight updates associated with RTBP. A second ow graph is obtained by transposing the original one, i.e., by reversing the arrows that link the graph nodes, and by interchanging the source and sink nodes. Flow graph theory shows that transposed ow graphs are interreciprocal, and for single input single output (SISO) systems, have identical transfer functions. This basic property, which was rst presented in the context of electrical circuits analysis (Peneld et al., 1970), nds applications in a wide The authors are with the department of Electrical Engineering, Stanford University, Stanford, CA 94305-4055. This work was sponsored by EPRI under contract RP8010-13. 1

variety of engineering disciplines, such as the reciprocity of emitting and receiving antennas in electromagnetism (Ramo et al., 1984), the relationship between controller and observer canonical forms in control theory (Kailath, 1980), and the duality between decimation in time and decimation in frequency formulations of the FFT algorithm in signal processing (Oppenheim and Schafer, 1989). The transposed ow graph is shown to correspond directly to the BPTT algorithm. The interreciprocity of the two ow graphs allows us to verify that RTBP and BPTT perform the same overall computations. These principles are then extended to a more elaborate control feedback structure. Network Equations A neural network with output recurrence is shown in Figure 1. Let r(k? 1) denote the vector of external reference inputs to the network and x(k? 1) the recurrent inputs. The output vector x(k) is a function of the recurrent and external inputs, and of the adaptive weights w of the network: x(k) = N (x(k? 1); r(k? 1); w): (1) r(k?1) x(k?1) N x(k) q Figure 1: Recurrent neural network (q represents a unit delay operator). The neural network N is most generally a feedforward multilayer architecture (Rumelhart et al., 1986). If N has only a single layer of neurons, the structure of Figure 1 represents a completely recurrent network (Williams and Zipser, 1989; Pineda, 1987). Any connectionist architecture with feedback units can, in fact, be represented in this standard format (Piche, 1993). Adapting the neural network amounts to nding the set of weights w that minimizes the cost function J = 1 2 E " KX k=1 e(k) T e(k) # = 1 2 KX k=1 h i E e(k)t e(k) ; (2) where the expectation E[] is taken over the external reference inputs r(k) and over the initial values of the recurrent inputs x(0). The error e(k) is dened at each time step as the dierence between the desired state d(k) and the recurrent state x(k) whenever the desired vector d(k) is dened, and is otherwise set to zero: e(k) = ( d(k)? x(k) if 9 d(k) 0 otherwise (3) For such problems as terminal control (Bryson and Ho, 1969; Nguyen and Widrow, 1990) a desired response may be given only at the nal time k = K, while for other problems such as system 2

identication (Ljung, 1987; Narendra, 1990) it is more common to have a desired response vector for all k. In addition, only some of the recurrent states may represent actual outputs while others may be used solely for computational purposes. In both RTBP and BPTT, a gradient descent approach is used to adapt the weights of the network. At each time step, the contribution to the weight update is given by w(k) T =? 2 h d e(k)t e(k) dw i = e(k) T dx(k) dw ; (4) where is the learning rate. Here the derivative is used to represent the change in error due to a weight change over all time 1. The accumulation of weight updates over k = 1:::K is given by w = P K k=1 w(k). Typically, RTBP uses on-line adaptation in which the weights are updated at each time k, whereas BPTT performs an update based on the aggregate w. The dierences due to on-line versus o-line adaptation will not be considered in this paper. For consistency, we assume that in both algorithms the weights are held constant during all gradient calculations. Flow Graph Representation of the Adaptive Algorithms RTBP was originally derived for fully recurrent single layer networks 2. A more general algorithm is obtained by using equation 1 to directly evaluate the state gradient dx(k) = dw in the above weight update formula. Applying the chain rule, we get: dx(k) dw = @x(k) dx(k? 1) + @x(k) dr(k? 1) + @x(k) @x(k? 1) dw @r(k? 1) dw @w dw dw ; (5) in which dr(k? 1) = dw = 0 since the external inputs do not depend on the network weights, and dw = dw = I, where I is the identity matrix. With these simplications, equation 5 reduces to: dx(k) dw = @x(k) dx(k? 1) + @x(k) @x(k? 1) dw @w : (6) Equation 6 is then applied recursively, from k = 1 to k = K, with initial conditions dx(0) = dw = 0. For sake of clarity, let With this new notation, equation 6 can be rewritten as: rec (k) = dx(k) = dw; (7) (k) = @x(k) = @w; (8) J(k) = @x(k) = @x(k? 1): (9) rec (k) = J(k) rec (k? 1) + (k) 8k = 1:::K (10) with initial condition rec (0) = 0. The weight update at each time step is given by: w T (k) = e(k) T rec (k): (11) 1 We dene the derivative of a vector a 2 < n with respect to another vector b 2 < m as the matrix da=db 2 < nm whose (i; j) th element is da i=db j. Similarly, the partial of a vector a 2 < n with respect to another vector b 2 < m is the matrix @a=@b 2 < nm whose (i; j) th element is @a i=@b j. For m = n = 1, this notation reduces to the scalar derivative and partial derivative as traditionally dened in calculus. It is easy to verify that most scalar operations in calculus, such as the chain rule, also hold in the vectorial case. 2 The linear equivalent of the RTBP algorithm was rst introduced in the context of Innite Impulse Response (IIR) lter adaptation (White, 1975). 3

w T w T (1) w T (k) w T (K) e T (1) e T (k) e T (K) rec (1) rec (k) rec (K) J(2) J(k) J(k+1) J(K?1) (1) (k) (K) 1:0 Figure 2: Flow graph associated with the real-time backpropagation algorithm. Equations 10 and 11 can be illustrated by a ow graph (see Figure 2). The input to the ow graph, or source node variable, is set to 1:0 and propagated along the lower horizontal branch of the graph. The center horizontal branch computes the state derivatives rec (k), and the upper horizontal branch accumulates the weight changes w(k) T. The total weight change, w T, is readily available at the output (sink node). RTBP is completely dened with this ow graph. Let us now build a new ow graph by transposing the ow graph of Figure 2. Transposing the original ow graph is accomplished by reversing the branch directions, transposing the branch gains, replacing summing junctions by branching points and vice versa, and interchanging source and sink nodes. The new ow graph is represented in Figure 3. From the work by Tellegen (1952) and Bordewijk (1956) (see Appendix A), we know that transposed ow graphs are a particular case of interreciprocal graphs. This means, in a SISO case, that the sink value obtained in one graph, when exciting the source with a given input, is the same as the sink value of the transposed graph, when exciting its source by the same input. Thus, if an input of 1:0 is distributed along the upper horizontal branch of the transposed graph, the output, which is now accumulated on the lower horizontal branch, will be equal to w. This w is identical to the output of our original ow graph 3. With the notation introduced before, and calling bp (k) the signal transmitted along the center horizontal branch, we can directly write down the equations describing the new ow graph: bp (k) = J T (k + 1) bp (k + 1) + e(k) 8k = K:::1 (12) with initial condition bp (K + 1) = 0. The weight update at each time step is given by: w(k) = T (k) bp (k): (13) Equations 12 and 13, obtained from the new ow graph, are nothing other than the description of hp BPTT: bp (k) is the error gradient? d Kk=1 e(k) T e(k) =dx(k) backpropagated from k = K:::1. This provides a simple theoretical derivation of BPTT. Alternative derivations include the use 3 The ow graphs introduced here are in fact single input multi output (SIMO). The arguments of interreciprocity may be applied by considering the SIMO graph to be a stack of SISO graphs, each of which can be independently transposed. i 4

1:0 e(1) e(k) e(k) bp (1) bp (k) bp (K) J T (2) J T (k) J T (k+1) J T (K?1) T (1) T (k) T (K) w w(1) w(k) w(k) Figure 3: Transposed ow graph: a representation of the backpropagation-through-time algorithm. of ordered derivatives (Werbos, 1964), heuristically unfolding the network in time (Rumelhart et al.,1986; Nguyen and Widrow, 1990), and solving a set of Euler-Lagrange equations (Le Cun, 1988; Plumer, 1993). Clearly one could have also taken a reverse path by starting with a derivation for BPTT, constructing the corresponding ow graph, transposing it, and then reading out the equations for RTBP. The two approaches lead to equivalent results. Another nice feature of ow graph representations is that the computational and complexity dierences between RTBP and BPTT can be directly observed from their respective ow graphs. By observing the dimension of the terms owing in the graphs and the necessary matrix calculations and multiplications, it can be veried that RTBP requires O(N 2 W ) operations while BPTT requires only O(W ), (N is the number of recurrent states and W is the number of weights) 4. Extension to Controller-Plant Structures Flow graph theory can also be applied to more complicated network arrangements, such as the dynamic controller-plant structure illustrated in Figure 4. A discrete-time dynamic plant P described by its state-space equations is controlled by a neural network controller C. Let x(k? 1) be the state of the plant, r(k? 1) the external reference inputs to the controller, and u(k? 1) the control signal used to drive the plant. Figure 4 can be described formally by the following equations: x(k) = P(x(k? 1); u(k? 1)) (14) u(k? 1) = C(x(k? 1); r(k? 1); w) (15) As before, the error vector e k is dened as the dierence between the desired state d(k) and the actual state x(k) when there exists a desired state, and zero otherwise. Using RTBP to adapt the weights of the controller requires the evaluation of the derivatives of the state with respect to the 4 For fully recurrent networks W = N 2 and RTBP is O(N 4 ). 5

r(k?1) x(k?1) C u(k?1) x(k?1) P x(k) q Figure 4: Controller-Plant Structure weights. Applying the chain rule to equations 14 and 15, we get: dx(k) dw = @x(k) dx(k? 1) + @x(k) @x(k? 1) dw @u(k? 1) @u(k? 1) dx(k? 1) = + @x(k? 1) dw du(k? 1) dw du(k? 1) dw (16) @u(k? 1) @w : (17) Equations 16 and 17 can then be represented by a ow graph (see Figure 5a). Transposing this ow graph, we get a new graph, which corresponds to the BPTT algorithm for the controllerplant structure (see Figure 5b). Again, the argument of interreciprocity immediately shows the equivalence of the weight updates performed by the two algorithms. In addition, it can be veried that BPTT applied to this structure still requires a factor of O(N 2 ) less computations than RTBP. 6

e T (1) w T (1) e T (K) w T w T (K) J Px(K) rec x (1) J Pu(1) J C(K?1) J Pu(K) rec x (K?1) rec x (K) u(0) u(k?1) 1:0 (a) 1:0 e(1) e(k) J T Px(K) J T Pu(1) bp x (1) bp x (K?1) J T C(K?1) J T Pu(K) bp x (K) T u(0) T u(k?1) w w(1) (b) w(k) Figure 5: (a) Flow graph corresponding to RTBP. (b) Transposed ow graph: a representation of BPTT. Notation: J C (k) = @u(k)=@x(k); J P x (k) = @x(k)=@x(k? 1); J P u (k) = : : : : @x(k)=@u(k? 1); u (k) = @u(k)=@w; rec : x (k) = dx(k)=dw; rec : u (k) = du(k)=dw; x bp : (k) =? d[ P K k=1 e T (k)e(k)]=dx(k): 7

Conclusion We have shown that real-time backpropagation and backpropagation-through-time are easily related when represented by signal ow graphs. In particular, the ow graphs corresponding to the two algorithms are the exact transpose of one another. As a consequence, ow graph theory could be applied to verify that the gradient calculations performed by the algorithms are equivalent. These principles were then extended to a controller-plant structure to illustrate how ow graph techniques can be applied to a variety of adaptive dynamic systems. Appendix A: Flow Graph Interreciprocity In this appendix we provide the formal denition of interreciprocity. We then prove that transposed ow graphs are interreciprocal, and that the transfer functions of single input single output interreciprocal ow graphs are identical. Y j T j;k Y k T l;k 1:0 X k Y l Figure 6: Example of nodes and branches in a signal ow graph. Let F be a ow graph. In F, we dene: Y k, the value associated with node k; T j;k, the transmittance of the branch (j; k); and V j;k = T j;k Y j, the output of branch (j; k). Let us further assume that each node k of the graph has associated to it a source node, i.e., a node connected to it by a branch of unity transmittance. Let X k be the value of this source node (if node k has no associated source node, X k is simply set to zero). It results from the above denitions that Y k = P j V j;k + X k = P j T j;k Y j + X k (see Figure 6). Let us now consider a second ow graph, ~ F, having the same topology as F (i.e., ~ F has the same set of nodes and branches as F, but the branch transmittances of both graphs may dier). ~F is described with the variables: ~ Yk ; ~ T j;k ; ~ V j;k, and ~ X k. Denition 1 Two ow graphs, F and transmittance matrices are transposed, i.e., ~ F, are said to be the transpose of each other i their ~T j;k = T k;j 8 j; k: (18) Denition 2 (Bordewijk, 1956): Two ow graphs, F and F, ~ are said to be interreciprocal i X ( Y ~ k X k? Y k X ~ k ) = 0: (19) We can now state the following theorem: Theorem 1 Transposed ow graphs are interreciprocal. k 8

Proof: Let F be a ow graph, and let F ~ be the transpose of F. We start from the identity P P ~ P k Y k Y k k Y k Yk ~, and replace Y k by P j T j;k Y j + X k in the rst member, and Yk ~ by ~ j T j;k Yj ~ + Xk ~ in the second member (Oppenheim and Schafer, 1989). Rearranging the terms, we get: X j;k ( ~ Y k V j;k? Y k ~ V j;k ) + X k ( ~ Y k X k? Y k ~ X k ) = 0: (20) Equation 20 is usually referred to as \the two-network form of Tellegen's theorem" (Tellegen,1952; Peneld, 1970). Since F ~ is the transpose of F, the rst term of equation 20 can be rewritten as P j;k ( ~ Yk V j;k? Y k ~ Vj;k ) = P j;k ( ~ Yk T j;k Y j? Y k ~ Tj;k ~ Yj ) = P j;k ( ~ Yk T j;k Y j? Y k T k;j ~ Yj ) = 0: Since the rst term of equation 20 is zero, the second term P k ( ~ Yk X k? Y k ~ Xk ) is also zero. The ow graphs ~ F and F are thus interreciprocal. QED. The last step consists in showing that SISO interreciprocal ow graphs have the same transfer functions. Let node a be the unique source of F and node b its unique sink. From the denition of transposition, node a is the sink of ~ F, and node b is its source. We thus have: Xk = 0 8k 6= a and ~X k = 0 8k 6= b. Therefore, equation 19 reduces to: X a ~ Y a = ~ X b Y b : (21) This last equality can be interpreted as follows (Peneld,1970; Oppenheim and Schafer,1989): the output Y b, obtained when exciting graph F with an input signal X a, is identical to the output ~ Y a of the transposed graph ~ F when exciting it at node b with an input ~ Xb X a. The transfer functions of the SISO systems represented by the two ow graphs are thus identical, which is the desired conclusion. References Bordewijk, J. L. 1956. Inter-reciprocity applied to electrical networks. Appl. Sci. Res. 6B, 1-74. Bryson, A. E.Jr. and Ho, Y. 1969. Applied Optimal Control, chapter 2. Blaisdell Publishing Co., New York. Hertz, J. A., Krogh, A., and Palmer, R. G. 1991. Introduction to the Theory of Neural Computation. Addison-Wesley, Reading. Kailath, T. 1980. Linear Systems. Prentice-Hall, Englewood Clis, NJ. Le Cun, Y. 1988. A Theoretical Framework for Back-Propagation. Proceedings of the 1988 Connectionist Models Summer School. Editors: Touretzky, D., Hinton, G. and Sejnowski, T., Morgan Kaufmann. San Mateo, CA. 21-28. Ljung L. 1987. System Identication: theory for the user. Prentice-Hall, Englewood Clis, NJ. Pineda, F. J. 1987. Generalization of Back-Propagation to Recurrent Neural Networks. IEEE Trans on neural networks, special issue on recurrent networks. Plumer, E. S. 1993. Time-Optimal Terminal Control Using Neural Networks. Proceedings of the IEEE International Conference on Neural Networks. San Francisco, CA. 1926-1931. 9

Ramo, S., Whinnery, J. R. and Van Duzer, T. 1984. Fields and waves in communication electronics. Second Edition. John Wiley & Sons. Rumelhart, D. E. and McClelland, J. L. 1986. Parallel Distributed Processing. The MIT Press, Cambridge, MA. Tellegen, B. D. H. 1952. A general network theorem, with applications. Philips Res. Rep. 7. 259-269. Werbos, P. 1990. Backpropagation Through Time: What It Does and How to Do It. Proc. IEEE, special issue on neural networks 2. 1550-1560. White, S. A. 1975. An Adaptive Recursive Digital Filter. Proc. 9th Asilomar Conf. Circuits Syst. Comput.. 21. Williams, R. J. and Zipser, D. 1989. A Learning algorithm for continually running fully recurrent neural networks. Neural Computation 1(2). 270-280. 10