Neural Computation, 1994 Relating Real-Time Backpropagation and Backpropagation-Through-Time: An Application of Flow Graph Interreciprocity. Francoise Beaufays and Eric A. Wan Abstract We show that signal ow graph theory provides a simple way to relate two popular algorithms used for adapting dynamic neural networks, real-time backpropagation and backpropagation-throughtime. Starting with the ow graph for real-time backpropagation, we use a simple transposition to produce a second graph. The new graph is shown to be interreciprocal with the original and to correspond to the backpropagation-through-time algorithm. Interreciprocity provides a theoretical argument to verify that both ow graphs implement the same overall weight update. Introduction Two adaptive algorithms, real-time backpropagation (RTBP) and backpropagation-throughtime (BPTT), are currently used to train multilayer neural networks with output feedback connections. RTBP was rst introduced for single layer fully recurrent networks by Williams and Zipser (1989). The algorithm has since been extended to include feedforward networks with output feedback (see, e.g., Narendra 1990). The algorithm is sometimes referred to as real-time recurrent learning, on-line backpropagation, or dynamic backpropagation (Williams and Zipser; Narendra et al., 1990; and, Hertz et al., 1991). The name recurrent backpropagation is also occasionally used, although this should not be confused with recurrent backpropagation as developed by Pinenda (1987) for learning xed points in feedback networks. RTBP is well suited for on-line adaptation of dynamic networks where a desired response is specied at each time step. BPTT, (Rumelhart et al. 1986; Nguyen and Widrow, 1990; and Werbos, 1990), on the other hand, involves unfolding the network in time and applying standard backpropagation through the unraveled system. It does not allow for on-line adaptation as in RTBP, but has been shown to be computationally less expensive. Both algorithms attempt to minimize the same performance criterion, and are equivalent in terms of what they compute (assuming all weight changes are made o-line). However, they are generally derived independently and take on very dierent mathematical formulations. In this paper, we use ow graph theory as a common support for relating the two algorithms. We begin by deriving a general ow graph diagram for the weight updates associated with RTBP. A second ow graph is obtained by transposing the original one, i.e., by reversing the arrows that link the graph nodes, and by interchanging the source and sink nodes. Flow graph theory shows that transposed ow graphs are interreciprocal, and for single input single output (SISO) systems, have identical transfer functions. This basic property, which was rst presented in the context of electrical circuits analysis (Peneld et al., 1970), nds applications in a wide The authors are with the department of Electrical Engineering, Stanford University, Stanford, CA 94305-4055. This work was sponsored by EPRI under contract RP8010-13. 1
variety of engineering disciplines, such as the reciprocity of emitting and receiving antennas in electromagnetism (Ramo et al., 1984), the relationship between controller and observer canonical forms in control theory (Kailath, 1980), and the duality between decimation in time and decimation in frequency formulations of the FFT algorithm in signal processing (Oppenheim and Schafer, 1989). The transposed ow graph is shown to correspond directly to the BPTT algorithm. The interreciprocity of the two ow graphs allows us to verify that RTBP and BPTT perform the same overall computations. These principles are then extended to a more elaborate control feedback structure. Network Equations A neural network with output recurrence is shown in Figure 1. Let r(k? 1) denote the vector of external reference inputs to the network and x(k? 1) the recurrent inputs. The output vector x(k) is a function of the recurrent and external inputs, and of the adaptive weights w of the network: x(k) = N (x(k? 1); r(k? 1); w): (1) r(k?1) x(k?1) N x(k) q Figure 1: Recurrent neural network (q represents a unit delay operator). The neural network N is most generally a feedforward multilayer architecture (Rumelhart et al., 1986). If N has only a single layer of neurons, the structure of Figure 1 represents a completely recurrent network (Williams and Zipser, 1989; Pineda, 1987). Any connectionist architecture with feedback units can, in fact, be represented in this standard format (Piche, 1993). Adapting the neural network amounts to nding the set of weights w that minimizes the cost function J = 1 2 E " KX k=1 e(k) T e(k) # = 1 2 KX k=1 h i E e(k)t e(k) ; (2) where the expectation E[] is taken over the external reference inputs r(k) and over the initial values of the recurrent inputs x(0). The error e(k) is dened at each time step as the dierence between the desired state d(k) and the recurrent state x(k) whenever the desired vector d(k) is dened, and is otherwise set to zero: e(k) = ( d(k)? x(k) if 9 d(k) 0 otherwise (3) For such problems as terminal control (Bryson and Ho, 1969; Nguyen and Widrow, 1990) a desired response may be given only at the nal time k = K, while for other problems such as system 2
identication (Ljung, 1987; Narendra, 1990) it is more common to have a desired response vector for all k. In addition, only some of the recurrent states may represent actual outputs while others may be used solely for computational purposes. In both RTBP and BPTT, a gradient descent approach is used to adapt the weights of the network. At each time step, the contribution to the weight update is given by w(k) T =? 2 h d e(k)t e(k) dw i = e(k) T dx(k) dw ; (4) where is the learning rate. Here the derivative is used to represent the change in error due to a weight change over all time 1. The accumulation of weight updates over k = 1:::K is given by w = P K k=1 w(k). Typically, RTBP uses on-line adaptation in which the weights are updated at each time k, whereas BPTT performs an update based on the aggregate w. The dierences due to on-line versus o-line adaptation will not be considered in this paper. For consistency, we assume that in both algorithms the weights are held constant during all gradient calculations. Flow Graph Representation of the Adaptive Algorithms RTBP was originally derived for fully recurrent single layer networks 2. A more general algorithm is obtained by using equation 1 to directly evaluate the state gradient dx(k) = dw in the above weight update formula. Applying the chain rule, we get: dx(k) dw = @x(k) dx(k? 1) + @x(k) dr(k? 1) + @x(k) @x(k? 1) dw @r(k? 1) dw @w dw dw ; (5) in which dr(k? 1) = dw = 0 since the external inputs do not depend on the network weights, and dw = dw = I, where I is the identity matrix. With these simplications, equation 5 reduces to: dx(k) dw = @x(k) dx(k? 1) + @x(k) @x(k? 1) dw @w : (6) Equation 6 is then applied recursively, from k = 1 to k = K, with initial conditions dx(0) = dw = 0. For sake of clarity, let With this new notation, equation 6 can be rewritten as: rec (k) = dx(k) = dw; (7) (k) = @x(k) = @w; (8) J(k) = @x(k) = @x(k? 1): (9) rec (k) = J(k) rec (k? 1) + (k) 8k = 1:::K (10) with initial condition rec (0) = 0. The weight update at each time step is given by: w T (k) = e(k) T rec (k): (11) 1 We dene the derivative of a vector a 2 < n with respect to another vector b 2 < m as the matrix da=db 2 < nm whose (i; j) th element is da i=db j. Similarly, the partial of a vector a 2 < n with respect to another vector b 2 < m is the matrix @a=@b 2 < nm whose (i; j) th element is @a i=@b j. For m = n = 1, this notation reduces to the scalar derivative and partial derivative as traditionally dened in calculus. It is easy to verify that most scalar operations in calculus, such as the chain rule, also hold in the vectorial case. 2 The linear equivalent of the RTBP algorithm was rst introduced in the context of Innite Impulse Response (IIR) lter adaptation (White, 1975). 3
w T w T (1) w T (k) w T (K) e T (1) e T (k) e T (K) rec (1) rec (k) rec (K) J(2) J(k) J(k+1) J(K?1) (1) (k) (K) 1:0 Figure 2: Flow graph associated with the real-time backpropagation algorithm. Equations 10 and 11 can be illustrated by a ow graph (see Figure 2). The input to the ow graph, or source node variable, is set to 1:0 and propagated along the lower horizontal branch of the graph. The center horizontal branch computes the state derivatives rec (k), and the upper horizontal branch accumulates the weight changes w(k) T. The total weight change, w T, is readily available at the output (sink node). RTBP is completely dened with this ow graph. Let us now build a new ow graph by transposing the ow graph of Figure 2. Transposing the original ow graph is accomplished by reversing the branch directions, transposing the branch gains, replacing summing junctions by branching points and vice versa, and interchanging source and sink nodes. The new ow graph is represented in Figure 3. From the work by Tellegen (1952) and Bordewijk (1956) (see Appendix A), we know that transposed ow graphs are a particular case of interreciprocal graphs. This means, in a SISO case, that the sink value obtained in one graph, when exciting the source with a given input, is the same as the sink value of the transposed graph, when exciting its source by the same input. Thus, if an input of 1:0 is distributed along the upper horizontal branch of the transposed graph, the output, which is now accumulated on the lower horizontal branch, will be equal to w. This w is identical to the output of our original ow graph 3. With the notation introduced before, and calling bp (k) the signal transmitted along the center horizontal branch, we can directly write down the equations describing the new ow graph: bp (k) = J T (k + 1) bp (k + 1) + e(k) 8k = K:::1 (12) with initial condition bp (K + 1) = 0. The weight update at each time step is given by: w(k) = T (k) bp (k): (13) Equations 12 and 13, obtained from the new ow graph, are nothing other than the description of hp BPTT: bp (k) is the error gradient? d Kk=1 e(k) T e(k) =dx(k) backpropagated from k = K:::1. This provides a simple theoretical derivation of BPTT. Alternative derivations include the use 3 The ow graphs introduced here are in fact single input multi output (SIMO). The arguments of interreciprocity may be applied by considering the SIMO graph to be a stack of SISO graphs, each of which can be independently transposed. i 4
1:0 e(1) e(k) e(k) bp (1) bp (k) bp (K) J T (2) J T (k) J T (k+1) J T (K?1) T (1) T (k) T (K) w w(1) w(k) w(k) Figure 3: Transposed ow graph: a representation of the backpropagation-through-time algorithm. of ordered derivatives (Werbos, 1964), heuristically unfolding the network in time (Rumelhart et al.,1986; Nguyen and Widrow, 1990), and solving a set of Euler-Lagrange equations (Le Cun, 1988; Plumer, 1993). Clearly one could have also taken a reverse path by starting with a derivation for BPTT, constructing the corresponding ow graph, transposing it, and then reading out the equations for RTBP. The two approaches lead to equivalent results. Another nice feature of ow graph representations is that the computational and complexity dierences between RTBP and BPTT can be directly observed from their respective ow graphs. By observing the dimension of the terms owing in the graphs and the necessary matrix calculations and multiplications, it can be veried that RTBP requires O(N 2 W ) operations while BPTT requires only O(W ), (N is the number of recurrent states and W is the number of weights) 4. Extension to Controller-Plant Structures Flow graph theory can also be applied to more complicated network arrangements, such as the dynamic controller-plant structure illustrated in Figure 4. A discrete-time dynamic plant P described by its state-space equations is controlled by a neural network controller C. Let x(k? 1) be the state of the plant, r(k? 1) the external reference inputs to the controller, and u(k? 1) the control signal used to drive the plant. Figure 4 can be described formally by the following equations: x(k) = P(x(k? 1); u(k? 1)) (14) u(k? 1) = C(x(k? 1); r(k? 1); w) (15) As before, the error vector e k is dened as the dierence between the desired state d(k) and the actual state x(k) when there exists a desired state, and zero otherwise. Using RTBP to adapt the weights of the controller requires the evaluation of the derivatives of the state with respect to the 4 For fully recurrent networks W = N 2 and RTBP is O(N 4 ). 5
r(k?1) x(k?1) C u(k?1) x(k?1) P x(k) q Figure 4: Controller-Plant Structure weights. Applying the chain rule to equations 14 and 15, we get: dx(k) dw = @x(k) dx(k? 1) + @x(k) @x(k? 1) dw @u(k? 1) @u(k? 1) dx(k? 1) = + @x(k? 1) dw du(k? 1) dw du(k? 1) dw (16) @u(k? 1) @w : (17) Equations 16 and 17 can then be represented by a ow graph (see Figure 5a). Transposing this ow graph, we get a new graph, which corresponds to the BPTT algorithm for the controllerplant structure (see Figure 5b). Again, the argument of interreciprocity immediately shows the equivalence of the weight updates performed by the two algorithms. In addition, it can be veried that BPTT applied to this structure still requires a factor of O(N 2 ) less computations than RTBP. 6
e T (1) w T (1) e T (K) w T w T (K) J Px(K) rec x (1) J Pu(1) J C(K?1) J Pu(K) rec x (K?1) rec x (K) u(0) u(k?1) 1:0 (a) 1:0 e(1) e(k) J T Px(K) J T Pu(1) bp x (1) bp x (K?1) J T C(K?1) J T Pu(K) bp x (K) T u(0) T u(k?1) w w(1) (b) w(k) Figure 5: (a) Flow graph corresponding to RTBP. (b) Transposed ow graph: a representation of BPTT. Notation: J C (k) = @u(k)=@x(k); J P x (k) = @x(k)=@x(k? 1); J P u (k) = : : : : @x(k)=@u(k? 1); u (k) = @u(k)=@w; rec : x (k) = dx(k)=dw; rec : u (k) = du(k)=dw; x bp : (k) =? d[ P K k=1 e T (k)e(k)]=dx(k): 7
Conclusion We have shown that real-time backpropagation and backpropagation-through-time are easily related when represented by signal ow graphs. In particular, the ow graphs corresponding to the two algorithms are the exact transpose of one another. As a consequence, ow graph theory could be applied to verify that the gradient calculations performed by the algorithms are equivalent. These principles were then extended to a controller-plant structure to illustrate how ow graph techniques can be applied to a variety of adaptive dynamic systems. Appendix A: Flow Graph Interreciprocity In this appendix we provide the formal denition of interreciprocity. We then prove that transposed ow graphs are interreciprocal, and that the transfer functions of single input single output interreciprocal ow graphs are identical. Y j T j;k Y k T l;k 1:0 X k Y l Figure 6: Example of nodes and branches in a signal ow graph. Let F be a ow graph. In F, we dene: Y k, the value associated with node k; T j;k, the transmittance of the branch (j; k); and V j;k = T j;k Y j, the output of branch (j; k). Let us further assume that each node k of the graph has associated to it a source node, i.e., a node connected to it by a branch of unity transmittance. Let X k be the value of this source node (if node k has no associated source node, X k is simply set to zero). It results from the above denitions that Y k = P j V j;k + X k = P j T j;k Y j + X k (see Figure 6). Let us now consider a second ow graph, ~ F, having the same topology as F (i.e., ~ F has the same set of nodes and branches as F, but the branch transmittances of both graphs may dier). ~F is described with the variables: ~ Yk ; ~ T j;k ; ~ V j;k, and ~ X k. Denition 1 Two ow graphs, F and transmittance matrices are transposed, i.e., ~ F, are said to be the transpose of each other i their ~T j;k = T k;j 8 j; k: (18) Denition 2 (Bordewijk, 1956): Two ow graphs, F and F, ~ are said to be interreciprocal i X ( Y ~ k X k? Y k X ~ k ) = 0: (19) We can now state the following theorem: Theorem 1 Transposed ow graphs are interreciprocal. k 8
Proof: Let F be a ow graph, and let F ~ be the transpose of F. We start from the identity P P ~ P k Y k Y k k Y k Yk ~, and replace Y k by P j T j;k Y j + X k in the rst member, and Yk ~ by ~ j T j;k Yj ~ + Xk ~ in the second member (Oppenheim and Schafer, 1989). Rearranging the terms, we get: X j;k ( ~ Y k V j;k? Y k ~ V j;k ) + X k ( ~ Y k X k? Y k ~ X k ) = 0: (20) Equation 20 is usually referred to as \the two-network form of Tellegen's theorem" (Tellegen,1952; Peneld, 1970). Since F ~ is the transpose of F, the rst term of equation 20 can be rewritten as P j;k ( ~ Yk V j;k? Y k ~ Vj;k ) = P j;k ( ~ Yk T j;k Y j? Y k ~ Tj;k ~ Yj ) = P j;k ( ~ Yk T j;k Y j? Y k T k;j ~ Yj ) = 0: Since the rst term of equation 20 is zero, the second term P k ( ~ Yk X k? Y k ~ Xk ) is also zero. The ow graphs ~ F and F are thus interreciprocal. QED. The last step consists in showing that SISO interreciprocal ow graphs have the same transfer functions. Let node a be the unique source of F and node b its unique sink. From the denition of transposition, node a is the sink of ~ F, and node b is its source. We thus have: Xk = 0 8k 6= a and ~X k = 0 8k 6= b. Therefore, equation 19 reduces to: X a ~ Y a = ~ X b Y b : (21) This last equality can be interpreted as follows (Peneld,1970; Oppenheim and Schafer,1989): the output Y b, obtained when exciting graph F with an input signal X a, is identical to the output ~ Y a of the transposed graph ~ F when exciting it at node b with an input ~ Xb X a. The transfer functions of the SISO systems represented by the two ow graphs are thus identical, which is the desired conclusion. References Bordewijk, J. L. 1956. Inter-reciprocity applied to electrical networks. Appl. Sci. Res. 6B, 1-74. Bryson, A. E.Jr. and Ho, Y. 1969. Applied Optimal Control, chapter 2. Blaisdell Publishing Co., New York. Hertz, J. A., Krogh, A., and Palmer, R. G. 1991. Introduction to the Theory of Neural Computation. Addison-Wesley, Reading. Kailath, T. 1980. Linear Systems. Prentice-Hall, Englewood Clis, NJ. Le Cun, Y. 1988. A Theoretical Framework for Back-Propagation. Proceedings of the 1988 Connectionist Models Summer School. Editors: Touretzky, D., Hinton, G. and Sejnowski, T., Morgan Kaufmann. San Mateo, CA. 21-28. Ljung L. 1987. System Identication: theory for the user. Prentice-Hall, Englewood Clis, NJ. Pineda, F. J. 1987. Generalization of Back-Propagation to Recurrent Neural Networks. IEEE Trans on neural networks, special issue on recurrent networks. Plumer, E. S. 1993. Time-Optimal Terminal Control Using Neural Networks. Proceedings of the IEEE International Conference on Neural Networks. San Francisco, CA. 1926-1931. 9
Ramo, S., Whinnery, J. R. and Van Duzer, T. 1984. Fields and waves in communication electronics. Second Edition. John Wiley & Sons. Rumelhart, D. E. and McClelland, J. L. 1986. Parallel Distributed Processing. The MIT Press, Cambridge, MA. Tellegen, B. D. H. 1952. A general network theorem, with applications. Philips Res. Rep. 7. 259-269. Werbos, P. 1990. Backpropagation Through Time: What It Does and How to Do It. Proc. IEEE, special issue on neural networks 2. 1550-1560. White, S. A. 1975. An Adaptive Recursive Digital Filter. Proc. 9th Asilomar Conf. Circuits Syst. Comput.. 21. Williams, R. J. and Zipser, D. 1989. A Learning algorithm for continually running fully recurrent neural networks. Neural Computation 1(2). 270-280. 10