Chapter 15. Dynamically Driven Recurrent Networks

Chapter 15. Dynamically Driven Recurrent Networks Neural Networks and Learning Machines (Haykin) Lecture Notes on Self-learning Neural Algorithms Byoung-Tak Zhang School of Computer Science and Engineering Seoul National University Version 20171120

Contents 15.1 Introduction... 3 15.2 Recurrent Network Architectures..... 4 15.3 Universal Approximation Theorem........ 12 15.5 Computational Power of Recurrent Networks....... 14 15.6 Learning Algorithms.......... 17 15.7 Back Propagation Through Time..... 18 15.8 Real-Time Recurrent Learning........ 21 15.9 Vanishing Gradients in Recurrent Networks...... 26 15.10 Supervised Training Framework for Recurrent Networks.... 28 15.11 Computer Experiment: Mackey-Glass Attractor.. 32 15.12 Adaptivity Considerations.. 33 15.13 Case Study: Model Reference Applied to Neurocontrol 34 Summary and Discussion...... 35 (c) 2017 Biointelligence Lab, SNU 2

15.1 Introduction Global feedback is a facilitator of computational intelligence. In previous chapters we studied how the use of global feedback in a recurrent network makes it possible to achieve some useful tasks: o o o Content-addressable memory Autoassociation Dynamic reconstruction of a chaotic process In this chapter, we study the other important applications of recurrent networks: o o o o Input output mapping, the study of which naturally benefits from Chapter 14 on sequential state estimation Applying the feedback from the output layer to the input of the hidden layer Combining all possible feedback loops in a single recurrent network structure Other configurations as the building blocks for the construction of recurrent networks Recurrent networks have a very rich repertoire of architectural layouts, which makes them all the more powerful in computational terms. A recurrent network responds temporally to an externally applied input signal. We may therefore speak of the recurrent networks considered in this chapter as dynamically driven recurrent networks (c) 2017 Biointelligence Lab, SNU 3

15.2 Recurrent Network Architectures (1/8) Four specific network architectures 1) Input-Output Recurrent Model 2) State-Space Model 3) Recurrent Multilayer Perceptrons 4) Second-Order Network They all incorporate a static multilayer perceptron or parts thereof They all exploit the nonlinear mapping capability of the multilayer perceptron (c) 2017 Biointelligence Lab, SNU 4

15.2 Recurrent Network Architectures (2/8) 1) Input-Output Recurrent Model 1. The model has a single input that is applied to a tapped-delay-line memory of q units. 2. It has a single output that is fed back to the input via another tapped-delay-line memory, also of q units. 3. The present value of the model input is denoted by un, and the corresponding value of the model output is denoted by y n + 1 4. The dynamic behavior of the nonlinear autoregressive with exogenous inputs (NARX) model is described by y F y y u u = (,!, ;,!, ) n+ 1 n n- q+ 1 n n- q+ 1 where F is a nonlinear function of its arguments Figure 15.1 Nonlinear autoregressive with exogenous inputs (NARX) model; the feedback part of the network is shown in blue. 5

15.2 Recurrent Network Architectures (3/8) 2) State-Space Model Figure 15.2 State-space model; the feedback part of the model is shown in blue. Figure 15.3 Simple recurrent network (SRN); the feedback part of the network is shown in blue. 6

15.2 Recurrent Network Architectures (4/8) 2) State-Space Model 1. A state space model, is the basic idea of which was discussed in Chapter 14 2. The output is fed back to the input layer via a bank of unit-time delays. 3. The input layer consists of a concatenation of feedback nodes and source nodes. x = n+ 1 a( xn, un) yn = Bxn 4. Elman's network (Fig.15.3) contains recurrent connections from the hidden neurons to a layer of context units consisting of unit-time delays. These context units store the outputs of the hidden neurons for one time-step and then feed them back to the input layer. (c) 2017 Biointelligence Lab, SNU 7

15.2 Recurrent Network Architectures (5/8) 3) Recurrent Multilayer Perceptrons Figure 15.4 Recurrent multilayer perceptron; feedback paths in the network are printed in blue. (c) 2017 Biointelligence Lab, SNU 8

15.2 Recurrent Network Architectures (6/8) 3) Recurrent Multilayer Perceptrons 1. A recurrent multilayer perceptron (RMLP) has one or more hidden layers, basically for the same reasons that static multilayer perceptrons are often more effective and parsimonious than those using a single hidden layer 2. Each computation layer of an RMLP has feedback around it, as illustrated in Fig. 15.4 for the case of an RMLP with two hidden layers. x x u I, n+ 1 = I( I, n, n) x = ( x x ) II, n+ 1 II II, n, I, n+ 1! f f f x = ( x x ) on, + 1 o on,, Kn, + 1 (c) 2017 Biointelligence Lab, SNU 9

15.2 Recurrent Network Architectures (7/8) 4) Second-Order Network Figure 15.5 Second-order recurrent network; bias connections to the neurons are omitted to simplify the presentation. The network has 2 inputs and 3 state neurons, hence the need for 3 X 2 = 6 multipliers. The feedback links in the figure are printed in blue to emphasize their global role. 10

15.2 Recurrent Network Architectures (8/8) 4) Second-Order Network First-order neuron Second-order neuron Second-order recurrent networks å å v = w x + w u v k a, kj j b, ki i j i = åå w xu k kij i j i j v b w x u = +åå k, n k kij i, n j, n i j x = 1 φ( v ) = + 1+ exp( -v ) kn, 1 kn, kn, d ( x u )= x i, j k 11

15.3 Universal Approximation Theorem (1/2) Any nonlinear dynamic system may be approximated by a recurrent neural network to any desired degree of accuracy and with no restrictions imposed on the compactness of the state space, provided that the network is equipped with an adequate number of hidden neurons. xn + 1 = f( W x + Wu ) y = Wx n c n a n b n 1- e φ ( x) = tanh( x) = 1 + e -2x -2x éx1ù éj( x1) ù ê x ú ê 2 j( x2) ú f : ê ú ê ú ê! ú ê! ú ê ú ê x ) ú êë qúû êëj ( xq úû 1 φ ( x) = 1 + e - x 12

15.3 Universal Approximation Theorem (2/2) Example 1 Fully Connected Recurrent Network m= 2, q= 3, and p= 1 W W a b éw w w 11 12 13 = ê w21 w22 w ú ê 23 ú êëw w w 31 32 33 éb w w 1 14 15 = ê b2 w24 w ú ê 25 ú êëb w w 3 34 35 ù úû ù úû Wc = [ 1 0 0] Figure 15.6 Fully connected recurrent network with two inputs, two hidden neurons, and one output neuron. The feedback connections are shown in blue to emphasize their global role. 13

15.5 Computational Power of Recurrent Networks (1/3) Every finite - state machine is equivalent to, and can be simulated by, some neural net. That is, given any finite - state machine M, we can build a certain neural net M N which, regarded as a black - box machine, will behave precisely like M! Theorem I (Siegelmann and Sontag, 1991) All Turing machines may be simulated by fully connected recurrent networks built on neurons with sigmoidal activation functions. Three functional blocks of Turing Machine a control unit, which can assume any one of a finite number of possible states linear tape, assumed to be infinitely long in both directions, which is marked off into discrete squares, where each square is available to store a single symbol taken from a finite set of symbols a read write head, which moves along the tape and transmits information to and from the control unit Figure 15.7 Turing machine. 14

15.5 Computational Power of Recurrent Networks (2/3) Figure 15.8 Illustration of Theorems I and II, and corollary to them. 15

15.5 Computational Power of Recurrent Networks (3/3) Theorem II (Siegelmann et al., 1997) NARX networks with one layer of hidden neurons with bounded, one-sided saturated (BOSS) activation functions and a linear output neuron can simulate fully connected recurrent networks with bounded, one-sided saturated activation functions, except for a linear slowdown. Three condtions of activation functions 1. The function j( ) has a bounded range; that is, a j( x) b, a¹ b, for all xîr 2. The function j( ) is saturated on the left side; that is, there exists values s and S such that j( x) = S for all x s 3. The function j( ) is nonconstant; that is j( x ) ¹ j( x ) for some x and x. 1 2 1 2 BOSS function ì 1 ï for x> j( x) = í1 + exp( - x) ï î0 for x s s 16

15.6 Learning Algorithms Two modes of training a recurrent network 1. Epochwise training. For a given epoch, the recurrent network uses a temporal sequence of input target response pairs and starts running from some initial state until it reaches a new state, at which point the training is stopped and the network is reset to an initial state for the next epoch. 2. Continuous training. This second method of training is suitable for situations where there are no reset states available or on-line learning is required. The distinguishing feature of continuous training is that the network learns while performing signal processing. Simply put, the learning process never stops. Two different learning algorithms 1. Back-propagation through time (BPTT) Section 15.7 - Epochwise, continuous, or combined 1. Real-time recurrent learning (RTRL) Section 15.8 - Derived from the state-space model 17

15.7 Back Propagation Through Time (1/3) The back-propagation-through-time (BPTT) algorithm for training a recurrent network is an extension of the standard back-propagation algorithm.8 It may be derived by unfolding the temporal operation of the network into a layered feedforward network, the topology of which grows by one layer at every time-step. Figure 15.9 (a) Architectural graph of a twoneuron recurrent network N. (b) Signal-flow graph of the network N. unfolded in time. (c) 2017 Biointelligence Lab, SNU 18

15.7 Back Propagation Through Time (2/3) Epochwise Back Propagation Through Time E n 1 = 2 1 åå n= n0 jîa e 2 total jn, δ = jn, E - total vj, n δ = jn, ' ' ì φ ( vjn, ) ejn, for n= n1 ï í é ù ïφ ( v ) êe å w ú for n n n î ë û j, n j, n + jkd k, n + 1 0 < < 1 kîa E D wji =-η w n 1 = η å n= n0+ 1 total δ ji jn, xin, - 1 (c) 2017 Biointelligence Lab, SNU 19

15.7 Back Propagation Through Time (3/3) Truncated Back Propagation Through Time 1 El= 2 å e Î δ = jl, j 2 j, l El δ jl, =- for all jîa and n- h< l n v ' ì φ ( vjl, ) ejl, for l = n ï í ' φ ( vj, l ) wjk, lδ k, l+ 1 for n- h< l < n ï å î kîa D w = η å δ x - n ji,n j, l i, l 1 n= n- h+ 1 The Ordered Derivative Approach A φ φ If a= φ( b, c), then F = F and F = F b c jl, l l l l - b - a - c - a (c) 2017 Biointelligence Lab, SNU 20

15.8 Real-Time Recurrent Learning (1/5) Real-time recurrent learning(rtrl): adjustments are made to the synaptic weights of a fully connected recurrent network in real time that is, while the network continues to perform its signal-processing function. T éφ( w1 ξ ) ù n ê ú ê! ú T n+ 1= ê φ( j n) ú éxn ù x w ξ ê ú ξn = ê ú ê! ú ëun û ê ú T êëφ( wqξn) úû w j éwa, jù = ê ú, j = 1,2,..., q êëw b, júû Figure 15.10 Fully connected recurrent network for formulation of the RTRL algorithm; the feedback connections are all shown in blue. (c) 2017 Biointelligence Lab, SNU 21

15.8 Real-Time Recurrent Learning (2/5) Λ xn, =, j = 1,2,..., q w jn j é0 ù ê T ú U jn, = êξnú jth row, j = 1,2,..., q ê ú ë0 û Φ = diag(φ ( w ξ ),...,φ ( w ξ )...,φ ( w ξ )) ' T ' T ' T n 1 n j n q n Λ Φ W Λ U =, 1 ( + jn n an, jn, jn, ), j = + 1,2,..., q (c) 2017 Biointelligence Lab, SNU 22

15.8 Real-Time Recurrent Learning (3/5) e = d -y n n n = d -Wx n c n E 1 T = ee 2 n n n E æ n e ö n = ç e w j è w j ø n æ x ö n =-Wc e ç n è w j ø =- WcΛ j, nen, j = 1,2,..., q D w j,n =- η E w n j = η W Λ, = 1,2,..., c j, nen j q Λ j,0 = 0 for all j (c) 2017 Biointelligence Lab, SNU 23

15.8 Real-Time Recurrent Learning (5/5) Teacher forcing, or equation-error method, involves replacing the actual output of a neuron, during training of the network, with the corresponding desired response (i.e., target signal) in subsequent computation of the dynamic behavior of the network, whenever that desired response is available. Faster training, corrective mechanism. x = φ( v ) jn, + 1 jn, = and tanh( v ) jn, φ( v jn, ) φ'( v jn, ) = v = jn, 2 sech ( v jn, ) = 1-x 2 jn, + 1 Figure 15.11 Sensitivity graph of the fully recurrent network of Fig. 15.6. Note: The three nodes, labeled ξ l,n are all to be viewed as a single input. 25

15.9 Vanishing Gradients in Recurrent Networks (1/2) The vanishing-gradients problem arises in the training of the network to produce a desired response at the current time that depends on input data in the distant past. Robust information latching in a recurrent network is accomplished if the states of the network are contained in the reduced attracting set of a hyperbolic attractor. Figure 15.12 Illustration of the vanishing-gradient problem: (a) State x n resides in the basin of attraction, β, but outside the reduced attraction set g. (b) State x n resides inside the reduced attraction set g. (c) 2017 Biointelligence Lab, SNU 26

15.9 Vanishing Gradients in Recurrent Networks (2/2) Long-Term Dependencies D = -η w n æ y ö D w = ηå d -y è ø (,, ) in, n ç i n i n i w å ( din, yin, ) in, in, = η - x f in, = x x i æ y x ö ç è xin, w ø ( xin,, un) ik, ik, = J E 2 total 1 Etotal = å di, n -yi, n 2 i w x, nk, D w = æ y æ x x öö d -y è è øø n in, in, ik, n ηå i n i n ç å i x ç in, k= 1 xik, wk ( J x,, ) (,, ) det nk 0 as k for all n The network is not robust to the presence of noise in the input signal, or else The network is unable to discover long-term dependencies (i.e., relationships between target outputs and inputs that occur in the distant past). (c) 2017 Biointelligence Lab, SNU 27

15.10 Supervised Training Framework for Recurrent Networks Using Nonlinear Sequential State Estimators (1/4) Figure 15.13 Nonlinear state-space model depicting the underlying dynamics of a recurrent network undergoing supervised training. w = w + ω + n 1 n n d = b( w, v, u ) + v n n n n n (c) 2017 Biointelligence Lab, SNU 28

15.10 Supervised Training Framework for Recurrent Networks Using Nonlinear Sequential State Estimators (2/4) Description of the Supervised-Training Framework using the Extended Kalman Filter The recurrent neural network, undergoing training, performs the role of the predictor; and the extended Kalman filter, providing the supervision, performs the role of the corrector. (c) 2017 Biointelligence Lab, SNU 29

15.10 Supervised Training Framework for Recurrent Networks Using Nonlinear Sequential State Estimators (3/4) Description of the Supervised-Training Framework using the Extended Kalman Filter α = d -b( wˆ, v, u ) n n nn -1 n n wˆ = wˆ + G α nn nn -1 n n wˆ = wˆ + G ( d -y ) nn nn -1 n n n (c) 2017 Biointelligence Lab, SNU 30

15.10 Supervised Training Framework for Recurrent Networks Using Nonlinear Sequential State Estimators (4/4) Decoupled Extended Kalman Filter Figure 15.15 Block-diagonal representation of the filtering-error covariance matrix pertaining to the decoupled Kalman filter (DEKF). The shaded parts of the square represent nonzero values of,where i = 1, 2, 3, 4 for the example illustrated in the figure. As we make the number of disjoint weight groups, g, larger, more zeros are created in the covariance matrix P n n ; in other words, the matrix P n n becomes more sparse. The computational burden is therefore reduced, but the numerical accuracy of the state estimation becomes degraded. (c) 2017 Biointelligence Lab, SNU 31

15.11 Computer Experiment: Dynamic Reconstruction of Mackey-Glass Attractor Figure 15.16 Ensemble-averaged cumulative absolute error curves during the autonomous prediction phase of dynamic reconstruction of the Mackey-Glass attractor. d x dt t ax =- bx t + 1 + t-dt 10 xt-d t Extended Kalman filter (EKF) Central-difference Kalman filter (CDKF) Cubature Kalman filter (CKF) (c) 2017 Biointelligence Lab, SNU 32

15.12 Adaptivity Considerations Adaptive Critic Figure 15.17 Block diagram illustrating the use of an adaptive critic for the control of recurrent node activities v n in a recurrent neural network (assumed to have a single output); the part of the figure involving the critic is shown in blue. Consider a recurrent neural network embedded in a stochastic environment with relatively small variability in its statistical behavior. Provided that the underlying probability distribution of the environment is fully represented in the supervised-training sample supplied to the network, it is possible for the network to adapt to the relatively small statistical variations in the environment without any further on-line adjustments being made to the synaptic weights of the network. (c) 2017 Biointelligence Lab, SNU 33

15.13 Case Study: Model Reference Applied to Neurocontrol Figure 15.18 Model-reference adaptive control system; the feedback loop of the system is printed in blue. T 1 J( w, θ ) = y ( n) -y( n, w, θ ) åå k i, r k T n = 1 i 2 (c) 2017 Biointelligence Lab, SNU 34

Summary and Discussion (1/2) Four main recurrent network models with global feedback Nonlinear autoregressive networks with exogenous inputs (NARX networks), which use feedback from the output layer to the input layer Fully connected recurrent networks, which use feedback from the hidden layer to the input layer Recurrent multilayer perceptrons with more than one hidden layer, which use feedback from the output of each computation layer to its own input Second-order recurrent networks, which use second-order neurons. Properties of Recurrent Neural Networks They are universal approximators of nonlinear dynamic systems, provided that they are equipped with an adequate number of hidden neurons. They are locally controllable and observable, provided that their linearized versions satisfy certain conditions around the equilibrium point. Given any finite-state machine, we can build a recurrent neural network which, regarded as a black-box machine, will behave like that finite-state machine. Recurrent neural networks exhibit a meta-learning (i.e., learning to learn) capability. (c) 2017 Biointelligence Lab, SNU 35

Summary and Discussion (2/2) Gradient-based Learning Algorithms Back propagation through time (BPTT) off-line learning Real-time recurrent learning (RTRL) on-line learning Supervised-learning Algorithms Based on Nonlinear Sequential State Estimation Extended Kalman filter (EKF), with the linearization of the measurement model pertaining to the recurrent neural network by using the BPTT or RTRL algorithm. Derivative-free nonlinear sequential state estimator (CKF / CDKF). In so doing, not only the applicability of this novel approach to supervised learning is broaden, but also numerical accuracy is improved (but with increased computational requirements). (c) 2017 Biointelligence Lab, SNU 36