Locally Connected Recurrent Networks. Lai-Wan CHAN and Evan Fung-Yu YOUNG. Computer Science Department, The Chinese University of Hong Kong

Locally Connected Recurrent Networks Lai-Wan CHAN and Evan Fung-Yu YOUNG Computer Science Department, The Chinese University of Hong Kong New Territories, Hong Kong Email : lwchan@cs.cuhk.hk Technical Report : CS-TR-95-10 Abstract The fully connected recurrent network (FRN) using the on-line training method, Real Time Recurrent Learning (RTRL), is computationally expensive. It has a computational complexity of O(N 4 ) and storage complexity of O(N 3 ), where N is the number of non-input units. We have devised a locally connected recurrent model which has a much lower complexity in both computational time and storage space. The ringstructure recurrent network (RRN), the simplest kind of the locally connected has the corresponding complexity of O(mn+np) and O(np) respectively, where p, n and m are the number of input, hidden and output units respectively. We compare the performance between RRN and FRN in sequence recognition and time series prediction. We tested the networks' ability in temporal memorizing power and time warpping ability in the sequence recognition task. In the time series prediction task, we used both networks to train and predict three series; a periodic series with white noise, a deterministic chaotic series and the sunspots data. Both tasks show that RRN needs a much shorter training time and the performance of RRN is comparable to that of FRN. 1 Introduction Recurrent neural network is an attractive model because it can carry information through time. But the problem of training recurrent models to encode temporal information has not been solved satisfactorily. To make it be popular in real life applications, we need an ecient and powerful training method to tap the full use of a recurrent network. Many researchers are working on this problem in the past decades and there are some exciting results in the training procedures. Back Propagation through time (BPTT) [4] is an ecient learning algorithm but it cannot be run on-line and is impractical for tasks with input signals of unknown lengths. The Real Time Recurrent Learning (RTRL) [9] is a powerful on-line learning algorithm but it is extremely inecient and is also non-local in space. There are also some other variants of RTRL [12, 7]. Alternatively, the non-gradient descent methods 1

like NBB [5] and TD [6] have many desirable features but they do not guarantee convergence. The challenge of devising new learning algorithms for recurrent models is thus an attractive, interesting and useful task. We usually consider a fully connected architecture for a recurrent model because this very general architecture provides the model a great freedom to build its own internal representations for encoding temporal information. This general architecture, on the other hand, has imposed a great burden on the training process. In a fully recurrent network, any weight w ij can aect the activation of any unit u k within one time step since unit u i is connected to unit u k directly. Therefore the weight w ij is "temporally" close to unit u k although it may not be "spatially". Actually we need to have some information from every other unit in order to update the weight w ij in RTRL. This inherited property of being fully connected has slowed down the training process and has also made the process being non-local in space. The fully connected architecture is also unnatural in neurosystem since every unit can now aect every other unit directly within one time step. Furthermore the model is not suitable for parallelization. The communication time will be long as we need each processing unit to communicate with every other at each epoch. In our work, we abandon the fully connected architecture and consider one in which every hidden unit is connected to only a number of its neighbors and we call it a "locally connected" recurrent network. We have devised an on-line learning algorithm for this new kind of network model such that information can be carried innitely through time although eect is not all propagated forward. The new learning process has a much lower computational complexity in both time and space in comparison with the other approaches and is also more local in space in the sense that a weight update needs only the information from those units in its vicinity. Besides this algorithm has much exibility for mapping into dierent parallel architectures and the parameters can be chosen in such a way that the computational process is totally local in space, i.e. the computations in a processing unit rely only on those units directly connected to it. In this paper, we rst describe the network architecture in section 2. Then we describe 2

input units output units hidden units Figure 1: Architecture of a Locally Connected Recurrent Network the learning algorithm and discuss about its computational complexity, storage complexity and other important aspects in section 3. Section 4 describes the simplest kind of the locally connected recurrent models called the ring-structured recurrent network (RRN). Section 5 and 6 compare RRN with the fully recurrent network (FRN) [9] in the tasks of sequence recognition and time series prediction respectively in terms of their performance in both training and testing. 2 Locally Connected Recurrent Networks 2.1 Network Topology The model is made up of three layers of units, input layer, hidden layer and output layer. Each hidden node is connected with every input unit and every output unit as shown in Figure 1. The inter-layer connections are all running forward (the solid lines in Figure 1) and recurrent links (the dotted curves in Figure 1) exist in the hidden layer only. In the hidden layer, each unit is connected to some hidden units forming a partially connected structure. Dierent structures can be formed depending on the way in which the hidden nodes are connected. For simplicity, we consider only homogeneous structures, so every unit in the structure is identical in terms of connections. Three common examples are shown in Figure 2. For example, each unit in a ring is connected to two others (Figure 2a) while that 3

(a) (b) to bottom to right to left (c) to bottom to top to front to right to left to back to top Figure 2: Connections in the Hidden Layer of a Locally Connected Recurrent Network. The black dots are the hidden neurons and the solid lines are the recurrent weights (for simplicity, only one line is drew between two nodes.) (a) ring structure (b) grid structure (c) cubic structure....... t=2......... t=2...... t=1...... t=1......... t=0......... t=0... (a) (b) Figure 3: Locally Connected Recurrent Networks after Unfolding 4

in a grid is four (Figure 2b) and that in an n dimensional cube (hypercube) is n (Figure 2c). We dene the term \neighborhood" of a hidden unit u j as follows: Neighborhood of unit u j, Q j = f u i j u i is a hidden unit and u i is connected to u j directly g and the degree of connectivity, q = max j [the cardinarity of Q j ] The neighborhood of a hidden node is thus a set of all hidden units having a direct connection to that node. A network is called a q-net if each hidden unit has q incoming recurrent connections (including the self-feedback loop). Therefore a ring is a 3-net, a grid is a 5-net and an n dimensional cube is a (n+1)-net. 2.2 Subgrouping The hidden layers can be unfolded in time to form a feedforward structure for easy in visualisation of the subgroups. Figure 3 shows the connections in a 3-net and a 5-net after unfolding. We divide the hidden units into overlapping subgroups such that each unit u j is at the center of one subgroup G j;;q : G j;;q = f u i j u i is a hidden unit in a q-net and u i aects u j in at most time steps g is a parameter which determines how far an eect propagates among the units. Figure 4 gives an example of a subgroup in a 3-net and a 5-net when is equal to two. These subgroups have the following properties: The subgroups are overlapping with one another. All subgroups have the same size and the size is dependent on and the degree of connectivity q (with the exceptional boundary case). The maximum size of one subgroup is denoted by r and r = q + (q-1)(-1) There are totally n subgroups where n is the number of hidden units. Each hidden unit belongs to r subgroups where r is the size of one subgroup and it is the center of exactly one of them. 5

t=2 t=2 t=1 t=1 t=0 t=0 (a) 3-net (b) 5-net A subgroup centered at the unfilled unit Figure 4: Subgrouping in Locally Connected Recurrent Networks Since and q are xed in a particular network model, we will omit these subscripts from G j;;q in the following derivation. 2.3 Learning Algorithm The training algorithm obeys the following rule: If a weight w ij can only aect the activation of a unit u k after at least + 1 time steps, its eect on u k is neglected. where is the xed parameter as described in the previous section. From the derivation of the backpropagation of errors in a feedforward network, the error terms in each consecutive layer are diminished by a factor equal to the derivative of the sigmoid function f(x) (i.e. f(x)[1-f(x)]). This factor is 0.25 at maximum and the eect of w ij is diminishing when it is being propagated through time, provided the connection weights are not exceptionally large. Therefore, this rule of neglecting temporally distant terms is justied. Now w ij aects only a xed set of units: f u k ju k 2 G i g and u k is only aected by a xed set of weights: f w ij ju i 2 G k g An example of this is shown in Figure 5. 6

Information of the dotted-line weight is kept in these five units The unfilled unit keeps information of the solid weights only (a) (b) Figure 5: Eect Propagation in a Locally Connected Recurrent Network Suppose there are p input units, n hidden units and m output units in a q-net and we use capital letters P, N and M to denote the set of input units, the set of hidden units and the set of output units respectively. Each hidden unit is connected to all input units and all output units while the hidden units themselves are interconnected as described above. At time step t, each hidden unit u i computes its output as: y i (t) = g(x i (t)) (1) X = g( w ij z j (t)) u i 2 N (2) u j 2Q i[p where z j = yj (t? 1) u j 2 Q i and I j (t) u j 2 P g can be any dierentiable function x i (t) is the total input to u i at time t w ij is the weight running from u j to u i Q i is the neighborhood of u i I k (t) is the k th input at time t There is no direct connection between the input layer and the output layer. Each output unit u i simply computes its activation at time t as: y i (t) = g(x X i (t)) (3) = g( w ij z j (t)) u i 2 M (4) uj2n 7

Only the output units contribute to the cost function: X G = E(t) (5) t X E(t) = 1 E k (t) 2 (6) 2 uk2m E k (t) = d k (t)? y k (t) (7) where G is the total error E(t) is the error of the whole network at time t E k (t) is the error of the k th output unit at time t d k (t) is the k th target output at time t Minimizing G will teach the network to have yi 0s imitating d0 is. This can be done using gradient descent with a momentum term: w ij (t) =? @E(t) X + w ij (t? 1) (8) @w ij = E k (t) @y k(t) + w ij (t? 1) (9) @w ij u k2m where is the learning rate is the momentum coecient w ij (t) is the change in w ij at time t The weights between the hidden layer and the output layer are non-recurrent and updates can easily be done by error backpropagation as in ordinary feedforward networks: w ij (t) = E i (t)g 0 (x i (t))y j (t) + w ij (t? 1) (10) where u i 2 M and u j 2 N For the incoming connections to the hidden units, we can propagate the error terms one step back from the output layer to the hidden layer as in equation (11) and update the 8

weights according to equation (12): X @E(t) =? @y k u l2m X w ij (t) =? uk2gi E l (t)g 0 (x l (t))w lk (t) u k 2 N (11) ( @E(t) @y k @y k(t) @w ij ) + w ij (t? 1) u i 2 N (12) since w ij can only aect the units in G i. The derivatives in equation 11 can be obtained by dierentiating the dynamic rule in equation 1: X @y k (t) @y = g 0 p (t? 1) (x k (t))[ w kp + ki z j (t? 1)] (13) @w ij @w u ij p2q k where ki = 1 ui = u k 0 u i 6= u k and u k 2 G i and u i 2 N This relates the derivatives @y k (t)=@w ij at time t to those at time t-1. We can thus iterate it forward from the initial conditions: @y k (0) @w ij = 0 (14) w(0) = 0 (15) The derivatives and weight changes can be calculated at each epoch along the way and take the full change in w ij as the time-sum of w ij (t) at the end of the training sequence. 3 Analysis 3.1 Time Complexity In the whole training process, the most time consuming operation is the update of the sensitivity matrix, [p k ij = @y k(t) ], according to equation 13. For each element p k ij in the @wij matrix, we need to go through equation 13 once in each time step to update its value and this requires q operations to calculate the summation on the right hand side where q is the degree of connectivity. Since there are n(q+p) weights in the hidden layer and each weight has r such p-terms, where r is the size of one subgroup, there are totally nr(q+p) p-terms. 9

Updating the sensitivity matrix in each epoch thus needs a total computational complexity of order O(nqr(q+p)) operations. In a locally connected recurrent network, q and r are constants which are much smaller than n, so the computational complexity of the above algorithm is greatly improved in comparison with the O(N 4 ) of a FRN, the fully recurrent network derived by Williams and Zipser [9], where N is equal to the number of non-input units (i.e. N = m+n). It is true that the n here may be dierent from that in FRN, i.e. we may need more hidden units in a locally connected recurrent model but experimental results show that this increase is just too small to counterbalance the O(N 4 ) in FRN. Taking into account the other operations, the total computational complexity is of order O(mn+nqr(p+q)) in each epoch. If we assume that the parameters q and r are xed in a recurrent model, the complexity becomes O(mn+np) which is optimal since we have already had this order of number of connection weights. 3.2 Space Complexity We have a p k ij -term for a particular weight w ij if and only if the unit u k is in the subgroup G i centered at the unit u i. Since there are r units in each subgroup and there are totally n(q+p) weights in the hidden layer, the total amount of storage needed is O(nr(q+p)). For xed q and r, the complexity turns into O(np). This space complexity is again smaller than the O(N 3 ) of FRN y. 3.3 Local Computations in Time and Space This learning algorithm is local in time since the computations at time t is dependent only on the information at time t and time (t-1). Training can thus be done on-line. Furthermore this algorithm is more local in space than FRN as the update of a weight w ij according to equation 13 needs only the p-terms from those units u k which are in the vicinity of unit u i, i.e. u k 2 G i. Thus unlike FRN in which updating a weight needs some information from The recurrent weights occur between all hidden and output units in a FRN and the complexity has not included the eect of the input units, which its number is assumed to be much smaller than N. y If the weights between the input units and the non-input units are taken into account, the complexity should contain an extra term of Np 10

B B C 2 3 D A 1 2 6 8 9 C A 1 10 5 4 E 5 10 7 3 D 9 6 A E 4 D B 8 E 7 C (a) (b) Figure 6: Analog between a Fully Recurrent Network and a Ring-Structured Recurrent Network every other unit in the network, only those units in the proximity are involved here. 4 Ring-Structured Recurrent Network (RRN) A ring-structured recurrent network (RRN) is a particular type of locally connected recurrent network and is the simplest kind. It is a model with degree of connectivity equaling to three and equaling to one. Each hidden unit is connected to itself and its two nearest neighbors only. The 3-net in Figure 2 shows its connection in the hidden layer. In spite of its simplicity, it becomes more powerful as long as we keep on increasing the number of hidden units. To explain it intuitively, we may imagine an analogy between a fully connected recurrent network and a RRN by tracing out the Hamiltonian circuit in the former model (Figure 6). One unit in the former is represented by several in the latter. Increasing the number of hidden units in a RRN acts as if increasing the number of units in a fully connected recurrent model which will eventually increase the memory capacity of the network. Another advantage of using RRN is that its simplicity reduces the computational complexity to the minimum of O(mn+27n) since q and r are now both equal to three. 11

5 Comparison between RRN and FRN in Sequence Recognition We compare the performance of RRN and FRN in sequence recognition. We trained both networks to learn a set of alphabet sequences and compared their training speed and recalling power. The letters in a sequence were input into the network one at a time and the network was required to recognize the sequence after seeing all the alphabets. In each training cycle, all the sequences were fed into the network once and this repeated until the mean square error dropped below a small threshold. We then tested the trained network with a long stream of letters which was made up of words that had already been learned. However the embedded words might be time warped to make recalling more dicult. We compare RRN and FRN in both the training speed and recalling power. 5.1 Training Sets and Testing Sequences We had used nine sets of training data (Table 1) of which three contained time warped sequences and we had used the alphabet streams of Table 2 to compare the recalling power of RRN and FRN. The rst six training sets aimed at testing the network's temporal memorizing power. There were six to twenty-four words in each set and the average word length varied from 3 to 5.9. The third training set was the most dicult one as it contained all twenty-four possible combinations of \a", \b", \c" and \d". Some words share the same sub-sequence, either in the beginning or at the end of the sequence. For example, in the test set 5, the words \iran" and \iraq", \poland", \ireland" and \iceland", \algeria" and \nigeria". The corresponding testing sequences of these sets were formed by concatenating all the words in the training set together, with each word separated by a dot which was the reset signal. The remaining three tests (test number 7, 8 and 9) aimed at testing the network's time warping performance. We had added several time warped samples into the training set, to see whether the network could generalize them to other arbitrarily time warped sequences. The seventh and the eighth test were more dicult since the training sequences were just permutations of the same set of alphabets. The testing alphabet streams were formed by 12

Test Training Sets 1 abc acb bac bca cab cba - - abcd dabc acbd dacb 2 bacd dbac bcad dbca cabd dcab cbad dcba abcd abdc acbd acdb adbc adcb bacd badc 3 bcad bcda bdac bdca cabd cadb cbad cbda cdab cdba dabc dacb dbac dbca dcab dcba cat dog pig man 4 hen bat cow y tod duck sh bird cock swam sheep - iran iraq niger swedan 5 norway poland ireland iceland algeria nigeria - - alpha beta gamma delta 6 epsilon zeta theta lambda mu nu xi omicron pi rho sigma - abc acb bac bca 7* cab cba aaaaabbbbbccccc aaaaacccccbbbbb bbbbbaaaaaccccc bbbbbcccccaaaaa cccccaaaaabbbbb cccccbbbbbaaaaa 8* stop tops spot pots ssstttoooppp tttooopppsss ssspppooottt pppoootttsss kick jump clap ride 9* it beat walk swim kkkiiiccckkk jjjuuummmppp ccclllaaappp rrriiidddeee lliiittt bbbeeeaaattt wwwaaalllkkk ssswwwiiimmm Table 1: Training Sets for Comparing RRN and RTRL in Sequence Recognition concatenating a number of time warped words together, again separated by dots and each character in a word might persist for one to three time steps unexpectantly. 5.2 Comparison in Training Speed The performance of RRN and FRN in the training phase are summarized in Table 3 and Table 4. We could see that the training time of RRN was always much shorter than that of FRN although the number of iterations taken by RRN might be greater. The low eciency of FRN was most apparent in test three which was a dicult training set with all twenty- 13

Test Testing Sequences 1 abc.acb.bac.bca.cab.cba 2 abcd.dabc.acbd.dacb.bacd.dbac.bcad.dbca.cabd.dcab.cbad.dcba. 3 abcd.abdc.acbd.acdb.adbc.adcb.bacd.badc.bcad.bcda.bdac.bdca.cabd. cadb.cbad.cbda.cdab.cdba.dabc.dacb.dbac.dbca.dcab.dcba. 4 cat.dog.pig.man.hen.bat.cow.y.tod.duck.sh.bird.cock.swam.sheep. 5 iran.iraq.niger.swedan.norway.poland.ireland.iceland.algeria.nigeria. 6 alpha.beta.gamma.delta.epsilon.zeta.theta.lambda.mu.nu.xi.omicron. pi.rho.sigma. 7 abc.aabbbcc.aaabbc.aaabccc.acb.aaacbbb.aaccbb.accccb.bac.bbbaaac. bbaaccc.baccc.bca.bbbcaaa.bbccca.bbcccaa.cab.caaab.cccab.cabbb. cba.cccbbbaaa.ccbba.cbbbaa. 8 stop.sssttoopp.stttooopp.sstooppp.tops.tttoopppss.ttoooppsss.tooopps. spot.ssspott.sppooot.sspppoottt.pots.pppoottss.ppooottsss.potttss. 9 kick.kkiicckk.kkkicckkk.kiiiccckk.jump.jjjuummp.jjuuummmpp. jummppp.clap.ccllappp.clllaap.cccllaaap.ride.rrriiddeee.rriiidde. riidddee.it.liiit.lliittt.iit.beat.bbbeeat.beeaaatt.bbeeeattt.walk. wwaaalllk.wwaallkkk.wwwalk.swim.ssswwiiimm.sswwwiim.swiiimmm. Table 2: Testing Sequence for Comparing RRN and RTRL in Sequence Recognition four possible combinations of "a", "b", "c" and "d". We could not even train the network to learn the sequences by FRN within a reasonable amount of time (about 3.2 days for 100 iterations). Test No. of No. of No. of Time Taken Final rms Hidden Units Recurrent Links Iterations (sec) Error 1 10 30 163 9 0.136733 2 18 54 1008 386 0.044699 3 35 105 123 255 0.044699 4 5 15 30000 7631 0.258267 5 10 30 798 345 0.224450 6 10 30 528 294 0.115412 7 10 30 319 106 0.141280 8 10 30 155 50 0.140819 9 6 18 73 44 0.139700 Table 3: Performance of RRN in the Training Phase of Sequence Recognition 14

Test No. of No. of No. of Time Taken Final rms Hidden Units Recurrent Links Iterations (sec) Error 1 10 100 62 201 0.127389 2 18 324 294 37846 0.044565 3 35 1225 - - - 4 5 25 45 1705 0.134082 5 10 100 1209 44063 0.223678 6 10 100 76 7149 0.135713 7 10 100 3334 64456 0.141421 8 10 100 482 6703 0.139743 9 6 36 62 1556 0.137040 Table 4: Performance of FRN in the Training Phase of Sequence Recognition 5.3 Comparison in Recalling Power The performance of RRN and FRN in the testing phase is summarized in Table 5. It could be shown that both RRN and FRN performed very satisfactorily in all these tests and the percentage of successful recalls was greater than 90% in all cases. The network can be trained to have both a strong temporal memorizing power and a good time warping performance with either learning algorithm although RRN can be trained at a much faster rate. Test Percentage of Successful Percentage of Successful Recalls (%) in RRN Recalls (%) in FRN 1 100 100 2 100 100 3 100-4 93.3 100 5 90 90 6 100 100 7 95.8 91.7 8 100 100 9 100 100 Table 5: Performance of RRN and FRN in the Recalling Phase 15

6 Comparison between RRN and FRN in Time Series Prediction Another comparison between RRN and FRN was made in the task of time series prediction. Time series prediction is to forecast the future values of a series based on the history of that series. Predicting the future is important in a variety of elds and the non-linear signal processing of neural networks has been a new and promising approach for this purpose [8, 10, 3, 2, 1]. Recurrent networks have an advantage over feedforward nets in this problem as the recurrent links in the network can bring the appropriate previous values back to itself in order to forecast the next one. In this paper, we compare the performance of RRN and FRN in these prediction tasks. A periodic series with white noise (series 1) - points on a sine curve, having a unit magnitude, and adding to it at each step a uniformly distributed random variable in the interval [-0.5,+0.5]: y(t) = sin(10t) + random[?0:5; +0:5] t = 0; 1; 2; ::::: This series was periodic (Figure 7) but the regularities were masked by noise. The whole series had 120 data points. The rst 100 points were for training while the remaining twenty were for testing. A deterministic chaotic series (series 2) - A chaotic series looks completely random although it is produced by a deterministic and noise-free system. Chaos has many interesting and ridiculous properties. A simple example of deterministic chaos is: y t = 1? 2yt?1 2 t = 1; 2; 3; ::::: All the values generated by this iterative quadratic formula lie within the unit interval if the initial value is between -1 and +1 (Figure 8). Again the whole series had 120 data points. The rst 100 points were training data and the remaining twenty were for testing. The sunspots numbers (series 3) - Sunspots are dark blotches on the sun. They were rst observed around 1610 and yearly averages have been recorded since 1700. 16

This sunspots series is usually used as a yardstick to compare new modeling and forecasting methods. We had used the data from 1700 to 1979 (Figure 9) and those from 1700 through 1920 was for training while the remaining was for testing. The comparison emphasizes on both the training speed and the predictive power. In order to have a fair comparison, FRN was slightly modied such that only the hidden layer was fully connected. The inter-layers connections were just ordinary feedforward links. Figure 10 shows the network topology of RRN and the modied FRN in these time series prediction tasks. In additions the output units in these models were linear since a sigmoid function gave only values between zero and one. They computed their activations as the weighted sum of their input from the hidden layer and non-linearity in these models were due to the sigmoid hidden units only. In the training phase, the data points were fed into the input unit one by one consecutively and network forecasted the next value in the series at the output unit. The training series was fed into the network repeatedly until the root mean square error of the predictions dropped below a very small threshold. The testing data was then used to evaluate the network's predictive power. There could be two ways to do prediction, single-step prediction and multi-step prediction: Single-step prediction - The input unit is given values of the observed time series. Multi-step prediction - The predicted output is fed back as input for the next prediction. Hence the input is consisted of predicted values as opposed to actual observations of the original time series. We will compare RRN and FRN in both their training speed and their performance in single-step prediction and multi-step prediction. 6.1 Comparison in Training Speed The performance of RRN and FRN in the training phase is shown in Tables 6 and 7. The learning speed of RRN was again much better than that of FRN and the speedup was even more obvious in comparison with that in sequence recognition. It was because the large 17

1.5 1 0.5 0-0.5-1 -1.5 0 20 40 60 80 100 120 time step Figure 7: Series 1 - A Sine Function with White Noise 1 0.8 0.6 0.4 0.2 0-0.2-0.4-0.6-0.8-1 0 20 40 60 80 100 120 time step Figure 8: Series 2 - An Iterative Quadratic Series 18

200 180 160 140 120 100 80 60 40 20 0 1700 1750 1800 1850 1900 1950 2000 year Figure 9: Series 3 - Sunspots Activity from 1700 to 1979 Output unit Output unit Hidden Units (fully connected) Ring of Hidden Units Input Unit Input Unit (a) (b) Figure 10: FRN and RRN for Time Series Prediction 19

training set (at least 100 data points, 220 in the sunspots data) increased the computational complexity of FRN signicantly. Training FRN to learn the sunspots numbers needed about four days while RRN needed only about nine hours. RRN was thus practically more useful then FRN for large-scale problems. However the nal root mean square error of FRN was smaller than that of RRN and we might expect FRN to have a stronger predictive power. Series No. of No. of No. of Time Final Hidden Units Recurrent Links Iterations Taken (sec) rms Error 1 10 30 90000 25675 0.129985 2 10 30 60000 24194 0.005099 3 10 30 82800 33103 11.37013 Table 6: Performance of RRN in the Training Phase of Time Series Prediction Series No. of No. of No. of Time Final Hidden Units Recurrent Links Iterations Taken (sec) rms Error 1 10 100 30000 430451 0.128996 2 10 100 30000 148575 0.004899 3 10 100 30000 336922 10.229369 Table 7: Performance of FRN in the Training Phase of Time Series Prediction 6.2 Comparison in Predictive Power Table 8 compares the predictive power of RRN and FRN in single-step prediction. The performance of FRN was slightly better than that of RRN in single-step prediction. This was expectable because FRN had more recurrent links and it could actually do more work than RRN. However there were still many factors aecting the results. One obvious factor was that the number of hidden units used, i.e. ten in this case, might not be optimal for RRN or FRN. To compare RRN and FRN in multi-step prediction, we had plotted the correlation coecient against prediction time for both methods on the same graph (Figure 11, 12 and 13). It could be shown that the correlation coecient of FRN dropped slower than that of RRN in series 2 but vice versa in series 3. Hence the performance of RRN and FRN is 20

Series rms Error of RRN rms Error of FRN 1 0.179148 0.160873 2 0.005387 0.005004 3 17.807 17.5308 Table 8: Performance of the single-step prediction by RRN and FRN in Time Series Prediction 1 solid line - RRN; dotted line - FRN 0.9 0.8 correlation coefficient 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 2 4 6 8 10 12 14 16 18 20 time step Figure 11: RRN v.s. FRN in Multi-step Prediction of Series 1 similar although the former is simpler and needs a much shorter training time. 7 Conclusion Recurrent networks are powerful in solving problems with temporal extent as the recurrent links in the network can carry past information through time. Unfortunately, there is still no satisfactory algorithm for training recurrent models. The FRN using Real Time Recurrent Learning Rule (RTRL) by Williams and Zipser [9] is powerful as it performs exact gradient descent for a fully recurrent network in an on-line manner. However FRN is inecient and 21

solid line - RRN; dotted line - FRN 1 0.9 0.8 correlation coefficient 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 2 4 6 8 10 12 14 16 18 20 time step Figure 12: RRN v.s. FRN in Multi-step Prediction of Series 2 solid line - RRN; dotted line - FRN 1 0.95 correlation coefficient 0.9 0.85 0.8 0.75 0.7 1920 1930 1940 1950 1960 1970 1980 year Figure 13: RRN v.s. FRN in Multi-step Prediction of Series 3 22

it has a computational complexity of O(N 4 ), where N is the total number of non-input units. Here we start from FRN and try to devise a new learning procedure for a locally connected recurrent network. This new on-line learning rule has a much lower computational complexity O(mn+np) and storage complexity O(np) than FRN where m, n and p are the number of input units, hidden units and output units respectively. This algorithm has much exibility for implementation on parallel architectures [11]. The algorithm is local in space if the parameter in the model is xed to one and each processing element in this case needs only to communicate with the directly connected processing elements. We had compared FRN with the simplest kind of locally connected recurrent networks, called the ring-structured recurrent network RRN, in temporal sequence recognition. It could be shown that RRN was much more ecient than FRN and there were even some large-scale problems which could not be solved by FRN within an acceptable amount of time. However both methods performed satisfactorily in recalling and the percentage of successful recalls was above 90% in all trials. Thus RRN could perform as good as FRN in these sequence recognition tasks although the training time was much shorter than that of FRN. We had also compared RRN with FRN in time series prediction. Three typical examples were used in our experiments, a periodic series with white noise, a deterministic chaotic series and the sunspots activity data. We could see that RRN, again, needed a much smaller amount of training time although both had a stronger predictive power and could perform satisfactorily in both the single-step prediction and multi-step prediction. To conclude, RRN can be run at a much faster speed than FRN although the performance of both are comparable to each other. However RRN is only the simplest kind of locally connected recurrent models and we can increase the number of recurrent links by using a more densely connected network. In sum, the new learning algorithm introduced in this paper is O(N 2 ) times faster than FRN and it should be preferred especially in some large scale applications. 23

References [1] K. Chakraborty, K. Mehrotra, C. K. Mohan, and S. Ranka. Forecasting the behaviour of multivariate time series using neural networks. Neural Networks, 5:961{970, 1992. [2] C. de Groot and D. Wurtz. Analysis of univariate time series with connectionist nets: a case study of two classical examples. Neurocomputing, 3:177{192, 1991. [3] J. B. Elsner. Predicting time series using a neural network as a method of distinguishing chaos from noise. J. Phys. A: Math. Gen., 25:843{850, 1992. [4] R. E. Rumelhard, G. E. Hinton, and R. J. Williams. Learning lnternal representation by error backpropagation. Parallel Distributed Processing: Explorations in Microstructure of Cognition, 1, 1986. [5] J. H. Schmidhuber. A local learning algorithm for dynamic feedforward and recurrent networks. Report FKI-124-90, 1990. [6] J. H. Schmidhuber. Temporal-dierence-driven learning in recurrent networks. Parallel Processing Neural Systems and Computers, 15:626{629, 1990. [7] J. H. Schmidhuber. A xed size storage O[n 3 ] time complexity learning algorithm for fully recurrent continually running network. Neural Computation, 4(2):243{248, 1992. [8] A. S. Weigend, B. A. Huberman, and D. E. Rumelhart. Predicting the future: a connectionist approach. International Journal of Neural Systems, 1(3):193{209, 1990. [9] R. J. Williams and D. Zipser. A learning algorithm for continually running fully recurrent neural networks. Neural Computation, 1(2):270{280, 1989. [10] F. S. Wong. Time series forecasting using backpropagation neural networks. Technical Report NU-CCS-90-9, 1990. [11] E.F.Y. Young and L.W. Chan. Parallel implementation of partially connected recurrent network. In IEEE Conference on Neural Networks 1994, Orlando, volume IV, pages 2058{2063, 1994. 24

[12] D. Zipser. Subgrouping reduces complexity and speeds up learning in recurrent networks. In Advances in Neural Information Processing Systems, pages 638{641, 1990. 25