International University Bremen Guided Research Proposal Improve on chaotic time series prediction using MLPs for output training

International University Bremen Guided Research Proposal Improve on chaotic time series prediction using MLPs for output training Aakash Jain a.jain@iu-bremen.de Spring Semester 2004 1 Executive Summary Echo State Networks (ESN) present a novel approach to analysing and training recurrent neural networks (RNNs). It leads to a fast, simple and constructive algorithm for supervised training of RNNs. A very powerful blackbox modeling tool to build models to simulate, predict, filter, classify, or control nonlinear dynamical systems, what makes ESNs excel over traditional techniques is that it can efficiently encode and retain massive information in an ESN echo network state about a long previous history. This makes ESNs excellent approximators of, among other nonlinear dynamical systems, chaotic time series - with obvious applications in prediction of such tasks as currency exchange rates. This research proposal aims to further improve upon the best empirical result of predicting a chaotic time series, which was obtained by an ESN, by replacing the linear readout mechanism employed by ESNs with a multi layer perceptron (MLP). This shall allow the resulting network to harness the dynamical memory of an ESN with the approximating powers of the gradient descent algorithm of a MLP, resulting in a powerful approximator, smaller in size than using just an ESN and allowing for more feasible practical implementations in telecommunications. 1

2 Summary Description of Project By now, there exist many kinds of artificial neural networks (ANN) and can be mainly characterized by their learning mechanisms (supervised or unsupervised) and network structures (feedforward-only or recurrent). In feedforward networks, activation is piped through the network from input units to output units, such as in Multi-Layer Perceptrons (MLP). Conversely, recurrent neural network (RNN) are characterized by feedback ( recurrent ) loops in their synaptic connection pathways, thus closely resembling biological neural networks and exhibiting dynamic memory. An Echo State Network (ESN) is an artificial recurrent neural network, characterized by its use of a large randomly-connected RNN (50 to 1000 neurons) and in that only the synaptic connections from the RNN to the output readout neurons are modified by learning. Because there are no cyclic dependencies between the trained readout connections, training an ESN becomes a simple linear regression task, solved by any offline or online linear regression algorithm to minimize the error: E[(d(t) y(t)) 2 ], where d(t) is the desired output (teacher signal when teacher forcing the output) and y(t) is the network generated output. Thus ESNs benefit from reduced training complexity, allowing for sparsely connected large network structures. It is important that the network is sparsely connected in order to develop and provide a rich reservoir - the dynamical reservoir (DR) - of excited dynamics which is then tapped by the output weights. The large recurrent network structure, under certain conditions, develops so-called echo states, which can be thought of as a state-space representation of neurons (internal units) of the network inherently encoding previous and current input (and output for networks with output feedback). It is this echo state property of ESNs which gives them the capability to store and represent massive information in a single ESN echo network state about a long previous history, and develop a large DR. This makes ESNs a very good predictor of chaotic time series, such as that generated by the Mackey-Glass delay differential equation. ESNs have already been shown to improve upon the benchmark task of predicting the Mackey-Glass system (MGS) time series by a factor of 2400 over previous techniques [12]. This research proposal aims to study the effect of training a MLP as the readout mechanism for the ESN instead of the technique of training just the disjoint DR to output weights employed by conventional ESNs. Basically, this means cascading a suitable MLP to a modified ESN in the sense that the internal units of the ESN are connected to the input units of the MLP in a suitable manner. The MLP is then trained to generate the desired output on the input it receives from the internal units of the ESN - the echo state. This approach replaces the linear readout mechanism of an ESN with a nonlinear method of a MLP, with the goal of further improving the prediction of a chaotic time series by harnessing the increased non-linearity posed by the MLP, along with maintaining the very useful echo state property of an ESN. It is not clear if the new neural network schema will in fact be more powerful in predicting the chaotic time series, however, one hypothesis which seems quite likely is that this schema should allow in a considerable reduction in the number of internal units of the ESN resulting in a general decrease in the number of network units and faster network response times during the exploitation phase. Section 3 provides a more detailed statement of the problem/research along with some motivation to do research in the desired field. Section 4 talks about the planned experiments in order to study the desired effects. 2

3 Statement and Motivation of Research The whole universe can be seen as a very complex high-dimensional nonlinear dynamical system, composed of smaller such systems. Such systems are not well understood, making it infeasible to obtain executable analytical models required to simulate, predict, filter, classify or control them. In such cases, one has to resort to blackbox modeling techniques which, while ignoring the internal physical mechanisms, reproduce the outwardly observable input-output behavior of the target system. Neural networks represent one such class of blackbox modeling techniques. Depending on the network structure they possess the capabilities to approximate linear and nonlinear dynamical systems to an arbitrary precision, making them practically relevant tools in applications related to telecommunications (channel equalization), control (of engines, generators, chemical plants), dynamic pattern classification (speech recognition), pattern generation (computer game animation, dynamical models of humans, machines, natural systems) and time series prediction (prediction of currency exchange rates or coronary attacks). Echo State Networks (ESN) present a novel approach to analyzing and training recurrent neural networks [9], resulting in a fast, simple and constructive algorithm for supervised training. Recurrent neural networks are more powerful in their representative powers since, like biological networks (which are recurrent), they can approximate arbitrary non-linear dynamical systems to an arbitrary accuracy (the universal approximation property)[7], as opposed to the static nonlinear input-output mappings achieved by feedforward networks, such as multi-layer perceptrons (MLP). Much of the neural network related applications and a big majority of the literature is based on these feedforward networks, since the established training algorithms for recurrent neural networks, such as Back Propagation Through Time (BPTT) [21], Real Time Recurrent Learning (RTRL) [24] and the Extended Kalman Filter (EKF) [6], suffer from drawbacks of slow convergence and suboptimal solutions. Section 3.1 provides a quick introduction on the training method of an ESN and section 3.2 provides the same for MLPs. Section 3.3 introduces the new network structure proposed by in this proposal, which is constructed by replacing the linear readout mechanism of an ESN with a MLP and discusses possible advantages or drawbacks of this new schema. 3.1 Echo State Networks (ESN) Figure 3.1: ESN Schema The state update equation of the ESN is given as: x(n + 1) = tanh(wx(n) + w in u(n + 1) + w fb y(n) + v(n)), (1) W: N N matrix of internal connection weights, w in : N-size vector of input connection weights, 3

w fb (optional): weight vector for feedback connections from the output neuron to the reservoir (internal units), v(n) (optional): noise vector. The output equation y(n) for a single-output network (as shown in figure 3.1) is: y(n) = tanh(w out (x(n), u(n))), (2) w out : (N+1)-size vector of weights of connections to the output neuron. nonlinearity tanh is optional. The output It is the output weights w out that are adjusted by the learning procedure. Lets assume that the weight vector w out is composed of w i, where each w i corresponds to the connection weight from the internal unit i to the output unit. As seen from figure 3.1, there exist no cyclic dependencies between these w i, which results in the task of adjusting the w i boiling down to a linear regression task to minimize the error E[(d(n) y(n)) 2 ]. Now, lets discuss an important property of ESNs that makes them such good approximators: the echo state property. The echo state property basically states that under certain conditions (e.g. σ max < 1), certain I/O echo functions exist for teacher forced output, modeled as: x i (n) = h i (u(n), u(n 1),..., y(n 1), y(n 2),...), (3) where h i is the echo function which produces the activation of the internal unit i. ESNs build the function h of the final deterministic dynamic equation d(t) = h(u(t), u(t 1),..., d(t 1), d(t 2),...) from linear combinations of the I/O echo functions (h i ): w i h i (u(t),..., y(t 1),...). (4) i Inherently, this means that ESNs linearly tap the desired output from the dynamical reservoir (DR), which can be thought to inherently encode the information about current and past input and output in its state as a result of the echo functions h i. This is also why the ESN should be large and sparsely connected, since it allows for a rich set of diverse dynamics to develop and reverberate in the dynamical memory of the DR that can then be tapped by the readout mechanism of the ESN. It might be useful to note here that a similar ANN model has been suggested by the research group of Wolfgang Maass et al, who have termed such networks as Liquid State Machines (LSM) [14]. They refer to the DR as the liquid and have also suggested employing a powerful readout mechanism from the liquid, such as that proposed in section 3.3 (the main topic of this research proposal). 3.2 Multi Layer Perceptrons (MLP) MLPs can be cosidered as providing a nonlinear mapping between an input vector, and a corresponding output vector. From a set of input output vectors, an MLP with a given number of hidden neurons may be trained by minimizing a least mean square cost criterion. One of the most widely known forms of an MLP training algorithms is the so called backpropagation algorithm, which was introduced in [19] and first applied to a time series modeling task in [13]. Backpropagation is a gradient descent technique, descending the error surface in the weight space. It involves four main stages: 1) randomly initialize network weights, 2) propagate the input x(n) forward through the network, 3) backpropagate the associated error terms δ i from the output layer to the hidden layer(s), and 4) update the network weights in the direction of the steepest gradient descent towards a local minima. Steps 2-4 are repeated until desired stopping criterion is achieved, usually realised through a crossvalidation scheme. It is well known that MLPs with just one hidden layer possess the universal approximation property to approximate arbitrary static nonlinear functions to arbitrary accuracy. 4

3.3 ESN with MLP as Readout Mechanism Figure 3.2: Schema of an ESN using MLP as the readout mechanism Figure 3.2 shows the basic schema of the new network structure we would like to investigate. The idea is to replace the linear readout mechanism of a conventional ESN with the nonlinear MLP and train the weights associated with the MLP using the backpropagation algorithm to achieve the desired approximation. Effectively, instead of linearly combining the echo states x i (n), we now combine them nonlinearly by feeding them as input to the MLP. The effect of this process is not quite predictable on the final generalization capabilities of ESNs since it could either improve or degrade performance depending on the task domain and other parameters of the network. Increased nonlinearity could drive the network in the thresholding range, resulting in poor approximation of smooth functions. On the other hand, it could also be that the new network structure emerges as a powerful approximator of nonlinear dynamical systems by exploiting the rich and dynamic memory reservoir of the ESN along with the approximating powers (gradient descent technique, which can localise a minima better) of the MLP. The true behaviour of the network needs to be empirically determined and the effect of altering various network parameters determined. One clear hypothesis that can be made about the resulting network properties is that for it to have the same performance as a traditional ESN, this new network would require considerably less internal units in its dynamic reservoir, otherwise we will almost certainly overfit the training data and result in poor performance in the exploitation stage. Adding the MLP considerably increases the number of trainable parameters in the network, thus allowing us to reduce the ESN size and result in a generally smaller network with faster response times and activation propagation. However, reducing the ESN size also means reducing the dimensions of the dynamical reservoir, where the rich set of the varied dynamics evolves and lives as echo states. This will have an adverse effect on the temporal memory capabilities of the ESN. Therefore, there seems to be a tradeoff between overfitting and memory as the size of an ESN is reduced. This is another effect that needs to be empirically investigated. Finally, to test whether the new network structure does (or does not) evolve as a more powerful approximating tool than conventional ESNs, we shall test the performance of the new structure on chaotic time series. ESNs hold the current record of prediciting the Mackey Glass 17 System (MGS 17) (see section 4.1 for more info on MGS) 84 steps in the future (a benchmark task) with a log 10 NRMSE 84 of -5.09 [12], which is an improvement by a factor of 2450 over any other chaotic time series approximator. Improving on this would certainly establish the proposed network structure as one of the best approximators to chaotic time series, and it is expected that this can be done with in fact a smaller network size than that used in [12]. 5

4 Experimental Setup First, we would like to determine if the new network structure proposed in 3.3 does possess the same or better approximation capabilities of nonlinear dynamical systems (e.g. chaotic time series), quite possibly with a smaller network size. Another parameter that we would like to test is the effect of the structure of incoming synapses from the ESN to the MLP. During training, the training input is teacher forced onto the output of the MLP, which is directly fed back into the ESN. The output of the ESN is then fed into the MLP, which then generates the final network output. This is compared with the teacher to compute the error to be minimized. After training, the teacher forced signal is decoupled from the network and its own output (from the MLP) is fed back in (to the ESN). As a stopping criterion for training, we shall use crossvalidation. 4.1 Preparation of training and testing data The dataset is obtained from the discretized version of the Mackey-Glass (MG) delay differential equation: dx/dt = 0.2x(t τ)/(1 + x(t τ) 10 0.1x(t)). This equation was proposed by L. Glass and M.C. Mackey in [15] in 1977 to describe a model for the onset of leukaemia. Over the years, this equation has established itself as a benchmark time series prediction dataset. τ is the delay and we shall first use τ = 17, followed by testing on τ = 30. If time permits, the network shall also be tested on dataset from other chaotic attractors: Lorentz attractor and the Laser time series. The dataset is prepared (discretized, shifted and scaled) as described in the supporting online material to [12], before feeding it to the network. Artificial noise is injected into the data. 4.2 Network Setup Here, we have two individual networks to setup and initialize: the ESN and the MLP. ESN: Setup in accordance to as described in [11] (and using the Matlab implementation provided). Initially, start with a large DR of 1000 units. This results in a 1000 1000 weight matrix W with 1% connectivity and random weights drawn from a uniform distribution over (-1,1) and then rescaled to spectrul radius of 0.8. As the network size is one of the main investigation parameters and it has been postulated that we will need smaller networks, we repeat the experiments with smaller DRs until performance begins to degrade. The decay rate of the DR can be determined based on results obtained. Output feedback is turned on and an auxiliary input unit is attached to feed in a constant bias input. MLP: For the MLP setup, we use one hidden layer (see section 3.2) with 8 hidden units h. This number has been chosen from a survey of relevant literature, such that it is not too big (resulting in overfitting) or too small (resulting in underfitting). [1] and [23] suggest that the effects of overparametrized MLPs can be overcome by careful selection of the range { a, a} from which the weight values are initialized, and that one may fix the value of h and carefully pick a for the range from which the initial weights are initialized. [1] also suggests a new method to do complexity analysis of a MLP, based on which the initial weights can be picked, as opposed to the commonly held belief that the smaller the value of a the better. However, we shall empirically determine the value of a by performing pretests with decaying values of a. Besides, for the final value of a picked, we shall repeat the experiment several times, each time re-initializing the weights from { a, a}; thus starting in different places on the error surface of the weight space (Monte-Carlo simulation). Another variable parameter here is the connection structure from the ESN internal units to the hidden units of the MLP. For this, we first experiment by connecting n/h (n: number of internal ESN 6

units) units from the ESN to each hidden unit of the MLP. This value can then be scaled to n, in which case there would be an incoming synapse from all of the ESN internal units to all of the MLP hidden units. The activation functions for the MLP are chosen to be sigmoid for the hidden units and linear for the single output unit, using the tanh function as the sigmoid function. The step-size η for the backpropagation algorithm is made to decay, with the decay rate being determined empirically through pretests. Too large an η results in initial fast convergence followed by constant jitter around the minima. On the other hand, a small η might take prohibitively long to converge. 4.3 Evaluation Criterion The evaluation criterion for our experiments is simply the log 10 NRMSE, which stands at -5.09 for MG 17 and -1.42 for MG 30 as the best results achieved yet. Improving upon these would certainly be one goal. For this, we probably have to use the ideas posed by the refined version of learning method in [12]: train the network using a reservoir with dynamics closer to the one encountered at exploitation time to improve modelling accuracy. 4.4 Time Scale 25th April, 2004: Setup the network as a Matlab simulation and start preliminary testing to identify various parameters. 27th April, 2004: Finish preliminary testing and start the real experiments. 03rd May, 2004: Finish experimentation phase. 07th May, 2004: Finish data analysis. 13th May, 2004: Final report due. 7

References [1] A. Atiya and C. Ji. How initial conditions affect generalization performances in large networks. IEEE Trans. Neural Networks, 8(2):448 451, 1997. [2] H. Bersini, M. Birattari, and G. Bontempi. Proc. IEEE World Congr. on Computational Intelligence (IJCNN 98), pages 2102 2106, 1997. [3] T. Chow and C.T. Leung. Performance enhancement uning nonlinear preprocessing. IEEE Trans. on Neural Networks, 7(4), July 1996. [4] L. Chudy and I. Farkas. Neural Network World, 8:481, 1998. [5] L. Fausett. Fundamentals of Neural Networks. Prentice Hall, 1994. [6] L.A. Feldkamp, D.V. Prokhorov, C.F. Eagen, and F.Yuan. in Nonlinear Modeling: Advanced Black-Box Techniques, pages 29 54, 1998. [7] K.-I. Funahashi and Y. Nakamura. Neural Networks, 6:801, 1993. [8] F. Gers, D. Eck, and J.F. Schmidhuber. Applying lstm to time series predictable through timewindow approaches. IDSIA-IDSIA-22-00, 2000. [9] H. Jaeger. The echo state approach to analysing and training recurrent neural networks. Technical Report 148, German National Research Center for Information Technology, 2001. [10] H. Jaeger. Short term memory in echo state networks. Technical Report 152, German National Research Center for Information Technology, 2001. [11] H. Jaeger. Tutorial on training recurrent neural networks, covering bppt, rtrl, ekf and the echo state network approach. Technical Report 159, German National Research Center for Information Technology, 2002. [12] H. Jaeger and H. Haas. Harnessing nonlinearity: Predicting chaotic systems and saving energy in wireless communication. Science, April 2:78 80, 2004. [13] A. Lapedes and R. Farber. Non-linear signal processing using neural networks: Prediction and system modelling. Technical Report LA-UR87-2662, Los Alamos National Laboratory, 1987. [14] W. Maass, T. Natschläger, and H. Markram. Real-time computing without stable states: A new framework for neural computation based on perturbations. Neural Computation, 14:2531 2560, 2002. [15] M.C. Mackey and L. Glass. Science, 197:287, 1977. [16] T.M. Martinetz, S.G. Berkovich, and K.J. Schulten. IEEE Trans. Neural Networks, 4:558, 1993. [17] J. McNames, J.A.K. Suykens, and J. Vandewalle. Int. J. Bifurcation Chaos, 9:1485, 1999. [18] T.M. Mitchell. Machine Learning. McGraw-Hill, 1997. [19] D.E. Rumelhart, G.E. Hinton, and R.J. Williams. Learning internal representations by error propagation. Parallel Distributed Processing, 1:318 362, 1986. [20] J. Vesanto. Proc. WSOM 97, 1997. 8

[21] P.J. Werbos. Proc. IEEE, 78(10):1550, 1990. [22] X. Yao and Y. Liu. IEEE Trans. Neural Networks, 8:694, 1997. [23] S. Zhong and V. Cherkassky. Factors controlling generalization ability of mlp networks. [24] D. Zipser and R.J. Williams. Neural Comput., 1:270, 1989. 9