Reservoir Computing in Forecasting Financial Markets

Size: px

Start display at page:

Download "Reservoir Computing in Forecasting Financial Markets"

Octavia Taylor
5 years ago
Views:

1 April 9, 2015 Reservoir Computing in Forecasting Financial Markets Jenny Su Committee Members: Professor Daniel Gauthier, Adviser Professor Kate Scholberg Professor Joshua Socolar Defense held on Wednesday, April 15, 2015 in Physics Building Room 298

2 Abstract The ability of the echo state network to learn chaotic time series makes it an interesting tool for financial forecasting where data is very nonlinear and complex. In this study I initially examine the Mackey-Glass system to determine how different global parameters can optimize training in an echo state network. In order to simultaneously optimize multiple parameters I conduct a grid search to explore the mean squared error surface plot. In the the grid search I find that error is relatively stable over certain ranges of the leaking rate and spectral radius. However, the ranges over which the Mackey-Glass system minimizes error do not correspond with an error surface plot minimum for financial data, as a result of intrinsic qualities such as step size and timescale of dynamics in the data. The study of chaos in financial time series data leads me to find alternate understandings of the distribution of the relative stock price over time. I find the Lorentzian distribution and the Voigt profile are good models for explaining the thick tails that characterize large returns and losses, which are not explained in the common Gaussian model. These distributions act as an untrained random model to benchmark the predictions of the echo state network trained on the historical price changes in the S&P 500. The global reservoir parameters, optimized in a grid search given financial input data, does not lead to significant predictive abilities. Studies of the committees of multiple reservoirs are shown to give similar forecasts to single reservoirs. Compared to a benchmark random sample from the defined distribution of previous input, the echo state network is not able to make significantly better forecasts, suggesting the necessity of more sophisticated statistical techniques and the need to better understand chaotic dynamics in finance. 2

3 Contents 1 Introduction Background Approach Network Concepts Basic Concept Input The Reservoir and Echo State Property Training and Output Reservoir Optimization Summary Echo State Network and Mackey-Glass Mackey-Glass System Constant Bias Leaking Rate, Spectral Radius, and Reservoir Size Summary Financial Forecasting and Neural Networks Nonlinear Characteristics History of Neural Networks in Finance Neural Forecasting Competition Network Predictions S&P 500 Data Data Processing Benchmarking Parameter Optimization Reservoir Results Committee Concluscion Analysis of Results Conclusion

4 Chapter 1 Introduction 1.1 Background Artificial Neural Networks are trainable systems that have powerful learning capabilities with many applications in forecasting and classification. Their learning process has many similarities to that of the human brain because their design was inspired by research into biological nervous systems. Like the central nervous system of animals, the artificial neural network is composed of many neurons or artificial nodes, which are connected in a defined network. Both biological and artificial neural networks send and receive feedback signals through their connections which partially determines their expressed state. Therefore these networks are constantly updating with information from external sources as well as internally within the network from other neurons. The development of artificial neural networks began in 1943 when McCulloch and Pitts [23] first proposed a network in which the state of neurons was determined by a combination of all the signals received from connected neurons in the network. In their model, simple artificial neurons i = 1,...,n could only have binary neuron activations or states, n i = 0,1, where n i = 0 represents the resting state and n i = 1 represents the activated state of the neuron. The binary state of each neuron i is dictated by a linear combination of action potentials, j w i j n j (t), where w i j is a matrix representation of the connection weights between neurons i and j as seen in figure 1.1. If this value exceeds a certain threshold, then neuron i becomes activated and transmits signal n i = 1. 2

5 Figure 1.1: Network model of neurons are summed as a linear combination which must exceed a certain threshold for the neuron to be activated. Another important foundational step occurred in 1949 when Hebb published The Organization of Behavior in which he asserts that simultaneous activation of neurons leads to increased connection strengths between those neurons [8]. This established the basis for Hebb s learning rule which states that the synaptic connection between neurons was adaptive. The connection between neurons can be strengthened or increased as a result of repeated activations between the neurons. If neuron x 1 activations consistently leads to neuron x 2 activations, then the connection weight between them would incrase. This idea was later incorporated in 1961 by Edward Caianeillo to the learning algorithm used to determine weight matrix w i j connecting neurons [1] where the connection weight matrix had adaptive weights for neurons with similar activations. The connection between neurons which activate simultaneously are stronger. All of these discoveries contributed to the development of the first simple feed-forward network which Rosenblatt and his collaborators called a perceptron [29]. The perceptron consisted of two layers of neurons: the input layer and the output layer. Signals continues from the input to the output in a single direction as seen in figure 1.2. This first feed-forward network was used in a simple classification problem between two classes. After each classification, if the model predicts the class incorrectly the weights are adjusted and the network is run again until the weights converge to the correct values that 3

6 allow the model to predict the class correctly. Figure 1.2: Single-layer perceptrons have adjustable weights which are trained to predict correct classification. Years later in 1982, Hopfield published the basis for a recurrent network where neurons can be updated sequentially based on information stored within the network [9]. As opposed to a feedforward network, recurrent neural networks have neurons whose connections are multidirectional as shown in figure 1.3. This allows the network to display dynamical behavior based on internal memory of signals throughout the network. In Hopfield s first network, there was no self-connection, the neuron did not receive input from itself, and the connections between neurons were symmetric so that w i j = w ji. Hopfield showed that networks with these characteristics which iteratively adjusted the state of each neuron would reach a local minimum and evolve to a final state. 4

7 Figure 1.3: Hopfield network neurons connections are multidirectional with symmetric weights. Within the field of networks, many different learning rules that have been developed are classified as supervised learning, where the network is adjusted by comparing the network output with the desired output. One of the most common learning rules, error backpropagation, was first proposed in 1974 by Werbos [37]. The learning algorithm makes small iterative adjustments to network connections to minimize the difference between the target and network output. Backpropagation has been especially successful in feed forward networks but the method is only partially successful with recurrent neural networks. Output of recurrent neural networks can bifurcate unlike the smooth continuous output of feed-forward networks. This bifurcation can lead to discontinuous error surfaces[3]. An alternative to backpropagation was introduced with echo state networks (ESNs) and liquid state machines [13, 19]. These two training models developed independently within the context of machine learning and computational neuroscience respectively share the same basic idea of a randomly connected reservoir of neurons with a trained output feedback. This developed into the current research field of Reservoir Computing. The specific network concepts defined by the echo state network approach I will be using in this project is explained in detail in chapter 2. Applications of reservoir computing are useful in tasks for function approximation, signal processing, classification, and data processing. They have been applied to system identification, natural language 5

8 processing, medical diagnosis, as well as spam filters. In this study, I am interested in the application of these network models to the financial sector, which historically has great incentives to determine models that are capable of predicting and forecasting changes in a complex, widely fluctuating market. 1.2 Approach The goal of the project is to study echo state network and its modeling capacity in forecasting financial data series. Within the scope of this project I first forecast a known dynamical system Mackey-Glass to better understand reservoir behavior given specified global parameters. Secondly I will then use knowledge of ESNs in application to financial data to determine whether ESNs may provide a useful prediction of the stock market trajectories. The paper is organized as follows. The second chapter discusses network concepts and the mathematical description of the Echo State network. In order to determine best practices, in chapter three I study the network using the Mackey-Glass system, a nonlinear delay differential equation that is commonly used to test modelling of complex systems. The ESN can be trained using Mackey-Glass on varying global parameters of the reservoir. These global parameters are tuned to optimize network performance. The fourth chapter introduces the dynamics of the stock market and how previous forecasting measures have dealt with these widely fluctuating time series. In the fifth chapter I apply the echo state network approach to financial data. I study the impact of global parameters and data processing techniques. I benchmark the reservoir predictions using random samples from distributions that best fit the historical input data. I also study the impact of committee methods combining output from multiple reservoirs in comparison to single reservoir predictions as well as random sample. The sixth and final chapter concludes with a comparative discussion of the echo state network approach in financial forecasting as well as limitations of the model and further studies. 6

9 Chapter 2 Network Concepts 2.1 Basic Concept There are many supervised learning algorithms for training recurrent neural networks. In the echo state network (ESN) approach used in this project, only the output weights from the reservoir are trained. The connections of the neurons within the reservoir are randomly generated at the outset as in figure 2.1. Training all network connections is unnecessary, which makes this faster than previous learning algorithms in which all connections are trained. Because the network has recurrent loops, it maintains a memory of the past input and the output consists of echoes of the initial input time series. 7

10 Figure 2.1: Echo state network approach has a combination of trained and untrained connections between neurons or nodes. In implementing the echo state network, an input teacher data series u(n) is used to train a reservoir of size N x with neurons connections determined by a randomly generated matrix W R N x N x. The output node of the reservoir gives a readout of a linear combination of all or a portion of the neuron activations W out x(n). This matrix W out is computed so that the output y(n) corresponds as closely as possible to a defined target data series y target (n). This last step is the training portion of the learning algorithm. Once the best weight matrix W out is determined, new input data u(n) can be used in the reservoir to generate output or reservoir predictions y(n) beyond the target data. The rest of this chapter describes the mathematical properties of the different components involved in the ESN approach and also introduces the tunable global parameters to optimize this learning architecture, as first developed by Jaeger in 2001 [13]. 2.2 Input The input data to the reservoir serves as the driving mechanism for the reservoir. Not only does the reservoir exhibit nonlinear dynamics as a result of the input, but the reservoir will also retain a memory 8

11 of previous input. This ability to remember, which will later be defined as the echo state property in section 2.3, occurs as a result of the recurrence of the network; the nodes form recurrent cycles through their connections in W. Typically, the teacher input series, u(n), consists of N u series of data points at discrete time steps (n, n + 1, n + 2,...). In my studies, part of the input teacher series is used as the target data y target (n) that the output is trained on. There are no limitations to the starting point of the data series, and therefore shifting the initial starting input does not significantly impact the ability of the reservoir to learn the input series. While in many contexts, N u is one-dimensional, there is no limit to the number of arrays that can be used as input to the reservoir. The input is fed into the reservoir using a randomly generated input weight matrix W in R N x (1+N u ). In addition to the specified input data u(n), there is also a constant bias value input to the reservoir. This bias input is weighted by the randomly generated column of values in W in as seen in figure 2.2. The purpose of the bias constant serves to increase the variability of input dynamics [13]. The impact of this bias constant is further studied in section 3.2. All these input values directly impact the reservoir, which is studied in the next section. Figure 2.2: W in determines the connection for both the input data and bias input. 9

12 2.3 The Reservoir and Echo State Property The reservoir is defined by the neurons within it whose states all follow the same update equation as a function of the input and feedback from other neurons. The state of each neurons or nodes depends on previous states in the reservoir according to x(n + 1) = (1 α)x(n) + α tanh(wx(n) + W in u(n + 1) + W f b y(n) + v(n)), (2.1) where α is the leaking rate of the network. Technically the state update can be governed by any sigmoidal function but in the context of this project I use the tanh function because it is the standard sigmoid function studied across all encountered literature in reservoir computing. According to Eq. 2.1, each new neuron state x(n + 1) is determined by its current state x(n) as well as a nonlinear expression of other current nodes Wx(n) and the input data W in u(n + 1). The state x(n + 1) can also depend on the output adjusted by the randomly generated feedback weight matrix W f b y(n) as well as a noise function v(n). However, both of these parameters are optional and are omitted in this project. Jaeger found that noise was useful in maintaining stability of reservoir activations when driven by highly chaotic time series such as the Mackey-Glass system [13]. He also concluded that noise insertion negatively impacted the precision of predictions, which is undesirable in my project. In the project, I do not use a separate, W f b, to feed into the reservoir but rather feed the output y(n) back as input u(n) and therefore W f b = W in. This allows the reservoir to exhibit dynamical behavior as a result of the output as well as the input. One of the most important parameters in determining reservoir behavior is the random weight matrix W that describes neuron connections. The recurrent loops that occur due to these connections lead to the echo state property of the network that gives the reservoir a finite memory. The echo state property states the the reservoir retains a memory of previous states as the input data is stored in the recurrent loops of the reservoir connections. The echo state property can be ensured in practice if spectral radius, denoted by ρ, is less than 1. The spectral radius is the largest maximum eigenvalue of the weight matrix W. However more recent researchers have determined that the echo state property occurs even for the more relaxed condition when the maximum value of entry w i j in W is less than 1 [6]. Jaeger, more recently, also noted that the echo state property can exist even when ρ > 1 but may not exist for all input and never exists for the null input [22]. Increasing the ρ may even increase network performance by increasing memory of the input since the spectral radius defines how long input is remembered. He also 10

13 found that the echo state property might be defined with respect to the input u(n). Since the reservoir exhibits memory, the initial n activations should not be used in training, which is defined in the next section, due to any transients that may exist in the reservoir. Since ρ in most of the studies in this project is close to unity there is a slow forgetting. Therefore, many initial states need to be dismissed before actual predictions can take place. This makes this learning process input-intensive and data-wasteful and other techniques have been developed that incorporate an auxiliary initiator network that computes an appropriate starting state for the recurrent model network [13]. These techniques are beyond the scope of my project because the data length is not so much a limitation in financial data. 2.4 Training and Output Given all the reservoir activations resulting from the N u input series, the goal is to take the output of this reservoir y(n) to best approximate the target data y target (n). Ideally the goal is to minimize the difference between the target data and the reservoir output. In this project, I use Mean Squared Error (MSE) defined as MS E = 1 n Because the reservoir output is a linear combination of the activations, n (y out (n) y target (n)) 2. (2.2) i=1 Y = W out X, (2.3) where Y and X are matrix representations of the reservoir output y(n) and reservoir states x(n) respectively, minimizing MSE becomes a simple linear regression. After substituting in matrix forms used in Eq. 2.3 into Eq. 2.4, I find MS E = (W out X Y target ) 2. (2.4) Minimizing the MSE gives the solution to W out as W out = Y target X 1. (2.5) However, because u(n) is often larger than the reservoir size, the system is overdetermined. Taking the inverse of an overdetermined matrix can give unstable solutions. Therefore instead of a output matrix as 11

14 a simple function of the target and inverse of the activations, I implement Tikhonov regularization, also known as ride regression, in order to find a stable solution for W out [17]. In Tikhonov regularization, a regularization term is added to Eq. 2.2, resulting in (W out X Y target ) 2 β(w out ) 2, (2.6) where β is the regularization coefficient used to penalize larger norms. Minimizing Eq. 2.6, I obtain the solution W out = Y target X T (XX T + βi) 1. (2.7) Alternatively, as mentioned previously in section 2.2, the noise function v(n) is also used to stabilize solutions in systems that are overdetermined. However the ridge regression method is a more computationally efficient solution that does not penalize the precision of reservoir predictions that noise may affect. 2.5 Reservoir Optimization In determining the optimal reservoir, there are many adjustable global parameters that influence the dynamical behavior. The goal in reservoir optimization is therefore to generate dynamical behavior that is most similar to the system the reservoir is attempting to model. The reservoir dynamics, as seen in Eq. 2.1, is governed by W, W in, and α. These parameters are related to different quantities that characterize the network such as: reservoir size N x, spectral radius ρ, input scaling β, and leaking rate α. These global parameters and their effect on the reservoir are the topic of this section. The current standard practice involves intuitive manual adjustment of each of these parameters to optimize the reservoir [17]. I will study parameter optimization in section 5.3. Reservoir Size The reservoir size N x, or the number of neurons, dictates the model capacity of the network. The current intuition states that in general, bigger reservoirs lead to better performance. The training method (Eq. 2.7) used for Echo State Networks is computationally efficient enough to generate large reservoir sizes 12

15 on the order of magnitude of These have been found useful in automatic speech recognition [32]. A benchmark for the lower bound for a reservoir is given by Lukosevicius [17]: the reservoir size N x should be at least equal to the estimate of the independent real values that the reservoir needs to retain in memory of the input. This is a result of the echo state property, described in section 2.3. The maximum length memory of the reservoir is limited by its size. Spectral Radius The spectral radius ρ is one of the most central parameters in the echo state network because it plays an important role in governing the echo state property of the reservoir. The radius ρ is defined as the maximum absolute eigenvalue of the reservoir connection matrix W and it determines the length of the reservoir memory. Larger spectral radius corresponds to a longer memory of the input history. In this project, the spectral radius essentially scales the weight matrix W. After a random matrix W is generated, the spectral radius of this matrix is calculated by ρ(w). Then W is divided by ρ(w) to normalize its spectral radius. This means ρ(w) becomes equal to 1. Then the connection matrix is multiplied by the selected spectral radius ρ such that ρ(w) = ρ. This optimization of the matrix spectral radius occurs after the connection weight matrix is randomly selected. Then its spectral radius is adjusted to fit the desired ρ value. In practice the condition, ρ < 1, ensures the echo state property in most applications [17]. However, the theoretical limit established in Ref. [6] shows that the echo state property only exists for every input under the tighter condition ρ < 1/2. In application, however, the echo state property often holds for even ρ 1 given nonzero input. Typically large ρ values may push the reservoir into chaotic spontaneous behavior which would violate the echo state property. This is shown by Jaeger [12] where the trivial zero input will lead to linearly unstable solutions given ρ > 1. Generally the spectral radius should be tuned to optimize reservoir performance with a starting benchmark value of ρ = 1 and exploring other ρ values close to 1. When modelling systems that depend on more recent input history, ρ should be smaller compared to a larger ρ when a more extensive memory is required. 13

16 Input Scaling Input scaling is an important parameter as previously noted in section 2.3 because it also plays a role in determining the echo state property [22]. The input scaling is typically determined by the input weight matrix W in which is randomly sampled on a range [ a, a], where a is the scaling parameter. While not necessarily required, the input scaling parameter determines the the scale of the entire input series by scaling all the columns of W in. This reduces the number of free parameters that need to be tuned to optimize reservoir performance [17]. Theoretically it is possible to scale individual input units u(n). Increasing the number of input scaling parameters could serve to expand components of the input u(n) that favorably drive reservoir dynamics. However there are no algorithms that can easily scale individual input components and exploration of this topic is beyond the scope of this project. In this project, I determine the input scaling through W in, which is chosen randomly from the range [ 1, 1]. The input scaling parameter, a, is multiplied by the input data, u(n). This is analogous to sampling on the range, [ a, a]. W in not only scales the input u(n) but it also scales the constant bias. Input scaling determines the nonlinearity of the network dynamics. Lukosevicius [17] advises ensuring that the inputs to the state update Eq. 2.1 are bounded. Because of the tanh function that defines neuron activations, inputs very close to 0 will lead to linear behavior of neurons. Inputs close to 1 or -1 may cause binary switching behavior of reservoirs. Inputs between 0 and 1 or -1 will lead to dynamical nonlinear behavior of neurons. Leaking Rate The leaking rate α approximates a discrete Euler integration of the state over time. The leaking rate α, which is bounded between [0,1], determines the speed at which the input affects or leaks into the reservoir. This discretization term comes from an Euler integration of the state equation in time x t = x(n + 1) x(n) t x n + f (n) (2.8) so I can determine the solution for a new state x(n + 1) x(n + 1) = (1 t)x n + t f (n). (2.9) In the reservoir the leaking rate α is the t in the Euler integration of an ordinary differential equation. 14

17 This also functions as a form of exponential smoothing for the time series where previous states x(n) will be weighted exponentially less over time because by substitution to the state update equation (Eq. 2.1) I can see x(n + 1) = (1 α) 2 x(n 2) + α(1 α) f (x(n 1)) + α f (x(n)). (2.10) In general, the leaking rate is set to match the speed of the dynamics the reservoir is attempting to model from y target. A recent study [12] has shown that leaking rate can also impact the short-term memory of echo state networks. The study showed that in some reserovirs, given a small leaking rate, α, the slow dynamics of the reservoir states, x(n), could increase the length of the short-term memory of the reservoir. I study impact of the leaking rate along with other parameters more explicitly in section 5.3. Further studies of multiple time scales may be useful to determine components of the system that may have multi-scale dynamics which occur on different timescales but are beyond the scope within this paper. 2.6 Summary In this chapter, I outlined the specific components of the echo state network as well as specific parameters that are useful in optimizing its performance. In the following chapter I apply this knowledge in practice by studying the modelling capacity of the reservoir using a known time series. 15

18 Chapter 3 Echo State Network and Mackey-Glass 3.1 Mackey-Glass System The Mackey-Glass equation is a nonlinear time delay differential equation. It was originally derived as a model of chaotic dynamics in blood cell generation in a collaboration between Mackey and Glass at McGill University [20]. The equation expands on a simple feedback system whose dynamics are given by dx dt = λ γx, (3.1) which has a stable equilibrium at λ/γ in the limits t given that γ is positive. In this simple feedback, system the rate of change of the control variable, dx/dt, is influenced by the value of the control variable. The system increases at a constant rate of λ and decreases at rate γx. To better model real physiological systems, modifications to this simple feedback system include introducing a time delay component. The form of the Mackey-Glass equation studied today is x τ dx dt = β 1 + xτ n γx. (3.2) The equation can generate periodic behavior, bifurcations, and chaotic dynamics given specified parameters β, γ, τ, and n. The derivatives of a time delay differential equation are dependent on solutions at previous times. x τ represents the the state x at a time delayed by the constant τ, x(t τ). There may be a significant time lag between determining the control variable x and responding with an updated change. Additionally, in real physiological models, the parameters γ and λ are not necessarily constant 16

19 for all time t, and may vary with x(t τ), also denoted x τ. This is true for Eq. 3.2, which also exhibits period doubling bifurcations leading up to chaotic regime, and has been extensively studied in relation to chaos theory [7]. Because the system has been so well characterized by previous studies, the Mackey-Glass system is often used to benchmark time series prediction studies. Previous studies by Jaeger [14] have shown that ESNs improve upon the best previous techniques for modeling the Mackey-Glass chaotic systems [36] by a factor of 700, which he attributes to the ESN s ability to store and remember previous states defined as the echo state property in section 2.3. In Jaeger s article, he uses Mackey-Glass parameters β = 0.2, n = 10, γ = 0.1, and τ = 17, which I reproduced in figure 3.1 using dde23 solver [5] in Python. For τ > 16.8, the system has a chaotic attractor. Since I use τ = 17 the system exhibits a chaotic attractor which can be observed in the time delay embedding in figure 3.2. I use the same above parameters to start to train my echo state networks. Figure 3.1: Mackey-Glass system over time t = 3000 given β = 0.2, n = 10, γ = 0.1, and τ = 17. I begin to understand the preparation for reservoir training tasks using the Mackey-Glass system since it has been well studied in the context of the reservoir. The rest of this chapter details the studies of the constant bias global parameter discussed in section 2.5 and the impact on optimization of the echo 17

20 Figure 3.2: Phase space diagram of Mackey-Glass attractor plotted by time delay embedding given β = 0.2, n = 10, γ = 0.1, and τ = 17. state network source code made publicly available by Lukosecivius [18]. These studies will give insight into the basic routine used in training networks which is applied in subsequent chapters to financial data sets. 3.2 Constant Bias In determining states of the activations (see Eq. 2.1), there is an additional bias input, randomly generated within the W in matrix that is multiplied by a constant scaling parameter. In order to better understand the impact of this bias input on the reservoir dynamics I use Mackey-Glass data and study the behavior of the reservoir as a result of changing bias. I generate input Mackey-Glass data for training and testing, construct an echo state network, and train the network on the output matrix to make predictions for the next step. The rest of this section illustrates the impact of bias on the mean squared error. As I study MSE, I examine the impact that the output weight matrix W out may have on the error as it indirectly impacts MSE by determining y(n). Input preparation Using Mackey-Glass parameters β = 0.2, n = 10, γ = 0.1, and τ = 17, I generate times series as shown in figure 3.1. This data is split up into the training data as well as the testing data. In this numerical study, 18

21 I use a training length of 2000 and a testing length of The experimental data is shifted between -1 and 1 to fall within the nonlinear regime of the tanh function. There is no additional input scaling factor needed to adjust the input in this case because the range of the series is less than 1. The training data is weighted according to the randomly generated input weight matrix used in the reservoir to drive the activation states in Eq The first 100 inputs are used to initialize the reservoir and are not used to train the output. The rest of the training data is used as target data to train on as Y target in Eq The testing length is used to compare with the reservoir output to determine the accuracy of reservoir predictions as in Eq Reservoir Construction A reservoir was constructed as described in section 2.3. In order to study the constant bias I varied the scaling parameter for the bias from the range 0 to 10 with.1 increments, I examine the impact this may have on the reservoir activations by isolating the parameter. I hold all other parameters constant: reservoir size = 1000, leaking rate =.3, spectral radius = 1.25, input scaling = 1. Using the first 2000 data points in the training sequence to drive the reservoir according to the reservoir update Eq. 2.1, the states of all the reservoir nodes are collected in X. As seen in figure 3.3 which shows some reservoir activations, the initial activations resulting from the first few random states of the network are highly variable in the first few steps t > 50. Therefore the first 100 activations are ignored to remove any initial transience of the random reservoir. Since reservoir connections are randomly generated, I created 10 reservoirs given each scaling parameter to test the impact of bias on MSE. 19

22 Figure 3.3: The activations of a few nodes for t < 50 are more variable than the stable activations for t > 50. Training and Prediction Training the output feedback of the reservoir requires using the target data Y target and reservoir activations X in Eq The regularization coefficient is set to 10 8 for all the reservoir training throughout this project in order to minimize the number of adjusted parameters. Once the output matrix is determined, the matrix prediction y(n) is calculated by the output weight matrix and the reservoir states. This output y(n) is then used as the subsequent input to continue to drive the reservoir and make the following prediction. This series of predictions is compared to the test data and the MSE, given by Eq. 2.2, is determined. Results After training a number of reservoirs over a range of scaling constants, I compare the prediction results of the reservoirs to the test data. The following plot is an example of the actual data compared to the 20

reservoir predictions for one of the runs using a bias scaling constant of 1. Figure 3.4: Reservoir Predictions after training compared to test data points.

23 reservoir predictions for one of the runs using a bias scaling constant of 1. Figure 3.4: Reservoir Predictions after training compared to test data points. To compare the results across the varying scaling constants, I calculate the MSE using Eq. 2.2 for reservoir predictions over a length of 500 steps for each reservoir. The average MSE of the 10 reservoirs for each scaling bias is shown in the figure below. 21

24 Figure 3.5: MSE across varying scaling constants for the bias given parameters reservoir size = 1000, leaking rate =.3, spectral radius = 1.25, input scaling = 1. As seen in figure 3.5, the smallest MSE occurred when the scaling constant was kept below 3.0 but error increased when the scaling constant was zero or close to it. In the range from 0.8 to 2.2, the MSE was on the order of magnitude of 10 5 or smaller. Typically in literature, this scaling constant is kept at 1 [17]. The purpose of the constant bias input as explained by Jaeger and in section 2.2 was to increase the variability of the reservoir dynamics. In application, the bias may help the reservoir deal with offset or non-centered data. The mean of the shifted input data is close to but not quite zero, at around While the bias input does seem to have an impact on MSE, there may be some indicator for large error in the echo state network model. Since W out has a significant impact on output y(n) I compare the different weight matrices as they affect MSE. Higher weights in the the output matrix could possibly be a result of an unstable solution in Eq To study whether or not the output weight matrix would be 22

25 correlated to the MSE, figure 3.6 was produced to compare the mean of the output weight matrix to the MSE. Figure 3.6: MSE as a function of output weight matrix W out Figure 3.6 indicates no correlation between the mean of the weight matrix and the MSE. This might have been expected because weight matrices with larger more unstable values could indicate poor linear fits for W out from Eq However because Tikhonov regression was intended to penalize large, unstable matrix solutions, these solutions may not manifest in the final outcome weight matrix. This already acts as a threshold for any solution derived from Eq. 2.7 to dismiss highly unstable solutions. Other studies of the output matrix could observe other statistical features. In the following section we study the impact of a few other reservoir parameters. 23

26 3.3 Leaking Rate, Spectral Radius, and Reservoir Size There are many other important parameters that can be optimized in the network as discussed in section 5.3. In this section we study the interrelated impact of multiple parameters. One of the flaws of optimizing each parameter individually is that the MSE may have a local minimum as a function of multiple parameters. In order to better optimize across multiple parameters, I study more advanced optimization techniques. Gradient Descent Gradient descent is a method commonly used to minimize error functions. In gradient descent optimization, also known as method of steepest descent, an iterative approach is taken to converge to a local minimum. From an initial starting vector of parameters, the gradient for MSE is calculated across the multiple parameters. The direction in which MSE will fall the fastest is the negative gradient at that initial starting set of parameters. These parameters are then adjusted in the direction of steepest descent so that MSE is reduced and the gradient is calculated again. This process is repeated to minimize the MSE until the gradient converges to 0, at which point there is a local minimum. While gradient descent algorithms are useful in finding the minimum of smooth convex functions, there is not much knowledge about the error surface in reservoir computing. The error surface could be discontinuous and contain multiple local minima where the gradient descent algorithm could fall and miss the absolute minimum. In order to get a better sense of the kind of error surface that exists in reservoir optimization, I conduct a grid search detailed in the next subsection. Grid Search The basic grid search explores the MSE outcome for multiple reservoirs across several ranges of global parameters. Grid searches are a computationally expensive brute force method to finding optimal parameters since multiple calculations need to be made at each point along the grid. However, for the purposes of this study, it provides the ability to visualize the error surface over many reservoir parameters. In this grid search, I explore the error surface as a function of reservoir size, leaking rate and spectral radius. Data processing follows the same procedures as described in section 3.2. However, instead of simply adjusting one parameter, I am able to tune three different parameters. I study the slices of the 24

27 error surface at different reservoir sizes N x = 100, 400, 700, and This surface plot shows the MSE at different values of leaking rate 0.05 α 0.3 and spectral radius 0.8 ρ 1.2. At each point on the grid defined by the global parameters, 10 different randomly connected reservoirs are created. Because of the computational costs associated with grid search I take large step sizes in all of my parameters to minimize the number of searches that need to be run. Results While I am not able to conduct an exhaustive grid search, the purpose of the grid search is to better understand the error surface across reservoir parameters. The surface plots in figures 3.7, 3.8, 3.9, and 3.10 show the mean of the MSE outcome for the 10 reservoirs created at each point. Figure 3.7 excludes values with leaking rate α < 0.2 because the MSE exploded beyond that point. I have already discussed in section 2.5 that smaller reservoir sizes do not have the capacity for large memory of input. Therefore it makes sense that many parameter values given in the grid search returned extremely high MSE values. The MSE surface also tends to be lower for the reservoir N x = 1000 compared to the other three plots. Studying these surface plots across multiple reservoir sizes, the MSE seems to be minimized for higher leaking rates α > 0.2 which correspond to the values of α I have been using in our studies of Mackey- Glass. 25

28 Figure 3.7: MSE surface for N x = 100 reservoirs across α and ρ 26

29 Figure 3.8: MSE surface for N x = 400 reservoirs across α and ρ 27

30 Figure 3.9: MSE surface for N x = 700 reservoirs across α and ρ 28

31 Figure 3.10: MSE surface for N x = 1000 reservoirs across α and ρ 3.4 Summary In this chapter I examined the Mackey-Glass system and how the numerical solutions could be used as input to the echo state network and as a demonstration of the network training process. I studied specifically the bias input in reservoir and how the scaling constant for bias input affected the reservoir predictions and error. I determined a wide optimal range for the bias input. Studying the output weight matrix, I determined there was no direct relationship between the mean of output weights and the prediction ability of the reservoir. I was able to study the multivariate impact of reservoir parameters on MSE in a grid search across multiple reservoir sizes, leaking rates, and spectral radii. The grid search suggests using leaking rate values α > 0.2, and the leaking rate value used in the Mackey-Glass studies is α = 0.3, which confirms I have been operating in an optimal range of reservoir parameters. As I move further, these initial studies will guide my understanding of the reservoir predictions of financial data. 29

32 Chapter 4 Financial Forecasting and Neural Networks This chapter provides a cursory examination of the history of analysis in financial markets as it relates to forecasting and nonlinear dynamics. In section 4.1, I describe the complex dynamics of the financial input. As a result of the nonlinearity, neural networks have historically been used as forecasting models with some success. The following section will elaborate some of the results that artificial neural networks have acheived in finance. Finally in the last section, I discuss the specific echo state network approach used in the project as a part of the Neural Forecasting Competition in 2007 in which different neural networks were used to predict unknown financial data. 4.1 Nonlinear Characteristics In finance, many qualitative and quantitative forecasting techniques have been applied to try to predict fluctuations in share prices. However, according to the efficient market hypothesis, one cannot outperform the market to achieve returns in excess of the average market returns based on empirical information. This is because the current market price reflects all known information about future value of the stock and investors cannot purchase undervalued stock or sell at inflated rates. This hypothesis is highly controversial and heavily debated. An entire body of study exists to produce methods and models that hope to have substantial predictive abilities of stock prices. Prediction is extremely complex and difficult as the stock market is the result of interactions of 30

33 many variables and players. Many previous techniques used various linear models to attempt to explain the numerous complex interactions. Chaos theory offers an explanation of the underlying generating process, suggesting the stock market may be a deterministic system. This section outlines the history of chaos in the description of financial markets, the determination of chaos in a time series, and possible causes for nonlinear behavior in the financial model. As a result of the failure of linear models, models of complex systems were developed to explain nonlinear processes in financial markets. The foundation for chaos theory in the realm of finance was established by Mandelbrot in 1963 in a study of cotton spot price data [21] where he found price changes did not fit a normal distribution. The theoretical justification for price changes fitting a Gaussian was given by Osbourne [27] using central limit theorem. It states that, if transactions are a true random walk, they are randomly independently, identically distributed, and price changes should be normally distributed since the price changes are the simple sum of the IID transactions. Mandelbrot discovered that the price changes differed from a normal distribution [21]. Particularly he found long tails in the distribution of price changes. There are more observations at the extreme ends than a normal distribution would predict. Furthermore his studies of cotton prices did not indicate finite variance. He expected to see variance converge as he sampled more cotton price changes since he was increasing the number of observations. Increasing the sample size of cotton prices did not cause the variance of cotton prices to converge as he would expect for Gaussian distribution since the variance for each observation is identical. He suggested instead the distribution modeled a stable Paretian distribution. Fama confirmed a stable Paretian distribution was a better fit for price returns than a Gaussian in his studies on thirty stocks in the Dow-Jones Industrial Index [4]. Other studies by Praetz in 1972 suggested a scaled t-distribution as an alternative [28]. Later on in section 5.2, I fit the financial data to both a Gaussian and a Lorentzian distribution, which is a stable paretian distribution with no skew and no defined variance measure, which was an important characteristic Mandelbrot observed in cotton prices. Mandelbrot s work established the basis of chaos theory in the world of finance. The rejection of the random walk theory brought chaos and determinism into the study of economics [24]. Studies have found that there is evidence of nonlinearities in share price data by applying correlation dimension analysis [25]. These studies sought to determine whether a time series is independently and identically distributed as would be expected for a random walk. Studies using correlation dimension have gone on to reject the Gaussian hypothesis offered by Osbourne and argue the existence of chaos in 31

34 financial markets[10]. There have been many reasons offered that suggest chaotic dynamics within the stock market. Deterministic processes could be the result of behavioral economics which dictate irrationality in investment decisions [35]. Rather than making purely logical decisions based on the current market, the psychology of fear and emotions plays a deterministic role in risk-taking and investment decisions. Paretian distribution implies a paretian market which is inherently more risky because there are more abrupt changes, variability higher and probability of loss is greater [4]. In general, the complex interactions of the many variables and players in the stock market could behave nonlinearly. Many researchers claim there does exist some degree of determinism in the market driven by high dimensional attractors. This debate on chaos will be important as I study the use of neural networks in the financial domain. 4.2 History of Neural Networks in Finance Throughout the years of research analysis on stock prices, studies have indicated a major roadblock is the lack of mathematical and statistical techniques in the field [35]. Some of the benefits of neural networks lie in the learning ability of networks to approximate functions. Given the vast amount of data available in the financial world, neural networks have become invaluable tools in detecting complex processes and modeling these relationships. Networks have been introduced as models for times series forecasting as early as 1990 where most early studies focused on predicting the return from indices. In the prediction of returns in the Tokyo Stock Price Index (TOPIX) on data covering the period January 1985 to September 1989, Kimoto and collaborators [16] compared the performance of modular neural works trained through backpropagation with multiple regression analysis, a traditional method of forecasting. The modular neural networks they created were series of independent feed-forward neural networks with three layers including input layer, a hidden layer of nodes, and an output layer. The output of these networks was combined to inform buy or sell decisions for TOPIX. They were able to determine a higher correlation coefficient between the target data and the system of networks (0.527) than for individual networks ( ). In 1990, Kamijo and Tanigawa [15] used a recurrent neural network approach to model candlestick charts, chart combining line charts and bar charts to illustrate the high, low, open, and close price for each day. They were able to use backpropagation to train the RNN to recognize specific patterns in the price chart but 32

35 were unable draw conclusions about the predictive abilities of the network. Neural networks were used to predict the daily change in direction of the S&P 500 in both studies conducted by Trippi and Desieno [33] and by Choi, Lee, and Lee [2]. Trippi and Desieno [33] developed composite rules which are different ways to combine the trained output from multiple networks to determine Boolean (rise or fall) signals. Their results show that they are 99% confident their best composite rule would outperform a randomly generated trading position. They estimated a potential annual return of $60,000. Choi, Lee, and Lee [2] used neural networks to make rise or fall predictions in the stock index. Using this method they were able to make higher annualized gains than previous methods. To address limitations of backpropagation in dealing with noise and non-stationary data [30], other research uses a hybrid method incorporating rules-based system from machine learning along with recurrent neural networks. The rules-based technique categorizes the stock data into cases based on empirical rules derived from past observations. Studies from Tsaih, Hsu, and Lai incorporating a hybrid method using rules-based system to generate input data use in neural networks found better returns over a six year period compared to a strategy where the stock and bought and held over the six year period [34]. Studies are still working to continue to improve the training algorithms for neural networks. In the next section I will study how the echo state network training algorithm, described in earlier chapters, can be implemented implemented in handling financial data. 4.3 Neural Forecasting Competition In spring of 2007, echo state networks outperformed many other neural network training algorithms in the NN3 Artificial neural Network and Computational Intelligenece Forecasting Competition, where the objective was to forecast 111 monthly financial timeseries by 18 months. The submission using an echo state network approach by Ilies, Jaeger, Kosuchinas, Rincon, Sakenas, and Vaskevicius was ranked first in forecasting the competition time series [26]. The team used the same recurrent nerual network architecture described in chapter 2 to train blocks of the 111 competition time series with high levels of success [11]. Based on their report, 111 time series was divided into 6 temporally blocks. Each of these blocks was preprocessed using seasonal decomposition methods before being used to train a collective of 500 echo state networks. The reservoir parameters were manually manipulated using part of the time series as a validation set [11]. The promising results 33

36 from the competition give rise to many more questions about the application of echo state networks in finance, which I will address in the next chapter. 34

37 Chapter 5 Network Predictions In this chapter, I apply all I have learned about echo state networks and finance to test the ability of the echo state network to handle financial data. In the first section, I examine the data set I will be using, the daily closing price of the S&P500 (GPSC) as well as the data processing measures implemented in the project. In section 5.2 I examine how I benchmark the reservoir performance against a random guess drawn from a distribution which accurately models the input data u(n). In order to optimize parameters simultaneously, I conduct a series of grid search in section 5.3. I present the results of the reservoir prediction in section 5.5. Then I study other methods which may improve prediction such as the committee method in section S&P 500 Data The data that I will be using as input data as well as the target data is the S&P 500, which is an American stock index of the top 500 corporations in the New York Stock Exchange. Specifically I used the daily close price before dividend adjustments of each day retrieved from Yahoo Finance [38]. The time period of data used ranges from March 2007 to March The figure below shows the raw input data from Yahoo Finance. 35

38 Figure 5.1: S&P 500 daily close prices from March 2007 to March Data Processing In order to use stock prices as input, I need to apply data processing techniques that would allow the reservoir to run efficiently. Raw input could lead to unstable reservoir dynamics because the raw input data could be in a range that does not produce meaningful reservoir activations. Important considerations in processing data include converting the non-stationary financial time series to a stationary set. Once the data set is detrended, I also need to process the data to ensure the scale is within the correct range for reservoir dynamics. Using a stationary representation of the financial data is important in ensuring proper reservoir performance [31]. Stationary processes are defined as a stationary distribution over time. In other words, the mean and the variance of the data remain constant over time. There are several methods for detrending considered for this project including simple differencing, logarithmic differencing, and relative differencing. For the original n-length time series y(t) = y 1, y 2...y n, the simple difference is 36

39 y di f f (t) = y 2 y 1, y 3 y 1,...y n y n 1. (5.1) Another method to detrend the data is logarithmic differencing y log (t) = log(y 2 /y 1 ), log(y 3 /y 2 ),... log(y n /y n 1 ) (5.2) The last method similar to the simple difference is the relative difference y rel (t) = (y 2 y 1 )/y 1, (y 3 y 2 )/y 2,...(y n y n 1 )/y n (5.3) I chose to implement a relative difference between data points as it indicates the change in stock price which has explicit implications on stock returns. The sign of the relative difference gives the direction of price trajectory. The following is a plot of the detrended data used in the project. Figure 5.2: Relative difference of S&P 500 data from figure

40 Another important consideration in data processing is in the scaling of the input data. As discussed in section 2.5 the input should fall within the nonlinear range of the tanh function. Since many of the relative differences were below 10%, an input scaling factor between 1 and 50 was used. In figure 5.3 the MSE from an average of 10 reservoirs over a scaling factor between 1 and 50. This is the result for a reservoir with spectral radius ρ = 0.8, N x = 500 neurons, and leaking rate α = 0.1. This plot shows input scale that minimizes MSE to be 1, which implies that the unscaled data may be a valid input to the reservoir. The range of input scales which causes a peak in MSE should be avoided. The shape of the MSE curve is very interesting but further studies are still needed to understand the dynamics underlying the bell shaped MSE curve. Figure 5.3: MSE of reservoir over input scaling factor between 1 and 50, reservoir with spectral radius ρ = 1.1, resevoir size of 500 neurons, and leaking rate =

41 5.2 Benchmarking In order to test the reservoir performance, I compare the reservoir output to a random draw from a distribution. As discussed in section 4.1, according to random walk theory, the distribution of price changes should be Gaussian. A random independent identically distributed variable should have a normal distribution. Therefore I fit a histogram of S&P 500 price changes to a Gaussian in the plot in figure 5.4 where a Guassian distribution is defined by mean µ and standard deviation σ in G(x, µ, σ) = 1 σ (x µ) 2 (2π) e 2σ 2. (5.4) The parameters µ and σ are fitted in table 5.1 with reduced chi-square valeu = µ e 05 ± σ ± Table 5.1: Gaussian fit parameters for distribution of S&P 500 data In figure 5.4, the Gaussian does not fit the distribution very well; the reduced chi-square is much larger than 1, which means the distribution is not fully capturing the data. There are too many extreme values in the long tail of the distribution for the Gaussian to be a good fit. Furthermore, because the Gaussian drops off exponentially from the mean, the slope is too steep to capture all the points in the bell curve. In table 5.1, the fitted µ has very high uncertainty as the center may be variable. The Gaussian is not a good fit for the distribution of price changes, which also leads me to reject the Gaussian distribution as way to fit the input data. 39

42 Figure 5.4: Gaussian fit to histogram of relative price change in S&P 500. In order to find a distribution with a better fit, I attempted a few other fits using the Lorentzian distribution, also known as the Cauchy distribution, and the Voigt profile, which is a convolution of the normal and the Lorentzian distribution. Both of these distributions have thicker tails than the Gaussian and would therefore be better able to capture the larger price changes within the distribution. The Lorentzian distribution is a unique case of the stable Paretian distribution discussed in section 4.1 with no skewness. The probability distribution function of the Lorentzian is defined by location x 0 and γ in L(x, x 0, γ) = 1 πγ[1 + ( (x x 0) γ ) 2. (5.5) ] Figure 5.5 is a graph of the fit to a Lorentzian with reduced chi-square = The parameters for the distribution are listed in table 5.2. The goodness of fit for the Lorentzian distribution is very close to reduced chi-square value = 1, which indicates a good fit between the data and the distribution. The thicker tails of the Lorentzian distribution are better able to capture the larger price changes that occur in the distribution. The uncertainty for the center location x 0 is relatively high compared to the center location value. However, since there was also trouble fitting a µ to the Gaussian model, I assume that the center of the historical price change distribution is variable and difficult to fit. 40

43 Figure 5.5: Lorentzian fit to histogram of relative price change in S&P 500. x e 06 ± 7.81e 05 γ ± 7.81e 05 Table 5.2: Lorentzian fit parameters for distribution of S&P 500 data. Another view of the Lorentzian distribution fit to the data at the tails is shown in figure 5.6. This figure shows that the tail of the Lorentzian distribution is able to fit the data well. However the Lorentzian distribution does tend to have slightly thicker tails than the data, and the Lorentzian distribution would be more likely to predict larger price fluctuations than that which actually occur. 41

44 Figure 5.6: A closer view of the tail of relative prices and the Lorentzian distribution shows fewer large price fluctuations than the Lorentzian would predict The next distribution that I attempt to fit the data to is the Voigt profile, a convolution of the Gaussian distribution, given by Eq. 5.4, and the Lorentzian distribution, given by Eq Therefore the convolution is defined by V(x, σ, γ) = G(x, σ)l(x x, γ)dx. (5.6) The Voigt profile in figure 5.7 has almost the same reduced chi-square value fit as the Lorentzian in figure 5.5. The goodness of fit for the Voigt profile is and the parameters for the Voigt profile are shown in table 5.3. The Voigt profile has three parameter values to fit compared to the two parameters for both the Gaussian and the Lorentzian. Even with this additional parameter, the goodness of fit is almost identical to the Lorentzian fit. As with the Lorentzian distribution, the Voigt profile is better able to the thicker tail of the price change distribution with high fluctuations. Similar to the previous two models, 42

45 the fitted parameter for the center x 0 is uncertain for the Voigt model. In the Voigt model, the σ value is more uncertain; the uncertainty of the σ in the fit is 13.57% of the σ value. Figure 5.7: Voigt profile fit to histogram of relative price change in S&P 500 x e 05 ± 7.70e 05 γ ± σ ± Table 5.3: Voigt profile fit parameters for distribution of S&P 500 data Ultimately I use units drawn randomly from the Lorentzian distribution that best approximates the inputs to measure the performance of the reservoir. The reduced chi-square for the Lorentzian distribution is very similar to the goodness of fit for the Voigt profile. Implementing a random selection from the Lorentzian distribution is also more straightforward. This selection from the distribution simulates a random guess for the price change each day based on the distribution of all previous days. The random 43

46 guess establishes the baseline for a randomly determined forecasting model. Comparing this to an echo state network, I can measure how well the trained network predicts price changes by studying the MSE of each method. 5.3 Parameter Optimization In order to optimize the reservoir to make the best reservoir predictions possible, there are a large number of global parameters that can be tuned. I discussed these parameters in section 2.5. Initially, I optimized these parameters individually to find the parameter value that would reduce MSE, which is the standard practice [17]. However, many of these parameters may affect each other, and therefore the best optimization strategy would determine the parameters simultaneously. I considered a gradient descent algorithm that would calculate the gradient of the MSE as a result of the different parameters and move in the direction of greatest descent to find the local minimum. However, this method may not be effective where there are many local minima or where the surface is discontinuous. Therefore, to better understand the landscape of MSE as a function of multiple parameters, I conduct a coarse grid search across the leaking rate, spectral radius, and reservoir size. To conduct the grid search, I create 10 reservoir given each set of parameters and find the average MSE across the 10 reservoirs created. Figures 5.8, 5.9, 5.10, and 5.11 show the surface of the MSE as a function of spectral radius and leaking rate across reservoir sizes of 100, 400, 700, and 1000 respectively. These figures highlight that while MSE may not be largely varying, certain choices of parameters will cause the reservoir to fluctuate highly and return output values that have extremely high errors. Compared to the grid search in section 3.3 where α 0.3 gave the best MSE results, the MSE is minimized in the case of financial data input for leaking rate α 0.1 and spectral radius ρ 1. There are many differences between the Mackey-Glass data and the stock data that could impact the optimal parameters for reservoir output. In the case of Mackey-Glass, the solution to Eq. 3.2 was taken for very small time steps of t = 0.1. This makes the Mackey-Glass input much smoother than the financial data series, which from figure 5.2 seem to experience many shocks. The smaller leaking rate α could minimize the impact of these sudden shocks by slowly integrating the input updates over time. The smaller optimal spectral radius implies a shorter memory required for the system. Financial markets may be less dependent on price changes in the distant past than on the more recent price shifts. However, 44

47 despite differences between the effect of spectral radius and leaking rate on Mackey-Glass input and financial data input, the reservoir size functions similarly in both such that increasing reservoir increases stability of MSE. The slope of the MSE surface for reservoir size N x = 1000 is lower than the surface for N x = 100 for both Mackey-Glass input and S&P 500 price changes input. In preparing the reservoir model to forecast and predict S&P 500 price changes, I will maintain the optimal range of parameters, where α 0.1, ρ 1, and reservoir size N x is as large as computationally feasible. Figure 5.8: MSE surface across spectral radius and leaking rate with reservoir size =

48 Figure 5.9: MSE surface across spectral radius and leaking rate with reservoir size =

49 Figure 5.10: MSE surface across spectral radius and leaking rate with reservoir size =

Figure 5.11: MSE surface across spectral radius and leaking rate with reservoir size = 1000. 5.4 Reservoir The reservoir used in forecasting the S&P 500 data is very similar to the reservoir system used in section 3.

50 Figure 5.11: MSE surface across spectral radius and leaking rate with reservoir size = Reservoir The reservoir used in forecasting the S&P 500 data is very similar to the reservoir system used in section 3.2. From figure 5.3, the optimal scaling factor = 1, which I set for all reservoirs forecasting the market index. Based on section 5.3, I use reservoir parameters reservoir size N x = 1000, leaking rate α = 0.1, and spectral radius ρ = 0.8. For the input, I use the first 1000 data points u 0, u 1...u 1000 in the training sequence computed as the relative difference in Eq. 5.3 to drive the reservoir. The first 20 activations are disregarded and not used in training the network since they function to initialize the reservoir. Then the weighted output of the nodes in the reservoir is trained on the target data, which comes from the relative difference of the input S&P 500 data y target (n) = u(n). This gives the trained output weight matrix W out, which is used to predict the price change one day ahead. Because the market exhibits such complex behaviors, it is hard to expect even a trained reservoir to make highly accurate predictions 48

Reservoir Computing and Echo State Networks

An Introduction to: Reservoir Computing and Echo State Networks Claudio Gallicchio gallicch@di.unipi.it Outline Focus: Supervised learning in domain of sequences Recurrent Neural networks for supervised