Reservoir Computing: Memory, Nonlinearity, and Spatial Observers. A Thesis Presented to The Division of Mathematics and Natural Sciences Reed College

Size: px

Start display at page:

Download "Reservoir Computing: Memory, Nonlinearity, and Spatial Observers. A Thesis Presented to The Division of Mathematics and Natural Sciences Reed College"

Cora Parrish
5 years ago
Views:

1 Reservoir Computing: Memory, Nonlinearity, and Spatial Observers A Thesis Presented to The Division of Mathematics and Natural Sciences Reed College In Partial Fulfillment of the Requirements for the Degree Bachelor of Arts Noah James Shofer May 2018

3 Approved for the Division (Physics) Lucas Illing

5 Acknowledgements I d like to thank Lucas for both being a fantastic thesis advisor and for all the conversations about hiking & camping in the Northwest. Hopefully I ll get to go on at least a couple more of the hikes you mentioned. Thank you also to Joel for showing me that physics theory can be fun, even if you can t put it on a breadboard or an optics table. Jenna and John, thank you for a fantastic experience with JLab, and for all the trivia about Horowitz & Hill. Finally, thanks to Andrew, Alison, Darrell, and Johnny, thank you for all the classes that you taught, each with a different perspective on what it means to do physics. Thank you Mom & Dad and Nick & Nate and Mama & Pop and GMD. You all put up with me, for some reason, and I m so fortunate to have you as my family. GPD, I think you would have enjoyed this one. I couldn t have done any of this without my amazing friends/burrito connoisseurs Sarah, Aja, Andrew, and Ribby, as well as all of the other denizens of the physics subbasement. Steven, Alex, Luke, and Michael, it s been great playing and listening to music with you. And lastly, thank you Ira, Avehi, Eliotte, Nikki, Oona, JR, Robby, Ella, and anyone else I missed, for nothing (and everything) in particular.

7 List of Abbreviations RNN RC RMSE CML Recurrent Neural Network Reservoir Computing Root Mean Square Error Coupled Map Lattice

9 Table of Contents Introduction Chapter 1: The Basics of Reservoir Computing Anatomy of a Recurrent Neural Network Training Recurrent Neural Networks Reservoir Computing The Reservoir Chapter 2: Constructing, Training, and Running a Reservoir Computer Network Implementation Generating the Reservoir Generating W in & W fb Training Procedures The Moore-Penrose Pseudoinverse Ridge Regression A Toy Example Chapter 3: Chaos & Attractors Background The Lorenz System The Rössler System Observability Chapter 4: Reservoir Computers as Observers: Linear vs. Nonlinear Reservoirs Memory vs. Nonlinearity Linear Reservoir Rössler System Lorenz System Comparing Activation Functions Quadratic Reservoir Strictly Linear or Strictly Nonlinear Reservoir Results Chapter 5: Learning Attractors

10 5.1 Lorenz Attractor Rössler Attractor Chapter 6: Coupled Map Lattices Background The Hagerstrom CML Motivation Mathematics Implementing CMLs In Silico Implementation Experimental Implementation Comparing Experiment to Simulation Chapter 7: Predicting Coupled Map Lattices The Problem with Two Dimensions First Attempts, and a Different Problem D Lattices Experimental 1D Lattice Faking Two Dimensions Chapter 8: Summary & Future Work Linear vs. Nonlinear Reservoirs Learning Attractors Coupled Map Lattices Future Work Exploring Linearity & Nonlinearity in Chaotic Systems Real-time Inference and Control of Chaotic Systems Appendix A: Reservoir Computing Algorithms Appendix B: Coupled Map Lattice Algorithms Appendix C: Derivation of Experimental Intensity Relationship References

11 List of Tables 4.1 Linear activation function network parameters RMSE for linear RC generated Rössler attractor components RMSE for linear RC generated Lorenz attractor components RMSE for inferred ỹ, z generated by x-driven RC, Lorenz system RMSE for inferred x, z generated by ỹ-driven RC, Lorenz system Network parameters for activation function comparison experiment RMSE for inferred ỹ, z generated by x-driven RC, Rössler system RMSE for inferred x, z generated by ỹ-driven RC, Rössler system Network parameters for Lorenz attractor learning Network parameters for Rössler attractor learning Network parameters for the 1D coupled map lattice inference task Network parameters for the experimental 1D CML inference task Network parameters for 1D & fake 2D coupled map lattice inference comparison

13 List of Figures 1.1 A cartoon diagram of a reservoir computer-type recurrent neural network. The blue arrows represent W in, the red arrows represent W, the teal arrows represent W out, and the dashed purple arrows represent W fb. The inputs on the left represent u(t), the neurons in the center represent x(t), and the outputs on the right represent y(t) The tanh(x) function The displacements of the toy network neurons for the τ = 250 time step training period The reservoir computer output generated after training compared to the actual desired output A simple trajectory The Lorenz Attractor The Rössler Attractor Linearly generated Rössler system components Linearly generated Lorenz system components The Lorenz system, generated by a standard tanh(x) reservoir The Lorenz system, generated by a quadratic reservoir The Lorenz system, generated by a linear reservoir The Rössler system, generated by the tanh(x) reservoir The Rössler system, generated by the quadratic reservoir The Rössler system, generated by the linear reservoir Cobweb plots of the uncoupled lattice update equation Evolution of uncoupled 1D lattices of 10 nodes The experimental setup used to generate coupled map lattices A snapshot of the experimental lattice Comparison of experimental and simulated 1D coupled map lattices generated with n = 36, a = 0.85, ɛ = 0.9, and R = Comparison of experimental and simulated 1D coupled map lattices generated with n = 36, a = 0.85, ɛ = 0.6, and R = Comparison of experimental and simulated 1D coupled map lattices generated with n = 36, a = 0.85, ɛ = 0.6, and R =

14 6.8 Comparison of experimental and simulated 1D coupled map lattices generated with n = 36, a = 0.85, ɛ = 0.9, and R = A few different 2D coupled map lattice observer schemes A few time steps of the CML dataset used for the 1D parameter search A few time steps of a 2D lattice The RMSE for inferred 1D CML dynamics with observer nodes every fourth node plotted against one of the reservoir parameters The RMSE for inferred 1D CML dynamics with observer nodes every eighth node plotted against one of the reservoir parameters The observer scheme used for the fake 2D reservoir inference tests The RMSE for inferred 1D and fake 2D CML dynamics with nodes 1 & 4 (columns 1 & 4 for the fake 2D case) as observer nodes plotted against one of the reservoir parameters

15 Abstract Reservoir computers are a subset of neural networks that behave as a dynamical system rather than a function. This makes reservoir computers uniquely suited to solving problems that involve time series data, but how they do so is not well understood. This thesis covers the background and implementation of reservoir computers, before discussing the effects of linear memory and nonlinear computational ability. We use reservoir computers with either exclusively linear memory, a combination of linear memory and nonlinear computational ability, or exclusively nonlinear computational ability. As a first task, we show that these reservoir computers can act as observers for the Lorenz and Rössler chaotic systems, i.e. they are able to predict unobserved dynamical variables of these systems. We find that while the reservoir with a combination of linear memory and nonlinear computation usually performs better as an observer, for some tasks either linear memory or nonlinear computation is not necessary to accurately make inferences about the system. Next, the linear and combination linear/nonlinear reservoirs were tasked with learning the Lorenz or Rössler attractor. The combination reservoir was able to learn the attractors, but the linear reservoir was not. Lastly, coupled map lattice dynamics were generated experimentally and in silico, and used to see if reservoir computers with a nonlinear activation function could function as observers for a dataset with spatial extent. We found an optimized parameter set that enabled the reservoir computer to predict the dynamics of the unobserved nodes of a 1D lattice when given the dynamics of a few nodes within the lattice. We found that, in principle, the reservoir could also function as an observer for a 2D lattice, though the spatiotemporal dynamics of our particular 2D lattice were too complex for any reservoir size we used.

17 Introduction Machine learning is, at its heart, the use of computer algorithms to process large quantities of data. There are a huge range of different machine learning techniques, ranging from simple model-fitting algorithms to massive neural networks, and covering all of them would be an undertaking on its own. This thesis will focus on a small subset of neural network techniques known as reservoir computing. But before we discuss reservoir computing, let s take a look at neural networks more generally. While neural networks have been a subject of great scientific and public interest in the past few years, they have been a subject of scientific exploration since the late 1940 s, with the first successful neurocomputer being developed in 1957 and 1958 [1]. In recent years, neural networks have been used for a wide variety of tasks where large amounts of data are available, such as computer vision (e.g. recognizing a person, or more likely, a cat, in a photograph) or speech recognition [2]. Loosely based on biological brains, a neural network consists of neurons connected by synapses [3]. These synapses are weighted, which is to say that some neurons are connected together more strongly than others, depending on the weight of the synapse connecting the neurons. Each neuron operates by summing up the weighted input signals from other neurons and outputting a nonlinear transformation of the summed input. By adjusting the weights of the synaptic connections, we can change the output of the network. A typical methodology for implementing a neural network consists of two stages: training and then running the network. The training dataset consists of a set of pre-sorted data: a set of pictures that may or may not contain a dog, and a marker that tells us if the picture has a dog in it. To train the network, we supply the network with the pictures and the markers, and update the weights of the synaptic connections so that it learns to associate certain patterns in the images with a dog. To run the network, we supply a set of dog-containing pictures that the network hasn t seen before, and let the network tell us the marker for each picture. There are several types of neural network, such as feed-forward, convolutional, and recurrent neural networks. Each of these types of networks is structured differently and performs better or worse on certain types of data, but all share the same basic operating principles. Reservoir computers (RC) are a subset of recurrent neural networks. Recurrent neural networks differ from both feed-forward and convolutional neural networks in that the neurons are connected in cycles, rather than just in layers [3]. As the name suggests, in a feed-forward neural network, all the neurons are connected in layers, all of which take inputs from the previous layer and output to the next layer. A convolutional neural network is organized similarly. By comparison, the

18 2 Introduction neurons of a recurrent neural network are connected in a random manner, and may have feedback loops (cycles) between sets of neurons. The cycles allow the recurrent neural network to have quiescent temporal dynamics even when there is no input to the network, whereas a feed-forward neural network will not have any dynamics when there is no input. That is to say, the recurrent neural network is a dynamical system, while the feed-forward neural network is a function [3]. The dynamical nature of recurrent neural networks allows them to excel at learning time-series data, which is useful for tasks as diverse as predicting the weather and controlling chaotic systems [4]. However, recurrent neural networks are difficult to train using conventional training algorithms for training neural networks [3]. Reservoir computing was first proposed in 2001 by Jaeger [5] as a simpler method for training recurrent neural networks. In the reservoir computing approach, the synapse weights between neurons are kept constant, and only the output synapse weights (i.e. the weights from the neurons to the network output) are changed during training. The neurons are effectively treated as a reservoir of dynamics that reflect the input data, and we draw from this reservoir to construct the output. This training method often produces networks that are very successful at their given task, with the added benefit of being quick and easy to implement. However, despite nearly two decades of research, there are still many open questions in reservoir computing [4]. Active areas of research include reservoir structure, the computational ability of reservoir computer networks, and physical implementations of reservoir computers [6]. This thesis primarily focuses on the computational ability of reservoir computers, first covering background and implementation of reservoir computers, then investigating linear memory and nonlinear computation in the reservoir via numerical experimentation. We then explore using reservoir computers to make inferences about spatiotemporally chaotic datasets. Background and in silico and experimental methods of generating spatiotemporally chaotic datasets are also discussed.

19 Chapter 1 The Basics of Reservoir Computing 1.1 Anatomy of a Recurrent Neural Network Figure 1.1 shows a cartoon layout of a recurrent neural network (RNN). On an abstract level, an RNN can be thought of as three layers of connected neurons. The first layer of nodes consist of the time dependent, K-dimensional input to the network, denoted here as u(t). These inputs are connected to the second, hidden, N-dimensional network layer of the RNN, which is represented by the state vector x(t). In contrast to a feed-forward neural network, this network layer is not organized in sublayers of neurons that pass information forward, but instead possesses topological cycles [3]. Finally, the third layer is the L-dimensional output layer. It is connected to the network layer of the RNN, and is denoted by y(t). Note that there might also be connections from the output back into the network layer ; this is useful for cases in which feedback is desired for the system. The output layer gives the computed solutions generated by the RNN. The connections between neurons within the hidden network layer and between the three layers are sometimes called synaptic connections, [3] and are represented by matrices: W R N N, W in R N K, W out R L (N+K), and W fb R N L. These matrices describe the internal connections, the input connections, the output connections, and feedback (output to hidden network layer) connections, respectively. In general, the connection weights can be either positive or negative. Depending on how it is implemented, the hidden network layer of an RNN can evolve continuously in time or with a discrete update step [7]. While continuous models are often used to represent biological processes [7] or some physical implementations of an RNN (see Paquot et al. [8]), most implementations of RNNs use a discrete update step. For discrete updates, the most abstract form of the update equation for an RNN is [3]: x(t + 1) = F(x(t), u(t + 1), y(t)) (1.1) where F is some update function. A more concrete form of the equation is given by [3, 7, 9]: x(t + 1) = f(w in u(t + 1) + W x(t) + W fb y(t) + ξ) (1.2)

20 4 Chapter 1. The Basics of Reservoir Computing u(t) y(t) x(t) Figure 1.1: A cartoon diagram of a reservoir computer-type recurrent neural network. The blue arrows represent W in, the red arrows represent W, the teal arrows represent W out, and the dashed purple arrows represent W fb. The inputs on the left represent u(t), the neurons in the center represent x(t), and the outputs on the right represent y(t). for some activation function, f, that is applied element wise to the update vector. This activation function may take a variety of forms, but is often a nonlinear function such as tanh( ) that saturates for large values. The time-dependent output y(t) of the RNN is given by [3, 7]: y(t + 1) = f out (W out [x(t + 1), u(t + 1)]) (1.3) where f out is some function applied individually to the elements of the output vector and [, ] denotes concatenation of two vectors. While f out may be a nonlinear function, many simple networks (including the networks used in this thesis) have f out = 1. In this case, the network output is simply y(t + 1) = W out [x(t + 1), u(t + 1)]. 1.2 Training Recurrent Neural Networks Similarly to feed-forward neural networks, RNNs are trained by updating the connection weights between input, output, and network nodes. The goal of training is ultimately to minimize the error between the output generated by the RNN, y, and the target output y. There are a variety of error functions that can be used, but commonly employed is the root mean square error (RMSE), given by [9]: E(y, y) = 1 L 1 T (y i (n) y L T i (n)) 2, (1.4) i=1 where E(, ) is the error between two signals, T is the total number of time steps over which the error is computed, and y designates the target signal vector to which the n=1

21 1.3. Reservoir Computing 5 tanh(x) x Figure 1.2: The function tanh(x). Note the linear region from roughly x = 1 to x = 1. This transitions to a nonlinear region 1 x 3. The function then saturates to a constant value for x 3. network output is being trained. Note also that the first sum is over all L dimensions of the output. Sometimes, however, the RMSE is computed individually for each dimension of the output, in which case the first sum is absent. There are two primary approaches to training RNNs: gradient descent methods and reservoir computing methods. Gradient descent methods iteratively reduce the error on the training dataset by changing connection weights on the network [3]. However, methods of these type are difficult to use with RNNs (not impossible though, see Jaeger [7] for an example of an RNN backpropagation algorithm), because the network dynamics may jump between qualitatively distinct behaviors as a parameter is varied even a small amount (this phenomenon is known as bifurcation)[3]. Gradient descent is also computationally expensive, especially if a long signal needs to be generated by the RNN on each iteration of training [3]. Due to these limitations of the gradient descent method, a fundamentally different method of training RNNs, now generally referred to as Reservoir Computing, was proposed independently by Jaeger [5] and Maass et al. [10]. 1.3 Reservoir Computing In contrast to gradient descent methods of training an RNN, a reservoir computing (RC) approach does not adjust the weights of all connections in the network. Instead, it leaves the input, hidden layer, and feedback weights fixed, and only the output weights are computed. The hidden network layer then becomes a driven dynamical system, with the nodes acting as echo functions for the input signal [11]. The reservoir thus acts as a nonlinear expansion of the input signal [3]. These echo functions are essentially variations on the input/feedback signal. In this RNN method, the hidden network layer acts as a reservoir of signals, from which the

22 6 Chapter 1. The Basics of Reservoir Computing output of the network is computed. Thus, we construct the RC output y(t) as [7]: y i (t) = N [W out ] i,j x j (t), (1.5) j=1 where [W out ] i,j are the elements in W out, for each output dimension i = 1,..., L. Since we are trying to minimize the RMSE for the network, we can then rewrite the training task as [7]: E(y, y) = 1 L 1 T (y i (n) y L T i (n)) 2 = 1 L i=1 L 1 T i=1 n=1 ( T N 2 [W out ] i,j x j (t) y i (n)), (1.6) n=1 Training is then a simple linear regression task, since we just need to minimize: j=1 ( T N 2 [W out ] i,j x j (t) y i (n)) (1.7) n=1 j=1 for each output dimension i = 1,..., L. Training with linear regression is much simpler than training the network using gradient descent, and can be implemented simply and efficiently in many programming languages The Reservoir While the reservoir computing approach to RNNs greatly simplifies implementation and training of networks, it leaves open many questions about the reservoir itself. How should we structure the reservoir? How much echo of the input signals should the reservoir produce? How strongly should the input signals influence the reservoir? While these questions are undoubtedly fundamental to efficient and reliable reservoir computing, there is unfortunately no systematic way to answer them [6]. There are, however, some concepts that can help guide our thinking about reservoirs. First, let s specify some terms. Probably the most important term for dealing with the reservoir is the spectral radius of the reservoir. The spectral radius is simply the eigenvalue of the reservoir connection matrix W with the largest absolute value. In this thesis we denote the spectral radius of the reservoir as ρ(w ) [3]. Another common parameter for specifying a reservoir is the density of the reservoir. This is the fraction of nonzero entries in the connection matrix W, and is denoted by d W. We also have the number of neurons N. These three parameters tell us all we need to know about a reservoir, for the purposes of this thesis. Properly setting the spectral radius is essential for the reservoir to function as intended. This is because the spectral radius is closely tied to a concept known as the echo state property. A reservoir with the echo state property will have dynamics

23 1.3. Reservoir Computing 7 that die out eventually. A more formal way to say this is that the current input u(t) and state of the reservoir x(t) will cease to have an impact on the future state of the reservoir x(t + t) as t [3]. This condition is essential for reservoir function [3, 5, 11], and can almost always be guaranteed by selecting ρ < 1. However, it is possible to have a reservoir with the echo state property and ρ 1, as well as to have a reservoir with ρ < 1 that lacks the echo state property, but this is not usually the case [3]. A more thorough discussion of this property can be found in Jaeger [5]. Counterintuitively, the density of the reservoir is not especially important to reservoir function [9]. The effect of internal reservoir topology on the function and performance of the reservoir is also unclear [3]. However, having a reservoir with a relatively low density can be advantageous from a computational perspective, as it allows for efficient sparse matrix algebra, which can speed up computation times for large networks. In addition to spectral radius and density parameters for the reservoir, we have similar parameters for the input matrix W in. The density of the input weight matrix is given by d in. Typically, d in is relatively close to 1, since we want the input data to go to most of the neurons. Unlike for the reservoir, however, we aren t interested in the eigenvalue scaling of the input matrix, but rather just the strength of the inputs. We denote the input radius as σ in. The input scaling controls the degree to which the current input determines the dynamics of the reservoir compared to past inputs. The higher the input scaling, the more the current input will dominate the dynamics in the reservoir [9]. Additionally, if we have neuron activation functions that have different regimes, such as the linear regime versus the nonlinear saturation regime present in the tanh activation function, the input scaling can push the activation function into either regime [12]. For the tanh activation function specifically, a smaller input radius will keep the neuron in the linear regime of the activation function, while a higher input scaling will push the neuron into the nonlinear regime. Since different tasks require different degrees of linearity versus nonlinearity (see Chapter 4), this parameter can greatly influence the function of an RC.

25 Chapter 2 Constructing, Training, and Running a Reservoir Computer This chapter will describe the specific RC setup used for this thesis. Here, we will cover network implementation, reservoir generation & parameters, and training algorithms. 2.1 Network Implementation The RC setup used in this thesis was implemented in Python 3.6, based primarily on networks specified in Jaeger [7] and Lu et al. [13]. The RC itself was an instantiation of a custom Python class, and all matrices that defined the reservoir were attributes of that class object. As in the general RC overview, our version had input u(t) R K, with W in R N K, and state vector x(t) R N, with connection matrix W R N N. The output vector was still y(t) R L, but as opposed to the general case where the output matrix is connected to the reservoir and directly to the input, we limited our output matrix to drawing signals from the reservoir, such that our W out R L N. For some tasks, we also include feedback connection matrix W fb R N L, but not always. The update equation for our network was based primarily on the update equation used by Lu et al. [13] and Jaeger [7], but was also influenced by several other authors [3, 9, 14]: x(t + 1) = (1 a) x(t) + a tanh (U(u(t + 1), x(t), y(t)) + ξ) (2.1) where U(u(t+1), x(t), y(t)) = W in u(t+1)+w x(t)+w fb y(t)+ν is the N-dimensional reservoir update vector at time step t and ξ is some constant bias term, generally unity. The constant bias term serves to push the activation function into the nonlinear or saturation regimes [13]. A small constant noise term ν, typically in the range of ν = , is often added to the input during output weight training. The noise term helps to immunize the reservoir against unexpected changes in the input u(t), since it helps prevent overfitting to a specific dataset. See Jaeger [5] for a more in-depth explanation of the mechanics of noise immunization. We also have a, a leak term between 0 and 1. The leak term controls to what degree the reservoir dynamics are governed by the inputs to the reservoir or the previous state of the

26 10 Chapter 2. Constructing, Training, and Running a Reservoir Computer reservoir. As a 0, the previous state of the reservoir dominates the input term, and causes the reservoir to operate at slower timescales [13, 14]. See also Luko sevi cius [9] for a slightly different take on a. We used a linear output function f out = 1, so our output equation was given by: y(t + 1) = W out x(t). (2.2) Generating the Reservoir We used the NetworkX s [15] gnp_random_graph() function to create our reservoirs. This function creates an N N adjacency matrix with a specified density. The adjacency matrix specifies whether two nodes are connected by assigning W i,j = 1 if the i th and j th nodes are connected and W i,j = 0 if they are not. However, since in general we want both positive and negative connection weights, we need to rescale the entries in the adjacency matrix. These connection weights could be either continuous or binary. For continuous connection weights, we rescale each nonzero entry in W by multiplying it by a random number drawn from the uniform distribution ( σ W, σ W ), where typically σ W = 1. For binary connection weights, for each nonzero entry in W, a random number p [0, 1) was drawn. If p < 0.5, the entry was assigned to be σ W ; if p 0.5, the entry is assigned to be σ W. All random matrix entries were generated using the Python Random library function uniform(), while p was generated using the Numpy library function random.random(). The matrix was then rescaled to the user-specified spectral radius ρ. To rescale the matrix, we follow the procedure specified by Jaeger [7], which is: W = ρ λ max W 0, (2.3) where W 0 is the initial unscaled connection matrix, λ max is the greatest magnitude eigenvalue of W 0, and W is the rescaled connection matrix, which now has its largest eigenvalue of magnitude ρ Generating W in & W fb The connection matrices for input and feedback were generated slightly differently than the reservoir connection matrix. To create either W in or W fb, we first create an all-zero matrix of the correct size and specify a density d (0, 1). Then, for each i, j-entry in the connection matrix, a random number p is drawn. If p < d, a random number from ( σ, σ) is placed in the entry. Otherwise, the entry is left as 0. Usually σ (0, 1), but different values of σ correspond to more or less driving of the reservoir by the input or feedback. Again, all random matrix entries were generated using the Python Random library function uniform(), while p was sampled from the Numpy library function random.random().

27 2.2. Training Procedures Training Procedures The training procedures described here are drawn primarily from Jaeger [7] and Luko sevi cius [9]. The method used to train the reservoir is known as teacher forcing [7], and consists of the following steps: 1. Generate W, W in, W fb as described above, and prepare training input u(t) and training target y(t). 2. Set the state vector x(t) to some initial value, usually x(0) = Drive the reservoir using Eq. 2.1 for T update steps, using the training input u and training output y instead of u and y. After some initial washout period ζ < T to remove transients from the reservoir, collect the state vectors into the τ N state collecting matrix M, where τ = T ζ is defined to be the training length of the reservoir. 4. While the reservoir is running, simultaneously collect the training output y(t) into a matrix for times ζ t T. This will give a τ L teacher collecting matrix T. Once we have the collecting matrices for the state vectors and training outputs, we can frame our problem clearly as a linear regression problem. According to Eq. 1.7, we want to minimize y(t) y(t) for our training period ζ < t τ. Ideally, we would want y(t) y(t) = 0. We can use the collecting matrices to write this clearly for all the times we re interested in: y(t) y(t) = 0 = MW T out T = 0 MW T out = T (2.4) where W T out denotes the transpose of the output weight matrix W out. We thus have a linear regression problem. If, in general, we had a square state collecting matrix M, we could simply invert M to get our output weights: W T out = M 1 T. However, usually the training length τ is longer than the number of nodes N, so M is not in general a square matrix. We thus need alternative methods of solving our linear regression problem The Moore-Penrose Pseudoinverse One relatively straightforward way to solve the problem of the non-square state matrix M in Eq. 2.4 would be to use a generalized inverse that is defined for non-square matrices. Jaeger [7] suggests the Moore-Penrose pseudoinverse, and this is echoed in other literature [9]. We will denote the Moore-Penrose pseudoinverse of a matrix

28 12 Chapter 2. Constructing, Training, and Running a Reservoir Computer A R m n as A + R n m. It can be shown [16] that if AA + b = b, the pseudoinverse solves a linear system of the form: with a solution given by: Az = b z = A + b + ( I A + A ) k (2.5) for A R m n, z R n, b R n, and arbitrary k R n. Note that I is the identity matrix. When A is a square nonsingular matrix, A + = A 1, so A + A = A 1 A = I, and thus the arbitrary component of the solution vanishes [16]. To solve the linear system in Eq. 2.4, we can extend the solution for a vector z R n given by Eq. 2.5 to our connection matrix W out R L N by noting that we could simply solve each column of the system separately, that is, we could solve: [ W T out ]i,1 = M + T i,1 [ W T out ]i,2 = M + T i,2. [ W T out ]i,l = M + T i,l independently and recombine the columns to form W T out. We have set the arbitrary component of the pseudoinverse solution to be zero for simplicity, and thus the Moore- Penrose pseudoinverse solution to the linear system is: W T out = M + T. (2.6) However, Luko sevi cius [9] notes that while the Moore-Penrose pseudoinverse is numerically stable and effective when the linear system being solved is overdetermined, in our case if 1 + N τ, it is memory intensive. This drawback bring us to the other commonly used method for solving the W out linear system, ridge regression Ridge Regression For the training and state matrices we have above, the ridge regression (also known as Tikhonov regularization) method of calculating the output weights of the reservoir is given by [3]: W out = T T M ( M T M + βi ) 1 (2.7) where β is a constant, I is the N N identity matrix, and all the other matrices are as before. 1 Ridge regression is considered the preferred method for calculating the 1 A particularly astute reader might notice that this formulation is slightly different than that presented in many reservoir computing manuals. The more common formulation of ridge regression is given by [3]: W out = Y X T ( XX T + βi ) 1 where Y R L τ and X R N τ are the teacher matrix and state collecting matrixes, respectively. These are just the transposes of the teacher matrix and state collecting matrices used in this thesis, which follow the definition given by Jaeger [7].

29 2.2. Training Procedures Displacement Neuron 1 Neuron 2 Neuron Time Step Figure 2.1: The displacements of the toy network neurons for the τ = 250 time step training period. output weights of the reservoir, for a couple of reasons. One, it is more efficient to compute than the pseudoinverse. We can see this if we note that it is faster to invert ( M T M + βi ) R N N than to find the pseudoinverse of M R τ N, since typically τ is on the order of thousands and N is on the order of a few hundred. The other, and arguably more important, benefit of the ridge regression training method is the β regularization. This constant offset term in the inverted matrix serves to decrease the size of entries of W out. This makes the network less sensitive to noise, and reduces any issue from overfitting [3]. See Luko sevi cius [9] for an explanation of why ridge regression penalizes large W out entries A Toy Example To make the network training procedure a little more clear, let s look at a toy example. Suppose we had a sine wave, and wanted to train a reservoir computer to output a sine wave of twice the initial amplitude. Not the hardest thing to do, certainly, but it allows for a simple demonstration of the working principles of training the network. Since this task is simple, a reservoir with a few neurons should do the trick. Let s make our network with K = 1 input, N = 3 neurons, and L = 1 output. First we generate W with parameters d W = 0.5 and ρ = 0.9: W = , and then generate our input matrix W in, with σ in = 1: 0.49 W in = The training input is the function f(t) = 0.5 sin(8πt) discretized in 300 time steps of dt = 0.01, while the training output is given by g(t) = sin(8πt) discretized in the same manner. Using our notation from above, we have the time series of training

30 14 Chapter 2. Constructing, Training, and Running a Reservoir Computer Generated y Actual y y(t) Time Step Figure 2.2: The reservoir computer output generated after training compared to the actual desired output. inputs given by u(t) = {f(0), f(0.01), f(0.02),..., f(2.99)} and the time series of training outputs given by y(t) = {g(0), g(0.01), g(0.02),..., g(2.99)}. Now, we drive the reservoir with the training input time series, and, after a washout period of 50 time steps to remove any transients, collect the internal reservoir states (displacements of each neuron) for each time step in a state collecting matrix: x(0) x(1) M =. = , x(249) where the first column of M is the displacement of neuron 1 during the training period, the second column is the displacement of neuron 2, and the third column is the displacement of neuron 3. The displacement of each neuron is plotted in Fig The training output matrix is: y(0) 0 y(1) T =. = y(249) 0.25 We can now compute the output weights using either the Moore-Penrose pseudoinverse or ridge regression to solve the problem given in Eq In this case, we used the pseudoinverse to get an output weight matrix given by: W out = [ ]. The network is now ready to be used! We supplied 100 time steps of the function f(t) = 0.5 sin(8πt), discretized with time step dt = 0.01, and compared the generated reservoir output to the desired function g(t) = sin(8πt). This is shown in Fig Note that due to the low number of neurons, some instantiations of this network with different connection matrices fail to accurately produce the desired output. However, a 3-neuron network is certainly capable of performing the desired task, as we can see.

31 Chapter 3 Chaos & Attractors 3.1 Background By its very nature, chaos is a tricky thing to pin down. In his influential book on chaos, Strogatz [17] defines chaos as aperiodic long-term behavior in a deterministic system that exhibits sensitive dependence on initial conditions, but notes that this is not a universally accepted definition. However, it will certainly work for the purposes of this thesis. Let s start by breaking this definition into simple terms. Aperiodic long-term behavior tells us that the system does not settle into a repeating pattern over time, but also does not trend to a fixed point, such as dying to 0 or going off to infinity. The requirement that the system be deterministic tells us that the behavior of the system is not governed by random chance. That is, the system behaves aperiodically due to a fixed set of rules, not because we are drawing a random number and adding it to our system. Lastly, if our system exhibits sensitive dependence on initial conditions, two systems that start arbitrarily close to one another will have divergent behavior as the systems evolve. So, while this definition is not incredibly rigorous, it gives us a good intuitive grasp of what chaos entails as a physical concept. The most straightforward way we could look at the evolution of a system would Figure 3.1: The trajectory of a simple 3-dimensional system. x, y, and z are given in Eq. 3.1.

32 16 Chapter 3. Chaos & Attractors be to plot each dimension of the system against time. However, another way we could visualize the evolution of a system would be to plot its trajectory, which is to say we plot one dimension of the system against one or more of the other dimensions. A simple example of this type of plot is shown in Fig Here, we have the decidedly non-chaotic system: x = sin(t) y = cos(t) (3.1) z = 0.1t From this plot, we can see the value of each coordinate as a function of the other two coordinates. Plotting the coordinates x, y, and z in this manner allows us to see patterns in the evolution of the system. For example, we can see that the system repeatedly passes through the same values of x and y, but only ever passes through a value of z once. While this is a somewhat trivial observation for this system, behaviors such as this, or more complicated behaviors, may not be immediately obvious in a more interesting system, but can be revealed by plotting the trajectory of the system. One non-obvious observation that arises from these trajectories is the concept of an attractor. Much like the idea of chaos, an attractor is also a contentious topic to define. However, Strogatz [17] again provides an intuitive definition, along with a more rigorous definition. We can think of an attractor as a minimal set of points to which all neighboring trajectories eventually converge. In general, this set should also be minimal, and should contain no smaller sets that are in turn converged to. A helpful metaphor (at least for the author of this thesis, but draw your own conclusions) is to think of rain falling in a hilly landscape. Rain that falls on one side of a hill might end up in one lake, while rain that falls on the other side of the hill will flow into the next lake over. Here, the lakes are the attractors, and much like the basins of a watershed, we call all the initial conditions in phase space that end up in the attractor the basin of attraction. We would also note that the small creeks in the hillside are not attractors, as they in turn converge to a lake, which all the water flows to over time. In addition to the general definition of an attractor, we can also define a strange attractor, also known as a chaotic attractor. These are attractors that are fractal [17]. Here we might not even have lakes, but a complicated self-similar collection of puddles. Let s take a look at a couple chaotic systems that will be used in this thesis to get a sense of some of their characteristics, and see what their attractors look like. 3.2 The Lorenz System The Lorenz attractor arises in a nonlinear system first written down by Edward Lorenz in 1963 to describe dissipative nonperiodic flow in a hydrodynamic system [18]. Lorenz motivates the system as an attempt to describe a fluid with some sort of external forcing and thermal and viscous dissipation. The canonical form of the

33 3.3. The Rössler System 17 (a) (b) Figure 3.2: The Lorenz Attractor. (a) The characteristic butterfly wings of the Lorenz attractor in 3 dimensions. (b) The x, y, and z components of the Lorenz attractor vs. time step. Both figures were generated using a time step of and SciPy s odeint routine. Lorenz system is given by: ẋ = a(y x) ẏ = bx y xz (3.2) ż = xy cz where a, b, c are parameters and ẋ denotes the time derivative of x. Different values of the parameters will cause the Lorenz system to display periodic or chaotic behavior. Except where otherwise noted, we used values of a = 10, b = 28, and c = 8/3, which places the attractor in the chaotic regime [13]. Fig. 3.2 shows the attractor for the Lorenz system, as well as the x, y, and z components for the attractor plotted against time. 3.3 The Rössler System The Rössler equations were first described in the 1970 s by Otto Rössler as a system explicitly designed to produce continuous temporal chaos [19]. While Rössler wrote down several systems that would be able to create continuous chaos, the most commonly used is the system given by [13]: ẋ = y z ẏ = x + ay (3.3) ż = b + z(x c) where the parameters a, b, c determine whether the system evolves periodically, chaotically, or converges to a static solution. Again ẋ denotes the time derivative of x. We used parameter values of a = 0.5, b = 2.0, and c = 4.0, which makes the system evolve chaotically [13]. The attractor for this parameter choice is shown in Fig. 3.3, along with the individual x, y, and z components plotted against time.

34 18 Chapter 3. Chaos & Attractors (a) (b) Figure 3.3: The Rössler Attractor. (a) The characteristic folded shape of the Rössler attractor in 3 dimensions. (b) The x, y, and z components of the Rössler attractor vs. time step. Both figures were generated using a time step of and SciPy s odeint routine. 3.4 Observability One thing we can think about when we look at these systems is how to determine the state of the system based on a subset of the variables. For example, if we knew the x-component of the Rössler system over some finite amount of time along with Eq. 3.3, could we infer y and z? If y and z were both dependent only on x, this would be simple enough to do. However, the value of y is dependent on both x and y, and the value of z is dependent on both x and z. So it would seem that we re stuck, since we know neither y or z. But, we make one more observation: the value of x is dependent on y and z. Perhaps we could use this knowledge to be able to reconstruct y and z? This is the question of observability: when do we have enough information about a system to be able to reconstruct all the variables in the system based on just some subset of them? This question is not unique to chaotic systems, and is actually fully answered for linear, non-chaotic systems [20]. Suppose we had a linear, time-independent, d-dimensional system given by: X t+1 = AX t (3.4) where X t is the d-dimensional column vector containing the values of the system variables at time t and A is a d d constant matrix. We now make some observations of our system at time t: O t = G T X t (3.5) Here G is a d-dimensional constant column vector (note G T is a row vector), and O t is the scalar observed system output. It can be shown that we can reconstruct the state of the system using a time series of our observer measurement O t [20]: ˆX t+1 = A ˆX t + C [O t+1 G T A ˆX ] t (3.6)

35 3.4. Observability 19 where ˆX t is the inferred state of the system at time t and C is the control vector. We want to choose the control vector to minimize the error between X t+1 and ˆX t+1. The error is given by [20]: X t+1 ˆX t+1 = [ A CG T A ] (X t ˆX t ). (3.7) To get the evolution of the error through time, we repeatedly apply the matrix [A CGA] to the previous error. So, if the eigenvalues of this matrix are chosen to be less than 1, the error will decrease to 0 as t increases. The requirement that the eigenvalues be less than 1 tells us how to set our control vector C, since we know both A and G. This is not always possible though, depending on A and G. For example, suppose we had: [ ] 1 0 A = 0 1 and only access to the second of the two decoupled variables, i.e.: [ ] 0 G =, 1 as our given matrices. Let our control vector be: [ ] A C =. B Then the repeatedly applied matrix [ A CG T A ] would be given by: [ A CG T A ] = [ ] 1 A, 0 1 B which has eigenvalues λ 1 = 1 and λ 2 = 1 B. So, in this case, we are unable to choose our eigenvalues such that the error will decrease to zero, since the first eigenvalue is fixed. So even in the simple linear case, we are not guaranteed observability. Now suppose we have a nonlinear, potentially chaotic, system, such as the Lorenz or Rössler system. In this case, our system is given by [20]: X t+1 = M (X t ), (3.8) where X t is the d-dimensional state of the system at time t and M( ) is a nonlinear function. We make an observation of our nonlinear system, and obtain a scalar measurement O t : O t = g (X t ), (3.9) where g( ) is a nonlinear function. Now, if we want to reconstruct the internal state of our nonlinear system, we have [20]: ˆX t+1 = M( ˆX t ) + C t [O t+1 g(m( ˆX ] t )), (3.10)

36 20 Chapter 3. Chaos & Attractors where, as before, ˆX t is our inferred state vector at time t and C t is a time-dependent d-dimensional control column vector. The error can now be written as: X t+1 ˆX t+1 = M(X t ) M( ˆX t ) C t [g(m(x t )) g(m( ˆX ] t )). (3.11) If we now linearize the above equation about ˆX t to get this in the form of a matrix equation analogous to Eq. 3.7, we get: X t+1 ˆX [ t+1 = DM( ˆX t ) C t Dg(M( ˆX t ))DM( ˆX ] t ) (X t ˆX t ), (3.12) where DM is a d d matrix of the derivatives of M and Dg is a d-dimensional row vector of the derivatives [ of g. We note that in contrast to the linear case, the error update matrix DM( ˆX t ) C t Dg(M( ˆX t ))DM( ˆX ] t ) is now time dependent. So while we can change the individual matrices C t at each time step to have the eigenvalues of the update matrix be less than 1, this does not guarantee that the error goes to 0 as t. While in some cases it is possible to create an algorithm that will provide a set of C t such that the error converges, it is not guaranteed, and likely to be very complicated [20].

37 Chapter 4 Reservoir Computers as Observers: Linear vs. Nonlinear Reservoirs Reservoir computers have successfully been used as observers for nonlinear systems. More specifically, Lu et al. [13] used an RC observer scheme to infer the y and z components of the Rössler system supplying only the x component to the reservoir (this also worked for the other permutations where the reservoir was given only y or z). In the same paper, Lu et al. [13] were also able to perform a similar task for the Lorenz system, although the reservoir was unable to function as an observer when supplied with the z coordinate of the Lorenz attractor. While this paper demonstrates the feasibility of using an RC as an observer, it does not go far into the details of what the reservoir does that allows it to function as an observer. 4.1 Memory vs. Nonlinearity To make an inference about the current state of a particular dataset, we would imagine that our observer needs information about past states of the system. A more succinct way to say this is that the reservoir requires memory of past states. While the past states of a system might tell us a lot about the current state of the system, we would also imagine that just knowing about the past states of the system is not sufficient to make an inference about the current state of the system, for any system worth looking at. Our observer thus requires some sort of computational power beyond the ability to mix together old inputs. We refer to this computational power as nonlinearity, for reasons that will become clear. These two properties, memory and nonlinearity, are integral to the ability of the reservoir to perform any inference or prediction task. A nice way to think about memory is to think about the activation functions of the neurons in the reservoir. We could make a reservoir with exclusively linear dynamics if we gave each neuron a linear activation function, such as f(x) = Cx for some constant C. Now suppose we input a simple signal into the reservoir, such as u(t) = sin(ωt). Since each individual neuron can only amplify the input signal by some constant factor, if we looked at the internal state of the reservoir, we would

38 22 Chapter 4. Reservoir Computers as Observers: Linear vs. Nonlinear Reservoirs see only linear combinations of the input signal, all at the input frequency ω [12]. Given enough neurons, we could in theory add up enough of the linear combinations of the input signal to reconstruct the original signal. This is how reservoir memory functions: it is a linear process, since linear transformations of the input signal can be undone later to recover the original signal. We can also look at the effects of a nonlinear activation function on our input signal u(t) = sin(ωt). In this case, let the activation function be given by the nonlinear function f(x) = tanh(x). Since this activation function can perform a nonlinear transformation of the input signal, the signals in the reservoir are not all at the input frequency ω, but in a range of different frequencies, or even in a steady state [12]. We could now in theory add together some internal signals in the reservoir to get some function at a different frequency, such as (uncreatively) y(t) = sin(2ωt). This computational ability is what is referred to as nonlinearity, since it requires nonlinear transformations. Unfortunately there is a trade-off between memory and nonlinearity. In a 2017 paper, Inubushi and Yoshimura [21] prove that any nonlinearity in a reservoir inherently reduces the linear memory of the reservoir. They circumvent this problem by introducing a reservoir that includes neurons with linear activation functions and neurons with nonlinear activation functions. While this is a functional solution, it does not provide any information about when linear memory or nonlinear computation is necessary for observer inference by a reservoir computer. To work towards this end, in this chapter we use three different activation functions to look at the inference performance of linear, linear/nonlinear (quadratic), and strictly nonlinear reservoirs. The linear activation function is the first-order Taylor expansion of tanh(x) about x = 1, the linear/nonlinear activation function is the second-order Taylor expansion of tanh(x) about x = 1, and the strictly nonlinear activation function is the second-order Taylor expansion of tanh(x) about x = 1, but with the term linear in x removed. We expand about x = 1 for each activation function to be consistent with the activation function used by Lu et al. [13]. If we recall the update equation for the network, given in Eq. 2.1: x(t + 1) = (1 a) x(t) + a tanh (U(u(t + 1), x(t), y(t)) + ξ) we can see that the argument of the tanh function is the reservoir update vector U(u(t + 1), x(t), y(t)) and a constant offset ξ. Lu et al. [13] use ξ = 1 to push the tanh update function into the nonlinear regime, which they found improved the performance of the RC observer. Thus, for the linear/nonlinear and strictly activation functions, we need to Taylor expand about x = 1 to be in the same nonlinear regime, since tanh(x) is roughly linear about x = 0. The linear activation function is expanded about x = 1 to be consistent with the other two activation functions, though in theory it doesn t matter, since there is no nonlinear regime in this case.

39 4.2. Linear Reservoir Linear Reservoir To linearize the reservoir, we use a linear activation function that consisted of the Taylor expansion of the tanh(x) about x = 1: f(x) = tanh(1) + ( 1 tanh 2 (1) ) x x, (4.1) For our experiments, we trained the reservoir as described in previous chapters, and used the Moore-Penrose pseudoinverse to generate output weights. The experimental data was generated by using either the x, ỹ, or z components 1 of the Lorenz or Rössler system to drive the reservoir, and then reading out the other two components of the attractor. We then compared the RC-generated estimated components to the true attractor components created using SciPy s odeint() function. The network parameters used for these experiments are given in Table 4.1. Note that for the linear activation function, the bias term does not matter, since there is no nonlinear saturation regime. Table 4.1: Linear activation function network parameters. Parameter Value Reservoir size (N) 400 Spectral radius (ρ) 1.0 Reservoir density (d W ) 0.25 Input density (d in ) 0.8 Input radius (σ in ) 0.25 Leak rate (a) 1.0 Noise (ν) 0 Bias (ξ) 0 Training length (τ) 400 Washout time (ζ) Rössler System We first examined the Rössler system using our linearized network. We used values of a = 0.5, b = 2.0, and c = 4.0, which is in the chaotic regime [13]. We drive the system with each of the three components of the attractor, x, ỹ, and z, and train the reservoir to generate the other two components. This was done 100 times for each component of the attractor, to minimize the effect of any individual reservoir characteristics. The 1 The x, ỹ, and z components are merely unit scaled versions of x, y, and z [13]. x is defined by: x(t) = x(t) x(t) [x(t) x(t) ] 2 1/2, where denotes the time average. ỹ and z are defined similarly.

40 24 Chapter 4. Reservoir Computers as Observers: Linear vs. Nonlinear Reservoirs (a) (b) (c) Figure 4.1: Linearly generated Rössler system components. The solid blue line is the trace generated by the RC, while the dashed red line is the actual component of the attractor. (a) The ỹ and z components generated by x. (b) The x and z components generated by ỹ. Note that the solid blue line is under the dashed red line, but can t be seen since the RC trace and actual components are close together. (c) The x and ỹ components generated by z. RMSE, along with standard deviation, for each generated component is shown in Table 4.2. Note that the standard deviation was computed from 100 instantiations of the reservoir. Example traces for the components generated by driving the RC with each of x, ỹ, and z are shown in Fig These were generated using the same network parameters as were used to calculate the average RMSE. It can be seen clearly from both the average RMSE and the example traces that both the x and z components of the Rössler attractor can be generated to high accuracy by driving the linear reservoir with ỹ. To some extent, a linear reservoir can also generate ỹ and z when driven by x, but not nearly as accurately, and it can not generate x or ỹ from z. This is perhaps surprising, especially due to the qualitative Table 4.2: RMSE for linear RC generated Rössler attractor components. Driving Component x RMSE ỹ RMSE z RMSE x N/A ± ± ỹ (2 ± 1.6) 10 4 N/A (8.4 ± 0.2) 10 4 z ± ± N/A

41 4.3. Comparing Activation Functions 25 Table 4.3: RMSE for linear RC generated Lorenz attractor components. Driving Component x RMSE ỹ RMSE z RMSE x N/A ± ± ỹ ± N/A ± z ± ± N/A similarity between x and ỹ. Evidently, though, we can generate the Rössler attractor very accurately using system memory and linear transformations when we supply only ỹ, whereas we need some level of nonlinearity to generate the attractor from x and z Lorenz System We also looked at the effects of a linearized RC reservoir on generating Lorenz attractor components. Here, we generated the Lorenz attractor with the canonical chaotic parameters a = 10, b = 28, and c = 8/3 [13]. As for the Rössler system, we drive the system with one of the three components of the Lorenz attractor, x, ỹ, and z, and generate the other two components of the attractor using the driven reservoir. This was done for 100 instantiations of the linear RC for each of x, ỹ, and z. The average RMSE, with standard deviation, is shown in Table 4.3 for each combination of components. Fig. 4.2 shows example traces for each of the combinations of components generated by the linear reservoir. The example traces were generated with the same network parameters as were used for the RMSE calculation. As we can see from the average RMSEs and the example traces, the linear reservoir is able to generate x from ỹ to fairly high accuracy. It can also mostly generate ỹ when driven by x, although not as well, as there is some overshoot, where the generated signal evidently need some higher order correction that is not present in the linear reservoir. Both of these results make intuitive sense to some degree, as the x and ỹ components resemble one another very closely. Conversely, neither the x driven reservoir or the ỹ driven reservoir can generate z in any capacity, nor can the z driven reservoir generate x or ỹ. Evidently the relation between z and x & ỹ is nonlinear, whereas some linear map allows x to be generated by ỹ, and most likely for ỹ from x as well, though this relation may benefit from nonlinear transformation for better inference. 4.3 Comparing Activation Functions The linear activation function inference task gave some insight about which components of the Lorenz and Rössler systems were linearly related. However, it did not tell us when nonlinearity is necessary, or when both nonlinearity and linear memory are needed for inference. To give some more intuition into what activation function

42 26 Chapter 4. Reservoir Computers as Observers: Linear vs. Nonlinear Reservoirs (a) (b) (c) Figure 4.2: Linearly generated Lorenz system components. The solid blue line is the trace generated by the RC, while the dashed red line is the actual component of the attractor. (a) The ỹ and z components generated by x. (b) The x and z components generated by ỹ. Note that in the top panel, the solid blue line is under the dashed line, but can t be seen since the Rc generated and actual components are very close. (c)the x and ỹ components generated by z. properties are necessary for each dataset, we looked at two more activation functions: a quadratic activation function, which had a linear and nonlinear component, and a strictly nonlinear activation function, which had only a quadratic component Quadratic Reservoir We used the second-order Taylor expansion of tanh(x) about x = 1 as our quadratic activation function: f(x) = tanh(1) + ( 1 tanh 2 (1) ) x + ( tanh 3 (1) tanh(1) ) x x x 2. (4.2) In this case, expansion about x = 1 approximates tanh(x) in the nonlinear regime, since tanh(x) is approximately linear about x = Strictly Linear or Strictly Nonlinear Reservoir Equation 4.2 has both a linear and nonlinear component, and thus this activation function is some hybrid of linear and nonlinear function, much like tanh(x), that

43 4.3. Comparing Activation Functions 27 Table 4.4: RMSE for inferred ỹ, z generated by x-driven RC, Lorenz system. Activation Function ỹ RMSE z RMSE Quadratic ± ± Linear ± ± Purely Nonlinear ± ± Table 4.5: RMSE for inferred x, z generated by ỹ-driven RC, Lorenz system. Activation Function x RMSE z RMSE Quadratic (6.9 ± 0.9) ± Linear ± ± Purely Nonlinear ± ± is more or less linear depending on the magnitude of the function input. However, we can break apart the quadratic activation function into its linear and nonlinear components to see how each term effects the predictive power of the RC. For our linear function, we used the linear activation function given by Eq. 4.1, while we used the second-order Taylor expansion of tanh(x) about x = 1 with the linear term removed as our purely nonlinear activation function: f(x) = tanh(1) + ( tanh 3 (1) tanh(1) ) x x 2. (4.3) Results To facilitate comparison between the linear and purely nonlinear reservoirs, we had to select network parameters that prevented the output generated by the purely nonlinear activation function from growing unboundedly when driven. For the Rössler system, we used the same parameters as we used for the linear activation function, with the exception that now we used ξ = 1. However, the Lorenz system tended to produce large output weights when using Moore-Penrose, so we both shrunk the network size to 250 nodes from 400 and used ridge regression with regularization parameter β = Full network parameters are given in Table 4.6. Tables 4.4 and 4.5 give the RMSE for the inferred components of the Lorenz system for the x-driven and ỹ-driven reservoirs, respectively. The error given is the standard deviation calculated over 100 instantiations of the particular reservoir. For the x-driven reservoir, we note that the predictive ability of the quadratic reservoir and the purely nonlinear reservoir are roughly the same, based on the value of the RMSE for the inferred ỹ and z components. The RMSE for the inferred components generated by the linear x-driven reservoir are two or three orders of magnitude worse.

44 28 Chapter 4. Reservoir Computers as Observers: Linear vs. Nonlinear Reservoirs Table 4.6: Network parameters for activation function comparison experiment. Parameter Value Reservoir size (N) 250 Spectral radius (ρ) 0.9 Reservoir density (d W ) 0.25 Input density (d in ) 0.8 Input radius (σ in ) 0.5 Leak rate (a) 1.0 Noise (ν) 0.01 Bias (ξ) 1 Training length (τ) 400 Washout time (ζ) 100 Table 4.7: RMSE for inferred ỹ, z generated by x-driven RC, Rössler system. Activation Function ỹ RMSE z RMSE Quadratic ± ± Linear ± ± Purely Nonlinear ± ± Based on this observation, we note that the network does not require linear memory to generate ỹ or z from x. For the x component inferred from the ỹ-driven reservoir, we can see that the quadratic activation function outperforms both the linear activation function and purely nonlinear activation function reservoirs, the former by two orders of magnitude, the latter by a single order of magnitude. This indicates that the reservoir requires both linear memory and nonlinearity to predict x from ỹ. However, neither the x-driven nor ỹ-driven linear reservoir can generate z at all (RMSE> 1). To generate the z-component from ỹ, memory does not seem to be essential, whereas nonlinearity is crucial. Tables 4.7 and 4.8 give the RMSE for the inferred Rössler system components for the x-driven and ỹ-driven RCs. The error given is the standard deviation calculated Table 4.8: RMSE for inferred x, z generated by ỹ-driven RC, Rössler system. Activation Function x RMSE z RMSE Quadratic ± ± Linear (2 ± 1.5) 10 4 (8.4 ± 0.2) 10 4 Purely Nonlinear ± ± 0.01

45 4.3. Comparing Activation Functions 29 over 100 instantiations of the particular reservoir. For the x-driven RC, the quadratic and purely nonlinear RCs perform equally well, for both the ỹ and z components. The linear RMSE for the linear network is two order of magnitude greater, indicating that the RC requires nonlinearity but not linear memory to infer ỹ and z from x. The ỹ- driven reservoir yielded different results. In this case, the linear RC outperformed the quadratic or purely nonlinear reservoir, with the linear RMSE for the inferred x and z smaller than the quadratic or purely nonlinear RMSE by an order of magnitude. In this case, the nonlinearity seems to be negatively effecting the predictive capacity of the network. However, it is unclear if it is merely overshoot due to lack of higherorder terms in the quadratic or purely nonlinear activation functions, that is, since there is no saturation the function tends to grow unboundedly. We omit the z-driven reservoir results for both systems due to the instability of the output. Even when using regularized output weights, the lack of higher order terms causes the single quadratic term in the activation function to produce seemingly unbounded neuron dynamics, producing large reservoir outputs, particularly in transitions between steady states of behavior in the chaotic driving component.

47 Chapter 5 Learning Attractors In contrast to previous chapters where we utilized a driven reservoir to predict some of the components of a chaotic system, here we attempt to get the reservoir to generate the attractor on its own, using output feedback connections and no external driving. While the driven reservoir can easily produce components of the various chaotic attractors fed to it, it is unclear if it is really learning the attractor. That is, is the RC just mapping the input signal to some output, or is it capable of sustaining trajectory evolution on a chaotic attractor? We look at three different types of reservoir to generate these attractors: a conventional tanh(x) activation function reservoir, a quadratic activation function reservoir, and a linear activation function reservoir. The quadratic activation function was the Taylor expansion of tanh(x) about x = 1, given by Eq Similarly, the linear activation function was the Taylor expansion of tanh(x) about x = 1, which we used earlier for the linearization of the driven reservoir (see Eq. 4.1). Using these different activation functions allows us to see how the nonlinearity of the reservoir determines its ability to generate the attractor, and whether linear memory is sufficient to generate the attractor. 5.1 Lorenz Attractor We first try to generate the chaotic Lorenz attractor using our feedback RC. As before we used parameters of a = 10, b = 28, and c = 8/3 [13], the canonical chaotic butterfly attractor. The network parameters are given in Table 5.1. The network was trained to the attractor using teacher forcing with no input (i.e. just with output forcing). Instead of using the Moore-Penrose pseudoinverse to calculate output weights as we have done previously in this thesis, we used ridge regression with β = 0.01 to set the weights in this instance, as it yielded better results. This was due to the large magnitude of the output weights calculated by the Moore-Penrose method, which caused large jumps in the magnitude of the output. Fig. 5.1, Fig. 5.2, and Fig. 5.3 show the attractor generated by the tanh reservoir, quadratic reservoir, and linear reservoir, respectively. Comparing these figures, we see that the attractors for the tanh and quadratic reservoirs resemble one another very

48 32 Chapter 5. Learning Attractors (a) (b) Figure 5.1: The Lorenz system, generated by a standard tanh(x) reservoir. (a) The x, ỹ, z components generated by tanh(x) reservoir compared to the continuation of the numerically integrated components used for training. The solid blue line is the trace generated by the RC, while the dashed red line is the actual component of the attractor. (b) The attractor generated by tanh(x) reservoir.

49 5.1. Lorenz Attractor 33 (a) (b) Figure 5.2: The Lorenz system, generated by a quadratic reservoir. (a) The x, ỹ, z components generated by the quadratic reservoir compared to the continuation of the numerically integrated components used for training. The solid blue line is the trace generated by the RC, while the dashed red line is the actual component of the attractor. (b) The attractor generated by the quadratic reservoir.

50 34 Chapter 5. Learning Attractors Table 5.1: Network parameters for Lorenz attractor learning. Parameter Value Reservoir size (N) 100 Spectral radius (ρ) 0.8 Reservoir density (d W ) 0.5 Feedback density (d fb ) 0.5 Feedback radius (σ fb ) 0.8 Leak rate (a) 0.75 Noise (ν) 0.01 Bias (ξ) 1 Training length (τ) 30,000 Washout time (ζ) 100 closely, as well as the actual attractor for the Lorenz system (see Fig. 3.2(a)). The generated x, ỹ, and z, shown in Fig. 5.1(a) and Fig. 5.2(a), for both of these reservoirs follows the continuation of the output used for training for some time before diverging (note that both of these traces start immediately after training ends). This is to be expected for a chaotic time series, as any small variation in the generated signal from the comparison signal will lead to drastically different results. The attractor generated by the linear reservoir, however, does not resemble the Lorenz attractor at all, as seen in Fig. 5.3(b). Looking at the x, ỹ, and z components generated by the linear reservoir, we see that the output from the reservoir quickly dies out. Evidently, linear transformation and memory is not sufficient to generate a chaotic system, but any nonlinearity, even the simplest quadratic nonlinearity, will allow the reservoir to learn the Lorenz attractor. 5.2 Rössler Attractor We also tried to generate the Rössler attractor using a feedback-only RC. Here, again, we used chaotic Rössler system parameters of a = 0.5, b = 2.0, and c = 4.0 [13]. The network parameters used to generate the Rössler attractors were the same as the parameters used for the Lorenz attractor, except for the feedback weight radius, which was changed to σ fb = 0.3. The full parameter set is shown in Table 5.2. To calculate the output weights, we used ridge regression with regularization parameter β = 0.01, again due to large magnitude output weights generated by the Moore- Penrose pseudoinverse. Fig. 5.4, Fig. 5.5, and Fig. 5.6 show the components and attractor of the Rössler system generated by the tanh(x), quadratic, and linear activation function RCs, respectively. As with the Lorenz system, we see that the tanh(x) and quadratic RCs can generate the Rössler attractor, and that the generated components agree with the expected components for some number of time steps before diverging, as is

51 5.2. Rössler Attractor 35 (a) (b) Figure 5.3: The Lorenz system, generated by a linear reservoir. (a) The x, ỹ, z components generated by the linear reservoir compared to the continuation of the numerically integrated components used for training. The solid blue line is the trace generated by the RC, while the dashed red line is the actual component of the attractor. (b) The attractor generated by the linear reservoir.

52 36 Chapter 5. Learning Attractors (a) (b) Figure 5.4: The Rössler system, generated by the tanh(x) reservoir. (a) The x, ỹ, z components generated by the tanh(x) reservoir compared to the continuation of the numerically integrated components used for training. The solid blue line is the trace generated by the RC, while the dashed red line is the actual component of the attractor. (b) The attractor generated by the tanh(x) reservoir.

53 5.2. Rössler Attractor 37 to be expected for a chaotic system. Perhaps unsurprisingly, since chaotic solutions only exist in nonlinear systems of ordinary differential equations, the linear network could not generate the attractor, and the network dynamics die out relatively quickly when external driving is absent. That both the tanh(x) and quadratic networks can generate the attractor suggests that the requirement to generate a chaotic attractor is just to have a nonlinearity, regardless of its form. That is, saturation as is present in the tanh(x) activation function but not in the quadratic activation function is not required, but merely helpful from a numerical standpoint. Table 5.2: Network parameters for Rössler attractor learning. Parameter Value Reservoir size (N) 100 Spectral radius (ρ) 0.8 Reservoir density (d W ) 0.5 Feedback density (d fb ) 0.5 Feedback radius (σ fb ) 0.3 Leak rate (a) 0.75 Noise (ν) 0.01 Bias (ξ) 1 Training length (τ) 30,000 Washout time (ζ) 100

54 38 Chapter 5. Learning Attractors (a) (b) Figure 5.5: The Rössler system, generated by the quadratic reservoir. (a) The x, ỹ, z components generated by the quadratic reservoir compared to the continuation of the numerically integrated components used for training. The solid blue line is the trace generated by the RC, while the dashed red line is the actual component of the attractor. (b) The attractor generated by the quadratic reservoir.

55 5.2. Rössler Attractor 39 (a) (b) Figure 5.6: The Rössler system, generated by the linear reservoir. (a) The x, ỹ, z components generated by the linear reservoir compared to the continuation of the numerically integrated components used for training. The solid blue line is the trace generated by the RC, while the dashed red line is the actual component of the attractor. (b) The attractor generated by the linear reservoir.

57 Chapter 6 Coupled Map Lattices 6.1 Background Coupled map lattices have been a field of interest in chaos research since the early 1980 s [22]. In general, a CML can be thought of as a discrete lattice of dynamical nodes, that interact with one another in some prescribed, iterative manner. Many natural and manmade systems that display spatiotemporal chaos can be modeled as some sort of CML, if the appropriate rules are chosen for the map. These rules usually consist of some coupling between neighboring nodes and some nonlinear transformation of the variables associated with each node, which are then iteratively applied [22]. This simple concept can produce an array of different behaviors, ranging from simple synchronous periodic behavior of the dynamic variables of all nodes in the lattice to spatially incoherent chaotic oscillations of the node variables, depending on the type and strength of the coupling [23]. CMLs are interesting for reservoir computing in that they provide dynamics with spatial extent. This allows us to investigate how reservoir computers deal with spatial information when performing various tasks. 6.2 The Hagerstrom CML Motivation For this thesis, we used a CML described by Hagerstrom et al. [23]. While Hagerstrom et al. [23] had used the CML to investigate chimera states present in the transition between order and chaos in the CML, the Hagerstrom CML has several advantages for generating prediction datasets: 1. The CML can be implemented in both a 1-dimensional and 2-dimensional lattice. 1 1 Note that here we use 1-dimensional and 2-dimensional in the sense of a physical/spatial dimension, i.e. a 1-dimensional line or a 2-dimensional area. This is in contrast to earlier sections, where we referred to the Lorenz and Rössler systems as having 3 dimensions. In this case, dimension referred to variables: x, y, and z. Not the best use of language, but such is life.

58 42 Chapter 6. Coupled Map Lattices 2. The dynamics of the CML are invariant with respect to changes of the number of nodes in the lattice. 3. There are a range of qualitatively distinct dynamics available in the 2-dimensional parameter space of the CML. 4. Numerical implementation of the code is relatively straightforward. 5. It has been implemented experimentally. The first three qualities are especially important. The chaotic datasets we have looked at thus far in this thesis can be considered as a single CML node with 3 variables. The CMLs we are considering now have many nodes, each with at least one dynamic variable, that the RC needs to account for, and it is unclear how reservoir size scales with the number of nodes. In addition, the effect of dimensionality on prediction is unclear. As far as we can tell, there have been no papers using RC to make predictions about 2-dimensional datasets, while 1-dimensional datasets have been previously investigated [13, 24]. Lastly, the range of dynamics, from incoherent, to periodic, to chaotic, that can be produced by varying two coupling parameters in the CML update equation allow us to easily see how well the RC would perform with different dynamics. As an added bonus, not only is it possible to implement the lattices in silico, as is usually done for CMLs, Hagerstrom et al. [23] outlined a method to implement their CML as an optoelectronic experiment. Future work might also allow for online implementation of RCs into the optoelectronic experiment for signal clean up, as suggested by Jaeger and Haas [11] Mathematics The CML described by Hagerstrom et al. [23] has two variables associated to each node in the lattice: intensity and phase shift. These variables are related by the nonlinearity: Ii,j t = 1 ( ( )) 1 cos φ t 2 i,j (6.1) where Ii,j t is the dimensionless intensity of the i, j th node on the lattice at time t and φ t i,j is the phase shift of the i, j th node at time t. We update the phase shift by reading the intensity at each lattice node, and coupling it to its neighbors. In the 1-dimensional case, our update equation is given by: [ φ t+1 i = 2πa Ii t + ɛ R ( ) ] I t 2R i+j Ii t (6.2) j= R where a is an overall constant that controls the temporal dynamics of the uncoupled nodes, ɛ is the coupling strength, and R is the radius of nodes about the i th node that are coupled to the i th node. In 2 dimensions, we have an update equation: φ t+1 i,j = 2πa [ I t i,j + ɛ 4R 2 R k,l= R ( ) ] I t i+k,j+l Ii,j t (6.3)

59 6.2. The Hagerstrom CML 43 (a) ϕ i t+1 (b) ϕ i t ϕ i t ϕ i t (c) ϕ i t ϕ i t Figure 6.1: Cobweb plots of the uncoupled lattice update equation, where the nonlinear mapping is given by Eq (a) The cobweb plot for a = 0.68, (b) The plot for a = 0.718, and (c) The plot for a = Each plot was initialized with φ 0 i = π. where a, ɛ, and R are the same as for the 1-dimensional case. For both the 1- dimensional and 2-dimensional cases periodic boundary conditions are used. Let s take a look at the constituent parts of these equations. Starting from the left, the first parameter we have is a. To understand its effect, we set the coupling strength ɛ = 0 (looking at the 1-dimensional equation for simplicity): φ t+1 i = 2πa [ I t i + (0) 2R R j= R ( ) ] I t i+j Ii t = φ t+1 i = 2πaI t i (6.4) Then, substituting this into the nonlinear mapping given in Eq. 6.1, we have: = πa ( 1 cos ( )) φ t i φ t+1 i (6.5) So, this parameter controls the dynamics of the uncoupled nodes, since it controls the relationship between the current and previous values of the phase shift for each node. Shifting the value of a can put the uncoupled dynamics into periodic, steady state, or chaotic regimes. Fig. 6.1 shows cobweb plots for three different values of a. Fig. 6.1(a) has a = 0.68, which gives a period-2 oscillation of the nodes, while Fig. 6.1(b) has a = 0.718, which gives a period-4 oscillation. Lastly, Fig. 6.1(c) has a = 0.85, which is chaotic. Examples of these behaviors for an uncoupled 1-dimensional lattice can be seen in Fig Note that in these evolution photos some of the nodes are dark.

44 Chapter 6. Coupled Map Lattices (a) (b) (c) Figure 6.2: Evolution of uncoupled 1D lattices of 10 nodes. Time steps were measured after washing out an initial 500 time steps.

These correspond to nodes that decayed to the φ = 0 steady state, as will happen for nodes with an initial intensity sufficiently close to 0.

Analyzing the effect of the coupling constants is considerably more involved than analyzing the effect of a, and somewhat outside the scope of this thesis. Fortunately, however, Hagerstrom et al.

60 44 Chapter 6. Coupled Map Lattices (a) (b) (c) Figure 6.2: Evolution of uncoupled 1D lattices of 10 nodes. Time steps were measured after washing out an initial 500 time steps. (a) shows the evolution of the lattice for a = 0.68, (b) shows the evolution for a = 0.718, and (c) shows the evolution for a = Each node was randomly initialized with an intensity I 0 i [0, 1). These correspond to nodes that decayed to the φ = 0 steady state, as will happen for nodes with an initial intensity sufficiently close to 0. While a controls temporal dynamics of the uncoupled CML, the other parameters, ɛ and R control the coupling between nodes. Analyzing the effect of the coupling constants is considerably more involved than analyzing the effect of a, and somewhat outside the scope of this thesis. Fortunately, however, Hagerstrom et al. [23] does an analysis of these parameters as part of the paper where they introduce this CML. For the purposes of this thesis, we will merely note when the dynamics of a lattice are chaotic, periodic, or steady-state, since this is what is relevant for the RC tasks performed. 6.3 Implementing CMLs As mentioned earlier in this chapter, the CMLs were created both in silico and experimentally. The simulations were advantageous in that they allowed for quick generation of datasets very accurately, as well as provided a way to check to make sure that the experimental data was being generated properly. The experimental implementation, however, would allow for potential real-time prediction of data (in the future), which is more interesting from a control systems, as well as real-world applications, perspective In Silico Implementation Simulations were performed using Python 3.6. Custom simulation scripts were written based on Eq. 6.2 and Eq The simulation allowed us to specify N, a, ɛ, and

61 6.3. Implementing CMLs 45 Iris HeNe Laser Mirrors Computer Iris Beam Expander Polarizing beam splitter Camera ND filter Half-wave plates Imaging lens Mirror Mirror SLM Quarter-wave plate Polarizing beam splitter Figure 6.3: The experimental setup used to generate CMLs. Laser light from a HeNe laser was expanded and the intensity modulated by a λ/2 plate, polarizing beam splitter, and a second λ/2 plate before passing through a λ/4 plate and onto the screen of a spatial light modulator. The spatial light modulator applies a programmable spatially-dependent phase shift to the wavefront before it reflects back through the λ/4 plate and through a polarizing beam splitter. The light emitted from the polarizing beam splitter then is imaged onto a CMOS camera through a neutral density filter. The irradiance information from the camera is used to update the phase shift applied by the spatial light modulator, which is the update step of the CML dynamics. The λ/4 plate, spatial light modulator, and polarizing beam splitter apply the update nonlinearity specified in Eq. 6.1.

46 Chapter 6. Coupled Map Lattices Figure 6.4: A snapshot of the experimental lattice. This lattice was initialized so that each node was randomly set to a value between 0 and 2π. R.

62 46 Chapter 6. Coupled Map Lattices Figure 6.4: A snapshot of the experimental lattice. This lattice was initialized so that each node was randomly set to a value between 0 and 2π. R. Each lattice was initialized such that the starting phase shift of each node was randomly assigned from a uniform distribution on [0, 2π). The algorithm used for both the 1-dimensional lattices and 2-dimensional lattices can be seen in Appendix B Experimental Implementation The setup used to generate CMLs experimentally is shown in Fig There were three primary components used to generate the experimental CMLs: beam generation and clean-up, intensity modulation, and imaging/update computation. The general scheme of the experimental setup was to use a spatial light modulator (SLM) in combination with other optics to modulate the intensity of a laser beam across its cross-sectional area, which is treated as a lattice. The irradiance of the light at different nodes in the lattice could then be recorded with a camera, and a new phase shift could be applied by the SLM based on the measured irradiance. The beam generation and clean-up stage of the experiment was designed to produce a vertically polarized beam that had close to uniform irradiance over a large region of its cross-sectional area. This stage consisted of a nm HeNe laser as the light source, a beam expander, a λ/2 plate, a polarizing beam splitter, and another λ/2 plate. Additionally, there is an iris just outside of the output coupler for the laser, and at the focal point of the beam expander. The beam emitted from the laser is reflected by mirrors through the beam expander, where it is expanded roughly 8 times. The expanded beam then passes through the first λ/2 plate, which rotates the linearly polarized laser light to some angle. When the light then passes through the polarizing beam splitter, the vertical component of the light is discarded, and only the horizontally polarized component of the beam is passed on. The combination of

6.3. Implementing CMLs 47 Figure 6.5: Comparison of experimental and simulated 1D coupled map lattices generated with n = 36, a = 0.85, ɛ = 0.9, and R = 5.

6: Comparison of experimental and simulated 1D coupled map lattices generated with n = 36, a = 0.85, ɛ = 0.6, and R = 3. Experiment is on the left; simulation is on the right.

While the CMLs look similar, the simulation later falls into a periodic state a few hundred steps after time step 20000.

The linearly polarized light is then rotated by the second λ/2 plate to vertical polarization, since this is what is required for the phase modulation stage.

63 6.3. Implementing CMLs 47 Figure 6.5: Comparison of experimental and simulated 1D coupled map lattices generated with n = 36, a = 0.85, ɛ = 0.9, and R = 5. Experiment is on the left; simulation is on the right. Both images were taken time steps after the lattice was randomly initialized. Figure 6.6: Comparison of experimental and simulated 1D coupled map lattices generated with n = 36, a = 0.85, ɛ = 0.6, and R = 3. Experiment is on the left; simulation is on the right. Both images were taken time steps after the lattice was randomly initialized. While the CMLs look similar, the simulation later falls into a periodic state a few hundred steps after time step λ/2 plate and polarizing beam splitter controls the irradiance of light falling onto the SLM screen. The linearly polarized light is then rotated by the second λ/2 plate to vertical polarization, since this is what is required for the phase modulation stage. The beam coming out of the second λ/2 plate is reflected off two mirrors to provide a long beam path so the beam has further room to expand. It then hits the intensity modulation stage, which consists of a λ/4 plate at -45, an SLM, and a polarizing beam splitter. The beam first passes through a λ/4 plate, which converts the light to left circularly polarized light. The light then strikes the SLM, where it is reflected as right circularly polarized light, with a spatially dependent phase shift applied by the SLM. Passing back through the λ/4 plate, the right circular polarized light is converted back to linearly polarized light in a combination of vertically and horizontally polarized light, depending on the phase shift applied by the SLM. It then passes through the polarizing beam splitter, which removes the horizontally polarized component from the beam. Since the SLM can apply a user-specified 0 to 2π phase shift at arbitrary points on the cross-sectional area of the beam, this system allows us to modulate the intensity of the beam at various points over its cross-sectional area. The intensity modulation using the SLM and associated optics implements the CML update nonlinearity given by Eq This relationship is derived using the Jones matrices for the various optics components in Appendix C. At this point, the intensity-modulated light passes through an imaging lens and a final neutral density filter before falling on the camera. We used a Thorlabs

48 Chapter 6. Coupled Map Lattices Figure 6.7: Comparison of experimental and simulated 1D coupled map lattices generated with n = 36, a = 0.85, ɛ = 0.6, and R = 7.

64 48 Chapter 6. Coupled Map Lattices Figure 6.7: Comparison of experimental and simulated 1D coupled map lattices generated with n = 36, a = 0.85, ɛ = 0.6, and R = 7. Experiment is on the left; simulation is on the right. Both images were taken time steps after the lattice was randomly initialized. The experimental CML has converged to a periodic state, while the simulation continues to resemble chaos. DCC1545M monochrome camera to measure the intensity of the light. The image from the camera was output in real-time to a computer, where the irradiance of each node in the lattice was estimated from the grayscale value of the camera pixels that comprised the node using a MATLAB script. The script then applied Eq. 6.2 or Eq. 6.3 to generate new phase shifts for each point in the lattice. These were then output to the SLM, and the new intensity measured, and so on. The measured intensity was stored at each update of the SLM, which was saved as our experimentally generated CML. The MATLAB script used for the experiment is included in the Supplemental Materials for this thesis. We tried a variety of lattice sizes before settling on a 6 6 lattice. A picture of the lattice is shown in Fig This allowed us to have an n = 6 (i.e. 6 6) 2D lattice or an n = 36 1D lattice. Having a smaller number of lattice nodes allowed us to include more camera and SLM pixels per node, which helped even out any effects of shifting optical elements in our setup, as well as any intensity differences in the beam Comparing Experiment to Simulation To test how well the experiment was working, we compared several datasets created experimentally and in silico with the same parameters a, ɛ, and R. A few examples are shown in Fig. 6.5, Fig. 6.6, Fig. 6.7, and Fig In all of these figures, the experimentally generated CML is on the left, while the numerically generated CML is shown on the right. Fig. 6.5 is a 1D CML created using parameters a = 0.85, ɛ = 0.9, and R = 5. For the entire time steps created using both methods, the data qualitatively looks similar. Fig. 6.6 is a 1D CML created using parameters a = 0.85, ɛ = 0.6, and R = 3. While the CMLs created using both methods look similar for this parameter set, the simulation CML later adopts periodic behavior, while the experimental CML does not. Fig. 6.7 is a 1D CML created with parameters a = 0.85, ɛ = 0.6, and R = 7. Here there is no resemblance between the two datasets, and the experimental CML has already adopted a periodic state, while the simulation continues to resemble chaos. Finally, Fig. 6.8 is a 1D CML created with parameters a = 0.85, ɛ = 0.9, and R = 7. In both the experimental and simulated datasets for this parameter set, the CML has gone to a periodic state.

Both the experimental and simulated CMLs have adopted a periodic state. Apparently, the experiment can not reliably reproduce the CMLs generated by the simulation.

65 6.3. Implementing CMLs 49 Figure 6.8: Comparison of experimental and simulated 1D coupled map lattices generated with n = 36, a = 0.85, ɛ = 0.9, and R = 7. Experiment is on the left; simulation is on the right. Both images were taken time steps after the lattice was randomly initialized. Both the experimental and simulated CMLs have adopted a periodic state. Apparently, the experiment can not reliably reproduce the CMLs generated by the simulation. It is somewhat unclear if this is due to a lack of resolution in the experiment, or some other experimental flaw. Another possibility is that for some sets of lattice parameters, periodic attractors might coexist with chaotic attractors. The basin of attraction that the lattice converges to would then depend strongly on the initial conditions of the lattice nodes (which are randomized), as well as the cumulative effect of small nudges due to noise in the experimental case. Also unclear is whether the experimental CMLs that did not converge to a periodic state within iterations would later converge. However, due to the length of time involved in collecting each data set (collecting iterations of a CML took roughly 2 hours on the experiment) made collecting longer data sets unfeasible, especially considering the laser power drifts over that time, which caused the relative irradiance of the lattice to shift as data was collected. Regardless, in some sense it does not matter if the experimental and simulated CMLs do not exactly match. For this thesis, we are not interested in the actual dynamics of the lattice. Rather, we are interested in finding out if a reservoir computer could infer the dynamics of a 1-dimensional or 2-dimensional lattice when given the dynamics of just some subset of the nodes and appropriate training. In this sense, the experiment does exactly what we want: it creates data sets with interesting dynamics, that would potentially be tricky for an RC to learn. So while it is challenging to quantitatively compare the experimental and simulated CMLs, we can still look at whether the RC can be an effective observer for these CMLs.

67 Chapter 7 Predicting Coupled Map Lattices As alluded to in the previous section, we use coupled map lattices as a convenient source for interesting dynamics. We then use the reservoir computing techniques discussed in previous chapters to see if the network was able to learn the system well enough to be able to function as an observer: if the reservoir was supplied with some subset of the lattice nodes, would it be able to infer the rest? In addition, CMLs allowed us to ask a couple more specific questions that we were unable to ask with the Lorenz or Rössler systems: can the reservoir predict 2-dimensional data sets? How will the reservoir parameters scale with these larger data sets? 7.1 The Problem with Two Dimensions The 2-dimensional CML presents an interesting problem for RC. Both the Moore- Penrose pseudoinverse and ridge regression methods of solving the RC linear regression problem rely on having the training matrix T R τ L of the teacher vectors stacked over time, and it is unclear how this would generalize to 2-dimensional CML teacher data, since we would need a 3-dimensional (time, teacher dimension 1, teacher dimension 2) training matrix T R τ n n, where n is the number nodes on each side of the lattice. So, to try and make some predictions on 2-dimensional data, we need to find some way to map down the number of spatial dimensions so that it can be represented as a vector. While there are many ways that one might represent a 2D lattice as a 1D vector, we chose the most obvious (and simplest to implement) solution: unwrapping the matrix in raster order. For a 3 3 lattice, this is: = [ ] (7.1) Using this method, none of the information in the nodes is lost (i.e. we aren t averaging over patches of nodes), though it is possible that some spatial information could be lost, since nodes that are actually next to each other in the lattice no longer are.

68 52 Chapter 7. Predicting Coupled Map Lattices Figure 7.1: A few different 2D coupled map lattice observer schemes. The nodes in gray are the nodes supplied to the reservoir as observations, while the nodes in white are the nodes that must be inferred by the network. We use a 5 5 lattice for demonstration, though the patterns scale with other lattice sizes First Attempts, and a Different Problem Having come up with a way to input 2D data into the reservoir, we attempted to use an RC as an observer for the CML dynamics. We chose a variety of different ways to select observer nodes (i.e. the subset of observed nodes), shown in Fig The nodes supplied to the RC when trying to reconstruct the lattice are shown in gray, while the nodes that must be reconstructed are shown in white. The patterns of supplied observer nodes scale with the lattice size, but maintain the same general shape as for the 5 5 case depicted. Once we had decided on the observer nodes, we supplied the vector of observer nodes to the reservoir, while we used the entire unwrapped lattice (including observer nodes) as the training vector. However, despite trying with a range of different reservoir sizes, training lengths, reservoir parameters, lattice sizes, and variations of observer nodes, we were unable to get a successful prediction (i.e., RMSE was on the order of unity) of lattice dynamics after the training period. The reservoir was able to recognize the observer nodes, but couldn t infer the values of the other nodes (see Supplemental Materials). This raised the other question around 2D lattices: does the unwrapping method cause a loss of spatial information? 7.2 1D Lattices To begin exploring why the network could not predict a 2D lattice, we started by looking at a 1D lattice. There were a couple of advantages to looking at a 1D lattice versus a 2D lattice. The 1D lattices can be directly input into the reservoir, without any unwrapping or other modification, so we are not at risk of potentially losing spatial information by modifying the format of the input data. Another quality of the 1D lattices that seemed to bode well for making predictions was the relative simplicity of the lattice dynamics (see, for example, Fig. 6.6 or Fig. 7.2), compared to more complicated spatiotemporal dynamics present in the 2D lattices (Fig. 7.3 and Supplemental Materials). Both of these qualities made prediction of a 1D lattice more likely than a 2D lattice, so they seemed like a good place to (re-)start our investigation of an RC observer for CMLs. We used a 32-node 1D lattice with a = 0.85, ɛ = 0.75, and R = 5 (chaotic dynamics) generated in silico to test the ability of our RCs to reconstruct CMLs. An

parameters a = 0.85, ɛ = 0.75, and R = 5.

69 7.2. 1D Lattices 53 Figure 7.2: A few time steps of the CML dataset used for the 1D parameter search. The dataset is a 32-node 1D lattice with parameters a = 0.85, ɛ = 0.75, and R = 5. This dataset was not observed to adopt a periodic state within 50,000 iterations. The snapshot starts at 25,000 iterations. Figure 7.3: A few time steps of a 2D lattice. This CML was created with parameters a = 0.85, ɛ = 0.8, and R = 5. Note that the spatiotemporal patterns are much more complicated than for the 1D case.

Lecture 15: Exploding and Vanishing Gradients

Lecture 15: Exploding and Vanishing Gradients Roger Grosse 1 Introduction Last lecture, we introduced RNNs and saw how to derive the gradients using backprop through time. In principle, this lets us train