Recurrence Enhances the Spatial Encoding of Static Inputs in Reservoir Networks

Similar documents
Negatively Correlated Echo State Networks

Several ways to solve the MSO problem

Short Term Memory Quantifications in Input-Driven Linear Dynamical Systems

Reservoir Computing and Echo State Networks

arxiv: v1 [cs.lg] 2 Feb 2018

Study on Classification Methods Based on Three Different Learning Criteria. Jae Kyu Suhr

A First Attempt of Reservoir Pruning for Classification Problems

Memory Capacity of Input-Driven Echo State NetworksattheEdgeofChaos

Using reservoir computing in a decomposition approach for time series prediction.

Reservoir Computing with Stochastic Bitstream Neurons

Short Term Memory and Pattern Matching with Simple Echo State Networks

Fisher Memory of Linear Wigner Echo State Networks

I D I A P. Online Policy Adaptation for Ensemble Classifiers R E S E A R C H R E P O R T. Samy Bengio b. Christos Dimitrakakis a IDIAP RR 03-69

Reservoir Computing Methods for Prognostics and Health Management (PHM) Piero Baraldi Energy Department Politecnico di Milano Italy

From Neural Networks To Reservoir Computing...

Simple Deterministically Constructed Cycle Reservoirs with Regular Jumps

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

Support Vector Machines: Maximum Margin Classifiers

Edge of Chaos Computation in Mixed-Mode VLSI - A Hard Liquid

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

Neural Networks and the Back-propagation Algorithm

ARTIFICIAL NEURAL NETWORKS گروه مطالعاتي 17 بهار 92

Unit 8: Introduction to neural networks. Perceptrons

Non-negative Matrix Factorization on Kernels

Lecture 4: Feed Forward Neural Networks

arxiv: v1 [astro-ph.im] 5 May 2015

ARTIFICIAL NEURAL NETWORK PART I HANIEH BORHANAZAD

Machine learning for pervasive systems Classification in high-dimensional spaces

A summary of Deep Learning without Poor Local Minima

Neural Networks. Mark van Rossum. January 15, School of Informatics, University of Edinburgh 1 / 28

Machine Learning : Support Vector Machines

Introduction to Neural Networks

Cheng Soon Ong & Christian Walder. Canberra February June 2018

ESANN'2001 proceedings - European Symposium on Artificial Neural Networks Bruges (Belgium), April 2001, D-Facto public., ISBN ,

REAL-TIME COMPUTING WITHOUT STABLE

Christian Mohr

Linear & nonlinear classifiers

Neural networks. Chapter 20, Section 5 1

The Markov Decision Process Extraction Network

Neural Networks. Chapter 18, Section 7. TB Artificial Intelligence. Slides from AIMA 1/ 21

Lecture 5: Recurrent Neural Networks

Linking non-binned spike train kernels to several existing spike train metrics

100 inference steps doesn't seem like enough. Many neuron-like threshold switching units. Many weighted interconnections among units

COMP-4360 Machine Learning Neural Networks

Multilayer Perceptron

Predicting Graph Labels using Perceptron. Shuang Song

A Model for Real-Time Computation in Generic Neural Microcircuits

Good vibrations: the issue of optimizing dynamical reservoirs

Towards a Calculus of Echo State Networks arxiv: v1 [cs.ne] 1 Sep 2014

Nonlinear Models. Numerical Methods for Deep Learning. Lars Ruthotto. Departments of Mathematics and Computer Science, Emory University.

International University Bremen Guided Research Proposal Improve on chaotic time series prediction using MLPs for output training

EEE 241: Linear Systems

Intelligent Modular Neural Network for Dynamic System Parameter Estimation

Initialization and Self-Organized Optimization of Recurrent Neural Network Connectivity

Adaptive Kernel Principal Component Analysis With Unsupervised Learning of Kernels

PHONEME CLASSIFICATION OVER THE RECONSTRUCTED PHASE SPACE USING PRINCIPAL COMPONENT ANALYSIS

Artificial Neural Networks

Artificial Neural Networks The Introduction

Tutorial on Machine Learning for Advanced Electronics

CS534 Machine Learning - Spring Final Exam

Echo State Networks with Filter Neurons and a Delay&Sum Readout

Lecture 5 Neural models for NLP

Lecture 4: Perceptrons and Multilayer Perceptrons

Clustering VS Classification

Long-Short Term Memory and Other Gated RNNs

Modelling Time Series with Neural Networks. Volker Tresp Summer 2017

Neural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35

Keywords- Source coding, Huffman encoding, Artificial neural network, Multilayer perceptron, Backpropagation algorithm

Artificial neural networks

ABSTRACT INTRODUCTION

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Artificial Neural Network

Radial-Basis Function Networks

Modular Neural Network Task Decomposition Via Entropic Clustering

From Neural Networks To Reservoir Computing...

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Learning and Memory in Neural Networks

Radial-Basis Function Networks

Improving the Separability of a Reservoir Facilitates Learning Transfer

A Modified Incremental Principal Component Analysis for On-line Learning of Feature Space and Classifier

Supervised locally linear embedding

Basic Principles of Unsupervised and Unsupervised

Machine Learning 2nd Edition

Structured reservoir computing with spatiotemporal chaotic attractors

Discriminative Models

Linear & nonlinear classifiers

Autoregressive Neural Models for Statistical Parametric Speech Synthesis

An overview of reservoir computing: theory, applications and implementations

Pattern Recognition Prof. P. S. Sastry Department of Electronics and Communication Engineering Indian Institute of Science, Bangalore

COMS 4771 Introduction to Machine Learning. Nakul Verma

Cheng Soon Ong & Christian Walder. Canberra February June 2018

IEEE TRANSACTION ON NEURAL NETWORKS, VOL. 22, NO. 1, JANUARY

Improving the Expert Networks of a Modular Multi-Net System for Pattern Recognition

Neural Networks Lecture 4: Radial Bases Function Networks

Sections 18.6 and 18.7 Artificial Neural Networks

Feature Design. Feature Design. Feature Design. & Deep Learning

Deep Feedforward Networks. Seung-Hoon Na Chonbuk National University

Harnessing Nonlinearity: Predicting Chaotic Systems and Saving

Deep Learning Recurrent Networks 2/28/2018

Transcription:

Recurrence Enhances the Spatial Encoding of Static Inputs in Reservoir Networks Christian Emmerich, R. Felix Reinhart, and Jochen J. Steil Research Institute for Cognition and Robotics (CoR-Lab), Bielefeld University, Universitätsstr. 25, 33615 Bielefeld, Germany {cemmeric,freinhar,jsteil}@cor-lab.uni-bielefeld.de http://www.cor-lab.de Abstract We shed light on the key ingredients of reservoir computing and analyze the contribution of the network dynamics to the spatial encoding of inputs. Therefore, we introduce attractor-based reservoir networks for processing of static patterns and compare their performance and encoding capabilities with a related feedforward approach. We show that the network dynamics improve the nonlinear encoding of inputs in the reservoir state which can increase the task-specific performance. Key words: reservoir computing, extreme learning machine, static pattern recognition 1 Introduction Reservoir computing (RC), a wellestablished paradigm to train recurrent neural networks, is based on the idea to restrict learning to a perceptron-like read-out layer, while the hidden reservoir network is initialized with random connection strengths and remains fixed. The latter can be understood as a random, temporal and nonlinear kernel [1] providing a suitable mixture of both spatial and temporal encoding of the input data in the Figure 1: Key ingredients of RC. network s hidden state space. This mixture is based upon three key ingredients illustrated in Fig. 1: (i) the projection into a high dimensional state space, (ii) the nonlinearity of the approach and (iii) the recurrent connections in the reservoir. On the one hand, the advantages of a nonlinear projection into a high dimensional space are beyond controversy: so-called kernel expansions rely on the concept of a nonlinear transformation of the original data into a high dimensional feature space and the subsequent use of a simple, mostly linear, model. On the other hand, the recurrent connections implement a short-term memory by means of transient network states. Due to this short-term memory,

2 Christian Emmerich, R. Felix Reinhart, and Jochen J. Steil reservoir networks are typically utilized for temporal pattern processing such as time-series prediction, classification and generation [2]. One could argue that a short-term memory could also be implemented in a more simple fashion, e.g. by an explicit delay-line. But we point out that the combination of spatial and temporal encoding makes the reservoir approach powerful and can explain the impressive performance on various tasks [3, 4, 5]. However, it remains unclear how the network dynamics influence the spatial encoding of inputs. Our hypothesis is that the dynamics of the reservoir network can enhance the spatial encoding of static inputs by means of a more nonlinear representation, which should consequently improve the task-specific performance. Moreover, we expect an improved performance when applying larger reservoirs, i.e. when using an increased dimensionality of the kernel expansion. Using attractor-based computation and by considering purely static input patterns, we systematically test the contribution of the network dynamics to the spatial encoding independently from its temporal effects. A statistical analysis of the distribution of the network s attractor states allows to access the qualitative difference of the encoding caused by the network s recurrence indepently of the task-specific performance. 2 Attractor-based Computation with Reservoir Networks We consider the three-layered network architecture depicted in Fig. 2, which comprises a recurrent hidden layer (reservoir) with a large set of nonlinear neurons. The input, reservoir and output neurons are denoted by x R D, h R N and y R C, respectively. The reservoir state is governed by discrete dynamics where the activation functions f i h(t + 1) = f ( W inp x(t) + W res h(t) ), (1) are applied componentwise. Typically, the reservoir neurons have sigmoidal activation functions such as f i (x) = tanh(x), whereas the output layer consists of linear neurons, i.e. y(t) = W out h(t). Learning in reservoir networks is restricted to the read-out weights W out. All other weights are randomly initialized and remain fixed. In order to infer a desired input-to-output mapping from a set of training examples (x T k, yt k ) k=1,...,k, the read-out weights W out are adapted such that the mean square error is minimized. In this paper, we use a simple linear ridge regression method: For all inputs x 1,..., x K we collect the corresponding reservoir states h k as well as the desired output targets y k column-wise in a reservoir state matrix H R N K and a target matrix Y R C K, respectively. The optimal read-out weights are then determined by the least squares solution with a regularization factor α 0: W out = Y H T ( H HT + α1 ) 1. The described network architecture in combination with the offline training by regression is often referred to as echo state network () [6]. The potential of the approach depends on the quality of the input encoding in the reservoir. To adress that issue, Jaeger proposed to use all weights drawn from a random distribution, where often a sparsely connected reservoir is preferred. In addtion, the reservoir weight matrix W res is scaled to have a certain spectral radius

Recurrence Enhances the Spatial Encoding in Reservoirs 3 Require: get external input x k 1: while h > δ and t < t max do 2: apply external input x(t) = x k 3: execute network iteration (1) 4: compute state change h = h(t) h(t 1) 2 5: t=t+1 6: end while Figure 2: Reservoir network. Algorithm 1: Convergence algorithm. λmax. There are two basic parameters involved in this procedure: the reservoir s weight connectivity or density 0 ρ 1 and the spectral radius λmax, where λmax is the largest absolute eigenvalue of W res. The input weights W inp are drawn from a uniform distribution in [ a, a]. In this paper, an attractor-based variant of the echo state approach is used, i.e. we map the inputs x k to the reservoir s related attractor states h k : The input neurons are clamped to the input pattern x k until the network state change h = h(t + 1) h(t) 2 approaches zero. This procedure is condensed in Alg. 1. As a prerequisite it must hold that the network always converges to a fix point attractor, which is related to a scaling of the reservoir s weights such that λmax < 1. Note that an with a spectral radius λmax =0 or with zero reservoir connectivity (ρ = 0) has no recurrent connections at all. Then, the degenerates to a feedforward network with randomly initialized weights. In [7], this special case of RC has been called extreme learning machine (). As our intention is to investigate the role of the recurrent reservoir connections, this feedforward approach obviously is the non-recurrent baseline of our recurrent model and we present all results in comparison to this non-dynamic model. 3 Key Ingredients of Reservoir Computing We present test results concerning the influence of the key ingredients of RC on the network performance for several data sets (Tab. 1) in a static pattern recognition scenario. Except for Wine, all data sets are not linearly seperable and thus constitute nontrivial classification tasks. The introduced models are used for classification of each data set. Therefore, we represent class labels c as a 1-of-C coding in the target vector y such that y c =1 and y i = 1 i c. For classification of a specific input pattern, we apply Alg. 1 and then read-out the estimated class label ĉ from the network output y according to ĉ = arg max i y i. All results are obtained by either partitioning the data into several cross-validation sets or using an existing partition of the data into training and test set and are averaged over 100 different network initializations. We use normalized data in the range [ 1, 1].

4 Christian Emmerich, R. Felix Reinhart, and Jochen J. Steil Role of Reservoir Size and Nonlinearity Fig. 3 shows the impact of the reservoir size to the network s recognition rate for a fully connected reservoir, i.e. ρ = 1.0, with λmax = 0.9 and α as in Tab. 1. The number of correct classified samples increases strongly with the number of hidden neurons. On the one hand, this result shows that the projection of the input into a high-dimensional network state space is crucial for the reservoir approach: The performance of very small reservoir networks degrates to the performance of a linear model (LM). However, we observe a saturation of the performance for large reservoir sizes. It seems that the random projection correct classifications [%] 100 95 90 85 80 75 70 65 Iris 60 Ecoli Olive Wine 55 3 5 10 15 20 25 30 40 50 75 100 reservoir size N Figure 3: Classification performance depending on the reservoir size N. can not improve the separability of inputs in the network state space anymore. On the other hand, note that the nonlinear activation functions of the reservoir neurons are crucial as well: Consider an with linear activation functions, then the inputs are only transformed linearly in a high dimensional representation. Hence, the read-out layer can only read from a linear transformation of the input and the classification performance is thus not affected by the dimensionality of that representation. Consequently, the combination of a random expansion and the non-linear activation functions is essential. Role of Reservoir Dynamics In this section, we focus on the role of reservoir dynamics and restrict our studies on the Iris data set. We vary both the spectral radius λmax of the reservoir matrix W res and the density ρ for a fixed reservoir size of N = 50. Note again that we obtain an for λmax = 0 or ρ = 0. Fig. 4 reveals that for recurrent networks the recognition rate increases significantly with the spectral radius λmax and surpasses the performance of the non-recurrent networks with the same parameter configuration. Interestingly enough, this correct classifications [%] 94 93 92 91 90 89 88 1 0.8 0.6 density ρ 0.4 0.2 0 0 0.2 0.4 0.6 0.8 spectral radius λ max Figure 4: Recognition rate for the Iris data set depending on λmax and ρ. is not true for the weight density in the reservoir: adding more than 10% connections inbetween the hidden neurons has only marginal impact on the classification performance, i.e. two many connections neither improve nor detoriate the performance. Note that we have bought this improvement by an increased num- 1

Recurrence Enhances the Spatial Encoding in Reservoirs 5 ber of iterations the network needs for settling in a stable state, which correlates with the spectral radius λmax. 4 On the Distribution of Attractor States We give a possible explanation for the improved performance caused by the recurrent connections. Our hypothesis is that these connections spread out the network s attractors to a spatially broader distribution than a nonrecurrent approach is capable of, which results in a more nonlinear hidden representation of the network s inputs. By reason of using networks with a linear read-out layer, the analysis of that representation H is done with a linear method, namely the principle cumulative energy content of PCs 1 to D [%] 100 99.5 99 98.5 98 97.5 97 Iris Ecoli Olive Wine Optdigits Shuttle Figure 5: Normalized cumulative energy content g(d) of the first D PCs. component analysis. Given the dimension D of the data, we expect the hidden representation to encode the input information with a significantly higher number of relevant principle components (PCs). Therefore, we calculate the shift of information or energy content from the first D PCs to the remaining N D PCs. Let λ 1... λ N 0 be the eigenvalues of the covariance matrix Cov( H). We calculate the normalized cumulative energy content of the first D PCs by g(d) = ( D i=1 λ i)/( N i=1 λ i), which measures the relevance of the first D PCs. The case of g(d) < 1 implicates a shift of the input information to additional PCs, because the encoded data then spans a space with more than D latent dimensions. If g(d) = 1, no information content shift occurs, which is true for any linear transformation of data. Fig. 5 reveals that both approaches are able to encode the input data with more than D latent dimensions. In the case of an, the information content shift is solely caused by its nonlinear activation functions. For recurrent networks, we observe the forecasted effect: The cumulative energy content g(d) of the first D PCs of the attractor distribution is significantly lower for reservoir networks than for s. That is, a reservoir network redistributes more of the existing information in the input data onto the remaining N D PCs than the feedword approach. This effect, which is due to the recurrent connections, shows the enhanced spatial encoding of inputs in reservoir networks and can explain the improved performance (cf. Fig. 4). However, we have to remark that the introduced measure g(d) does not strictly correlate with the task-specific performance. Although the reassigns a greater amount of information content on the last N D PCs than the (cf. Fig. 5), this does not improve the generalization performance for every data set (cf. Tab. 1).

6 Christian Emmerich, R. Felix Reinhart, and Jochen J. Steil data set classification rate [%] network properties (L-fold cross-validation) properties D C K L LM N a α Iris [8] 4 3 150 10 83.3 88.9 ± 0.7 92.7 ± 2.1 50 0.5 0.001 Ecoli [8] 7 8 336 8 84.2 86.6 ± 0.5 86.4 ± 0.6 50 0.5 0.001 Olive [9] 8 9 572 11 82.7 95.3 ± 0.5 95.0 ± 0.7 50 0.5 0.001 Wine [8] 13 3 178 2 97.7 97.6 ± 0.7 96.9 ± 1.0 50 0.5 0.1 Optdigits [8] 64 10 5620-92.0 95.9 ± 0.4 95.8 ± 0.4 200 0.1 0.001 Statlog Shuttle [8] 9 7 58000-89.1 98.1 ± 0.2 99.2 ± 0.2 100 0.5 0.001 Table 1: Mean classification rates with standard deviations. 5 Conclusion We present an attractor-based implementation of the reservoir network approach for processing of static patterns. In order to investigate the effect of recurrence on the spatial input encoding, we systematically vary the respective network parameters and compare the recurrent reservoir approach to a related feedforward network. The inner reservoir dynamics result in an increased nonlinear representation of the input patterns in the network s attractor states which can be advantageous for the separability of patterns in terms of static pattern recognition. In temporal tasks that also require a suitable spatial encoding, the mixed spatio-temporal representation of inputs is crucial for the functioning of the reservoir approach. Incorporating the results reported in [3, 4, 5], we conclude that the spatial representation is not detoriated by the temporal component. References [1] D. Verstraeten and B. Schrauwen. On the Quantification of Dynamics in Reservoir Computing. Springer, pages 985 994, 2009. [2] H. Jaeger. Adaptive nonlinear system identification with echo state networks, volume 15. MIT Press, Cambridge, MA, 15 edition, 2002. [3] D. Verstraeten, B. Schrauwen, and D. Stroobandt. Reservoir-based techniques for speech recognition. In Proc. IJCNN, pages 1050 1053, 2006. [4] M. Ozturk and J. Principe. An associative memory readout for s with applications to dynamical pattern recognition. Neural Networks, 20(3):377 390, 2007. [5] P. Buteneers, B. Schrauwen, D. Verstraeten, and D. Stroobandt. Real-Time Epileptic Seizure Detection on Intra-cranial Rat Data Using Reservoir Computing. Advances in Neuro-Information Processing, pages 56 63, 2009. [6] H. Jaeger. The echo state approach to analysing and training recurrent neural networks, 2001. [7] G. Huang, Q. Zhu, and C. Siew. Extreme learning machine: Theory and applications. Neurocomputing, 70(1-3):489 501, 2006. [8] C. Blake, E. Keogh, and C. Merz. Uci repository of machine learning databases. http://archive.ics.uci.edu/ml/index.html, 1998. [9] M. Forina and C. Armanino. Eigenvector projection and simplified nonlinear mapping of fatty acid content of italian olive oils. Ann. Chem., (72):125 127, 1982.