Short Term Memory Quantifications in Input-Driven Linear Dynamical Systems

Similar documents
Fisher Memory of Linear Wigner Echo State Networks

Negatively Correlated Echo State Networks

Memory Capacity of Input-Driven Echo State NetworksattheEdgeofChaos

IEEE TRANSACTION ON NEURAL NETWORKS, VOL. 22, NO. 1, JANUARY

arxiv: v1 [cs.lg] 2 Feb 2018

Simple Deterministically Constructed Cycle Reservoirs with Regular Jumps

Short Term Memory and Pattern Matching with Simple Echo State Networks

RECENTLY there has been an outburst of research activity

Towards a Calculus of Echo State Networks arxiv: v1 [cs.ne] 1 Sep 2014

Recurrence Enhances the Spatial Encoding of Static Inputs in Reservoir Networks

Learning from Temporal Data Using Dynamical Feature Space

arxiv: v1 [astro-ph.im] 5 May 2015

Learning in the Model Space for Temporal Data

Several ways to solve the MSO problem

Introduction to Machine Learning

A First Attempt of Reservoir Pruning for Classification Problems

Reservoir Computing and Echo State Networks

Homework 1. Yuan Yao. September 18, 2011

ESANN'2001 proceedings - European Symposium on Artificial Neural Networks Bruges (Belgium), April 2001, D-Facto public., ISBN ,

MODULAR ECHO STATE NEURAL NETWORKS IN TIME SERIES PREDICTION

Vector spaces. DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis.

Immediate Reward Reinforcement Learning for Projective Kernel Methods

ARCHITECTURAL DESIGNS OF ECHO STATE NETWORK

Intelligent Data Analysis. Principal Component Analysis. School of Computer Science University of Birmingham

Supplementary Material

Lecture 5: Recurrent Neural Networks

ECE521 week 3: 23/26 January 2017

VAR Model. (k-variate) VAR(p) model (in the Reduced Form): Y t-2. Y t-1 = A + B 1. Y t + B 2. Y t-p. + ε t. + + B p. where:

Gaussian Models (9/9/13)

SCRAM: Statistically Converging Recurrent Associative Memory

Kernel and Nonlinear Canonical Correlation Analysis

Guided Self-Organization of Input-Driven Recurrent Neural Networks

Sample Complexity of Learning Mahalanobis Distance Metrics. Nakul Verma Janelia, HHMI

Matching the dimensionality of maps with that of the data

REAL-TIME COMPUTING WITHOUT STABLE

Initialization and Self-Organized Optimization of Recurrent Neural Network Connectivity

Artificial Intelligence Hopfield Networks

Neuroscience Introduction

Echo State Networks with Filter Neurons and a Delay&Sum Readout

Clustering VS Classification

Good vibrations: the issue of optimizing dynamical reservoirs

Optimization of Quadratic Forms: NP Hard Problems : Neural Networks

Linear Dimensionality Reduction

Memory and Information Processing in Recurrent Neural Networks

Principal Component Analysis

Reservoir Computing Methods for Prognostics and Health Management (PHM) Piero Baraldi Energy Department Politecnico di Milano Italy

DIAGONALIZATION. In order to see the implications of this definition, let us consider the following example Example 1. Consider the matrix

Reservoir Computing in Forecasting Financial Markets

Edge of Chaos Computation in Mixed-Mode VLSI - A Hard Liquid

CLOSE-TO-CLEAN REGULARIZATION RELATES

Latent Semantic Analysis. Hongning Wang

PRINCIPAL COMPONENTS ANALYSIS

Regularized Discriminant Analysis and Reduced-Rank LDA

DS-GA 1002 Lecture notes 10 November 23, Linear models

Principal component analysis

Latent Tree Approximation in Linear Model

Using reservoir computing in a decomposition approach for time series prediction.

Grassmann Averages for Scalable Robust PCA Supplementary Material

Dimensionality Reduction Using PCA/LDA. Hongyu Li School of Software Engineering TongJi University Fall, 2014

9. Multivariate Linear Time Series (II). MA6622, Ernesto Mordecki, CityU, HK, 2006.

Recurrent Autoregressive Networks for Online Multi-Object Tracking. Presented By: Ishan Gupta

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010

Methods for Estimating the Computational Power and Generalization Capability of Neural Microcircuits

First Efficient Convergence for Streaming k-pca: a Global, Gap-Free, and Near-Optimal Rate

A Cross-Associative Neural Network for SVD of Nonsquared Data Matrix in Signal Processing

Neural Networks. Hopfield Nets and Auto Associators Fall 2017

Lecture Notes 1: Vector spaces

In biological terms, memory refers to the ability of neural systems to store activity patterns and later recall them when required.

Iterative Laplacian Score for Feature Selection

Entropy Manipulation of Arbitrary Non I inear Map pings

Linking non-binned spike train kernels to several existing spike train metrics

Advanced Methods for Recurrent Neural Networks Design

Direct Method for Training Feed-forward Neural Networks using Batch Extended Kalman Filter for Multi- Step-Ahead Predictions

A quick introduction to reservoir computing

Lecture 6. Notes on Linear Algebra. Perceptron

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Notes on Latent Semantic Analysis

Explorations in Echo State Networks

A Model for Real-Time Computation in Generic Neural Microcircuits

Mathematical foundations - linear algebra

Linear Regression and Its Applications

Learning gradients: prescriptive models

A Novel Chaotic Neural Network Architecture

Examples include: (a) the Lorenz system for climate and weather modeling (b) the Hodgkin-Huxley system for neuron modeling

Linear Regression (9/11/13)

Dimensionality Reduction: PCA. Nicholas Ruozzi University of Texas at Dallas

Unsupervised Learning with Permuted Data

Methods for sparse analysis of high-dimensional data, II

Here represents the impulse (or delta) function. is an diagonal matrix of intensities, and is an diagonal matrix of intensities.

Multiple-step Time Series Forecasting with Sparse Gaussian Processes

Stochastic Gradient Estimate Variance in Contrastive Divergence and Persistent Contrastive Divergence

Numerical Analysis Lecture Notes

L26: Advanced dimensionality reduction

Applied Physics and by courtesy, Neurobiology and EE

Natural Gradients via the Variational Predictive Distribution

Analyzing the weight dynamics of recurrent learning algorithms

Methods for sparse analysis of high-dimensional data, II

Refutation of Second Reviewer's Objections

At the Edge of Chaos: Real-time Computations and Self-Organized Criticality in Recurrent Neural Networks

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Transcription:

Short Term Memory Quantifications in Input-Driven Linear Dynamical Systems Peter Tiňo and Ali Rodan School of Computer Science, The University of Birmingham Birmingham B15 2TT, United Kingdom E-mail: {P.Tino, a.a.rodan}@cs.bham.ac.uk Abstract. We investigate the relation between two quantitative measures characterizing short term memory in input driven dynamical systems, namely the short term memory capacity (MC) [2] and the Fisher memory curve (FMC) [1]. We show that under some assumptions, the two quantities can be interpreted as squared Mahalanobis norms of images of the input vector under the system s dynamics and that even though MC and FMC map the memory structure of the system from two quite different perspectives, they can be linked by a close relation. 1 Introduction Input driven dynamical systems play an important role as machine learning models when data sets exhibit temporal dependencies, e.g. in prediction or control. In an attempt to characterize dynamic properties of such systems, measures have been suggested to quantify how well past information can be represented in the system s internal state. In this contribution we investigate two such well known measures, namely the short term memory capacity spectrum MC k [2] and the Fisher memory curve J(k) [1]. The two quantities map the memory structure of the system under investigation from two quite different perspectives. So far their relation has not been closely investigated. In this paper we take the first step to bridge this gap and show that under some conditions MC k and J(k) can be related in an interpretable manner. 2 Background We study linear input driven state space models with N-dimensional state space and univariate inputs and outputs. Such systems can be represented e.g. by linear Echo State Networks (ESN) [3] with N recurrent (reservoir) units. The activations of the input, internal (state), and output units at time step t are denoted by s(t), x(t), and y(t), respectively. The input-to-recurrent and recurrentto-output unit connections are given by N-dimensional weight vectors v and u, respectively; connections between the internal units are collected in an N N weight matrix W. We assume there are no feedback connections from the output to the reservoir and no direct connections from the input to the output. Under 43

these conditions, the reservoir units are updated according to: x(t) = vs(t)+wx(t 1)+z(t), (1) where z(t) are zero-mean noise terms. The linear readout is computed as 1 : y(t) = u T x(t). (2) The output weights u are typically trained both offline and online by minimizing the Normalized Mean square Error: NMSE = y(t) τ(t) 2 2 τ(t) τ(t) 2 (3) 2, where y(t) is the readout output, τ(t) is the desired output (target),. 2 denotes the Euclidean norm and denotes the empirical mean. In ESN, the elements of W and v are fixed prior to training with random values drawn from a uniform distribution over a (typically) symmetric interval. The reservoir connection matrix W is typically scaled as W αw/ λ max, where λ max isthe spectralradiusofw and 0 < α < 1isascalingparameter [3]. Short Term Memory Capacity (MC): In [2] Jaeger quantified the ability of recurrent network architectures to represent past events through a measure correlating the past events in a (typically i.i.d.) input stream with the network output. In particular, the network (1) without dynamic noise (z(t) = 0) is driven by a univariate stationary input signal s(t). For a given delay k, we consider the network with optimal parameters for the task of outputting s(t k) after seeing the input stream...s(t 1)s(t) up to time t. The goodness of fit is measured in terms of the squared correlation coefficient between the desired output τ(t) = s(t k) and the observed network output y(t): MC k = Cov2 (s(t k),y(t)) Var(s(t)) Var(y(t)), (4) where Cov and V ar denote the covariance and variance operators, respectively. The short term memory (STM) capacity is then given by [2] MC = k=1 MC k. Jaeger [2] proved that for any recurrent neural network with N recurrent neurons, under the assumption of i.i.d. input stream, the STM capacity cannot exceed N. Fisher Memory Curve (FMC): Memory capacity M C represents one way of quantifying the amount of information that can be preserved in the reservoir about the past inputs. In [1] Ganguli, Huh and Sompolinsky proposed a term. 1 The reservoir activation vector is extended with a fixed element accounting for the bias 44

different quantification of memory capacity for linear reservoirs corrupted by a Gaussian state noise. In particular, it is assumed that the dynamic noise z(t) is a memoryless process of i.i.d. zero mean Gaussian variables with covariance ǫi (I is the identity matrix). Then, given an input driving stream s(..t) =... s(t 2) s(t 1) s(t), the dynamic noise induces a state distribution p(x(t) s(..t)), which is a Gaussian with covariance [1] C = ǫ W l (W T ) l. (5) The Fisher memory matrix quantifies sensitivity of p(x(t) s(..t)) with respect to small perturbations in the input driving stream s(..t) (parameters of the recurrent network are fixed), [ 2 ] F k,l (s(..t)) = E p(x(t) s(..t)) s(t k) s(t l) logp(x(t) s(..t)) and its diagonal elements J(k) = F k,k (s(..t)) quantify the information that x(t) retainaboutachange(e.g. apulse)enteringthenetworkk timestepsinthepast. The collection of terms {J(k)} k=0 was termed Fisher memory curve (FMC) and evaluated to [1] J(k) = v T (W T ) k C 1 W k v. (6) Note that, unlike the short term memory capacity, it turns out that FMC does not depend on the input driving stream. 3 Relation between short term memory capacity and Fisher memory curve We first briefly introduce some necessary notation. Denote the image of the input weight vector v through k-fold application of the reservoir operator W by v (k), i.e. v (k) = W k v. Define A = 1 ǫc G, where G = v (l) (v (l) ) T. (7) Provided A is invertible, denote G (A 1 + G 1 ) G by D. For any positive definite matrix B R n n we denote the induced norm on R n by B, i.e. for any v R n, v 2 B = vt Bv. We are now ready to formulate the main result. Theorem: Let MC k be the k-th memory capacity term (4) of network (1) with no dynamic noise, under a zero-mean i.i.d. input driving source. Let J(k) be 45

the k-th term of the Fisher memory curve (6) of network (1) with i.i.d. dynamic noise of variance ǫ. If D is positive definite, then and MC k > ǫ J(k), for all k > 0. MC k = ǫ J(k)+ v (k) 2 D 1 (8) Proof: Given an i.i.d. zero-mean real-valued input stream s(..t) =... s(t 2) s(t 1) s(t) of variance σ 2 emitted by a source P, the state at time t of the linear reservoir (under no dynamic noise (ǫ = 0)) is x(t) = s(t l) W l v = s(t l) v (l). For the task of recalling the input from k time steps back, the optimal leastsquares readout vector u is given by where u = R 1 p (k), (9) R = E P(s(..t)) [x(t) x T (t)] = σ 2 G is the covariance matrix of reservoir activations and p (k) = E P(s(..t)) [s(t k) x(t)] = σ 2 v (k). Provided R is full rank, the optimal readout vector u (k) for delay k reads u (k) = G 1 v (k). (10) The optimal recall output at time t is then y(t) = x T (t) u (k), yielding Cov(s(t k),y(t)) = σ 2 (v (k) ) T G 1 v (k). (11) Since for the optimal recall output Cov(s(t k),y(t)) = Var(y(t)) [2, 4], we have MC k = (v (k) ) T G 1 v (k). (12) The Fisher memory curve and memory capacity terms (6) and (12), respectively have the same form. The matrix G = v(l) (v (l) ) T can be considered a scaled covariance matrix of the iterated images of v under the reservoir mapping. Then MC k is the squared Mahalanobis norm of v (k) under the covariance structure G, MC k = v (k) 2 G 1. (13) 46

Analogously,J(k)isthesquared Mahalanobisnorm ofv (k) underthecovariance C of the state distribution p(x(t) s(..t)) induced by the dynamic noise z(t), J(k) = (v (k) ) T C 1 v (k) = v (k) 2 C 1. (14) Denote the rank-1 matrix vv T by Q. Then by (5), 1 ǫc = A+G, where A = W l (I Q) (W T ) l. It follows that ǫc 1 = (A+G) 1 and, provided A is invertible (and (A 1 +G 1 ) is invertible as well), by matrix inversion lemma, where We have ǫc 1 = G 1 G 1 (A 1 +G 1 ) 1 G 1. J(k) = (v (k) ) T C 1 v (k) = 1 ǫ MC k 1 ǫ (v(k) ) T D 1 v (k), D = G (A 1 +G 1 ) G. Since G and A are symmetric matrices, so are their inverses and hence D is also a symmetric matrix. Provided D is positive definite, it can be considered (inverse of a) metric tensor and MC k = ǫ J(k)+ v (k) 2 D 1. Obviously, in such a case, MC k > ǫ J(k) for all k > 0. From (8) we have: k=0 MC k = ǫ k=0 J(k) + k=0 v(k) 2 D. If the 1 input weight vector v is a unit vector ( v 2 = 1) and the reservoir matrix W is normal (i.e. has orthogonal eigenvector basis), we have k=0j(k) = 1 [1]. In such cases k=0 MC k = N, implying v (k) 2 D 1 = N ǫ. (15) k=0 As an example of metric structures underlying the norms in (8), (13) and (14), we show in figure 1 covariance structure of C (ǫ = 1), G and D corresponding to a 15-node linear reservoir. The covariances were projected onto the two-dimensional space spanned by the 1st and 14th eigenvectors of C (rank determined by decreasing eigenvalues). Reservoir weights were randomly generated from a uniform distribution over an interval symmetric around zero and then W was normalized to spectral radius 0.995. Input weights were generated from uniform distribution over [ 0.5, 0.5]. 47

3 2 1 0 1 2 3 ESANN 2012 proceedings, European Symposium on Artificial Neural Networks, Computational Intelligence 3 C 3 G 3 D 2 2 2 1 1 1 0 0 0 1 1 1 2 2 2 3 3 3 2 1 0 1 2 3 3 3 2 1 0 1 2 3 Fig. 1: Covariance structure of C (left), G (middle) and D (right) for a 15-node linear reservoir projected onto the 1st and 14th eigenvectors of C. Shown are iso-lines corresponding to 0.5, 1, 1.5,..., 3 standard deviations. 4 Conclusions We investigated the relation between two quantitative measures suggested in the literature to characterize short term memory in input driven dynamical systems, namely the short term memory capacity spectrum MC k and the Fisher memory curve J(k), for time lags k 0. J(k) is independent of the input driving stream s(..t) and measures the inherent memory capabilities of such systems by measuring the sensitivity of the state distribution p(x(t) s(..t)) induced by the dynamic noise with respect to perturbations in s(..t), k time steps back. On the otherhandmc k quantifieshowwellthepastinputss(t k)canbereconstructed by linearly projecting the state vector x(t). We have shown that under some assumptions, the two quantities can be interpreted as squared Mahalanobis norms of images of the input vector under the system s dynamics and that MC k > ǫ J(k), for all k > 0. Even though MC k and J(k) map the memory structure of the system under investigation from two quite different perspectives, they can be closely related. References [1] S. Ganguli, D. Huh, and H. Sompolinsky. Memory traces in dynamical systems. Proceedings of the National Academy of Sciences, 105:18970 18975, 2008. [2] H. Jaeger. Short term memory in echo state networks. Technical report gmd report 152, German National Research Center for Information Technology, 2002. [3] M. Lukosevicius and H. Jaeger. Reservoir computing approaches to recurrent neural network training. Computer Science Review, 3(3):127 149, 2009. [4] A. Rodan and P. Tino. Minimum complexity echo state network. IEEE Transactions on Neural Networks, 22(1):131 144, 2011. 48