Learning from Temporal Data Using Dynamical Feature Space

Similar documents
Learning in the Model Space for Temporal Data

Short Term Memory Quantifications in Input-Driven Linear Dynamical Systems

arxiv: v1 [astro-ph.im] 5 May 2015

arxiv: v1 [cs.lg] 2 Feb 2018

Negatively Correlated Echo State Networks

Simple Deterministically Constructed Cycle Reservoirs with Regular Jumps

Reservoir Computing and Echo State Networks

IEEE TRANSACTION ON NEURAL NETWORKS, VOL. 22, NO. 1, JANUARY

Memory Capacity of Input-Driven Echo State NetworksattheEdgeofChaos

Fisher Memory of Linear Wigner Echo State Networks

ECE521 week 3: 23/26 January 2017

Several ways to solve the MSO problem

Lecture 2 Machine Learning Review

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

CPSC 540: Machine Learning

STA 414/2104: Machine Learning

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

STA 4273H: Statistical Machine Learning

Text Correction Using Approaches Based on Markovian Architectural Bias

STA 4273H: Statistical Machine Learning

Machine Learning Lecture 5

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015

Introduction to Machine Learning Midterm Exam

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Introduction. Chapter 1

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

Bayesian Methods for Machine Learning

Neural Networks and the Back-propagation Algorithm

Basic Principles of Unsupervised and Unsupervised

CS6220: DATA MINING TECHNIQUES

Understanding Generalization Error: Bounds and Decompositions

MODULAR ECHO STATE NEURAL NETWORKS IN TIME SERIES PREDICTION

CS6220: DATA MINING TECHNIQUES

Deep Learning Architecture for Univariate Time Series Forecasting

PATTERN CLASSIFICATION

Short Term Memory and Pattern Matching with Simple Echo State Networks

Using a Hopfield Network: A Nuts and Bolts Approach

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Dynamic Data Modeling, Recognition, and Synthesis. Rui Zhao Thesis Defense Advisor: Professor Qiang Ji

In the Name of God. Lectures 15&16: Radial Basis Function Networks

ARTIFICIAL NEURAL NETWORKS گروه مطالعاتي 17 بهار 92

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

The Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017

STA414/2104 Statistical Methods for Machine Learning II

STA 4273H: Statistical Machine Learning

Recurrent Latent Variable Networks for Session-Based Recommendation

Recurrent Neural Networks Deep Learning Lecture 5. Efstratios Gavves

COMS 4771 Introduction to Machine Learning. Nakul Verma

Recurrence Enhances the Spatial Encoding of Static Inputs in Reservoir Networks

Lecture 3: Pattern Classification

Introduction to Machine Learning Midterm Exam Solutions

Lecture 6. Regression

L11: Pattern recognition principles

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

ARCHITECTURAL DESIGNS OF ECHO STATE NETWORK

Forecasting Data Streams: Next Generation Flow Field Forecasting

Learning Recurrent Neural Networks with Hessian-Free Optimization: Supplementary Materials

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino

Linear Models for Regression. Sargur Srihari

Stat260: Bayesian Modeling and Inference Lecture Date: February 10th, Jeffreys priors. exp 1 ) p 2

CS-E3210 Machine Learning: Basic Principles

A Sparse Linear Model and Significance Test. for Individual Consumption Prediction

Machine Learning Lecture 7

Pattern Recognition. Parameter Estimation of Probability Density Functions

Unsupervised Learning with Permuted Data

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

Linear Regression and Its Applications

Learning theory. Ensemble methods. Boosting. Boosting: history

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Lecture 5: Logistic Regression. Neural Networks

PMR Learning as Inference

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Unsupervised learning: beyond simple clustering and PCA

Support Vector Machines and Bayes Regression

Hidden Markov Models Part 1: Introduction

Gaussian Processes (10/16/13)

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Recurrent Autoregressive Networks for Online Multi-Object Tracking. Presented By: Ishan Gupta

ARTIFICIAL NEURAL NETWORK PART I HANIEH BORHANAZAD

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

Machine Learning! in just a few minutes. Jan Peters Gerhard Neumann

Cheng Soon Ong & Christian Walder. Canberra February June 2018

CS 446 Machine Learning Fall 2016 Nov 01, Bayesian Learning

Neutron inverse kinetics via Gaussian Processes

Midterm: CS 6375 Spring 2015 Solutions

Lecture 9. Time series prediction

Nonparametric Bayesian Methods (Gaussian Processes)

Towards a Calculus of Echo State Networks arxiv: v1 [cs.ne] 1 Sep 2014

MODULE -4 BAYEIAN LEARNING

Linear & nonlinear classifiers

DS-GA 1002 Lecture notes 12 Fall Linear regression

Math 350: An exploration of HMMs through doodles.

Maximum Likelihood Estimation. only training data is available to design a classifier

Part of the slides are adapted from Ziko Kolter

Bagging During Markov Chain Monte Carlo for Smoother Predictions

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

Transcription:

Learning from Temporal Data Using Dynamical Feature Space Peter Tiňo School of Computer Science University of Birmingham, UK Learning from Temporal Data Using Dynamical Feature Space p.1/64

Special thanks to... Barbara Hammer Huanhuan Chen Alessio Micheli Michal Čerňanský Luba Beňúšková Jort van Mourik Igor Farkaš Ali Rodan Jochen Steil Xin Yao Nick Gianniotis Dan Conford Manfred Opper Fengzhen Tang,... Learning from Temporal Data Using Dynamical Feature Space p.2/64

Learning in the data or model space? Complex data, e.g. - long (possibly) multivariate time series (video, sensor data, etc.) of variable length - fmri data measured on normalized cognitive tasks -... We kind-of know what to do for fix-dimensional vectorial data, but how to represent such complex data in order to usefully "learn" from it? The critical notion of "smoothness" when learning from historic data - "close" things should lead to the same response "closeness" - distance, dot product,... How to represent complex data items in the light of the "smoothness criterion"? Learning from Temporal Data Using Dynamical Feature Space p.3/64

Learning in the Model Space Framework Each data item is represented by a model that "explains" it Learning is formulated in the space of models (function space) Parametrization-free notions of "input similarities": similarity quantifications between the models not their (arbitrary) parametrizations! Key consideration - base model class - flexible enough to represent variety of data items - sufficiently constrained to avoid overfitting Learning from Temporal Data Using Dynamical Feature Space p.4/64

NARMA task - illustrative example NARMA sequences - orders 10, 20 and 30 - represented as state space models 0.9 0.8 0.7 order10 order20 order30 0.6 8 6 0.5 4 0.4 2 0.3 0 2 0.2 30 30 4 10 0.1 10 20 20 10 0 0 1000 2000 3000 4000 5000 6000 6 8 12 10 8 6 4 2 0 2 4 6 8 10 0 Learning from Temporal Data Using Dynamical Feature Space p.5/64

Representing Sequential Data Each sequence will be represented by a dedicated model that captures everything important about the sequence in a compact manner. If we d like to change focus on different aspects of the sequences, we need to change the base model accordingly. What should be our base model? Many possible choices, but here we ll consider State space models - a very general and powerful model class capturing temporal structure through the notion of information processing state (IPS). IPS at time t codes for the entire history of time series items we have seen up to time t. Learning from Temporal Data Using Dynamical Feature Space p.6/64

Discrete Time State Space Model Output Delay Parametrized DS W... 1 2 2 3 2 1 1 1 2 States of the DS form IPS These states code the entire history of sequence elements we have seen so far Learning from Temporal Data Using Dynamical Feature Space p.7/64

A State Space Model N-dimensional continuous state space [ 1,1] N x(t) = f(x(t 1),u(t)) = σ(rx(t 1)+Vu(t)+b), y(t) = h(x) = κ(wx(t)+a), x(t) [ 1,1] N - state vector at time t u(t) - input time series element at time t f( ) - state-transition function y(t) - output of the readout from the state space. h( ) - readout (output) function Learning from Temporal Data Using Dynamical Feature Space p.8/64

Another State Space Model - Transducer translate input streams over {a,b} to sequences over {x,y} a x b y 1 2 a x b x Learning from Temporal Data Using Dynamical Feature Space p.9/64

Picking the Right Base Model Appealing feature of the previous example - the state space structure and dynamics is perfectly tailored for the particular task. Of course, in our general scenario we may not know (and typically this is the case!) the right state space structure that extracts from the data (time series) what is needed for the particular task: - what are the appropriate IPS? - What is their transition structure? Since we are ML people, the answer seems to be obvious: Learn everything!... easier said than done... Learning from Temporal Data Using Dynamical Feature Space p.10/64

Picking the Right Base Model Things are actually much worse! If we adopted a general parameterized model x(t) = f(x(t 1),u(t)) = σ(rx(t 1)+Vu(t)+b), y(t) = h(x) = κ(wx(t)+a), we may have problems fiting it to individual sequences (information latching problem) Worse still - Even if we managed to do perfect model fitting so that each sequence is nicely represented - how to define distance between state space models - driven dynamical systems? Highly non-trivial - as we know from the work in machine vision and video processing communities (low-dim dynamical mostly linear dynamical systems). Learning from Temporal Data Using Dynamical Feature Space p.11/64

Our Proposal Interpret the state space model as a "dynamical kernel machine". x(t) = f(x(t 1),u(t)) = σ(rx(t 1)+Vu(t)+b), y(t) = h(x) = Wx(t)+a, At timetthe input itemu(t) is mapped to the feature space - state space of the dynamic system - its feature space image is the state x(t) The feature mapping is dynamic - e.g. besides providing an image x(t) of u(t) it also reflects the history of inputs up to time t Once in the feature space - the representations are rich enough to allow for a convenient use of linear tools! Learning from Temporal Data Using Dynamical Feature Space p.12/64

Dynamic Feature Space "dynamic kernel trick" fixed, "general purpose" state space organization "redundant" high-dim state (feature) space task dependent linear readout mapping all the "hard" work will be done in the feature space creation. How can the idea of "general purpose" dynamic feature space be realized? Let s get an inspiration from information theory ("general purpose" information sources). Learning from Temporal Data Using Dynamical Feature Space p.13/64

Simplest Case - Observable Markovian IPS The simplest construction of IPS is based on concentrating on the very recent (finite) past e.g. definite memory machines, finite memory machines, Markov models Example: Only the last 5 input symbols matter when reasoning about what symbol comes next... 1 2 1 1 2 3 2 1 2 1 2 4 3 2 1 1... 2 1 1 1 1 3 4 4 4 4 1 4 3 2 1 1... 3 3 2 2 1 2 1 2 2 1 3 4 3 2 1 1 All three sequences belong to the same IPS 43211 Learning from Temporal Data Using Dynamical Feature Space p.14/64

Probabilistic framework - Markov model (MM) finite context-conditional next-symbol distributions... 1 2 1 1 2 3 2 1 2 1 2 4 3 2 1 1?... 2 1 1 1 1 3 4 4 4 4 1 4 3 2 1 1?... 3 3 2 2 1 2 1 2 2 1 3 4 3 2 1 1? P(s 11111) P(s 11112) P(s 11113)... P(s 11121)... P(s 43211)... P(s 44444) s {1,2,3,4} Learning from Temporal Data Using Dynamical Feature Space p.15/64

Markov model MM are intuitive and simple, but... Number of IPS (prediction contexts) grows exponentially fast with memory length Large alphabets and long memories are infeasible: computationally demanding and very long sequences are needed to obtain sound statistical estimates of free parameters Learning from Temporal Data Using Dynamical Feature Space p.16/64

Variable memory length MM (VLMM) Sophisticated implementation of potentially high-order MM Takes advantage of subsequence structure in data Saves resources by making memory depth context dependent Use deep memory only when it is needed Natural representation of IPS in form of Prediction Suffix Trees Closely linked to the idea of universal simulation of information sources. Learning from Temporal Data Using Dynamical Feature Space p.17/64

Prediction suffix tree (PST) IFS are organized on nodes of PST Traverse the tree from root towards leaves in reversed order Complex and potentially time-consuming training 1 1 P(. 122 ) P(. 12 ) 1 1 2 2 Each node i has associated next symbol probability distribution P(. 222 ) 2 P(. i ) time Learning from Temporal Data Using Dynamical Feature Space p.18/64

Contractive DS lead to Markovian IPS 3 * 4 3 *1 4 contract symbol 1 contract 1 2 1 2 3 *142 4 3 *14 4 symbol 2 contract symbol 4 1 2 1 2 Learning from Temporal Data Using Dynamical Feature Space p.19/64

Contractive DS lead to Markovian IPS Mapping symbolic sequences into Euclidean space Coding strategy is Markovian Sequences with shared histories of recently observed symbols have close images The longer is the common suffix of two sequences, the closer are they mapped to each other Add a simple output-generating readout on top of the fixed state dynamics! Learning from Temporal Data Using Dynamical Feature Space p.20/64

More Formally... Given an input symbol s A at time t, and activations of recurrent units from the previous step, x(t 1), x(t) = f s (x(t 1)). This notation can be extended to sequences over A the recurrent activations after t time steps are x(t) = f s1...s t (x(0)). If maps f s are contractions with Lip. constant 0 < C < 1, f vq (x) f wq (x) C q f v (x) f w (x) C q diam(x). Learning from Temporal Data Using Dynamical Feature Space p.21/64

Theoretical grounding Theorem (Hammer & Tiňo): Recurrent networks with contractive transition function can be approximated arbitrarily well on input sequences of unbounded length by a definite memory machine. Conversely, every definite memory machine can be simulated by a recurrent network with contractive transition function. Hence initialization with small weights induces an architectural bias into learning with recurrent neural networks. Learning from Temporal Data Using Dynamical Feature Space p.22/64

Learnability (PAC framework) The architectural bias emphasizes one possible region of the weight space where generalization ability can be formally proved. Standard RNN are not distribution independent learnable in the PAC sense if arbitrary precision and inputs are considered. Theorem (Hammer & Tiňo): Recurrent networks with contractive transition function with a fixed contraction parameter fulfill the distribution independent UCED property and so unlike general recurrent networks, are distribution independent PAC-learnable. Learning from Temporal Data Using Dynamical Feature Space p.23/64

State space complexity sequence complexity Complexity of the driving input stream (topological entropy) is directly reflected in the complexity of the state space activations (fractal dimension). Theorem (Tiňo & Hammer): Recurrent activations inside contractive RNN form fractal clusters the dimension of which can be bounded by the scaled entropy of the underlying driving source. The scaling factors are fixed and are given by the RNN parameters. Learning from Temporal Data Using Dynamical Feature Space p.24/64

Fix the DS to a contractive system! Output "smooth" map IPS Non autonomous DS Delay... 1 2 2 3 2 1 1 1 2 Fix the DS part to a randomly constructed contraction. Only train the simple linear readout output map. Can be done efficiently. Reservoir computation models. Learning from Temporal Data Using Dynamical Feature Space p.25/64

Simple, non-randomized reservoirs Dynamical Reservoir N internal units x(t) Dynamical Reservoir N internal units x(t) Dynamical Reservoir N internal units x(t) Input unit s(t) V W U output unit y(t) Input unit s(t) V W U output unit y(t) Input unit s(t) V W U output unit y(t) Learning from Temporal Data Using Dynamical Feature Space p.26/64

Cycle reservoir with regular jumps still a simple deterministic construction... (A) 16 r c 13 r c r c r j N=18 1 r r j r j c 10 r j r j r c r c 4 r r 7 c c r c (B) 13 r c r 17 c N=18 r j r j r c 1 r j r j 9 r c r c r c r c 5 Learning from Temporal Data Using Dynamical Feature Space p.27/64

Reservoir characterizations Memory capacity - [Jaeger] A measure correlating past events in an i.i.d. input stream with the network output. Assume the network is driven by a univariate stationary input signal u(t). For a given delayk > 0, construct the network with optimal parameters for the task of outputtingu(t k) after seeing the input stream...u(t 1)u(t) up to time t. Goodness of fit - squared correlation coefficient between the desired output u(t k) and the observed network output y(t) Learning from Temporal Data Using Dynamical Feature Space p.28/64

Memory capacity MC k = Cov2 (u(t k),y(t)) Var(u(t)) Var(y(t)), where Cov denotes the covariance and Var the variance operators. The short term memory (STM) capacity is then given by: MC = MC k. k=1 [Jaeger] For any recurrent neural network with N recurrent neurons, under the assumption of i.i.d. input stream, the STM capacity cannot exceed N. Learning from Temporal Data Using Dynamical Feature Space p.29/64

Memory capacity of SCR [Tiňo & Rodan] Under the assumption of zero-mean i.i.d. input stream, the memory capacity of linear SCR architecture with N reservoir units can be made arbitrarily close to N. Theorem (Tiňo & Rodan): Consider a linear SCR network with reservoir weight 0 < r < 1 and an input weight vector such that the reservoir activations are full rank. Then, the SCR network memory capacity is equal to MC = N (1 r 2N ). Learning from Temporal Data Using Dynamical Feature Space p.30/64

Reservoir characterizations Eigen-decomposition of reservoir matrix ESN SCR 0.5 0 0.5 0.5 0 0.5 1 0.5 0.5 1 CRJ 1 0.5 0 0.5 1 CRHJ 0.5 0 0.5 0.5 0 0.5 1 0.5 0 0.5 1 1 0.5 0 0.5 1 Learning from Temporal Data Using Dynamical Feature Space p.31/64

Reservoir characterizations Memory capacity (A) 10 0 (B) 10 0 10 5 10 5 10 10 10 10 10 15 10 15 MC k 10 20 MC k 10 20 10 25 10 25 10 30 10 30 10 35 10 40 (C) 10 0 SCR CRJ ESN 10 0 10 1 10 2 k 10 35 10 40 (D) 10 0 SCR CRJ ESN 10 0 10 1 10 2 k 10 5 10 5 MC k MC k 10 10 10 10 10 15 10 15 10 20 SCR CRJ ESN 10 0 10 1 10 2 k 10 20 SCR CRJ ESN 10 0 10 1 10 2 k Learning from Temporal Data Using Dynamical Feature Space p.32/64

Reservoir characterizations Pseudo-Lyapunov exponents 2 NARMA Task 2 Laser task lyapunov exponent 1 0 1 2 ESN SCR CRJ lyapunov exponent 1 0 1 2 ESN SCR CRJ 3 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 scaling parameter 3 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 scaling parameter Learning from Temporal Data Using Dynamical Feature Space p.33/64

Time Series Kernel via Reservoir Models rc rc rc rj rj rj rj rc rc rc Readout Mapping 1. For all sequences - the same fixed state transition (reservoir) part. 2. For each time series train the linear readout on prediction task. Readout Mapping Space Deterministic and fixed Trained 3. Gaussian kernel in the readout function space. Learning from Temporal Data Using Dynamical Feature Space p.34/64

Distance in the Readout Function Space Two readouts h 1 and h 2 obtained on two sequences: L 2 (h 1,h 2 ) = ( C ) 1/2 h 1 (x) h 2 (x) 2 dµ(x) where µ(x) is the probability density measure on the input domain, andc is the integral range. C = [ 1,+1] N as we use recurrent neural network reservoir formulation with tanh transfer function. Learning from Temporal Data Using Dynamical Feature Space p.35/64

Model Distance with Uniform Measure Consider two readouts from the same reservoir h 1 (x) = W 1 x+a 1, h 2 (x) = W 2 x+a 2. Since the sigmoid activation function is employed in the domain of the readout,c [ 1,1] N. L 2 (h 1,h 2 ) = = ( C h 1(x) h 2 (x) 2 dµ(x)) 1/2 ( 2 N 3 N j=1 O i=1 w2 i,j +2N a 2) 1/2 where w i,j is the (i,j)-th element of W 1 W 2, a = a 1 a 2. Learning from Temporal Data Using Dynamical Feature Space p.36/64

Model Distance vs. Parameter Distance Scaling of the squared model distance by 1 3 N j=1 O wi,j 2 + a 2, i=1 Squared Euclidean distance of the readout parameters N j=1 O i=1 w 2 i,j + a 2, more importance is given to the offset than orientation of the readout mapping. Learning from Temporal Data Using Dynamical Feature Space p.37/64

Non-uniform State Distribution For K-component Gaussian mixture model µ(x) = K α i µ i (x η i,σ i ) i=1 The model distance can be obtained as follows: L 2 2(h 1,h 2 ) = K i=1 α i Sampling approximation trace(w T WΣ k )+a T a + η T k WT Wη k +2a T Wη k L 2 2(h 1,h 2 ) 1 m m i=1 h 1 (x(i)) h 2 (x(i)) 2. Learning from Temporal Data Using Dynamical Feature Space p.38/64

Three Time Series Kernels Uniform state distribution Reservoir Kernel: RV Non-uniform state distribution Reservoir kernel by sampling: SamplingRV Reservoir kernel by Gaussian mixture models: GMMRV K(s i,s j ) = exp { γ L 2 2(h i,h j ) } Learning from Temporal Data Using Dynamical Feature Space p.39/64

Fisher Kernel Single model for all data items. Based on Information Geometry Fisher vector (score functions): U(s) = λ logp(s λ) Metric tensor - Fisher Information Matrix F K(s i,s j ) = (U(s i )) T F U(s j ). Learning from Temporal Data Using Dynamical Feature Space p.40/64

Fisher Kernel Based on Reservoir Model Endowing the readout with a noise model yields a generative time series model of the form: x(t) = f(u(t),x(t 1)), u(t+1) = Wx(t)+a+ε(t). Assume the noise model follows a Gaussian distribution ε(t) = N(0,σ 2 I). Learning from Temporal Data Using Dynamical Feature Space p.41/64

Fisher Kernel by Reservoir Models partial derivatives of log likelihoodlogp(u(1..l)) U = logp(u(1..l)) l P(u(t) u(1 t 1)) t=1 = W W l (u(t) a) x(t 1) T Wx(t 1)x(t 1) T =. σ 2 t=1 Fisher kernel (simplified in the usual way - Id matrix instead of F) for two time series s i ands j with scores U i andu j : K(s i,s j ) = O o=1 N (U i U j ) o,n n=1 Learning from Temporal Data Using Dynamical Feature Space p.42/64

Other time series kernels Kernel based on Dynamic time warping (DTW) DTW tries to warp the time axis of one (or both) sequences to achieve a better alignment DTW can generate un-intuitive alignments by mapping a single point on one time series onto a large subsection of another time series, leading to inferior results efficient versions of DTW based kernels available Learning from Temporal Data Using Dynamical Feature Space p.43/64

Other time series kernels Autoregressive kernel VAR model class of a given order is used to generate an infinite family of features from the time series. For a given time series s, the likelihood profile p θ (s) across all possible parameter setting (under a matrix normal-inverse Wishart prior ω(θ)) forms a representation of s. K(s i,s j ) = θ p θ (s i )p θ (s j )ω(dθ) Learning from Temporal Data Using Dynamical Feature Space p.44/64

Experimental Comparisons Compared Kernels: Autoregressive Kernel: AR Fisher Kernel with Hidden Markov Model: Fisher Dynamic time wrapping: DTW Our Kernels Reservoir Kernel: RV Reservoir kernel by sampling: SamplingRV Reservoir kernel by Gaussian mixture models: GMMRV Fisher kernel with reservoir models: FisherRV Learning from Temporal Data Using Dynamical Feature Space p.45/64

Synthetic Data: NARMA 0.8 0.7 0.6 0.5 0.4 0.3 0.2 order 10 order 20 order 30 0.1 0 500 1000 1500 2000 2500 3000 s(t + 1) = 0.3s(t) + 0.05s(t) 9 s(t i)+1.5u(t 9)u(t)+0.1 i=0 s(t + 1) = tanh(0.3s(t) + 0.05s(t) s(t + 1) = 0.2s(t) + 0.004s(t) 29 i=0 19 i=0 s(t i)+1.5u(t 19)u(t)+0.01)+0.2 s(t i)+1.5u(t 29)u(t)+0.201 Learning from Temporal Data Using Dynamical Feature Space p.46/64

Multi-Dimensional Scaling Representation 0.1 0.05 0 0.05 0.1 order 10 order 20 order 30 0.15 0.15 0.1 0.05 0 0.05 0.1 0.15 0.2 0.25 0.3 Illustration of MDS on the model distance among reservoir weights Learning from Temporal Data Using Dynamical Feature Space p.47/64

Robustness to Noise Generalization Accuracy 100 90 80 70 60 50 RV AR Fisher DTW NoKernel 40 30 0 0.1 0.2 0.3 0.4 0.5 Gaussian Noise Level Illustration of the performance of compared kernels with different noise levels Learning from Temporal Data Using Dynamical Feature Space p.48/64

Benchmark Data Dataset Length Classes Train Test Symbols 398 6 25 995 OSULeaf 427 6 200 242 Oliveoil 570 4 30 30 Lighting2 637 2 60 61 Beef 470 6 30 30 Car 576 4 60 60 Fish 463 8 175 175 Coffee 286 2 28 28 Adiac 176 37 390 391 Learning from Temporal Data Using Dynamical Feature Space p.49/64

Results: Classification Accuracy Dataset DTW AR Fisher RV FisherRV GMMRV SamplingRV Symbols 94.77 91.15 94.42 98.08 95.96 97.31 95.77 OSULeaf 74.79 56.61 54.96 69.83 64.59 56.55 63.33 Oliveoil 83.33 73.33 56.67 86.67 83.33 84.00 90.00 Lighting2 64.10 77.05 64.10 77.05 75.41 78.69 80.33 Beef 66.67 78.69 58.00 80.00 68.00 79.67 86.67 Car 58.85 60.00 65.00 76.67 72.33 78.33 86.67 Fish 69.86 60.61 57.14 79.00 74.29 78.00 85.71 Coffee 85.71 100.00 81.43 100.00 92.86 96.43 100.00 Adiac 65.47 64.45 68.03 72.63 71.61 74.94 76.73 Learning from Temporal Data Using Dynamical Feature Space p.50/64

Results : CPU Time (s) Dataset DTW AR Fisher RV FisherRV GMMRV SamplingRV Symbols 1,318 2,868 2,331 202 236 374 808 OSULeaf 6,030 1,375 3,264 98 111 186 447 Oliveoil 295 113 832 11 19 27 43 Lighting2 918 151 1,143 33 46 61 95 Beef 107 54 87 10 17 23 40 Car 679 442 902 27 42 50 84 Fish 3,353 495 1,998 81 96 159 286 Coffee 21 25 145 3 3 7 19 Adiac 550 8131 1,122 201 213 394 699 Learning from Temporal Data Using Dynamical Feature Space p.51/64

Performance on increasing length of time series 45 40 RV AR DTW Fisher 10 5 RV AR DTW Fisher 35 10 4 Accuracy 30 CPU time 25 10 3 20 15 200 400 600 800 1000 1200 1400 1600 1800 length of time series 10 2 200 400 600 800 1000 1200 1400 1600 1800 length of time series Generalization Accuracy CPU Time Learning from Temporal Data Using Dynamical Feature Space p.52/64

Multivariate Time Series Our kernels can be applicable for multivariate time series of variable length Dataset dim length classes train test Libras 2 45 15 360 585 handwritten 3 60-182 20 600 2258 AUSLAN 22 45-136 95 600 1865 Learning from Temporal Data Using Dynamical Feature Space p.53/64

Multivariate Time Series: Results 100 95 90 AR Fisher DTW RV SamplingRV 4500 4000 3500 3000 AR Fisher DTW RV SamplingRV 2500 85 2000 80 1500 1000 75 500 70 Libras handwritten AUSLAN 0 Libras handwritten AUSLAN Generalization Accuracy CPU Time Learning from Temporal Data Using Dynamical Feature Space p.54/64

Online Kernel Construction Reservoir readouts can be trained in an on-line fashion, e.g. using Recursive Least Squares.This enables us to construct and refine reservoir kernels. w opt = w 0 +k(b a T w 0 ) k = 1 λ+a T (A T 0 A 0) 1 a (AT 0A 0 ) 1 a (A T 0A 0 ) 1 = [λ 1 I ka T ](A T 0A 0 ) 1 λ:forgetting factor Learning from Temporal Data Using Dynamical Feature Space p.55/64

Long Time series PEMS-SF is 15 months (Jan. 1st 2008 to Mar. 30th 2009) of daily data from the California Department of Transportation PEMS. The data describes the occupancy rate of different car lanes of San Francisco bay area freeways. PEMS-SF with 440 time series of length 138,672. Learning from Temporal Data Using Dynamical Feature Space p.56/64

Results The computational complexity of our kernels is O(l), where l is the length of time series: Scale linearly with the length of time series! 90 85 80 1. RV kernel achieves 86.13% accuracy. Accuracy 75 70 65 60 55 0 2 4 6 8 10 12 14 Length of Time Series x 10 4 2. The best accuracy reported are 82% 83% for AR, global alignment, spline smoothing and Bag of vectors kernels Learning from Temporal Data Using Dynamical Feature Space p.57/64

Incremental One Class Learning 1. Normal data - apply reservoir model in the sliding window (size m) in the first t steps 2. Kernel matrix train a one-class SVM K σ (s i,s j ) = exp{ σ L 2 (h i,h j )}, 3. If needed, incrementally create new clusters of models (signal regimes ). Learning from Temporal Data Using Dynamical Feature Space p.58/64

NARMA task NARMA sequences - orders 10, 20 and 30 0.9 0.8 0.7 order10 order20 order30 0.6 8 6 0.5 4 0.4 2 0.3 0 2 0.2 30 30 4 10 0.1 10 20 20 10 0 0 1000 2000 3000 4000 5000 6000 6 8 12 10 8 6 4 2 0 2 4 6 8 10 0 Learning from Temporal Data Using Dynamical Feature Space p.59/64

Experiments NARMA (3 classes) Algorithm Classes Precision Recall Specificity DBscan 4 0.6690 0.7650 0.8825 ARMAX-OCS 5 0.9354 0.9229 0.9615 RC-OCS 3 0.9637 0.9615 0.9808 DRC-OCS(Sampling) 3 0.9683 0.9692 0.9914 DRC-OCS 3 0.9861 0.9858 0.9929 Learning from Temporal Data Using Dynamical Feature Space p.60/64

Experiments Van der Pol (4 classes) Algorithm Classes Precision Recall Specificity DBscan 10 0.7629 0.6842 0.8018 ARMAX-OCS 2 0.4309 0.4880 0.7868 RC-OCS 6 0.9606 0.9583 0.9861 DRC-OCS(Sampling) 5 0.9617 0.9726 0.9819 DRC-OCS 5 0.9736 0.9731 0.9910 Learning from Temporal Data Using Dynamical Feature Space p.61/64

Experiments Three Tank (4 classes) Algorithm Classes Precision Recall Specificity DBscan 14 0.8742 0.7561 0.9253 ARMAX-OCS 5 0.9914 0.9923 0.9984 RC-OCS 9 0.9182 0.8788 0.9596 DRC-OCS(Sampling) 7 0.9940 0.9949 0.9988 DRC-OCS 10 0.9931 0.9931 0.9977 Learning from Temporal Data Using Dynamical Feature Space p.62/64

Experiments Barcelona Water (32 classes) Algorithm Classes Precision Recall Specificity DBscan 61 0.8019 0.7326 0.8654 ARMAX-OCS 57 0.7826 0.7419 0.8237 RC-OCS 44 0.8913 0.8942 0.9263 DRC-OCS(Sampling) 39 0.9219 0.9310 0.9513 DRC-OCS 48 0.9538 0.9640 0.9871 Learning from Temporal Data Using Dynamical Feature Space p.63/64

Further reading A. Rodan, P. Tino: Minimum Complexity Echo State Network. IEEE TNNLS, 22(1), pp 131-144, 2011. A. Rodan, P. Tino: Simple Deterministically Constructed Cycle Reservoirs with Regular Jumps. Neural Computation, 24(7), pp. 1822-1852, 2012. P. Tino, A. Rodan: Short Term Memory in Input-Driven Linear Dynamical Systems. Neurocomputing, 112, pp. 58-63, 2013. H. Chen, P. Tino, X. Yao, A. Rodan: Learning in the Model Space for Fault Diagnosis. IEEE TNNLS, in print, 2013. H. Chen, F. Tang, P. Tino, X. Yao: Model-based Kernel for Efficient Time Series Analysis. KDD-2013. Learning from Temporal Data Using Dynamical Feature Space p.64/64