Learning from Temporal Data Using Dynamical Feature Space

Size: px

Start display at page:

Download "Learning from Temporal Data Using Dynamical Feature Space"

Dwain Kelly
5 years ago
Views:

1 Learning from Temporal Data Using Dynamical Feature Space Peter Tiňo School of Computer Science University of Birmingham, UK Learning from Temporal Data Using Dynamical Feature Space p.1/64

2 Special thanks to... Barbara Hammer Huanhuan Chen Alessio Micheli Michal Čerňanský Luba Beňúšková Jort van Mourik Igor Farkaš Ali Rodan Jochen Steil Xin Yao Nick Gianniotis Dan Conford Manfred Opper Fengzhen Tang,... Learning from Temporal Data Using Dynamical Feature Space p.2/64

3 Learning in the data or model space? Complex data, e.g. - long (possibly) multivariate time series (video, sensor data, etc.) of variable length - fmri data measured on normalized cognitive tasks -... We kind-of know what to do for fix-dimensional vectorial data, but how to represent such complex data in order to usefully "learn" from it? The critical notion of "smoothness" when learning from historic data - "close" things should lead to the same response "closeness" - distance, dot product,... How to represent complex data items in the light of the "smoothness criterion"? Learning from Temporal Data Using Dynamical Feature Space p.3/64

4 Learning in the Model Space Framework Each data item is represented by a model that "explains" it Learning is formulated in the space of models (function space) Parametrization-free notions of "input similarities": similarity quantifications between the models not their (arbitrary) parametrizations! Key consideration - base model class - flexible enough to represent variety of data items - sufficiently constrained to avoid overfitting Learning from Temporal Data Using Dynamical Feature Space p.4/64

5 NARMA task - illustrative example NARMA sequences - orders 10, 20 and 30 - represented as state space models order10 order20 order Learning from Temporal Data Using Dynamical Feature Space p.5/64

6 Representing Sequential Data Each sequence will be represented by a dedicated model that captures everything important about the sequence in a compact manner. If we d like to change focus on different aspects of the sequences, we need to change the base model accordingly. What should be our base model? Many possible choices, but here we ll consider State space models - a very general and powerful model class capturing temporal structure through the notion of information processing state (IPS). IPS at time t codes for the entire history of time series items we have seen up to time t. Learning from Temporal Data Using Dynamical Feature Space p.6/64

7 Discrete Time State Space Model Output Delay Parametrized DS W States of the DS form IPS These states code the entire history of sequence elements we have seen so far Learning from Temporal Data Using Dynamical Feature Space p.7/64

8 A State Space Model N-dimensional continuous state space [ 1,1] N x(t) = f(x(t 1),u(t)) = σ(rx(t 1)+Vu(t)+b), y(t) = h(x) = κ(wx(t)+a), x(t) [ 1,1] N - state vector at time t u(t) - input time series element at time t f( ) - state-transition function y(t) - output of the readout from the state space. h( ) - readout (output) function Learning from Temporal Data Using Dynamical Feature Space p.8/64

9 Another State Space Model - Transducer translate input streams over {a,b} to sequences over {x,y} a x b y 1 2 a x b x Learning from Temporal Data Using Dynamical Feature Space p.9/64

10 Picking the Right Base Model Appealing feature of the previous example - the state space structure and dynamics is perfectly tailored for the particular task. Of course, in our general scenario we may not know (and typically this is the case!) the right state space structure that extracts from the data (time series) what is needed for the particular task: - what are the appropriate IPS? - What is their transition structure? Since we are ML people, the answer seems to be obvious: Learn everything!... easier said than done... Learning from Temporal Data Using Dynamical Feature Space p.10/64

11 Picking the Right Base Model Things are actually much worse! If we adopted a general parameterized model x(t) = f(x(t 1),u(t)) = σ(rx(t 1)+Vu(t)+b), y(t) = h(x) = κ(wx(t)+a), we may have problems fiting it to individual sequences (information latching problem) Worse still - Even if we managed to do perfect model fitting so that each sequence is nicely represented - how to define distance between state space models - driven dynamical systems? Highly non-trivial - as we know from the work in machine vision and video processing communities (low-dim dynamical mostly linear dynamical systems). Learning from Temporal Data Using Dynamical Feature Space p.11/64

12 Our Proposal Interpret the state space model as a "dynamical kernel machine". x(t) = f(x(t 1),u(t)) = σ(rx(t 1)+Vu(t)+b), y(t) = h(x) = Wx(t)+a, At timetthe input itemu(t) is mapped to the feature space - state space of the dynamic system - its feature space image is the state x(t) The feature mapping is dynamic - e.g. besides providing an image x(t) of u(t) it also reflects the history of inputs up to time t Once in the feature space - the representations are rich enough to allow for a convenient use of linear tools! Learning from Temporal Data Using Dynamical Feature Space p.12/64

13 Dynamic Feature Space "dynamic kernel trick" fixed, "general purpose" state space organization "redundant" high-dim state (feature) space task dependent linear readout mapping all the "hard" work will be done in the feature space creation. How can the idea of "general purpose" dynamic feature space be realized? Let s get an inspiration from information theory ("general purpose" information sources). Learning from Temporal Data Using Dynamical Feature Space p.13/64

14 Simplest Case - Observable Markovian IPS The simplest construction of IPS is based on concentrating on the very recent (finite) past e.g. definite memory machines, finite memory machines, Markov models Example: Only the last 5 input symbols matter when reasoning about what symbol comes next All three sequences belong to the same IPS Learning from Temporal Data Using Dynamical Feature Space p.14/64

15 Probabilistic framework - Markov model (MM) finite context-conditional next-symbol distributions ? ? ? P(s 11111) P(s 11112) P(s 11113)... P(s 11121)... P(s 43211)... P(s 44444) s {1,2,3,4} Learning from Temporal Data Using Dynamical Feature Space p.15/64

16 Markov model MM are intuitive and simple, but... Number of IPS (prediction contexts) grows exponentially fast with memory length Large alphabets and long memories are infeasible: computationally demanding and very long sequences are needed to obtain sound statistical estimates of free parameters Learning from Temporal Data Using Dynamical Feature Space p.16/64

17 Variable memory length MM (VLMM) Sophisticated implementation of potentially high-order MM Takes advantage of subsequence structure in data Saves resources by making memory depth context dependent Use deep memory only when it is needed Natural representation of IPS in form of Prediction Suffix Trees Closely linked to the idea of universal simulation of information sources. Learning from Temporal Data Using Dynamical Feature Space p.17/64

18 Prediction suffix tree (PST) IFS are organized on nodes of PST Traverse the tree from root towards leaves in reversed order Complex and potentially time-consuming training 1 1 P(. 122 ) P(. 12 ) Each node i has associated next symbol probability distribution P(. 222 ) 2 P(. i ) time Learning from Temporal Data Using Dynamical Feature Space p.18/64

19 Contractive DS lead to Markovian IPS 3 * 4 3 *1 4 contract symbol 1 contract * *14 4 symbol 2 contract symbol Learning from Temporal Data Using Dynamical Feature Space p.19/64

20 Contractive DS lead to Markovian IPS Mapping symbolic sequences into Euclidean space Coding strategy is Markovian Sequences with shared histories of recently observed symbols have close images The longer is the common suffix of two sequences, the closer are they mapped to each other Add a simple output-generating readout on top of the fixed state dynamics! Learning from Temporal Data Using Dynamical Feature Space p.20/64

21 More Formally... Given an input symbol s A at time t, and activations of recurrent units from the previous step, x(t 1), x(t) = f s (x(t 1)). This notation can be extended to sequences over A the recurrent activations after t time steps are x(t) = f s1...s t (x(0)). If maps f s are contractions with Lip. constant 0 < C < 1, f vq (x) f wq (x) C q f v (x) f w (x) C q diam(x). Learning from Temporal Data Using Dynamical Feature Space p.21/64

22 Theoretical grounding Theorem (Hammer & Tiňo): Recurrent networks with contractive transition function can be approximated arbitrarily well on input sequences of unbounded length by a definite memory machine. Conversely, every definite memory machine can be simulated by a recurrent network with contractive transition function. Hence initialization with small weights induces an architectural bias into learning with recurrent neural networks. Learning from Temporal Data Using Dynamical Feature Space p.22/64

23 Learnability (PAC framework) The architectural bias emphasizes one possible region of the weight space where generalization ability can be formally proved. Standard RNN are not distribution independent learnable in the PAC sense if arbitrary precision and inputs are considered. Theorem (Hammer & Tiňo): Recurrent networks with contractive transition function with a fixed contraction parameter fulfill the distribution independent UCED property and so unlike general recurrent networks, are distribution independent PAC-learnable. Learning from Temporal Data Using Dynamical Feature Space p.23/64

24 State space complexity sequence complexity Complexity of the driving input stream (topological entropy) is directly reflected in the complexity of the state space activations (fractal dimension). Theorem (Tiňo & Hammer): Recurrent activations inside contractive RNN form fractal clusters the dimension of which can be bounded by the scaled entropy of the underlying driving source. The scaling factors are fixed and are given by the RNN parameters. Learning from Temporal Data Using Dynamical Feature Space p.24/64

25 Fix the DS to a contractive system! Output "smooth" map IPS Non autonomous DS Delay Fix the DS part to a randomly constructed contraction. Only train the simple linear readout output map. Can be done efficiently. Reservoir computation models. Learning from Temporal Data Using Dynamical Feature Space p.25/64

26 Simple, non-randomized reservoirs Dynamical Reservoir N internal units x(t) Dynamical Reservoir N internal units x(t) Dynamical Reservoir N internal units x(t) Input unit s(t) V W U output unit y(t) Input unit s(t) V W U output unit y(t) Input unit s(t) V W U output unit y(t) Learning from Temporal Data Using Dynamical Feature Space p.26/64

27 Cycle reservoir with regular jumps still a simple deterministic construction... (A) 16 r c 13 r c r c r j N=18 1 r r j r j c 10 r j r j r c r c 4 r r 7 c c r c (B) 13 r c r 17 c N=18 r j r j r c 1 r j r j 9 r c r c r c r c 5 Learning from Temporal Data Using Dynamical Feature Space p.27/64

28 Reservoir characterizations Memory capacity - [Jaeger] A measure correlating past events in an i.i.d. input stream with the network output. Assume the network is driven by a univariate stationary input signal u(t). For a given delayk > 0, construct the network with optimal parameters for the task of outputtingu(t k) after seeing the input stream...u(t 1)u(t) up to time t. Goodness of fit - squared correlation coefficient between the desired output u(t k) and the observed network output y(t) Learning from Temporal Data Using Dynamical Feature Space p.28/64

29 Memory capacity MC k = Cov2 (u(t k),y(t)) Var(u(t)) Var(y(t)), where Cov denotes the covariance and Var the variance operators. The short term memory (STM) capacity is then given by: MC = MC k. k=1 [Jaeger] For any recurrent neural network with N recurrent neurons, under the assumption of i.i.d. input stream, the STM capacity cannot exceed N. Learning from Temporal Data Using Dynamical Feature Space p.29/64

30 Memory capacity of SCR [Tiňo & Rodan] Under the assumption of zero-mean i.i.d. input stream, the memory capacity of linear SCR architecture with N reservoir units can be made arbitrarily close to N. Theorem (Tiňo & Rodan): Consider a linear SCR network with reservoir weight 0 < r < 1 and an input weight vector such that the reservoir activations are full rank. Then, the SCR network memory capacity is equal to MC = N (1 r 2N ). Learning from Temporal Data Using Dynamical Feature Space p.30/64

31 Reservoir characterizations Eigen-decomposition of reservoir matrix ESN SCR CRJ CRHJ Learning from Temporal Data Using Dynamical Feature Space p.31/64

32 Reservoir characterizations Memory capacity (A) 10 0 (B) MC k MC k (C) 10 0 SCR CRJ ESN k (D) 10 0 SCR CRJ ESN k MC k MC k SCR CRJ ESN k SCR CRJ ESN k Learning from Temporal Data Using Dynamical Feature Space p.32/64

33 Reservoir characterizations Pseudo-Lyapunov exponents 2 NARMA Task 2 Laser task lyapunov exponent ESN SCR CRJ lyapunov exponent ESN SCR CRJ scaling parameter scaling parameter Learning from Temporal Data Using Dynamical Feature Space p.33/64

34 Time Series Kernel via Reservoir Models rc rc rc rj rj rj rj rc rc rc Readout Mapping 1. For all sequences - the same fixed state transition (reservoir) part. 2. For each time series train the linear readout on prediction task. Readout Mapping Space Deterministic and fixed Trained 3. Gaussian kernel in the readout function space. Learning from Temporal Data Using Dynamical Feature Space p.34/64

35 Distance in the Readout Function Space Two readouts h 1 and h 2 obtained on two sequences: L 2 (h 1,h 2 ) = ( C ) 1/2 h 1 (x) h 2 (x) 2 dµ(x) where µ(x) is the probability density measure on the input domain, andc is the integral range. C = [ 1,+1] N as we use recurrent neural network reservoir formulation with tanh transfer function. Learning from Temporal Data Using Dynamical Feature Space p.35/64

36 Model Distance with Uniform Measure Consider two readouts from the same reservoir h 1 (x) = W 1 x+a 1, h 2 (x) = W 2 x+a 2. Since the sigmoid activation function is employed in the domain of the readout,c [ 1,1] N. L 2 (h 1,h 2 ) = = ( C h 1(x) h 2 (x) 2 dµ(x)) 1/2 ( 2 N 3 N j=1 O i=1 w2 i,j +2N a 2) 1/2 where w i,j is the (i,j)-th element of W 1 W 2, a = a 1 a 2. Learning from Temporal Data Using Dynamical Feature Space p.36/64

37 Model Distance vs. Parameter Distance Scaling of the squared model distance by 1 3 N j=1 O wi,j 2 + a 2, i=1 Squared Euclidean distance of the readout parameters N j=1 O i=1 w 2 i,j + a 2, more importance is given to the offset than orientation of the readout mapping. Learning from Temporal Data Using Dynamical Feature Space p.37/64

38 Non-uniform State Distribution For K-component Gaussian mixture model µ(x) = K α i µ i (x η i,σ i ) i=1 The model distance can be obtained as follows: L 2 2(h 1,h 2 ) = K i=1 α i Sampling approximation trace(w T WΣ k )+a T a + η T k WT Wη k +2a T Wη k L 2 2(h 1,h 2 ) 1 m m i=1 h 1 (x(i)) h 2 (x(i)) 2. Learning from Temporal Data Using Dynamical Feature Space p.38/64

39 Three Time Series Kernels Uniform state distribution Reservoir Kernel: RV Non-uniform state distribution Reservoir kernel by sampling: SamplingRV Reservoir kernel by Gaussian mixture models: GMMRV K(s i,s j ) = exp { γ L 2 2(h i,h j ) } Learning from Temporal Data Using Dynamical Feature Space p.39/64

40 Fisher Kernel Single model for all data items. Based on Information Geometry Fisher vector (score functions): U(s) = λ logp(s λ) Metric tensor - Fisher Information Matrix F K(s i,s j ) = (U(s i )) T F U(s j ). Learning from Temporal Data Using Dynamical Feature Space p.40/64

41 Fisher Kernel Based on Reservoir Model Endowing the readout with a noise model yields a generative time series model of the form: x(t) = f(u(t),x(t 1)), u(t+1) = Wx(t)+a+ε(t). Assume the noise model follows a Gaussian distribution ε(t) = N(0,σ 2 I). Learning from Temporal Data Using Dynamical Feature Space p.41/64

42 Fisher Kernel by Reservoir Models partial derivatives of log likelihoodlogp(u(1..l)) U = logp(u(1..l)) l P(u(t) u(1 t 1)) t=1 = W W l (u(t) a) x(t 1) T Wx(t 1)x(t 1) T =. σ 2 t=1 Fisher kernel (simplified in the usual way - Id matrix instead of F) for two time series s i ands j with scores U i andu j : K(s i,s j ) = O o=1 N (U i U j ) o,n n=1 Learning from Temporal Data Using Dynamical Feature Space p.42/64

43 Other time series kernels Kernel based on Dynamic time warping (DTW) DTW tries to warp the time axis of one (or both) sequences to achieve a better alignment DTW can generate un-intuitive alignments by mapping a single point on one time series onto a large subsection of another time series, leading to inferior results efficient versions of DTW based kernels available Learning from Temporal Data Using Dynamical Feature Space p.43/64

44 Other time series kernels Autoregressive kernel VAR model class of a given order is used to generate an infinite family of features from the time series. For a given time series s, the likelihood profile p θ (s) across all possible parameter setting (under a matrix normal-inverse Wishart prior ω(θ)) forms a representation of s. K(s i,s j ) = θ p θ (s i )p θ (s j )ω(dθ) Learning from Temporal Data Using Dynamical Feature Space p.44/64

45 Experimental Comparisons Compared Kernels: Autoregressive Kernel: AR Fisher Kernel with Hidden Markov Model: Fisher Dynamic time wrapping: DTW Our Kernels Reservoir Kernel: RV Reservoir kernel by sampling: SamplingRV Reservoir kernel by Gaussian mixture models: GMMRV Fisher kernel with reservoir models: FisherRV Learning from Temporal Data Using Dynamical Feature Space p.45/64

46 Synthetic Data: NARMA order 10 order 20 order s(t + 1) = 0.3s(t) s(t) 9 s(t i)+1.5u(t 9)u(t)+0.1 i=0 s(t + 1) = tanh(0.3s(t) s(t) s(t + 1) = 0.2s(t) s(t) 29 i=0 19 i=0 s(t i)+1.5u(t 19)u(t)+0.01)+0.2 s(t i)+1.5u(t 29)u(t) Learning from Temporal Data Using Dynamical Feature Space p.46/64

47 Multi-Dimensional Scaling Representation order 10 order 20 order Illustration of MDS on the model distance among reservoir weights Learning from Temporal Data Using Dynamical Feature Space p.47/64

48 Robustness to Noise Generalization Accuracy RV AR Fisher DTW NoKernel Gaussian Noise Level Illustration of the performance of compared kernels with different noise levels Learning from Temporal Data Using Dynamical Feature Space p.48/64

49 Benchmark Data Dataset Length Classes Train Test Symbols OSULeaf Oliveoil Lighting Beef Car Fish Coffee Adiac Learning from Temporal Data Using Dynamical Feature Space p.49/64

50 Results: Classification Accuracy Dataset DTW AR Fisher RV FisherRV GMMRV SamplingRV Symbols OSULeaf Oliveoil Lighting Beef Car Fish Coffee Adiac Learning from Temporal Data Using Dynamical Feature Space p.50/64

51 Results : CPU Time (s) Dataset DTW AR Fisher RV FisherRV GMMRV SamplingRV Symbols 1,318 2,868 2, OSULeaf 6,030 1,375 3, Oliveoil Lighting , Beef Car Fish 3, , Coffee Adiac , Learning from Temporal Data Using Dynamical Feature Space p.51/64

52 Performance on increasing length of time series RV AR DTW Fisher 10 5 RV AR DTW Fisher Accuracy 30 CPU time length of time series length of time series Generalization Accuracy CPU Time Learning from Temporal Data Using Dynamical Feature Space p.52/64

53 Multivariate Time Series Our kernels can be applicable for multivariate time series of variable length Dataset dim length classes train test Libras handwritten AUSLAN Learning from Temporal Data Using Dynamical Feature Space p.53/64

54 Multivariate Time Series: Results AR Fisher DTW RV SamplingRV AR Fisher DTW RV SamplingRV Libras handwritten AUSLAN 0 Libras handwritten AUSLAN Generalization Accuracy CPU Time Learning from Temporal Data Using Dynamical Feature Space p.54/64

Online Kernel Construction Reservoir readouts can be trained in an on-line fashion, e.g. using Recursive Least Squares.This enables us to construct and refine reservoir kernels.

55 Online Kernel Construction Reservoir readouts can be trained in an on-line fashion, e.g. using Recursive Least Squares.This enables us to construct and refine reservoir kernels. w opt = w 0 +k(b a T w 0 ) k = 1 λ+a T (A T 0 A 0) 1 a (AT 0A 0 ) 1 a (A T 0A 0 ) 1 = [λ 1 I ka T ](A T 0A 0 ) 1 λ:forgetting factor Learning from Temporal Data Using Dynamical Feature Space p.55/64

56 Long Time series PEMS-SF is 15 months (Jan. 1st 2008 to Mar. 30th 2009) of daily data from the California Department of Transportation PEMS. The data describes the occupancy rate of different car lanes of San Francisco bay area freeways. PEMS-SF with 440 time series of length 138,672. Learning from Temporal Data Using Dynamical Feature Space p.56/64

57 Results The computational complexity of our kernels is O(l), where l is the length of time series: Scale linearly with the length of time series! RV kernel achieves 86.13% accuracy. Accuracy Length of Time Series x The best accuracy reported are 82% 83% for AR, global alignment, spline smoothing and Bag of vectors kernels Learning from Temporal Data Using Dynamical Feature Space p.57/64

58 Incremental One Class Learning 1. Normal data - apply reservoir model in the sliding window (size m) in the first t steps 2. Kernel matrix train a one-class SVM K σ (s i,s j ) = exp{ σ L 2 (h i,h j )}, 3. If needed, incrementally create new clusters of models (signal regimes ). Learning from Temporal Data Using Dynamical Feature Space p.58/64

59 NARMA task NARMA sequences - orders 10, 20 and order10 order20 order Learning from Temporal Data Using Dynamical Feature Space p.59/64

60 Experiments NARMA (3 classes) Algorithm Classes Precision Recall Specificity DBscan ARMAX-OCS RC-OCS DRC-OCS(Sampling) DRC-OCS Learning from Temporal Data Using Dynamical Feature Space p.60/64

61 Experiments Van der Pol (4 classes) Algorithm Classes Precision Recall Specificity DBscan ARMAX-OCS RC-OCS DRC-OCS(Sampling) DRC-OCS Learning from Temporal Data Using Dynamical Feature Space p.61/64

62 Experiments Three Tank (4 classes) Algorithm Classes Precision Recall Specificity DBscan ARMAX-OCS RC-OCS DRC-OCS(Sampling) DRC-OCS Learning from Temporal Data Using Dynamical Feature Space p.62/64

63 Experiments Barcelona Water (32 classes) Algorithm Classes Precision Recall Specificity DBscan ARMAX-OCS RC-OCS DRC-OCS(Sampling) DRC-OCS Learning from Temporal Data Using Dynamical Feature Space p.63/64

64 Further reading A. Rodan, P. Tino: Minimum Complexity Echo State Network. IEEE TNNLS, 22(1), pp , A. Rodan, P. Tino: Simple Deterministically Constructed Cycle Reservoirs with Regular Jumps. Neural Computation, 24(7), pp , P. Tino, A. Rodan: Short Term Memory in Input-Driven Linear Dynamical Systems. Neurocomputing, 112, pp , H. Chen, P. Tino, X. Yao, A. Rodan: Learning in the Model Space for Fault Diagnosis. IEEE TNNLS, in print, H. Chen, F. Tang, P. Tino, X. Yao: Model-based Kernel for Efficient Time Series Analysis. KDD Learning from Temporal Data Using Dynamical Feature Space p.64/64

Learning in the Model Space for Temporal Data

Learning in the Model Space for Temporal Data Peter Tiňo School of Computer Science University of Birmingham, UK Learning in the Model Space for Temporal Data p.1/66 Special thanks to... Barbara Hammer