Giovanni Stilo SAX! A Symbolic Representations of Time Series

Size: px

Start display at page:

Download "Giovanni Stilo SAX! A Symbolic Representations of Time Series"

Allyson Ray
5 years ago
Views:

1 Giovanni Stilo SAX! A Symbolic Representations of Time Series

25.1750 25.2250 25.2500 25.2500 25.2750 25.3250 25.3500 25.3500 25.4000 25.4000 25.3250 25.2250 25.2000 25.1750.... 24.6250 24.6750 24.6750 24.6250 24.6250 24.6250 24.6750 24.7500 What are Time Series?

2 What are Time Series? A time series is a collection of observations made sequentially in time

3 Time Series are Ubiquitous People measure things... Schwarzeneggers popularity rating. Their blood pressure. The annual rainfall in New Zealand. The value of their Yahoo stock. The number of web hits per second. and things change over time. Thus time series occur in virtually every medical, scientific and businesses domain.

4 Image data, may best be thought of as time series

Video data, may best be thought of as time series Steady pointing Point Hand moving to shoulder level Hand at rest 0 10 20 30 40 50 60 70 80 90

5 Video data, may best be thought of as time series Steady pointing Point Hand moving to shoulder level Hand at rest Steady pointing Gun-Draw Hand moving t shoulder level Hand moving down to grasp Hand gun moving above holster Hand at rest

6 What do we want to do with the time series data? Clustering Classification Motif Discovery Rule Discovery 10 s = 0.5 c = 0.3 Query by Content Novelty Detection

7 All these problems require similarity matching Clustering Classification Motif Discovery Rule Discovery 10 s = 0.5 c = 0.3 Query by Content Novelty Detection

8 Euclidean Distance Metric Given two time series Q = q 1 q n and C = c 1 c n C their Euclidean distance is defined as: Q (, ) ( ) D Q C n i= 1 q i c i 2 D(Q,C)

The Generic Data Mining Algorithm Create an approximation of the data, which will fit in main memory, yet retains the essential features of interest Approximately solve the problem at hand in main

9 The Generic Data Mining Algorithm Create an approximation of the data, which will fit in main memory, yet retains the essential features of interest Approximately solve the problem at hand in main memory Make (hopefully very few) accesses to the original data on disk to confirm the solution obtained in Step 2, or to modify the solution so it agrees with the solution we would have obtained on the original data But which approximation should we use?

10 Time Series Representations Data Adaptive Non Data Adaptive Sorted Coefficients Piecewise Polynomial Singular Value Decomposition Symbolic Trees Wavelets Random Mappings Spectral Piecewise Aggregate Approximation Piecewise Linear Approximation Interpolation Regression Adaptive Piecewise Constant Approximat ion Natural Language Strings Orthonormal Bi - Orthonormal Discrete Fourier Transform Haar Daubechies Coiflets Symlets dbn n > 1 Discrete Cosine Transform UUCUCUCD U UC U C U DD DFT DWT SVD APCA PAA PLA SYM

Approximately solve the problem at hand in main memory Make (hopefully very

Step 2, or to modify the solution so it agrees with the solution we would have

11 The Generic Data Mining Algorithm Create an approximation of the data, which will fit in main memory, yet retains the essential features of interest Approximately solve the problem at hand in main memory Make (hopefully very few) accesses to the original data on disk to confirm the solution obtained in Step 2, or to modify the solution so it agrees with the solution we would have obtained on the original data This only works if the approximation allows lower bounding

12 What is lower bounding? Exact (Euclidean) distance D(Q,S) Q Lower bounding distance D LB (Q,S) Q S S D(Q,S) D(Q,S) n i= 1 ( ) q i s i 2 D LB (Q,S ) D LB (Q,S ) w i=1 ( ) 2 n w q i c i Lower bounding means that for all Q and S, we have D LB (Q,S ) D(Q,S)

13 Time Series Representations Data Adaptive Non Data Adaptive Sorted Coefficients Piecewise Polynomial Singular Value Decomposition Symbolic Trees Wavelets Random Mappings Spectral Piecewise Aggregate Approximation Piecewise Linear Approximation Interpolation Regression Adaptive Piecewise Constant Approximat ion Natural Language Strings Orthonormal Bi - Orthonormal Discrete Fourier Transform Haar Daubechies Coiflets Symlets dbn n > 1 Discrete Cosine Transform We can live without trees, random mappings and natural language, but it would be nice if we could lower bound strings (symbolic or discrete approximations) A lower bounding symbolic approach would allow data miners to Use suffix trees, hashing, markov models etc Use text processing and bioinformatic algorithms

14 The first symbolic representation of time series, that allows Lower bounding of Euclidean distance Dimensionality Reduction Numerosity Reduction

15 This representation SAX Symbolic Aggregate ApproXimation baabccbc

16 How do we obtain SAX? C C First convert the time series to PAA representation, then convert the PAA to symbols It take linear time - c c c b b b a a baabccbc

c b b - a a Time series subsequences tend to have a

95 0.90 0 20 40 60 80 Why a Gaussian? Probability 0.

001-10 0 10 A normal probability plot of the

17 c b b - a a Time series subsequences tend to have a highly Gaussian distribution Why a Gaussian? Probability A normal probability plot of the (cumulative) distribution of values from subsequences of length 128.

Visual Comparison 3 2 1 0-1 -2-3 f e d c b a DFT PLA Haar APCA A raw time series of length 128 is transformed into the word

18 Visual Comparison f e d c b a DFT PLA Haar APCA A raw time series of length 128 is transformed into the word ffffffeeeddcbaabceedcbaaaaacddee. We can use more symbols to represent the time series since each symbol requires fewer bits than real-numbers (float, double)

1.5 1 0.5 0-0.5 C n (, ) ( q i c i ) D Q C i= 1 Euclidean Distance 2-1 - 1.5 Q 0 20 40 60 80 100 120 DR( Q, C ) n w w i = 1 ( q c ) i i 2 1.5 1 0.5 0-0.5-1 - 1.

19 C n (, ) ( q i c i ) D Q C i= 1 Euclidean Distance Q DR( Q, C ) n w w i = 1 ( q c ) i i C Q PAA distance lower-bounds the Euclidean Distance C ˆ Q ˆ = baabccbc = babcacca MINDIST ( Qˆ, Cˆ) n w w i = 1 ( dist( qˆ, cˆ )) dist() can be implemented using a table lookup. i i 2

20 Novelty Detection Fault detection Interestingness detection Anomaly detection Surprisingness detection

21 A temporal notion of meaning Words with similar temporal behavior are similar similar = same time-frame, similar shape similar = synonym (#papafrancesco,#newpope) OR contextually related (Boston, bomb, marathon) Boston Bomb Cluster - Normalized scene muslim safe terrorist boston innoc

22 Temporal Data Mining discovering temporal patterns in text information collected over time (Mei and Zhai, 2005). When applied to large and lengthy micro-blog stream main problem is complexity. Need efficient way of representing continuous signals Need methods to prune non-relevant signals/identify relevant patterns Need algorithms to efficiently detect similarity among signals Evaluation is also an issue: millions of often obscure patterns, lack of golden datasets and benchmarks

23 SAX*: temporal clustering based on Symbolic Aggregate Approximation 3 steps: 1. Convert signals into sequences of symbols using Symbolic Aggregate Approximation 2. Learn patterns of collective attention (regular expression) 3. Cluster relevant strings in sliding windows

24 1. Symbolic Aggregate Approximation Parameters: W (dimension of temporal window) Δ(discretization step) Σ(alphabet of symbols) Signal is first Z-normalized In each (sliding) window W, signal is partitioned vertically in W/Δ slices, and the average value is computed in each slot Δj Signal is partitioned horizontally in Σ slices of equal area, let βj be the breackpoints A symbol is associated to every Δj according to: si=j, jϵσ, iff βj-1<si<βj

25 Example (two symbols, breackpoint is 0) a a a b b a a b b a W=10, Δ=1, Σ= a,b string is aaabbaabba

26 2.Patterns of collective attention Apply SAX to manually selected (about 30) words related to known event from Wikipedia Events descriptions (Arab Spring, Olympics, tsunami, Occupy wall Street..) Generate compatible regular expressions using RPNI algorithm (Oncina and Garcia 1992) The following regex is learned(one-two peaks/plateaux): a+b?bb?a+?a+b?bba*? Turns out to be compatible with shapes graphically shown in previous works on clustering patterns of collective attention (Lehmann et al. 2012; Xie et al. 2013; Weng et Lee 2011; Yang and Leskovec 2011)

27 3. Pattern clustering Bottom-up hierarchical clustering algorithm with complete linkage (Jain 2010) stop hierarchical bottom-up clustering aggregation for a cluster when: SD(d(centroid,tk))<δ Clusters smaller than f elements are purged To summarize, parameters are: W, Δ,Σ,δ,f Clusters are created in sliding windows of lenght W; Δ-clusters are subsequently generated (clusters active in slot Δ)

28 Experiments 1 year 1% Twitter stream ( ) ~7TB of Data. 2 types of experiments: event detection (in English) and (multilingual) hashtag sense clustering Parameters setting (especially W, Δ, Σ) depend on density and locality of stream. With 1% world-wide stream only big events can be detected. Best results with W=10 days, Δ=1 day, Σ=a,b For each experiment (events, hashtags), SAX* clustering is performed 365 times (one for each sliding window Wi).

29 Example: Bomb during Boston marathon, non normalized scene muslim safe terrorist boston innoc

30 Example: Bomb during Boston marathon, normalized Boston Bomb Cluster - Normalized scene muslim safe terrorist boston innoc

31 Example: London-olympics hahstags 3 Normalized hashtags time series (London 2012 Olympics) Olympics Olimpiadi2012 londra2012 London2012 Londres2012

papafrancesco,papafrancesco,pope,francois1e r,habemuspapam,

32 Example: #habemuspapam PapaFrancesco,francesco,habaemuspapam,ha bemuspapa,habemuspapam, newpope, papafrancesco,papafrancesco,pope,francois1e r,habemuspapam, NouveauPape, habemuspapam,nouveaupape,confessionsintim es, Francesco, HabemusPapa, HabemusPapam,Habemuspapam

Identifying Patterns and Anomalies in Delayed Neutron Monitor Data of Nuclear Power Plant

Identifying Patterns and Anomalies in Delayed Neutron Monitor Data of Nuclear Power Plant Durga Toshniwal, Aditya Gupta Department of Computer Science & Engineering Indian Institute of Technology Roorkee