Giovanni Stilo stilo@di.uniroma1.it SAX! A Symbolic Representations of Time Series
25.1750 25.2250 25.2500 25.2500 25.2750 25.3250 25.3500 25.3500 25.4000 25.4000 25.3250 25.2250 25.2000 25.1750.... 24.6250 24.6750 24.6750 24.6250 24.6250 24.6250 24.6750 24.7500 What are Time Series? A time series is a collection of observations made sequentially in time. 29 28 27 26 25 24 23 0 50 100 150 200 250 300 350 400 450 500
Time Series are Ubiquitous People measure things... Schwarzeneggers popularity rating. Their blood pressure. The annual rainfall in New Zealand. The value of their Yahoo stock. The number of web hits per second. and things change over time. Thus time series occur in virtually every medical, scientific and businesses domain.
Image data, may best be thought of as time series
Video data, may best be thought of as time series Steady pointing Point Hand moving to shoulder level Hand at rest 0 10 20 30 40 50 60 70 80 90 Steady pointing Gun-Draw Hand moving t shoulder level Hand moving down to grasp Hand gun moving above holster Hand at rest 0 10 20 30 40 50 60 70 80 90
What do we want to do with the time series data? Clustering Classification Motif Discovery Rule Discovery 10 s = 0.5 c = 0.3 Query by Content Novelty Detection
All these problems require similarity matching Clustering Classification Motif Discovery Rule Discovery 10 s = 0.5 c = 0.3 Query by Content Novelty Detection
Euclidean Distance Metric Given two time series Q = q 1 q n and C = c 1 c n C their Euclidean distance is defined as: Q (, ) ( ) D Q C n i= 1 q i c i 2 D(Q,C)
The Generic Data Mining Algorithm Create an approximation of the data, which will fit in main memory, yet retains the essential features of interest Approximately solve the problem at hand in main memory Make (hopefully very few) accesses to the original data on disk to confirm the solution obtained in Step 2, or to modify the solution so it agrees with the solution we would have obtained on the original data But which approximation should we use?
Time Series Representations Data Adaptive Non Data Adaptive Sorted Coefficients Piecewise Polynomial Singular Value Decomposition Symbolic Trees Wavelets Random Mappings Spectral Piecewise Aggregate Approximation Piecewise Linear Approximation Interpolation Regression Adaptive Piecewise Constant Approximat ion Natural Language Strings Orthonormal Bi - Orthonormal Discrete Fourier Transform Haar Daubechies Coiflets Symlets dbn n > 1 Discrete Cosine Transform UUCUCUCD 0 20 40 60 80 100 120 0 20 40 60 80 100 120 0 20 40 60 80 100 120 0 20 40 60 80 100 120 0 20 40 60 80 100120 0 20 40 60 80 100120 0 20 40 60 80 100 120 U UC U C U DD DFT DWT SVD APCA PAA PLA SYM
The Generic Data Mining Algorithm Create an approximation of the data, which will fit in main memory, yet retains the essential features of interest Approximately solve the problem at hand in main memory Make (hopefully very few) accesses to the original data on disk to confirm the solution obtained in Step 2, or to modify the solution so it agrees with the solution we would have obtained on the original data This only works if the approximation allows lower bounding
What is lower bounding? Exact (Euclidean) distance D(Q,S) Q Lower bounding distance D LB (Q,S) Q S S D(Q,S) D(Q,S) n i= 1 ( ) q i s i 2 D LB (Q,S ) D LB (Q,S ) w i=1 ( ) 2 n w q i c i Lower bounding means that for all Q and S, we have D LB (Q,S ) D(Q,S)
Time Series Representations Data Adaptive Non Data Adaptive Sorted Coefficients Piecewise Polynomial Singular Value Decomposition Symbolic Trees Wavelets Random Mappings Spectral Piecewise Aggregate Approximation Piecewise Linear Approximation Interpolation Regression Adaptive Piecewise Constant Approximat ion Natural Language Strings Orthonormal Bi - Orthonormal Discrete Fourier Transform Haar Daubechies Coiflets Symlets dbn n > 1 Discrete Cosine Transform We can live without trees, random mappings and natural language, but it would be nice if we could lower bound strings (symbolic or discrete approximations) A lower bounding symbolic approach would allow data miners to Use suffix trees, hashing, markov models etc Use text processing and bioinformatic algorithms
The first symbolic representation of time series, that allows Lower bounding of Euclidean distance Dimensionality Reduction Numerosity Reduction
This representation SAX Symbolic Aggregate ApproXimation baabccbc
How do we obtain SAX? C C 0 20 40 60 80 100 120 First convert the time series to PAA representation, then convert the PAA to symbols It take linear time - c c c b b b a a 0 20 40 60 80 100 120 baabccbc
c b b - a a Time series subsequences tend to have a highly Gaussian distribution 0.999 0.997 0.99 0.98 0.95 0.90 0 20 40 60 80 Why a Gaussian? Probability 0.75 0.50 0.25 0.10 0.05 0.02 0.01 0.003 0.001-10 0 10 A normal probability plot of the (cumulative) distribution of values from subsequences of length 128.
Visual Comparison 3 2 1 0-1 -2-3 f e d c b a DFT PLA Haar APCA A raw time series of length 128 is transformed into the word ffffffeeeddcbaabceedcbaaaaacddee. We can use more symbols to represent the time series since each symbol requires fewer bits than real-numbers (float, double)
1.5 1 0.5 0-0.5 C n (, ) ( q i c i ) D Q C i= 1 Euclidean Distance 2-1 - 1.5 Q 0 20 40 60 80 100 120 DR( Q, C ) n w w i = 1 ( q c ) i i 2 1.5 1 0.5 0-0.5-1 - 1.5 C Q PAA distance lower-bounds the Euclidean Distance 0 20 40 60 80 100 120 C ˆ Q ˆ = baabccbc = babcacca MINDIST ( Qˆ, Cˆ) n w w i = 1 ( dist( qˆ, cˆ )) dist() can be implemented using a table lookup. i i 2
Novelty Detection Fault detection Interestingness detection Anomaly detection Surprisingness detection
A temporal notion of meaning Words with similar temporal behavior are similar similar = same time-frame, similar shape similar = synonym (#papafrancesco,#newpope) OR contextually related (Boston, bomb, marathon) Boston Bomb Cluster - Normalized 3.5 3 2.5 2 1.5 1 0.5 0-0.5-1 scene muslim safe terrorist boston innoc
Temporal Data Mining discovering temporal patterns in text information collected over time (Mei and Zhai, 2005). When applied to large and lengthy micro-blog stream main problem is complexity. Need efficient way of representing continuous signals Need methods to prune non-relevant signals/identify relevant patterns Need algorithms to efficiently detect similarity among signals Evaluation is also an issue: millions of often obscure patterns, lack of golden datasets and benchmarks
SAX*: temporal clustering based on Symbolic Aggregate Approximation 3 steps: 1. Convert signals into sequences of symbols using Symbolic Aggregate Approximation 2. Learn patterns of collective attention (regular expression) 3. Cluster relevant strings in sliding windows
1. Symbolic Aggregate Approximation Parameters: W (dimension of temporal window) Δ(discretization step) Σ(alphabet of symbols) Signal is first Z-normalized In each (sliding) window W, signal is partitioned vertically in W/Δ slices, and the average value is computed in each slot Δj Signal is partitioned horizontally in Σ slices of equal area, let βj be the breackpoints A symbol is associated to every Δj according to: si=j, jϵσ, iff βj-1<si<βj
Example (two symbols, breackpoint is 0) 3 2.5 2 1.5 1 0.5 0-0.5 a a a b b a a b b a W=10, Δ=1, Σ= a,b string is aaabbaabba
2.Patterns of collective attention Apply SAX to manually selected (about 30) words related to known event from Wikipedia Events descriptions (Arab Spring, Olympics, tsunami, Occupy wall Street..) Generate compatible regular expressions using RPNI algorithm (Oncina and Garcia 1992) The following regex is learned(one-two peaks/plateaux): a+b?bb?a+?a+b?bba*? Turns out to be compatible with shapes graphically shown in previous works on clustering patterns of collective attention (Lehmann et al. 2012; Xie et al. 2013; Weng et Lee 2011; Yang and Leskovec 2011)
3. Pattern clustering Bottom-up hierarchical clustering algorithm with complete linkage (Jain 2010) stop hierarchical bottom-up clustering aggregation for a cluster when: SD(d(centroid,tk))<δ Clusters smaller than f elements are purged To summarize, parameters are: W, Δ,Σ,δ,f Clusters are created in sliding windows of lenght W; Δ-clusters are subsequently generated (clusters active in slot Δ)
Experiments 1 year 1% Twitter stream (2012-13) ~7TB of Data. 2 types of experiments: event detection (in English) and (multilingual) hashtag sense clustering Parameters setting (especially W, Δ, Σ) depend on density and locality of stream. With 1% world-wide stream only big events can be detected. Best results with W=10 days, Δ=1 day, Σ=a,b For each experiment (events, hashtags), SAX* clustering is performed 365 times (one for each sliding window Wi).
Example: Bomb during Boston marathon, non normalized 20000 15000 10000 5000 0-5000 scene muslim safe terrorist boston innoc
Example: Bomb during Boston marathon, normalized Boston Bomb Cluster - Normalized 3.5 3 2.5 2 1.5 1 0.5 0-0.5-1 scene muslim safe terrorist boston innoc
Example: London-olympics hahstags 3 Normalized hashtags time series (London 2012 Olympics) 2.5 2 1.5 1 0.5 0-0.5-1 Olympics Olimpiadi2012 londra2012 London2012 Londres2012
Example: #habemuspapam PapaFrancesco,francesco,habaemuspapam,ha bemuspapa,habemuspapam, newpope, papafrancesco,papafrancesco,pope,francois1e r,habemuspapam, NouveauPape, habemuspapam,nouveaupape,confessionsintim es, Francesco, HabemusPapa, HabemusPapam,Habemuspapam