Giovanni Stilo SAX! A Symbolic Representations of Time Series

Similar documents
Identifying Patterns and Anomalies in Delayed Neutron Monitor Data of Nuclear Power Plant

Study of Wavelet Functions of Discrete Wavelet Transformation in Image Watermarking

Time Series (1) Information Systems M

Data preprocessing. DataBase and Data Mining Group 1. Data set types. Tabular Data. Document Data. Transaction Data. Ordered Data

A Framework of Detecting Burst Events from Micro-blogging Streams

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS341 info session is on Thu 3/1 5pm in Gates415. CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining

Global vs. Multiscale Approaches

Noise & Data Reduction

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling

Introduction to Time Series Mining. Slides edited from Keogh Eamonn s tutorial:

Lecture 24: Principal Component Analysis. Aykut Erdem May 2016 Hacettepe University

Text Analytics (Text Mining)

Time Series Analysis and Mining with R

Fast Hierarchical Clustering from the Baire Distance

Text Mining. Dr. Yanjun Li. Associate Professor. Department of Computer and Information Sciences Fordham University

Time-Series Analysis Prediction Similarity between Time-series Symbolic Approximation SAX References. Time-Series Streams

CS570 Introduction to Data Mining

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining

Cambridge University Press The Mathematics of Signal Processing Steven B. Damelin and Willard Miller Excerpt More information

Internet Video Search

CS6220: DATA MINING TECHNIQUES

CS6220: DATA MINING TECHNIQUES

Collaborative Filtering. Radek Pelánek

Digital Image Processing

Introduction to Information Retrieval

Multimedia analysis and retrieval

Information Management course

High Dimensional Search Min- Hashing Locality Sensi6ve Hashing

Compression methods: the 1 st generation

Real-time Sentiment-Based Anomaly Detection in Twitter Data Streams

WAVELET TRANSFORMS IN TIME SERIES ANALYSIS

Rainfall data analysis and storm prediction system

Statistics 202: Data Mining. c Jonathan Taylor. Week 2 Based in part on slides from textbook, slides of Susan Holmes. October 3, / 1

Part I. Other datatypes, preprocessing. Other datatypes. Other datatypes. Week 2 Based in part on slides from textbook, slides of Susan Holmes

Hidden Markov Models Part 1: Introduction

Sequential Logic Optimization. Optimization in Context. Algorithmic Approach to State Minimization. Finite State Machine Optimization

Algorithms for Data Science: Lecture on Finding Similar Items

Latent semantic indexing

Noise & Data Reduction

Activity Mining in Sensor Networks

HOT SAX: Efficiently Finding the Most Unusual Time Series Subsequence

Index. p, lip, 78 8 function, 107 v, 7-8 w, 7-8 i,7-8 sine, 43 Bo,94-96

IV Course Spring 14. Graduate Course. May 4th, Big Spatiotemporal Data Analytics & Visualization

Image Data Compression

Mining of Massive Datasets Jure Leskovec, AnandRajaraman, Jeff Ullman Stanford University

Contents. Preface to the second edition. Preface to the fírst edition. Acknowledgments PART I PRELIMINARIES

Introduction to Wavelet. Based on A. Mukherjee s lecture notes

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2017

Exploring the Patterns of Human Mobility Using Heterogeneous Traffic Trajectory Data

Cluster Analysis (Sect. 9.6/Chap. 14 of Wilks) Notes by Hong Li

Rare Time Series Motif Discovery from Unbounded Streams

Final Exam, Machine Learning, Spring 2009

CS249: ADVANCED DATA MINING

PV211: Introduction to Information Retrieval

Multi-scale anomaly detection algorithm based on infrequent pattern of time series

Expectation Maximization

Natural Language Processing

Advanced Techniques for Mining Structured Data: Process Mining

Principal Component Analysis, A Powerful Scoring Technique

Advanced Introduction to Machine Learning CMU-10715

Natural Language Processing. Topics in Information Retrieval. Updated 5/10

Index. Copyright (c)2007 The Society for Industrial and Applied Mathematics From: Matrix Methods in Data Mining and Pattern Recgonition By: Lars Elden

Scalable Algorithms for Distribution Search

Engineering Physics 3W4 "Acquisition and Analysis of Experimental Information" Part II: Fourier Transforms

2.6 Complexity Theory for Map-Reduce. Star Joins 2.6. COMPLEXITY THEORY FOR MAP-REDUCE 51

Identification of ELM evolution patterns with unsupervised clustering of time-series similarity metrics

ECE 901 Lecture 16: Wavelet Approximation Theory

A Bayesian Model of Diachronic Meaning Change

Leverage Sparse Information in Predictive Modeling

Principles of Pattern Recognition. C. A. Murthy Machine Intelligence Unit Indian Statistical Institute Kolkata

Knowledge Discovery in Databases II Summer Term 2017

USING SINGULAR VALUE DECOMPOSITION (SVD) AS A SOLUTION FOR SEARCH RESULT CLUSTERING

Kronecker Decomposition for Image Classification

4/26/2017. More algorithms for streams: Each element of data stream is a tuple Given a list of keys S Determine which tuples of stream are in S

Slide source: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University.

Frame Diagonalization of Matrices

CS60021: Scalable Data Mining. Dimensionality Reduction

Introduction to Hilbert Space Frames

Singular Value Decompsition

University of Florida CISE department Gator Engineering. Clustering Part 1

Course overview. Heikki Mannila Laboratory for Computer and Information Science Helsinki University of Technology

Mining of Massive Datasets Jure Leskovec, AnandRajaraman, Jeff Ullman Stanford University

Compressing Tabular Data via Pairwise Dependencies

CONTEMPORARY ANALYTICAL ECOSYSTEM PATRICK HALL, SAS INSTITUTE

Finding Similar Sets. Applications Shingling Minhashing Locality-Sensitive Hashing Distance Measures. Modified from Jeff Ullman

Maths for Signals and Systems Linear Algebra for Engineering Applications

Data Exploration and Unsupervised Learning with Clustering

CS145: INTRODUCTION TO DATA MINING

Retrieval by Content. Part 2: Text Retrieval Term Frequency and Inverse Document Frequency. Srihari: CSE 626 1

BAYESIAN MACHINE LEARNING.

Wavelet Methods for Time Series Analysis. What is a Wavelet? Part I: Introduction to Wavelets and Wavelet Transforms. sines & cosines are big waves

B490 Mining the Big Data

Modeling Complex Temporal Composition of Actionlets for Activity Prediction

Measurement and Data. Topics: Types of Data Distance Measurement Data Transformation Forms of Data Data Quality

University of Washington Department of Electrical Engineering EE512 Spring, 2006 Graphical Models

Highland Park Physics I Curriculum Semester I Weeks 1-4

Must-read Material : Multimedia Databases and Data Mining. Indexing - Detailed outline. Outline. Faloutsos

Data Fitting and Uncertainty

Transcription:

Giovanni Stilo stilo@di.uniroma1.it SAX! A Symbolic Representations of Time Series

25.1750 25.2250 25.2500 25.2500 25.2750 25.3250 25.3500 25.3500 25.4000 25.4000 25.3250 25.2250 25.2000 25.1750.... 24.6250 24.6750 24.6750 24.6250 24.6250 24.6250 24.6750 24.7500 What are Time Series? A time series is a collection of observations made sequentially in time. 29 28 27 26 25 24 23 0 50 100 150 200 250 300 350 400 450 500

Time Series are Ubiquitous People measure things... Schwarzeneggers popularity rating. Their blood pressure. The annual rainfall in New Zealand. The value of their Yahoo stock. The number of web hits per second. and things change over time. Thus time series occur in virtually every medical, scientific and businesses domain.

Image data, may best be thought of as time series

Video data, may best be thought of as time series Steady pointing Point Hand moving to shoulder level Hand at rest 0 10 20 30 40 50 60 70 80 90 Steady pointing Gun-Draw Hand moving t shoulder level Hand moving down to grasp Hand gun moving above holster Hand at rest 0 10 20 30 40 50 60 70 80 90

What do we want to do with the time series data? Clustering Classification Motif Discovery Rule Discovery 10 s = 0.5 c = 0.3 Query by Content Novelty Detection

All these problems require similarity matching Clustering Classification Motif Discovery Rule Discovery 10 s = 0.5 c = 0.3 Query by Content Novelty Detection

Euclidean Distance Metric Given two time series Q = q 1 q n and C = c 1 c n C their Euclidean distance is defined as: Q (, ) ( ) D Q C n i= 1 q i c i 2 D(Q,C)

The Generic Data Mining Algorithm Create an approximation of the data, which will fit in main memory, yet retains the essential features of interest Approximately solve the problem at hand in main memory Make (hopefully very few) accesses to the original data on disk to confirm the solution obtained in Step 2, or to modify the solution so it agrees with the solution we would have obtained on the original data But which approximation should we use?

Time Series Representations Data Adaptive Non Data Adaptive Sorted Coefficients Piecewise Polynomial Singular Value Decomposition Symbolic Trees Wavelets Random Mappings Spectral Piecewise Aggregate Approximation Piecewise Linear Approximation Interpolation Regression Adaptive Piecewise Constant Approximat ion Natural Language Strings Orthonormal Bi - Orthonormal Discrete Fourier Transform Haar Daubechies Coiflets Symlets dbn n > 1 Discrete Cosine Transform UUCUCUCD 0 20 40 60 80 100 120 0 20 40 60 80 100 120 0 20 40 60 80 100 120 0 20 40 60 80 100 120 0 20 40 60 80 100120 0 20 40 60 80 100120 0 20 40 60 80 100 120 U UC U C U DD DFT DWT SVD APCA PAA PLA SYM

The Generic Data Mining Algorithm Create an approximation of the data, which will fit in main memory, yet retains the essential features of interest Approximately solve the problem at hand in main memory Make (hopefully very few) accesses to the original data on disk to confirm the solution obtained in Step 2, or to modify the solution so it agrees with the solution we would have obtained on the original data This only works if the approximation allows lower bounding

What is lower bounding? Exact (Euclidean) distance D(Q,S) Q Lower bounding distance D LB (Q,S) Q S S D(Q,S) D(Q,S) n i= 1 ( ) q i s i 2 D LB (Q,S ) D LB (Q,S ) w i=1 ( ) 2 n w q i c i Lower bounding means that for all Q and S, we have D LB (Q,S ) D(Q,S)

Time Series Representations Data Adaptive Non Data Adaptive Sorted Coefficients Piecewise Polynomial Singular Value Decomposition Symbolic Trees Wavelets Random Mappings Spectral Piecewise Aggregate Approximation Piecewise Linear Approximation Interpolation Regression Adaptive Piecewise Constant Approximat ion Natural Language Strings Orthonormal Bi - Orthonormal Discrete Fourier Transform Haar Daubechies Coiflets Symlets dbn n > 1 Discrete Cosine Transform We can live without trees, random mappings and natural language, but it would be nice if we could lower bound strings (symbolic or discrete approximations) A lower bounding symbolic approach would allow data miners to Use suffix trees, hashing, markov models etc Use text processing and bioinformatic algorithms

The first symbolic representation of time series, that allows Lower bounding of Euclidean distance Dimensionality Reduction Numerosity Reduction

This representation SAX Symbolic Aggregate ApproXimation baabccbc

How do we obtain SAX? C C 0 20 40 60 80 100 120 First convert the time series to PAA representation, then convert the PAA to symbols It take linear time - c c c b b b a a 0 20 40 60 80 100 120 baabccbc

c b b - a a Time series subsequences tend to have a highly Gaussian distribution 0.999 0.997 0.99 0.98 0.95 0.90 0 20 40 60 80 Why a Gaussian? Probability 0.75 0.50 0.25 0.10 0.05 0.02 0.01 0.003 0.001-10 0 10 A normal probability plot of the (cumulative) distribution of values from subsequences of length 128.

Visual Comparison 3 2 1 0-1 -2-3 f e d c b a DFT PLA Haar APCA A raw time series of length 128 is transformed into the word ffffffeeeddcbaabceedcbaaaaacddee. We can use more symbols to represent the time series since each symbol requires fewer bits than real-numbers (float, double)

1.5 1 0.5 0-0.5 C n (, ) ( q i c i ) D Q C i= 1 Euclidean Distance 2-1 - 1.5 Q 0 20 40 60 80 100 120 DR( Q, C ) n w w i = 1 ( q c ) i i 2 1.5 1 0.5 0-0.5-1 - 1.5 C Q PAA distance lower-bounds the Euclidean Distance 0 20 40 60 80 100 120 C ˆ Q ˆ = baabccbc = babcacca MINDIST ( Qˆ, Cˆ) n w w i = 1 ( dist( qˆ, cˆ )) dist() can be implemented using a table lookup. i i 2

Novelty Detection Fault detection Interestingness detection Anomaly detection Surprisingness detection

A temporal notion of meaning Words with similar temporal behavior are similar similar = same time-frame, similar shape similar = synonym (#papafrancesco,#newpope) OR contextually related (Boston, bomb, marathon) Boston Bomb Cluster - Normalized 3.5 3 2.5 2 1.5 1 0.5 0-0.5-1 scene muslim safe terrorist boston innoc

Temporal Data Mining discovering temporal patterns in text information collected over time (Mei and Zhai, 2005). When applied to large and lengthy micro-blog stream main problem is complexity. Need efficient way of representing continuous signals Need methods to prune non-relevant signals/identify relevant patterns Need algorithms to efficiently detect similarity among signals Evaluation is also an issue: millions of often obscure patterns, lack of golden datasets and benchmarks

SAX*: temporal clustering based on Symbolic Aggregate Approximation 3 steps: 1. Convert signals into sequences of symbols using Symbolic Aggregate Approximation 2. Learn patterns of collective attention (regular expression) 3. Cluster relevant strings in sliding windows

1. Symbolic Aggregate Approximation Parameters: W (dimension of temporal window) Δ(discretization step) Σ(alphabet of symbols) Signal is first Z-normalized In each (sliding) window W, signal is partitioned vertically in W/Δ slices, and the average value is computed in each slot Δj Signal is partitioned horizontally in Σ slices of equal area, let βj be the breackpoints A symbol is associated to every Δj according to: si=j, jϵσ, iff βj-1<si<βj

Example (two symbols, breackpoint is 0) 3 2.5 2 1.5 1 0.5 0-0.5 a a a b b a a b b a W=10, Δ=1, Σ= a,b string is aaabbaabba

2.Patterns of collective attention Apply SAX to manually selected (about 30) words related to known event from Wikipedia Events descriptions (Arab Spring, Olympics, tsunami, Occupy wall Street..) Generate compatible regular expressions using RPNI algorithm (Oncina and Garcia 1992) The following regex is learned(one-two peaks/plateaux): a+b?bb?a+?a+b?bba*? Turns out to be compatible with shapes graphically shown in previous works on clustering patterns of collective attention (Lehmann et al. 2012; Xie et al. 2013; Weng et Lee 2011; Yang and Leskovec 2011)

3. Pattern clustering Bottom-up hierarchical clustering algorithm with complete linkage (Jain 2010) stop hierarchical bottom-up clustering aggregation for a cluster when: SD(d(centroid,tk))<δ Clusters smaller than f elements are purged To summarize, parameters are: W, Δ,Σ,δ,f Clusters are created in sliding windows of lenght W; Δ-clusters are subsequently generated (clusters active in slot Δ)

Experiments 1 year 1% Twitter stream (2012-13) ~7TB of Data. 2 types of experiments: event detection (in English) and (multilingual) hashtag sense clustering Parameters setting (especially W, Δ, Σ) depend on density and locality of stream. With 1% world-wide stream only big events can be detected. Best results with W=10 days, Δ=1 day, Σ=a,b For each experiment (events, hashtags), SAX* clustering is performed 365 times (one for each sliding window Wi).

Example: Bomb during Boston marathon, non normalized 20000 15000 10000 5000 0-5000 scene muslim safe terrorist boston innoc

Example: Bomb during Boston marathon, normalized Boston Bomb Cluster - Normalized 3.5 3 2.5 2 1.5 1 0.5 0-0.5-1 scene muslim safe terrorist boston innoc

Example: London-olympics hahstags 3 Normalized hashtags time series (London 2012 Olympics) 2.5 2 1.5 1 0.5 0-0.5-1 Olympics Olimpiadi2012 londra2012 London2012 Londres2012

Example: #habemuspapam PapaFrancesco,francesco,habaemuspapam,ha bemuspapa,habemuspapam, newpope, papafrancesco,papafrancesco,pope,francois1e r,habemuspapam, NouveauPape, habemuspapam,nouveaupape,confessionsintim es, Francesco, HabemusPapa, HabemusPapam,Habemuspapam