HOT SAX: Efficiently Finding the Most Unusual Time Series Subsequence

Similar documents
Time series anomaly discovery with grammar-based compression

Time-Series Analysis Prediction Similarity between Time-series Symbolic Approximation SAX References. Time-Series Streams

Identifying Patterns and Anomalies in Delayed Neutron Monitor Data of Nuclear Power Plant

Introduction to Time Series Mining. Slides edited from Keogh Eamonn s tutorial:

Time Series (1) Information Systems M

On the Projection Matrices Influence in the Classification of Compressed Sensed ECG Signals

Anomaly Detection. Jing Gao. SUNY Buffalo

Anomaly Detection for the CERN Large Hadron Collider injection magnets

ECE 661: Homework 10 Fall 2014

Anomaly Detection in Real-Valued Multidimensional Time Series

Automatic detection and of dipoles in large area SQUID magnetometry

CS570 Data Mining. Anomaly Detection. Li Xiong. Slide credits: Tan, Steinbach, Kumar Jiawei Han and Micheline Kamber.

Classification Based on Logical Concept Analysis

MTH 517 COURSE PROJECT

Linear Classification: Perceptron

Time Series Shapelets

Frequent Itemsets and Association Rule Mining. Vinay Setty Slides credit:

Shared Segmentation of Natural Scenes. Dependent Pitman-Yor Processes

Data Mining Prof. Pabitra Mitra Department of Computer Science & Engineering Indian Institute of Technology, Kharagpur

High-Dimensional Indexing by Distributed Aggregation

Detecting Anomalies in Time Series Data Via A Meta-feature Based Approach

Improvement of Hubble Space Telescope subsystems through data mining

Unsupervised Learning: K- Means & PCA

Lecture 2. Judging the Performance of Classifiers. Nitin R. Patel

Spatial Database. Ahmad Alhilal, Dimitris Tsaras

CS229 Supplemental Lecture notes

15 Introduction to Data Mining

An Entropy-based Method for Assessing the Number of Spatial Outliers

Module 1: Analyzing the Efficiency of Algorithms

Self-Tuning Spectral Clustering

Automated P-Wave Quality Assessment for Wearable Sensors

Final Exam, Machine Learning, Spring 2009

CHAPTER 3 FEATURE EXTRACTION USING GENETIC ALGORITHM BASED PRINCIPAL COMPONENT ANALYSIS

Algorithm Design Strategies V

Data Mining in Chemometrics Sub-structures Learning Via Peak Combinations Searching in Mass Spectra

Linear Classification: Linear Programming

Finding persisting states for knowledge discovery in time series

Independent Events. The multiplication rule for independent events says that if A and B are independent, P (A and B) = P (A) P (B).

Learning Time-Series Shapelets

Geographic Analysis of Linguistically Encoded Movement Patterns A Contextualized Perspective

Classification and Regression Trees

Predictive analysis on Multivariate, Time Series datasets using Shapelets

Decision Tree Learning

Chapter 15. Probability Rules! Copyright 2012, 2008, 2005 Pearson Education, Inc.

Heartbeats Arrhythmia Classification Using Probabilistic Multi-Class Support Vector Machines: A Comparative Study

Multi-scale anomaly detection algorithm based on infrequent pattern of time series

Lecture Notes 1: Vector spaces

Aberrant Behavior Detection in Time Series for Monitoring Business-Critical Metrics (DRAFT)

Change Detection in Multivariate Data

Pattern Structures 1

11/21/2011. The Magnetic Field. Chapter 24 Magnetic Fields and Forces. Mapping Out the Magnetic Field Using Iron Filings

CS246: Mining Massive Data Sets Winter Only one late period is allowed for this homework (11:59pm 2/14). General Instructions

Introduction to Data Mining

Decision Tree Analysis for Classification Problems. Entscheidungsunterstützungssysteme SS 18

Pearson Edexcel GCSE Physics Unit P3: Applications of Physics

Gaussian Process Regression with K-means Clustering for Very Short-Term Load Forecasting of Individual Buildings at Stanford

DATA MINING LECTURE 3. Frequent Itemsets Association Rules

Tutorial 8 Raster Data Analysis

Recursion: Introduction and Correctness

CS 360, Winter Morphology of Proof: An introduction to rigorous proof techniques

CS246 Final Exam. March 16, :30AM - 11:30AM

Divide-Conquer-Glue Algorithms

Reading and Writing. Mathematical Proofs. Slides by Arthur van Goetham

Dock Ligands from a 2D Molecule Sketch

Linear Classification: Linear Programming

arxiv: v1 [cs.sd] 14 Nov 2017

ncounter PlexSet Data Analysis Guidelines

Unsupervised Vocabulary Induction

Time Series Mining and Forecasting

Exam IV, Magnetism May 1 st, Exam IV, Magnetism

Tutorial. Getting started. Sample to Insight. March 31, 2016

Meelis Kull Autumn Meelis Kull - Autumn MTAT Data Mining - Lecture 05

Reinforcement Learning

Unsupervised Anomaly Detection for High Dimensional Data

Evolving a New Feature for a Working Program

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Tailored Bregman Ball Trees for Effective Nearest Neighbors

Structural Geology, GEOL 330 Spring 2012

Basic electromagnetism and electromagnetic induction

Surprise Detection in Science Data Streams Kirk Borne Dept of Computational & Data Sciences George Mason University

Algorithm User Guide:

GLR-Entropy Model for ECG Arrhythmia Detection

An Experimental Evaluation of Passage-Based Process Discovery

5. DIVIDE AND CONQUER I

HOW TO ANALYZE SYNCHROTRON DATA

Enhance Security, Safety and Efficiency With Geospatial Visualization

Coupling Analysis of ECG and EMG based on Multiscale symbolic transfer entropy

Association Rule Mining on Web

Time Series. Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech. Mining and Forecasting

CS221 Practice Midterm #2 Solutions

Accelerated warning of a bioterrorist event or endemic disease outbreak

Testing Newton s 2nd Law

Anomaly Detection via Over-sampling Principal Component Analysis

Outline for today. Information Retrieval. Cosine similarity between query and document. tf-idf weighting

Data Mining and Machine Learning (Machine Learning: Symbolische Ansätze)

Giovanni Stilo SAX! A Symbolic Representations of Time Series

IMBALANCED DATA. Phishing. Admin 9/30/13. Assignment 3: - how did it go? - do the experiments help? Assignment 4. Course feedback

Data mining, 4 cu Lecture 5:

Skylines. Yufei Tao. ITEE University of Queensland. INFS4205/7205, Uni of Queensland

15-451/651: Design & Analysis of Algorithms September 13, 2018 Lecture #6: Streaming Algorithms last changed: August 30, 2018

Transcription:

HOT SAX: Efficiently Finding the Most Unusual Time Series Subsequence Eamonn Keogh *Jessica Lin Ada Fu University of California, Riverside George Mason Univ Chinese Univ of Hong Kong The 5 th IEEE International Conference on Data Mining Nov 27-30, Houston, TX

Anomaly (interestingness) detection We would like to be able to discover surprising (unusual, interesting, anomalous) patterns in time series. Note that we don t know in advance in what way the time series might be surprising Also note that surprising is very context dependent, application dependent, subjective etc. 2

Simple Approaches I 35 Limit Checking 30 25 20 15 10 5 0-5 -10 0 100 200 300 400 500 600 700 800 900 1000 3

Simple Approaches II 35 Discrepancy Checking 30 25 20 15 10 5 0-5 -10 0 100 200 300 400 500 600 700 800 900 1000 4

Discrepancy Checking: Example normalized sales de-noised threshold Early statistical detection of anthrax outbreaks by tracking over-the-counter medication sales Goldenberg, Shmueli, Caruana, and Fienberg Actual value Predicted value 5 The actual value is greater than the predicted value, but still less than the threshold, so no alarm is sounded.

Time Series Discord Discord: subsequence that is least similar to other subsequences Applications: Anomaly detection Clustering Data cleaning 6 ECG qtdb/sel102 (excerpt)

Background Sliding Windows Use a sliding window to extract subsequences 7

Time Series Discords Subsequence C of length n is said to be the discord if C has the largest distance to its nearest non-self match. K th Time Series Discord 8

Non-self Match Non-Self Match: Given a time series T, containing a subsequence C of length n beginning at position p and a matching subsequence M beginning at q, we say that M is a non-self match to C at distance of Dist(M,C) if p q n. 9

Why is the Notion of Non-self Match Important? Consider the following string: abcabcabcabcxxxabcabcabacabc Annotated string: With Non-self match distance: 10

Time Series Discords Subsequence C of length n is said to be the discord if C has the largest distance to its nearest non-self match. K th Time Series Discord 11

Finding Discords: Brute-force [outer loop] For each subsequence in the time series, [inner loop] find the distance to its nearest match The subsequence that has the greatest such value is the discord (i.e. discord is the subsequence with the farthest nearest-neighbor) O(m 2 ) 12

Example 5 best-so-far = 5 13

Example 2 best-so-far = 5 14

Example Optimal Ordering best-so-far = 10 15

Example Optimal Ordering best-so-far = 10 5 16

Observations from Brute-Force Alg. Our goal is to find the subsequence with the greatest distance to its nearest neighbor We keep track of the best-so-far value In the inner loop, as soon as we encounter a distance < best-so-far, we can terminate the loop Such optimization depends on the orderings of subsequences examined in both the outer and the inner loop 17

Heuristic Discord Discovery Two heuristics: One to determine the order in which the outer loop visits the subsequences invoked once need to be no larger than O(m) One for the inner loop takes the current candidate (from the outer loop) into account invoked for every iteration of the outer loop need to be O(1) 18

Three Possible Heuristics Magic O(m) Perfect ordering: for outer loop, subsequences are sorted in descending order of non-self match distance to the NN. for inner loop, subsequences are sorted in ascending order of distance to current candidate (from outer) Perverse O(m 2 ) Reverse of Magic Random - O(m) ~ O(m 2 ) Random ordering for both outer/inner loops works well in practice 19

Approximations to Magic For the outer loop, we don t actually need the perfect ordering Just need to ensure that among the first few subsequences examined, one of them has a large distance to its NN For the inner loop, we don t need the perfect ordering either Need to ensure that among the first few subsequences examined, one of them has a small distance to the current candidate (i.e. smaller than best-so-far) 20

Approximating the Magic Outer Loop Scan the counts of the array entries and find those with the smallest count (i.e. mincount = 1) Subsequences with such SAX strings (mincount = 1) are examined first in the outer loop The rest are ordered randomly Intuition: Unusual subsequences are likely to have rare or unique SAX strings 21

Approximating the Magic Inner Loop When candidate j is being examined in the outer loop Look up its SAX string by examining the array Visit the trie and find the subsequences mapped to the same string these will be examined first The rest are ordered randomly Intuition: subsequences that are mapped to the same SAX strings are likely to be similar 22

HOT SAX Because our algorithm works by using heuristics to order SAX sequences, we call it HOT SAX, short for Heuristically Order Time series using SAX 23

We have changed the original screen shot only by adding a red circle to highlight the anomaly In all the examples below, we have included screen dumps of the MIT ECG server in order to allow people to retrieve the original data independent of us. However, all data is also available from us in a convenient zip file. This is KEY only, the next 8 slides show examples in this format 24

The annotated ECG from PhysioBank (two signals) Anomalies (marked by red lines) found by the discord discovery algorithm. Each of the two traces were searched independently. 25

26 Each of the two traces were searched independently.

27 Each of the two traces were searched independently.

28 Each of the two traces were searched independently.

Adding Linear Trend This is a dataset shown in a previous example To demonstrate that the discord algorithm can find anomalies even with the presence of linear trends, we added linear trend to the ECG data on the top. The new data and the anomalies found are shown below. This is important in ECGs because of the wandering baseline effect. 29

Window Size This example shows that the discord algorithm is not sensitive to the window size. In fact on all problems above, we can double or half the discord length and still find the anomalies. Below is just one example for clarity. discord 200 Each of the two traces were searched independently. discord 100 30 Each of the two traces were searched independently.

Space Shuttle Dataset Energizing Space Shuttle Marotta Valve: Example of a normal cycle De-Energizing 0 100 200 300 400 500 600 700 800 900 1000 Space Shuttle Marotta Valve Poppet pulled significantly out of the solenoid before energizing The De-Energizing phase is normal 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 31

Space Shuttle A More Subtle Problem Space Shuttle Marotta Valve Poppet pulled significantly out of the solenoid before energizing 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 0 50 100 150 32

Premature ventricular contraction Supraventricular escape beat Premature ventricular contraction 0 2000 4000 6000 8000 10000 12000 14000 16000 3-discord, d = 18.9361, location = 4017 2-discord, d = 21.7285, location = 10014 1-discord, d = 25.0896, location = 10871 The time series is record mitdb/x_mitdb/x_108 from the PhysioNet Web Server (The local copy in the UCR archive is called mitdbx_mitdbx_108.txt). It is a two feature time series, here we are looking at just the MLII column. Cardiologists from MIT have annotated the time series, here we have added colored makers to draw attention to those annotations. Here we show the results of finding the top 3 discords on this dataset. We chose a length of 600, because this a little longer than the average length of a single heartbeat. 3 rd Discord 2 nd Discord 1 st Discord MIT-BIH Arrhythmia Database: Record 108 r S r 0 5000 10000 15000 33

Shallow breaths as waking cycle begins Stage II Eyes closed, awake or stage I Eyes open, 0 500 1000 1500 2000 A time series showing a patients respiration (measured by thorax extension), as they wake up. A medical expert, Dr. J. Rittweger, manually segmented the data. The 1- discord is a very obvious deep breath taken as the patient opened their eyes. The 2- discord is much more subtle and impossible to see at this scale. A zoom-in suggests that Dr. J. Rittweger noticed a few shallow breaths that indicated the transition of sleeping stages. Institute for Physiology. Free University of Berlin. Data shows respiration (thorax extension), sampling rate 10 Hz. This is Figure 9 in the paper. This is dataset nprs44 Beginning at 15500 Ending at 22000 The beginning and ending points were chosen for visual clarity (given the small plot size) they do not effect the results 34

Stage II sleep Awake Eyes Closed Stage I sleep 0 500 1000 1500 2000 2500 3000 3500 4000 A time series showing a patients respiration (measured by thorax extension), as they wake up. A medical expert, Dr. J. Rittweger, manually segmented the data. Institute for Physiology.Free University of Berlin. Data shows respiration (thorax extension), sampling rate 10 Hz. This is Figure 10 in the paper. This is dataset nprs43 Beginning at 1 Ending at 4000 The beginning and ending points were chosen for visual clarity (given the small plot size) they do not effect the results 35

The training data used by IMM only (The first 1,000 data points of chfdbchf15) The anomaly The test data (from 1,0001 to 3000 of dataset of chfdbchf15) discord discovery IMM TSA-tree In this experiment, we can say that all the algorithms find the anomaly. The IMM approach has a slightly higher peak value just after the anomaly, but that may simply reflect the slight discretization of the time axis. In the next slide, we consider more of the time series 500 1000 1500 2000 36

The training data used by IMM only (The first 1,000 data points of chfdbchf15) The anomaly The test data (from 1,001 to 15,000 of dataset of chfdbchf15) We can see here that the IMM approach has many false positives, in spite of very careful parameter tuning. It simply cannot handle complex datasets. Both the other algorithms do well here. Note that this problem is in Figure 11 in paper discord discovery IMM TSA-tree 0 5000 10000 15000 37

The training data used by IMM only (The first 700 data points of qtdbsele0606) The anomaly The test data (from 701 to 3,000 of dataset of qtdbsele0606) This example is Figure 12/13 in the paper. Recall that we discussed this example above, it is interesting because the anomaly is extremely subtle. Here only the discord discovery algorithm can find the anomaly. discord discovery IMM TSA-tree 0 500 1000 1500 2000 2500 R T P Q S r 900 1000 1100 1200 2 3 1 4 Discord How was the discord able to find this very subtle Premature ventricular contraction? Note that in the normal heartbeats, the ST wave increases monotonically, it is only in the Premature ventricular contractions that there is an inflection.nb, this is not necessary true for all ECGS 38

The training data used by IMM only 4 normal cycles of Space Shuttle Marotta Valve Series Poppet pulled significantly out of the solenoid before energizing The test data TEK17.txt Space Shuttle Marotta Valve Series This example is Figure 7/8 in the paper. discord discovery Here the anomaly very subtle. Only the discord discovery algorithm can find the anomaly. IMM TSA-tree 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 A reminder of the cause of the anomaly 39

Conclusion & Future Work We define time series discords We introduce the HOT SAX algorithm to efficiently find discords and demonstrate its utility in various domains Future direction includes multi-dimensional data streaming data 40