MTH 517 COURSE PROJECT

Similar documents
A least-squares approach to anomaly detection in static and sequential data

International Journal of Scientific & Engineering Research, Volume 7, Issue 2, February ISSN

The difference-sign runs length distribution in testing for serial independence

CS570 Data Mining. Anomaly Detection. Li Xiong. Slide credits: Tan, Steinbach, Kumar Jiawei Han and Micheline Kamber.

arxiv: v6 [cs.na] 12 Nov 2015

Anomaly Detection. Jing Gao. SUNY Buffalo

Fractals and Physiology: Nature s Biology and Geometry

Rare Event Discovery And Event Change Point In Biological Data Stream

WAVELET APPROXIMATION FOR IMPLEMENTATION IN DYNAMIC TRANSLINEAR CIRCUITS

Predictive analysis on Multivariate, Time Series datasets using Shapelets

Dynamic Data-Driven Adaptive Sampling and Monitoring of Big Spatial-Temporal Data Streams for Real-Time Solar Flare Detection

A COLLECTIVE BIOLOGICAL PROCESSING ALGORITHM FOR EKG SIGNALS

Support Measure Data Description for Group Anomaly Detection

ASYMMETRIC DETRENDED FLUCTUATION ANALYSIS REVEALS ASYMMETRY IN THE RR INTERVALS TIME SERIES

Available online at ScienceDirect. Procedia Computer Science 96 (2016 )

arxiv: v1 [stat.ml] 1 Dec 2018

HOT SAX: Efficiently Finding the Most Unusual Time Series Subsequence

Journal of Chemical and Pharmaceutical Research, 2014, 6(5): Research Article

Time Series Clustering Methods

Anomaly Detection via Over-sampling Principal Component Analysis

Unsupervised Anomaly Detection for High Dimensional Data

Fuzzy Geographically Weighted Clustering

Mining Infrequent Patter ns

A Comparative Evaluation of Anomaly Detection Techniques for Sequence Data. Technical Report

Construction and Analysis of Climate Networks

AN INTERNATIONAL SOLAR IRRADIANCE DATA INGEST SYSTEM FOR FORECASTING SOLAR POWER AND AGRICULTURAL CROP YIELDS

Data Mining Based Anomaly Detection In PMU Measurements And Event Detection

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining

Change Detection in Multivariate Data

Interbeat RR Interval Time Series Analyzed with the Sample Entropy Methodology

Detecting Anomalous and Exceptional Behaviour on Credit Data by means of Association Rules. M. Delgado, M.D. Ruiz, M.J. Martin-Bautista, D.

International Journal of Advanced Research in Computer Science and Software Engineering

Anomaly Detection for the CERN Large Hadron Collider injection magnets

MASTER. Anomaly detection on event logs an unsupervised algorithm on ixr-messages. Severins, J.D. Award date: Link to publication

Distributed Event Identification for WSNs in Non-Stationary Environments

Application of Clustering to Earth Science Data: Progress and Challenges

Chapter 5 Identifying hydrological persistence

Finding Multiple Outliers from Multidimensional Data using Multiple Regression

Detecting Heart-Rate while Jogging: Blind Source Separation of Gait and Heartbeat

Stochastic Hydrology. a) Data Mining for Evolution of Association Rules for Droughts and Floods in India using Climate Inputs

Investigations of Cardiac Rhythm Fluctuation Using the DFA Method

Study on a Low Complexity ECG Compression Scheme with Multiple Sensors

International Journal of Advance Engineering and Research Development. Review Paper On Weather Forecast Using cloud Computing Technique

A MULTISCALE APPROACH TO DETECT SPATIAL-TEMPORAL OUTLIERS

Exam Machine Learning for the Quantified Self with answers :00-14:45

Anomaly Detection using Support Vector Machine

Data Mining for the Discovery of Ocean Climate Indices *

Data Mining for the Discovery of Ocean Climate Indices *

Fast Multidimensional Subset Scan for Outbreak Detection and Characterization

Outline. 15. Descriptive Summary, Design, and Inference. Descriptive summaries. Data mining. The centroid

Working with Temporal Data in ArcGIS

Machine Learning on temporal data

Data Fusion and Multi-Resolution Data

LOW-COMPLEXITY, MULTI-CHANNEL, LOSSLESS AND NEAR-LOSSLESS EEG COMPRESSION

Peak Misdetection In Heart-Beat-Based Security: Characterization and Tolerance

Proc. of NCC 2010, Chennai, India

Inclusion of Non-Street Addresses in Cancer Cluster Analysis

Confronting Climate Change in the Great Lakes Region. Technical Appendix Climate Change Projections EXTREME EVENTS

Cluster Analysis using SaTScan. Patrick DeLuca, M.A. APHEO 2007 Conference, Ottawa October 16 th, 2007

Multi-criteria Anomaly Detection using Pareto Depth Analysis

Models, Data, Learning Problems

Discovery of Climate Indices using Clustering

DATA STREAM MINING IN MEDICAL SENSOR-CLOUD

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining

Clustering non-stationary data streams and its applications

Anomaly Detection via Online Oversampling Principal Component Analysis

Expectation-based scan statistics for monitoring spatial time series data

Streaming multiscale anomaly detection

Multimedia Communications. Mathematical Preliminaries for Lossless Compression

Non-linear Measure Based Process Monitoring and Fault Diagnosis

Detecting heart rate while jogging: blind source separation of gait and heartbeat

arxiv: v1 [stat.ap] 30 Jun 2015

Machine Learning: Basis and Wavelet 김화평 (CSE ) Medical Image computing lab 서진근교수연구실 Haar DWT in 2 levels

Training the linear classifier

Predicting investment fluxes from implicit lead-lag investor networks

Seasonal forecasting of climate anomalies for agriculture in Italy: the TEMPIO Project

Clustering Techniques and their applications at ECMWF

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

Mixing Dynamics of Heart Rate Variability

A Family of Joint Sparse PCA Algorithms for Anomaly Localization in Network Data Streams. Ruoyi Jiang

SPATIAL ANALYSIS. Transformation. Cartogram Central. 14 & 15. Query, Measurement, Transformation, Descriptive Summary, Design, and Inference

An introduction to clustering techniques

Prompt Network Anomaly Detection using SSA-Based Change-Point Detection. Hao Chen 3/7/2014

More on Unsupervised Learning

Spatial Data Modelling: The Search For Gold In Otago. Presented By Matthew Hill Kenex Knowledge Systems (NZ) Kenex. Kenex

Automating biomedical time-series analysis using massive feature extraction

Assignment 3: Chapter 2 & 3 (2.6, 3.8)

Exploring Climate Patterns Embedded in Global Climate Change Datasets

8. Classifier Ensembles for Changing Environments

Real-time Sentiment-Based Anomaly Detection in Twitter Data Streams

An Overview of Outlier Detection Techniques and Applications

Spatial Extension of the Reality Mining Dataset

Point-to-point response to reviewers comments

Online Nonnegative Matrix Factorization with General Divergences

SWIM and Horizon 2020 Support Mechanism

Contextual Time Series Change Detection

We Prediction of Geological Characteristic Using Gaussian Mixture Model

Charles J. Fisk * Naval Base Ventura County - Point Mugu, CA 1. INTRODUCTION

Mining Climate Data. Michael Steinbach Vipin Kumar University of Minnesota /AHPCRC

Detecting neutrinos from the next galactic supernova in the NOvA detectors. Andrey Sheshukov DLNP JINR

Transcription:

MTH 517 COURSE PROJECT Anomaly Detection in Time Series Aditya Degala - 12051 Shreesh Laddha - 12679 1 Introduction In data mining, anomaly detection (or outlier detection) is the identification of items, events or observations which do not conform to an expected pattern or other items in a dataset. Typically the anomalous items will translate to some kind of problem such as bank fraud, a structural defect, medical problems or finding errors in text. This technique is especially useful in the context of abuse and network intrusion detection, where the interesting objects are often not rare objects, but unexpected bursts in activity. This pattern does not adhere to the common statistical definition of an outlier as a rare object, and many outlier detection methods (in particular unsupervised methods) will fail on such data, unless it has been aggregated appropriately. Instead, a cluster analysis algorithm may be able to detect the micro clusters formed by these patterns. In this dosument we investigate the problem of anomaly detection for univariate time series. Although there has been extensive work on anomaly detection, most of the techniques look for individual objects that are different from normal objects but do not consider the sequence aspect of the data into consideration. Such anomalies are also referred to as point anomalies. If the data is treated as a collection of amplitude values, while ignoring their temporal aspect, the anomaly cannot be detected. Time series data are common in many application areas including medical and environmental sciences, aerospace and industrial engineering, and finance and agriculture. Detecting emergent anomalies in time series data provides significant information for each application. For example, anomaly detection in electrocardiogram (ECG) signals or in aircraft health management saves lives, detecting anomalies in disease incident provides useful information about the possibility of occurring outbreaks, and in industrial process it may help to diagnose the incident faults. In time series, anomaly can be considered as the occurrence of any unexpected changes in a subsequence of data. The term unexpected change makes sense when we compare the available pattern in a subsequence with the existing patterns in the entire time series. As the result, one common approach to anomaly detection in time series is the use of a fixed length sliding window and generating a set of subsequences of time series. In the next step, one may use different techniques to detect and characterize anomalies i.e. assigning an anomaly score to each subsequence. Anomalies occurring in time series can be a result of a change in the amplitude of data (e.g. a heavy rainfall in a week of a year), or it may be a change in the shape (e.g. occurring an arrhythmia within a set of normal heartbeats in ECG signals). Therefore, in this document we have reported anomalies of two types: anomalies in shape and anomalies in amplitude. In this study, we propose a unified framework to detect both types of anomalies. For this purpose, after generating a set of subsequences of time series using a sliding window, a fuzzy C-Means (FCM) clustering has been employed to reveal the available structure within data. Then, a reconstruction criterion is considered to reconstruct the original subsequences from the determined cluster centers (prototypes) and partition matrix. For each subsequence, an anomaly score has been assigned based on the difference between the original subsequence and the reconstructed one. In the case of anomalies in amplitude, the original representation of time series along with the Euclidean distance 1

function is used in the clustering process, while for shape anomalies, first a representation of subsequences are considered to capture the shape information and then, the Euclidean distance in the new feature space has been employed. 2 Anomaly detection using a Fuzzy C-Means Clustering Let us consider a time series x = x 1, x 2,..., x p of length p. We aim at finding a set of subsequences of x with length q, having highest amount of unexpected changes (in shape or amplitude) in terms of anomaly score. For this purpose, a sliding window with length q, moves thorough the time series and generates a set of subsequences. Consequently, there are n subsequences coming in the form x 1 = x 11, x 12,..., x 1q x 2 = x 21, x 22,..., x 2q. (1) x n = x n1, x n2,..., x nq Note that in each movement, the sliding window moves r time steps. As the result, the number of subsequences, n is n = p q + 1 (2) r Considering a low value for r (e.g., r = 1) guarantees that no anomalous subsequences are missed, but processing a high amount of subsequences is time consuming. On the other hand, considering a high value for r (e.g., r = q) generates lower number of subsequences and processing time will be lower, but there is a risk of losing some anomalous subsequences. A trade off between accuracy and processing time can be considered. Selecting the value of r being proportional to the length of subsequences is a reasonable choice, i.e. selecting a higher value of r for longer subsequences and lower value of r for shorter subsequences. The length of sliding window, q, is another important parameter that can be selected based on the application purpose. However, one may consider different values for this parameter to find some appropriate results. Fuzzy C-Means clustering proposed by Dunn and Bezdek is one of the commonly used clustering techniques. It describes n subsequences (or their representation) x 1, x 2,..., x 3 using c (c n) prototypes v 1, v 2,..., v c and a fuzzy partition matrix U through the minimization of the following objective function: c i=1 k=1 n u m ik v i x k 2 (3) where m (m > 1) is a fuzzification coefficient, and. denotes the Euclidean distance function. The defined objective function in (3) can be minimized by calculating the cluster centers using (4) and partition matrix using (5) in an iterative fashion. v i = n k=1 um ik x k n k=1 um ik (4) u ik = 1 ( ) 2/m 1 (5) c vi x k j=1 v i x k Clustering subsequence x 1, x 2,..., x n leads to the generation of a set of prototypes representing the normal structure of subsequences. Each normal subsequence in data set is similar to one or more 2

prototypes or it can be similar to a combination (in the form of weighted average) of prototypes. The more the subsequence is similar to the prototypes, the less anomalous it is. To evaluate how much a subsequence is similar to the revealed prototypes (or their combination) a reconstruction criterion has been considered in this paper. This criterion also was exploited for clustering spatiotemporal data in. Pedrycz and de Oliveira noted that FCM can be viewed as an encoding scheme of data and the original data points (here subsequences) can be decoded (reconstructed) using the estimated cluster centers and partition matrix. Assuming that ˆx 1, ˆx 2,..., ˆx n are the reconstructed version of subsequences x 1, x 2,..., x n, by minimizing the following sum of distances: F = c n u m ik v i ˆx k 2 (6) i=1 k=1 one may arrive at: ˆx k = c k=1 um ik v i c i=1 um ik (7) After calculating the reconstructed version of each subsequence using (7), the reconstruction error in (8), that is a squared Euclidean distance between a subsequence and its reconstructed version is considered as the evaluation criterion to estimate how much a subsequence is similar to prototypes. In other words, for each subsequence the calculated reconstruction error using (8) is considered as its anomaly score. E k = x k ˆx k 2 (8) When detecting anomalies in shape is of concern, the generated subsequences cannot be employed directly in clustering. The reason is that the generated subsequences are not synchronized and using the Euclidean distance function is not efficient as similarity measure. Although there are number of viable distance functions to measure the dissimilarity of asynchronous time series with respect to their shapes (e.g. dynamic time warping distance ), one has to be aware of the challenges we may encounter for optimizing the FCM objective function in dealing with those distance functions. In this paper we confine ourselves to the Euclidean distance function. To compare subsequences based on their shape information, each subsequence is normalized to have a zero mean and a standard deviation equal to one. Then, each normalized subsequence is represented using a set of autocorrelation coefficients. Considering x k as a subsequence with length q, its autocorrelation coefficient for lag s can be estimated using (9). y k,s = q t=s+1 (x k,t x k ) (x k,t s x k ) q t=1 (x k,t x k ) 2 (9) 3

Figure 1: A monthly precipitation Time Series along with estimated Amplitude Anomaly Scores. Figure 2: Detection of Sinus Arrhythmia and PVC in ECG Time Series. Figure 3: Application of Shape Anomaly Detection in ECG. 4

Figure 4: Another example. References [1] Chandola, V.; Banerjee, A.; Kumar, V. (2009), Anomaly detection: A survey [2] Hodge, V. J.; Austin, J. (2004), A Survey of Outlier Detection Methodologies [3] Dokas, Paul; Levent Ertoz, Vipin Kumar, Aleksandar Lazarevic, Jaideep Srivastava, Pang-Ning Tan (2002), Data mining for network intrusion detection [4] Goldberger AL, Amaral LAN, Glass L, Hausdorff JM, Ivanov PCh, Mark RG, Mietus JE, Moody GB, Peng C-K, Stanley HE. PhysioBank, PhysioToolkit, and PhysioNet: Components of a New Research Resource for Complex Physiologic Signals. Circulation 101(23):e215-e220 [5] The United States Historical Climatology Network (USHCN), Available online at: http://cdiac.ornl.gov/epubs/ndp/ushcn/ushcn.html. [6] V. Chandola, V. Mithal, V. Kumar, Comparative evaluation of anomaly detection techniques for sequence data, In 8th IEEE International Conference on Data Mining, 2008, pp. 743-748. [7] J.C. Dunn, A fuzzy relative of the ISODATA process, and its use in detecting compact wellseparated clusters,j. Cyber. 3 (3), pp. 32 57, 1973. [8] Anomaly detection in time series data using a fuzzy c-means clustering,izakian H, Dept of Electrical and Computer Engineering, University of Alberta, Edmonton, AB, Canada, Pedrycz, W 5