CS570 Data Mining. Anomaly Detection. Li Xiong. Slide credits: Tan, Steinbach, Kumar Jiawei Han and Micheline Kamber.

Similar documents
Anomaly Detection. Jing Gao. SUNY Buffalo

Anomaly (outlier) detection. Huiping Cao, Anomaly 1

Data Mining for Anomaly Detection

Unsupervised Anomaly Detection for High Dimensional Data

An Overview of Outlier Detection Techniques and Applications

Anomaly Detection via Online Oversampling Principal Component Analysis

Expert Systems with Applications

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

Holdout and Cross-Validation Methods Overfitting Avoidance

Data Mining. Preamble: Control Application. Industrial Researcher s Approach. Practitioner s Approach. Example. Example. Goal: Maintain T ~Td

A Comparative Evaluation of Anomaly Detection Techniques for Sequence Data. Technical Report

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

Introduction to Machine Learning

ML in Practice: CMSC 422 Slides adapted from Prof. CARPUAT and Prof. Roth

MASTER. Anomaly detection on event logs an unsupervised algorithm on ixr-messages. Severins, J.D. Award date: Link to publication

Anomaly Detection via Over-sampling Principal Component Analysis

Ensemble Methods. NLP ML Web! Fall 2013! Andrew Rosenberg! TA/Grader: David Guy Brizan

Graph-Based Anomaly Detection with Soft Harmonic Functions

Anomaly Detection using Support Vector Machine

International Journal of Scientific & Engineering Research, Volume 7, Issue 2, February ISSN

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)

Lecture 2. Judging the Performance of Classifiers. Nitin R. Patel

Introduction to Machine Learning Midterm Exam

Outlier Detection Using Rough Set Theory

Using HDDT to avoid instances propagation in unbalanced and evolving data streams

Data Mining Based Anomaly Detection In PMU Measurements And Event Detection

8. Classifier Ensembles for Changing Environments

SYSTEMATIC CONSTRUCTION OF ANOMALY DETECTION BENCHMARKS FROM REAL DATA. Outlier Detection And Description Workshop 2013

A Family of Joint Sparse PCA Algorithms for Anomaly Localization in Network Data Streams. Ruoyi Jiang

Data Mining. 3.6 Regression Analysis. Fall Instructor: Dr. Masoud Yaghini. Numeric Prediction

Density-Based Clustering

Anomaly Detection in Logged Sensor Data. Master s thesis in Complex Adaptive Systems JOHAN FLORBÄCK

An overview of Boosting. Yoav Freund UCSD

Introduction to Statistical Inference

Finding Multiple Outliers from Multidimensional Data using Multiple Regression

Surprise Detection in Science Data Streams Kirk Borne Dept of Computational & Data Sciences George Mason University

Lecture 4 Discriminant Analysis, k-nearest Neighbors

Evaluating Electricity Theft Detectors in Smart Grid Networks. Group 5:

Bayesian Networks Inference with Probabilistic Graphical Models

Change Detection in Multivariate Data

Applied Machine Learning Annalisa Marsico

CS145: INTRODUCTION TO DATA MINING

Face Recognition Using Laplacianfaces He et al. (IEEE Trans PAMI, 2005) presented by Hassan A. Kingravi

Classifier performance evaluation

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Anomaly Detection for the CERN Large Hadron Collider injection magnets

Machine Learning Linear Classification. Prof. Matteo Matteucci

Clustering. CSL465/603 - Fall 2016 Narayanan C Krishnan

LARGE SCALE ANOMALY DETECTION AND CLUSTERING USING RANDOM WALKS

CSE 546 Final Exam, Autumn 2013

IV Course Spring 14. Graduate Course. May 4th, Big Spatiotemporal Data Analytics & Visualization

IMBALANCED DATA. Phishing. Admin 9/30/13. Assignment 3: - how did it go? - do the experiments help? Assignment 4. Course feedback

FINAL: CS 6375 (Machine Learning) Fall 2014

EEL 851: Biometrics. An Overview of Statistical Pattern Recognition EEL 851 1

Tools of AI. Marcin Sydow. Summary. Machine Learning

Data Exploration vis Local Two-Sample Testing

ADVANCED ALGORITHMS FOR KNOWLEDGE DISCOVERY FROM IMBALANCED AND ADVERSARIAL DATA

Logic and machine learning review. CS 540 Yingyu Liang

CSE 5243 INTRO. TO DATA MINING

Università di Pisa A.A Data Mining II June 13th, < {A} {B,F} {E} {A,B} {A,C,D} {F} {B,E} {C,D} > t=0 t=1 t=2 t=3 t=4 t=5 t=6 t=7

Available online at ScienceDirect. Procedia Computer Science 96 (2016 )

IAENG International Journal of Computer Science, 43:1, IJCS_43_1_09. Intrusion Detection System Using PCA and Kernel PCA Methods

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

Roberto Perdisci^+, Guofei Gu^, Wenke Lee^ presented by Roberto Perdisci. ^Georgia Institute of Technology, Atlanta, GA, USA

Introduction to Signal Detection and Classification. Phani Chavali

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation

ENTROPY FILTER FOR ANOMALY DETECTION WITH EDDY CURRENT REMOTE FIELD SENSORS

Learning Classification with Auxiliary Probabilistic Information Quang Nguyen Hamed Valizadegan Milos Hauskrecht

CS 484 Data Mining. Classification 7. Some slides are from Professor Padhraic Smyth at UC Irvine

A Framework for Adaptive Anomaly Detection Based on Support Vector Data Description

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

ECE662: Pattern Recognition and Decision Making Processes: HW TWO

Learning with multiple models. Boosting.

COMP 5331: Knowledge Discovery and Data Mining

Text Mining. Dr. Yanjun Li. Associate Professor. Department of Computer and Information Sciences Fordham University

Intro. ANN & Fuzzy Systems. Lecture 15. Pattern Classification (I): Statistical Formulation

Data Mining und Maschinelles Lernen

FRaC: A Feature-Modeling Approach for Semi-Supervised and Unsupervised Anomaly Detection

Handling imbalanced datasets by reconstruction rules in decomposition schemes

ECS289: Scalable Machine Learning

A CUSUM approach for online change-point detection on curve sequences

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Your Project Proposals

Advanced Statistical Methods: Beyond Linear Regression

9/26/17. Ridge regression. What our model needs to do. Ridge Regression: L2 penalty. Ridge coefficients. Ridge coefficients

When Dictionary Learning Meets Classification

Introduction to Machine Learning Midterm Exam Solutions

Computational Genomics

MTH 517 COURSE PROJECT

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Pattern Recognition and Machine Learning

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Machine learning for pervasive systems Classification in high-dimensional spaces

CS534 Machine Learning - Spring Final Exam

Machine Learning, Fall 2009: Midterm

Reverse Nearest Neighbors in Unsupervised Distance-Based Outlier Detection. I.Priyanka 1. Research Scholar,Bharathiyar University. G.

ECS289: Scalable Machine Learning

Algorithm-Independent Learning Issues

CLUe Training An Introduction to Machine Learning in R with an example from handwritten digit recognition

Least Squares Classification

Transcription:

CS570 Data Mining Anomaly Detection Li Xiong Slide credits: Tan, Steinbach, Kumar Jiawei Han and Micheline Kamber April 3, 2011 1

Anomaly Detection Anomaly is a pattern in the data that does not conform to the expected behavior outliers, exceptions, peculiarities, surprise

Point Anomalies Contextual Anomalies Collective Anomalies Type of Anomaly

Point Anomalies An individual data instance is anomalous w.r.t. the data Y N 1 o 1 O 3 o 2 N 2 X

Contextual Anomalies An individual data instance is anomalous within a context Requires a notion of context Also referred to as conditional anomalies* Normal Anomaly * Xiuyao Song, Mingxi Wu, Christopher Jermaine, Sanjay Ranka, Conditional Anomaly Detection, IEEE Transactions on Data and Knowledge Engineering, 2006.

Collective Anomalies A collection of related data instances is anomalous Requires a relationship among data instances Sequential Data Spatial Data Graph Data The individual instances within a collective anomaly are not anomalous by themselves Anomalous Subsequence

Anomalies often have significant impact Cyber intrusions Credit card fraud Health condition anomaly Anomaly Detection Crust deformation anomaly

Black Swan Theory The impact of rare events is huge and highly underrated Black swan event The event is a surprise The event has a major impact. After its first recording, the event is rationalized by hindsight Almost all scientific discoveries, historical events are black swan events April 3, 2011 Data Mining: Concepts and Techniques 8

Anomaly Detection Related problems Rare Class Mining Chance discovery Anomaly Detection Novelty Detection Exception Mining Mining needle in a haystack. So much hay and so little time Noise Removal Black Swan

Intrusion Detection: Intrusion Detection Process of monitoring the events occurring in a computer system or network for intrusions Intrusions are attempts to bypass the security mechanisms of a computer or network Approaches Traditional signature-based intrusion detection systems are based on signatures of known attacks Anomaly detection

Fraud Detection Fraud detection detection of criminal activities occurring in commercial organizations Types of fraud Credit card fraud Insurance claim fraud Mobile / cell phone fraud Insider trading Challenges Fast and accurate real-time detection Misclassification cost is very high

Image Processing Detecting outliers in a image monitored over time Detecting anomalous regions within an image Used in mammography image analysis video surveillance satellite image analysis Key Challenges Detecting collective anomalies Data sets are very large Anomaly

Anomaly Detection Supervised Anomaly Detection Labels available for both normal data and anomalies Classification Semi-supervised Anomaly Detection Labels available only for normal data Classification Unsupervised Anomaly Detection No labels assumed Based on the assumption that anomalies are very rare compared to normal data

Output of Anomaly Detection Label Each test instance is given a normal or anomaly label Score Each test instance is assigned an anomaly score Allows the output to be ranked Requires an additional threshold parameter

Classification Based Techniques Main idea: build a classification model for normal (and anomalous (rare)) events based on labeled training data, and use it to classify each new unseen event Categories: Supervised classification techniques Require knowledge of both normal and anomaly class Build classifier to distinguish between normal and known anomalies Semi-supervised classification techniques Require knowledge of normal class only! Use modified classification model to learn the normal behavior and then detect any deviations from normal behavior as anomalous

Advantages: Classification Based Techniques Supervised classification techniques Models that can be easily understood High accuracy in detecting many kinds of known anomalies Semi-supervised classification techniques Models that can be easily understood Normal behavior can be accurately learned Drawbacks: Supervised classification techniques Require both labels from both normal and anomaly class Cannot detect unknown and emerging anomalies Semi-supervised classification techniques Require labels from normal class Possible high false alarm rate - previously unseen (yet legitimate) data records may be recognized as anomalies

Challenge Supervised Anomaly Detection Classification models must be able to handle skewed (imbalanced) class distributions Misclassification cost for the rare class tend to be high

Supervised Classification Techniques Blackbox approaches Manipulating data records (oversampling / undersampling / generating artificial examples) Whitebox approaches Adapt classification models Design new classification models Cost-sensitive classification techniques Ensemble based algorithms (SMOTEBoost, RareBoost

Manipulating Data Records Over-sampling the rare class [Ling98] Make the duplicates of the rare events until the data set contains as many examples as the majority class => balance the classes Down-sizing (undersampling) the majority class [Kubat97] Sample the data records from majority class (Randomly, Near miss examples, Examples far from minority class examples (far from decision boundaries) Generating artificial anomalies SMOTE (Synthetic Minority Over-sampling TEchnique) [Chawla02] - new rare class examples are generated inside the regions of existing rare class examples Artificial anomalies are generated around the edges of the sparsely populated data regions [Fan01]

Adapting Existing Rule Based Classifiers Case specific feature weighting [Cardey97] Increases the weight for rare class examples in decision tree learning Weight dynamically generated based on the path taken by that example Case specific rule weighting [Grzymala00] LERS (Learning from Examples based on Rough Sets) increases the rule strength for all rules describing the rare class

Rare Class Detection Evaluation True positive rate, true negative rate, false positive rate, false negative rate Precision/recall Implications due to imbalanced class distribution Base rate fallacy April 3, 2011 Data Mining: Concepts and Techniques 21

Base Rate Fallacy (Axelsson, 1999)

Base Rate Fallacy Even though the test is 99% certain, your chance of having the disease is 1/100, because the population of healthy people is much larger than sick people

Semi-supervised Classification Techniques Use modified classification model to learn the normal behavior and then detect any deviations from normal behavior as anomalous Recent approaches: Neural network based approaches Support Vector machines (SVM) based approaches Markov model based approaches Rule-based approaches

Using Support Vector Machines One class classification problem computes a spherically shaped decision boundary with minimal volume around a training set of objects.

Anomaly Detection Supervised Anomaly Detection Semi-supervised Anomaly Detection Unsupervised Anomaly Detection Graphical based Statistical based Nearest neighbor based techniques

The image cannot be displayed. Your computer may not have enough memory to open the image, or the image may have been corrupted. Restart your computer, and then open the file again. If the red x still appears, you may have to delete the image and then insert it again. Graphical Approaches Boxplot (1-D), Scatter plot (2-D), Spin plot (3-D) Limitations Time consuming Subjective

Statistical Approaches Assume a parametric model describing the distribution of the data (e.g., normal distribution) Apply a statistical test that depends on Data distribution Parameter of distribution (e.g., mean, variance) Number of expected outliers (confidence limit)

Grubbs Test Detect outliers in univariate data Assume data comes from normal distribution Detects one outlier at a time, remove the outlier, and repeat H 0 : There is no outlier in data H A : There is at least one outlier Grubbs test statistic: Reject H 0 if: G > ( N 1) N G = max N t 2 X s ( α/ N, N 2) 2 + t 2 X ( α/ N, N 2)

Statistical-based Likelihood Approach Assume the data set D contains samples from a mixture of two probability distributions: M (majority distribution) A (anomalous distribution) General Approach: Initially, assume all the data points belong to M Let L t (D) be the log likelihood of D at time t For each point x t that belongs to M, move it to A Let L t+1 (D) be the new log likelihood. Compute the difference, = L t (D) L t+1 (D) If > c (some threshold), then x t is declared as an anomaly and moved permanently from M to A

Limitations of Statistical Approaches Most of the tests are for a single attribute In many cases, data distribution may not be known For high dimensional data, it may be difficult to estimate the true distribution

Distance-based Approaches Data is represented as a vector of features Three major approaches Nearest-neighbor based Density based Clustering based

Nearest Neighbor Based Techniques Key assumption: normal points have close neighbors while anomalies are located far from other points General two-step approach 1. Compute neighborhood for each data record 2. Analyze the neighborhood to determine whether data record is anomaly or not Categories: Distance based methods Anomalies are data points most distant from other points Density based methods Anomalies are data points in low density regions

Nearest Neighbor Based Techniques Distance based approaches A point O in a dataset is an DB(p, d) outlier if at least fraction p of the points in the data set lies greater than distance d from the point O* Density based approaches Compute local densities of particular regions and declare instances in low density regions as potential anomalies Approaches Local Outlier Factor (LOF) Connectivity Outlier Factor (COF ) Multi-Granularity Deviation Factor (MDEF) *Knorr, Ng,Algorithms for Mining Distance-Based Outliers in Large Datasets, VLDB98

Local Outlier Factor (LOF)* For each data point q compute the distance to the k-th nearest neighbor (k-distance) Compute reachability distance (reach-dist) for each data example q with respect to data example p as: reach-dist(q, p) = max{k-distance(p), d(q,p)} Compute local reachability density (lrd) of data example q as inverse of the average reachabaility distance based on the MinPts nearest neighbors of data example q MinPts lrd(q) = reach_ dist ( q, p) Compaute LOF(q) as ratio of average local reachability density of q s k- nearest neighbors and local reachability density of the data record q LOF(q) = p p MinPts 1 lrd( p) MinPts lrd( q) * - Breunig, et al, LOF: Identifying Density-Based Local Outliers, KDD 2000.

Advantages of Density based Techniques Local Outlier Factor (LOF) approach Example: Distance from p 3 to nearest neighbor Distance from p 2 to nearest neighbor p 3 p 2 p 1 In the NN approach, p 2 is not considered as outlier, while the LOF approach find both p 1 and p 2 as outliers NN approach may consider p 3 as outlier, but LOF approach does not

Density based approach Using Support Vector Machines Main idea [Steinwart05] : Normal data records belong to high density data regions Anomalies belong to low density data regions Use unsupervised approach to learn high density and low density data regions Use SVM to classify data density level

Key assumption Clustering Based Techniques normal data records belong to large and dense clusters, while anomalies belong do not belong to any of the clusters or form very small clusters Anomalies detected using clustering based methods can be: Data records that do not fit into any cluster (residuals from clustering) Small clusters Low density clusters or local anomalies (far from other points within the same cluster)