Anomaly Detection. Jing Gao. SUNY Buffalo

Similar documents
CS570 Data Mining. Anomaly Detection. Li Xiong. Slide credits: Tan, Steinbach, Kumar Jiawei Han and Micheline Kamber.

Unsupervised Anomaly Detection for High Dimensional Data

Performance Evaluation

Anomaly (outlier) detection. Huiping Cao, Anomaly 1

An Overview of Outlier Detection Techniques and Applications

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

Least Squares Classification

Performance evaluation of binary classifiers

Evaluation & Credibility Issues

Lecture 4 Discriminant Analysis, k-nearest Neighbors

Performance Evaluation and Comparison

Performance Evaluation

Machine Learning Concepts in Chemoinformatics

Data Mining Prof. Pabitra Mitra Department of Computer Science & Engineering Indian Institute of Technology, Kharagpur

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining

Classifier Evaluation. Learning Curve cleval testc. The Apparent Classification Error. Error Estimation by Test Set. Classifier

Pointwise Exact Bootstrap Distributions of Cost Curves

Data Mining for Anomaly Detection

Model Accuracy Measures

Big Data Analytics: Evaluating Classification Performance April, 2016 R. Bohn. Some overheads from Galit Shmueli and Peter Bruce 2010

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

BANA 7046 Data Mining I Lecture 4. Logistic Regression and Classications 1

Lecture 2. Judging the Performance of Classifiers. Nitin R. Patel

CS4445 Data Mining and Knowledge Discovery in Databases. B Term 2014 Solutions Exam 2 - December 15, 2014

Machine Learning Linear Classification. Prof. Matteo Matteucci

Anomaly Detection in Logged Sensor Data. Master s thesis in Complex Adaptive Systems JOHAN FLORBÄCK

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining

Evaluation. Andrea Passerini Machine Learning. Evaluation

Introduction to Supervised Learning. Performance Evaluation

Data Mining algorithms

Evaluation requires to define performance measures to be optimized

IN Pratical guidelines for classification Evaluation Feature selection Principal component transform Anne Solberg

Stephen Scott.

Metric Embedding of Task-Specific Similarity. joint work with Trevor Darrell (MIT)

Bayesian Decision Theory

MASTER. Anomaly detection on event logs an unsupervised algorithm on ixr-messages. Severins, J.D. Award date: Link to publication

CC283 Intelligent Problem Solving 28/10/2013

ML in Practice: CMSC 422 Slides adapted from Prof. CARPUAT and Prof. Roth

Introduction to Statistical Inference

Anomaly Detection for the CERN Large Hadron Collider injection magnets

Classifier performance evaluation

Lecture 3 Classification, Logistic Regression

Pattern Recognition and Machine Learning. Learning and Evaluation for Pattern Recognition

Part I. Linear Discriminant Analysis. Discriminant analysis. Discriminant analysis

Surprise Detection in Science Data Streams Kirk Borne Dept of Computational & Data Sciences George Mason University

Applied Machine Learning Annalisa Marsico

Ensemble Methods. NLP ML Web! Fall 2013! Andrew Rosenberg! TA/Grader: David Guy Brizan

CSC314 / CSC763 Introduction to Machine Learning

EEL 851: Biometrics. An Overview of Statistical Pattern Recognition EEL 851 1

Diagnostics. Gad Kimmel

Anomaly Detection using Support Vector Machine

IMBALANCED DATA. Phishing. Admin 9/30/13. Assignment 3: - how did it go? - do the experiments help? Assignment 4. Course feedback

CptS 570 Machine Learning School of EECS Washington State University. CptS Machine Learning 1

CISC 4631 Data Mining

Graph-Based Anomaly Detection with Soft Harmonic Functions

Smart Home Health Analytics Information Systems University of Maryland Baltimore County

Performance Measures. Sören Sonnenburg. Fraunhofer FIRST.IDA, Kekuléstr. 7, Berlin, Germany

Clustering Lecture 1: Basics. Jing Gao SUNY Buffalo

CSE 546 Final Exam, Autumn 2013

Chart types and when to use them

Q1 (12 points): Chap 4 Exercise 3 (a) to (f) (2 points each)

Online Advertising is Big Business

Index of Balanced Accuracy: A Performance Measure for Skewed Class Distributions

DECISION TREE LEARNING. [read Chapter 3] [recommended exercises 3.1, 3.4]

Introduction to Data Science

Class 4: Classification. Quaid Morris February 11 th, 2011 ML4Bio

Anomaly Detection via Over-sampling Principal Component Analysis

day month year documentname/initials 1

Rare Event Discovery And Event Change Point In Biological Data Stream

Moving Average Rules to Find. Confusion Matrix. CC283 Intelligent Problem Solving 05/11/2010. Edward Tsang (all rights reserved) 1

Anomaly Detection via Online Oversampling Principal Component Analysis

Università di Pisa A.A Data Mining II June 13th, < {A} {B,F} {E} {A,B} {A,C,D} {F} {B,E} {C,D} > t=0 t=1 t=2 t=3 t=4 t=5 t=6 t=7

Lecture 9: Classification, LDA

A Family of Joint Sparse PCA Algorithms for Anomaly Localization in Network Data Streams. Ruoyi Jiang

MIRA, SVM, k-nn. Lirong Xia

Lecture 9: Classification, LDA

Expert Systems with Applications

CS145: INTRODUCTION TO DATA MINING

Surprise Detection in Multivariate Astronomical Data Kirk Borne George Mason University

Defining Statistically Significant Spatial Clusters of a Target Population using a Patient-Centered Approach within a GIS

Methods and Criteria for Model Selection. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

Contextual Profiling of Homogeneous User Groups for Masquerade Detection. Pieter Bloemerus Ruthven

Data Mining and Analysis: Fundamental Concepts and Algorithms

Lecture 3. STAT161/261 Introduction to Pattern Recognition and Machine Learning Spring 2018 Prof. Allie Fletcher

Regularization. CSCE 970 Lecture 3: Regularization. Stephen Scott and Vinod Variyam. Introduction. Outline

Machine Learning, Fall 2009: Midterm

Fundamentals to Biostatistics. Prof. Chandan Chakraborty Associate Professor School of Medical Science & Technology IIT Kharagpur

8. Classifier Ensembles for Changing Environments

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)

Computational paradigms for the measurement signals processing. Metodologies for the development of classification algorithms.

ECE521 Lecture7. Logistic Regression

Basics of Multivariate Modelling and Data Analysis

Lecture Slides for INTRODUCTION TO. Machine Learning. ETHEM ALPAYDIN The MIT Press,

Algorithm-Independent Learning Issues

Statistics for classification

Hypothesis Evaluation

Algorithms for Classification: The Basic Methods

Introduction to Machine Learning

International Journal of Scientific & Engineering Research, Volume 7, Issue 2, February ISSN

Transcription:

Anomaly Detection Jing Gao SUNY Buffalo 1

Anomaly Detection Anomalies the set of objects are considerably dissimilar from the remainder of the data occur relatively infrequently when they do occur, their consequences can be quite dramatic and quite often in a negative sense Mining needle in a haystack. So much hay and so little time 2

Definition of Anomalies Anomaly is a pattern in the data that does not conform to the expected behavior Also referred to as outliers, exceptions, peculiarities, surprise, etc. Anomalies translate to significant (often critical) real life entities Cyber intrusions Credit card fraud 3

Real World Anomalies Credit Card Fraud An abnormally high purchase made on a credit card Cyber Intrusions Computer virus spread over Internet 4

Simple Example N 1 and N 2 are regions of normal behavior Points o 1 and o 2 are anomalies Y N 1 o 1 O 3 o 2 Points in region O 3 are anomalies N 2 X 5

Rare Class Mining Chance discovery Novelty Detection Exception Mining Noise Removal Related problems 6

Key Challenges Defining a representative normal region is challenging The boundary between normal and outlying behavior is often not precise The exact notion of an outlier is different for different application domains Limited availability of labeled data for training/validation Malicious adversaries Data might contain noise Normal behaviour keeps evolving 7

Aspects of Anomaly Detection Problem Nature of input data Availability of supervision Type of anomaly: point, contextual, structural Output of anomaly detection Evaluation of anomaly detection techniques 8

10 Input Data Most common form of data handled by anomaly detection techniques is Record Data Univariate Multivariate Tid SrcIP Start time Dest IP Dest Port Number of bytes Attack 1 206.135.38.95 11:07:20 160.94.179.223 139 192 No 2 206.163.37.95 11:13:56 160.94.179.219 139 195 No 3 206.163.37.95 11:14:29 160.94.179.217 139 180 No 4 206.163.37.95 11:14:30 160.94.179.255 139 199 No 5 206.163.37.95 11:14:32 160.94.179.254 139 19 Yes 6 206.163.37.95 11:14:35 160.94.179.253 139 177 No 7 206.163.37.95 11:14:36 160.94.179.252 139 172 No 8 206.163.37.95 11:14:38 160.94.179.251 139 285 Yes 9 206.163.37.95 11:14:41 160.94.179.250 139 195 No 10 206.163.37.95 11:14:44 160.94.179.249 139 163 Yes 9

Input Data Complex Data Types Relationship among data instances Sequential Temporal Spatial Spatio-temporal Graph GGTTCCGCCTTCAGCCCCGCGCC CGCAGGGCCCGCCCCGCGCCGTC GAGAAGGGCCCGCCTGGCGGGCG GGGGGAGGCGGGGCCGCCCGAGC CCAACCGAGTCCGACCAGGTGCC CCCTCTGCTCGGCCTAGACCTGA GCTCATTAGGCGGCAGCGGACAG GCCAAGTAGAACACGCGAAGCGC TGGGCTGCCTGCTGCGACCAGGG 10

Data Labels Supervised Anomaly Detection Labels available for both normal data and anomalies Similar to skewed (imbalanced) classification Semi-supervised Anomaly Detection Limited amount of labeled data Combine supervised and unsupervised techniques Unsupervised Anomaly Detection No labels assumed Based on the assumption that anomalies are very rare compared to normal data 11

Point Anomalies Contextual Anomalies Collective Anomalies Type of Anomalies 12

Point Anomalies An individual data instance is anomalous w.r.t. the data Y N 1 o 1 O 3 o 2 N 2 X 13

Contextual Anomalies An individual data instance is anomalous within a context Requires a notion of context Also referred to as conditional anomalies Normal Anomaly 14

Collective Anomalies A collection of related data instances is anomalous Requires a relationship among data instances Sequential Data Spatial Data Graph Data The individual instances within a collective anomaly are not anomalous by themselves Anomalous Subsequence 15

Label Output of Anomaly Detection Each test instance is given a normal or anomaly label This is especially true of classification-based approaches Score Each test instance is assigned an anomaly score Allows the output to be ranked Requires an additional threshold parameter 16

Metrics for Performance Evaluation Confusion Matrix PREDICTED CLASS + - ACTUAL CLASS + a b - c d a: TP (true positive) b: FN (false negative) c: FP (false positive) d: TN (true negative)

Metrics for Performance Evaluation PREDICTED CLASS + - ACTUAL CLASS + a (TP) - c (FP) b (FN) d (TN) Measure used in classification: Accuracy a a b d c d TP TP TN TN FP FN 18

Limitation of Accuracy Anomaly detection Number of negative examples = 9990 Number of positive examples = 10 If model predicts everything to be class 0, accuracy is 9990/10000 = 99.9 % Accuracy is misleading because model does not detect any positive examples 19

Cost Matrix PREDICTED CLASS C(i j) + - ACTUAL CLASS + C(+ +) C(- +) - C(+ -) C(- -) C(i j): Cost of misclassifying class j example as class i 20

Computing Cost of Classification Cost Matrix ACTUAL CLASS PREDICTED CLASS C(i j) + - + -1 100-1 0 Model M 1 PREDICTED CLASS Model M 2 PREDICTED CLASS ACTUAL CLASS + - + 150 40-60 250 ACTUAL CLASS + - + 250 45-5 200 Accuracy = 80% Cost = 3910 Accuracy = 90% Cost = 4255 21

Cost-Sensitive Measures Precision (p) a a Recall (r) a b F - measure (F) a c 2rp r p 2a 2a b c Weighted Accuracy w a 1 w a 1 w b 2 w d 4 w c 3 w d 4 22

ROC (Receiver Operating Characteristic) ROC curve plots TPR (on the y-axis) against FPR (on the x-axis) Performance of each classifier represented as a point on the ROC curve changing the threshold of algorithm, sample distribution or cost matrix changes the location of the point 23

ROC Curve - 1-dimensional data set containing 2 classes (positive and negative) - any points located at x > t is classified as positive At threshold t: TP=0.5, FN=0.5, FP=0.12, FN=0.88 24

(TPR,FPR): (0,0): declare everything to be negative class (1,1): declare everything to be positive class (1,0): ideal ROC Curve Diagonal line: Random guessing Below diagonal line: prediction is opposite of the true class 25

Using ROC for Model Comparison Comparing two models M 1 is better for small FPR M 2 is better for large FPR Area Under the ROC curve Ideal: Area = 1 Random guess: Area = 0.5 26

How to Construct an ROC curve Instance Score Label ACTUAL CLASS 1 0.95 + 2 0.93 + 3 0.87-4 0.85-5 0.85-6 0.85 + 7 0.76-8 0.53 + 9 0.43 - PREDICTED CLASS 10 0.25 + + a (TP) - c (FP) + - b (FN) d (TN) Calculate the outlier scores of the given instances Sort the instances according to the scores in decreasing order Apply threshold at each unique value of the score Count the number of TP, FP, TN, FN at each threshold TP rate, TPR = TP/(TP+FN) FP rate, FPR = FP/(FP + TN) 27

Threshold >= How to construct an ROC curve Class + - + - - - + - + + 0.25 0.43 0.53 0.76 0.85 0.85 0.85 0.87 0.93 0.95 1.00 TP 5 4 4 3 3 3 3 2 2 1 0 FP 5 5 4 4 3 2 1 1 0 0 0 TN 0 0 1 1 2 3 4 4 5 5 5 FN 0 1 1 2 2 2 2 3 3 4 5 TPR 1 0.8 0.8 0.6 0.6 0.6 0.6 0.4 0.4 0.2 0 FPR 1 1 0.8 0.8 0.6 0.4 0.2 0.2 0 0 0 ROC Curve: 28

Applications of Anomaly Detection Network intrusion detection Insurance / Credit card fraud detection Healthcare Informatics / Medical diagnostics Image Processing / Video surveillance 29

Intrusion Detection Intrusion Detection Process of monitoring the events occurring in a computer system or network and analyzing them for intrusions Intrusions are defined as attempts to bypass the security mechanisms of a computer or network Challenges Traditional signature-based intrusion detection systems are based on signatures of known attacks and cannot detect emerging cyber threats Substantial latency in deployment of newly created signatures across the computer system Anomaly detection can alleviate these limitations 30

Fraud Detection Fraud detection refers to detection of criminal activities occurring in commercial organizations Malicious users might be the actual customers of the organization or might be posing as a customer (also known as identity theft). Types of fraud Credit card fraud Insurance claim fraud Mobile / cell phone fraud Insider trading Challenges Fast and accurate real-time detection Misclassification cost is very high 31

Healthcare Informatics Detect anomalous patient records Indicate disease outbreaks, instrumentation errors, etc. Key Challenges Misclassification cost is very high Data can be complex: spatio-temporal 32

Image Processing Detecting outliers in a image monitored over time Detecting anomalous regions within an image Used in mammography image analysis video surveillance satellite image analysis Key Challenges Detecting collective anomalies Data sets are very large 50 100 150 200 250 50 100 150 200 250 300 350 Anomaly 33

General Steps Anomaly Detection Schemes Build a profile of the normal behavior Profile can be patterns or summary statistics for the overall population Use the normal profile to detect anomalies Methods Anomalies are observations whose characteristics differ significantly from the normal profile Statistical-based Distance-based Model-based 34

Statistical Approaches Assume a parametric model describing the distribution of the data (e.g., normal distribution) Apply a statistical test that depends on Data distribution Parameter of distribution (e.g., mean, variance) Number of expected outliers (confidence limit) 35

Grubbs Test Detect outliers in univariate data Assume data comes from normal distribution Detects one outlier at a time, remove the outlier, and repeat H 0 : There is no outlier in data H A : There is at least one outlier G max Grubbs test statistic: Reject H 0 if: G ( N 1) N N t 2 X s ( / N, N 2) 2 t 2 X ( / N, N 2) 36

Statistical-based Likelihood Approach Assume the data set D contains samples from a mixture of two probability distributions: M (majority distribution) A (anomalous distribution) General Approach: Initially, assume all the data points belong to M Let L t (D) be the log likelihood of D at time t For each point x t that belongs to M, move it to A Let L t+1 (D) be the new log likelihood. Compute the difference, = L t (D) L t+1 (D) If > c (some threshold), then x t is declared as an anomaly and moved permanently from M to A 37

Statistical-based Likelihood Approach Data distribution, D = (1 ) M + A M is a probability distribution estimated from data Can be based on any modeling method, e.g., mixture model A can be assumed to be uniform distribution Likelihood at time t: t i t t t i t t A x i A A M x i M M N i i D t x P x P x P D L ) ( ) ( ) (1 ) ( ) ( 1 38

Limitations of Statistical Approaches Most of the tests are for a single attribute In many cases, data distribution may not be known For high dimensional data, it may be difficult to estimate the true distribution 39

Distance-based Approaches Data is represented as a vector of features Three major approaches Nearest-neighbor based Density based Clustering based 40

Nearest-Neighbor Based Approach Approach: Compute the distance between every pair of data points There are various ways to define outliers: Data points for which there are fewer than p neighboring points within a distance D The top n data points whose distance to the k-th nearest neighbor is greatest The top n data points whose average distance to the k nearest neighbors is greatest 41

Distance-Based Outlier Detection For each object o, examine the # of other objects in the r- neighborhood of o, where r is a user-specified distance threshold An object o is an outlier if most (taking π as a fraction threshold) of the objects in D are far away from o, i.e., not in the r-neighborhood of o An object o is a DB(r, π) outlier if Equivalently, one can check the distance between o and its k- th nearest neighbor o k, where. o is an outlier if dist(o, o k ) > r 42

Local Outlier Factor (LOF) approach Example: Density-based Approach Distance from p 3 to nearest neighbor Distance from p 2 to nearest neighbor p 3 p 2 p 1 In the NN approach, p 2 is not considered as outlier, while the LOF approach find both p 1 and p 2 as outliers NN approach may consider p 3 as outlier, but LOF approach does not 43

Density-based: LOF approach For each point, compute the density of its local neighborhood Compute local outlier factor (LOF) of a sample p as the average of the ratios of the density of sample p and the density of its nearest neighbors Outliers are points with largest LOF value 44

Local Outlier Factor: LOF Reachability distance from o to o: where k is a user-specified parameter Local reachability density of o: LOF (Local outlier factor) of an object o is the average of the ratio of local reachability of o and those of o s k-nearest neighbors The higher the local reachability distance of o, and the higher the local reachability density of the knn of o, the higher LOF This captures a local outlier whose local density is relatively low comparing to the local densities of its knn 45

Clustering-Based Basic idea: Cluster the data into groups of different density Choose points in small cluster as candidate outliers Compute the distance between candidate points and non-candidate clusters. If candidate points are far from all other noncandidate points, they are outliers 46

Classification-Based Methods Idea: Train a classification model that can distinguish normal data from outliers Consider a training set that contains samples labeled as normal and others labeled as outlier But, the training set is typically heavily biased: # of normal samples likely far exceeds # of outlier samples Handle the imbalanced distribution Oversampling positives and/or undersampling negatives Alter decision threshold Cost-sensitive learning 47

One-Class Model One-class model: A classifier is built to describe only the normal class Learn the decision boundary of the normal class using classification methods such as SVM Any samples that do not belong to the normal class (not within the decision boundary) are declared as outliers Adv: can detect new outliers that may not appear close to any outlier objects in the training set 48

Take-away Message Definition of outlier detection Applications of outlier detection Evaluation of outlier detection techniques Unsupervised approaches (statistical, distance, density-based) 49