Advances in Anomaly Detection

Similar documents
SYSTEMATIC CONSTRUCTION OF ANOMALY DETECTION BENCHMARKS FROM REAL DATA. Outlier Detection And Description Workshop 2013

Sequential Feature Explanations for Anomaly Detection

Introduction to Density Estimation and Anomaly Detection. Tom Dietterich

arxiv: v2 [cs.ai] 26 Aug 2016

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Lecture 3. STAT161/261 Introduction to Pattern Recognition and Machine Learning Spring 2018 Prof. Allie Fletcher

CS570 Data Mining. Anomaly Detection. Li Xiong. Slide credits: Tan, Steinbach, Kumar Jiawei Han and Micheline Kamber.

Toward automated quality control for hydrometeorological. station data. DSA 2018 Nyeri 1

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

FINAL: CS 6375 (Machine Learning) Fall 2014

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Anomaly Detection. Jing Gao. SUNY Buffalo

CS534 Machine Learning - Spring Final Exam

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

Introduction to Machine Learning Midterm Exam

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Introduction to Signal Detection and Classification. Phani Chavali

Anomaly Detection for the CERN Large Hadron Collider injection magnets

Unsupervised Anomaly Detection for High Dimensional Data

Holdout and Cross-Validation Methods Overfitting Avoidance

Anomaly Detection via Over-sampling Principal Component Analysis

VBM683 Machine Learning

Learning Classification with Auxiliary Probabilistic Information Quang Nguyen Hamed Valizadegan Milos Hauskrecht

Chapter 6: Classification

Midterm: CS 6375 Spring 2015 Solutions

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

An Overview of Outlier Detection Techniques and Applications

Introduction to Machine Learning Midterm Exam Solutions

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

CS7267 MACHINE LEARNING

Pointwise Exact Bootstrap Distributions of Cost Curves

Machine Learning Linear Classification. Prof. Matteo Matteucci

Decision Trees. Machine Learning CSEP546 Carlos Guestrin University of Washington. February 3, 2014

Lecture 4 Discriminant Analysis, k-nearest Neighbors

Machine Learning Lecture 7

Text Mining. Dr. Yanjun Li. Associate Professor. Department of Computer and Information Sciences Fordham University

Machine Learning, Midterm Exam

Learning with multiple models. Boosting.

Probabilistic Machine Learning. Industrial AI Lab.

Ensemble Methods. NLP ML Web! Fall 2013! Andrew Rosenberg! TA/Grader: David Guy Brizan

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

Feedback-Guided Anomaly Discovery via Online Optimization

A Step Towards the Cognitive Radar: Target Detection under Nonstationary Clutter

Ensemble learning 11/19/13. The wisdom of the crowds. Chapter 11. Ensemble methods. Ensemble methods

W vs. QCD Jet Tagging at the Large Hadron Collider

Oliver Dürr. Statistisches Data Mining (StDM) Woche 11. Institut für Datenanalyse und Prozessdesign Zürcher Hochschule für Angewandte Wissenschaften

Introduction to Machine Learning Midterm, Tues April 8

Final Exam, Machine Learning, Spring 2009

Statistical Machine Learning from Data

Course in Data Science

Loda: Lightweight on-line detector of anomalies

CS 6375 Machine Learning

Learning Ensembles. 293S T. Yang. UCSB, 2017.

Class 4: Classification. Quaid Morris February 11 th, 2011 ML4Bio

CSCI-567: Machine Learning (Spring 2019)

Machine Learning, Fall 2009: Midterm

Midterm exam CS 189/289, Fall 2015

Boosting: Foundations and Algorithms. Rob Schapire

Learning theory. Ensemble methods. Boosting. Boosting: history

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Summary and discussion of: Dropout Training as Adaptive Regularization

Loss Functions, Decision Theory, and Linear Models

ECE 5424: Introduction to Machine Learning

Support Vector Machines. CSE 4309 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Decision Trees: Overfitting

From statistics to data science. BAE 815 (Fall 2017) Dr. Zifei Liu

ECE521 week 3: 23/26 January 2017

Roberto Perdisci^+, Guofei Gu^, Wenke Lee^ presented by Roberto Perdisci. ^Georgia Institute of Technology, Atlanta, GA, USA

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Bagging and Other Ensemble Methods

The exam is closed book, closed notes except your one-page cheat sheet.

Click Prediction and Preference Ranking of RSS Feeds

Machine Learning for NLP

A Framework for Adaptive Anomaly Detection Based on Support Vector Data Description

CS249: ADVANCED DATA MINING

Final Exam, Fall 2002

FRaC: A Feature-Modeling Approach for Semi-Supervised and Unsupervised Anomaly Detection

Large-Margin Thresholded Ensembles for Ordinal Regression

Chart types and when to use them

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 9

A Simple Algorithm for Learning Stable Machines

Data Mining und Maschinelles Lernen

Support Vector Machines. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Predicting Storms: Logistic Regression versus Random Forests for Unbalanced Data

Machine Learning (CS 567) Lecture 2

Automated Discovery of Novel Anomalous Patterns

UVA CS 4501: Machine Learning

day month year documentname/initials 1

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014

Randomized Algorithms

Random Forests. These notes rely heavily on Biau and Scornet (2016) as well as the other references at the end of the notes.

Classification. Classification is similar to regression in that the goal is to use covariates to predict on outcome.

Manual for a computer class in ML

6.036 midterm review. Wednesday, March 18, 15

10-701/ Machine Learning - Midterm Exam, Fall 2010

Mining Classification Knowledge

Classification: The rest of the story

Concentration-based Delta Check for Laboratory Error Detection

Midterm Exam, Spring 2005

Transcription:

Advances in Anomaly Detection Tom Dietterich Alan Fern Weng-Keen Wong Andrew Emmott Shubhomoy Das Md. Amran Siddiqui Tadesse Zemicheal

Outline Introduction Three application areas Two general approaches to anomaly detection Under-fitting Over-fitting DARPA ADAMS Red Team results Benchmarks for Anomaly Detection Validation Comparison Study Next Steps Anomaly Explanations Ensembles 2

Why Anomaly Detection? Data cleaning Find data points that contain errors Science Find data points that are interesting or unusual Security / fraud detection Find users/customers who are behaving weirdly 3

Data Cleaning for Sensor Networks An ideal method should produce two things given raw data: -10 0 10 x11 Air Temperature (Degrees Celsius) -10 0 10-10 0 10-10 0 10-10 0 10-10 0 10-10 0 10-10 0 10-10 0 10 x31 x29 x25 x20 x19 x18 x17 x12 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 Day Index (From Start of Deployment) 4

Data Cleaning for Sensor Networks An ideal method should produce two things given raw data: A label that marks anomalies -10 0 10 Air Temperature (Degrees Celsius) -10 0 10-10 0 10-10 0 10-10 0 10-10 0 10-10 0 10-10 0 10-10 0 10 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 Day Index (From Start of Deployment) x11 x12 x17 x18 x19 x20 x25 x29 x31 5

Data Cleaning for Sensor Networks An ideal method should produce two things given raw data: A label that marks anomalies An imputation of the true value (with some confidence measure) Dereszynski &, Dietterich, ACM TOS 2011. -10 0 10 Air Temperature (Degrees Celsius) -10 0 10-10 0 10-10 0 10-10 0 10-10 0 10-10 0 10-10 0 10-10 0 10 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 Day Index (From Start of Deployment) 6 x11 x12 x17 x18 x19 x20 x25 x29 x31

NASA: Finding Interesting Data Points Ingest data set Rank points by interestingness Repeat Show most interesting point to scientist Yes: Interesting No: Not Interesting Build model of the uninteresting points Most interesting point == most un-uninteresting point Most extreme outlier among the uninteresting points Mars Science Laboratory ChemCam Olivine First non-carbonate Wagstaff, Lanza, Thompson, Dietterich, Gilmore. AAAI 2013 7

Security/Fraud Detection: DARPA ADAMS Program Desktop activity data collected from ~5000 employees of a corporation using Raytheon-Oakley Sureview CERT Red Team overlays selected employees with insider threat activity based on real scenarios Example Scenarios: Anomalous Encryption Layoff Logic Bomb Insider Startup Circumventing SureView Hiding Undue Affluence Survivor s Burden Team: LEIDOS (former SAIC); Ted Senator, PI; Rand Waltzman, PM. 8

Outline Introduction Three application areas Two general approaches to anomaly detection Under-fitting Over-fitting DARPA ADAMS Red Team results Benchmarks for Anomaly Detection Validation Comparison Study Next Steps Anomaly Explanations Ensembles 9

What is Anomaly Detection? Input: vectors xx ii R dd for ii = 1,, NN Assumed to be a mix of normal and anomalous data points Anomalies are generated by some distinct process (e.g., instrument failures, fraud, intruders, etc.) Output Anomaly score ss ii for each input xx ii such that higher scores are more anomalous and similar scores imply similar levels of anomalousness Metrics AUC: Probability that a randomly-chosen anomaly is ranked above a randomly-chosen normal point Precision in top K 10

Two General Approaches: Anomaly Detection Methods Anomaly Detection by Underfitting Anomaly Detection by Overfitting Gaussian Mixture Model (GMM) Ensemble of Gaussian Mixture Models (EGMM) Isolation Forest (IFOR) Repeated Impossible Discrimination Ensemble (RIDE) 11

Anomaly Detection by Under-Fitting Choose a class of models Fit to the data Let PP θθ xx ii be the probability density assigned to data point xx ii by the model θθ Assign score ss ii log PP θθ xx ii Low density points (poorlyexplained by the model) are the anomalies 12

Example: Gaussian Mixture Model PP xx = KK kk=1 pp kk Normal xx μμ kk, Σ kk K=3 13

Ensemble of GMMs Train MM independent Gaussian Mixture Models Train model mm = 1,, MM on a bootstrap replicate of the data Vary the number of clusters KK Delete any model with log likelihood < 70% of best model Compute average surprise: 1 MM log PP mm(xx ii ) mm 14

DARPA ADAMS Vegas Results Score each user and rank them all AUC = Probability that we correctly rank a randomlychoosen Red Team insert above a randomly-chosen normal user Vegas Sept 2012 ROC ROC (Vegas Sept) Vegas Oct 2012 ROC 1 1 0.8 0.8 0.6 0.6 0.4 0.2 AUC=0.970 AUC=0.970 AvgLift=26.17 0.4 0.2 AUC=0.970 0 0 0.2 0.4 0.6 0.8 1 0 0 0.2 0.4 0.6 0.8 1 15

New approach: Anomaly Detection By Over-Fitting Take the input points Randomly split in half and label half as 0, half as 1 Apply supervised learning to discriminate the 0 s from the 1 s (which by construction is impossible) ssssssssss xx = 0.5 PP yy ii = 1 Repeat random split Repeat discrimination ssssssssss xx = 0.5 PP yy ii = 1 Total score after 50 iterations RIDE: Repeated Impossible Discrimination Ensemble 16

RIDE Vegas Results Vegas Sept ROC Vegas Oct ROC 1 1 0.8 0.8 0.6 0.6 0.4 0.2 AUC=0.920 0.4 0.2 AUC=0.981 0 0 0.2 0.4 0.6 0.8 1 0 0 0.2 0.4 0.6 0.8 1 17

Isolation Forest [Liu, Ting, Zhou, 2011] Construct a fully random binary tree choose attribute jj at random choose splitting threshold θθ uniformly from min xx jj, max xx jj until every data point is in its own leaf let dd(xx ii ) be the depth of point xx ii repeat 100 times let dd (xx ii ) be the average depth of xx ii ssssssssss xx ii = 2 dd xx ii rr xx ii rr(xx ii ) is the expected depth xx jj > θθ xx 2 > θθ 2 xx 8 > θθ 3 xx 3 > θθ 4 xx 1 > θθ 5 xx ii 18

Outline Introduction Three application areas Two general approaches to anomaly detection Under-fitting Over-fitting DARPA ADAMS Red Team results Benchmarks for Anomaly Detection Validation Comparison Study Next Steps Anomaly Explanations Ensembles 19

VEGAS Results May 2013

VEGAS Results June 2013

VEGAS Results July 2013

Outline Introduction Three application areas Two general approaches to anomaly detection Under-fitting Over-fitting DARPA ADAMS Red Team results Benchmarks for Anomaly Detection Validation Comparison Study Next Steps Anomaly Explanations Ensembles 23

Needed: Benchmarks for Anomaly Detection Algorithms Shared benchmark databases have helped supervised learning make rapid progress UCI Repository of Machine Learning Data Sets Anomaly Detection lacks shared benchmarks Most data sets are proprietary and/or classified Exception: Lincoln Labs Simulated Network Intrusion data set hopelessly out of date Goal: Develop a collection of benchmark data sets with known properties 24

Benchmark Requirements The underlying process generating the anomalies should be distinct from the process generating the normal points anomalies are not merely outliers We need many benchmark data sets prevent the research community from fixating on a small number of problems Benchmark data sets should systematically vary a set of relevant properties 25

Relevant Properties Point difficulty: How difficult is it to separate each individual anomaly point from the normal points? Relative frequency: How rare are the anomalies? Clusteredness: Are the anomalous points tightly clustered or widely scattered? Irrelevant features: How many features are irrelevant? 26

Creating an Anomaly Detection Benchmark Data Set Select a UCI supervised learning dataset Choose one class to be the anomalies (call this class 0 and the union of the other classes class 1) Ensures that different processes are generating the anomalies and the normal points Computing point difficulty: Fit a kernel logistic regression model to estimate PP(yy = 1 xx), where yy is the class label oracle model Difficulty of xx ii is defined as PP yy = 1 xx for anomaly points according to the oracle For the desired relative frequency Select points based on difficulty and clusteredness Optionally: Add irrelevant features by selecting existing features and randomly permuting their values 27

Benchmark Collection 19 mother UCI data sets point difficulty: low: (0, 0.16) medium: [0.16, 0.33) high: [0.33, 0.5) very high: [0.5, 1) relative frequency: 0.001, 0.005, 0.01, 0.05, 0.1 clusteredness: 7 levels based on log σσ nn 2 σσ 2 aa variance of normal points divided by variance of anomalous points facility location algorithm used to select well-spaced points seed point neighbors used to find clustered points irrelevant features: 4 levels based on increasing the average distance between normal points 24,800 benchmark data sets generated 28

Benchmarking Study State of the art methods: ocsvm: one-class SVM (Schoelkopf et al. 1999) lof: Local Outlier Factor (Breuning et al. 2000) svdd: Support Vector Data Description (Tax & Duin, 2004) if: Isolation Forest (Liu et al., 2008, 2011) scif: SciForest (Liu et al., 2010) rkde: Robust Kernel Density Estimation (Kim & Scott, 2012) egmm: ours Analysis Measure the AUC of each method Compute mean AUC for each method Fit logistic regression model: logit AAAAAA = mmmmmmmmmmm + dddddddddddddddddddd + ffffffffffffffffff + cccccccccccccccccccccccccc + iiiiiiiiiiiiiiiiiiiiii 29

Benchmark Validity: Point Difficulty 30

Benchmark Validity: Relative Frequency 31

Benchmark Validity: Clusteredness 32

Algorithm Comparisons: Mean AUC 0.62 0.6 0.58 Mean AUC 0.56 0.54 0.52 0.5 0.403 if lof rkde egmm svdd scif ocsvm 33

Algorithm Comparisons: Logistic Regression Results if: Isolation Forest (Ling et al, 2011) rkde: Robust Kernel Density Estimation (Kim & Scott, 2012) egmm: ours lof: Local Outlier Factor (Breuning et al. 2000) ocsvm: one-class SVM (Schoelkopf et al. 1999) svdd: Support Vector Data Description (Tax & Duin, 2004) 34

Sensitivity to Irrelevant Features The performance of all methods drops with increasing # of irrelevant features RKDE and IFOR performing very well OCSVM extremely sensitive EGMM was hurt by the largest level of irrelevance top performer when there is no noise Average AUC 0.65 0.63 0.61 0.59 0.57 0.55 0.53 0.51 0.49 0.47 0.45 level-0 level-1 level-2 level-3 Increasing # of Irrelevant Features egmm if lof rkde svdd scif ocsvm 35

Outline Introduction Three application areas Two general approaches to anomaly detection Under-fitting Over-fitting DARPA ADAMS Red Team results Benchmarks for Anomaly Detection Validation Comparison Study Next Steps Anomaly Explanations Ensembles 36

Next Steps Generate Explanations of each Anomaly for the Analyst Ensembles Model the peer-group structure of the organization the same user in previous days all users in the company today users with the same job class users who work together shared printer email cliques 37

Anomaly Explanations Data Points Outliers Alarms Threats & Non-Threats Anomaly Detector Threats & False Positives Human Analyst Threats & False Positives Non-Outliers Discarded Non-Threats & Missed Threats Non-Threats & Missed Threats Type 1 Missed Threats = Anomaly Detector False Negatives Reduce by improving anomaly detector Type 2 Missed Threats = Analyst False Negatives Can occur due to information overload and time constraints We consider reducing Type 2 misses by providing explanations Why did the detector consider an object to be an outlier? Analyst can focus on info related to explanation 38

Sequential Feature Explanations Outliers + Explanations Threats & False Positives Human Analyst Alarms Threats & False Positives Goal: reduce analyst effort for correctly detecting outliers that are threats How: provide analyst with sequential feature explanations of outlier points Sequential Feature Explanation (SFE): an ordering on features of an outlier prioritized by importance to anomaly detector Protocol: incrementally reveal features ordered by SFE until analyst can make a confident determination 39

Typical Sequential Feature Explanation Curve Performance Metric: # of features that must be examined by the analyst in order to make a confident decision that a proposed threat (outlier) requires opening an investigation 40

Evaluating Explanations Methodological Problem: Evaluation requires access to an analyst, but we can t run large scale experiments with real analysts Solution: Construct simulated analysts that compute PP(nnnnnnnnnnnn xx) How: Start with an anomaly detection benchmark constructed from a UCI supervised learning data set [Emmott et al., 2013] Learn a classifier to predict anomaly vs. normal from labeled data (cheating) UCI Dataset Normal Points Supervised Learning Simulated Analyst Classifier PP(nnnnnnnnnnnn xx) 41 Anomaly Points Repeat for each subset of KK features

Explanation Methods for Density- Based Anomaly Detectors Density-based: Rank points xx according to estimated density ff(xx) Marginal Methods: greedily add features that most decrease joint marginal ff(xx 1,, xx KK ) Sequential Marginal: First feature xx ii minimizes ff(xx ii ) Second feature xx jj minimizes ff xx ii, xx jj.. Independent Marginal -- Order features by ff xx ii Dropout Methods: greedily remove features that most increase ff xx Sequential Dropout: Independent Dropout First feature xx ii minimizes ff(xx ( ii) ) -- Order features by ff xx ( ii) Second feature xx jj minimizes ff xx ( ii jj).. 42

Empirical Demonstration Datasets: 10,000 benchmarks derived from 7 UCI datasets Anomaly Detector: Ensemble of Gaussian Mixture Model (EGMM) Simulated Analysts: Regularized Random Forests (RRFs) Evaluation Metric: mean minimum feature prefix (MMFP) = average number of features revealed before the analyst is able to make a decision (exonerate vs. open investigation) 43

Results (EGMM + Explanation Method) MMFP 6 In these domains, IndDO an oracle 5only needs IndMarg 1-2 features OptOracle 4 3 2 1 0 SeqDO Dropout SeqMarg methods Randomare often worse than marginal Often no benefit to Random is sequential methods always worst over independent methods 44

Results (EGMM + Explanation Method) All methods significantly beat random Marginal methods no worse and sometimes better than dropout Independent marginal is nearly as good as sequential marginal 45

KDD99 (Computer Intrusion) Results (EGMM detector) MMFP 5 4 3 2 1 0 Independent Dropout Sequential Dropout Independent Marginal Sequential Marginal 46 [95% Confidence Intervals] Marginal Methods are Best One Feature is Enough!

Ensemble Methods In Supervised Learning, ensemble methods have been shown to be very powerful bagging random forests boosting Can we develop general-purpose ensemble methods for Anomaly Detection? Our methods employ internal ensembles Can we combine heterogeneous anomaly detection algorithms into an external ensemble? 47

Comparison of Ensemble Methods 2-Component Gaussian PCA 0.3 Schubert 0.2 Schubert-info 0.1 glmnet 0 L1 logistic regression -0.1 Isolation Forest -0.2 best non-ensemble method -0.3 Change in logit(auc) wrt gauss-model Ensemble Comparison (MAGIC Gamma Telescope) gauss-model PCA Schubert Schubertinfo glmnet iforest 48

Ensemble Conclusions No convincing evidence that ensembles work better than simply running iforest 49

Concluding Remarks Anomaly Detection has received relatively little study in machine learning, statistics, and data mining There are two main paradigms for designing algorithms anomaly detection by under-fitting anomaly detection by over-fitting The over-fitting paradigm is producing interesting algorithms They also require less modeling effort They can be very efficient In the analyst case, simple marginal scores work very well for sequential feature explanations 50

Questions? Anomaly Detection has received relatively little study in machine learning, statistics, and data mining There are two main paradigms for designing algorithms anomaly detection by under-fitting anomaly detection by over-fitting The over-fitting paradigm is producing interesting algorithms They also require less modeling effort They can be very efficient In the analyst case, simple marginal scores work very well for sequential feature explanations 51