Advances in Anomaly Detection Tom Dietterich Alan Fern Weng-Keen Wong Andrew Emmott Shubhomoy Das Md. Amran Siddiqui Tadesse Zemicheal
Outline Introduction Three application areas Two general approaches to anomaly detection Under-fitting Over-fitting DARPA ADAMS Red Team results Benchmarks for Anomaly Detection Validation Comparison Study Next Steps Anomaly Explanations Ensembles 2
Why Anomaly Detection? Data cleaning Find data points that contain errors Science Find data points that are interesting or unusual Security / fraud detection Find users/customers who are behaving weirdly 3
Data Cleaning for Sensor Networks An ideal method should produce two things given raw data: -10 0 10 x11 Air Temperature (Degrees Celsius) -10 0 10-10 0 10-10 0 10-10 0 10-10 0 10-10 0 10-10 0 10-10 0 10 x31 x29 x25 x20 x19 x18 x17 x12 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 Day Index (From Start of Deployment) 4
Data Cleaning for Sensor Networks An ideal method should produce two things given raw data: A label that marks anomalies -10 0 10 Air Temperature (Degrees Celsius) -10 0 10-10 0 10-10 0 10-10 0 10-10 0 10-10 0 10-10 0 10-10 0 10 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 Day Index (From Start of Deployment) x11 x12 x17 x18 x19 x20 x25 x29 x31 5
Data Cleaning for Sensor Networks An ideal method should produce two things given raw data: A label that marks anomalies An imputation of the true value (with some confidence measure) Dereszynski &, Dietterich, ACM TOS 2011. -10 0 10 Air Temperature (Degrees Celsius) -10 0 10-10 0 10-10 0 10-10 0 10-10 0 10-10 0 10-10 0 10-10 0 10 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 Day Index (From Start of Deployment) 6 x11 x12 x17 x18 x19 x20 x25 x29 x31
NASA: Finding Interesting Data Points Ingest data set Rank points by interestingness Repeat Show most interesting point to scientist Yes: Interesting No: Not Interesting Build model of the uninteresting points Most interesting point == most un-uninteresting point Most extreme outlier among the uninteresting points Mars Science Laboratory ChemCam Olivine First non-carbonate Wagstaff, Lanza, Thompson, Dietterich, Gilmore. AAAI 2013 7
Security/Fraud Detection: DARPA ADAMS Program Desktop activity data collected from ~5000 employees of a corporation using Raytheon-Oakley Sureview CERT Red Team overlays selected employees with insider threat activity based on real scenarios Example Scenarios: Anomalous Encryption Layoff Logic Bomb Insider Startup Circumventing SureView Hiding Undue Affluence Survivor s Burden Team: LEIDOS (former SAIC); Ted Senator, PI; Rand Waltzman, PM. 8
Outline Introduction Three application areas Two general approaches to anomaly detection Under-fitting Over-fitting DARPA ADAMS Red Team results Benchmarks for Anomaly Detection Validation Comparison Study Next Steps Anomaly Explanations Ensembles 9
What is Anomaly Detection? Input: vectors xx ii R dd for ii = 1,, NN Assumed to be a mix of normal and anomalous data points Anomalies are generated by some distinct process (e.g., instrument failures, fraud, intruders, etc.) Output Anomaly score ss ii for each input xx ii such that higher scores are more anomalous and similar scores imply similar levels of anomalousness Metrics AUC: Probability that a randomly-chosen anomaly is ranked above a randomly-chosen normal point Precision in top K 10
Two General Approaches: Anomaly Detection Methods Anomaly Detection by Underfitting Anomaly Detection by Overfitting Gaussian Mixture Model (GMM) Ensemble of Gaussian Mixture Models (EGMM) Isolation Forest (IFOR) Repeated Impossible Discrimination Ensemble (RIDE) 11
Anomaly Detection by Under-Fitting Choose a class of models Fit to the data Let PP θθ xx ii be the probability density assigned to data point xx ii by the model θθ Assign score ss ii log PP θθ xx ii Low density points (poorlyexplained by the model) are the anomalies 12
Example: Gaussian Mixture Model PP xx = KK kk=1 pp kk Normal xx μμ kk, Σ kk K=3 13
Ensemble of GMMs Train MM independent Gaussian Mixture Models Train model mm = 1,, MM on a bootstrap replicate of the data Vary the number of clusters KK Delete any model with log likelihood < 70% of best model Compute average surprise: 1 MM log PP mm(xx ii ) mm 14
DARPA ADAMS Vegas Results Score each user and rank them all AUC = Probability that we correctly rank a randomlychoosen Red Team insert above a randomly-chosen normal user Vegas Sept 2012 ROC ROC (Vegas Sept) Vegas Oct 2012 ROC 1 1 0.8 0.8 0.6 0.6 0.4 0.2 AUC=0.970 AUC=0.970 AvgLift=26.17 0.4 0.2 AUC=0.970 0 0 0.2 0.4 0.6 0.8 1 0 0 0.2 0.4 0.6 0.8 1 15
New approach: Anomaly Detection By Over-Fitting Take the input points Randomly split in half and label half as 0, half as 1 Apply supervised learning to discriminate the 0 s from the 1 s (which by construction is impossible) ssssssssss xx = 0.5 PP yy ii = 1 Repeat random split Repeat discrimination ssssssssss xx = 0.5 PP yy ii = 1 Total score after 50 iterations RIDE: Repeated Impossible Discrimination Ensemble 16
RIDE Vegas Results Vegas Sept ROC Vegas Oct ROC 1 1 0.8 0.8 0.6 0.6 0.4 0.2 AUC=0.920 0.4 0.2 AUC=0.981 0 0 0.2 0.4 0.6 0.8 1 0 0 0.2 0.4 0.6 0.8 1 17
Isolation Forest [Liu, Ting, Zhou, 2011] Construct a fully random binary tree choose attribute jj at random choose splitting threshold θθ uniformly from min xx jj, max xx jj until every data point is in its own leaf let dd(xx ii ) be the depth of point xx ii repeat 100 times let dd (xx ii ) be the average depth of xx ii ssssssssss xx ii = 2 dd xx ii rr xx ii rr(xx ii ) is the expected depth xx jj > θθ xx 2 > θθ 2 xx 8 > θθ 3 xx 3 > θθ 4 xx 1 > θθ 5 xx ii 18
Outline Introduction Three application areas Two general approaches to anomaly detection Under-fitting Over-fitting DARPA ADAMS Red Team results Benchmarks for Anomaly Detection Validation Comparison Study Next Steps Anomaly Explanations Ensembles 19
VEGAS Results May 2013
VEGAS Results June 2013
VEGAS Results July 2013
Outline Introduction Three application areas Two general approaches to anomaly detection Under-fitting Over-fitting DARPA ADAMS Red Team results Benchmarks for Anomaly Detection Validation Comparison Study Next Steps Anomaly Explanations Ensembles 23
Needed: Benchmarks for Anomaly Detection Algorithms Shared benchmark databases have helped supervised learning make rapid progress UCI Repository of Machine Learning Data Sets Anomaly Detection lacks shared benchmarks Most data sets are proprietary and/or classified Exception: Lincoln Labs Simulated Network Intrusion data set hopelessly out of date Goal: Develop a collection of benchmark data sets with known properties 24
Benchmark Requirements The underlying process generating the anomalies should be distinct from the process generating the normal points anomalies are not merely outliers We need many benchmark data sets prevent the research community from fixating on a small number of problems Benchmark data sets should systematically vary a set of relevant properties 25
Relevant Properties Point difficulty: How difficult is it to separate each individual anomaly point from the normal points? Relative frequency: How rare are the anomalies? Clusteredness: Are the anomalous points tightly clustered or widely scattered? Irrelevant features: How many features are irrelevant? 26
Creating an Anomaly Detection Benchmark Data Set Select a UCI supervised learning dataset Choose one class to be the anomalies (call this class 0 and the union of the other classes class 1) Ensures that different processes are generating the anomalies and the normal points Computing point difficulty: Fit a kernel logistic regression model to estimate PP(yy = 1 xx), where yy is the class label oracle model Difficulty of xx ii is defined as PP yy = 1 xx for anomaly points according to the oracle For the desired relative frequency Select points based on difficulty and clusteredness Optionally: Add irrelevant features by selecting existing features and randomly permuting their values 27
Benchmark Collection 19 mother UCI data sets point difficulty: low: (0, 0.16) medium: [0.16, 0.33) high: [0.33, 0.5) very high: [0.5, 1) relative frequency: 0.001, 0.005, 0.01, 0.05, 0.1 clusteredness: 7 levels based on log σσ nn 2 σσ 2 aa variance of normal points divided by variance of anomalous points facility location algorithm used to select well-spaced points seed point neighbors used to find clustered points irrelevant features: 4 levels based on increasing the average distance between normal points 24,800 benchmark data sets generated 28
Benchmarking Study State of the art methods: ocsvm: one-class SVM (Schoelkopf et al. 1999) lof: Local Outlier Factor (Breuning et al. 2000) svdd: Support Vector Data Description (Tax & Duin, 2004) if: Isolation Forest (Liu et al., 2008, 2011) scif: SciForest (Liu et al., 2010) rkde: Robust Kernel Density Estimation (Kim & Scott, 2012) egmm: ours Analysis Measure the AUC of each method Compute mean AUC for each method Fit logistic regression model: logit AAAAAA = mmmmmmmmmmm + dddddddddddddddddddd + ffffffffffffffffff + cccccccccccccccccccccccccc + iiiiiiiiiiiiiiiiiiiiii 29
Benchmark Validity: Point Difficulty 30
Benchmark Validity: Relative Frequency 31
Benchmark Validity: Clusteredness 32
Algorithm Comparisons: Mean AUC 0.62 0.6 0.58 Mean AUC 0.56 0.54 0.52 0.5 0.403 if lof rkde egmm svdd scif ocsvm 33
Algorithm Comparisons: Logistic Regression Results if: Isolation Forest (Ling et al, 2011) rkde: Robust Kernel Density Estimation (Kim & Scott, 2012) egmm: ours lof: Local Outlier Factor (Breuning et al. 2000) ocsvm: one-class SVM (Schoelkopf et al. 1999) svdd: Support Vector Data Description (Tax & Duin, 2004) 34
Sensitivity to Irrelevant Features The performance of all methods drops with increasing # of irrelevant features RKDE and IFOR performing very well OCSVM extremely sensitive EGMM was hurt by the largest level of irrelevance top performer when there is no noise Average AUC 0.65 0.63 0.61 0.59 0.57 0.55 0.53 0.51 0.49 0.47 0.45 level-0 level-1 level-2 level-3 Increasing # of Irrelevant Features egmm if lof rkde svdd scif ocsvm 35
Outline Introduction Three application areas Two general approaches to anomaly detection Under-fitting Over-fitting DARPA ADAMS Red Team results Benchmarks for Anomaly Detection Validation Comparison Study Next Steps Anomaly Explanations Ensembles 36
Next Steps Generate Explanations of each Anomaly for the Analyst Ensembles Model the peer-group structure of the organization the same user in previous days all users in the company today users with the same job class users who work together shared printer email cliques 37
Anomaly Explanations Data Points Outliers Alarms Threats & Non-Threats Anomaly Detector Threats & False Positives Human Analyst Threats & False Positives Non-Outliers Discarded Non-Threats & Missed Threats Non-Threats & Missed Threats Type 1 Missed Threats = Anomaly Detector False Negatives Reduce by improving anomaly detector Type 2 Missed Threats = Analyst False Negatives Can occur due to information overload and time constraints We consider reducing Type 2 misses by providing explanations Why did the detector consider an object to be an outlier? Analyst can focus on info related to explanation 38
Sequential Feature Explanations Outliers + Explanations Threats & False Positives Human Analyst Alarms Threats & False Positives Goal: reduce analyst effort for correctly detecting outliers that are threats How: provide analyst with sequential feature explanations of outlier points Sequential Feature Explanation (SFE): an ordering on features of an outlier prioritized by importance to anomaly detector Protocol: incrementally reveal features ordered by SFE until analyst can make a confident determination 39
Typical Sequential Feature Explanation Curve Performance Metric: # of features that must be examined by the analyst in order to make a confident decision that a proposed threat (outlier) requires opening an investigation 40
Evaluating Explanations Methodological Problem: Evaluation requires access to an analyst, but we can t run large scale experiments with real analysts Solution: Construct simulated analysts that compute PP(nnnnnnnnnnnn xx) How: Start with an anomaly detection benchmark constructed from a UCI supervised learning data set [Emmott et al., 2013] Learn a classifier to predict anomaly vs. normal from labeled data (cheating) UCI Dataset Normal Points Supervised Learning Simulated Analyst Classifier PP(nnnnnnnnnnnn xx) 41 Anomaly Points Repeat for each subset of KK features
Explanation Methods for Density- Based Anomaly Detectors Density-based: Rank points xx according to estimated density ff(xx) Marginal Methods: greedily add features that most decrease joint marginal ff(xx 1,, xx KK ) Sequential Marginal: First feature xx ii minimizes ff(xx ii ) Second feature xx jj minimizes ff xx ii, xx jj.. Independent Marginal -- Order features by ff xx ii Dropout Methods: greedily remove features that most increase ff xx Sequential Dropout: Independent Dropout First feature xx ii minimizes ff(xx ( ii) ) -- Order features by ff xx ( ii) Second feature xx jj minimizes ff xx ( ii jj).. 42
Empirical Demonstration Datasets: 10,000 benchmarks derived from 7 UCI datasets Anomaly Detector: Ensemble of Gaussian Mixture Model (EGMM) Simulated Analysts: Regularized Random Forests (RRFs) Evaluation Metric: mean minimum feature prefix (MMFP) = average number of features revealed before the analyst is able to make a decision (exonerate vs. open investigation) 43
Results (EGMM + Explanation Method) MMFP 6 In these domains, IndDO an oracle 5only needs IndMarg 1-2 features OptOracle 4 3 2 1 0 SeqDO Dropout SeqMarg methods Randomare often worse than marginal Often no benefit to Random is sequential methods always worst over independent methods 44
Results (EGMM + Explanation Method) All methods significantly beat random Marginal methods no worse and sometimes better than dropout Independent marginal is nearly as good as sequential marginal 45
KDD99 (Computer Intrusion) Results (EGMM detector) MMFP 5 4 3 2 1 0 Independent Dropout Sequential Dropout Independent Marginal Sequential Marginal 46 [95% Confidence Intervals] Marginal Methods are Best One Feature is Enough!
Ensemble Methods In Supervised Learning, ensemble methods have been shown to be very powerful bagging random forests boosting Can we develop general-purpose ensemble methods for Anomaly Detection? Our methods employ internal ensembles Can we combine heterogeneous anomaly detection algorithms into an external ensemble? 47
Comparison of Ensemble Methods 2-Component Gaussian PCA 0.3 Schubert 0.2 Schubert-info 0.1 glmnet 0 L1 logistic regression -0.1 Isolation Forest -0.2 best non-ensemble method -0.3 Change in logit(auc) wrt gauss-model Ensemble Comparison (MAGIC Gamma Telescope) gauss-model PCA Schubert Schubertinfo glmnet iforest 48
Ensemble Conclusions No convincing evidence that ensembles work better than simply running iforest 49
Concluding Remarks Anomaly Detection has received relatively little study in machine learning, statistics, and data mining There are two main paradigms for designing algorithms anomaly detection by under-fitting anomaly detection by over-fitting The over-fitting paradigm is producing interesting algorithms They also require less modeling effort They can be very efficient In the analyst case, simple marginal scores work very well for sequential feature explanations 50
Questions? Anomaly Detection has received relatively little study in machine learning, statistics, and data mining There are two main paradigms for designing algorithms anomaly detection by under-fitting anomaly detection by over-fitting The over-fitting paradigm is producing interesting algorithms They also require less modeling effort They can be very efficient In the analyst case, simple marginal scores work very well for sequential feature explanations 51