SYSTEMATIC CONSTRUCTION OF ANOMALY DETECTION BENCHMARKS FROM REAL DATA. Outlier Detection And Description Workshop 2013

Similar documents
arxiv: v2 [cs.ai] 26 Aug 2016

Advances in Anomaly Detection

Introduction to Density Estimation and Anomaly Detection. Tom Dietterich

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

Introduction to Gaussian Process

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

CS570 Data Mining. Anomaly Detection. Li Xiong. Slide credits: Tan, Steinbach, Kumar Jiawei Han and Micheline Kamber.

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Oliver Dürr. Statistisches Data Mining (StDM) Woche 11. Institut für Datenanalyse und Prozessdesign Zürcher Hochschule für Angewandte Wissenschaften

Unsupervised Anomaly Detection for High Dimensional Data

Pointwise Exact Bootstrap Distributions of Cost Curves

Introduction to Machine Learning Midterm Exam

ECE521 week 3: 23/26 January 2017

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

CS145: INTRODUCTION TO DATA MINING

Real Estate Price Prediction with Regression and Classification CS 229 Autumn 2016 Project Final Report

Sequential Feature Explanations for Anomaly Detection

An Overview of Outlier Detection Techniques and Applications

Feedback-Guided Anomaly Discovery via Online Optimization

Issues using Logistic Regression for Highly Imbalanced data

Click Prediction and Preference Ranking of RSS Feeds

CS6220: DATA MINING TECHNIQUES

Holdout and Cross-Validation Methods Overfitting Avoidance

CS534 Machine Learning - Spring Final Exam

Loda: Lightweight on-line detector of anomalies

Introduction to Signal Detection and Classification. Phani Chavali

Clustering. CSL465/603 - Fall 2016 Narayanan C Krishnan

Graph-Based Anomaly Detection with Soft Harmonic Functions

CS6220: DATA MINING TECHNIQUES

ECLT 5810 Data Preprocessing. Prof. Wai Lam

Introduction. Chapter 1

Final Exam, Spring 2006

Anomaly Detection. Jing Gao. SUNY Buffalo

FINAL: CS 6375 (Machine Learning) Fall 2014

Essence of Machine Learning (and Deep Learning) Hoa M. Le Data Science Lab, HUST hoamle.github.io

9/26/17. Ridge regression. What our model needs to do. Ridge Regression: L2 penalty. Ridge coefficients. Ridge coefficients

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Machine Learning Linear Classification. Prof. Matteo Matteucci

Roberto Perdisci^+, Guofei Gu^, Wenke Lee^ presented by Roberto Perdisci. ^Georgia Institute of Technology, Atlanta, GA, USA

Kernel Logistic Regression and the Import Vector Machine

Anomaly Detection for the CERN Large Hadron Collider injection magnets

Introduction to Machine Learning Midterm Exam Solutions

FRaC: A Feature-Modeling Approach for Semi-Supervised and Unsupervised Anomaly Detection

Chart types and when to use them

Final Exam, Machine Learning, Spring 2009

Lecture 4 Discriminant Analysis, k-nearest Neighbors

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning

Machine learning for pervasive systems Classification in high-dimensional spaces

Data Mining Classification: Basic Concepts and Techniques. Lecture Notes for Chapter 3. Introduction to Data Mining, 2nd Edition

Modeling Complex Temporal Composition of Actionlets for Activity Prediction

Learning Classification with Auxiliary Probabilistic Information Quang Nguyen Hamed Valizadegan Milos Hauskrecht

Introduction to Machine Learning

10-701/ Machine Learning - Midterm Exam, Fall 2010

Incorporating Boosted Regression Trees into Ecological Latent Variable Models

Predicting flight on-time performance

Nonlinear Classification

Data Mining Techniques

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

Lectures in AstroStatistics: Topics in Machine Learning for Astronomers

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

CS 6375 Machine Learning

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Support Vector Machine. Industrial AI Lab.

FINAL EXAM: FALL 2013 CS 6375 INSTRUCTOR: VIBHAV GOGATE

Decision Tree And Random Forest

Data Analytics for Social Science

Introduction to Basic Statistics Version 2

Review of Lecture 1. Across records. Within records. Classification, Clustering, Outlier detection. Associations

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Midterm Exam, Spring 2005

10-810: Advanced Algorithms and Models for Computational Biology. Optimal leaf ordering and classification

Machine Learning Concepts in Chemoinformatics

Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text

Support Vector Machine

Classification Based on Logical Concept Analysis

ECS289: Scalable Machine Learning

CSE 546 Final Exam, Autumn 2013

Machine Learning, Fall 2009: Midterm

Anomaly Detection via Over-sampling Principal Component Analysis

Anomaly Detection in Logged Sensor Data. Master s thesis in Complex Adaptive Systems JOHAN FLORBÄCK

Machine Learning Lecture 7

Machine Learning Lecture 5

CPSC 340: Machine Learning and Data Mining

Introduction to Machine Learning

CS145: INTRODUCTION TO DATA MINING

CS6220: DATA MINING TECHNIQUES

Detection of Unauthorized Electricity Consumption using Machine Learning

CSCI-567: Machine Learning (Spring 2019)

Anomaly Detection with Score functions based on Nearest Neighbor Graphs

Lecture 4: Data preprocessing: Data Reduction-Discretization. Dr. Edgar Acuna. University of Puerto Rico- Mayaguez math.uprm.

Article from. Predictive Analytics and Futurism. July 2016 Issue 13

15-388/688 - Practical Data Science: Decision trees and interpretable models. J. Zico Kolter Carnegie Mellon University Spring 2018

BAYESIAN CLASSIFICATION OF HIGH DIMENSIONAL DATA WITH GAUSSIAN PROCESS USING DIFFERENT KERNELS

Machine Learning (CS 567) Lecture 2

Decision Trees: Overfitting

CS6220: DATA MINING TECHNIQUES

Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers

A Deep Interpretation of Classifier Chains

Transcription:

SYSTEMATIC CONSTRUCTION OF ANOMALY DETECTION BENCHMARKS FROM REAL DATA Outlier Detection And Description Workshop 2013

Authors Andrew Emmott emmott@eecs.oregonstate.edu Thomas Dietterich tgd@eecs.oregonstate.edu Weng-Keen Wong wong@eecs.oregonstate.edu Shubhomoy Das dassh@eecs.oregonstate.edu Alan Fern afern@eecs.oregonstate.edu

Staff Jed Irvine Mike Sander Michael Slater Acknowledgements

Anomaly Detection At Multiple Scales Image credit: CERT Application domain: Insider threat detection Good results but on private data; hard to publish

How are anomaly detection methods evaluated?

Synthetic Data Pro: Ground truth density is perfectly labeled. Con: Synthetic data is synthetic. Image Credit: Erion Hasanbelliu

Real Data Abalone: Distinguish candidate anomalies by regression value. Wine Quality: Donor suggests extreme values as anomalies. KDD Cup 99: Posed as a supervised learning problem. Intrusions are more numerous than reality. All approaches require modifying existing data sets.

Requirements for anomaly detection benchmarks Points are drawn from real data. Anomalies are semantically distinct from normal points. Many benchmark datasets are needed. Benchmark datasets should be characterized in terms of well defined and meaningful problem dimensions that can be systematically varied.

Proposed problem dimensions Point difficulty how difficult is it to distinguish the anomalies from the normal points? Insider threats want to blend in. Jet engine failures do not. Relative frequency How much of the data is anomalous? Signal failure might be relatively frequent. Insider threats might be very very rare.

Proposed problem dimensions Semantic variation How similar are the anomalies to each other? Are they a de facto class? Or are they simply not normal? Feature relevance/irrelevance How well do statistical outliers in the data map to the application target? An 8 foot tall man is an outlier. But he is not an insider threat.

How do we enforce these requirements?

Points are drawn from real data: We chose mother sets from the UC Irvine Machine Learning Repository using the following criteria: Task: Classification (binary or multi-class) or Regression. No Time-Series. Instances: At least 1000. No upper limit. Features: No more than 200. No lower limit. Values: Numeric only. Categorical features are ignored if present. No missing values, except where easily ignored. This resulted in the selection of 19 data sets.

Anomalies are semantically distinct from normal points: We use the semantics of the mother set to distinguish candidate nominals and candidate anomalies. For binary classification tasks, we are already given two semantically distinct groups. Image credit: mlpy.org

Anomalies are semantically distinct from normal points: We use the semantics of the mother set to distinguish candidate nominals and candidate anomalies. For regression tasks we split the data into two groups at the median response value. Image credit: mathworks

Anomalies are semantically distinct from normal points: We use the semantics of the mother set to distinguish candidate nominals and candidate anomalies. For multi-class classification tasks we partition the classes such that we maximize the confusion between them Image credit: UNITEN

Maximizing Confusion: How and Why Why? We need confusion between the candidate nominal and anomaly classes to ensure a variety of point difficulty, (discussed later). How? Consider a complete graph with the classes as vertices. Assign weights to the edges equal to the confusion between classes, (discussed next) Find a maximum weight spanning tree on this graph. Image credit: Rosalind 2-color the tree; the coloring is your partition.

Measuring Confusion We used Breimann s Random Forest algorithm to model the original multi-class problem. Image credit: ICCV 2009 The forest provides a probabilistic belief about each point s membership in each class. All probability mass that exists between classes is summed as our confusion measure.

Many benchmark datasets are needed. Our methodology generated 5,120 benchmarks from the original 19 mother sets. Benchmark datasets should be characterized in terms of well defined and meaningful problem dimensions that can be systematically varied. Our contribution includes a methodology for controlling and measuring point difficulty, relative frequency, and semantic variation.

Point Difficulty We model the binary version of the mother set with kernel logistic regression. Note: We label the candidate anomalies as class 0, and the nominals as class 1. We take the response given by the KLR model on the candidate anomalies as a measure of their difficulty. Image credit: scikit For this study, we bucket the candidate anomalies into four groups: easy (0, ⅙) medium [⅙,⅓) hard [⅓, ½) very hard [½, 1)

Relative Frequency Is easily controlled; simply select the desired number of anomalies. In this study we examined relative frequencies of 0.001, 0.005, 0.01, 0.05, and 0.1 Semantic Variation We define clusteredness to indicate the semantic variation in the chosen anomalies as the ratio of the sample variance of the nominal points over the sample variance of the anomalous points. We partially control this problem dimension by using a facility location algorithm to choose clustered or scattered groups. We bucket the scores into six qualitative groups: high scatter (0, 0.25) medium scatter [0.25, 0.5) low scatter [0.5, 1) low cluster [1, 2) medium cluster [2, 4) high cluster [4, )

Algorithm Summary One-Class SVM (ocsvm) (Scholkopf et al.) Support Vector Data Description (svdd) (Tax and Duin) Local Outlier Factor (lof) (Breunig et al.) Isolation Forest (if) (Liu et. al) (New) Robust Kernel Density Estimation (rkde) (Kim and Scott) J. Kim and C. Scott, Robust Kernel Density Estimation. Journal of Machine Learning Research, 13 (2012) 2529-2565 Ensemble Gaussian Mixture Model (egmm)

Results We evaluate each algorithm on each benchmark with ROC AUC. We fit a linear model and measure Δ logit(auc) logit(auc) ~ set + algo + diff + freq + cluster

<-Harder Change in Logit(AUC) Easier -> Δ logit(auc) by set 0,8 0,6 0,4 0,2 0-0,2-0,4-0,6-0,8

Change in Logit(AUC) Δ logit(auc) by algorithm 0,2 0-0,2-0,4-0,6-0,8 if rkde egmm lof ocsvm svdd Algorithm

Change in Logit(AUC) Δ logit(auc) by difficulty 0,2 0-0,2-0,4-0,6-0,8-1 -1,2-1,4-1,6-1,8 easy medium hard veryhard Difficulty

Change in Logit(AUC) Δ logit(auc) by relative frequency 0,5 0-0,5-1 -1,5-2 -2,5 very low low medium high very high Relative Frequency

Change in Logit(AUC) Δ logit(auc) by clusteredness 2 1,5 1 0,5 0-0,5 high cluster med cluster low cluster low scatter Clusteredness med scatter high scatter

Thank You!