Anomaly (outlier) detection. Huiping Cao, Anomaly 1

Similar documents
An Overview of Outlier Detection Techniques and Applications

CS570 Data Mining. Anomaly Detection. Li Xiong. Slide credits: Tan, Steinbach, Kumar Jiawei Han and Micheline Kamber.

Anomaly Detection. Jing Gao. SUNY Buffalo

CSE 546 Final Exam, Autumn 2013

Clustering. CSL465/603 - Fall 2016 Narayanan C Krishnan

Unsupervised Anomaly Detection for High Dimensional Data

University of Florida CISE department Gator Engineering. Clustering Part 1

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Outlier Detection Using Rough Set Theory

EEL 851: Biometrics. An Overview of Statistical Pattern Recognition EEL 851 1

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Descriptive Data Summarization

Logic and machine learning review. CS 540 Yingyu Liang

Anomaly Detection via Online Oversampling Principal Component Analysis

Università di Pisa A.A Data Mining II June 13th, < {A} {B,F} {E} {A,B} {A,C,D} {F} {B,E} {C,D} > t=0 t=1 t=2 t=3 t=4 t=5 t=6 t=7

Machine Learning. Nonparametric Methods. Space of ML Problems. Todo. Histograms. Instance-Based Learning (aka non-parametric methods)

ECE521 week 3: 23/26 January 2017

Data Preprocessing. Cluster Similarity

Probabilistic Methods in Bioinformatics. Pabitra Mitra

Reverse Nearest Neighbors in Unsupervised Distance-Based Outlier Detection. I.Priyanka 1. Research Scholar,Bharathiyar University. G.

MASTER. Anomaly detection on event logs an unsupervised algorithm on ixr-messages. Severins, J.D. Award date: Link to publication

L11: Pattern recognition principles

Riemannian Metric Learning for Symmetric Positive Definite Matrices

Statistical Learning. Dong Liu. Dept. EEIS, USTC

9/26/17. Ridge regression. What our model needs to do. Ridge Regression: L2 penalty. Ridge coefficients. Ridge coefficients

Detection of Unauthorized Electricity Consumption using Machine Learning

Introduction to Statistical Inference

Applying cluster analysis to 2011 Census local authority data

Anomaly Detection via Online Over-Sampling Principal Component Analysis

Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones

Part I. Linear Discriminant Analysis. Discriminant analysis. Discriminant analysis

Scuola di Calcolo Scientifico con MATLAB (SCSM) 2017 Palermo 31 Luglio - 4 Agosto 2017

ENTROPY FILTER FOR ANOMALY DETECTION WITH EDDY CURRENT REMOTE FIELD SENSORS

CSE446: non-parametric methods Spring 2017

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine Learning. Theory of Classification and Nonparametric Classifier. Lecture 2, January 16, What is theoretically the best classifier

Classification and Pattern Recognition

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

probability of k samples out of J fall in R.

CPSC 540: Machine Learning

Introduction to Gaussian Process

A Comparative Evaluation of Anomaly Detection Techniques for Sequence Data. Technical Report

10-701/ Machine Learning - Midterm Exam, Fall 2010

Text Mining. Dr. Yanjun Li. Associate Professor. Department of Computer and Information Sciences Fordham University

Nonparametric Bayesian Methods (Gaussian Processes)

Intro. ANN & Fuzzy Systems. Lecture 15. Pattern Classification (I): Statistical Formulation

Geometric View of Machine Learning Nearest Neighbor Classification. Slides adapted from Prof. Carpuat

Probability and Statistics

Style-aware Mid-level Representation for Discovering Visual Connections in Space and Time

Linear Models for Regression. Sargur Srihari

Rare Event Discovery And Event Change Point In Biological Data Stream

Introduction to Machine Learning

day month year documentname/initials 1

Introduction to machine learning

Anomaly Detection in Logged Sensor Data. Master s thesis in Complex Adaptive Systems JOHAN FLORBÄCK

Machine Learning Practice Page 2 of 2 10/28/13

Probabilistic Machine Learning. Industrial AI Lab.

Skylines. Yufei Tao. ITEE University of Queensland. INFS4205/7205, Uni of Queensland

BNG 495 Capstone Design. Descriptive Statistics

Surprise Detection in Science Data Streams Kirk Borne Dept of Computational & Data Sciences George Mason University

Issues and Techniques in Pattern Classification

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

Chap 1. Overview of Statistical Learning (HTF, , 2.9) Yongdai Kim Seoul National University

A Family of Joint Sparse PCA Algorithms for Anomaly Localization in Network Data Streams. Ruoyi Jiang

Modern Information Retrieval

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010

Data Mining algorithms

Nearest Neighbors Methods for Support Vector Machines

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

ECE521 Lecture7. Logistic Regression

Anomaly Detection via Over-sampling Principal Component Analysis

CPSC 340: Machine Learning and Data Mining. Linear Least Squares Fall 2016

Introduction to Machine Learning. Introduction to ML - TAU 2016/7 1

Modeling Complex Temporal Composition of Actionlets for Activity Prediction

Intensity Analysis of Spatial Point Patterns Geog 210C Introduction to Spatial Data Analysis

MULTI-LEVEL RELATIONSHIP OUTLIER DETECTION

Global Scene Representations. Tilke Judd

Intensity Analysis of Spatial Point Patterns Geog 210C Introduction to Spatial Data Analysis

A Bayesian Method for Guessing the Extreme Values in a Data Set

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

An Introduction to Machine Learning

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

CSE 5243 INTRO. TO DATA MINING

Batch Mode Sparse Active Learning. Lixin Shi, Yuhang Zhao Tsinghua University

INTRODUCTION TO DATA SCIENCE

Machine Learning (CS 567) Lecture 5

Chemometrics: Classification of spectra

Graph-Based Anomaly Detection with Soft Harmonic Functions

ECE521 lecture 4: 19 January Optimization, MLE, regularization

Surprise Detection in Multivariate Astronomical Data Kirk Borne George Mason University

Multivariate Analysis of Crime Data using Spatial Outlier Detection Algorithm

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Kernel expansions with unlabeled examples

12 - Nonparametric Density Estimation

Principles of Pattern Recognition. C. A. Murthy Machine Intelligence Unit Indian Statistical Institute Kolkata

Classification: Decision Trees

Supplementary Material for Wang and Serfling paper

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Probabilistic clustering

PATTERN RECOGNITION AND MACHINE LEARNING

Transcription:

Anomaly (outlier) detection Huiping Cao, Anomaly 1

Outline General concepts What are outliers Types of outliers Causes of anomalies Challenges of outlier detection Outlier detection approaches Huiping Cao, Anomaly 2

What are outliers The set of data points that are significantly different from the rest of the objects Assumption There are considerably more normal observations than abnormal observations (outliers/anomalies) in the data Applications Fraud detection (credit card usage) Intrusion detection (computer systems, computer networks) Ecosystem disturbances Public health Medicine Related Novelty detection Huiping Cao, Anomaly 3

Types of outliers Global: deviate significantly from the rest of the dataset Also called point anomalies Most outlier detection methods are designed to find such outliers Example Intrusion detection in network traffic Huiping Cao, Anomaly 4

Types of outliers Contextual (conditional) outliers An object is an outlier in one context, but may be normal in another context Contextual attributes: define the object s context. date, location Behavior attributes: define the object s characteristics, and are used to evaluate whether the object is an outlier in the context. temperature A generalization of local outlier, defined in density based analysis. Background information to determine contextual attributes, etc. Huiping Cao, Anomaly 5

Types of outliers Collective: a subset of data objects forms a collective outlier if the objects as a whole deviate significantly from the entire data set The individual data objects may not be outliers Applications: supply-chain, web visiting, network (denialof-service) Need background information to make object relationships Huiping Cao, Anomaly 6

Causes of anomalies Data from different classes Hawkins definition of an Outlier: an outlier is an observation that differs so much from other observations as to arouse suspicion that it was generated by a different mechanism. Natural variation Anomalies that represent extreme or unlikely variations (extreme tall person) Data measurement and collection errors Removing such anomalies is the focus of data preprocessing (data cleaning) Others: several sources Huiping Cao, Anomaly 7

Outline General concepts What are outliers Types of outliers Causes of anomalies Challenges of outlier detection Outlier detection approaches Huiping Cao, Anomaly 8

Challenges of outlier detection Model normal/outlier objects Hard to model complete normal behavior Some methods assign normal or abnormal Some methods assign a score measuring the outlier-ness of the object. Universal outlier detection: hard to develop Similarity and distance definition is application-dependent Common issues: noise Understandability Understand why the detected objects are outliers Provide justification of the detection Huiping Cao, Anomaly 9

Outline General concepts What are outliers Types of outliers Challenges of outlier detection Outlier detection approaches Statistical methods Proximity-based methods Clustering-based methods Huiping Cao, Anomaly 10

Outlier detection methods Data for analysis are labeled with normal or abnormal by domain experts. Supervised methods Can be modeled as a classification problem Special aspects to consider: imbalanced normal data points and abnormal points Measures: recall is more meaningful Unsupervised methods Largely utilize clustering methods Semi-supervised Huiping Cao, Anomaly 11

Outlier detection methods Outlier detection algorithms make assumptions about outliers versus the rest of the data. Categories according to the assumptions made Statistical methods (or model based) Normal data follow a statistical (stochastic) model Outliers do not follow the model Proximity-based methods The proximity of outliers to their neighbors are different from the proximity of most other objects to their neighbors Distance-based, density-based Clustering-based methods Normal objects belong to large and dense clusters Outliers belong to small or sparse clusters, or belong to no cluster Huiping Cao, Anomaly 12

Statistical approaches Probabilistic definition of an outlier: an outlier is an object that has a low probability with respect to a probability distribution model of the data. Normal objects are generated by a stochastic process, occur in regions of high probability for the stochastic model Outliers occur in regions of low probability Approach steps Learn a generative model fitting the given data Identify the objects in low-probability regions of the model Categories Parametric method (univariate, multivariate) Nonparametric method Huiping Cao, Anomaly 13

Parametric: univariate Normal Distribution Normal distribution, maximum likelihood estimation (MLE) Standard normal distribution, N(0,1) Non-standard normal distribution, N(μ,σ 2 ), z-score Use MLE to estimate μ and σ 2 Huiping Cao, Anomaly 14

Parametric: univariate Normal Distribution prob( x c) = α for N(0,1) Mark an object as an outlier if it is more than 3σ away from the estimated mean μ, where σ is the standard deviation (μ±3σ region contains 99.73% of the data) (c, α) pair for N(0,1) c α for N(0,1) 1.0 0.3173 1.5 0.1336 2.0 0.0455 2.5 0.0124 3.0 0.0027 3.5 0.0005 4.0 0.0001 Huiping Cao, Anomaly 15

Parametric: univariate Normal Distribution Example A city s average temperature values in 10 years: 24, 28.9, 28.9, 29, 29.1, 29.1, 29.2, 29.2, 29.3, 29.4 μ = 28.61 σ 2 2.29, σ = sqrt(2.29) = 1.51 Is 24 an outlier? z-score = ( 24-28.61 )/1.51 = 3.04 > 3 Huiping Cao, Anomaly 16

Parametric: other univariate outlier detection approaches (S.S.) Boxplot method Grubb s test (maximum normed residual test) Huiping Cao, Anomaly 17

Parametric: multivariate Multivariate Convert the problem to a univariate outlier detection problem Use Mahalanobis distance from object o to its mean μ Use χ 2 statistic o i : is the value of o on the i-th dimension E i : the mean of the i-th dimension of all objects n: the number of object Huiping Cao, Anomaly 18

Nonparametric Nonparametric methods use fewer assumptions about data distribution, thus can be applicable in more scenarios Histogram approach Construct histograms (types: equal width or equal depth, number of bins, or size of each bin) Outliers: not in any bin or in bins with small size Drawback: hard to decide the bin size Others: kernel function (more discussed in machine learning) Huiping Cao, Anomaly 19

Outline General concepts What are outliers Types of outliers Challenges of outlier detection Outlier detection approaches Statistical methods Proximity-based methods Clustering-based methods Huiping Cao, Anomaly 20

Proximity-based Approaches Data is represented as a vector of features Based on the neighborhood Major approaches Distance based Density based Huiping Cao, Anomaly 21

Distance-based approach Anomaly: if an object is distant from most points. Distance to k-nearest Neighbor: the outlier score of an object is given by the distance to its k-nearest neighbor. Outliers: threshold Problem: hard to decide k (see next slides) Improvement: average of the distances to the first k-nearest neighbors Huiping Cao, Anomaly 22

2.5 2.5 2 2 1.5 1.5 1 1 0.5 0.5 0 0 0.5 1 1.5 2 2.5 0 0 0.5 1 1.5 2 2.5 k=1, outlier is O k=1, outlier is O 2.5 2 1.5 1 0.5 0 0 0.5 1 1.5 2 2.5 k=5, all points at the right upper corner are outliers 23

Distance-based outlier detection Given a dataset D with n data points, a distance threshold r r-neighborhood: about outliers vs. the rest of the data Object o is a DB(r,π)-outlier Approach: Compute the distance between every pair of data points O(n 2 ) Practically, O(n) Huiping Cao, Anomaly 24

A grid-based method implementation Cell diagonal length: r/2 Cell edge length: where d is the number of dimensions Level-1 cell Direct neighbor cells of a cell C r 2 d Any point o in such cells has dist(o,o ) r Level-2 cell One or two cells away from a cell C Any point with dist(o,o ) > r must be in level-2 cell Huiping Cao, Anomaly 25

A grid-based method implementation Pruning n 0 total number of objects in a cell C n 1 total number of objects in a cell C s level-1 cells n 2 total number of objects in a cell C s level-2 cells Level-1 cell pruning: If (n 0 +n 1 ) > πn, o is NOT an outlier Level-2 cell: If (n 0 +n 1 +n 2 ) < πn+1, all the points in C are outliers Huiping Cao, Anomaly 26

Distance-based outlier detection Global outliers: cannot handle data sets with regions of different densities p 2 p 1 Huiping Cao, Anomaly 27

Proximity-based Approaches Data is represented as a vector of features Based on the neighborhood Major approaches Distance based Density based Huiping Cao, Anomaly 28

Density-based outlier detection Local proximity-based outlier Compare the density around one object with the density around its local neighbors p 2 p 1 Huiping Cao, Anomaly 29

Density based D: a set of objects Nearest neighbor of o d(o,d) = min{d(o,o ) o in C} Local outliers: relative to their local neighborhoods, particularly with respect to the densities of the neighborhoods. Density based outlier: the outlier score of an object is the inverse of the density around an object. Huiping Cao, Anomaly 30

Concepts k-distance of an object o d k (o): measure the relative density of an object o. Formally, d k (o) = d(o,p) s.t. at least k objects o in D/{o}, d(o,o`) d(o,p) at least k-1 objects o in D/{o}, d(o,o`)<d(o,p) K-distance neighborhood of an object o N k (o) = {o o in D, d(o,o ) d k (o)} N k (o) may contain more than k objects Measure local density: average distance from o to N k (o) Problem: fluctuations Huiping Cao, Anomaly 31

Reachable distance Concepts reachdist(o ào) = max{d k (o), d(o,o )} Alleviate fluctuations Not symmetric, reachdist(o ào) reachdist(oào ) Local density of o: average reachability distance from o to N k (o) density k (o) = o' N k (o) N k (o) reachdist(o o') = o' N k (o) N k (o) max{d k (o'), d(o, o')} Different from density definition in density-based clustering Global/local Huiping Cao, Anomaly 32

Example 5 k=2, use Euclidean distance Distance from o to o s 2NN is 1 d k (o)=1 4 N k (o)={p1,p2,p3} d k (p1) = sqrt(0.64+1.0) = 1.28, dist(o,p1)=0.8 d k (p2) = sqrt(2) =1.41, dist(o,p2)=1 y 3 2 p1 O p3 d k (p3) = sqrt(0.32) = 0.57, dist(o,p3)=1 reachdist(o->p1) = 1.28 1 p2 reachdist(o->p2) = 1.41 reachdist(o->p3) = 1 0 0 1 2 3 density k (o)=3/(1.28+1.41+1) = 0.813 x Huiping Cao, Anomaly 33

Local outlier factor (LOF) (or average relative density of o) Average ratio of local reachability density of o and local reachability density of the k-nearest neighbors of o The lower density k (o), and the higher density k (o )è higher LOFà higher probability to be outlier Huiping Cao, Anomaly 34

Example 5 k=2, use Euclidean distance Distance from o to o s 2NN is 1 d k (o)=1 4 N k (o)={p1,p2,p3} d k (p1) = sqrt(0.64+1.0) = 1.28, dist(o,p1)=0.8 d k (p2) = sqrt(2) =1.41, dist(o,p2)=1 y 3 2 p1 O p3 d k (p3) = sqrt(0.32) = 0.57, dist(o,p3)=1 reachdist(o->p1) = 1.28 1 p2 reachdist(o->p2) = 1.41 reachdist(o->p3) = 1 0 0 1 2 3 density k (o)=3/(1.28+1.41+1) = 0.813 x Then, calculate density k (p1), density k (p2), density k (p3) Huiping Cao, Anomaly 35

Outline General concepts What are outliers Types of outliers Challenges of outlier detection Outlier detection approaches Statistical methods Proximity-based methods Clustering-based methods Huiping Cao, Anomaly 36

Clustering-Based Clustering-based outlier: an object is a cluster-based outlier if the object does not strongly belong to any cluster. An outlier an object belonging to a small and remote cluster or not belonging to any cluster Huiping Cao, Anomaly 37

Clustering-Based Basic steps: Cluster the data into groups of different density Three general approaches An object does not belong to any cluster à outlier object There is a large distance between an object and the cluster to which it is closest à outlier The object is part of a small and sparse cluster à all the objects in that cluster are outliers Huiping Cao, Anomaly 38

Approach 2 There is a large distance between an object and the cluster to which it is closest à outlier Calculate ratio, the larger the ratio, the farther away o is from its closest cluster C o ratio = d(o,c o ) d(o',c o ) o' Co C o Huiping Cao, Anomaly 39

Outliers in Lower Dimensional Projection In high-dimensional space, data is sparse and notion of proximity becomes meaningless Every point is an almost equally good outlier from the perspective of proximity-based definitions Lower-dimensional projection methods A point is an outlier if in some lower dimensional projection, it is present in a local region of abnormally low density Huiping Cao, Anomaly 40

R packages https://cran.r-project.org/web/packages/outliers/outliers.pdf R parallel implementation of Local Outlier Factor(LOF) which uses multiplecpus to significantly speed up the LOF computation for large datasets. https://cran.r-project.org/web/packages/rlof/rlof.pdf Python LOF implementation: http://shahramabyari.com/2015/12/30/myfirst-attempt-with-local-outlier-factorlof-identifying-density-based-localoutliers/ Huiping Cao, Anomaly 41