Anomaly (outlier) detection. Huiping Cao, Anomaly 1

Anomaly (outlier) detection Huiping Cao, Anomaly 1

Outline General concepts What are outliers Types of outliers Causes of anomalies Challenges of outlier detection Outlier detection approaches Huiping Cao, Anomaly 2

What are outliers The set of data points that are significantly different from the rest of the objects Assumption There are considerably more normal observations than abnormal observations (outliers/anomalies) in the data Applications Fraud detection (credit card usage) Intrusion detection (computer systems, computer networks) Ecosystem disturbances Public health Medicine Related Novelty detection Huiping Cao, Anomaly 3

Types of outliers Global: deviate significantly from the rest of the dataset Also called point anomalies Most outlier detection methods are designed to find such outliers Example Intrusion detection in network traffic Huiping Cao, Anomaly 4

Types of outliers Contextual (conditional) outliers An object is an outlier in one context, but may be normal in another context Contextual attributes: define the object s context. date, location Behavior attributes: define the object s characteristics, and are used to evaluate whether the object is an outlier in the context. temperature A generalization of local outlier, defined in density based analysis. Background information to determine contextual attributes, etc. Huiping Cao, Anomaly 5

Types of outliers Collective: a subset of data objects forms a collective outlier if the objects as a whole deviate significantly from the entire data set The individual data objects may not be outliers Applications: supply-chain, web visiting, network (denialof-service) Need background information to make object relationships Huiping Cao, Anomaly 6

Causes of anomalies Data from different classes Hawkins definition of an Outlier: an outlier is an observation that differs so much from other observations as to arouse suspicion that it was generated by a different mechanism. Natural variation Anomalies that represent extreme or unlikely variations (extreme tall person) Data measurement and collection errors Removing such anomalies is the focus of data preprocessing (data cleaning) Others: several sources Huiping Cao, Anomaly 7

Outline General concepts What are outliers Types of outliers Causes of anomalies Challenges of outlier detection Outlier detection approaches Huiping Cao, Anomaly 8

Challenges of outlier detection Model normal/outlier objects Hard to model complete normal behavior Some methods assign normal or abnormal Some methods assign a score measuring the outlier-ness of the object. Universal outlier detection: hard to develop Similarity and distance definition is application-dependent Common issues: noise Understandability Understand why the detected objects are outliers Provide justification of the detection Huiping Cao, Anomaly 9

Outline General concepts What are outliers Types of outliers Challenges of outlier detection Outlier detection approaches Statistical methods Proximity-based methods Clustering-based methods Huiping Cao, Anomaly 10

Outlier detection methods Data for analysis are labeled with normal or abnormal by domain experts. Supervised methods Can be modeled as a classification problem Special aspects to consider: imbalanced normal data points and abnormal points Measures: recall is more meaningful Unsupervised methods Largely utilize clustering methods Semi-supervised Huiping Cao, Anomaly 11

Outlier detection methods Outlier detection algorithms make assumptions about outliers versus the rest of the data. Categories according to the assumptions made Statistical methods (or model based) Normal data follow a statistical (stochastic) model Outliers do not follow the model Proximity-based methods The proximity of outliers to their neighbors are different from the proximity of most other objects to their neighbors Distance-based, density-based Clustering-based methods Normal objects belong to large and dense clusters Outliers belong to small or sparse clusters, or belong to no cluster Huiping Cao, Anomaly 12

Statistical approaches Probabilistic definition of an outlier: an outlier is an object that has a low probability with respect to a probability distribution model of the data. Normal objects are generated by a stochastic process, occur in regions of high probability for the stochastic model Outliers occur in regions of low probability Approach steps Learn a generative model fitting the given data Identify the objects in low-probability regions of the model Categories Parametric method (univariate, multivariate) Nonparametric method Huiping Cao, Anomaly 13

Parametric: univariate Normal Distribution Normal distribution, maximum likelihood estimation (MLE) Standard normal distribution, N(0,1) Non-standard normal distribution, N(μ,σ 2 ), z-score Use MLE to estimate μ and σ 2 Huiping Cao, Anomaly 14

Parametric: univariate Normal Distribution prob( x c) = α for N(0,1) Mark an object as an outlier if it is more than 3σ away from the estimated mean μ, where σ is the standard deviation (μ±3σ region contains 99.73% of the data) (c, α) pair for N(0,1) c α for N(0,1) 1.0 0.3173 1.5 0.1336 2.0 0.0455 2.5 0.0124 3.0 0.0027 3.5 0.0005 4.0 0.0001 Huiping Cao, Anomaly 15

Parametric: univariate Normal Distribution Example A city s average temperature values in 10 years: 24, 28.9, 28.9, 29, 29.1, 29.1, 29.2, 29.2, 29.3, 29.4 μ = 28.61 σ 2 2.29, σ = sqrt(2.29) = 1.51 Is 24 an outlier? z-score = ( 24-28.61 )/1.51 = 3.04 > 3 Huiping Cao, Anomaly 16

Parametric: other univariate outlier detection approaches (S.S.) Boxplot method Grubb s test (maximum normed residual test) Huiping Cao, Anomaly 17

Parametric: multivariate Multivariate Convert the problem to a univariate outlier detection problem Use Mahalanobis distance from object o to its mean μ Use χ 2 statistic o i : is the value of o on the i-th dimension E i : the mean of the i-th dimension of all objects n: the number of object Huiping Cao, Anomaly 18

Nonparametric Nonparametric methods use fewer assumptions about data distribution, thus can be applicable in more scenarios Histogram approach Construct histograms (types: equal width or equal depth, number of bins, or size of each bin) Outliers: not in any bin or in bins with small size Drawback: hard to decide the bin size Others: kernel function (more discussed in machine learning) Huiping Cao, Anomaly 19

Proximity-based Approaches Data is represented as a vector of features Based on the neighborhood Major approaches Distance based Density based Huiping Cao, Anomaly 21

Distance-based approach Anomaly: if an object is distant from most points. Distance to k-nearest Neighbor: the outlier score of an object is given by the distance to its k-nearest neighbor. Outliers: threshold Problem: hard to decide k (see next slides) Improvement: average of the distances to the first k-nearest neighbors Huiping Cao, Anomaly 22

2.5 2.5 2 2 1.5 1.5 1 1 0.5 0.5 0 0 0.5 1 1.5 2 2.5 0 0 0.5 1 1.5 2 2.5 k=1, outlier is O k=1, outlier is O 2.5 2 1.5 1 0.5 0 0 0.5 1 1.5 2 2.5 k=5, all points at the right upper corner are outliers 23

Distance-based outlier detection Given a dataset D with n data points, a distance threshold r r-neighborhood: about outliers vs. the rest of the data Object o is a DB(r,π)-outlier Approach: Compute the distance between every pair of data points O(n 2 ) Practically, O(n) Huiping Cao, Anomaly 24

A grid-based method implementation Cell diagonal length: r/2 Cell edge length: where d is the number of dimensions Level-1 cell Direct neighbor cells of a cell C r 2 d Any point o in such cells has dist(o,o ) r Level-2 cell One or two cells away from a cell C Any point with dist(o,o ) > r must be in level-2 cell Huiping Cao, Anomaly 25

A grid-based method implementation Pruning n 0 total number of objects in a cell C n 1 total number of objects in a cell C s level-1 cells n 2 total number of objects in a cell C s level-2 cells Level-1 cell pruning: If (n 0 +n 1 ) > πn, o is NOT an outlier Level-2 cell: If (n 0 +n 1 +n 2 ) < πn+1, all the points in C are outliers Huiping Cao, Anomaly 26

Distance-based outlier detection Global outliers: cannot handle data sets with regions of different densities p 2 p 1 Huiping Cao, Anomaly 27

Proximity-based Approaches Data is represented as a vector of features Based on the neighborhood Major approaches Distance based Density based Huiping Cao, Anomaly 28

Density-based outlier detection Local proximity-based outlier Compare the density around one object with the density around its local neighbors p 2 p 1 Huiping Cao, Anomaly 29

Density based D: a set of objects Nearest neighbor of o d(o,d) = min{d(o,o ) o in C} Local outliers: relative to their local neighborhoods, particularly with respect to the densities of the neighborhoods. Density based outlier: the outlier score of an object is the inverse of the density around an object. Huiping Cao, Anomaly 30

Concepts k-distance of an object o d k (o): measure the relative density of an object o. Formally, d k (o) = d(o,p) s.t. at least k objects o in D/{o}, d(o,o`) d(o,p) at least k-1 objects o in D/{o}, d(o,o`)<d(o,p) K-distance neighborhood of an object o N k (o) = {o o in D, d(o,o ) d k (o)} N k (o) may contain more than k objects Measure local density: average distance from o to N k (o) Problem: fluctuations Huiping Cao, Anomaly 31

Reachable distance Concepts reachdist(o ào) = max{d k (o), d(o,o )} Alleviate fluctuations Not symmetric, reachdist(o ào) reachdist(oào ) Local density of o: average reachability distance from o to N k (o) density k (o) = o' N k (o) N k (o) reachdist(o o') = o' N k (o) N k (o) max{d k (o'), d(o, o')} Different from density definition in density-based clustering Global/local Huiping Cao, Anomaly 32

Example 5 k=2, use Euclidean distance Distance from o to o s 2NN is 1 d k (o)=1 4 N k (o)={p1,p2,p3} d k (p1) = sqrt(0.64+1.0) = 1.28, dist(o,p1)=0.8 d k (p2) = sqrt(2) =1.41, dist(o,p2)=1 y 3 2 p1 O p3 d k (p3) = sqrt(0.32) = 0.57, dist(o,p3)=1 reachdist(o->p1) = 1.28 1 p2 reachdist(o->p2) = 1.41 reachdist(o->p3) = 1 0 0 1 2 3 density k (o)=3/(1.28+1.41+1) = 0.813 x Huiping Cao, Anomaly 33

Local outlier factor (LOF) (or average relative density of o) Average ratio of local reachability density of o and local reachability density of the k-nearest neighbors of o The lower density k (o), and the higher density k (o )è higher LOFà higher probability to be outlier Huiping Cao, Anomaly 34

Clustering-Based Clustering-based outlier: an object is a cluster-based outlier if the object does not strongly belong to any cluster. An outlier an object belonging to a small and remote cluster or not belonging to any cluster Huiping Cao, Anomaly 37

Clustering-Based Basic steps: Cluster the data into groups of different density Three general approaches An object does not belong to any cluster à outlier object There is a large distance between an object and the cluster to which it is closest à outlier The object is part of a small and sparse cluster à all the objects in that cluster are outliers Huiping Cao, Anomaly 38

Approach 2 There is a large distance between an object and the cluster to which it is closest à outlier Calculate ratio, the larger the ratio, the farther away o is from its closest cluster C o ratio = d(o,c o ) d(o',c o ) o' Co C o Huiping Cao, Anomaly 39

Outliers in Lower Dimensional Projection In high-dimensional space, data is sparse and notion of proximity becomes meaningless Every point is an almost equally good outlier from the perspective of proximity-based definitions Lower-dimensional projection methods A point is an outlier if in some lower dimensional projection, it is present in a local region of abnormally low density Huiping Cao, Anomaly 40

R packages https://cran.r-project.org/web/packages/outliers/outliers.pdf R parallel implementation of Local Outlier Factor(LOF) which uses multiplecpus to significantly speed up the LOF computation for large datasets. https://cran.r-project.org/web/packages/rlof/rlof.pdf Python LOF implementation: http://shahramabyari.com/2015/12/30/myfirst-attempt-with-local-outlier-factorlof-identifying-density-based-localoutliers/ Huiping Cao, Anomaly 41