Introduction to Density Estimation and Anomaly Detection. Tom Dietterich

Similar documents
SYSTEMATIC CONSTRUCTION OF ANOMALY DETECTION BENCHMARKS FROM REAL DATA. Outlier Detection And Description Workshop 2013

Advances in Anomaly Detection

Lecture 3. STAT161/261 Introduction to Pattern Recognition and Machine Learning Spring 2018 Prof. Allie Fletcher

arxiv: v2 [cs.ai] 26 Aug 2016

Variations. ECE 6540, Lecture 02 Multivariate Random Variables & Linear Algebra

Toward automated quality control for hydrometeorological. station data. DSA 2018 Nyeri 1

Support Vector Machines. CSE 4309 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

CS249: ADVANCED DATA MINING

Lecture 11. Kernel Methods

A Step Towards the Cognitive Radar: Target Detection under Nonstationary Clutter

ECE 6540, Lecture 06 Sufficient Statistics & Complete Statistics Variations

Property Testing and Affine Invariance Part I Madhu Sudan Harvard University

A Posteriori Error Estimates For Discontinuous Galerkin Methods Using Non-polynomial Basis Functions

Rotational Motion. Chapter 10 of Essential University Physics, Richard Wolfson, 3 rd Edition

Estimate by the L 2 Norm of a Parameter Poisson Intensity Discontinuous

General Strong Polarization

CS570 Data Mining. Anomaly Detection. Li Xiong. Slide credits: Tan, Steinbach, Kumar Jiawei Han and Micheline Kamber.

CS249: ADVANCED DATA MINING

A new procedure for sensitivity testing with two stress factors

Chapter 6: Classification

Integrating Rational functions by the Method of Partial fraction Decomposition. Antony L. Foster

10.4 The Cross Product

Chapter 5: Spectral Domain From: The Handbook of Spatial Statistics. Dr. Montserrat Fuentes and Dr. Brian Reich Prepared by: Amanda Bell

General Strong Polarization

Worksheets for GCSE Mathematics. Quadratics. mr-mathematics.com Maths Resources for Teachers. Algebra

Wave Motion. Chapter 14 of Essential University Physics, Richard Wolfson, 3 rd Edition

Exam Programme VWO Mathematics A

Worksheets for GCSE Mathematics. Algebraic Expressions. Mr Black 's Maths Resources for Teachers GCSE 1-9. Algebra

Lecture 6. Notes on Linear Algebra. Perceptron

10.1 Three Dimensional Space

Lecture 10: Logistic Regression

Lecture 3. Linear Regression

PAC-learning, VC Dimension and Margin-based Bounds

CHAPTER 5 Wave Properties of Matter and Quantum Mechanics I

ECE521 week 3: 23/26 January 2017

Introduction to Machine Learning Midterm Exam

Holdout and Cross-Validation Methods Overfitting Avoidance

CSC 578 Neural Networks and Deep Learning

Extreme value statistics: from one dimension to many. Lecture 1: one dimension Lecture 2: many dimensions

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Anomaly Detection for the CERN Large Hadron Collider injection magnets

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

SECTION 5: POWER FLOW. ESE 470 Energy Distribution Systems

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Work, Energy, and Power. Chapter 6 of Essential University Physics, Richard Wolfson, 3 rd Edition

Independent Component Analysis and FastICA. Copyright Changwei Xiong June last update: July 7, 2016

General Strong Polarization

Inferring the origin of an epidemic with a dynamic message-passing algorithm

Radial Basis Function (RBF) Networks

PHY103A: Lecture # 9

Statistical Learning with the Lasso, spring The Lasso

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

From CDF to PDF A Density Estimation Method for High Dimensional Data

General Strong Polarization

Introduction to Machine Learning Midterm, Tues April 8

Review for Exam Hyunse Yoon, Ph.D. Assistant Research Scientist IIHR-Hydroscience & Engineering University of Iowa

Data Mining und Maschinelles Lernen

Non-Parametric Weighted Tests for Change in Distribution Function

Introduction. Chapter 1

GWAS V: Gaussian processes

Final Exam, Fall 2002

CHAPTER 2 Special Theory of Relativity

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

A Simple Algorithm for Learning Stable Machines

Grover s algorithm. We want to find aa. Search in an unordered database. QC oracle (as usual) Usual trick

Predicting Winners of Competitive Events with Topological Data Analysis

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Adaptive Crowdsourcing via EM with Prior

Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones

Mathematics Ext 2. HSC 2014 Solutions. Suite 403, 410 Elizabeth St, Surry Hills NSW 2010 keystoneeducation.com.

Elastic light scattering

CSCI-567: Machine Learning (Spring 2019)

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLĂ„RNING

Decision Trees. Machine Learning CSEP546 Carlos Guestrin University of Washington. February 3, 2014

Variable Selection and Weighting by Nearest Neighbor Ensembles

Gradient expansion formalism for generic spin torques

P.3 Division of Polynomials

FINAL: CS 6375 (Machine Learning) Fall 2014

Introduction to Machine Learning Midterm Exam Solutions

Module 7 (Lecture 25) RETAINING WALLS

Learning theory. Ensemble methods. Boosting. Boosting: history

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

PAC-learning, VC Dimension and Margin-based Bounds

Lecture No. 5. For all weighted residual methods. For all (Bubnov) Galerkin methods. Summary of Conventional Galerkin Method

Algebraic Codes and Invariance

Real-Time Weather Hazard Assessment for Power System Emergency Risk Management

Classification and Pattern Recognition

Yang-Hwan Ahn Based on arxiv:

Introduction to Gaussian Process

Locality in Coding Theory

Online Mode Shape Estimation using Complex Principal Component Analysis and Clustering, Hallvar Haugdal (NTMU Norway)

Classical RSA algorithm

Lecture 2. Judging the Performance of Classifiers. Nitin R. Patel

Continuous Random Variables

Expectation Propagation performs smooth gradient descent GUILLAUME DEHAENE

Chapter 22 : Electric potential

Clustering. CSL465/603 - Fall 2016 Narayanan C Krishnan

Loda: Lightweight on-line detector of anomalies

International Journal of Mathematical Archive-5(3), 2014, Available online through ISSN

PART I (STATISTICS / MATHEMATICS STREAM) ATTENTION : ANSWER A TOTAL OF SIX QUESTIONS TAKING AT LEAST TWO FROM EACH GROUP - S1 AND S2.

Transcription:

Introduction to Density Estimation and Anomaly Detection Tom Dietterich

Outline Definition and Motivations Density Estimation Parametric Density Estimation Mixture Models Kernel Density Estimation Neural Density Estimation Anomaly Detection Distance-based methods Isolation Forest and LODA RPAD theory 2

What is Density Estimation? Given a data set xx 1,, xx NN where xx ii R dd We assume the data have been drawn iid from an unknown probability density: xx ii ~PP xx ii Goal: Estimate PP Requirements PP xx 0 xx R dd PP xx dddd xx R dd = 1 3

How to Evaluate a Density Estimator Suppose I have computed a density estimator PP for PP. How can I evaluate it? A good density estimator should assign high density where PP is large and low density where PP is low Standard metric: the Log Likelihood log PP xx ii ii 4

Important: Holdout Likelihood If we use our training data to construct PP, we cannot use that same data to evaluate PP. Solution: Given our initial data SS = xx 1,, xx NN Randomly split into SS tttttttttt and SS tttttttt Compute PP using SS tttttttttt Evaluate PP using SS tttttttt log PP xx ii xx ii SS tttttttt 5

Reminder: Densities, Probabilities, Events A density μμ is a measure over some space XX A density can be > 1 but must integrate to 1 An event is a subspace (region) EE XX The probability of an event is obtained by integration PP EE = xx EE μμ xx dddd 6

Example from the ipython notebook Normal probability density PP xx; μμ, σσ = 1 2ππσσ exp 1 xx μμ 2 σσ Normal cumulative distribution function FF zz; μμ, σσ = probability of the event, zz zz FF zz; μμ, σσ = PP xx; μμ, σσ dddd 2 7

Why Estimate Densities? Anomaly Detection Classification If we can learn high-dimensional densities, then all machine learning problems can be solved using probabilistic methods 8

Anomaly Detection Anomaly: A data point generated by a different process than the process that generates the normal data points Example: Fraud Detection Normal points: Legitimate financial transactions Anomaly points: Fraudulent transactions Example: Sensor Data Normal points: Correct data values Anomaly points: Bad values (broken sensors) 9

Anomaly Score using Surprise In information theory, the surprise of an observation xx, SS(xx) is defined as SS xx = log PP xx Properties: If PP xx = 0, then SS xx = + If PP xx = 1, then SS xx = 0 Surprise is only defined for events, but we often use it for densities when we are interested on small values of PP(xx) 10

Example Nominal distribution: Normal 0,1 Anomaly distribution: Normal(3,1) Generate 100 nominals and 10 anomalies Plot anomaly score as a function of xx Setting a threshold at 2.39 will detect all anomalies and also have 8 false alarms 11

Classification Class-Conditional Models Goal: Predict yy from xx Model the process that creates the data: yy ~ PP yy discrete xx ~ PP xx yy = NN xx μμ yy, Σ yy Gaussian Conditional Density Estimation Classification via probabilistic inference PP yy xx = PP yy NN(xx μμ yy,σ yy ) PP yy NN xx μμ yy Σ yy yy Which class yy best explains the observed xx? Challenge: PP(xx yy) is usually very complex and difficult to model. yy xx 2 12

Outline Definition and Motivations Density Estimation Parametric Density Estimation Mixture Models Kernel Density Estimation Neural Density Estimation Anomaly Detection Distance-based methods Isolation Forest and LODA RPAD theory 13

Parametric Density Estimation Assume PP xx = Normal xx μμ, Σ is the multivariate Gaussian distribution 1 PP xx = exp 1 xx μμ Σ 1 xx μμ 2ππ dd det Σ 2 Fit by computing moments: μμ = 1 NN xx ii ii Σ = 1 NN xx ii μμ ii xx ii μμ 14

Example Sample 100 points from multivariate Gaussian with 1 1.5 μμ = (2,2) and Σ = 1.5 4 Estimates: μμ = 1.968731, 1.894511 Σ = 1.081423 1.462467 1.462467 4.000821 15

Mixture Models PP xx = PP cc PP(xx cc) cc cc indexes the mixture component Each mixture component has its own conditional density PP xx cc cc xx 16

Mixture Models Example PP cc = 1 = 2/3 PP cc = 2 = 1/3 PP(xx cc = 1) o PP(xx cc = 2) + cc = 2 cc = 1 17

Fitting Mixture Models: Expectation Maximization Choose the number of mixture components KK Create a matrix RR of dimension KK NN. This will represent PP cc ii = kk xx ii = RR kk, ii This is called the membership probability. The probability that data point ii was generated by component kk Randomly initialize RR kk, ii (0,1) subject to the constraint that RR[kk, ii] kk = 1 18

EM Algorithm Main Loop MM step: Maximum likelihood estimation of the model parameters using the data xx 1,, xx NN and RR PP cc = kk 1 NN PP(cc ii = kk xx ii ) ii μμ kk 1 NN PP PP cc cc=kk ii ii = kk xx ii xx ii Σ kk 1 NN PP PP cc cc=kk ii ii = kk xx ii EE step: Re-estimate RR xx ii xx ii RR kk, ii PP cc = kk Normal xx ii μμ kk, Σ kk RR kk, ii RR[kk, ii]/ RR kk, ii kk 19

Results on Our Example [R mclust package] Estimates mixing proportions (0.649, 0.351) means: (2.014, 1.967) ( 0.631, 0.957) True values Mixing proportions: (0.667, 0.333) Means: (2.000,2.000) ( 1.000, 1.000) 20

Mixture Model Summary Can fit very complex distributions EM is a local search algorithm Different random initializations can find different fitted models There are some new moment-based methods that find the global optimum Difficult to choose the number of mixture components There are many heuristics for this 21

Kernel Density Estimation Define a mixture model with one mixture component for each data point PP xx = 1 NN KK xx xx NN ii=1 ii, σσ 2 Often use a Gaussian Kernel KK xx, σσ 2 = 1 xx2 exp 2ππσσ 2σσ 2 Often use a fixed scale σσ 2. The scale is also called the bandwidth 22

One-Dimensional Example 23

Design Decisions Choice of Kernel: generally not important as long as it is local Choice of bandwidth is very important 24

Challenges KDE in high dimensions suffers from the Curse of Dimensionality The amount of data required to achieve a desired level of accuracy scales exponentially with the dimensionality dd of the problem: exp dd+4 2 25

Factoring the Joint Density into Conditional Distributions Let xx = xx 1,, xx dd be a vector of random variables. We wish to model the joint distribution PP(xx). By the chain rule of probability, we can write this as PP xx = PP xx 1 PP xx 2 xx 1 PP xx 3 xx 1, xx 2 PP xx dd xx 1,, xx dd 1 We can model PP xx 1 using any 1-D method (e.g., KDE). We can model each conditional distribution using regression methods 26

Linear Regression = Conditional Density Estimation Let x = xx 1,, xx JJ be the predictor variables and yy be the response variable Standard least-squares linear regression models the conditional probability distribution was PP yy xx ~Normal yy; μμ xx, σσ 2 where μμ xx = ββ 0 + ββ 1 xx 1 + + ββ JJ xx JJ and σσ 2 is a fitted constant Neural networks trained to minimize squared error model the mean as μμ xx = NNNNNNNN xx; WW, where WW are the weights of the neural network 27

Deep Neural Networks for Density Estimation Masked Auto-Regressive Flow (Papamarkarios, 2017) Apply the chain rule of probability PP xx jj xx 1:jj 1 = Normal xx jj ; μμ jj, exp αα jj 2 where μμ jj = ff jj xx 1:jj 1 and αα jj = gg jj xx 1:jj 1 are implemented by neural networks Re-parameterization trick for sampling xx jj ~PP xx jj xx 1:jj 1 Let uu jj ~Normal(0,1) xx jj = uu jj exp αα jj + μμ jj (rescale by the standard deviation and displace by the mean) 28

Transformation View Equivalent model: uu~nnnnnnnnnnnn(uu; 0, II) is a vector of JJ standard normal random variates xx = FF(uu) transforms those random variates into the observed data To compute PP(xx) we can invert this function and evaluate Normal FF 1 xx ; 0, II almost PP xx = Normal FF 1 xx ; 0, II det FF 1 xx This ensures that PP integrates to 1 29

Derivative of FF 1 is Simple Because of the triangular structure created by the chain rule, det FF 1 xx = exp αα jj jj 30

FF is Easy to Invert uu jj = xx jj μμ jj exp αα jj where μμ jj = ff jj xx 1:jj 1 and αα jj = gg jj xx 1:jj 1 31

Masked Auto-Regressive Flow xx ff,g mask ensures that ff jj and gg jj only depend on xx 1:jj 1 32

Stacking MAFs One MAF network is often not sufficient True Density Fitted Density from single MAF network Distribution of the uu values 33

Stack MAFs until the uu values are Normal(0,I) True Density Fitted Density from stack of 5 MAFs Distribution of the uu values 34

Test Set Log Likelihood 35

Outline Definition and Motivations Density Estimation Parametric Density Estimation Mixture Models Kernel Density Estimation Neural Density Estimation Anomaly Detection Distance-based methods Isolation Forest and LODA RPAD theory 36

Part 2: Other Anomaly Detection Approaches Do we need accurate density estimates to do anomaly detection? Maybe not Rank points proportional to log PP(xx) Detect low-probability outliers 37

Distance-Based Anomaly Detection Assume a distance metric xx yy between any two data points xx and yy AA(xx) = anomaly score = distance to kk-th nearest data point Points in empty regions of the input space are likely to be anomalies There are many refinements of this basic idea Learned distance metrics can be helpful 38

Isolation Forest [Liu, Ting, Zhou, 2011] Construct a fully random binary tree choose attribute jj at random choose splitting threshold θθ uniformly from min xx jj, max xx jj until every data point is in its own leaf let dd(xx ii ) be the depth of point xx ii repeat 100 times let dd (xx ii ) be the average depth of xx ii AA xx ii = 2 dd xx ii rr xx ii rr(xx ii ) is the expected depth xx jj > θθ xx 2 > θθ 2 xx 8 > θθ 3 xx 3 > θθ 4 xx 1 > θθ 5 xx ii 39

LODA: Lightweight Online Detector of Anomalies [Pevny, 2016] Π 1,, Π MM set of MM sparse random projections ff 1,, ff MM corresponding 1- dimensional density estimators SS xx = 1 log ff mm (xx) MM mm average surprise y -4-2 0 2 4-4 -2 0 2 4 x NewRelic 40

LODA: Lightweight Online Detector of Anomalies [Pevny, 2016] Π 1,, Π MM set of MM sparse random projections ff 1,, ff MM corresponding 1- dimensional density estimators SS xx = 1 log ff mm (xx) MM mm average surprise y Density -4-2 0 2 4 0.0 0.1 0.2 0.3 0.4-4 -2 0 2 4 x ff 1-4 -2 0 2 4 NewRelic 41

LODA: Lightweight Online Detector of Anomalies [Pevny, 2016] Π 1,, Π MM set of MM sparse random projections ff 1,, ff MM corresponding 1- dimensional density estimators SS xx = 1 log ff mm (xx) MM mm average surprise Density 0.3 0.4-5 0 5 y Density -4-2 0 2 4 0.0 0.1 0.2 0.3 0.4-4 -2 0 2 4 x ff 1-4 -2 0 2 4 NewRelic 42

Benchmarking Study [Andrew Emmott] Most AD papers only evaluate on a few datasets Often proprietary or very easy (e.g., KDD 1999) Research community needs a large and growing collection of public anomaly benchmarks [Emmott, Das, Dietterich, Fern, Wong, 2013; KDD ODD-2013] [Emmott, Das, Dietterich, Fern, Wong. 2016; arxiv 1503.01158v2] NewRelic 43

Benchmarking Methodology Select 19 data sets from UC Irvine repository Choose one or more classes to be anomalies ; the rest are nominals Manipulate Relative frequency Point difficulty Irrelevant features Clusteredness 20 replicates of each configuration Result: 25,685 Benchmark Datasets NewRelic 44

Algorithms Density-Based Approaches RKDE: Robust Kernel Density Estimation (Kim & Scott, 2008) EGMM: Ensemble Gaussian Mixture Model (our group) Quantile-Based Methods OCSVM: One-class SVM (Schoelkopf, et al., 1999) SVDD: Support Vector Data Description (Tax & Duin, 2004) Neighbor-Based Methods LOF: Local Outlier Factor (Breunig, et al., 2000) ABOD: knn Angle-Based Outlier Detector (Kriegel, et al., 2008) Projection-Based Methods IFOR: Isolation Forest (Liu, et al., 2008) LODA: Lightweight Online Detector of Anomalies (Pevny, 2016) NewRelic 45

Analysis Linear ANOVA mmmmmmmmmmmm ~ rrrr + pppp + cccc + iiii + mmmmmmmm + aaaaaaaa rf: relative frequency pd: point difficulty cl: normalized clusteredness ir: irrelevant features mset: Mother set algo: anomaly detection algorithm Validate the effect of each factor Assess the algo effect while controlling for all other factors NewRelic 46

Algorithm Comparison 1.2 Change in Metric wrt Control Dataset 1 0.8 0.6 0.4 0.2 logit(auc) log(lift) 0 iforest egmm lof rkde abod loda ocsvm svdd Algorithm NewRelic 47

How Much Training Data is Needed? We looked at learning curves for anomaly detection On most benchmarks, we only need a few thousand points to get good detection rates That is certainly better than exp dd+4 What is going on? 2 48

Rare Pattern Anomaly Detection [Siddiqui, et al.; UAI 2016] A pattern h: R dd {0,1} is an indicator function for a measurable region in the input space Examples: Half planes Axis-parallel hyper-rectangles in 1,1 dd A pattern space H is a set of patterns (countable or uncountable) NewRelic 49

Rare and Common Patterns Let μμ be a fixed measure over R dd Typical choices: uniform over 1, +1 dd standard Gaussian over R dd μμ(h) is the measure of the pattern defined by h Let pp be the nominal probability density defined on R dd (or on some subset) pp(h) is the probability of pattern h A pattern h is ττ-rare if ff h = pp h μμ h ττ Otherwise it is ττ-common NewRelic 50

Rare and Common Points A point xx is ττ-rare if there exists a ττ-rare h such that h xx = 1 Otherwise a point is ττ-common Goal: An anomaly detection algorithm should output all ττ-rare points and not output any ττ-common points NewRelic 51

PAC-RPAD Algorithm AA is PAC-RPAD for pattern space H, measure μμ, parameters ττ, εε, δδ if for any probability density pp and any ττ, AA draws a sample from pp and with probability 1 δδ detects all ττ-rare points and rejects all (ττ + εε)-common points in the sample εε allows the algorithm some margin for error If a point is between ττ-rare and ττ + εε -common, the algorithm can treat it arbitrarily Running time polynomial in 1, 1, and 1, and some measure of the εε δδ ττ complexity of H NewRelic 52

RAREPATTERNDETECT Draw a sample of size NN(εε, δδ) from pp Let pp (h) be the fraction of sample points that satisfy h Let ff h = pp h μμ h be the estimated rareness of h A query point xx qq is declared to be an anomaly if there exists a pattern h H such that h xx qq = 1 and ff h ττ. NewRelic 53

Results Theorem 1: For any finite pattern space H, RAREPATTERNDETECT is PAC-RPAD with sample complexity NN εε, δδ = OO 1 εε 2 log H + log 1 δδ Theorem 2: For any pattern space H with finite VC dimension VV H, RAREPATTERNDETECT is PAC-RPAD with sample complexity NN εε, δδ = OO 1 εε 2 VV H log 1 εε 2 + log 1 δδ NewRelic 54

Intuition: Surround the data with rare patterns If a query point hits one of the rare patterns, declare it to be an anomaly We only estimate the probability of each pattern, not the full density 55

Isolation RPAD Grow an isolation forest Each tree is only grown to depth kk Each leaf defines a pattern h μμ is the volume (Lebesgue measure) Compute ff (h) for each leaf Details Grow the tree using one sample Estimate ff using a second sample Score query point(s) h 1 xx 1 < 0.5 xx 1 < 0.2 xx 2 < 0.6 h 2 h 3 h 4 NewRelic 56

Results: Covertype PatternMin is slower, but eventually beats IFOREST NewRelic 57

Results: Particle IFOREST is much better better NewRelic 58

Results: Shuttle Isolation Forest (Shuttle) RPAD (Shuttle) 0.89 0.89 0.87 0.87 AUC 0.85 0.83 k=1 k=4 k=7 AUC 0.85 0.83 k=1 k=4 k=7 k=10 k=10 0.81 0.81 0.79 16 64 256 1024 4096 16384 0.79 16 64 256 1024 4096 16384 Sample Size Sample Size PatternMin is consistently beats iforest for kk > 1 NewRelic 59

RPAD Conclusions The PAC-RPAD theory seems to capture the behavior of algorithms such as IFOREST It is easy to design practical RPAD algorithms Theory requires extension to handle sampledependent pattern spaces H NewRelic 60

Summary Density Estimation Parametric Density Estimation Mixture Models Kernel Density Estimation Neural Density Estimation Anomaly Detection Distance-based methods Isolation Forest and LODA RPAD theory 61

Bibliography Silverman, B. W. (2018). Density estimation for statistics and data analysis. Routledge. [updated version of classic book] Sheather, S. J. (2004). Density Estimation. Statistical Science, 19(4), 588 597. Recent survey on kernel density estimation Hastie, T., Tibshirani, R., & Friedman, J. (2011). The Elements of Statistical Learning: Data Mining, Inference, and Prediction (2nd Edition). New York: Springer Verlag. Good textbook descriptions Papamakarios, G., Pavlakou, T., & Murray, I. (2017). Masked Autoregressive Flow for Density Estimation. In NIPS 2017. Retrieved from http://arxiv.org/abs/1705.07057 Liu, F. T., Ting, K. M., & Zhou, Z.-H. (2008). Isolation Forest. In 2008 Eighth IEEE International Conference on Data Mining (pp. 413 422). Ieee. http://doi.org/10.1109/icdm.2008.17 Pevný, T. (2015). Loda: Lightweight on-line detector of anomalies. Machine Learning, (November 2014). http://doi.org/10.1007/s10994-015-5521-0 Emmott, A., Das, S., Dietterich, T., Fern, A., & Wong, W.-K. (2015). Systematic construction of anomaly detection benchmarks from real data. http://arxiv.org/1503.01158v2 Siddiqui, A., Fern, A., Dietterich, T. G., & Das, S. (2016). Finite Sample Complexity of Rare Pattern Anomaly Detection. In Proceedings of UAI-2016. 62