Anomaly Detection. Lecture Notes for Chapter 9. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Similar documents
Clustering (Bishop ch 9)

CHAPTER 10: LINEAR DISCRIMINATION

Bayes rule for a classification problem INF Discriminant functions for the normal density. Euclidean distance. Mahalanobis distance

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 4

CHAPTER 5: MULTIVARIATE METHODS

( t) Outline of program: BGC1: Survival and event history analysis Oslo, March-May Recapitulation. The additive regression model

Advanced Machine Learning & Perception

( ) [ ] MAP Decision Rule

Variants of Pegasos. December 11, 2009

Math 128b Project. Jude Yuen

CS 536: Machine Learning. Nonparametric Density Estimation Unsupervised Learning - Clustering

Econ107 Applied Econometrics Topic 5: Specification: Choosing Independent Variables (Studenmund, Chapter 6)

Department of Economics University of Toronto

An introduction to Support Vector Machine

Lecture VI Regression

In the complete model, these slopes are ANALYSIS OF VARIANCE FOR THE COMPLETE TWO-WAY MODEL. (! i+1 -! i ) + [(!") i+1,q - [(!

CHAPTER 7: CLUSTERING

J i-1 i. J i i+1. Numerical integration of the diffusion equation (I) Finite difference method. Spatial Discretization. Internal nodes.

TSS = SST + SSE An orthogonal partition of the total SS

Lecture 6: Learning for Control (Generalised Linear Regression)

PHYS 1443 Section 001 Lecture #4

Solution in semi infinite diffusion couples (error function analysis)

Machine Learning 2nd Edition

Computing Relevance, Similarity: The Vector Space Model

An Effective TCM-KNN Scheme for High-Speed Network Anomaly Detection

Machine Learning Linear Regression

V.Abramov - FURTHER ANALYSIS OF CONFIDENCE INTERVALS FOR LARGE CLIENT/SERVER COMPUTER NETWORKS

Chapter 4. Neural Networks Based on Competition

Outline. Probabilistic Model Learning. Probabilistic Model Learning. Probabilistic Model for Time-series Data: Hidden Markov Model

Mechanics Physics 151

Introduction to Boosting

Robust and Accurate Cancer Classification with Gene Expression Profiling

Robustness Experiments with Two Variance Components

Mechanics Physics 151

Normal Random Variable and its discriminant functions

ON THE WEAK LIMITS OF SMOOTH MAPS FOR THE DIRICHLET ENERGY BETWEEN MANIFOLDS

Cubic Bezier Homotopy Function for Solving Exponential Equations

New M-Estimator Objective Function. in Simultaneous Equations Model. (A Comparative Study)

CHAPTER 2: Supervised Learning

CHAPTER 10: LINEAR DISCRIMINATION

Learning Objectives. Self Organization Map. Hamming Distance(1/5) Introduction. Hamming Distance(3/5) Hamming Distance(2/5) 15/04/2015

Lecture 11 SVM cont

UNIVERSITAT AUTÒNOMA DE BARCELONA MARCH 2017 EXAMINATION

F-Tests and Analysis of Variance (ANOVA) in the Simple Linear Regression Model. 1. Introduction

A Novel Object Detection Method Using Gaussian Mixture Codebook Model of RGB-D Information

Appendix H: Rarefaction and extrapolation of Hill numbers for incidence data

Clustering with Gaussian Mixtures

. The geometric multiplicity is dim[ker( λi. number of linearly independent eigenvectors associated with this eigenvalue.

Linear Response Theory: The connection between QFT and experiments

Lecture 18: The Laplace Transform (See Sections and 14.7 in Boas)

Fall 2010 Graduate Course on Dynamic Learning

Comparison of Supervised & Unsupervised Learning in βs Estimation between Stocks and the S&P500

Chapters 2 Kinematics. Position, Distance, Displacement

CS286.2 Lecture 14: Quantum de Finetti Theorems II

Ordinary Differential Equations in Neuroscience with Matlab examples. Aim 1- Gain understanding of how to set up and solve ODE s

. The geometric multiplicity is dim[ker( λi. A )], i.e. the number of linearly independent eigenvectors associated with this eigenvalue.

Including the ordinary differential of distance with time as velocity makes a system of ordinary differential equations.

Lecture Slides for INTRODUCTION TO. Machine Learning. ETHEM ALPAYDIN The MIT Press,

Data Collection Definitions of Variables - Conceptualize vs Operationalize Sample Selection Criteria Source of Data Consistency of Data

GMM parameter estimation. Xiaoye Lu CMPS290c Final Project

Bernoulli process with 282 ky periodicity is detected in the R-N reversals of the earth s magnetic field

Fitting a Conditional Linear Gaussian Distribution

2. SPATIALLY LAGGED DEPENDENT VARIABLES

Endogeneity. Is the term given to the situation when one or more of the regressors in the model are correlated with the error term such that

Motion in Two Dimensions

II. Light is a Ray (Geometrical Optics)

Tools for Analysis of Accelerated Life and Degradation Test Data

Comparison of Differences between Power Means 1

Chapter Lagrangian Interpolation

NPTEL Project. Econometric Modelling. Module23: Granger Causality Test. Lecture35: Granger Causality Test. Vinod Gupta School of Management

The Performance of Optimum Response Surface Methodology Based on MM-Estimator

Stochastic Programming handling CVAR in objective and constraint

Single-loop System Reliability-Based Design & Topology Optimization (SRBDO/SRBTO): A Matrix-based System Reliability (MSR) Method

Foundations of State Estimation Part II

Appendix to Online Clustering with Experts

January Examinations 2012

THE PREDICTION OF COMPETITIVE ENVIRONMENT IN BUSINESS

[Link to MIT-Lab 6P.1 goes here.] After completing the lab, fill in the following blanks: Numerical. Simulation s Calculations

Kayode Ayinde Department of Pure and Applied Mathematics, Ladoke Akintola University of Technology P. M. B. 4000, Ogbomoso, Oyo State, Nigeria

Mechanics Physics 151

Chapter 3: Vectors and Two-Dimensional Motion

Example: MOSFET Amplifier Distortion

Part II CONTINUOUS TIME STOCHASTIC PROCESSES

THEORETICAL AUTOCORRELATIONS. ) if often denoted by γ. Note that

Attribute Reduction Algorithm Based on Discernibility Matrix with Algebraic Method GAO Jing1,a, Ma Hui1, Han Zhidong2,b

John Geweke a and Gianni Amisano b a Departments of Economics and Statistics, University of Iowa, USA b European Central Bank, Frankfurt, Germany

On One Analytic Method of. Constructing Program Controls

Video-Based Face Recognition Using Adaptive Hidden Markov Models

Introduction ( Week 1-2) Course introduction A brief introduction to molecular biology A brief introduction to sequence comparison Part I: Algorithms

Kernel-Based Bayesian Filtering for Object Tracking

Professor Joseph Nygate, PhD

Sampling Procedure of the Sum of two Binary Markov Process Realizations

Dynamic Team Decision Theory. EECS 558 Project Shrutivandana Sharma and David Shuman December 10, 2005

Detection of Waving Hands from Images Using Time Series of Intensity Values

Density Matrix Description of NMR BCMB/CHEM 8190

This document is downloaded from DR-NTU, Nanyang Technological University Library, Singapore.

ELASTIC MODULUS ESTIMATION OF CHOPPED CARBON FIBER TAPE REINFORCED THERMOPLASTICS USING THE MONTE CARLO SIMULATION

ACEI working paper series RETRANSFORMATION BIAS IN THE ADJACENT ART PRICE INDEX

Tight results for Next Fit and Worst Fit with resource augmentation

Advanced time-series analysis (University of Lund, Economic History Department)

Transcription:

Anomaly eecon Lecure Noes for Chaper 9 Inroducon o aa Mnng, 2 nd Edon by Tan, Senbach, Karpane, Kumar 2/14/18 Inroducon o aa Mnng, 2nd Edon 1 Anomaly/Ouler eecon Wha are anomales/oulers? The se of daa pons ha are consderably dfferen han he remander of he daa Naural mplcaon s ha anomales are relavely rare One n a housand occurs ofen f you have los of daa Cone s mporan, e.g., freezng emps n July Can be mporan or a nusance 10 foo all 2 year old Unusually hgh blood pressure 2/14/18 Inroducon o aa Mnng, 2nd Edon 2

Imporance of Anomaly eecon Ozone epleon Hsory In 1985 hree researchers (Farman, Gardnar and Shankln were puzzled by daa gahered by he Brsh Anarcc Survey showng ha ozone levels for Anarcca had dropped 10% below normal levels Why dd he Nmbus 7 saelle, whch had nsrumens aboard for recordng ozone levels, no record smlarly low ozone concenraons? The ozone concenraons recorded by he saelle were so low hey were beng reaed as oulers by a compuer program and dscarded! Sources: hp://eplorngdaa.cqu.edu.au/ozone.hml hp://www.epa.gov/ozone/scence/hole/sze.hml 2/14/18 Inroducon o aa Mnng, 2nd Edon 3 Causes of Anomales aa from dfferen classes Measurng he weghs of oranges, bu a few grapefru are med n Naural varaon Unusually all people aa errors 200 pound 2 year old 2/14/18 Inroducon o aa Mnng, 2nd Edon 4

sncon Beween Nose and Anomales Nose s erroneous, perhaps random, values or conamnang objecs Wegh recorded ncorrecly Grapefru med n wh he oranges Nose doesn necessarly produce unusual values or objecs Nose s no neresng Anomales may be neresng f hey are no a resul of nose Nose and anomales are relaed bu dsnc conceps 2/14/18 Inroducon o aa Mnng, 2nd Edon 5 General Issues: Number of Arbues Many anomales are defned n erms of a sngle arbue Hegh Shape Color Can be hard o fnd an anomaly usng all arbues Nosy or rrelevan arbues Objec s only anomalous wh respec o some arbues However, an objec may no be anomalous n any one arbue 2/14/18 Inroducon o aa Mnng, 2nd Edon 6

General Issues: Anomaly Scorng Many anomaly deecon echnques provde only a bnary caegorzaon An objec s an anomaly or sn Ths s especally rue of classfcaon-based approaches Oher approaches assgn a score o all pons Ths score measures he degree o whch an objec s an anomaly Ths allows objecs o be ranked In he end, you ofen need a bnary decson Should hs cred card ransacon be flagged? Sll useful o have a score How many anomales are here? 2/14/18 Inroducon o aa Mnng, 2nd Edon 7 Oher Issues for Anomaly eecon Fnd all anomales a once or one a a me Swampng Maskng Evaluaon How do you measure performance? Supervsed vs. unsupervsed suaons Effcency Cone Professonal baskeball eam 2/14/18 Inroducon o aa Mnng, 2nd Edon 8

Varans of Anomaly eecon Problems Gven a daa se, fnd all daa pons wh anomaly scores greaer han some hreshold Gven a daa se, fnd all daa pons havng he op-n larges anomaly scores Gven a daa se, conanng mosly normal (bu unlabeled daa pons, and a es pon, compue he anomaly score of wh respec o 2/14/18 Inroducon o aa Mnng, 2nd Edon 9 Model-Based Anomaly eecon Buld a model for he daa and see Unsupervsed u Anomales are hose pons ha don f well u Anomales are hose pons ha dsor he model u Eamples: Sascal dsrbuon Clusers Regresson Geomerc Graph Supervsed u Anomales are regarded as a rare class u Need o have ranng daa 2/14/18 Inroducon o aa Mnng, 2nd Edon 10

Addonal Anomaly eecon Technques Promy-based Anomales are pons far away from oher pons Can deec hs graphcally n some cases ensy-based Low densy pons are oulers Paern machng Creae profles or emplaes of aypcal bu mporan evens or objecs Algorhms o deec hese paerns are usually smple and effcen 2/14/18 Inroducon o aa Mnng, 2nd Edon 11 Vsual Approaches Boplos or scaer plos Lmaons No auomac Subjecve 2/14/18 Inroducon o aa Mnng, 2nd Edon 12

y Sascal Approaches Probablsc defnon of an ouler: An ouler s an objec ha has a low probably wh respec o a probably dsrbuon model of he daa. Usually assume a paramerc model descrbng he dsrbuon of he daa (e.g., normal dsrbuon Apply a sascal es ha depends on aa dsrbuon Parameers of dsrbuon (e.g., mean, varance Number of epeced oulers (confdence lm Issues Idenfyng he dsrbuon of a daa se u Heavy aled dsrbuon Number of arbues Is he daa a mure of dsrbuons? 2/14/18 Inroducon o aa Mnng, 2nd Edon 13 Normal srbuons One-dmensonal Gaussan 8 7 6 5 4 3 2 1 0-1 -2-3 -4-5 0.1 0.09 0.08 0.07 0.06 0.05 0.04 0.03 0.02 0.01 probably densy Two-dmensonal Gaussan -4-3 -2-1 0 1 2 3 4 5 2/14/18 Inroducon o aa Mnng, 2nd Edon 14

Grubbs Tes eec oulers n unvarae daa Assume daa comes from normal dsrbuon eecs one ouler a a me, remove he ouler, and repea H 0 : There s no ouler n daa H A : There s a leas one ouler Grubbs es sasc: X X G = ma s Rejec H 0 f: ( N 1 G > N 2 ( α / N, N 2 N 2 + 2 ( α / N, N 2 2/14/18 Inroducon o aa Mnng, 2nd Edon 15 Sascal-based Lkelhood Approach Assume he daa se conans samples from a mure of wo probably dsrbuons: M (majory dsrbuon A (anomalous dsrbuon General Approach: Inally, assume all he daa pons belong o M Le L ( be he log lkelhood of a me For each pon ha belongs o M, move o A u Le L +1 ( be he new log lkelhood. u Compue he dfference, Δ = L ( L +1 ( u If Δ > c (some hreshold, hen s declared as an anomaly and moved permanenly from M o A 2/14/18 Inroducon o aa Mnng, 2nd Edon 16

Sascal-based Lkelhood Approach aa dsrbuon, = (1 λ M + λ A M s a probably dsrbuon esmaed from daa Can be based on any modelng mehod (naïve Bayes, mamum enropy, ec A s nally assumed o be unform dsrbuon Lkelhood a me : = + + + = = = A A M M A A A M M M N P A P M LL P P P L ( log log ( log log(1 ( ( ( (1 ( ( 1 λ λ λ λ 2/14/18 Inroducon o aa Mnng, 2nd Edon 17 Srenghs/Weaknesses of Sascal Approaches Frm mahemacal foundaon Can be very effcen Good resuls f dsrbuon s known In many cases, daa dsrbuon may no be known For hgh dmensonal daa, may be dffcul o esmae he rue dsrbuon Anomales can dsor he parameers of he dsrbuon 2/14/18 Inroducon o aa Mnng, 2nd Edon 18

sance-based Approaches Several dfferen echnques An objec s an ouler f a specfed fracon of he objecs s more han a specfed dsance away (Knorr, Ng 1998 Some sascal defnons are specal cases of hs The ouler score of an objec s he dsance o s kh neares neghbor 2/14/18 Inroducon o aa Mnng, 2nd Edon 19 One Neares Neghbor - One Ouler 2 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 Ouler Score 2/14/18 Inroducon o aa Mnng, 2nd Edon 20

One Neares Neghbor - Two Oulers 0.55 0.5 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 Ouler Score 2/14/18 Inroducon o aa Mnng, 2nd Edon 21 Fve Neares Neghbors - Small Cluser 2 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 Ouler Score 2/14/18 Inroducon o aa Mnng, 2nd Edon 22

Fve Neares Neghbors - fferng ensy 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 Ouler Score 2/14/18 Inroducon o aa Mnng, 2nd Edon 23 Srenghs/Weaknesses of sance-based Approaches Smple Epensve O(n 2 Sensve o parameers Sensve o varaons n densy sance becomes less meanngful n hghdmensonal space 2/14/18 Inroducon o aa Mnng, 2nd Edon 24

ensy-based Approaches ensy-based Ouler: The ouler score of an objec s he nverse of he densy around he objec. Can be defned n erms of he k neares neghbors One defnon: Inverse of dsance o kh neghbor Anoher defnon: Inverse of he average dsance o k neghbors BSCAN defnon If here are regons of dfferen densy, hs approach can have problems 2/14/18 Inroducon o aa Mnng, 2nd Edon 25 Relave ensy Consder he densy of a pon relave o ha of s k neares neghbors 2/14/18 Inroducon o aa Mnng, 2nd Edon 26

Relave ensy Ouler Scores 6.85 6 C 5 1.40 4 3 1.33 A 2 1 Ouler Score 2/14/18 Inroducon o aa Mnng, 2nd Edon 27 ensy-based: LOF approach For each pon, compue he densy of s local neghborhood Compue local ouler facor (LOF of a sample p as he average of he raos of he densy of sample p and he densy of s neares neghbors Oulers are pons wh larges LOF value p 2 p 1 In he NN approach, p 2 s no consdered as ouler, whle LOF approach fnd boh p 1 and p 2 as oulers 2/14/18 Inroducon o aa Mnng, 2nd Edon 28

Srenghs/Weaknesses of ensy-based Approaches Smple Epensve O(n 2 Sensve o parameers ensy becomes less meanngful n hghdmensonal space 2/14/18 Inroducon o aa Mnng, 2nd Edon 29 Cluserng-Based Approaches Cluserng-based Ouler: An objec s a cluser-based ouler f does no srongly belong o any cluser For prooype-based clusers, an objec s an ouler f s no close enough o a cluser cener For densy-based clusers, an objec s an ouler f s densy s oo low For graph-based clusers, an objec s an ouler f s no well conneced Oher ssues nclude he mpac of oulers on he clusers and he number of clusers 2/14/18 Inroducon o aa Mnng, 2nd Edon 30

sance of Pons from Closes Cenrods 4.5 4.6 C 4 3.5 3 0.17 2.5 2 1.5 1.2 A 1 0.5 Ouler Score 2/14/18 Inroducon o aa Mnng, 2nd Edon 31 Relave sance of Pons from Closes Cenrod 4 3.5 3 2.5 2 1.5 1 0.5 Ouler Score 2/14/18 Inroducon o aa Mnng, 2nd Edon 32

Srenghs/Weaknesses of sance-based Approaches Smple Many cluserng echnques can be used Can be dffcul o decde on a cluserng echnque Can be dffcul o decde on number of clusers Oulers can dsor he clusers 2/14/18 Inroducon o aa Mnng, 2nd Edon 33