Anomaly eecon Lecure Noes for Chaper 9 Inroducon o aa Mnng, 2 nd Edon by Tan, Senbach, Karpane, Kumar 2/14/18 Inroducon o aa Mnng, 2nd Edon 1 Anomaly/Ouler eecon Wha are anomales/oulers? The se of daa pons ha are consderably dfferen han he remander of he daa Naural mplcaon s ha anomales are relavely rare One n a housand occurs ofen f you have los of daa Cone s mporan, e.g., freezng emps n July Can be mporan or a nusance 10 foo all 2 year old Unusually hgh blood pressure 2/14/18 Inroducon o aa Mnng, 2nd Edon 2
Imporance of Anomaly eecon Ozone epleon Hsory In 1985 hree researchers (Farman, Gardnar and Shankln were puzzled by daa gahered by he Brsh Anarcc Survey showng ha ozone levels for Anarcca had dropped 10% below normal levels Why dd he Nmbus 7 saelle, whch had nsrumens aboard for recordng ozone levels, no record smlarly low ozone concenraons? The ozone concenraons recorded by he saelle were so low hey were beng reaed as oulers by a compuer program and dscarded! Sources: hp://eplorngdaa.cqu.edu.au/ozone.hml hp://www.epa.gov/ozone/scence/hole/sze.hml 2/14/18 Inroducon o aa Mnng, 2nd Edon 3 Causes of Anomales aa from dfferen classes Measurng he weghs of oranges, bu a few grapefru are med n Naural varaon Unusually all people aa errors 200 pound 2 year old 2/14/18 Inroducon o aa Mnng, 2nd Edon 4
sncon Beween Nose and Anomales Nose s erroneous, perhaps random, values or conamnang objecs Wegh recorded ncorrecly Grapefru med n wh he oranges Nose doesn necessarly produce unusual values or objecs Nose s no neresng Anomales may be neresng f hey are no a resul of nose Nose and anomales are relaed bu dsnc conceps 2/14/18 Inroducon o aa Mnng, 2nd Edon 5 General Issues: Number of Arbues Many anomales are defned n erms of a sngle arbue Hegh Shape Color Can be hard o fnd an anomaly usng all arbues Nosy or rrelevan arbues Objec s only anomalous wh respec o some arbues However, an objec may no be anomalous n any one arbue 2/14/18 Inroducon o aa Mnng, 2nd Edon 6
General Issues: Anomaly Scorng Many anomaly deecon echnques provde only a bnary caegorzaon An objec s an anomaly or sn Ths s especally rue of classfcaon-based approaches Oher approaches assgn a score o all pons Ths score measures he degree o whch an objec s an anomaly Ths allows objecs o be ranked In he end, you ofen need a bnary decson Should hs cred card ransacon be flagged? Sll useful o have a score How many anomales are here? 2/14/18 Inroducon o aa Mnng, 2nd Edon 7 Oher Issues for Anomaly eecon Fnd all anomales a once or one a a me Swampng Maskng Evaluaon How do you measure performance? Supervsed vs. unsupervsed suaons Effcency Cone Professonal baskeball eam 2/14/18 Inroducon o aa Mnng, 2nd Edon 8
Varans of Anomaly eecon Problems Gven a daa se, fnd all daa pons wh anomaly scores greaer han some hreshold Gven a daa se, fnd all daa pons havng he op-n larges anomaly scores Gven a daa se, conanng mosly normal (bu unlabeled daa pons, and a es pon, compue he anomaly score of wh respec o 2/14/18 Inroducon o aa Mnng, 2nd Edon 9 Model-Based Anomaly eecon Buld a model for he daa and see Unsupervsed u Anomales are hose pons ha don f well u Anomales are hose pons ha dsor he model u Eamples: Sascal dsrbuon Clusers Regresson Geomerc Graph Supervsed u Anomales are regarded as a rare class u Need o have ranng daa 2/14/18 Inroducon o aa Mnng, 2nd Edon 10
Addonal Anomaly eecon Technques Promy-based Anomales are pons far away from oher pons Can deec hs graphcally n some cases ensy-based Low densy pons are oulers Paern machng Creae profles or emplaes of aypcal bu mporan evens or objecs Algorhms o deec hese paerns are usually smple and effcen 2/14/18 Inroducon o aa Mnng, 2nd Edon 11 Vsual Approaches Boplos or scaer plos Lmaons No auomac Subjecve 2/14/18 Inroducon o aa Mnng, 2nd Edon 12
y Sascal Approaches Probablsc defnon of an ouler: An ouler s an objec ha has a low probably wh respec o a probably dsrbuon model of he daa. Usually assume a paramerc model descrbng he dsrbuon of he daa (e.g., normal dsrbuon Apply a sascal es ha depends on aa dsrbuon Parameers of dsrbuon (e.g., mean, varance Number of epeced oulers (confdence lm Issues Idenfyng he dsrbuon of a daa se u Heavy aled dsrbuon Number of arbues Is he daa a mure of dsrbuons? 2/14/18 Inroducon o aa Mnng, 2nd Edon 13 Normal srbuons One-dmensonal Gaussan 8 7 6 5 4 3 2 1 0-1 -2-3 -4-5 0.1 0.09 0.08 0.07 0.06 0.05 0.04 0.03 0.02 0.01 probably densy Two-dmensonal Gaussan -4-3 -2-1 0 1 2 3 4 5 2/14/18 Inroducon o aa Mnng, 2nd Edon 14
Grubbs Tes eec oulers n unvarae daa Assume daa comes from normal dsrbuon eecs one ouler a a me, remove he ouler, and repea H 0 : There s no ouler n daa H A : There s a leas one ouler Grubbs es sasc: X X G = ma s Rejec H 0 f: ( N 1 G > N 2 ( α / N, N 2 N 2 + 2 ( α / N, N 2 2/14/18 Inroducon o aa Mnng, 2nd Edon 15 Sascal-based Lkelhood Approach Assume he daa se conans samples from a mure of wo probably dsrbuons: M (majory dsrbuon A (anomalous dsrbuon General Approach: Inally, assume all he daa pons belong o M Le L ( be he log lkelhood of a me For each pon ha belongs o M, move o A u Le L +1 ( be he new log lkelhood. u Compue he dfference, Δ = L ( L +1 ( u If Δ > c (some hreshold, hen s declared as an anomaly and moved permanenly from M o A 2/14/18 Inroducon o aa Mnng, 2nd Edon 16
Sascal-based Lkelhood Approach aa dsrbuon, = (1 λ M + λ A M s a probably dsrbuon esmaed from daa Can be based on any modelng mehod (naïve Bayes, mamum enropy, ec A s nally assumed o be unform dsrbuon Lkelhood a me : = + + + = = = A A M M A A A M M M N P A P M LL P P P L ( log log ( log log(1 ( ( ( (1 ( ( 1 λ λ λ λ 2/14/18 Inroducon o aa Mnng, 2nd Edon 17 Srenghs/Weaknesses of Sascal Approaches Frm mahemacal foundaon Can be very effcen Good resuls f dsrbuon s known In many cases, daa dsrbuon may no be known For hgh dmensonal daa, may be dffcul o esmae he rue dsrbuon Anomales can dsor he parameers of he dsrbuon 2/14/18 Inroducon o aa Mnng, 2nd Edon 18
sance-based Approaches Several dfferen echnques An objec s an ouler f a specfed fracon of he objecs s more han a specfed dsance away (Knorr, Ng 1998 Some sascal defnons are specal cases of hs The ouler score of an objec s he dsance o s kh neares neghbor 2/14/18 Inroducon o aa Mnng, 2nd Edon 19 One Neares Neghbor - One Ouler 2 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 Ouler Score 2/14/18 Inroducon o aa Mnng, 2nd Edon 20
One Neares Neghbor - Two Oulers 0.55 0.5 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 Ouler Score 2/14/18 Inroducon o aa Mnng, 2nd Edon 21 Fve Neares Neghbors - Small Cluser 2 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 Ouler Score 2/14/18 Inroducon o aa Mnng, 2nd Edon 22
Fve Neares Neghbors - fferng ensy 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 Ouler Score 2/14/18 Inroducon o aa Mnng, 2nd Edon 23 Srenghs/Weaknesses of sance-based Approaches Smple Epensve O(n 2 Sensve o parameers Sensve o varaons n densy sance becomes less meanngful n hghdmensonal space 2/14/18 Inroducon o aa Mnng, 2nd Edon 24
ensy-based Approaches ensy-based Ouler: The ouler score of an objec s he nverse of he densy around he objec. Can be defned n erms of he k neares neghbors One defnon: Inverse of dsance o kh neghbor Anoher defnon: Inverse of he average dsance o k neghbors BSCAN defnon If here are regons of dfferen densy, hs approach can have problems 2/14/18 Inroducon o aa Mnng, 2nd Edon 25 Relave ensy Consder he densy of a pon relave o ha of s k neares neghbors 2/14/18 Inroducon o aa Mnng, 2nd Edon 26
Relave ensy Ouler Scores 6.85 6 C 5 1.40 4 3 1.33 A 2 1 Ouler Score 2/14/18 Inroducon o aa Mnng, 2nd Edon 27 ensy-based: LOF approach For each pon, compue he densy of s local neghborhood Compue local ouler facor (LOF of a sample p as he average of he raos of he densy of sample p and he densy of s neares neghbors Oulers are pons wh larges LOF value p 2 p 1 In he NN approach, p 2 s no consdered as ouler, whle LOF approach fnd boh p 1 and p 2 as oulers 2/14/18 Inroducon o aa Mnng, 2nd Edon 28
Srenghs/Weaknesses of ensy-based Approaches Smple Epensve O(n 2 Sensve o parameers ensy becomes less meanngful n hghdmensonal space 2/14/18 Inroducon o aa Mnng, 2nd Edon 29 Cluserng-Based Approaches Cluserng-based Ouler: An objec s a cluser-based ouler f does no srongly belong o any cluser For prooype-based clusers, an objec s an ouler f s no close enough o a cluser cener For densy-based clusers, an objec s an ouler f s densy s oo low For graph-based clusers, an objec s an ouler f s no well conneced Oher ssues nclude he mpac of oulers on he clusers and he number of clusers 2/14/18 Inroducon o aa Mnng, 2nd Edon 30
sance of Pons from Closes Cenrods 4.5 4.6 C 4 3.5 3 0.17 2.5 2 1.5 1.2 A 1 0.5 Ouler Score 2/14/18 Inroducon o aa Mnng, 2nd Edon 31 Relave sance of Pons from Closes Cenrod 4 3.5 3 2.5 2 1.5 1 0.5 Ouler Score 2/14/18 Inroducon o aa Mnng, 2nd Edon 32
Srenghs/Weaknesses of sance-based Approaches Smple Many cluserng echnques can be used Can be dffcul o decde on a cluserng echnque Can be dffcul o decde on number of clusers Oulers can dsor he clusers 2/14/18 Inroducon o aa Mnng, 2nd Edon 33