Data Mining: Concepts and Techniques

Data Mg: cepts ad Techques 3 rd ed. hapter 10 1 Evaluat f lusterg lusterg evaluat assesses the feasblty f clusterg aalyss a data set ad the qualty f the results geerated by a clusterg methd. Three mar tasks f clusterg evaluat: Assessg the clusterg tedecy Whether a radm structure exsts the data Determg the umber f clusterg a data set Measurg clusterg qualty 2 1

Assessg lusterg Tedecy lusterg requres ufrm dstrbut f data Assess f -radm structure exsts the data by measurg the prbablty that the data s geerated by a ufrm data dstrbut Hpks Statstc Test spatal radmess Ths statstc exames whether bects a data set dffer sgfcatly frm the assumpt that they are ufrmly dstrbuted the multdmesal space It cmpares the dstaces p betwee the real bects ad ther earest eghbrs t the dstaces q betwee artfcal bects ufrmly geerated ver the data space ad ther earest real eghbrs. Gve a dataset D regarded as a sample f a radm varable determe hw far away s frm beg ufrmly dstrbuted the data space 3 Hpks Statstc Idex alculate the Hpks Statc Idex Sample pts p 1 p frm D. Each pt has the same prbablty f beg cluded the sample. Fr each p fd ts earest eghbr D: x = m{dst p v} where v D Sample pts q 1 q ufrmly frm D. Fr each q fd ts earest eghbr D {q }: y = m{dst q v} where v D ad v q x 1 alculate the Hpks Statstc: H x y If D s ufrmly dstrbuted x ad y wll be clse t each ther ad H s clse t 0.5. If clusterg are preset the dstaces fr artfcal bects x wll be larger tha fr the real es y H s clse t 1. because these artfcal bects are hmgeeusly dstrbuted whereas the real es are gruped tgether ad the value f H wll crease. 1 1 4 2

Examples Ope crcles represet real bects clsed crcles selected real bects ad astersks represet artfcal bects geerated ver the data space a H value = 0.49 b H value = 0.73 a b 5 Determe the Number f lusters 1 Emprcal methd # f clusters: k /2 fr a dataset f pts e.g. = 200 k = 10 Elbw methd Gve a umber k>0 we ca frm k clusters the data set usg a cluster algrthm lke k-meas. alculate the sum f wth-cluster varace vark Plt the curve f var wth respect t k. The frst turg pt the curve suggests the rght umber 6 3

Determe the Number f lusters 2 rss valdat methd Dvde a gve data set t m parts Use m 1 parts t bta a clusterg mdel Use the remag part t test the qualty f the clusterg E.g. Fr each pt the test set fd the clsest cetrd ad use the sum f squared dstace betwee all pts the test set ad the clsest cetrds t measure hw well the mdel fts the test set Fr ay k > 0 repeat t m tmes calculate the average qualty measure as the verall qualty measure mpare the verall qualty measure w.r.t. dfferet values f k ad fd # f clusters that fts the data the best 7 Measurg lusterg Qualty Exteral: supervsed emply crtera t heret t the dataset mpare a clusterg agast prr r expert-specfed kwledge.e. the grud truth usg certa clusterg qualty measure Iteral: usupervsed crtera derved frm data tself Evaluate the gdess f a clusterg by csderg hw well the clusters are separated ad hw cmpact the clusters are e.g. Slhuette ceffcet 8 4

Measurg lusterg Qualty: Exteral Methds lusterg qualty measure: Q T fr a clusterg gve the grud truth T Q s gd f t satsfes the fllwg 4 essetal crtera luster hmgeety: the purer the better luster cmpleteess: shuld assg bects belg t the same categry the grud truth t the same cluster Rag bag: puttg a hetergeeus bect t a pure cluster shuld be pealzed mre tha puttg t t a rag bag.e. mscellaeus r ther categry Small cluster preservat: splttg a small categry t peces s mre harmful tha splttg a large categry t peces 9 Bubed Precs ad Recall Metrcs The precs f a bect dcates hw may ther bects the same cluster belg t the same categry as the bect. The recall f a bect reflects hw may bects f the same categry are assged t the same cluster. 10 5

6 Bubed Precs ad Recall Metrcs Let D={ 1 } be the set f bects ad be a clusterg D. Let L be the categry f gve by grud truth ad be the cluster_id f. Fr tw bects ad 1 the crrectess f the relat betwee ad clusterg s gve by rrectess = 1 f L = L = rrectess = 0 therwse 11 Bubed Precs ad Recall Metrcs Bcubed precs s defed as Bcubed recall s defed as 12 rrectess precs 1 : L L rrectess recall L L 1 :

Itrsc Methds 1 Itrsc methds evaluate a clusterg by examg hw well the clusters are separated ad hw cmpact the clusters are. Grud truth are t avalable The slhuette ceffcet measure Fr a data set D wth bects D s partted t k clusters 1 2 k. Fr each bect we calculate the average dstace betwee ad all ther bects the cluster t whch belgs. Suppse 1 k a ' ' c dst ' 1 13 Itrsc Methds 2 Smlarly we calculate the mmum average dstace frm t all clusters t whch des t belg. b m :1 k dst ' The slhuette ceffcet f s defed as ' b a s max a b The values f the slhuette ceffcet s betwee -1 ad 1 14 7

Itrsc Methds 3 a reflects the cmpactess f the cluster The smaller the value the mre cmpact the cluster b captures the degree t whch s separated frm ther clusters. Whe s appraches 1 the cluster ctag s cmpact ad s far away frm ther clusters whch s preferable. T measure a cluster s ftess wth a clusterg we ca cmpute the average slhuette ceffcet value f all bects the cluster. T measure the qualty f a clusterg cmpute the average slhuette ceffcet value f all bects the data set. 15 Summary luster aalyss grups bects based ther smlarty ad has wde applcats Measure f smlarty ca be cmputed fr varus types f data lusterg algrthms ca be categrzed t parttg methds herarchcal methds desty-based methds grd-based methds ad mdel-based methds K-meas ad K-medds algrthms are ppular parttg-based clusterg algrthms Qualty f clusterg results ca be evaluated varus ways 16 8