Chapter 3: Cluster Analysis } 3.1 Basic Cncepts f Clustering 3.1.1 Cluster Analysis 3.1. Clustering Categries } 3. Partitining Methds 3..1 The principle 3.. K-Means Methd 3..3 K-Medids Methd 3..4 CLARA 3..5 CLARANS } 3.3 Hierarchical Methds } 3.4 Density-based Methds } 3.5 Clustering High-Dimensinal Data } 3.6 Outlier Analysis
3.1.1 Cluster Analysis } Unsupervised learning (i.e., Class label is unknwn) } Grup data t frm new categries (i.e., clusters), e.g., cluster huses t find distributin patterns } Principle: Maximizing intra-class similarity & minimizing interclass similarity } Typical Applicatins " WWW, Scial netwrks, Marketing, Bilgy, Library, etc.
3.1. Clustering Categries } } } } } } } Partitining Methds " Cnstruct k partitins f the data Hierarchical Methds " Creates a hierarchical decmpsitin f the data Density-based Methds " Grw a given cluster depending n its density (# data bjects) Grid-based Methds " Quantize the bject space int a finite number f cells Mdel-based methds " Hypthesize a mdel fr each cluster and find the best fit f the data t the given mdel Clustering high-dimensinal data " Subspace clustering Cnstraint-based methds " Used fr user-specific applicatins
Chapter 3: Cluster Analysis } 3.1 Basic Cncepts f Clustering 3.1.1 Cluster Analysis 3.1. Clustering Categries } 3. Partitining Methds 3..1 The principle 3.. K-Means Methd 3..3 K-Medids Methd 3..4 CLARA 3..5 CLARANS } 3.3 Hierarchical Methds } 3.4 Density-based Methds } 3.5 Clustering High-Dimensinal Data } 3.6 Outlier Analysis
3..1 Partitining Methds: The Principle } Given " A data set f n bjects " K the number f clusters t frm } Organize the bjects int k partitins (k<=n) where each partitin represents a cluster } The clusters are frmed t ptimize an bjective partitining criterin " Objects within a cluster are similar " Objects f different clusters are dissimilar
3.. K-Means Methd Chse 3 bjects (cluster centrids) Gal: create 3 clusters (partitins) Assign each bject t the clsest centrid t frm Clusters Update cluster centrids + + +
K-Means Methd Recmpute Clusters + + + If Stable centrids, then stp + + +
K-Means Algrithm } Input " K: the number f clusters " D: a data set cntaining n bjects } Output: A set f k clusters } Methd: (1) Arbitrary chse k bjects frm D as in initial cluster centers () Repeat (3) Reassign each bject t the mst similar cluster based n the mean value f the bjects in the cluster (4) Update the cluster means (5) Until n change
K-Means Prperties } The algrithm attempts t determine k partitins that minimize the square-errr functin E = k i 1 p C i p m i " E: the sum f the squared errr fr all bjects in the data set " P: the data pint in the space representing an bject " m i : is the mean f cluster C i } It wrks well when the clusters are cmpact cluds that are rather well separated frm ne anther
K-Means Prperties Advantages } K-means is relatively scalable and efficient in prcessing large data sets } The cmputatinal cmplexity f the algrithm is O(nkt) " n: the ttal number f bjects " k: the number f clusters " t: the number f iteratins " Nrmally: k<<n and t<<n Disadvantage } Can be applied nly when the mean f a cluster is defined } Users need t specify k } K-means is nt suitable fr discvering clusters with nncnvex shapes r clusters f very different size } It is sensitive t nise and utlier data pints (can influence the mean value)
Variatins f the K-Means Methd } A few variants f the k-means which differ in " Selectin f the initial k means " Dissimilarity calculatins " Strategies t calculate cluster means } Handling categrical data: k-mdes (Huang 9) " Replacing means f clusters with mdes " Using new dissimilarity measures t deal with categrical bjects " Using a frequency-based methd t update mdes f clusters " A mixture f categrical and numerical data Nvember 0, 01 Data Mining: Cncepts and Techniques 11
3..3 K-Medids Methd } Minimize the sensitivity f k-means t utliers } Pick actual bjects t represent clusters instead f mean values } Each remaining bject is clustered with the representative bject (Medid) t which is the mst similar } The algrithm minimizes the sum f the dissimilarities between each bject and its crrespnding reference pint E = k i 1 p C i p i " E: the sum f abslute errr fr all bjects in the data set " P: the data pint in the space representing an bject " O i : is the representative bject f cluster C i
K-Medids Methd: The Idea } Initial representatives are chsen randmly } The iterative prcess f replacing representative bjects by n representative bjects cntinues as lng as the quality f the clustering is imprved } Fr each representative Object O " Fr each nn-representative bject R, swap O and R } Chse the cnfiguratin with the lwest cst } Cst functin is the difference in abslute errr-value if a current representative bject is replaced by a nn-representative bject
K-Medids Methd: Example Data Objects O 1 A 1 A 6 O 3 4 O 3 3 O 4 4 7 O 5 6 O 6 6 4 O 7 7 3 O 7 4 O 9 5 O 10 7 6 9 7 6 5 4 3 1 3 4 5 3 4 5 6 7 9 Gal: create tw clusters Chse randmly tw medids O = (3,4) O = (7,4) 6 10 7 9
K-Medids Methd: Example Data Objects O 1 A 1 A 6 O 3 4 O 3 3 O 4 4 7 O 5 6 O 6 6 4 O 7 7 3 O 7 4 O 9 5 O 10 7 6 9 7 6 5 4 3 1 3 cluster1 4 5 3 4 5 6 7 9 Cluster1 = {O 1, O, O 3, O 4 } 6 10 cluster " Assign each bject t the clsest representative bject " Using L1 Metric (Manhattan), we frm the fllwing clusters 7 9 Cluster = {O 5, O 6, O 7, O, O 9, O 10 }
K-Medids Methd: Example O 1 A 1 A 6 O 3 4 O 3 3 O 4 4 7 O 5 6 O 6 6 4 O 7 7 3 O 7 4 O 9 5 O 10 7 6 Data Objects 3 4 5 6 7 9 3 4 5 6 7 9 1 3 4 5 6 7 9 10 " Cmpute the abslute errr criterin [fr the set f Medids (O,O)] 10 9 7 6 5 4 3 1 1 p E k i C p i i + + + + + + + = = cluster1 cluster
K-Medids Methd: Example Data Objects O 1 A 1 A 6 O 3 4 O 3 3 O 4 4 7 O 5 6 O 6 6 4 O 7 7 3 O 7 4 O 9 5 O 10 7 6 9 7 6 5 4 3 1 3 cluster1 4 5 3 4 5 6 7 9 " The abslute errr criterin [fr the set f Medids (O,O)] 6 10 cluster E = ( 3+ 4 + 4) + (3 + 1+ 1+ + ) = 7 9 0
K-Medids Methd: Example Data Objects O 1 A 1 A 6 O 3 4 O 3 3 O 4 4 7 O 5 6 O 6 6 4 O 7 7 3 O 7 4 O 9 5 O 10 7 6 9 7 6 5 4 3 1 " Chse a randm bject O 7 " Swap O and O7 3 cluster1 4 5 3 4 5 6 7 9 " Cmpute the abslute errr criterin [fr the set f Medids (O,O7)] 6 10 cluster E = ( 3+ 4 + 4) + ( + + 1+ 3+ 3) = 7 9
K-Medids Methd: Example Data Objects O 1 A 1 A 6 O 3 4 O 3 3 O 4 4 7 O 5 6 O 6 6 4 O 7 7 3 O 7 4 O 9 5 O 10 7 6 9 7 6 5 4 3 cluster1 3 4 1 6 5 10 9 7 3 4 5 6 7 9 " Cmpute the cst functin Abslute errr [fr O,O 7 ] Abslute errr [O,O ] S = 0 cluster S> 0 it is a bad idea t replace O by O 7
K-Medids Methd Data Objects O 1 A 1 A 6 O 3 4 O 3 3 O 4 4 7 O 5 6 O 6 6 4 O 7 7 3 O 7 4 O 9 5 O 10 7 6 9 7 6 5 4 3 1 3 cluster1 4 5 3 4 5 6 7 9 6 cluster } In this example, changing the medid f cluster did nt change the assignments f bjects t clusters. } What are the pssible cases when we replace a medid by anther bject? 10 7 9
K-Medids Methd Cluster 1 Cluster A B B First case The assignment f P t A des nt change p Representative bject Randm Object Currently P assigned t A Cluster 1 Cluster A p B B Secnd case P is reassigned t A Representative bject Randm Object Currently P assigned t B
K-Medids Methd Cluster 1 Cluster A p B B Third case P is reassigned t the new B Representative bject Randm Object Currently P assigned t B Cluster 1 Cluster A Furth case p B B P is reassigned t B Representative bject Randm Object Currently P assigned t A
} Input K-Medids Algrithm(PAM) " K: the number f clusters " D: a data set cntaining n bjects } Output: A set f k clusters } Methd: (1) Arbitrary chse k bjects frm D as representative bjects (seeds) () Repeat PAM : Partitining Arund Medids (3) Assign each remaining bject t the cluster with the nearest representative bject (4) Fr each representative bject O j (5) Randmly select a nn representative bject O randm (6) Cmpute the ttal cst S f swapping representative bject Oj with O randm (7) if S<0 then replace O j with O randm () Until n change
K-Medids Prperties(k-medis vs. K-means) } The cmplexity f each iteratin is O(k(n-k) ) } Fr large values f n and k, such cmputatin becmes very cstly } Advantages " K-Medids methd is mre rbust than k-means in the presence f nise and utliers } Disadvantages " K-Medids is mre cstly that the k-means methd " Like k-means, k-medids require the user t specify k " It des nt scale well fr large data sets
3..4 CLARA } CLARA (Clustering Large Applicatins) uses a sampling-based methd t deal with large data sets } A randm sample shuld clsely represent the riginal data sample PAM } The chsen medids will likely be similar t what wuld have been chsen frm the whle data set
CLARA } Draw multiple samples f the data set } Apply PAM t each sample } Return the best clustering Chse the best clustering Clusters Clusters Clusters PAM PAM PAM sample 1 sample sample m
CLARA Prperties } Cmplexity f each Iteratin is: O(ks + k(n-k)) " s: the size f the sample " k: number f clusters " n: number f bjects } PAM finds the best k medids amng a given data, and CLARA finds the best k medids amng the selected samples } Prblems " The best k medids may nt be selected during the sampling prcess, in this case, CLARA will never find the best clustering " If the sampling is biased we cannt have a gd clustering " Trade ff-f efficiency
3..5 CLARANS } CLARANS (Clustering Large Applicatins based upn RANdmized Search ) was prpsed t imprve the quality and the scalability f CLARA } It cmbines sampling techniques with PAM } It des nt cnfine itself t any sample at a given time } It draws a sample with sme randmness in each step f the search
CLARANS: The idea Clustering view Current medids medids Cst=-10 Cst=5 Cst=-15 Cst=0 Cst= Cst=3 Cst=1 Keep the current medids
CLARANS: The idea CLARA } Draws a sample f ndes at the beginning f the search } Neighbrs are frm the chsen sample } Restricts the search t a specific area f the riginal data First step f the search Neighbrs are frm the chsen sample Current medids Sample medids secnd step f the search Neighbrs are frm the chsen sample
CLARANS: The idea CLARANS } Des nt cnfine the search t a lcalized area } Stps the search when a lcal minimum is fund } Finds several lcal ptimums and utput the clustering with the best lcal ptimum First step f the search Draw a randm sample f neighbrs Current medids Original data medids secnd step f the search Draw a randm sample f neighbrs The number f neighbrs sampled frm the riginal data is specified by the user
CLARANS Prperties } Advantages " Experiments shw that CLARANS is mre effective than bth PAM and CLARA " Handles utliers } Disadvantages " The cmputatinal cmplexity f CLARANS is O(n ), where n is the number f bjects " The clustering quality depends n the sampling methd
Summary f Sectin 3. } Partitining methds find sphere-shaped clusters } K- mean is efficient fr large data sets but sensitive t utliers } PAM used centers f the clusters instead f means } CLARA and CLARANS are used fr clustering large databases