Cluster Validation Determining Number of Clusters. Umut ORHAN, PhD.

Cluster Analyss Cluster Valdaton Determnng Number of Clusters 1 Cluster Valdaton The procedure of evaluatng the results of a clusterng algorthm s known under the term cluster valdty. How do we evaluate the goodness of the resultng clusters? Then why do we want to evaluate them? To avod fndng patterns n nose To compare clusterngs, or clusterng algorthms 2 1

Cluster Valdaton Determnng the clusterng tendency of a set of data. Comparng the results of a cluster analyss to externally known results. Evaluatng how well the results of a cluster analyss ft the data wthout reference to external nformaton. Comparng the results of two dfferent sets of cluster analyses to determne whch s better. Determnng the correct number of clusters. 3 Cluster Valdaton Measures that are appled to judge varous aspects of cluster valdty, are classfed nto the followng two types: External Index: Used to measure the extent to whch cluster labels match externally suppled class labels. E.g., entropy, precson, recall Internal Index: Used to measure the goodness of a clusterng structure wthout reference to external nformaton. E.g., Sum of Squared Error (SSE) 4 2

Cluster Valdaton External Index Valdate aganst ground truth Compare two clusters: (how smlar) Internal Index Valdate wthout external nfo Wth dfferent number of clusters Solve the number of clusters 5 External Index Assume that the data s labeled wth some class labels. Ths s called the ground truth. It s wanted the clusters to be homogeneous wth respect to classes. External measures are based on a matrx that summarze the number of correct predctons and wrong predctons. 6 3

External Index n = number of ponts m = ponts n cluster c j = ponts n class j n j = ponts n cluster comng from class j p j = n j /m = probablty of element from cluster to be assgned n class j 7 External Index Entropy: L Of a cluster : e = j=1 p j log p j Precson: Of a cluster : Prec, j = p j The fracton of a cluster that conssts of objects of a class j 8 4

External Index Recall: Of a cluster : Rec, j = n j c j The extent to whch a cluster contans all objects of a class j. F-measure: F, j = 2 Prec,j Rec(,j) Prec,j +Rec(,j) 9 External Index Precson: (0.94, 0.81, 0.85) overall 0.86 Recall: (0.85, 0.9, 0.85) - overall 0.87 Precson: (0.38, 0.38, 0.38) overall 0.38 Recall: (0.35, 0.42, 0.38) overall 0.39 (Assgn to cluster the class k such that k = arg max n j ) j 10 5

Internal Index Used to measure the goodness of a clusterng structure wthout reference to external nformaton. Internal ndex: Varances of wthn cluster and between clusters Slhouette Coeffcent F-Rato Daves-Bouldn Index (DBI) 11 Internal Index Internal valdaton measures are often based on the followng two crtera: Cluster Coheson: Measures how closely related are objects n a cluster. Cluster Separaton: Measure how dstnct or wellseparated a cluster s from other clusters. Intra-cluster varance s mnmzed Inter-cluster varance s maxmzed 12 6

Internal Index Coheson s measured by the wthn cluster sum of squares (SSE) WSS x C ( x c ) Separaton s measured by the between cluster sum of squares BSS m ( c c ) 2 2 13 Internal Index Example: BSS + WSS = constant 14 7

Slhouette Coeffcent Slhouette Coeffcent combne deas of both coheson and separaton. For an ndvdual pont, Calculate a = average dstance of to the ponts n ts cluster Calculate b = mn (average dstance of to ponts n another cluster) 15 Slhouette Coeffcent x x coheson a(x): average dstance n the cluster separaton b(x): average dstances to others clusters, fnd mnmal 16 8

Daves-Bouldn Index The Daves-Bouldn ndex can be calculated by the followng formula: c x s the centrod of cluster x ơx s the average dstance of all elements n cluster x to centrod c x 17 Dunn Index The Dunn ndex can be calculated by the followng formula: d(,j) represents the dstance between clusters and j. d '(k) measures the ntra-cluster dstance of cluster k 18 9

F - Rato Measures rato of between-groups varance aganst the wthn-groups varance. F N k x c k j 1 1 p() n c x j j 2 2 k SSW ( X ) SSW BSS 19 Determnng Number of Clusters By rule of thumb Elbow method Choosng k usng the slhoutte Cross valdaton 20 10

By Rule of Thumb It s a smple method. Ths method can by apply to any type of data set. k = (n/2) 21 Elbow Method The oldest method for determnng the true number of clusters n a data set s nelegantly called the elbow method. The dea s that Start wth K=2, and keep ncreasng t n each step by 1. Calculate your clusters and the cost that comes wth the tranng. At some value for K the cost drops dramatcally, and after that t reaches a plateau when you ncrease t further. 22 11

Elbow Method 23 Choosng k Usng The Slhoutte The largest average slhouette wdth, over dfferent K, ndcates the best number of clusters. The slhouette of a data nstance s a measure of how closely t s matched to data wthn ts cluster and how loosely t s matched to data of the neghbourng cluster. 24 12

Cross Valdaton Ths method splts the data n two or more (K) parts. One part s used for clusterng and the other parts are used for valdaton. The value of the objectve functon calculated for each part. These K values are calculated and averaged for each alternatve number of clusters. The cluster number s selected that leads to only a small reducton n the objectve functon. 25 Herarchcal: Dendogram Cuttng a dendrogram at a certan level gves a set of clusters. 26 13

Densty-based: Determnng ε The average dstance to each mnpts-nearest neghbors s calculated. These mnptsdstances are then drawn n ascendng order. The ε value can also be determned by the user. 27 14