Data Mining and Analysis: Fundamental Concepts and Algorithms

Size: px

Start display at page:

Download "Data Mining and Analysis: Fundamental Concepts and Algorithms"

Shanna Burns
6 years ago
Views:

1 Data Mining and Analysis: Fundamental Concepts and Algorithms dataminingbook.info Mohammed J. Zaki 1 Wagner Meira Jr. 2 1 Department of Computer Science Rensselaer Polytechnic Institute, Troy, NY, USA 2 Department of Computer Science Universidade Federal de Minas Gerais, Belo Horizonte, Brazil Chapter 17: Clustering Validation Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 17: Clustering Validation 1 / 56

2 Clustering Validation and Evaluation Cluster validation and assessment encompasses three main tasks: clustering evaluation seeks to assess the goodness or quality of the clustering, clustering stability seeks to understand the sensitivity of the clustering result to various algorithmic parameters, for example, the number of clusters, and clustering tendency assesses the suitability of applying clustering in the first place, that is, whether the data has any inherent grouping structure. Validity measures can be divided into three main types: External: External validation measures employ criteria that are not inherent to the dataset, e.g., class labels. Internal: Internal validation measures employ criteria that are derived from the data itself, e.g., intracluster and intercluster distances. Relative: Relative validation measures aim to directly compare different clusterings, usually those obtained via different parameter settings for the same algorithm. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 17: Clustering Validation 2 / 56

3 External Measures External measures assume that the correct or ground-truth clustering is known a priori, which is used to evaluate a given clustering. Let D = {x i } n i=1 be a dataset consisting of n points in a d-dimensional space, partitioned into k clusters. Let y i {1, 2,...,k} denote the ground-truth cluster membership or label information for each point. The ground-truth clustering is given as T = {T 1, T 2,...,T k }, where the cluster T j consists of all the points with label j, i.e., T j = {x i D y i = j}. We refer to T as the ground-truth partitioning, and to each T i as a partition. Let C = {C 1,...,C r } denote a clustering of the same dataset into r clusters, obtained via some clustering algorithm, and let ŷ i {1, 2,...,r} denote the cluster label for x i. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 17: Clustering Validation 3 / 56

4 External Measures External evaluation measures try capture the extent to which points from the same partition appear in the same cluster, and the extent to which points from different partitions are grouped in different clusters. All of the external measures rely on the r k contingency table N that is induced by a clustering C and the ground-truth partitioning T, defined as follows N(i, j) = n ij = C i T j The count n ij denotes the number of points that are common to cluster C i and ground-truth partition T j. Let n i = C i denote the number of points in cluster C i, and let m j = T j denote the number of points in partition T j. The contingency table can be computed from T and C in O(n) time by examining the partition and cluster labels, y i and ŷ i, for each point x i D and incrementing the corresponding count n yi ŷ i. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 17: Clustering Validation 4 / 56

5 Matching Based Measures: Purity Purity quantifies the extent to which a cluster C i contains entities from only one partition: purity i = 1 n i k max j=1 {n ij} The purity of clustering C is defined as the weighted sum of the clusterwise purity values: purity = r i=1 n i n purity i = 1 n r i=1 k max j=1 {n ij} where the ratio n i n denotes the fraction of points in cluster C i. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 17: Clustering Validation 5 / 56

6 Matching Based Measures: Maximum Matching The maximum matching measure selects the mapping between clusters and partitions, such that the sum of the number of common points (n ij ) is maximized, provided that only one cluster can match with a given partition. Let G be a bipartite graph over the vertex set V = C T, and let the edge set be E = {(C i, T j )} with edge weights w(c i, T j ) = n ij. A matching M in G is a subset of E, such that the edges in M are pairwise nonadjacent, that is, they do not have a common vertex. The maximum weight matching in G is given as: { } w(m) match = arg max M n where w(m) is the sum of the sum of all the edge weights in matching M, given as w(m) = e M w(e) Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 17: Clustering Validation 6 / 56

7 Matching Based Measures: F-measure Given cluster C i, let j i denote the partition that contains the maximum number of points from C i, that is, j i = max k j=1{n ij }. The precision of a cluster C i is the same as its purity: prec i = 1 n i The recall of cluster C i is defined as where m ji = T ji. k max j=1 {n ij} = n ij i n i recall i = n ij i T ji = n ij i m ji The F-measure is the harmonic mean of the precision and recall values for each cluster C i F i = 2 1 prec i + 1 recall i = 2 prec i recall i prec i + recall i = 2 n ij i n i + m ji The F-measure for the clusteringc is the mean of clusterwise F-measure values: F = 1 r F i r i=1 Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 17: Clustering Validation 7 / 56

8 K-means: Iris Principal Components Data (Good Case) Contingency table: u iris-setosa iris-versicolor iris-virginica T 1 T 2 T 3 n i C 1 (squares) C 2 (circles) C 3 (triangles) m j n = 100 purity = 0.887, match = 0.887, F = Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 17: Clustering Validation 8 / 56 u 1

9 K-means: Iris Principal Components Data (Bad Case) Contingency table: u iris-setosa iris-versicolor iris-virginica T 1 T 2 T 3 n i C 1 (squares) C 2 (circles) C 3 (triangles) m j n = 150 purity = 0.667, match = 0.560, F = Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 17: Clustering Validation 9 / 56 u 1

10 Entropy-based Measures: Conditional Entropy The entropy of a clustering C and partitioning T is given as r H(C) = p Ci log p Ci i=1 k H(T ) = p Tj log p Tj j=1 where p Ci = n i n and p T j = m j n are the probabilities of cluster C i and partition T j. The cluster-specific entropy of T, that is, the conditional entropy of T with respect to cluster C i is defined as H(T C i ) = k j=1 ( nij n i ) log ( nij n i ) Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 17: Clustering Validation 10 / 56

11 Entropy-based Measures: Conditional Entropy The conditional entropy of T given clustering C is defined as the weighted sum: H(T C) = r i=1 n i n H(T C i) = = H(C,T ) H(C) r k i=1 j=1 ( ) pij p ij log p Ci where p ij = n ij n is the probability that a point in cluster i also belongs to partition and where H(C,T ) = r k i=1 j=1 p ij log p ij is the joint entropy of C and T. H(T C) = 0 if and only if T is completely determined by C, corresponding to the ideal clustering. If C and T are independent of each other, then H(T C) = H(T ). Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 17: Clustering Validation 11 / 56

12 Entropy-based Measures: Normalized Mutual Information The mutual information tries to quantify the amount of shared information between the clustering C and partitioning T, and it is defined as I(C,T ) = r k i=1 j=1 ( ) pij p ij log p Ci p Tj When C and T are independent then p ij = p Ci p Tj, and thus I(C,T ) = 0. However, there is no upper bound on the mutual information. The normalized mutual information (NMI) is defined as the geometric mean: I(C,T ) NMI(C,T ) = H(C) I(C,T ) H(T ) = I(C,T ) H(C) H(T ) The NMI value lies in the range [0, 1]. Values close to 1 indicate a good clustering. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 17: Clustering Validation 12 / 56

13 Entropy-based Measures: Variation of Information This criterion is based on the mutual information between the clustering C and the ground-truth partitioning T, and their entropy; it is defined as VI(C,T ) = (H(T ) I(C,T ))+(H(C) I(C,T )) = H(T )+H(C) 2I(C,T ) Variation of information (VI) is zero only when C and T are identical. Thus, the lower the VI value the better the clustering C. VI can also be expressed as: VI(C,T ) = H(T C)+H(C T ) VI(C,T ) = 2H(T,C) H(T ) H(C) Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 17: Clustering Validation 13 / 56

14 K-means: Iris Principal Components Data (Good Case) u u 2 u u 1 (a) K-means: good (b) K-means: bad purity match F H(T C) NMI VI (a) Good (b) Bad Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 17: Clustering Validation 14 / 56

15 Pairwise Measures Given clustering C and ground-truth partitioning T, let x i, x j D be any two points, with i j. Let y i denote the true partition label and let ŷ i denote the cluster label for point x i. If both x i and x j belong to the same cluster, that is, ŷ i = ŷ j, we call it a positive event, and if they do not belong to the same cluster, that is, ŷ i ŷ j, we call that a negative event. Depending on whether there is agreement between the cluster labels and partition labels, there are four possibilities to consider: True Positives: x i and x j belong to the same partition in T, and they are also in the same cluster in C. The number of true positive pairs is given as TP = {(xi, x j ) : y i = y j and ŷ i = ŷ j } False Negatives: x i and x j belong to the same partition in T, but they do not belong to the same cluster in C. The number of all false negative pairs is given as FN = {(xi, x j ) : y i = y j and ŷ i ŷ j } Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 17: Clustering Validation 15 / 56

16 Pairwise Measures False Positives: x i and x j do not belong to the same partition in T, but they do belong to the same cluster in C. The number of false positive pairs is given as FP = {(xi, x j ) : y i y j and ŷ i = ŷ j } True Negatives: x i and x j neither belong to the same partition in T, nor do they belong to the same cluster in C. The number of such true negative pairs is given as TN = {(xi, x j ) : y i y j and ŷ i ŷ j } Because there are N = ( ) n 2 = n(n 1) 2 pairs of points, we have the following identity: N = TP + FN + FP + TN Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 17: Clustering Validation 16 / 56

17 Pairwise Measures: TP, TN, FP, FN They can be computed efficiently using the contingency table N = {n ij }. The number of true positives is given as TP = 1 ( ( r k ) ) nij 2 n 2 i=1 j=1 The false negatives can be computed as FN = 1 ( k mj 2 2 j=1 The number of false positives are: FP = 1 ( r ni 2 2 i=1 r k i=1 j=1 r k i=1 j=1 Finally, the number of true negatives can be obtained via TN = N (TP + FN + FP) = 1 ( r k n 2 ni 2 mj i=1 n 2 ij n 2 ij ) ) j=1 r k i=1 j=1 n 2 ij ) Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 17: Clustering Validation 17 / 56

18 Pairwise Measures: Jaccard Coefficient, Rand Statistic, FM Measure Jaccard Coefficient: measures the fraction of true positive point pairs, but after ignoring the true negative: Jaccard = TP TP + FN + FP Rand Statistic: measures the fraction of true positives and true negatives over all point pairs: Rand = TP + TN N Fowlkes-Mallows Measure: Define the overall pairwise precision and pairwise recall values for a clusteringc, as follows: prec = TP/TP + FP recall = TP/TP + FN The Fowlkes Mallows (FM) measure is defined as the geometric mean of the pairwise precision and recall FM = prec recall = TP (TP + FN)(TP + FP) Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 17: Clustering Validation 18 / 56

19 K-means: Iris Principal Components Data (Good Case) u The number of true positives is: ( ) ( TP = u1 ) + Contingency table: setosa versicolor virginica T 1 T 2 T 3 C C C ( ) ( ) ( ) 36 = Likewise, we have FN = 645, FP = 766, TN = 6734, and N = ( ) = We therefore have: Jaccard = 0.682, Rand = 0.887, FM = For the bad clustering, we have: Jaccard = 0.477, Rand = 0.717, FM = Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 17: Clustering Validation 19 / 56

20 Correlation Measures: Hubert statistic Let X and Y be two symmetric n n matrices, and let N = ( n 2). Let x, y R N denote the vectors obtained by linearizing the upper triangular elements (excluding the main diagonal) of X and Y. Let µ X denote the element-wise mean of x, given as µ X = 1 N n 1 i=1 j=i+1 n X(i, j) = 1 N xt x and let z x denote the centered x vector, defined as The Hubert statistic is defined as Γ = 1 N n 1 i=1 j=i+1 z x = x 1 µ X n X(i, j) Y(i, j) = 1 N xt y The normalized Hubert statistic is defined as the element-wise correlation Γ n = z T x z y z x z y = cosθ Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 17: Clustering Validation 20 / 56

21 Correlation-based Measure: Discretized Hubert Statistic Let T and C be the n n matrices defined as { 1 if y i = y j, i j T(i, j) = C(i, j) = 0 otherwise { 1 if ŷ i = ŷ j, i j 0 otherwise Let t, c R N denote the N-dimensional vectors comprising the upper triangular elements (excluding the diagonal) of T and C. Let z t and z c denote the centered t and c vectors. The discretized Hubert statistic is computed by setting x = t and y = c: Γ = 1 N tt c = TP N The normalized version of the discretized Hubert statistic is simply the correlation between t and c whre µ T = TP+FN N Γ n = z T t z c z t z c = and µ C = TP+FP N. TP N µ Tµ C µt µ C (1 µ T )(1 µ C ) Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 17: Clustering Validation 21 / 56

22 Internal Measures Internal evaluation measures do not have recourse to the ground-truth partitioning. To evaluate the quality of the clustering, internal measures therefore have to utilize notions of intracluster similarity or compactness, contrasted with notions of intercluster separation, with usually a trade-off in maximizing these two aims. The internal measures are based on the n n distance matrix, also called the proximity matrix, of all pairwise distances among the n points: W = { } n δ(x i, x j ) i,j=1 where δ(x i, x j ) = x i x j 2 is the Euclidean distance between x i, x j D. The proximity matrix W is the adjacency matrix of the weighted complete graph G over the n points, that is, with nodes V = {x i x i D}, edges E = {(x i, x j ) x i, x j D}, and edge weights w ij = W(i, j) for all x i, x j D. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 17: Clustering Validation 22 / 56

23 Internal Measures The clustering C can be considered as a k-way cut in G. Given any subsets S, R V, define W(S, R) as the sum of the weights on all edges with one vertex in S and the other in R, given as W(S, R) = w ij x i S x j R We denote by S = V S the complementary set of vertices. The sum of all the intracluster and intercluster weights are given as W in = 1 2 k W(C i, C i ) W out = 1 2 i=1 k k 1 W(C i, C i ) = W(C i, C j ) i=1 i=1 j>i The number of distinct intracluster and intracluster edges is given as N in = k i=1 ( ) ni 2 N out = k 1 i=1 j=i+1 k n i n j Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 17: Clustering Validation 23 / 56

24 Clusterings as Graphs: Iris (Good Case) u u2 u u1 Only intracluster edges shown. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 17: Clustering Validation 24 / 56

25 Clusterings as Graphs: Iris (Bad Case) u u2 u u1 Only intracluster edges shown. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 17: Clustering Validation 25 / 56

26 Internal Measures: BetaCV and C-index BetaCV Measure: The BetaCV measure is the ratio of the mean intracluster distance to the mean intercluster distance: BetaCV = W in/n in = N out W in = N k out i=1 W(C i, C i ) W out /N out N in W out N k in i=1 W(C i, C i ) The smaller the BetaCV ratio, the better the clustering. C-index: Let W min (N in ) be the sum of the smallest N in distances in the proximity matrix W, where N in is the total number of intracluster edges, or point pairs. Let W max (N in ) be the sum of the largest N in distances in W. The C-index measures to what extent the clustering puts together the N in points that are the closest across the k clusters. It is defined as Cindex = W in W min (N in ) W max (N in ) W min (N in ) The C-index lies in the range [0, 1]. The smaller the C-index, the better the clustering. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 17: Clustering Validation 26 / 56

27 Internal Measures: Normalized Cut and Modularity Normalized Cut Measure: The normalized cut objective for graph clustering can also be used as an internal clustering evaluation measure: NC = k i=1 W(C i, C i ) vol(c i ) = k i=1 W(C i, C i ) W(C i, V) where vol(c i ) = W(C i, V) is the volume of cluster C i. The higher the normalized cut value the better. Modularity: The modularity objective is given as Q = ( k i=1 W(C i, C i ) W(V, V) ( ) ) 2 W(Ci, V) W(V, V) The smaller the modularity measure the better the clustering. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 17: Clustering Validation 27 / 56

28 Internal Measures: Dunn Index The Dunn index is defined as the ratio between the minimum distance between point pairs from different clusters and the maximum distance between point pairs from the same cluster where W min out and W max in Dunn = W out min Win max is the minimum intercluster distance: Wout min { } = min wab x a C i, x b C j i,j>i is the maximum intracluster distance: Win max { } = max wab x a, x b C i i The larger the Dunn index the better the clustering because it means even the closest distance between points in different clusters is much larger than the farthest distance between points in the same cluster. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 17: Clustering Validation 28 / 56

29 Internal Measures: Davies-Bouldin Index Let µ i denote the cluster mean µ i = 1 n i x j C i x j Let σ µi denote the dispersion or spread of the points around the cluster mean δ(x xj Ci j,µ i ) 2 σ µi = = var(c i ) n i The Davies Bouldin measure for a pair of clusters C i and C j is defined as the ratio DB ij = σ µ i +σ µj δ(µ i,µ j ) DB ij measures how compact the clusters are compared to the distance between the cluster means. The Davies Bouldin index is then defined as DB = 1 k max{db ij } k j i i=1 The smaller the DB value the better the clustering. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 17: Clustering Validation 29 / 56

30 Silhouette Coefficient Define the silhoutte coefficient of a point x i as µ min s i = out(x i ) µ in (x i ) { } max µ min out(x i ),µ in (x i ) where µ in (x i ) is the mean distance from x i to points in its own cluster ŷ i : x j Cŷi,j i δ(x i, x j ) µ in (x i ) = nŷi 1 and µ min out(x i ) is the mean of the distances from x i to points in the closest cluster: { } µ min y C j δ(x i, y) out(x i ) = min j ŷ i n j The s i value lies in the interval [ 1,+1]. A value close to +1 indicates that x i is much closer to points in its own cluster, a value close to zero indicates x i is close to the boundary, and a value close to 1 indicates that x i is much closer to another cluster, and therefore may be mis-clustered. The silhouette coefficient is the mean s i value: SC = 1 n n i=1 s i. A value close to +1 indicates a good clustering. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 17: Clustering Validation 30 / 56

31 Iris Data: Good vs. Bad Clustering u2 u u u1 (a) Good (b) Bad Lower better Higher better BetaCV Cindex Q DB NC Dunn SC Γ Γ n (a) Good (b) Bad Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 17: Clustering Validation 31 / 56

32 Relative Measures: Silhouette Coefficient The silhouette coefficient for each point s j, and the average SC value can be used to estimate the number of clusters in the data. The approach consists of plotting the s j values in descending order for each cluster, and to note the overall SC value for a particular value of k, as well as clusterwise SC values: SC i = 1 n i x j C i s j We then pick the value k that yields the best clustering, with many points having high s j values within each cluster, as well as high values for SC and SC i (1 i k). Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 17: Clustering Validation 32 / 56

33 Iris K-means: Silhouette Coefficient Plot (k = 2) silhouette coefficient SC 1 = n 1 = 97 (a) k = 2, SC = SC 2 = n 2 = 53 k = 2 yields the highest silhouette coefficient, with the two clusters essentially well separated. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 17: Clustering Validation 33 / 56

34 Iris K-means: Silhouette Coefficient Plot (k = 3) y x SC 1 = n 1 = 61 SC 2 = n 2 = 50 SC 3 = 0.52 n 3 = 39 (b) k = 3, SC = Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 17: Clustering Validation 34 / 56

35 Iris K-means: Silhouette Coefficient Plot (k = 4) y SC 1 = n 1 = 49 SC 2 = n 2 = 28 (c) k = 4, SC = SC 3 = n 3 = 50 SC 4 = n 4 = 23 x Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 17: Clustering Validation 35 / 56

36 Relative Measures: Calinski Harabasz Index Given the dataset D = {x i } n i=1, the scatter matrix for D is given as S = nσ = n (x j µ)(x j µ) T j=1 where µ = 1 n n j=1 x j is the mean and Σ is the covariance matrix. The scatter matrix can be decomposed into two matrices S = S W + S B, where S W is the within-cluster scatter matrix and S B is the between-cluster scatter matrix, given as S W = S B = k i=1 (x j µ i )(x j µ i ) T x j C i k n i (µ i µ)(µ i µ) T i=1 where µ i = 1 n i x j C i x j is the mean for cluster C i. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 17: Clustering Validation 36 / 56

37 Relative Measures: Calinski Harabasz Index The Calinski Harabasz (CH) variance ratio criterion for a given value of k is defined as follows: where tr is the trace of the matrix. CH(k) = tr(s B)/(k 1) tr(s W )/(n k) = n k k 1 tr(s B) tr(s W ) We plot the CH values and look for a large increase in the value followed by little or no gain. We choose the value k > 3 that minimizes the term ( ) ( ) (k) = CH(k + 1) CH(k) CH(k) CH(k 1) The intuition is that we want to find the value of k for which CH(k) is much higher than CH(k 1) and there is only a little improvement or a decrease in the CH(k + 1) value. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 17: Clustering Validation 37 / 56

38 Calinski Harabasz Variance Ratio CH ratio for various values of k on the Iris principal components data, using the K-means algorithm, with the best results chosen from 200 runs. CH k The successive CH(k) and (k) values are as follows: k CH(k) (k) (k) suggests k = 3 as the best (lowest) value. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 17: Clustering Validation 38 / 56

39 Relative Measures: Gap Statistic The gap statistic compares the sum of intracluster weights W in for different values of k with their expected values assuming no apparent clustering structure, which forms the null hypothesis. Let C k be the clustering obtained for a specified value of k. Let Win k (D) denote the sum of intracluster weights (over all clusters) for C k on the input dataset D. We would like to compute the probability of the observed Win k value under the null hypothesis. To obtain an empirical distribution for W in, we resort to Monte Carlo simulations of the sampling process. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 17: Clustering Validation 39 / 56

40 Relative Measures: Gap Statistic We generate t random samples comprising n points. Let R i R n d, 1 i t denote the ith sample. Let Win k(r i) denote the sum of intracluster weights for a given clustering of R i into k clusters. From each sample dataset R i, we generate clusterings for different values of k, and record the intracluster values W k in (R i). Let µ W (k) and σ W (k) denote the mean and standard deviation of these intracluster weights for each value of k. The gap statistic for a given k is then defined as gap(k) = µ W (k) log W k in(d) Choose k as follows: { } k = arg min gap(k) gap(k + 1) σ W (k + 1) k Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 17: Clustering Validation 40 / 56

41 Gap Statistic: Randomly Generated Data (a) Randomly generated data (k = 3) Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 17: Clustering Validation 41 / 56

42 Gap Statistic: Intracluster Weights and Gap Values log 2 W in k expected: µ W (k) observed: Win k k (b) Intracluster weights gap(k) k (c) Gap statistic Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 17: Clustering Validation 42 / 56

43 Gap Statistic as a Function of k k gap(k) σ W (k) gap(k) σ W (k) The optimal value for the number of clusters is k = 4 because gap(4) = > gap(5) σ W (5) = However, if we relax the gap test to be within two standard deviations, then the optimal value is k = 3 because gap(3) = > gap(4) 2σ W (4) = = Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 17: Clustering Validation 43 / 56

44 Cluster Stability The main idea behind cluster stability is that the clusterings obtained from several datasets sampled from the same underlying distribution as D should be similar or stable. Stability can be used to find a good value for k, the correct number of clusters. We generate t samples of size n by sampling from D with replacement. Let C k (D i ) denote the clustering obtained from sample D i, for a given value of k. Next, we compare the distance between all pairs of clusterings C k (D i ) and C k (D j ) using several of the external cluster evaluation measures. From these values we compute the expected pairwise distance for each value of k. Finally, the value k that exhibits the least deviation between the clusterings obtained from the resampled datasets is the best choice for k because it exhibits the most stability. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 17: Clustering Validation 44 / 56

45 Clustering Stability Algorithm CLUSTERINGSTABILITY (A, t, k max, D): n D for i = 1, 2,...,t do D i sample n points from D with replacement for i = 1, 2,...,t do for k = 2, 3,...,k max do C k (D i ) cluster D i into k clusters using algorithm A foreach pair D i, D j with j > i do D ij D i D j // create common dataset for k = 2, 3,...,k max do d ij (k) d ( ) 10 C k (D i ),C k (D j ), D ij // distance between clusterings for k = 2, 3,...,k max do µ d (k) 2 t t(t 1) i=1 j>i d ij(k) { k arg min k µd (k) } Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 17: Clustering Validation 45 / 56

46 Clustering Stability: Iris Data t = 500 bootstrap samples; best K-means from 100 runs Expected Value µ s(k) : FM µ d (k) : VI k The best choice is k = 2. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 17: Clustering Validation 46 / 56

47 Clustering Tendency: Spatial Histogram Clustering tendency or clusterability aims to determine whether the dataset D has any meaningful groups to begin with. Let X 1, X 2,...,X d denote the d dimensions. Given b, the number of bins for each dimension, we divide each dimension X j into b equi-width bins, and simply count how many points lie in each of the b d d-dimensional cells. From this spatial histogram, we can obtain the empirical joint probability mass function (EPMF) for the dataset D {xj cell i} f(i) = P(x j cell i) = n where i = (i 1, i 2,...,i d ) denotes a cell index, with i j denoting the bin index along dimension X j. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 17: Clustering Validation 47 / 56

48 Clustering Tendency: Spatial Histogram We generate t random samples, each comprising n points within the same d-dimensional space as the input dataset D. Let R j denote the jth such random sample. We then compute the corresponding EPMF g j (i) for each R j, 1 j t. We next compute how much the distribution f differs from g j (for j = 1,...,t), using the Kullback Leibler (KL) divergence from f to g j, defined as KL(f g j ) = ( ) f(i) f(i) log g j (i) i The KL divergence is zero only when f and g j are the same distributions. Using these divergence values, we can compute how much the dataset D differs from a random dataset. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 17: Clustering Validation 48 / 56

49 Spatial Histogram: Iris Data versus Uniform u u u u 1 (a) Iris: spatial cells (b) Uniform: spatial cells Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 17: Clustering Validation 49 / 56

50 Spatial Histogram: Empirical PMF Probability Spatial Cells (c) Empirical probability mass function Iris (f) Uniform (g j ) Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 17: Clustering Validation 50 / 56

51 Spatial Histogram: KL Divergence Distribution 0.25 Probability KL Divergence (d) KL-divergence distribution We generated t = 500 random samples from the null distribution, and computed the KL divergence from f to g j for each 1 j t. The mean KL value is µ KL = 1.17, with a standard deviation of σ KL = Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 17: Clustering Validation 51 / 56

52 Clustering Tendency: Distance Distribution We can compare the pairwise point distances from D, with those from the randomly generated samples R i from the null distribution. We create the EPMF from the proximity matrix W for D by binning the distances into b bins: {wpq bin i} f(i) = P(w pq bin i x p, x q D, p < q) = n(n 1)/2 Likewise, for each of the samples R j, we determine the EPMF for the pairwise distances, denoted g j. Finally, we compute the KL divergences between f and g j. The expected divergence indicates the extent to which D differs from the null (random) distribution. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 17: Clustering Validation 52 / 56

53 Iris Data: Distance Distribution Probability Pairwise distance (a) Iris (f) Uniform (g j ) Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 17: Clustering Validation 53 / 56

54 Iris Data: Distance Distribution 0.20 Probability KL divergence (b) Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 17: Clustering Validation 54 / 56

55 Clustering Tendency: Hopkins Statistic Given a dataset D comprising n points, we generate t uniform subsamples R i of m points each, sampled from the same dataspace as D. We also generate t subsamples of m points directly from D, using sampling without replacement. Let D i denote the ith direct subsample. Next, we compute the minimum distance between each point x j D i and points in D { } δ min (x j ) = min δ(x j, x i ) x i D,x i x j We also compute the minimum distance δ min (y j ) between a point y j R i and points in D. The Hopkins statistic (in d dimensions) for the ith pair of samples R i and D i is then defined as y j R i (δ min (y j )) d HS i = y j R i (δ min (y j )) d + x j D i (δ min (x j )) d If the data is well clustered we expect δ min (x j ) values to be smaller compared to the δ min (y j ) values, and in this case HS i tends to 1. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 17: Clustering Validation 55 / 56

56 Iris Data: Hopkins Statistic Distribution Probability Hopkins Statistic Number of sample pairs t = 500, subsample size m = 30. The mean of the Hopkins statistic is µ HS = 0.935, with a standard deviation of σ HS = Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 17: Clustering Validation 56 / 56

Data Mining and Analysis: Fundamental Concepts and Algorithms

Data Mining and Analysis: Fundamental Concepts and Algorithms : Fundamental Concepts and Algorithms dataminingbook.info Mohammed J. Zaki 1 Wagner Meira Jr. 2 1 Department of Computer Science Rensselaer Polytechnic Institute, Troy, NY, USA 2 Department of Computer