Chapter DM:II (continued)

Size: px

Start display at page:

Download "Chapter DM:II (continued)"

Damon Collins
5 years ago
Views:

1 Chapter DM:II (continued) II. Cluster Analysis Cluster Analysis Basics Hierarchical Cluster Analysis Iterative Cluster Analysis Density-Based Cluster Analysis Cluster Evaluation Constrained Cluster Analysis DM:II-198 Cluster Analysis STEIN

2 Overview The validation of clustering structures is the most difficult and frustrating part of cluster analysis. Without a strong effort in this direction, cluster analysis will remain a black art accessible only to those true believers who have experience and great courage. [Jain/Dubes 1990] DM:II-199 Cluster Analysis STEIN

3 [Tan/Steinbach/Kumar 2005] Random points DM:II-200 Cluster Analysis STEIN

4 [Tan/Steinbach/Kumar 2005] Random points DBSCAN DM:II-201 Cluster Analysis STEIN

5 [Tan/Steinbach/Kumar 2005] Random points DBSCAN k-means DM:II-202 Cluster Analysis STEIN

6 [Tan/Steinbach/Kumar 2005] Random points DBSCAN k-means Complete link DM:II-203 Cluster Analysis STEIN

7 Overview Cluster evaluation can address different issues: Provide evidence whether data contains non-random structures. Relate found structures in the data to externally provided class information. Rank alternative clusterings with regard to their quality. Determine the ideal number of clusters. Provide information to choose a suited clustering approach. DM:II-204 Cluster Analysis STEIN

8 Overview Cluster evaluation can address different issues: Provide evidence whether data contains non-random structures. Relate found structures in the data to externally provided class information. Rank alternative clusterings with regard to their quality. Determine the ideal number of clusters. Provide information to choose a suited clustering approach. (1) External validity measures: Analyze how close is a clustering to a reference. (2) Internal validity measures: Analyze intrinsic characteristics of a clustering. (3) Relative validity measures: Analyze the sensitivity (of internal measures) during clustering generation. DM:II-205 Cluster Analysis STEIN

9 Overview cluster validity external internal covering analysis information-theoretic static: structure analysis absolute dynamic: re-cluster stability relative F-Measure, Purity, RAND statistics Kullback-Leibler, Entropy Davies-Bouldin, Dunn, ρ, λ dilution analysis elbow criterion, GAP statistics (1) External validity measures: Analyze how close is a clustering to a reference. (2) Internal validity measures: Analyze intrinsic characteristics of a clustering. (3) Relative validity measures: Analyze the sensitivity (of internal measures) during clustering generation. DM:II-206 Cluster Analysis STEIN

10 (1) External Validity Measures: F -Measure DM:II-207 Cluster Analysis STEIN

11 (1) External Validity Measures: F -Measure Class i DM:II-208 Cluster Analysis STEIN

12 (1) External Validity Measures: F -Measure Cluster j DM:II-209 Cluster Analysis STEIN

13 (1) External Validity Measures: F -Measure Cluster j (node-based analysis) P Truth Hypothesis P TP (a) FP (b) N FN (c) TN (d) N DM:II-210 Cluster Analysis STEIN

14 (1) External Validity Measures: F -Measure Cluster j (node-based analysis) P Truth Hypothesis P TP (a) FP (b) N FN (c) TN (d) N DM:II-211 Cluster Analysis STEIN

15 (1) External Validity Measures: F -Measure Cluster j (node-based analysis) P Truth Hypothesis P TP (a) FP (b) N FN (c) TN (d) N DM:II-212 Cluster Analysis STEIN

16 (1) External Validity Measures: F -Measure Cluster j (node-based analysis) P Truth Hypothesis P TP (a) FP (b) N FN (c) TN (d) N DM:II-213 Cluster Analysis STEIN

17 (1) External Validity Measures: F -Measure Cluster j (node-based analysis) P Truth Hypothesis P TP (a) FP (b) N FN (c) TN (d) N DM:II-214 Cluster Analysis STEIN

18 (1) External Validity Measures: F -Measure Cluster j (node-based analysis) P Truth Hypothesis P TP (a) FP (b) N FN (c) TN (d) N Precision: a a + b Recall: a a + c DM:II-215 Cluster Analysis STEIN

19 (1) External Validity Measures: F -Measure Cluster j (node-based analysis) P Truth Hypothesis P TP (a) FP (b) N FN (c) TN (d) N Precision: a a + b Recall: a a + c F -measure: F α = 1 + α 1 precision + α recall α = 1 α (0; 1) α > 1 harmonic mean favor precision over recall favor recall over precision DM:II-216 Cluster Analysis STEIN

20 (1) External Validity Measures: F -Measure Classes: DM:II-217 Cluster Analysis STEIN

21 (1) External Validity Measures: F -Measure Classes: Target: In cluster: Recall / = 0.26 Precision /( ) = 0.94 F-Measure = 0.40 High precision, low recall low F -measure. DM:II-218 Cluster Analysis STEIN

22 (1) External Validity Measures: F -Measure Classes: Target: In cluster: Recall / = 0.59 Precision /( ) = 0.53 F-Measure = 0.56 Low precision, low recall low F -measure. DM:II-219 Cluster Analysis STEIN

23 (1) External Validity Measures: F -Measure Classes: Target: In cluster: Recall / = 0.92 Precision /( ) = 0.99 F-Measure = 0.95 High precision, high recall high F -measure. DM:II-220 Cluster Analysis STEIN

24 (1) External Validity Measures: F -Measure Cluster j (node-based analysis) Clustering C = {C 1,..., C k } and classification C = {C 1,..., C l } of D. F i,j is the F -measure of a cluster j computed with respect to a class i. Precision of cluster j with respect to class i is C j C i / C j (here: Prec i,j = 0.71) Recall of cluster j with respect to class i is C j C i / C i (here: Rec i,j = 1.0) DM:II-221 Cluster Analysis STEIN

25 (1) External Validity Measures: F -Measure Cluster j (node-based analysis) Clustering C = {C 1,..., C k } and classification C = {C 1,..., C l } of D. F i,j is the F -measure of a cluster j computed with respect to a class i. Precision of cluster j with respect to class i is C j C i / C j (here: Prec i,j = 0.71) Recall of cluster j with respect to class i is C j C i / C i (here: Rec i,j = 1.0) Micro-averaged F -measure for D, C, C : F = l i=1 C i D max j=1,...,k {F i,j} DM:II-222 Cluster Analysis STEIN

26 (1) External Validity Measures: F -Measure Cluster j (node-based analysis) Clustering C = {C 1,..., C k } and classification C = {C 1,..., C l } of D. F i,j is the F -measure of a cluster j computed with respect to a class i. Precision of cluster j with respect to class i is C j C i / C j (here: Prec i,j = 0.71) Recall of cluster j with respect to class i is C j C i / C i (here: Rec i,j = 1.0) Macro-averaged F -measure for D, C, C : F = 1 l l i=1 max j=1,...,k {F i,j} DM:II-223 Cluster Analysis STEIN

27 (1) External Validity Measures: Entropy Cluster j (node-based analysis) DM:II-224 Cluster Analysis STEIN

28 (1) External Validity Measures: Entropy Cluster j (node-based analysis) A cluster C acts as information source L. L emits cluster labels L 1,..., L l with probabilities P (L 1 ),..., P (L l ). DM:II-225 Cluster Analysis STEIN

29 (1) External Validity Measures: Entropy Cluster j (node-based analysis) A cluster C acts as information source L. L emits cluster labels L 1,..., L l with probabilities P (L 1 ),..., P (L l ). ˆP ( ) = 10/14, ˆP ( ) = 4/14 DM:II-226 Cluster Analysis STEIN

30 (1) External Validity Measures: Entropy Cluster j (node-based analysis) A cluster C acts as information source L. L emits cluster labels L 1,..., L l with probabilities P (L 1 ),..., P (L l ). ˆP ( ) = 10/14, ˆP ( ) = 4/14 Entropy of L : H(L) = l i=1 P (L i) log 2 (P (L i )) Entropy of C j wrt. C : H(C j ) = C j C i C j C i / C j log 2 ( C j C i / C j ) DM:II-227 Cluster Analysis STEIN

31 (1) External Validity Measures: Entropy Cluster j (node-based analysis) A cluster C acts as information source L. L emits cluster labels L 1,..., L l with probabilities P (L 1 ),..., P (L l ). ˆP ( ) = 10/14, ˆP ( ) = 4/14 Entropy of L : H(L) = l i=1 P (L i) log 2 (P (L i )) Entropy of C j wrt. C : H(C j ) = Entropy of C wrt. C : H(C) = C j C C j C i C j C i / C j log 2 ( C j C i / C j ) C j / D H(C j ) DM:II-228 Cluster Analysis STEIN

32 (1) External Validity Measures: Rand, Jaccard Cluster j false positive true positive false negative true negative (edge-based analysis) DM:II-229 Cluster Analysis STEIN

33 (1) External Validity Measures: Rand, Jaccard Cluster j false positive true positive false negative true negative (edge-based analysis) R(C) = T P + T N T P + T N + F P + F N = T P + T N n(n 1)/2, with n = D J(C) = T P T P + F P + F N DM:II-230 Cluster Analysis STEIN

34 (2) Internal Validity Measures: Edge Correlation [Tan/Steinbach/Kumar 2005] Construct occurrence matrix based on cluster analysis. Compare similarity matrix to occurrence matrix: correlation τ DM:II-231 Cluster Analysis STEIN

35 (2) Internal Validity Measures: Edge Correlation [Tan/Steinbach/Kumar 2005] Construct occurrence matrix based on cluster analysis. Compare similarity matrix to occurrence matrix: correlation τ DM:II-232 Cluster Analysis STEIN

36 (2) Internal Validity Measures: Edge Correlation [Tan/Steinbach/Kumar 2005] Construct occurrence matrix based on cluster analysis. Compare similarity matrix to occurrence matrix: correlation τ DM:II-233 Cluster Analysis STEIN

37 (2) Internal Validity Measures: Edge Correlation [Tan/Steinbach/Kumar 2005] Construct occurrence matrix based on cluster analysis. Compare similarity matrix to occurrence matrix: correlation τ DM:II-234 Cluster Analysis STEIN

38 (2) Internal Validity Measures: Edge Correlation [Tan/Steinbach/Kumar 2005] k-means τ = 0.58 k-means τ = Construct occurrence matrix based on cluster analysis. Compare similarity matrix to occurrence matrix: correlation τ DM:II-235 Cluster Analysis STEIN

39 (2) Internal Validity Measures: Edge Correlation [Tan/Steinbach/Kumar 2005] k-means at structured data. Similarity matrix sorted by cluster label. DM:II-236 Cluster Analysis STEIN

40 (2) Internal Validity Measures: Edge Correlation [Tan/Steinbach/Kumar 2005] DBSCAN at random data. Similarity matrix sorted by cluster label. DM:II-237 Cluster Analysis STEIN

41 (2) Internal Validity Measures: Edge Correlation [Tan/Steinbach/Kumar 2005] k-means at random data. Similarity matrix sorted by cluster label. DM:II-238 Cluster Analysis STEIN

42 (2) Internal Validity Measures: Edge Correlation [Tan/Steinbach/Kumar 2005] Complete link at random data. Similarity matrix sorted by cluster label. DM:II-239 Cluster Analysis STEIN

43 (2) Internal Validity Measures: Edge Correlation [Tan/Steinbach/Kumar 2005] DBSCAN at structured data. Similarity matrix sorted by cluster label. DM:II-240 Cluster Analysis STEIN

44 (2) Internal Validity Measures: Structural Analysis Distance for two clusters, d C (C 1, C 2 ). Diameter of a cluster, (C). Scatter within a cluster, σ 2 (C), SSE. DM:II-241 Cluster Analysis STEIN

45 (2) Internal Validity Measures: Structural Analysis d C (C 1, C 2 ) Distance for two clusters, d C (C 1, C 2 ). Diameter of a cluster, (C). Scatter within a cluster, σ 2 (C), SSE. DM:II-242 Cluster Analysis STEIN

46 (2) Internal Validity Measures: Structural Analysis d C (C 1, C 2 ) (C) Distance for two clusters, d C (C 1, C 2 ). Diameter of a cluster, (C). Scatter within a cluster, σ 2 (C), SSE. DM:II-243 Cluster Analysis STEIN

47 (2) Internal Validity Measures: Structural Analysis d C (C 1, C 2 ) (C) σ2(c) Distance for two clusters, d C (C 1, C 2 ). Diameter of a cluster, (C). Scatter within a cluster, σ 2 (C), SSE. DM:II-244 Cluster Analysis STEIN

48 (2) Internal Validity Measures: Dunn Index I(C) = min i j{d C (C i, C j )} max 1 l k { (C l )}, I(C) max DM:II-245 Cluster Analysis STEIN

49 (2) Internal Validity Measures: Dunn Index I(C) = min i j{d C (C i, C j )} max 1 l k { (C l )}, I(C) max DM:II-246 Cluster Analysis STEIN

50 (2) Internal Validity Measures: Dunn Index I(C) = min i j{d C (C i, C j )} max 1 l k { (C l )}, I(C) max DM:II-247 Cluster Analysis STEIN

51 (2) Internal Validity Measures: Dunn Index I(C) = min i j{d C (C i, C j )} max 1 l k { (C l )}, I(C) max Dunn is susceptible to noise. Dunn is biased towards the worst substructure in a clustering (cf. the min) Dunn cannot put distances and diameters into relation. DM:II-248 Cluster Analysis STEIN

52 (2) Internal Validity Measures: Dunn Index C 2 (C 1 ) d C (C 1, C 2 ) C 1 C 3 I(C) = min i j{d C (C i, C j )} max 1 l k { (C l )}, I(C) max Dunn is susceptible to noise. Dunn is biased towards the worst substructure in a clustering (cf. the min) Dunn cannot put distances and diameters into relation. DM:II-249 Cluster Analysis STEIN

53 (2) Internal Validity Measures: Expected Density ρ [Stein/Meyer zu Eissen 2007] DM:II-250 Cluster Analysis STEIN

54 (2) Internal Validity Measures: Expected Density ρ [Stein/Meyer zu Eissen 2007] Different models (feature sets) yield different similarity graphs. DM:II-251 Cluster Analysis STEIN

55 (2) Internal Validity Measures: Expected Density ρ [Stein/Meyer zu Eissen 2007] Different models (feature sets) yield different similarity graphs. DM:II-252 Cluster Analysis STEIN

56 (2) Internal Validity Measures: Expected Density ρ [Stein/Meyer zu Eissen 2007] Compare (for alternative clusterings) the similarity density within the clusters to the average similarity of the entire graph. DM:II-253 Cluster Analysis STEIN

57 (2) Internal Validity Measures: Expected Density ρ [Stein/Meyer zu Eissen 2007] Compare (for alternative clusterings) the similarity density within the clusters to the average similarity of the entire graph. DM:II-254 Cluster Analysis STEIN

58 (2) Internal Validity Measures: Expected Density ρ [Stein/Meyer zu Eissen 2007] Compare (for alternative clusterings) the similarity density within the clusters to the average similarity of the entire graph. DM:II-255 Cluster Analysis STEIN

59 (2) Internal Validity Measures: Expected Density ρ [Stein/Meyer zu Eissen 2007] Graph G = V, E G is called sparse [dense] if E = O( V ) [O( V 2 )] the density θ computes from the equation E = V θ DM:II-256 Cluster Analysis STEIN

60 (2) Internal Validity Measures: Expected Density ρ [Stein/Meyer zu Eissen 2007] Graph G = V, E G is called sparse [dense] if E = O( V ) [O( V 2 )] the density θ computes from the equation E = V θ Similarity graph G = V, E, w, E w(g) = e E w(e) the density θ computes from the equation w(g) = V θ DM:II-257 Cluster Analysis STEIN

61 (2) Internal Validity Measures: Expected Density ρ [Stein/Meyer zu Eissen 2007] Graph G = V, E G is called sparse [dense] if E = O( V ) [O( V 2 )] the density θ computes from the equation E = V θ Similarity graph G = V, E, w, E w(g) = e E w(e) the density θ computes from the equation w(g) = V θ Induced subgraph G i for class C i the expected density ρ compares class C i to the density average in G ρ(g i ) = w(g i) V i θ DM:II-258 Cluster Analysis STEIN

62 (3) Relative Validity Measures: Elbow Criterion 1. Hyperparameter alternatives of a clustering algorithm: π 1,..., π m number of centroids for k-means stopping level for hierarchical algorithms neighborhood size for DBSCAN 2. Set of clusterings C = {C π1,..., C πm } associated with π 1,..., π m. 3. Points of an error curve {(π i, e(c πi )) i = 1,..., m}. DM:II-259 Cluster Analysis STEIN

63 (3) Relative Validity Measures: Elbow Criterion 1. Hyperparameter alternatives of a clustering algorithm: π 1,..., π m number of centroids for k-means stopping level for hierarchical algorithms neighborhood size for DBSCAN 2. Set of clusterings C = {C π1,..., C πm } associated with π 1,..., π m. 3. Points of an error curve {(π i, e(c πi )) i = 1,..., m}. DM:II-260 Cluster Analysis STEIN

64 (3) Relative Validity Measures: Elbow Criterion 1. Hyperparameter alternatives of a clustering algorithm: π 1,..., π m number of centroids for k-means stopping level for hierarchical algorithms neighborhood size for DBSCAN 2. Set of clusterings C = {C π1,..., C πm } associated with π 1,..., π m. 3. Points of an error curve {(π i, e(c πi )) i = 1,..., m}. e = SSE 1 k V π = cluster number 4. Find point that maximizes error drop with respect to its predecessor. DM:II-261 Cluster Analysis STEIN

65 (3) Relative Validity Measures: Elbow Criterion d C : Hamming distance Merging: complete link Relations between 1377 file systems for Linux Kernel [Razvan Musaloiu 2009] DM:II-262 Cluster Analysis STEIN

66 (3) Relative Validity Measures: Elbow Criterion d C : Hamming distance Merging: group average link Relations between 1377 file systems for Linux Kernel [Razvan Musaloiu 2009] DM:II-263 Cluster Analysis STEIN

67 Correlation between External and Internal Measures In the wild, we are not given a reference classification. An external evaluation is not possible. (though many papers report on such experiments) Resort to an internal evaluation. (connectivity, squared error sums, distance-diameter heuristics, etc.) DM:II-264 Cluster Analysis STEIN

68 Correlation between External and Internal Measures In the wild, we are not given a reference classification. An external evaluation is not possible. (though many papers report on such experiments) Resort to an internal evaluation. (connectivity, squared error sums, distance-diameter heuristics, etc.) To which extent can an internal evaluation φ be used to predict for a clustering its distance from the best reference classification say, to predict the F -measure? argmax φ {τ X, Y x = F (C), y = φ(c), C C} [Stein/Meyer zu Eissen 2007] DM:II-265 Cluster Analysis STEIN

69 Correlation between External and Internal Measures F-Measure Cluster Validity Perfect correlation (desired). DM:II-266 Cluster Analysis STEIN

70 Correlation between External and Internal Measures classes, 800 documents 0.8 F-Measure Davies-Bouldin: k k i=1 max j s(c i ) + s(c j ) d C (C i, C j ) Davies-Bouldin Prefers spherical clusters. DM:II-267 Cluster Analysis STEIN

71 Correlation between External and Internal Measures classes, 800 documents 0.8 F-Measure Dunn Index Dunn Index: min i j {d C (C i, C j )} max 1 l k { (C l )} Maximizes dilatation = inter/intra-cluster-diameter. DM:II-268 Cluster Analysis STEIN

72 Correlation between External and Internal Measures classes, 800 documents 0.8 F-Measure Expected Density: ρ = k i=1 V i V w(g i) V i θ Independent of cluster forms and sizes. Expected Density DM:II-269 Cluster Analysis STEIN

Data Mining and Analysis: Fundamental Concepts and Algorithms

Data Mining and Analysis: Fundamental Concepts and Algorithms dataminingbook.info Mohammed J. Zaki 1 Wagner Meira Jr. 2 1 Department of Computer Science Rensselaer Polytechnic Institute, Troy, NY, USA