Data Mining and Analysis: Fundamental Concepts and Algorithms

Size: px
Start display at page:

Download "Data Mining and Analysis: Fundamental Concepts and Algorithms"

Transcription

1 Data Mining and Analysis: Fundamental Concepts and Algorithms dataminingbook.info Mohammed J. Zaki 1 Wagner Meira Jr. 2 1 Department of Computer Science Rensselaer Polytechnic Institute, Troy, NY, USA 2 Department of Computer Science Universidade Federal de Minas Gerais, Belo Horizonte, Brazil Chapter 17: Clustering Validation Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 17: Clustering Validation 1 / 56

2 Clustering Validation and Evaluation Cluster validation and assessment encompasses three main tasks: clustering evaluation seeks to assess the goodness or quality of the clustering, clustering stability seeks to understand the sensitivity of the clustering result to various algorithmic parameters, for example, the number of clusters, and clustering tendency assesses the suitability of applying clustering in the first place, that is, whether the data has any inherent grouping structure. Validity measures can be divided into three main types: External: External validation measures employ criteria that are not inherent to the dataset, e.g., class labels. Internal: Internal validation measures employ criteria that are derived from the data itself, e.g., intracluster and intercluster distances. Relative: Relative validation measures aim to directly compare different clusterings, usually those obtained via different parameter settings for the same algorithm. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 17: Clustering Validation 2 / 56

3 External Measures External measures assume that the correct or ground-truth clustering is known a priori, which is used to evaluate a given clustering. Let D = {x i } n i=1 be a dataset consisting of n points in a d-dimensional space, partitioned into k clusters. Let y i {1, 2,...,k} denote the ground-truth cluster membership or label information for each point. The ground-truth clustering is given as T = {T 1, T 2,...,T k }, where the cluster T j consists of all the points with label j, i.e., T j = {x i D y i = j}. We refer to T as the ground-truth partitioning, and to each T i as a partition. Let C = {C 1,...,C r } denote a clustering of the same dataset into r clusters, obtained via some clustering algorithm, and let ŷ i {1, 2,...,r} denote the cluster label for x i. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 17: Clustering Validation 3 / 56

4 External Measures External evaluation measures try capture the extent to which points from the same partition appear in the same cluster, and the extent to which points from different partitions are grouped in different clusters. All of the external measures rely on the r k contingency table N that is induced by a clustering C and the ground-truth partitioning T, defined as follows N(i, j) = n ij = C i T j The count n ij denotes the number of points that are common to cluster C i and ground-truth partition T j. Let n i = C i denote the number of points in cluster C i, and let m j = T j denote the number of points in partition T j. The contingency table can be computed from T and C in O(n) time by examining the partition and cluster labels, y i and ŷ i, for each point x i D and incrementing the corresponding count n yi ŷ i. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 17: Clustering Validation 4 / 56

5 Matching Based Measures: Purity Purity quantifies the extent to which a cluster C i contains entities from only one partition: purity i = 1 n i k max j=1 {n ij} The purity of clustering C is defined as the weighted sum of the clusterwise purity values: purity = r i=1 n i n purity i = 1 n r i=1 k max j=1 {n ij} where the ratio n i n denotes the fraction of points in cluster C i. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 17: Clustering Validation 5 / 56

6 Matching Based Measures: Maximum Matching The maximum matching measure selects the mapping between clusters and partitions, such that the sum of the number of common points (n ij ) is maximized, provided that only one cluster can match with a given partition. Let G be a bipartite graph over the vertex set V = C T, and let the edge set be E = {(C i, T j )} with edge weights w(c i, T j ) = n ij. A matching M in G is a subset of E, such that the edges in M are pairwise nonadjacent, that is, they do not have a common vertex. The maximum weight matching in G is given as: { } w(m) match = arg max M n where w(m) is the sum of the sum of all the edge weights in matching M, given as w(m) = e M w(e) Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 17: Clustering Validation 6 / 56

7 Matching Based Measures: F-measure Given cluster C i, let j i denote the partition that contains the maximum number of points from C i, that is, j i = max k j=1{n ij }. The precision of a cluster C i is the same as its purity: prec i = 1 n i The recall of cluster C i is defined as where m ji = T ji. k max j=1 {n ij} = n ij i n i recall i = n ij i T ji = n ij i m ji The F-measure is the harmonic mean of the precision and recall values for each cluster C i F i = 2 1 prec i + 1 recall i = 2 prec i recall i prec i + recall i = 2 n ij i n i + m ji The F-measure for the clusteringc is the mean of clusterwise F-measure values: F = 1 r F i r i=1 Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 17: Clustering Validation 7 / 56

8 K-means: Iris Principal Components Data (Good Case) Contingency table: u iris-setosa iris-versicolor iris-virginica T 1 T 2 T 3 n i C 1 (squares) C 2 (circles) C 3 (triangles) m j n = 100 purity = 0.887, match = 0.887, F = Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 17: Clustering Validation 8 / 56 u 1

9 K-means: Iris Principal Components Data (Bad Case) Contingency table: u iris-setosa iris-versicolor iris-virginica T 1 T 2 T 3 n i C 1 (squares) C 2 (circles) C 3 (triangles) m j n = 150 purity = 0.667, match = 0.560, F = Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 17: Clustering Validation 9 / 56 u 1

10 Entropy-based Measures: Conditional Entropy The entropy of a clustering C and partitioning T is given as r H(C) = p Ci log p Ci i=1 k H(T ) = p Tj log p Tj j=1 where p Ci = n i n and p T j = m j n are the probabilities of cluster C i and partition T j. The cluster-specific entropy of T, that is, the conditional entropy of T with respect to cluster C i is defined as H(T C i ) = k j=1 ( nij n i ) log ( nij n i ) Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 17: Clustering Validation 10 / 56

11 Entropy-based Measures: Conditional Entropy The conditional entropy of T given clustering C is defined as the weighted sum: H(T C) = r i=1 n i n H(T C i) = = H(C,T ) H(C) r k i=1 j=1 ( ) pij p ij log p Ci where p ij = n ij n is the probability that a point in cluster i also belongs to partition and where H(C,T ) = r k i=1 j=1 p ij log p ij is the joint entropy of C and T. H(T C) = 0 if and only if T is completely determined by C, corresponding to the ideal clustering. If C and T are independent of each other, then H(T C) = H(T ). Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 17: Clustering Validation 11 / 56

12 Entropy-based Measures: Normalized Mutual Information The mutual information tries to quantify the amount of shared information between the clustering C and partitioning T, and it is defined as I(C,T ) = r k i=1 j=1 ( ) pij p ij log p Ci p Tj When C and T are independent then p ij = p Ci p Tj, and thus I(C,T ) = 0. However, there is no upper bound on the mutual information. The normalized mutual information (NMI) is defined as the geometric mean: I(C,T ) NMI(C,T ) = H(C) I(C,T ) H(T ) = I(C,T ) H(C) H(T ) The NMI value lies in the range [0, 1]. Values close to 1 indicate a good clustering. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 17: Clustering Validation 12 / 56

13 Entropy-based Measures: Variation of Information This criterion is based on the mutual information between the clustering C and the ground-truth partitioning T, and their entropy; it is defined as VI(C,T ) = (H(T ) I(C,T ))+(H(C) I(C,T )) = H(T )+H(C) 2I(C,T ) Variation of information (VI) is zero only when C and T are identical. Thus, the lower the VI value the better the clustering C. VI can also be expressed as: VI(C,T ) = H(T C)+H(C T ) VI(C,T ) = 2H(T,C) H(T ) H(C) Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 17: Clustering Validation 13 / 56

14 K-means: Iris Principal Components Data (Good Case) u u 2 u u 1 (a) K-means: good (b) K-means: bad purity match F H(T C) NMI VI (a) Good (b) Bad Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 17: Clustering Validation 14 / 56

15 Pairwise Measures Given clustering C and ground-truth partitioning T, let x i, x j D be any two points, with i j. Let y i denote the true partition label and let ŷ i denote the cluster label for point x i. If both x i and x j belong to the same cluster, that is, ŷ i = ŷ j, we call it a positive event, and if they do not belong to the same cluster, that is, ŷ i ŷ j, we call that a negative event. Depending on whether there is agreement between the cluster labels and partition labels, there are four possibilities to consider: True Positives: x i and x j belong to the same partition in T, and they are also in the same cluster in C. The number of true positive pairs is given as TP = {(xi, x j ) : y i = y j and ŷ i = ŷ j } False Negatives: x i and x j belong to the same partition in T, but they do not belong to the same cluster in C. The number of all false negative pairs is given as FN = {(xi, x j ) : y i = y j and ŷ i ŷ j } Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 17: Clustering Validation 15 / 56

16 Pairwise Measures False Positives: x i and x j do not belong to the same partition in T, but they do belong to the same cluster in C. The number of false positive pairs is given as FP = {(xi, x j ) : y i y j and ŷ i = ŷ j } True Negatives: x i and x j neither belong to the same partition in T, nor do they belong to the same cluster in C. The number of such true negative pairs is given as TN = {(xi, x j ) : y i y j and ŷ i ŷ j } Because there are N = ( ) n 2 = n(n 1) 2 pairs of points, we have the following identity: N = TP + FN + FP + TN Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 17: Clustering Validation 16 / 56

17 Pairwise Measures: TP, TN, FP, FN They can be computed efficiently using the contingency table N = {n ij }. The number of true positives is given as TP = 1 ( ( r k ) ) nij 2 n 2 i=1 j=1 The false negatives can be computed as FN = 1 ( k mj 2 2 j=1 The number of false positives are: FP = 1 ( r ni 2 2 i=1 r k i=1 j=1 r k i=1 j=1 Finally, the number of true negatives can be obtained via TN = N (TP + FN + FP) = 1 ( r k n 2 ni 2 mj i=1 n 2 ij n 2 ij ) ) j=1 r k i=1 j=1 n 2 ij ) Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 17: Clustering Validation 17 / 56

18 Pairwise Measures: Jaccard Coefficient, Rand Statistic, FM Measure Jaccard Coefficient: measures the fraction of true positive point pairs, but after ignoring the true negative: Jaccard = TP TP + FN + FP Rand Statistic: measures the fraction of true positives and true negatives over all point pairs: Rand = TP + TN N Fowlkes-Mallows Measure: Define the overall pairwise precision and pairwise recall values for a clusteringc, as follows: prec = TP/TP + FP recall = TP/TP + FN The Fowlkes Mallows (FM) measure is defined as the geometric mean of the pairwise precision and recall FM = prec recall = TP (TP + FN)(TP + FP) Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 17: Clustering Validation 18 / 56

19 K-means: Iris Principal Components Data (Good Case) u The number of true positives is: ( ) ( TP = u1 ) + Contingency table: setosa versicolor virginica T 1 T 2 T 3 C C C ( ) ( ) ( ) 36 = Likewise, we have FN = 645, FP = 766, TN = 6734, and N = ( ) = We therefore have: Jaccard = 0.682, Rand = 0.887, FM = For the bad clustering, we have: Jaccard = 0.477, Rand = 0.717, FM = Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 17: Clustering Validation 19 / 56

20 Correlation Measures: Hubert statistic Let X and Y be two symmetric n n matrices, and let N = ( n 2). Let x, y R N denote the vectors obtained by linearizing the upper triangular elements (excluding the main diagonal) of X and Y. Let µ X denote the element-wise mean of x, given as µ X = 1 N n 1 i=1 j=i+1 n X(i, j) = 1 N xt x and let z x denote the centered x vector, defined as The Hubert statistic is defined as Γ = 1 N n 1 i=1 j=i+1 z x = x 1 µ X n X(i, j) Y(i, j) = 1 N xt y The normalized Hubert statistic is defined as the element-wise correlation Γ n = z T x z y z x z y = cosθ Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 17: Clustering Validation 20 / 56

21 Correlation-based Measure: Discretized Hubert Statistic Let T and C be the n n matrices defined as { 1 if y i = y j, i j T(i, j) = C(i, j) = 0 otherwise { 1 if ŷ i = ŷ j, i j 0 otherwise Let t, c R N denote the N-dimensional vectors comprising the upper triangular elements (excluding the diagonal) of T and C. Let z t and z c denote the centered t and c vectors. The discretized Hubert statistic is computed by setting x = t and y = c: Γ = 1 N tt c = TP N The normalized version of the discretized Hubert statistic is simply the correlation between t and c whre µ T = TP+FN N Γ n = z T t z c z t z c = and µ C = TP+FP N. TP N µ Tµ C µt µ C (1 µ T )(1 µ C ) Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 17: Clustering Validation 21 / 56

22 Internal Measures Internal evaluation measures do not have recourse to the ground-truth partitioning. To evaluate the quality of the clustering, internal measures therefore have to utilize notions of intracluster similarity or compactness, contrasted with notions of intercluster separation, with usually a trade-off in maximizing these two aims. The internal measures are based on the n n distance matrix, also called the proximity matrix, of all pairwise distances among the n points: W = { } n δ(x i, x j ) i,j=1 where δ(x i, x j ) = x i x j 2 is the Euclidean distance between x i, x j D. The proximity matrix W is the adjacency matrix of the weighted complete graph G over the n points, that is, with nodes V = {x i x i D}, edges E = {(x i, x j ) x i, x j D}, and edge weights w ij = W(i, j) for all x i, x j D. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 17: Clustering Validation 22 / 56

23 Internal Measures The clustering C can be considered as a k-way cut in G. Given any subsets S, R V, define W(S, R) as the sum of the weights on all edges with one vertex in S and the other in R, given as W(S, R) = w ij x i S x j R We denote by S = V S the complementary set of vertices. The sum of all the intracluster and intercluster weights are given as W in = 1 2 k W(C i, C i ) W out = 1 2 i=1 k k 1 W(C i, C i ) = W(C i, C j ) i=1 i=1 j>i The number of distinct intracluster and intracluster edges is given as N in = k i=1 ( ) ni 2 N out = k 1 i=1 j=i+1 k n i n j Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 17: Clustering Validation 23 / 56

24 Clusterings as Graphs: Iris (Good Case) u u2 u u1 Only intracluster edges shown. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 17: Clustering Validation 24 / 56

25 Clusterings as Graphs: Iris (Bad Case) u u2 u u1 Only intracluster edges shown. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 17: Clustering Validation 25 / 56

26 Internal Measures: BetaCV and C-index BetaCV Measure: The BetaCV measure is the ratio of the mean intracluster distance to the mean intercluster distance: BetaCV = W in/n in = N out W in = N k out i=1 W(C i, C i ) W out /N out N in W out N k in i=1 W(C i, C i ) The smaller the BetaCV ratio, the better the clustering. C-index: Let W min (N in ) be the sum of the smallest N in distances in the proximity matrix W, where N in is the total number of intracluster edges, or point pairs. Let W max (N in ) be the sum of the largest N in distances in W. The C-index measures to what extent the clustering puts together the N in points that are the closest across the k clusters. It is defined as Cindex = W in W min (N in ) W max (N in ) W min (N in ) The C-index lies in the range [0, 1]. The smaller the C-index, the better the clustering. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 17: Clustering Validation 26 / 56

27 Internal Measures: Normalized Cut and Modularity Normalized Cut Measure: The normalized cut objective for graph clustering can also be used as an internal clustering evaluation measure: NC = k i=1 W(C i, C i ) vol(c i ) = k i=1 W(C i, C i ) W(C i, V) where vol(c i ) = W(C i, V) is the volume of cluster C i. The higher the normalized cut value the better. Modularity: The modularity objective is given as Q = ( k i=1 W(C i, C i ) W(V, V) ( ) ) 2 W(Ci, V) W(V, V) The smaller the modularity measure the better the clustering. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 17: Clustering Validation 27 / 56

28 Internal Measures: Dunn Index The Dunn index is defined as the ratio between the minimum distance between point pairs from different clusters and the maximum distance between point pairs from the same cluster where W min out and W max in Dunn = W out min Win max is the minimum intercluster distance: Wout min { } = min wab x a C i, x b C j i,j>i is the maximum intracluster distance: Win max { } = max wab x a, x b C i i The larger the Dunn index the better the clustering because it means even the closest distance between points in different clusters is much larger than the farthest distance between points in the same cluster. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 17: Clustering Validation 28 / 56

29 Internal Measures: Davies-Bouldin Index Let µ i denote the cluster mean µ i = 1 n i x j C i x j Let σ µi denote the dispersion or spread of the points around the cluster mean δ(x xj Ci j,µ i ) 2 σ µi = = var(c i ) n i The Davies Bouldin measure for a pair of clusters C i and C j is defined as the ratio DB ij = σ µ i +σ µj δ(µ i,µ j ) DB ij measures how compact the clusters are compared to the distance between the cluster means. The Davies Bouldin index is then defined as DB = 1 k max{db ij } k j i i=1 The smaller the DB value the better the clustering. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 17: Clustering Validation 29 / 56

30 Silhouette Coefficient Define the silhoutte coefficient of a point x i as µ min s i = out(x i ) µ in (x i ) { } max µ min out(x i ),µ in (x i ) where µ in (x i ) is the mean distance from x i to points in its own cluster ŷ i : x j Cŷi,j i δ(x i, x j ) µ in (x i ) = nŷi 1 and µ min out(x i ) is the mean of the distances from x i to points in the closest cluster: { } µ min y C j δ(x i, y) out(x i ) = min j ŷ i n j The s i value lies in the interval [ 1,+1]. A value close to +1 indicates that x i is much closer to points in its own cluster, a value close to zero indicates x i is close to the boundary, and a value close to 1 indicates that x i is much closer to another cluster, and therefore may be mis-clustered. The silhouette coefficient is the mean s i value: SC = 1 n n i=1 s i. A value close to +1 indicates a good clustering. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 17: Clustering Validation 30 / 56

31 Iris Data: Good vs. Bad Clustering u2 u u u1 (a) Good (b) Bad Lower better Higher better BetaCV Cindex Q DB NC Dunn SC Γ Γ n (a) Good (b) Bad Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 17: Clustering Validation 31 / 56

32 Relative Measures: Silhouette Coefficient The silhouette coefficient for each point s j, and the average SC value can be used to estimate the number of clusters in the data. The approach consists of plotting the s j values in descending order for each cluster, and to note the overall SC value for a particular value of k, as well as clusterwise SC values: SC i = 1 n i x j C i s j We then pick the value k that yields the best clustering, with many points having high s j values within each cluster, as well as high values for SC and SC i (1 i k). Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 17: Clustering Validation 32 / 56

33 Iris K-means: Silhouette Coefficient Plot (k = 2) silhouette coefficient SC 1 = n 1 = 97 (a) k = 2, SC = SC 2 = n 2 = 53 k = 2 yields the highest silhouette coefficient, with the two clusters essentially well separated. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 17: Clustering Validation 33 / 56

34 Iris K-means: Silhouette Coefficient Plot (k = 3) y x SC 1 = n 1 = 61 SC 2 = n 2 = 50 SC 3 = 0.52 n 3 = 39 (b) k = 3, SC = Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 17: Clustering Validation 34 / 56

35 Iris K-means: Silhouette Coefficient Plot (k = 4) y SC 1 = n 1 = 49 SC 2 = n 2 = 28 (c) k = 4, SC = SC 3 = n 3 = 50 SC 4 = n 4 = 23 x Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 17: Clustering Validation 35 / 56

36 Relative Measures: Calinski Harabasz Index Given the dataset D = {x i } n i=1, the scatter matrix for D is given as S = nσ = n (x j µ)(x j µ) T j=1 where µ = 1 n n j=1 x j is the mean and Σ is the covariance matrix. The scatter matrix can be decomposed into two matrices S = S W + S B, where S W is the within-cluster scatter matrix and S B is the between-cluster scatter matrix, given as S W = S B = k i=1 (x j µ i )(x j µ i ) T x j C i k n i (µ i µ)(µ i µ) T i=1 where µ i = 1 n i x j C i x j is the mean for cluster C i. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 17: Clustering Validation 36 / 56

37 Relative Measures: Calinski Harabasz Index The Calinski Harabasz (CH) variance ratio criterion for a given value of k is defined as follows: where tr is the trace of the matrix. CH(k) = tr(s B)/(k 1) tr(s W )/(n k) = n k k 1 tr(s B) tr(s W ) We plot the CH values and look for a large increase in the value followed by little or no gain. We choose the value k > 3 that minimizes the term ( ) ( ) (k) = CH(k + 1) CH(k) CH(k) CH(k 1) The intuition is that we want to find the value of k for which CH(k) is much higher than CH(k 1) and there is only a little improvement or a decrease in the CH(k + 1) value. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 17: Clustering Validation 37 / 56

38 Calinski Harabasz Variance Ratio CH ratio for various values of k on the Iris principal components data, using the K-means algorithm, with the best results chosen from 200 runs. CH k The successive CH(k) and (k) values are as follows: k CH(k) (k) (k) suggests k = 3 as the best (lowest) value. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 17: Clustering Validation 38 / 56

39 Relative Measures: Gap Statistic The gap statistic compares the sum of intracluster weights W in for different values of k with their expected values assuming no apparent clustering structure, which forms the null hypothesis. Let C k be the clustering obtained for a specified value of k. Let Win k (D) denote the sum of intracluster weights (over all clusters) for C k on the input dataset D. We would like to compute the probability of the observed Win k value under the null hypothesis. To obtain an empirical distribution for W in, we resort to Monte Carlo simulations of the sampling process. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 17: Clustering Validation 39 / 56

40 Relative Measures: Gap Statistic We generate t random samples comprising n points. Let R i R n d, 1 i t denote the ith sample. Let Win k(r i) denote the sum of intracluster weights for a given clustering of R i into k clusters. From each sample dataset R i, we generate clusterings for different values of k, and record the intracluster values W k in (R i). Let µ W (k) and σ W (k) denote the mean and standard deviation of these intracluster weights for each value of k. The gap statistic for a given k is then defined as gap(k) = µ W (k) log W k in(d) Choose k as follows: { } k = arg min gap(k) gap(k + 1) σ W (k + 1) k Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 17: Clustering Validation 40 / 56

41 Gap Statistic: Randomly Generated Data (a) Randomly generated data (k = 3) Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 17: Clustering Validation 41 / 56

42 Gap Statistic: Intracluster Weights and Gap Values log 2 W in k expected: µ W (k) observed: Win k k (b) Intracluster weights gap(k) k (c) Gap statistic Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 17: Clustering Validation 42 / 56

43 Gap Statistic as a Function of k k gap(k) σ W (k) gap(k) σ W (k) The optimal value for the number of clusters is k = 4 because gap(4) = > gap(5) σ W (5) = However, if we relax the gap test to be within two standard deviations, then the optimal value is k = 3 because gap(3) = > gap(4) 2σ W (4) = = Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 17: Clustering Validation 43 / 56

44 Cluster Stability The main idea behind cluster stability is that the clusterings obtained from several datasets sampled from the same underlying distribution as D should be similar or stable. Stability can be used to find a good value for k, the correct number of clusters. We generate t samples of size n by sampling from D with replacement. Let C k (D i ) denote the clustering obtained from sample D i, for a given value of k. Next, we compare the distance between all pairs of clusterings C k (D i ) and C k (D j ) using several of the external cluster evaluation measures. From these values we compute the expected pairwise distance for each value of k. Finally, the value k that exhibits the least deviation between the clusterings obtained from the resampled datasets is the best choice for k because it exhibits the most stability. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 17: Clustering Validation 44 / 56

45 Clustering Stability Algorithm CLUSTERINGSTABILITY (A, t, k max, D): n D for i = 1, 2,...,t do D i sample n points from D with replacement for i = 1, 2,...,t do for k = 2, 3,...,k max do C k (D i ) cluster D i into k clusters using algorithm A foreach pair D i, D j with j > i do D ij D i D j // create common dataset for k = 2, 3,...,k max do d ij (k) d ( ) 10 C k (D i ),C k (D j ), D ij // distance between clusterings for k = 2, 3,...,k max do µ d (k) 2 t t(t 1) i=1 j>i d ij(k) { k arg min k µd (k) } Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 17: Clustering Validation 45 / 56

46 Clustering Stability: Iris Data t = 500 bootstrap samples; best K-means from 100 runs Expected Value µ s(k) : FM µ d (k) : VI k The best choice is k = 2. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 17: Clustering Validation 46 / 56

47 Clustering Tendency: Spatial Histogram Clustering tendency or clusterability aims to determine whether the dataset D has any meaningful groups to begin with. Let X 1, X 2,...,X d denote the d dimensions. Given b, the number of bins for each dimension, we divide each dimension X j into b equi-width bins, and simply count how many points lie in each of the b d d-dimensional cells. From this spatial histogram, we can obtain the empirical joint probability mass function (EPMF) for the dataset D {xj cell i} f(i) = P(x j cell i) = n where i = (i 1, i 2,...,i d ) denotes a cell index, with i j denoting the bin index along dimension X j. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 17: Clustering Validation 47 / 56

48 Clustering Tendency: Spatial Histogram We generate t random samples, each comprising n points within the same d-dimensional space as the input dataset D. Let R j denote the jth such random sample. We then compute the corresponding EPMF g j (i) for each R j, 1 j t. We next compute how much the distribution f differs from g j (for j = 1,...,t), using the Kullback Leibler (KL) divergence from f to g j, defined as KL(f g j ) = ( ) f(i) f(i) log g j (i) i The KL divergence is zero only when f and g j are the same distributions. Using these divergence values, we can compute how much the dataset D differs from a random dataset. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 17: Clustering Validation 48 / 56

49 Spatial Histogram: Iris Data versus Uniform u u u u 1 (a) Iris: spatial cells (b) Uniform: spatial cells Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 17: Clustering Validation 49 / 56

50 Spatial Histogram: Empirical PMF Probability Spatial Cells (c) Empirical probability mass function Iris (f) Uniform (g j ) Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 17: Clustering Validation 50 / 56

51 Spatial Histogram: KL Divergence Distribution 0.25 Probability KL Divergence (d) KL-divergence distribution We generated t = 500 random samples from the null distribution, and computed the KL divergence from f to g j for each 1 j t. The mean KL value is µ KL = 1.17, with a standard deviation of σ KL = Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 17: Clustering Validation 51 / 56

52 Clustering Tendency: Distance Distribution We can compare the pairwise point distances from D, with those from the randomly generated samples R i from the null distribution. We create the EPMF from the proximity matrix W for D by binning the distances into b bins: {wpq bin i} f(i) = P(w pq bin i x p, x q D, p < q) = n(n 1)/2 Likewise, for each of the samples R j, we determine the EPMF for the pairwise distances, denoted g j. Finally, we compute the KL divergences between f and g j. The expected divergence indicates the extent to which D differs from the null (random) distribution. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 17: Clustering Validation 52 / 56

53 Iris Data: Distance Distribution Probability Pairwise distance (a) Iris (f) Uniform (g j ) Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 17: Clustering Validation 53 / 56

54 Iris Data: Distance Distribution 0.20 Probability KL divergence (b) Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 17: Clustering Validation 54 / 56

55 Clustering Tendency: Hopkins Statistic Given a dataset D comprising n points, we generate t uniform subsamples R i of m points each, sampled from the same dataspace as D. We also generate t subsamples of m points directly from D, using sampling without replacement. Let D i denote the ith direct subsample. Next, we compute the minimum distance between each point x j D i and points in D { } δ min (x j ) = min δ(x j, x i ) x i D,x i x j We also compute the minimum distance δ min (y j ) between a point y j R i and points in D. The Hopkins statistic (in d dimensions) for the ith pair of samples R i and D i is then defined as y j R i (δ min (y j )) d HS i = y j R i (δ min (y j )) d + x j D i (δ min (x j )) d If the data is well clustered we expect δ min (x j ) values to be smaller compared to the δ min (y j ) values, and in this case HS i tends to 1. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 17: Clustering Validation 55 / 56

56 Iris Data: Hopkins Statistic Distribution Probability Hopkins Statistic Number of sample pairs t = 500, subsample size m = 30. The mean of the Hopkins statistic is µ HS = 0.935, with a standard deviation of σ HS = Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 17: Clustering Validation 56 / 56

Data Mining and Analysis: Fundamental Concepts and Algorithms

Data Mining and Analysis: Fundamental Concepts and Algorithms : Fundamental Concepts and Algorithms dataminingbook.info Mohammed J. Zaki 1 Wagner Meira Jr. 2 1 Department of Computer Science Rensselaer Polytechnic Institute, Troy, NY, USA 2 Department of Computer

More information

Data Mining and Analysis: Fundamental Concepts and Algorithms

Data Mining and Analysis: Fundamental Concepts and Algorithms Data Mining and Analysis: Fundamental Concepts and Algorithms dataminingbook.info Mohammed J. Zaki 1 Wagner Meira Jr. 2 1 Department of Computer Science Rensselaer Polytechnic Institute, Troy, NY, USA

More information

Data Mining and Analysis: Fundamental Concepts and Algorithms

Data Mining and Analysis: Fundamental Concepts and Algorithms Data Mining and Analysis: Fundamental Concepts and Algorithms dataminingbook.info Mohammed J. Zaki 1 Wagner Meira Jr. 2 1 Department of Computer Science Rensselaer Polytechnic Institute, Troy, NY, USA

More information

Data Mining and Analysis: Fundamental Concepts and Algorithms

Data Mining and Analysis: Fundamental Concepts and Algorithms Data Mining and Analysis: Fundamental Concepts and Algorithms dataminingbook.info Mohammed J. Zaki 1 Wagner Meira Jr. 2 1 Department of Computer Science Rensselaer Polytechnic Institute, Troy, NY, USA

More information

Data Mining and Analysis: Fundamental Concepts and Algorithms

Data Mining and Analysis: Fundamental Concepts and Algorithms Data Mining and Analysis: Fundamental Concepts and Algorithms dataminingbook.info Mohammed J. Zaki 1 Wagner Meira Jr. 1 Department of Computer Science Rensselaer Polytechnic Institute, Troy, NY, USA Department

More information

Data Mining and Analysis: Fundamental Concepts and Algorithms

Data Mining and Analysis: Fundamental Concepts and Algorithms Data Mining and Analysis: Fundamental Concepts and Algorithms dataminingbook.info Mohammed J. Zaki 1 Wagner Meira Jr. 2 1 Department of Computer Science Rensselaer Polytechnic Institute, Troy, NY, USA

More information

Data Mining and Analysis: Fundamental Concepts and Algorithms

Data Mining and Analysis: Fundamental Concepts and Algorithms Data Mining and Analysis: Fundamental Concepts and Algorithms dataminingbook.info Mohammed J. Zaki 1 Wagner Meira Jr. 2 1 Department of Computer Science Rensselaer Polytechnic Institute, Troy, NY, USA

More information

Data Mining and Analysis: Fundamental Concepts and Algorithms

Data Mining and Analysis: Fundamental Concepts and Algorithms Data Mining and Analysis: Fundamental Concepts and Algorithms dataminingbook.info Mohammed J. Zaki 1 Wagner Meira Jr. 2 1 Department of Computer Science Rensselaer Polytechnic Institute, Troy, NY, USA

More information

Data Mining and Analysis: Fundamental Concepts and Algorithms

Data Mining and Analysis: Fundamental Concepts and Algorithms Data Mining and Analysis: Fundamental Concepts and Algorithms dataminingbook.info Mohammed J. Zaki 1 Wagner Meira Jr. 2 1 Department of Computer Science Rensselaer Polytechnic Institute, Troy, NY, USA

More information

Data Mining and Analysis: Fundamental Concepts and Algorithms

Data Mining and Analysis: Fundamental Concepts and Algorithms Data Mining and Analysis: Fundamental Concepts and Algorithms dataminingbook.info Mohammed J. Zaki 1 Wagner Meira Jr. 2 1 Department of Computer Science Rensselaer Polytechnic Institute, Troy, NY, USA

More information

Data Mining and Analysis: Fundamental Concepts and Algorithms

Data Mining and Analysis: Fundamental Concepts and Algorithms Data Mining and Analysis: Fundamental Concepts and Algorithms dataminingbook.info Mohammed J. Zaki 1 Wagner Meira Jr. 2 1 Department of Computer Science Rensselaer Polytechnic Institute, Troy, NY, USA

More information

Data Mining and Analysis: Fundamental Concepts and Algorithms

Data Mining and Analysis: Fundamental Concepts and Algorithms Data Mining and Analysis: Fundamental Concepts and Algorithms dataminingbook.info Mohammed J. Zaki Wagner Meira Jr. Department of Computer Science Rensselaer Polytechnic Institute, Troy, NY, USA Department

More information

Chapter DM:II (continued)

Chapter DM:II (continued) Chapter DM:II (continued) II. Cluster Analysis Cluster Analysis Basics Hierarchical Cluster Analysis Iterative Cluster Analysis Density-Based Cluster Analysis Cluster Evaluation Constrained Cluster Analysis

More information

Cluster Validity. Oct. 28, Cluster Validity 10/14/ Erin Wirch & Wenbo Wang. Outline. Hypothesis Testing. Relative Criteria.

Cluster Validity. Oct. 28, Cluster Validity 10/14/ Erin Wirch & Wenbo Wang. Outline. Hypothesis Testing. Relative Criteria. 1 Testing Oct. 28, 2010 2 Testing Testing Agenda 3 Testing Review of Testing Testing Review of Testing 4 Test a parameter against a specific value Begin with H 0 and H 1 as the null and alternative hypotheses

More information

Data Mining and Analysis: Fundamental Concepts and Algorithms

Data Mining and Analysis: Fundamental Concepts and Algorithms Data Mining and Analysis: Fundamental Concepts and Algorithms dataminingbook.info Mohammed J. Zaki 1 Wagner Meira Jr. 2 1 Department of Computer Science Rensselaer Polytechnic Institute, Troy, NY, USA

More information

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION 1 Outline Basic terminology Features Training and validation Model selection Error and loss measures Statistical comparison Evaluation measures 2 Terminology

More information

Finding normalized and modularity cuts by spectral clustering. Ljubjana 2010, October

Finding normalized and modularity cuts by spectral clustering. Ljubjana 2010, October Finding normalized and modularity cuts by spectral clustering Marianna Bolla Institute of Mathematics Budapest University of Technology and Economics marib@math.bme.hu Ljubjana 2010, October Outline Find

More information

Data Mining and Analysis

Data Mining and Analysis 978--5-766- - Data Mining and Analysis: Fundamental Concepts and Algorithms CHAPTER Data Mining and Analysis Data mining is the process of discovering insightful, interesting, and novel patterns, as well

More information

A Bayesian Criterion for Clustering Stability

A Bayesian Criterion for Clustering Stability A Bayesian Criterion for Clustering Stability B. Clarke 1 1 Dept of Medicine, CCS, DEPH University of Miami Joint with H. Koepke, Stat. Dept., U Washington 26 June 2012 ISBA Kyoto Outline 1 Assessing Stability

More information

Lecture Notes 1: Vector spaces

Lecture Notes 1: Vector spaces Optimization-based data analysis Fall 2017 Lecture Notes 1: Vector spaces In this chapter we review certain basic concepts of linear algebra, highlighting their application to signal processing. 1 Vector

More information

The University of Texas at Austin Department of Electrical and Computer Engineering. EE381V: Large Scale Learning Spring 2013.

The University of Texas at Austin Department of Electrical and Computer Engineering. EE381V: Large Scale Learning Spring 2013. The University of Texas at Austin Department of Electrical and Computer Engineering EE381V: Large Scale Learning Spring 2013 Assignment 1 Caramanis/Sanghavi Due: Thursday, Feb. 7, 2013. (Problems 1 and

More information

Feature selection and extraction Spectral domain quality estimation Alternatives

Feature selection and extraction Spectral domain quality estimation Alternatives Feature selection and extraction Error estimation Maa-57.3210 Data Classification and Modelling in Remote Sensing Markus Törmä markus.torma@tkk.fi Measurements Preprocessing: Remove random and systematic

More information

Algorithm Efficiency. Algorithmic Thinking Luay Nakhleh Department of Computer Science Rice University

Algorithm Efficiency. Algorithmic Thinking Luay Nakhleh Department of Computer Science Rice University Algorithm Efficiency Algorithmic Thinking Luay Nakhleh Department of Computer Science Rice University 1 All Correct Algorithms Are Not Created Equal When presented with a set of correct algorithms for

More information

Performance Evaluation and Comparison

Performance Evaluation and Comparison Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Cross Validation and Resampling 3 Interval Estimation

More information

Dimension Reduction Techniques. Presented by Jie (Jerry) Yu

Dimension Reduction Techniques. Presented by Jie (Jerry) Yu Dimension Reduction Techniques Presented by Jie (Jerry) Yu Outline Problem Modeling Review of PCA and MDS Isomap Local Linear Embedding (LLE) Charting Background Advances in data collection and storage

More information

Chapter 7, continued: MANOVA

Chapter 7, continued: MANOVA Chapter 7, continued: MANOVA The Multivariate Analysis of Variance (MANOVA) technique extends Hotelling T 2 test that compares two mean vectors to the setting in which there are m 2 groups. We wish to

More information

Hands-On Learning Theory Fall 2016, Lecture 3

Hands-On Learning Theory Fall 2016, Lecture 3 Hands-On Learning Theory Fall 016, Lecture 3 Jean Honorio jhonorio@purdue.edu 1 Information Theory First, we provide some information theory background. Definition 3.1 (Entropy). The entropy of a discrete

More information

Data Preprocessing. Cluster Similarity

Data Preprocessing. Cluster Similarity 1 Cluster Similarity Similarity is most often measured with the help of a distance function. The smaller the distance, the more similar the data objects (points). A function d: M M R is a distance on M

More information

CS249: ADVANCED DATA MINING

CS249: ADVANCED DATA MINING CS249: ADVANCED DATA MINING Clustering Evaluation and Practical Issues Instructor: Yizhou Sun yzsun@cs.ucla.edu May 2, 2017 Announcements Homework 2 due later today Due May 3 rd (11:59pm) Course project

More information

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts Data Mining: Concepts and Techniques (3 rd ed.) Chapter 8 1 Chapter 8. Classification: Basic Concepts Classification: Basic Concepts Decision Tree Induction Bayes Classification Methods Rule-Based Classification

More information

Clustering Lecture 1: Basics. Jing Gao SUNY Buffalo

Clustering Lecture 1: Basics. Jing Gao SUNY Buffalo Clustering Lecture 1: Basics Jing Gao SUNY Buffalo 1 Outline Basics Motivation, definition, evaluation Methods Partitional Hierarchical Density-based Mixture model Spectral methods Advanced topics Clustering

More information

Vector spaces. DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis.

Vector spaces. DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis. Vector spaces DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_fall17/index.html Carlos Fernandez-Granda Vector space Consists of: A set V A scalar

More information

Discrete Applied Mathematics

Discrete Applied Mathematics Discrete Applied Mathematics 194 (015) 37 59 Contents lists available at ScienceDirect Discrete Applied Mathematics journal homepage: wwwelseviercom/locate/dam Loopy, Hankel, and combinatorially skew-hankel

More information

INTRODUCTION TO FURSTENBERG S 2 3 CONJECTURE

INTRODUCTION TO FURSTENBERG S 2 3 CONJECTURE INTRODUCTION TO FURSTENBERG S 2 3 CONJECTURE BEN CALL Abstract. In this paper, we introduce the rudiments of ergodic theory and entropy necessary to study Rudolph s partial solution to the 2 3 problem

More information

Random Subspace NMF for Unsupervised Transfer Learning

Random Subspace NMF for Unsupervised Transfer Learning Random Subspace NMF for Unsupervised Transfer Learning Ievgen Redko & Younès Bennani Université Paris 13 - Institut Galilée - Sorbonne Paris Cité Laboratoire d'informatique de Paris-Nord - CNRS (UMR 7030)

More information

Feature selection. c Victor Kitov August Summer school on Machine Learning in High Energy Physics in partnership with

Feature selection. c Victor Kitov August Summer school on Machine Learning in High Energy Physics in partnership with Feature selection c Victor Kitov v.v.kitov@yandex.ru Summer school on Machine Learning in High Energy Physics in partnership with August 2015 1/38 Feature selection Feature selection is a process of selecting

More information

Performance Evaluation

Performance Evaluation Performance Evaluation Confusion Matrix: Detected Positive Negative Actual Positive A: True Positive B: False Negative Negative C: False Positive D: True Negative Recall or Sensitivity or True Positive

More information

Introduction to Statistics and Error Analysis II

Introduction to Statistics and Error Analysis II Introduction to Statistics and Error Analysis II Physics116C, 4/14/06 D. Pellett References: Data Reduction and Error Analysis for the Physical Sciences by Bevington and Robinson Particle Data Group notes

More information

Intensity Analysis of Spatial Point Patterns Geog 210C Introduction to Spatial Data Analysis

Intensity Analysis of Spatial Point Patterns Geog 210C Introduction to Spatial Data Analysis Intensity Analysis of Spatial Point Patterns Geog 210C Introduction to Spatial Data Analysis Chris Funk Lecture 5 Topic Overview 1) Introduction/Unvariate Statistics 2) Bootstrapping/Monte Carlo Simulation/Kernel

More information

Clustering using Mixture Models

Clustering using Mixture Models Clustering using Mixture Models The full posterior of the Gaussian Mixture Model is p(x, Z, µ,, ) =p(x Z, µ, )p(z )p( )p(µ, ) data likelihood (Gaussian) correspondence prob. (Multinomial) mixture prior

More information

Performance evaluation of binary classifiers

Performance evaluation of binary classifiers Performance evaluation of binary classifiers Kevin P. Murphy Last updated October 10, 2007 1 ROC curves We frequently design systems to detect events of interest, such as diseases in patients, faces in

More information

Part I. Linear Discriminant Analysis. Discriminant analysis. Discriminant analysis

Part I. Linear Discriminant Analysis. Discriminant analysis. Discriminant analysis Week 5 Based in part on slides from textbook, slides of Susan Holmes Part I Linear Discriminant Analysis October 29, 2012 1 / 1 2 / 1 Nearest centroid rule Suppose we break down our data matrix as by the

More information

Degenerate Expectation-Maximization Algorithm for Local Dimension Reduction

Degenerate Expectation-Maximization Algorithm for Local Dimension Reduction Degenerate Expectation-Maximization Algorithm for Local Dimension Reduction Xiaodong Lin 1 and Yu Zhu 2 1 Statistical and Applied Mathematical Science Institute, RTP, NC, 27709 USA University of Cincinnati,

More information

Machine Learning. Clustering 1. Hamid Beigy. Sharif University of Technology. Fall 1395

Machine Learning. Clustering 1. Hamid Beigy. Sharif University of Technology. Fall 1395 Machine Learning Clustering 1 Hamid Beigy Sharif University of Technology Fall 1395 1 Some slides are taken from P. Rai slides Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1395 1

More information

1 Matrix notation and preliminaries from spectral graph theory

1 Matrix notation and preliminaries from spectral graph theory Graph clustering (or community detection or graph partitioning) is one of the most studied problems in network analysis. One reason for this is that there are a variety of ways to define a cluster or community.

More information

Adapted Feature Extraction and Its Applications

Adapted Feature Extraction and Its Applications saito@math.ucdavis.edu 1 Adapted Feature Extraction and Its Applications Naoki Saito Department of Mathematics University of California Davis, CA 95616 email: saito@math.ucdavis.edu URL: http://www.math.ucdavis.edu/

More information

Pattern Recognition Approaches to Solving Combinatorial Problems in Free Groups

Pattern Recognition Approaches to Solving Combinatorial Problems in Free Groups Contemporary Mathematics Pattern Recognition Approaches to Solving Combinatorial Problems in Free Groups Robert M. Haralick, Alex D. Miasnikov, and Alexei G. Myasnikov Abstract. We review some basic methodologies

More information

Lecture 10: Dimension Reduction Techniques

Lecture 10: Dimension Reduction Techniques Lecture 10: Dimension Reduction Techniques Radu Balan Department of Mathematics, AMSC, CSCAMM and NWC University of Maryland, College Park, MD April 17, 2018 Input Data It is assumed that there is a set

More information

Clustering compiled by Alvin Wan from Professor Benjamin Recht s lecture, Samaneh s discussion

Clustering compiled by Alvin Wan from Professor Benjamin Recht s lecture, Samaneh s discussion Clustering compiled by Alvin Wan from Professor Benjamin Recht s lecture, Samaneh s discussion 1 Overview With clustering, we have several key motivations: archetypes (factor analysis) segmentation hierarchy

More information

The legacy of Sir Ronald A. Fisher. Fisher s three fundamental principles: local control, replication, and randomization.

The legacy of Sir Ronald A. Fisher. Fisher s three fundamental principles: local control, replication, and randomization. 1 Chapter 1: Research Design Principles The legacy of Sir Ronald A. Fisher. Fisher s three fundamental principles: local control, replication, and randomization. 2 Chapter 2: Completely Randomized Design

More information

1 Matrix notation and preliminaries from spectral graph theory

1 Matrix notation and preliminaries from spectral graph theory Graph clustering (or community detection or graph partitioning) is one of the most studied problems in network analysis. One reason for this is that there are a variety of ways to define a cluster or community.

More information

An indicator for the number of clusters using a linear map to simplex structure

An indicator for the number of clusters using a linear map to simplex structure An indicator for the number of clusters using a linear map to simplex structure Marcus Weber, Wasinee Rungsarityotin, and Alexander Schliep Zuse Institute Berlin ZIB Takustraße 7, D-495 Berlin, Germany

More information

Classification and Regression Trees

Classification and Regression Trees Classification and Regression Trees Ryan P Adams So far, we have primarily examined linear classifiers and regressors, and considered several different ways to train them When we ve found the linearity

More information

Chapter 11. Approximation Algorithms. Slides by Kevin Wayne Pearson-Addison Wesley. All rights reserved.

Chapter 11. Approximation Algorithms. Slides by Kevin Wayne Pearson-Addison Wesley. All rights reserved. Chapter 11 Approximation Algorithms Slides by Kevin Wayne. Copyright @ 2005 Pearson-Addison Wesley. All rights reserved. 1 Approximation Algorithms Q. Suppose I need to solve an NP-hard problem. What should

More information

Chapter 3: Element sampling design: Part 1

Chapter 3: Element sampling design: Part 1 Chapter 3: Element sampling design: Part 1 Jae-Kwang Kim Fall, 2014 Simple random sampling 1 Simple random sampling 2 SRS with replacement 3 Systematic sampling Kim Ch. 3: Element sampling design: Part

More information

More on Unsupervised Learning

More on Unsupervised Learning More on Unsupervised Learning Two types of problems are to find association rules for occurrences in common in observations (market basket analysis), and finding the groups of values of observational data

More information

Summary of Chapters 7-9

Summary of Chapters 7-9 Summary of Chapters 7-9 Chapter 7. Interval Estimation 7.2. Confidence Intervals for Difference of Two Means Let X 1,, X n and Y 1, Y 2,, Y m be two independent random samples of sizes n and m from two

More information

Linear Model Selection and Regularization

Linear Model Selection and Regularization Linear Model Selection and Regularization Recall the linear model Y = β 0 + β 1 X 1 + + β p X p + ɛ. In the lectures that follow, we consider some approaches for extending the linear model framework. In

More information

Unsupervised machine learning

Unsupervised machine learning Chapter 9 Unsupervised machine learning Unsupervised machine learning (a.k.a. cluster analysis) is a set of methods to assign objects into clusters under a predefined distance measure when class labels

More information

Classification Methods II: Linear and Quadratic Discrimminant Analysis

Classification Methods II: Linear and Quadratic Discrimminant Analysis Classification Methods II: Linear and Quadratic Discrimminant Analysis Rebecca C. Steorts, Duke University STA 325, Chapter 4 ISL Agenda Linear Discrimminant Analysis (LDA) Classification Recall that linear

More information

Statistical Data Analysis

Statistical Data Analysis DS-GA 0 Lecture notes 8 Fall 016 1 Descriptive statistics Statistical Data Analysis In this section we consider the problem of analyzing a set of data. We describe several techniques for visualizing the

More information

Learning from Sensor Data: Set II. Behnaam Aazhang J.S. Abercombie Professor Electrical and Computer Engineering Rice University

Learning from Sensor Data: Set II. Behnaam Aazhang J.S. Abercombie Professor Electrical and Computer Engineering Rice University Learning from Sensor Data: Set II Behnaam Aazhang J.S. Abercombie Professor Electrical and Computer Engineering Rice University 1 6. Data Representation The approach for learning from data Probabilistic

More information

arxiv: v1 [stat.ml] 17 Jun 2016

arxiv: v1 [stat.ml] 17 Jun 2016 Ground Truth Bias in External Cluster Validity Indices Yang Lei a,, James C. Bezdek a, Simone Romano a, Nguyen Xuan Vinh a, Jeffrey Chan b, James Bailey a arxiv:166.5596v1 [stat.ml] 17 Jun 216 Abstract

More information

Medical Imaging. Norbert Schuff, Ph.D. Center for Imaging of Neurodegenerative Diseases

Medical Imaging. Norbert Schuff, Ph.D. Center for Imaging of Neurodegenerative Diseases Uses of Information Theory in Medical Imaging Norbert Schuff, Ph.D. Center for Imaging of Neurodegenerative Diseases Norbert.schuff@ucsf.edu With contributions from Dr. Wang Zhang Medical Imaging Informatics,

More information

Interaction Analysis of Spatial Point Patterns

Interaction Analysis of Spatial Point Patterns Interaction Analysis of Spatial Point Patterns Geog 2C Introduction to Spatial Data Analysis Phaedon C Kyriakidis wwwgeogucsbedu/ phaedon Department of Geography University of California Santa Barbara

More information

Approximation Algorithms

Approximation Algorithms Approximation Algorithms Chapter 26 Semidefinite Programming Zacharias Pitouras 1 Introduction LP place a good lower bound on OPT for NP-hard problems Are there other ways of doing this? Vector programs

More information

Structure in Data. A major objective in data analysis is to identify interesting features or structure in the data.

Structure in Data. A major objective in data analysis is to identify interesting features or structure in the data. Structure in Data A major objective in data analysis is to identify interesting features or structure in the data. The graphical methods are very useful in discovering structure. There are basically two

More information

Gaussian Mixture Distance for Information Retrieval

Gaussian Mixture Distance for Information Retrieval Gaussian Mixture Distance for Information Retrieval X.Q. Li and I. King fxqli, ingg@cse.cuh.edu.h Department of omputer Science & Engineering The hinese University of Hong Kong Shatin, New Territories,

More information

5 Mutual Information and Channel Capacity

5 Mutual Information and Channel Capacity 5 Mutual Information and Channel Capacity In Section 2, we have seen the use of a quantity called entropy to measure the amount of randomness in a random variable. In this section, we introduce several

More information

Feature Selection based on the Local Lift Dependence Scale

Feature Selection based on the Local Lift Dependence Scale Feature Selection based on the Local Lift Dependence Scale Diego Marcondes Adilson Simonis Junior Barrera arxiv:1711.04181v2 [stat.co] 15 Nov 2017 Abstract This paper uses a classical approach to feature

More information

Lecture 2: Basic Concepts and Simple Comparative Experiments Montgomery: Chapter 2

Lecture 2: Basic Concepts and Simple Comparative Experiments Montgomery: Chapter 2 Lecture 2: Basic Concepts and Simple Comparative Experiments Montgomery: Chapter 2 Fall, 2013 Page 1 Random Variable and Probability Distribution Discrete random variable Y : Finite possible values {y

More information

Lecture 4: Data preprocessing: Data Reduction-Discretization. Dr. Edgar Acuna. University of Puerto Rico- Mayaguez math.uprm.

Lecture 4: Data preprocessing: Data Reduction-Discretization. Dr. Edgar Acuna. University of Puerto Rico- Mayaguez math.uprm. COMP 6838: Data Mining Lecture 4: Data preprocessing: Data Reduction-Discretization Dr. Edgar Acuna Department t of Mathematics ti University of Puerto Rico- Mayaguez math.uprm.edu/~edgar 1 Discretization

More information

Recitation 9: Loopy BP

Recitation 9: Loopy BP Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science 6.438 Algorithms For Inference Fall 204 Recitation 9: Loopy BP General Comments. In terms of implementation,

More information

Expectation Maximization

Expectation Maximization Expectation Maximization Bishop PRML Ch. 9 Alireza Ghane c Ghane/Mori 4 6 8 4 6 8 4 6 8 4 6 8 5 5 5 5 5 5 4 6 8 4 4 6 8 4 5 5 5 5 5 5 µ, Σ) α f Learningscale is slightly Parameters is slightly larger larger

More information

Evaluation. Andrea Passerini Machine Learning. Evaluation

Evaluation. Andrea Passerini Machine Learning. Evaluation Andrea Passerini passerini@disi.unitn.it Machine Learning Basic concepts requires to define performance measures to be optimized Performance of learning algorithms cannot be evaluated on entire domain

More information

L11: Pattern recognition principles

L11: Pattern recognition principles L11: Pattern recognition principles Bayesian decision theory Statistical classifiers Dimensionality reduction Clustering This lecture is partly based on [Huang, Acero and Hon, 2001, ch. 4] Introduction

More information

L2: Review of probability and statistics

L2: Review of probability and statistics Probability L2: Review of probability and statistics Definition of probability Axioms and properties Conditional probability Bayes theorem Random variables Definition of a random variable Cumulative distribution

More information

ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015

ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015 ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015 http://intelligentoptimization.org/lionbook Roberto Battiti

More information

Introduction to statistics

Introduction to statistics Introduction to statistics Literature Raj Jain: The Art of Computer Systems Performance Analysis, John Wiley Schickinger, Steger: Diskrete Strukturen Band 2, Springer David Lilja: Measuring Computer Performance:

More information

Resampling techniques for statistical modeling

Resampling techniques for statistical modeling Resampling techniques for statistical modeling Gianluca Bontempi Département d Informatique Boulevard de Triomphe - CP 212 http://www.ulb.ac.be/di Resampling techniques p.1/33 Beyond the empirical error

More information

CS246 Final Exam. March 16, :30AM - 11:30AM

CS246 Final Exam. March 16, :30AM - 11:30AM CS246 Final Exam March 16, 2016 8:30AM - 11:30AM Name : SUID : I acknowledge and accept the Stanford Honor Code. I have neither given nor received unpermitted help on this examination. (signed) Directions

More information

Clustering: K-means. -Applied Multivariate Analysis- Lecturer: Darren Homrighausen, PhD

Clustering: K-means. -Applied Multivariate Analysis- Lecturer: Darren Homrighausen, PhD Clustering: K-means -Applied Multivariate Analysis- Lecturer: Darren Homrighausen, PhD 1 Clustering Introduction When clustering, we seek to simplify the data via a small(er) number of summarizing variables

More information

Confidence Measure Estimation in Dynamical Systems Model Input Set Selection

Confidence Measure Estimation in Dynamical Systems Model Input Set Selection Confidence Measure Estimation in Dynamical Systems Model Input Set Selection Paul B. Deignan, Jr. Galen B. King Peter H. Meckl School of Mechanical Engineering Purdue University West Lafayette, IN 4797-88

More information

CMPSCI 611 Advanced Algorithms Midterm Exam Fall 2015

CMPSCI 611 Advanced Algorithms Midterm Exam Fall 2015 NAME: CMPSCI 611 Advanced Algorithms Midterm Exam Fall 015 A. McGregor 1 October 015 DIRECTIONS: Do not turn over the page until you are told to do so. This is a closed book exam. No communicating with

More information

Homework 1 Due: Thursday 2/5/2015. Instructions: Turn in your homework in class on Thursday 2/5/2015

Homework 1 Due: Thursday 2/5/2015. Instructions: Turn in your homework in class on Thursday 2/5/2015 10-704 Homework 1 Due: Thursday 2/5/2015 Instructions: Turn in your homework in class on Thursday 2/5/2015 1. Information Theory Basics and Inequalities C&T 2.47, 2.29 (a) A deck of n cards in order 1,

More information

Evaluation requires to define performance measures to be optimized

Evaluation requires to define performance measures to be optimized Evaluation Basic concepts Evaluation requires to define performance measures to be optimized Performance of learning algorithms cannot be evaluated on entire domain (generalization error) approximation

More information

Unsupervised Learning. k-means Algorithm

Unsupervised Learning. k-means Algorithm Unsupervised Learning Supervised Learning: Learn to predict y from x from examples of (x, y). Performance is measured by error rate. Unsupervised Learning: Learn a representation from exs. of x. Learn

More information

DEEP LEARNING CHAPTER 3 PROBABILITY & INFORMATION THEORY

DEEP LEARNING CHAPTER 3 PROBABILITY & INFORMATION THEORY DEEP LEARNING CHAPTER 3 PROBABILITY & INFORMATION THEORY OUTLINE 3.1 Why Probability? 3.2 Random Variables 3.3 Probability Distributions 3.4 Marginal Probability 3.5 Conditional Probability 3.6 The Chain

More information

Introduction to Supervised Learning. Performance Evaluation

Introduction to Supervised Learning. Performance Evaluation Introduction to Supervised Learning Performance Evaluation Marcelo S. Lauretto Escola de Artes, Ciências e Humanidades, Universidade de São Paulo marcelolauretto@usp.br Lima - Peru Performance Evaluation

More information

Chapter 5-2: Clustering

Chapter 5-2: Clustering Chapter 5-2: Clustering Jilles Vreeken Revision 1, November 20 th typo s fixed: dendrogram Revision 2, December 10 th clarified: we do consider a point x as a member of its own ε-neighborhood 12 Nov 2015

More information

Spectral Clustering. Spectral Clustering? Two Moons Data. Spectral Clustering Algorithm: Bipartioning. Spectral methods

Spectral Clustering. Spectral Clustering? Two Moons Data. Spectral Clustering Algorithm: Bipartioning. Spectral methods Spectral Clustering Seungjin Choi Department of Computer Science POSTECH, Korea seungjin@postech.ac.kr 1 Spectral methods Spectral Clustering? Methods using eigenvectors of some matrices Involve eigen-decomposition

More information

Additive Combinatorics and Szemerédi s Regularity Lemma

Additive Combinatorics and Szemerédi s Regularity Lemma Additive Combinatorics and Szemerédi s Regularity Lemma Vijay Keswani Anurag Sahay 20th April, 2015 Supervised by : Dr. Rajat Mittal 1 Contents 1 Introduction 3 2 Sum-set Estimates 4 2.1 Size of sumset

More information

Undirected Graphical Models

Undirected Graphical Models Undirected Graphical Models 1 Conditional Independence Graphs Let G = (V, E) be an undirected graph with vertex set V and edge set E, and let A, B, and C be subsets of vertices. We say that C separates

More information

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396 Data Mining Linear & nonlinear classifiers Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 1 / 31 Table of contents 1 Introduction

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning Undirected Graphical Models Mark Schmidt University of British Columbia Winter 2016 Admin Assignment 3: 2 late days to hand it in today, Thursday is final day. Assignment 4:

More information

Estimation Theory. as Θ = (Θ 1,Θ 2,...,Θ m ) T. An estimator

Estimation Theory. as Θ = (Θ 1,Θ 2,...,Θ m ) T. An estimator Estimation Theory Estimation theory deals with finding numerical values of interesting parameters from given set of data. We start with formulating a family of models that could describe how the data were

More information

Linear Regression and Its Applications

Linear Regression and Its Applications Linear Regression and Its Applications Predrag Radivojac October 13, 2014 Given a data set D = {(x i, y i )} n the objective is to learn the relationship between features and the target. We usually start

More information

Hypothesis Evaluation

Hypothesis Evaluation Hypothesis Evaluation Machine Learning Hamid Beigy Sharif University of Technology Fall 1395 Hamid Beigy (Sharif University of Technology) Hypothesis Evaluation Fall 1395 1 / 31 Table of contents 1 Introduction

More information

Multivariate Statistics

Multivariate Statistics Multivariate Statistics Chapter 6: Cluster Analysis Pedro Galeano Departamento de Estadística Universidad Carlos III de Madrid pedro.galeano@uc3m.es Course 2017/2018 Master in Mathematical Engineering

More information

Subcubic Equivalence of Triangle Detection and Matrix Multiplication

Subcubic Equivalence of Triangle Detection and Matrix Multiplication Subcubic Equivalence of Triangle Detection and Matrix Multiplication Bahar Qarabaqi and Maziar Gomrokchi April 29, 2011 1 Introduction An algorithm on n n matrix with the entries in [ M, M] has a truly

More information