Overview of clustering analysis Yuehua Cui Email: cuiy@msu.edu http://www.stt.msu.edu/~cui
A data set with clear cluster structure How would you design an algorithm for finding the three clusters in this case?
Yahoo! Hierarchy isn t clustering but is the kind of output you want from clustering www.yahoo.com/science (30) agriculture biology physics CS space...... dairy crops botany cell forestry agronomy evolution magnetism relativity......... AI HCI courses craft missions
Hierarchical Clustering Dendrogram Venn Diagram of Clustered Data From http://www.stat.unc.edu/postscript/papers/marron/stat321fda/rimaizempresentation.ppt
Hierarchical clustering may be represented by a two dimensional diagram known as dendrogram which illustrates the fusions or divisions made at each successive stage of analysis. An example of such a dendrogram is given below:
Nearest Neighbor Algorithm Nearest Neighbor Algorithm is an agglomerative approach (bottom-up). Starts with n nodes (n is the size of our sample), merges the 2 most similar nodes at each step, and stops when the desired number of clusters is reached. From http://www.stat.unc.edu/postscript/papers/marron/stat321fda/rimaizempresentation.ppt
Nearest Neighbor, Level 2, k = 7 clusters. From http://www.stat.unc.edu/postscript/papers/marron/stat321fda/rimaizempresentation.ppt
Nearest Neighbor, Level 3, k = 6 clusters.
Nearest Neighbor, Level 4, k = 5 clusters.
Nearest Neighbor, Level 5, k = 4 clusters.
Nearest Neighbor, Level 6, k = 3 clusters.
Nearest Neighbor, Level 7, k = 2 clusters.
Nearest Neighbor, Level 8, k = 1 cluster.
Hierarchical Clustering Calculate the similarity between all possible combinations of two profiles Keys Similarity Clustering Two most similar clusters are grouped together to form a new cluster Calculate the similarity between the new cluster and all remaining clusters. Adapted from P. Hong
Similarity Measurements Pearson Correlation Two profiles (vectors) x x 1 x N and y y 1 y N r pearson ( x, y) N i1 ( x x)( y y) i N 2 N 2 ( x ) ( ) 1 i x y i i 1 i y i x y 1 N N 1 N N n1 n1 x y n n A measure of the strength of linear dependence between two variables.
Similarity Measurements Spearman correlation Spearman's rank correlation coefficient or Spearman's rho is a non-parametric measure of statistical dependence between two variables. It measures how well the relationship between two variables can be described by a monotonic function. It is often thought of as being the Pearson correlation coefficient between the ranked variables. The n paired observation (X i, Y i ) are converted to ranks (R xi, R yi ), and the differences d i = R xi - R yi are calculated. If there are no tied ranks, then ρ is given by:
Similarity Measurements Cosine Correlation x N x x 1 y x y x N y x C N i i i 1 cosine 1 ), ( y N y y 1 y x -1 Cosine Correlation + 1 y x Adapted from P. Hong
Similarity Measurements Euclidean Distance Other distance measures such as the Mahalanobis distance N n n x n y y x d 1 2 ) ( ), ( x N x x 1 y N y y 1
Clustering Agglomerative techniques C 1 C 2 Merge which pair of clusters? C 3 Adapted from P. Hong
Clustering Single Linkage, also called nearest neighbor technique + Dissimilarity between two clusters = Minimum dissimilarity between the members of two clusters + C 1 C 2 D(r,s) = Min { d(i,j) : Where object i is in cluster r and object j is cluster s } At each stage of hierarchical clustering, the clusters r and s, for which D(r,s) is minimum, are merged. Tend to generate long chains Adapted from P. Hong
Clustering Complete Linkage: also termed farthest neighbor + + C 2 Dissimilarity between two clusters = Maximum dissimilarity between the members of two clusters D(r,s) = Max { d(i,j) : Where object i is in cluster r and object j is cluster s } C 1 At each stage of hierarchical clustering, the clusters r and s, for which D(r,s) is minimum, are merged. Tend to generate clumps Adapted from P. Hong
Clustering Average Linkage + Dissimilarity between two clusters = Averaged distances of all pairs of objects (one from each cluster). + C 1 C 2 D(r,s) = T rs / ( N r * N s ) Where T rs is the sum of all pairwise distances between cluster r and cluster s. N r and N s are the sizes of the clusters r and s respectively. Adapted from P. Hong
Clustering Average Group Linkage + Dissimilarity between two clusters = Distance between two cluster means. + Also termed centroid linkage C 2 C 1 Adapted from P. Hong
Single-Link Method b a 4 5 3 6 5 2 c b a d c b Distance Matrix Euclidean Distance 4 5 3, c b a d c 4 5 3 6 5 2 c b a d c b 4,, c b a d (1) (2) (3) a,b,c c c d a,b d d a,b,c,d
Complete-Link Method b a 4 5 3 6 5 2 c b a d c b Distance Matrix Euclidean Distance 4 6 5, c b a d c 4 5 3 6 5 2 c b a d c b 6,, b a d c (1) (2) (3) a,b c c d a,b d c,d a,b,c,d
Compare Dendrograms Single-Link Complete-Link a b c d a b c d 0 2 4 6
Other clustering method K-means clustering (a type of partitional clustering) SOM: self-organization map Principle component analysis Information visualization for exploring similarities or dissimilarities in data Multidimensional scaling (MDS)
K-means algorithm Given k, the k-means algorithm works as follows: 1)Randomly choose k data points (seeds) to be the initial centroids, cluster centers 2)Assign each data point to the closest centroid 3)Re-compute the centroids using the current cluster memberships. 4)If a convergence criterion is not met, go to 2). 28
The K-Means Clustering Method Example 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
Comments on the K-Means Method Strength Relatively efficient: O(tkn), where n is # objects, k is # clusters, and t is # iterations. Normally, k, t << n. Often terminates at a local optimum. The global optimum may be found using techniques such as: deterministic annealing and genetic algorithms Weakness Applicable only when mean is defined, then what about categorical data? Need to specify k, the number of clusters, in advance Unable to handle noisy data and outliers Not suitable to discover clusters with non-convex shapes
How to choose a clustering algorithm Choosing the best algorithm is a challenge. Every algorithm has limitations and works well with certain data distributions. It is very hard, if not impossible, to know what distribution the application data follow. The data may not fully follow any ideal structure or distribution required by the algorithms. One also needs to decide how to standardize the data, to choose a suitable distance function and to select other parameter values. Due to these complexities, the common practice is to run several algorithms using different distance functions and parameter settings, and then carefully analyze and compare the results. The interpretation of the results must be based on insight into the meaning of the original data together with knowledge of the algorithms used. Clustering is highly application dependent and to certain extent subjective (personal preferences). 31
Principal Component Analysis
Problem: Too much data! Pat1 Pat2 Pat3 Pat4 Pat5 Pat6 Pat7 Pat8 Pat9 209619_at 7758 4705 5342 7443 8747 4933 7950 5031 5293 32541_at 280 387 392 238 385 329 337 163 225 206398_s_at 1050 835 1268 1723 1377 804 1846 1180 252 219281_at 391 593 298 265 491 517 334 387 285 207857_at 1425 977 2027 1184 939 814 658 593 659 211338_at 37 27 28 38 33 16 36 23 31 213539_at 124 197 454 116 162 113 97 97 160 221497_x_at 120 86 175 99 115 80 83 119 66 213958_at 179 225 449 174 185 203 186 185 157 210835_s_at 203 144 197 314 250 353 173 285 325 209199_s_at 758 1234 833 1449 769 1110 987 638 1133 217979_at 570 563 972 796 869 494 673 1013 665 201015_s_at 533 343 325 270 691 460 563 321 261 203332_s_at 649 354 494 554 710 455 748 392 418 204670_x_at 5577 3216 5323 4423 5771 3374 4328 3515 2072 208788_at 648 327 1057 746 541 270 361 774 590 210784_x_at 142 151 144 173 148 145 131 146 147 204319_s_at 298 172 200 298 196 104 144 110 150 205049_s_at 3294 1351 2080 2066 3726 1396 2244 2142 1248 202114_at 833 674 733 1298 862 371 886 501 734 213792_s_at 646 375 370 436 738 497 546 406 376 203932_at 1977 1016 2436 1856 1917 822 1189 1092 623 203963_at 97 63 77 136 85 74 91 61 66 203978_at 315 279 221 260 227 222 232 141 123 203753_at 1468 1105 381 1154 980 1419 1253 554 1045 204891_s_at 78 71 152 74 127 57 66 153 70 209365_s_at 472 519 365 349 756 528 637 828 720 209604_s_at 772 74 130 216 108 311 80 235 177 211005_at 49 58 129 70 56 77 61 61 75 219686_at 694 342 345 502 960 403 535 513 258 38521_at 775 604 305 563 542 543 725 587 406 217853_at 367 168 107 160 287 264 273 113 89 217028_at 4926 2667 3542 5163 4683 3281 4822 3978 2702
Why Dimension Reduction Computation: The complexity grows exponentially with the dimension. Visualization: projection of high-dimensional data to 2D or 3D. Interpretation: the intrinsic dimension maybe small.
Dimensional Reduction vs feature selection Reduction of dimensions Principle Component Analysis (PCA) Feature selection (gene selection) Significant genes: t-test Selection of a limited number of genes
Principal Component Analysis (PCA) Used for visualization of complex data Developed to capture as much of the variation in data as possible Generic features of principal components summary variables linear combinations of the original variables uncorrelated with each other capture as much of the original variance as possible
Principal components 1. principal component (PC1) the direction along which there is greatest variation 2. principal component (PC2) the direction with maximum variation left in data, orthogonal to the direction (i.e. vector) of PC1 3. principal component (PC3) the direction with maximal variation left in data, orthogonal to the plane of PC1 and PC2 (Rarely used) etc...
Philosophy of PCA A PCA is concerned with explaining the variancecovariance sturcture of a set of variables through a few linear combinations. We typically have a data matrix of n observations on p correlated variables x 1,x 2, x p PCA looks for a transformation of the x i into p new variables y i that are uncorrelated. Want to present x 1,x 2, x p with a few y i s without lossing much information.
PCA Looking for a transformation of the data matrix X (nxp) such that Y= T X= 1 X 1 + 2 X 2 +..+ p X p Where =( 1, 2,.., p ) T is a column vector of wheights with 1 ²+ 2 ²+..+ p ²=1
Maximize the variance of the projection of the observations on the Y variables Find so that Var( T X)= T Var(X) is maximal Var(X) is the covariance matrix of the X i variables
Eigen Vector and Eigen Value A square matrix A is said to have eigenvalue l, with corresponding eigenvector x ¹ 0, if æ è ç 2 3 2 1 ö ø æ è ç 3 2 Ax = lx. ö ø = æ 12 è ç 8 ö ø = 4 æ 3 è ç 2 Only square matrices have eigenvectors, but not all square matrices have eigen vectors. All eigen vectors are orthogonal to each other. ö ø
PCA Result: let be the covariance matix associated with the random vector X ' [ X, X,... X ]. 1 2 p Let have the eigenvalue-eigenvector pairs (, e ),...,(, e ) where... 0. 2 2 p p 1 2 p (, e 1 1 ), Then the ith principle component is given by Y e' X e X e X... e X, i=1,2,...,p. i i i1 1 i 2 2 ip p
And so.. We find that The direction of is given by the eigenvector 1 correponding to the largest eigenvalue of matrix Σ. The second vector that is orthogonal (uncorrelated) to the first is the one that has the second highest variance which comes to be the eignevector corresponding to the second eigenvalue And so on
So PCA gives New variables Y i that are linear combination of the original variables (x i ): Y i = e i1 x 1 +e i2 x 2 + e ip x p ; i=1..p The new variables Y i are derived in decreasing order of importance; they are called principal components
Scale before PCA PCA is sensitive to scale PCA should be applied on data that have approximately the same scale in each variable
Result: Let X T = é ëx 1, X 2,..., X p eigenvalue-eigen vector pairs ( l 1,e 1 ),..., l p,e p Let Y i = e i ' X be the PCs. Then s 11 + s 22 +...+ s pp = æ ç ç ç è ç p åvar X i i=1 Proportion of total population variance due to the k th principle component ù û have covariance matrix S, with ( ), where l 1 ³ l 2 ³... ³ l p ³ 0. ( ) = l 1 + l 2 +...+ l p = ö = l k p, k = 1,.., p. ål i ø i=1 p åvar Y i i=1 ( )
How many PCAs to keep
PCA application: genomic study Population stratification: allele frequency differences between cases and controls due to systematic ancestry differences can cause spurious associations in disease studies. PCA could be used to infer underlying population structure.
Figure 2 Nature Genetics 38, 904-909 (2006) Principal components analysis corrects for stratification in genome-wide association studies Alkes L Price, Nick J Patterson, Robert M Plenge, Michael E Weinblatt, Nancy A Shadick & David Reic
Chao Tian, Peter K. Gregersen and Michael F. Seldin. (2008) Accounting for ancestry: population substructure and genomewide association studies.
Example: 3 dimensions 2 dimensions
PCA on all Genes Leukemia data, precursor B and T Plot of 34 patients, 8973 dimensions (genes) reduced to 2
Variance (%) Variance retained, variance lost 25 20 15 10 5 0 PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10
PCA on 100 top significant genes after DE gene selection Plot of 34 patients, 100 dimensions (genes) reduced to 2