Multivariate Statistics: Hierarchical and k-means cluster analysis

Multivariate Statistics: Hierarchical and k-means cluster analysis Steffen Unkel Department of Medical Statistics University Medical Center Goettingen, Germany Summer term 217 1/43

What is a cluster? Proximity measures Cluster analysis is about discovering groups in data. Clustering methods should not be confused with discrimination and other supervised learning methods, where the groups are known a priori. Some authors suggest that that the ultimate criterion for evaluating the meaning of the term cluster is the value judgement of the user. Others attempt to define just what a cluster is in terms of internal cohesion (homogeneity) and external isolation (separation). Summer term 217 2/43

It is It notis entirely deardear how how a 'duster' is recognized whenwhen displayed in not entirely a 'duster' is recognized displaye Proximity measures It plane, is not entirely dear how a 'duster' is recognized when displayed in the Hierarchical clustering but one feature of the recognition process would appear to involve asse plane, but one feature of the recognition process would appear to involve ane,ment but ment one thedistances recognition process would appear to involve assessof feature the between points. How How human observers dr of relative theofrelative distances between points. human observe ent perceptually of the relative distances between points. How human observers draw Clusters with internal cohesion and/or external separation coherent dusters out ofout fields of 'dots' will be briefty perceptually coherent dusters of fields of 'dots' willconsidered be considered b rceptually coherent dusters out of fields of 'dots' will be considered briefty in Chapter 2. Chapter 2. hapter 2..,....,.....,.......,. Sif....,....,....,....!.,.....,. : :...!....,.............,....... :...!...,.,........,.,..............,.....,..,.,...... -.;j::.: -.; :... :...,.. -.;j::.:........ :......... l l.e!.e! l.e!.~.~ Sif ~-~~-~.,., ~-~.,.~ Figure 1.2 1.2 Clusters with internal cohesion and!orand!or external solution. (Reproduced Figure Clusters with internal cohesion external solution. (Reproduw of CRC Press f rom fgordon, 198.) Permission gure 1.2 Permission Clusters with internal cohesion and!or external solution. (Reproduced with of CRC Press rom Gordon, 198.) rmission of CRC Press f rom Gordon, 198.) Summer term 217 3/43

homogeneaus collection of points. Ideally, then, one might expect a m Proximity measures Hierarchical clustering cluster analysis applied to such data to come to a similar conclusion. As k-means clustering seen later, this may not be the case, and many (most) methods of cluster will divide the type of data seen in Figure 1.3 into 'groups'. Often the pr Data containing no natural clusters and dissection dividing a homogeneaus data set into different parts is referred to as dissec LUSTER ANALYSIS such a procedure may be useful in specific circumstances. If, for example, th in Figure 1. 3 represented the geographicallocations of houses in a town, d might be a useful way of dividing the town up into compact postal distric contain comparable numbers of houses - see Figure 1.4. (This exam suggested by Gordon, 198.) The problern is, of course, that since in mo.........,................ '.......................................... ~ ~. I, ~~-........ 7 ~~, - -~_..! _.. _ ~-...~ ~-,,........ ~,, ~, Figure 1.4 Dissection of data in Figure 1.3 (Reproduced with permission of CRC P 1.3 Data containing no 'natural ' clusters. (Reproduced with permission of CRC G1 don, 198. ) om Gordon, 198. ) urther set of two-dimensional data is plotted in Figure 1.3. Here most ers would conclude that there is no 'natural' cluster structure, simply a single Summer term 217 4/43

The measurement of proximity To identify clusters which may be present in data knowledge on how close observations are to each other is needed. We need a quantitative measure of closeness, more commonly referred to as dissimilarity, distance, similarity, with a general term being proximity. Two observations are close when their dissimilarity is small or their similarity is large. Proximities can be determined either directly (e.g. from tasting experiments) or indirectly (usually derived from the n p data matrix X). Summer term 217 5/43

Similarity measures for categorical data With data in which all the variables are categorical, measures of similarity are most commonly used. The measures are generally scaled to be in the interval [, 1], or they are expressed as percentages in the range 1%. Two individuals i and j have a similarity coefficient s ij of unity (zero) if both have identical values for all variables (differ maximally for all variables). A similarity s ij could be converted into a dissimilarity δ ij by taking, for example, δ ij = 1 s ij. Summer term 217 6/43

Proximity measures equivalent to one- one Hierarchical matches, and therefore should be included in the calculated clustering similarity measure. An example is gender, where there is no preference as to which of the two categories should be coded zero or one. But in other cases the inclusion or otherwise of d is more problematic; for example, when the zero category corresponds to the genuine absence of some property, such as wings in a study of insects. The question that then needs to be asked is do the co-absences contain Binary data Table 3.2 Counts of binary outcomes for two individuals. Individual i Outcome Individual j Total Total +b a b a c d c+d p=a+b+c+d a +c b+d Figure: Counts of binary outcomes for two individuals. Summer term 217 7/43

Similarity measures for binary data Two out of many examples: 1 Matching coefficient: s ij = a + d a + b + c + d = a + d p. 2 Jaccard coefficient: s ij = a a + b + c. There is an apparent uncertainty as to how to deal with the count of zero-zero matches, d. In some cases, zero-zero matches are completely equivalent to one-one matches, but in other cases the inclusion of d is more problematic. Summer term 217 8/43

Similarity measures for categorical data with levels > 2 One approach is to allocate a score s ijk of zero or one to each variable k, depending on whether the two individuals i and j are the same on that variable. The similarity coefficient is then computed as s ij = 1 p s ijk. p k=1 An alternative approach is to 1 divide all possible outcomes of the kth variable into mutually exclusive subsets of categories, 2 allocate s ijk to zero or one depending on whether the two categories for individuals i and j are members of the same subset, and then 3 determine the proportion of shared subsets across variables. Summer term 217 9/43

Dissimilarity and distance measures for continuous data A dissimilarity measure δ ij > with δ ii = is termed a distance measure if it fulfils the metric (triangular) inequality δ ij + δ im δ jm. for pairs of individuals ij, im and jm. An n n matrix of dissimilarities,, with elements δ ij, where δ ii = for all i, is said to be metric, if the triangular inequality holds for all triples (i, j, m). We refer to metric dissimilarities as distances and denote the n n matrix of distances D, with elements d ij. Summer term 217 1/43

Distance measures Proximity measures General Minkowski distance or l r norm: ( p d ij = k=1 w r k x ik x jk r ) 1/r (r 1), where the w k, k = 1,..., p denote the non-negative weights of the p variables (often all w k =1). Both the Euclidean distance (l 2 norm) and the Manhattan (city-block or taxicab) distance (l 1 norm) are special cases of the Minkowski distance. Summer term 217 11/43

Euclidean property Proximity measures An n n dissimilarity matrix with elements δ ij is said to be Euclidean if the n individuals can be represented as points in space such that the Euclidean distance between points i and j is δ ij. This Euclidean property allows the interpretation of dissimilarities as physical distances. If a dissimilarity matrix is Euclidean then it is also metric, but the converse does not follow. Summer term 217 12/43

/ / Illustration Proximity measures MEASUREMENT OF PROXIMITY 53 2 1.1 2 1.1 1.1 / /............... 2 Figure 3.1 An example of a set of distances that satisfy the metric inequality but which have no Euclidean representation. (Reproduced with permission from Gower and Legendre, 1986.) Figure: An example of a set of distances that satisfy the metric inequality but have no Euclidean representation. Summer term 217 13/43

Similarity measures for mixed data There are a number of approaches to constructing proximities for data in which some variables are continuous and some categorical, three examples of which are: 1 Dichotomize all variables and use a similarity measure for binary data. 2 Rescale all the variables so that they are on the same scale by replacing variable values by their ranks among the objects and then using a measure for continuous data. 3 Construct a dissimilarity measure for each type of variable and combine these, either with or without differential weighting into a single coefficient. Summer term 217 14/43

Inter-group proximity measures In clustering applications it also becomes necessary to consider how to measure the proximity between groups of individuals. There are two basic approaches: 1 Define suitable summary of the proximities between individuals from either group. Example: if group A has mean vector x A = ( x A1,..., x Ap ) and group B mean vector x B = ( x B1,..., x Bp ), then the generalized distance, D 2, is given by D 2 = ( x A x B ) W 1 ( x A x B ), where W is the pooled within-group covariance matrix for the two groups. 2 Each group might be described by a representative observation and the inter-group proximity defined as the proximity between the representative observations. Examples: nearest-neighbour distance, furthest-neighbour distance, average dissimilarity between individuals from both groups. Summer term 217 15/43

Standardization Proximity measures In many clustering applications the variables describing the objects to be clustered will not be measured in the same units. The solution most often suggested is to simply standardize each variable to unit variance. Standardization of variables to unit variance can be viewed as a special case of weighting. Sometimes it is preferable to standardize variables using a measure of within-group variability rather than one of total variability. Summer term 217 16/43

Proximity measures l() o oo~ ~8 gol' ~ oo N so~d~oo o~l?fo &'> ool o o Ct.~ oo OH ~ o Illustration of the standardization problem o CO o oo N I l() I -5 1 5 x1 MEASUREMENT OF PROXIMITY (a) 65-2 15 2 4 x1 (b) l() <!l <!l <t ~ l() o oo~ ~8 gol' oo o OH N so~d~oo o~l?fo &'> ool o o Ct.~ oo ~ o CO o oo N I l() I -5 1 5 x1 (a) 15-2 2 4 x1 (b) Figure 3.3 6-2 2 x1 (c) 4 6 Illustration of standardization problem. (a) Data on (b) Undesirable standardization: weights basedontotal standard deviatio Figure: (a) Data on original scale. (b)standardization: Undesirable standardization: weights based on within-group standard deviations. weights based on total standard deviations. (c) Desirable standardization: weights based on within-group standard deviations. An alternative criterion for determining the importance of a va <!l Summer term 217 data has been proposed by De Soete (1986), who suggests findin for each variable, which yield weighted Euclidean distances t criterion for departure from ultrametricity (a term defined and d 17/43 next chapter; see Section 4.4.3). This is motivated by a well-kno

Introduction Proximity measures In hierarchical clustering the data are not partitioned into a particular number of clusters at a single step. Instead the clustering consists of a series of partitions and the aim is to find the optimal step. techniques may be subdivided into 1 agglomerative methods, which proceed by a series of successive fusions of the n objects into groups, and 2 divisive methods (less commonly used), which separate the n objects successively into finer groupings. With hierarchical methods, divisions or fusions, once made, are irrevocable. Summer term 217 18/43

Dendrogram Proximity measures 2 3 4 Agglomerative a d 4 3 2 Divisive Figure: Example of a hierarchical tree structure. Summer term 217 19/43

Agglomerative hierarchical clustering procedure An agglomerative procedure produces a series of partitions of the data, P n, P n 1,..., P 1, where P n consists of n single-member clusters, and P 1 consists of a single group containing all n objects. Algorithm Start: Clusters C 1, C 2,..., C n each containing a single individual. (1) Find the nearest point of distinct clusters, say C i and C j, merge C i and C j, delete C j and decrease the number of clusters by one. (2) If number of clusters equals one then stop, else return to (1). Difference between agglomerative methods arise because of the different ways of defining similarity between an individual and a group or between two groups of individuals. Summer term 217 2/43

Measuring inter-cluster dissimilarity Three inter-cluster distance measures: d AB = min (d ij), (1) i A, j B d AB = max (d ij), (2) i A, j B 1 d AB = d ij, (3) n A n B i A j B where d AB is the distance between two clusters A and B, d ij is the distance (e.g. Euclidean distance) between objects i and j and n A (n B ) is the number of objects in cluster A (B). (1) is the basis of single linkage clustering, that in (2) of complete linkage clustering, and that in (3) of group average clustering. Summer term 217 21/43

Proximity measures Examples of three inter-cluster distance measures Summer term 217 22/43

Illustrative example (single linkage) Consider the following distance matrix: 1. 2 2.. D = 3 6. 5.. 4 1. 9. 4.. 5 9. 8. 5. 3... At stage one, individuals 1 and 2 are merged to form a cluster, since d 12 is the smallest non-zero entry in D. Distances between this cluster and the other three individuals are obtained as d (12)3 = min(d 13, d 23 ) = d 23 = 5., d (12)4 = min(d 14, d 24 ) = d 24 = 9., d (12)5 = min(d 15, d 25 ) = d 25 = 8.. Summer term 217 23/43

Illustrative example (single linkage) A new matrix may now be constructed whose entries are inter-individual and cluster-individual distance values: (12). 3 D 1 = 5.. 4 9. 4... 5 8. 5. 3.. The smallest non-zero entry in D 1 is d 45 and so individuals 4 and 5 are now merged to form a new cluster, and a new set of distances are found: d (12)3 = 5. as before, d (12)(45) = min(d 14, d 15, d 24, d 25 ) = d 25 = 8., d (45)3 = min(d 34, d 35 ) = d 34 = 4.. Summer term 217 24/43

Illustrative example (single linkage) The new distances max be arranged to give the matrix D 2 = (12) 3. 5... (45) 8. 4.. The smallest non-zero entry is now d (45)3, and so individual 3 is merged with the cluster containing individuals 4 and 5. Finally, fusion of the two remaining groups takes place to form a single group containing all five individuals. Summer term 217 25/43

Dendrogram for worked example of single linkage Summer term 217 26/43

Problems of agglomerative methods To illustrate some of the problems of agglomerative methods, a set of simulated data will be clustered using single, complete and average linkage. The data consist of 5 points simulated from two bivariate normal distributions with mean vectors (,) and (4,4) and common covariance matrix ( ) 16. 1.5 Σ =. 1.5.25 Two intermediate points have been added for the first analysis, in order to illustrate the problem known as chaining often found when using single linkage. Summer term 217 27/43

Proximity measures Dendrogram of single linkage clustering of simulated data Summer term 217 28/43

Proximity measures Problems of agglomerative methods Figure: Clusters obtained from simulated data: (a) single linkage, with with intermediate points, two-cluster solution; (b) single linkage, no intermediate points, five-cluster solution. Summer term 217 29/43

Proximity measures Problems of agglomerative methods Figure: Clusters obtained from simulated data: (c) complete linkage, no intermediate points, five-cluster solution; (d) average linkage, no intermediate points, five-cluster solution. Summer term 217 3/43

Global fit of a hierarchical clustering solution The method most commonly used for comparing a dendrogram with a proximity matrix or a second dendrogram is the cophenetic correlation. It is the product-moment correlation of the n(n 1)/2 entries in the lower half of the proximity matrix and the corresponding entries in the cophenetic matrix, C. The elements c ij of C are the heights at which individuals i and j appear together in the same cluster in the dendrogram. For the worked example, the cophenetic correlation between D and C is.82. Summer term 217 31/43

Choice of partition Proximity measures It is often the case that the investigator is not interested in the complete hierarchy but only in one or two partitions obtained from it. Partitions are obtained by cutting a dendrogram at a particular height or selecting one of the solutions on the nested sequence of clusterings. One informal approach for determining the number of groups is to examine the size of the difference between fusion levels in the resulting diagram. Large changes in fusion levels might be taken to indicate a particular number of clusters. Summer term 217 32/43

Introduction to optimization clustering techniques We now consider a clustering technique that produces a partition of the individuals into a specified number of groups. This is done by optimizing some numerical criterion. With each partition of the n objects into the required number of groups, k, is an index c(n, k), the value of which measures some quality of this particular criterion. Differences between optimization techniques arise both because of the variety of clustering criteria and the various optimization algorithms that might be used. Summer term 217 33/43

Deriving clustering criteria Considering the following three p p matrices derived from X: T = W = B = k n m m=1 l=1 k n m m=1 l=1 (x ml x)(x ml x), (x ml x m )(x ml x m ), k n m ( x m x)( x m x), m=1 where x ml is the vector of observations of the lth object in group m, x is the mean vector of all n observations, x m is the vector of sample means within group m, and n m is the number of observations in group m. Summer term 217 34/43

Clustering criteria Proximity measures The three matrices represent respectively total dispersion, within-group dispersion and between-group dispersion. They satisfy the equation: T = W + B. In the multivariate case (p > 1), a number of criteria based on the decomposition above have been suggested: 1 Minimization of tr(w), that is, minimize the sum of the within-group sums of squares, over all variables. 2 Minimization of det(w). 3 Maximization of tr(bw 1 ). Summer term 217 35/43

Number of possible partitions Having decided on a suitable criterion, consideration needs to be given to how to find a partition into k groups that optimizes the criterion. One may want to calculate the criterion value for each possible partition and choose a partition that gives an optimum. Unfortunately, there are a large number of possible partitions, N, even for moderate n and k; for example, N(5, 2) = 15, N(1, 3) = 933, N(1, 5) 6.6 1 67. This problem has led to the development of hill-climbing algorithms designed to search for the optimum by rearranging the existing partitions and keeping the new one only if it provides improvement. Summer term 217 36/43

Optimization algorithm Hill-climbing clustering algorithm 1 Find some initial partition of the n objects into k groups. 2 Calculate the change in clustering criterion produced by moving each object from its own to another group. 3 Make the change which leads to the greatest improvement in the value of the clustering criterion. 4 Repeat steps 2 and 3 until no move of a single object causes the cluster criterion to improve. The k-means algorithm consists of iteratively updating a partition by simultaneously relocating each object to the group to whose mean (centroid) it was closest and then recalculating the group means. This can be shown to be equivalent to minimizing tr(w) when Euclidean distances are used to define closeness. Summer term 217 37/43

17 Proximity measures Illustrative example of K- means clustering: Example Pick K random points as cluster centers (means) Shown here for K=2 Figure: Pick k random points as cluster means (here k = 2). Summer term 217 38/43

Illustrative example of K- means clustering: Example Iterative Step 1 Assign data points to closest cluster center 18 Figure: Assign points to closest cluster center. Summer term 217 39/43

1 Proximity measures Illustrative example of K- means clustering: Example Iterative Step 2 Change the cluster center to the average the assigned points Figure: Change the cluster center to the average of the assigned points. Summer term 217 4/43

2 Proximity measures Illustrative example of K- means clustering: Example Repeat undl convergence Figure: Repeat until convergence. Summer term 217 41/43

Choosing the number of clusters The k-means approach is used to partition the data into a prespecified number of clusters set by the investigator. In practice, solutions for a range of values for number of groups are found, but the question remains as to the optimal number of clusters for the data. A number of suggestions have been made as to how tackle this question. For example, one involves plotting the value of the clustering criterion against the number of groups. As k increases the value of tr(w) will necessarily decrease but some sharp change may be indicative of the best solution. Summer term 217 42/43

Properties of k-means and alternatives The k-means algorithm is guaranteed to converge, at least to a local optimum. However, it suffers from the problems of 1 not being scale-invariant, that is, different solutions may be obtained from the raw data and from the data standardized in some way. 2 and of imposing a spherical structure on the observed clusters even when the natural clusters in the data are of other shapes. Alternatives to k-means are the k-median and partitioning around medoids (PAM) algorithms. Summer term 217 43/43