Multivariate Analysis Cluster Analysis

Multivariate Analysis Cluster Analysis Prof. Dr. Anselmo E de Oliveira anselmo.quimica.ufg.br anselmo.disciplinas@gmail.com

Cluster Analysis System Samples Measurements Similarities Distances Clusters

Cluster Analysis CA searches for objects which are close together in the variable space. First of all, the choice of a distance metric must be made. The general distance is given by d ij = N k=1 x ik x jk N 1 N For N=2, this is the familiar n-space Euclidean distance. Higher values of N will give more weight to smaller distances.

Cluster Analysis Secondly, a variety of ways to cluster the points have been developed. The single link method judges the nearness of a point to a cluster on the basis of the distance to the closest point in the cluster. Conversely, the more conservative complete link method uses the distance to the farthest point. A more rigorous but computationally slower method is the centroid method in which the distance of a point to the centre of gravity of the points in a cluster is used.

Cluster Analysis Single link Complete link

Cluster Analysis Points are grouped together based on their nearness or similarity into clusters and we assume that the nearness of points in n-space reflects the similarity of their properties. Typically, measurements are made on the samples and used to calculate interpoint distances. Similarity values, S ij, are calculated as d ij S ij = 1 d ij max

Observation Cluster Analysis Hierarchical Clustering: once an object has been assigned to a group the process cannot be reversed. 1 2 3 4 5 6 7 Distance/Similarity Dendrogram

Dendrogram Cluster Analysis

Cluster Analysis Dendrogram Representing Trees with Dendrograms

Cluster Analysis Cluster Analysis methods can be classified into two main categories: agglomerative and partitional. Agglomerative methods begin with each object being it's own cluster, and progress by combining (agglomerating) existing clusters into larger ones. Partitional methods start with a single cluster containing all objects, and progress by dividing existing clusters into smaller ones.

Cluster Analysis

Cluster Analysis All clustering methods require the specification of a distance measure to be used to indicate distances between objects, and subsequently between clusters, during method operation original X-variables PCA scores colinearity and noise-reduction benefits requires the specification of the appropriate number of PCs

Cluster Analysis Euclidean or Mahalanobis distance The use of Mahalanobis distance allows one to account for dominant multivariate directions in the data when performing cluster analysis

Cluster Analysis Euclidean or Mahalanobis distance Illustration of Euclidean distance (a) and Mahalanobis distance (b) where the contours represent equidistant points from the center using each distance metric. R. De Maesschalck, D. Jouan-Rimbaud and D. L. Massart, "Tutorial - The Mahalanobis distance," Chemometrics and Intelligent Laboratory Systems, 2000

K-Means Data mining algorithm Starts with a random selection of K objects that are to be used as cluster targets, where K is determined a priori. During each cycle of this clustering method, the remaining objects are assigned to one of these clusters, based on distance from each of the K targets. New cluster targets are then calculated as the means of the objects in each cluster. The procedure is repeated until no objects are re-assigned after the updated mean calculations

K-Means k-means clustering is often more suitable than hierarchical clustering for large amounts of data Honey data K=4 Rape (ra): 8-10 Honeydew (hd): 11-19 Floral (of): 4-7; 20-27 Acacia (ac): 1-3 27 samples and 11 parameters

K-Means honeydata.mat X, 27x11 To get an idea of how well-separated the resulting clusters are, you can make a silhouette plot using the cluster indices output from kmeans. The silhouette plot displays a measure of how close each point in one cluster is to points in the neighboring clusters. This measure ranges from +1, indicating points that are very distant from neighboring clusters, through 0, indicating points that are not distinctly in one cluster or another, to -1, indicating points that are probably assigned to the wrong cluster.

Cluster K-Means autoscaling k = 4 idx4=kmeans(x,4); [silh4,h] = silhouette(x,idx4); 1 2 3 large silhouette values, greater than 0.6, indicate that the cluster is somewhat separated from neighboring clusters. points with low silhouette values, and points with negative values, indicate that the cluster is not well separated. 4 0 0.2 0.4 0.6 0.8 1 Silhouette Value

Cluster K-Means Diminuir o número de clusters (k=3)? idx3=kmeans(x,3); [silh3,h] = silhouette(x,idx3); 1 2 3 0 0.2 0.4 0.6 0.8 1 Silhouette Value

Cluster K-Means Aumentar o número de clusters (k=5)? idx5=kmeans(x,5); [silh5,h] = silhouette(x,idx5); 1 2 3 4 5 0 0.2 0.4 0.6 0.8 1 Silhouette Value

K-Means A more quantitative way to compare the solutions is to look at the average silhouette values Testei até k=9. O melhor valor foi com k=2 mean(silh2) ans = 0.4810

Cluster K-Means Espectros de RMN 1 H Testei de k = 2 até 9 Melhor valor k =3 mean(silh3) = 0.3955 x 10 8 4.5 4 3.5 1 3 2.5 2 1.5 1 2 0.5 0 0.7 0.8 0.9 1 1.1 1.2 1.3 x 10 4 3 0 0.2 0.4 0.6 0.8 1 Silhouette Value

K-Means Without some knowledge of how many clusters are really in the data, it is a good idea to experiment with a range of values for k

Nearest Neighbor (KNN) Single-linkage clustering The distance between any two clusters is defined as the minimum of all possible pair-wise distances of objects between the two clusters; the two clusters with the minimum distance are joined together. This method tends to perform well with data that form elongated "chain-type" clusters.

Nearest Neighbor (KNN) Honey data 25 12 14 17 15 13 20 19 18 11 10 9 15 10 27 45 23 20 24 22 5 21 6 3 2 1 0 16 26 25 7 8 Dendrogram of Data with Preprocessing: Autoscale Euclidian 0 1 2 3 4 5 Distance to K-Nearest Neighbor 25 20 15 10 5 0 16 12 15 13 14 18 17 19 11 25 22 5 9 10 20 78 21 26 6 27 4 24 23 3 2 1 Dendrogram of Data with Preprocessing: Autoscale 4 PCs 0 0.5 1 1.5 2 2.5 3 3.5 4 Distance to K-Nearest Neighbor 25 25 5 14 18 17 20 19 11 26 15 4 15 13 21 10 9 8 10 7 20 5 23 24 3 2 1 0 16 12 22 6 27 Dendrogram of Data with Preprocessing: Autoscale 4 PCs, Mahalanobis 0 0.5 1 1.5 2 2.5 3 Distance to K-Nearest Neighbor

Furthest Neighbor The distance between any two clusters is defined as the maximum of all possible pair-wise distances of objects between the two clusters; the two clusters with the minimum distance are joined together. This method tends to perform well with data that form "round", distinct clusters.

Furthest Neighbor 0 Honey data 16 17 25 15 14 13 12 27 20 23 19 18 11 15 10 9 24 8 22 20 10 21 26 6 25 5 5 7 4 3 2 1 Dendrogram of Data with Preprocessing: Autoscale 0 1 2 3 4 5 6 7 8 9 Distance to Furthest Neighbor 16 15 25 13 18 17 14 12 20 10 9 20 78 15 21 19 6 11 25 10 5 26 Dendrogram of Data with Preprocessing: Autoscale 27 4 24 5 23 22 3 2 1 0 0 1 2 3 4 5 6 7 8 9 Distance to Furthest Neighbor 16 25 10 9 8 21 20 7 20 18 6 17 19 11 15 25 26 5 14 4 10 15 13 12 27 23 5 22 24 3 2 1 0 Dendrogram of Data with Preprocessing: Autoscale 0 1 2 3 4 5 6 Distance to Furthest Neighbor

Centroid The distance between any two clusters is defined as the difference in the multivariate means (centroids) of each cluster; the two clusters with the minimum distance are joined together.

Centroid 25 20 21 27 23 24 20 22 6 26 14 17 15 15 13 19 18 11 10 Honey data 16 12 9 10 Dendrogram of Data with Preprocessing: Autoscale 25 78 5 5 4 3 2 1 0 0 1 2 3 4 5 6 7 Distance Between Cluster Centers 16 12 25 15 13 14 17 18 20 19 11 27 25 15 26 5 10 5 21 6 3 2 1 0 10 94 8 7 24 23 22 20 Dendrogram of Data with Preprocessing: Autoscale 0 1 2 3 4 5 6 Distance Between Cluster Centers 16 12 25 14 15 13 22 21 20 10 9 20 78 15 6 27 23 24 10 25 3 26 5 18 17 5 19 11 4 2 1 0 Dendrogram of Data with Preprocessing: Autoscale 0 0.5 1 1.5 2 2.5 3 3.5 4 Distance Between Cluster Centers

Pair-Group Average The distance between any two clusters is defined as the average distance of all possible pair-wise distances between objects in the two clusters; the two clusters with the minimum distance are joined together. This method tends to perform equally well with both "chain-type" and "round" clusters.

Pair-Group Average 25 14 17 15 13 19 20 18 11 27 23 24 15 22 20 21 26 6 10 16 12 9 10 Dendrogram of Data with Preprocessing: Autoscale 25 78 5 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 Average-Paired Distance 25 20 15 10 5 0 16 15 13 12 18 17 14 19 11 10 9 8 7 20 21 25 6 5 26 27 4 24 23 22 3 2 1 Dendrogram of Data with Preprocessing: Autoscale 0 1 2 3 4 5 6 Average-Paired Distance 16 25 25 26 5 10 94 20 21 8 20 7 15 12 6 14 15 13 18 10 17 19 11 27 23 5 22 24 3 2 1 0 Dendrogram of Data with Preprocessing: Autoscale 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 Average-Paired Distance

Median The distance between any two clusters is defined as the difference in the weighted multivariate means (centroids) of each cluster, where the means are weighted by the number of objects in each cluster; the two clusters with the minimum distance are joined together. This method might perform better than the Centroid method if the number of objects is expected to vary greatly between clusters.

Median 25 20 15 10 5 0 16 12 13 14 17 15 19 18 11 26 25 4 5 27 23 22 20 21 24 6 10 9 8 7 3 2 1 Dendrogram of Data with Preprocessing: Autoscale 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 Distance Between Cluster Centers 16 27 25 12 15 13 14 17 20 18 19 11 10 9 15 8 24 7 23 22 10 20 21 25 6 Dendrogram of Data with Preprocessing: Autoscale 5 26 5 4 3 2 1 0 0 1 2 3 4 5 Distance Between Cluster Centers 25 20 15 10 9 10 8 20 7 5 0 16 26 25 5 12 27 14 15 13 18 17 19 11 21 4 22 6 23 24 3 2 1 Dendrogram of Data with Preprocessing: Autoscale 0 0.5 1 1.5 2 2.5 3 3.5 4 Distance Between Cluster Centers

Ward's Method This method does not require calculation of the cluster centers; it joins the two existing clusters such that the resulting pooled withincluster variance (with respect to each cluster's centroid) is minimized.

Ward's Method Dendrogram of Data with Preprocessing: Autoscale Dendrogram of Data with Preprocessing: Autoscale Dendrogram of Data with Preprocessing: Autoscale 16 17 25 15 14 13 12 19 20 18 11 20 21 27 6 15 23 24 22 3 10 9 10 25 8 5 26 5 7 4 2 1 0 0 2 4 6 8 10 12 14 16 Variance Weighted Distance Between Cluster Centers 15 13 25 18 17 14 12 16 20 19 11 27 24 23 15 22 10 9 10 20 78 21 25 6 5 26 5 4 3 2 1 0 0 2 4 6 8 10 12 14 16 18 Variance Weighted Distance Between Cluster Centers 10 9 25 21 8 20 7 20 15 6 13 18 17 14 15 12 27 23 22 24 10 5 0 3 16 19 11 25 5 26 4 2 1 0 1 2 3 4 5 6 7 8 9 Variance Weighted Distance Between Cluster Centers

Scores on PC 2 (20.35%) PCA Honey data 3 Samples/Scores Plot 2 1 0-1 -2-3 -4-4 -3-2 -1 0 1 2 3 4 Scores on PC 1 (44.77%)