Multivariate Analysis Cluster Analysis

Size: px

Start display at page:

Download "Multivariate Analysis Cluster Analysis"

Stephen Garrison
5 years ago
Views:

1 Multivariate Analysis Cluster Analysis Prof. Dr. Anselmo E de Oliveira anselmo.quimica.ufg.br anselmo.disciplinas@gmail.com

2 Cluster Analysis System Samples Measurements Similarities Distances Clusters

3 Cluster Analysis CA searches for objects which are close together in the variable space. First of all, the choice of a distance metric must be made. The general distance is given by d ij = N k=1 x ik x jk N 1 N For N=2, this is the familiar n-space Euclidean distance. Higher values of N will give more weight to smaller distances.

4 Cluster Analysis Secondly, a variety of ways to cluster the points have been developed. The single link method judges the nearness of a point to a cluster on the basis of the distance to the closest point in the cluster. Conversely, the more conservative complete link method uses the distance to the farthest point. A more rigorous but computationally slower method is the centroid method in which the distance of a point to the centre of gravity of the points in a cluster is used.

5 Cluster Analysis Single link Complete link

6 Cluster Analysis Points are grouped together based on their nearness or similarity into clusters and we assume that the nearness of points in n-space reflects the similarity of their properties. Typically, measurements are made on the samples and used to calculate interpoint distances. Similarity values, S ij, are calculated as d ij S ij = 1 d ij max

7 Observation Cluster Analysis Hierarchical Clustering: once an object has been assigned to a group the process cannot be reversed Distance/Similarity Dendrogram

8 Dendrogram Cluster Analysis

9 Dendrogram Cluster Analysis

10 Dendrogram Cluster Analysis

11 Cluster Analysis Dendrogram Representing Trees with Dendrograms

12 Cluster Analysis Cluster Analysis methods can be classified into two main categories: agglomerative and partitional. Agglomerative methods begin with each object being it's own cluster, and progress by combining (agglomerating) existing clusters into larger ones. Partitional methods start with a single cluster containing all objects, and progress by dividing existing clusters into smaller ones.

13 Cluster Analysis

14 Cluster Analysis All clustering methods require the specification of a distance measure to be used to indicate distances between objects, and subsequently between clusters, during method operation original X-variables PCA scores colinearity and noise-reduction benefits requires the specification of the appropriate number of PCs

15 Cluster Analysis Euclidean or Mahalanobis distance The use of Mahalanobis distance allows one to account for dominant multivariate directions in the data when performing cluster analysis

16 Cluster Analysis Euclidean or Mahalanobis distance Illustration of Euclidean distance (a) and Mahalanobis distance (b) where the contours represent equidistant points from the center using each distance metric. R. De Maesschalck, D. Jouan-Rimbaud and D. L. Massart, "Tutorial - The Mahalanobis distance," Chemometrics and Intelligent Laboratory Systems, 2000

During each cycle of this clustering method, the remaining objects are assigned to one of these clusters, based on

17 K-Means Data mining algorithm Starts with a random selection of K objects that are to be used as cluster targets, where K is determined a priori. During each cycle of this clustering method, the remaining objects are assigned to one of these clusters, based on distance from each of the K targets. New cluster targets are then calculated as the means of the objects in each cluster. The procedure is repeated until no objects are re-assigned after the updated mean calculations

18 K-Means k-means clustering is often more suitable than hierarchical clustering for large amounts of data Honey data K=4 Rape (ra): 8-10 Honeydew (hd): Floral (of): 4-7; Acacia (ac): samples and 11 parameters

19 K-Means honeydata.mat X, 27x11 To get an idea of how well-separated the resulting clusters are, you can make a silhouette plot using the cluster indices output from kmeans. The silhouette plot displays a measure of how close each point in one cluster is to points in the neighboring clusters. This measure ranges from +1, indicating points that are very distant from neighboring clusters, through 0, indicating points that are not distinctly in one cluster or another, to -1, indicating points that are probably assigned to the wrong cluster.

20 Cluster K-Means autoscaling k = 4 idx4=kmeans(x,4); [silh4,h] = silhouette(x,idx4); large silhouette values, greater than 0.6, indicate that the cluster is somewhat separated from neighboring clusters. points with low silhouette values, and points with negative values, indicate that the cluster is not well separated Silhouette Value

21 Cluster K-Means Diminuir o número de clusters (k=3)? idx3=kmeans(x,3); [silh3,h] = silhouette(x,idx3); Silhouette Value

22 Cluster K-Means Aumentar o número de clusters (k=5)? idx5=kmeans(x,5); [silh5,h] = silhouette(x,idx5); Silhouette Value

23 K-Means A more quantitative way to compare the solutions is to look at the average silhouette values Testei até k=9. O melhor valor foi com k=2 mean(silh2) ans =

Cluster K-Means Espectros de RMN 1 H Testei de k = 2 até 9 Melhor valor k =3 mean(silh3) = 0.

24 Cluster K-Means Espectros de RMN 1 H Testei de k = 2 até 9 Melhor valor k =3 mean(silh3) = x x Silhouette Value

25 K-Means Without some knowledge of how many clusters are really in the data, it is a good idea to experiment with a range of values for k

Nearest Neighbor (KNN) Single-linkage clustering The distance between any two clusters is defined as the minimum of all possible pair-wise distances of objects between

26 Nearest Neighbor (KNN) Single-linkage clustering The distance between any two clusters is defined as the minimum of all possible pair-wise distances of objects between the two clusters; the two clusters with the minimum distance are joined together. This method tends to perform well with data that form elongated "chain-type" clusters.

27 Nearest Neighbor (KNN) Honey data Dendrogram of Data with Preprocessing: Autoscale Euclidian Distance to K-Nearest Neighbor Dendrogram of Data with Preprocessing: Autoscale 4 PCs Distance to K-Nearest Neighbor Dendrogram of Data with Preprocessing: Autoscale 4 PCs, Mahalanobis Distance to K-Nearest Neighbor

Furthest Neighbor The distance between any two clusters is defined as the maximum of all possible pair-wise distances of objects between the two

28 Furthest Neighbor The distance between any two clusters is defined as the maximum of all possible pair-wise distances of objects between the two clusters; the two clusters with the minimum distance are joined together. This method tends to perform well with data that form "round", distinct clusters.

29 Furthest Neighbor 0 Honey data Dendrogram of Data with Preprocessing: Autoscale Distance to Furthest Neighbor Dendrogram of Data with Preprocessing: Autoscale Distance to Furthest Neighbor Dendrogram of Data with Preprocessing: Autoscale Distance to Furthest Neighbor

30 Centroid The distance between any two clusters is defined as the difference in the multivariate means (centroids) of each cluster; the two clusters with the minimum distance are joined together.

31 Centroid Honey data Dendrogram of Data with Preprocessing: Autoscale Distance Between Cluster Centers Dendrogram of Data with Preprocessing: Autoscale Distance Between Cluster Centers Dendrogram of Data with Preprocessing: Autoscale Distance Between Cluster Centers

Pair-Group Average The distance between any two clusters is defined as the average distance of all possible pair-wise distances between objects in the two

32 Pair-Group Average The distance between any two clusters is defined as the average distance of all possible pair-wise distances between objects in the two clusters; the two clusters with the minimum distance are joined together. This method tends to perform equally well with both "chain-type" and "round" clusters.

33 Pair-Group Average Dendrogram of Data with Preprocessing: Autoscale Average-Paired Distance Dendrogram of Data with Preprocessing: Autoscale Average-Paired Distance Dendrogram of Data with Preprocessing: Autoscale Average-Paired Distance

34 Median The distance between any two clusters is defined as the difference in the weighted multivariate means (centroids) of each cluster, where the means are weighted by the number of objects in each cluster; the two clusters with the minimum distance are joined together. This method might perform better than the Centroid method if the number of objects is expected to vary greatly between clusters.

35 Median Dendrogram of Data with Preprocessing: Autoscale Distance Between Cluster Centers Dendrogram of Data with Preprocessing: Autoscale Distance Between Cluster Centers Dendrogram of Data with Preprocessing: Autoscale Distance Between Cluster Centers

36 Ward's Method This method does not require calculation of the cluster centers; it joins the two existing clusters such that the resulting pooled withincluster variance (with respect to each cluster's centroid) is minimized.

37 Ward's Method Dendrogram of Data with Preprocessing: Autoscale Dendrogram of Data with Preprocessing: Autoscale Dendrogram of Data with Preprocessing: Autoscale Variance Weighted Distance Between Cluster Centers Variance Weighted Distance Between Cluster Centers Variance Weighted Distance Between Cluster Centers

38 Scores on PC 2 (20.35%) PCA Honey data 3 Samples/Scores Plot Scores on PC 1 (44.77%)

Applying cluster analysis to 2011 Census local authority data

Applying cluster analysis to 2011 Census local authority data Kitty.Lymperopoulou@manchester.ac.uk SPSS User Group Conference November, 10 2017 Outline Basic ideas of cluster analysis How to choose variables