Multivariate Analysis Cluster Analysis

Similar documents
Applying cluster analysis to 2011 Census local authority data

2. Sample representativeness. That means some type of probability/random sampling.

Classification methods

Data Exploration and Unsupervised Learning with Clustering

2. Sample representativeness. That means some type of probability/random sampling.

DIMENSION REDUCTION AND CLUSTER ANALYSIS

Clustering. Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein. Some slides adapted from Jacques van Helden

Clustering. CSL465/603 - Fall 2016 Narayanan C Krishnan

Multivariate Statistics: Hierarchical and k-means cluster analysis

Cluster Analysis (Sect. 9.6/Chap. 14 of Wilks) Notes by Hong Li

Chapter 5: Microarray Techniques

Clustering. Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein. Some slides adapted from Jacques van Helden

Clusters. Unsupervised Learning. Luc Anselin. Copyright 2017 by Luc Anselin, All Rights Reserved

Principles of Pattern Recognition. C. A. Murthy Machine Intelligence Unit Indian Statistical Institute Kolkata

Overview of clustering analysis. Yuehua Cui

STATISTICA MULTIVARIATA 2

Solving Non-uniqueness in Agglomerative Hierarchical Clustering Using Multidendrograms

EasySDM: A Spatial Data Mining Platform

Distances and similarities Based in part on slides from textbook, slides of Susan Holmes. October 3, Statistics 202: Data Mining

Unsupervised machine learning

Part I. Linear regression & LASSO. Linear Regression. Linear Regression. Week 10 Based in part on slides from textbook, slides of Susan Holmes

Statistical Machine Learning

An Alternative Algorithm for Classification Based on Robust Mahalanobis Distance

Chapter 5-2: Clustering

MULTIVARIATE ANALYSIS OF BORE HOLE DISCONTINUITY DATA

Machine Learning. Nonparametric Methods. Space of ML Problems. Todo. Histograms. Instance-Based Learning (aka non-parametric methods)

MATCHING FOR EE AND DR IMPACTS

CSE446: Clustering and EM Spring 2017

Clustering. Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein. Some slides adapted from Jacques van Helden

Short Answer Questions: Answer on your separate blank paper. Points are given in parentheses.

University of Florida CISE department Gator Engineering. Clustering Part 1

Classification for High Dimensional Problems Using Bayesian Neural Networks and Dirichlet Diffusion Trees

Marielle Caccam Jewel Refran

Metric-based classifiers. Nuno Vasconcelos UCSD

Fast Hierarchical Clustering from the Baire Distance

Logic and machine learning review. CS 540 Yingyu Liang

Clustering. Genome 373 Genomic Informatics Elhanan Borenstein. Some slides adapted from Jacques van Helden

Gov 2002: 5. Matching

Unsupervised clustering of COMBO-17 galaxy photometry

FACTOR ANALYSIS AND MULTIDIMENSIONAL SCALING

Multivariate Analysis

Data Preprocessing. Cluster Similarity

Unsupervised Learning. k-means Algorithm

Machine Learning - MT Clustering

1 Basic Concept and Similarity Measures

Clustering: K-means. -Applied Multivariate Analysis- Lecturer: Darren Homrighausen, PhD

Clustering and classification with applications to microarrays and cellular phenotypes

Chemometrics. Classification of Mycobacteria by HPLC and Pattern Recognition. Application Note. Abstract

Cluster Analysis CHAPTER PREVIEW KEY TERMS

Contents. Preface to the second edition. Preface to the fírst edition. Acknowledgments PART I PRELIMINARIES

Data Analysis. Santiago González

Homework : Data Mining SOLUTIONS

CS626 Data Analysis and Simulation

SC4/SM4 Data Mining and Machine Learning Clustering

Principal component analysis

Clustering using Mixture Models

Advanced Statistical Methods: Beyond Linear Regression

CSE446: non-parametric methods Spring 2017

Co-expression analysis of RNA-seq data

Multivariate Statistics

k-means clustering mark = which(md == min(md)) nearest[i] = ifelse(mark <= 5, "blue", "orange")}

Leverage. the response is in line with the other values, or the high leverage has caused the fitted model to be pulled toward the observed response.

Multivariate analysis of genetic data: exploring groups diversity

Regularized Discriminant Analysis. Part I. Linear and Quadratic Discriminant Analysis. Discriminant Analysis. Example. Example. Class distribution

Pattern recognition. "To understand is to perceive patterns" Sir Isaiah Berlin, Russian philosopher

Core Electron Binding Energy (CEBE) Shifts Applied to Structure Activity Relationship (SAR) Analysis of Neolignans

Modern Information Retrieval

5. Discriminant analysis

Modern Information Retrieval

Chemometrics: Classification of spectra

Machine Learning. Clustering 1. Hamid Beigy. Sharif University of Technology. Fall 1395

Part I. Linear Discriminant Analysis. Discriminant analysis. Discriminant analysis

Final Exam, Machine Learning, Spring 2009

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

DM534: Introduction to Computer Science Autumn term Exercise Clustering: Clustering, Color Histograms

Clustering Lecture 1: Basics. Jing Gao SUNY Buffalo

Table of Contents. Multivariate methods. Introduction II. Introduction I

L5: Quadratic classifiers

Sneddon, Duncan J.M. (2010) Statistical analysis of crystallographic data. PhD thesis.

CS4445 Data Mining and Knowledge Discovery in Databases. B Term 2014 Solutions Exam 2 - December 15, 2014

Bayes Decision Theory - I

The Bayes classifier

Outline Lecture Notes Math /17

Investigation of Latent Semantic Analysis for Clustering of Czech News Articles

Decision Trees. Machine Learning CSEP546 Carlos Guestrin University of Washington. February 3, 2014

Dimension Reduc-on. Example: height of iden-cal twins. PCA, SVD, MDS, and clustering [ RI ] Twin 2 (inches away from avg)

Generalized Ward and Related Clustering Problems

Data Mining and Analysis: Fundamental Concepts and Algorithms

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Chemometrics. 1. Find an important subset of the original variables.

Discrimination Among Groups. Discrimination Among Groups

Anomaly (outlier) detection. Huiping Cao, Anomaly 1

INTRODUCTION TO MULTIVARIATE ANALYSIS OF ECOLOGICAL DATA

Statistics 202: Data Mining. c Jonathan Taylor. Model-based clustering Based in part on slides from textbook, slides of Susan Holmes.

Application of the Cross-Entropy Method to Clustering and Vector Quantization

BMI/CS 576 Fall 2016 Final Exam

Clustering. Stephen Scott. CSCE 478/878 Lecture 8: Clustering. Stephen Scott. Introduction. Outline. Clustering.

Massive Experiments and Observational Studies: A Linearithmic Algorithm for Blocking/Matching/Clustering

Descriptive Data Summarization

Basics of Multivariate Modelling and Data Analysis

Transcription:

Multivariate Analysis Cluster Analysis Prof. Dr. Anselmo E de Oliveira anselmo.quimica.ufg.br anselmo.disciplinas@gmail.com

Cluster Analysis System Samples Measurements Similarities Distances Clusters

Cluster Analysis CA searches for objects which are close together in the variable space. First of all, the choice of a distance metric must be made. The general distance is given by d ij = N k=1 x ik x jk N 1 N For N=2, this is the familiar n-space Euclidean distance. Higher values of N will give more weight to smaller distances.

Cluster Analysis Secondly, a variety of ways to cluster the points have been developed. The single link method judges the nearness of a point to a cluster on the basis of the distance to the closest point in the cluster. Conversely, the more conservative complete link method uses the distance to the farthest point. A more rigorous but computationally slower method is the centroid method in which the distance of a point to the centre of gravity of the points in a cluster is used.

Cluster Analysis Single link Complete link

Cluster Analysis Points are grouped together based on their nearness or similarity into clusters and we assume that the nearness of points in n-space reflects the similarity of their properties. Typically, measurements are made on the samples and used to calculate interpoint distances. Similarity values, S ij, are calculated as d ij S ij = 1 d ij max

Observation Cluster Analysis Hierarchical Clustering: once an object has been assigned to a group the process cannot be reversed. 1 2 3 4 5 6 7 Distance/Similarity Dendrogram

Dendrogram Cluster Analysis

Dendrogram Cluster Analysis

Dendrogram Cluster Analysis

Cluster Analysis Dendrogram Representing Trees with Dendrograms

Cluster Analysis Cluster Analysis methods can be classified into two main categories: agglomerative and partitional. Agglomerative methods begin with each object being it's own cluster, and progress by combining (agglomerating) existing clusters into larger ones. Partitional methods start with a single cluster containing all objects, and progress by dividing existing clusters into smaller ones.

Cluster Analysis

Cluster Analysis All clustering methods require the specification of a distance measure to be used to indicate distances between objects, and subsequently between clusters, during method operation original X-variables PCA scores colinearity and noise-reduction benefits requires the specification of the appropriate number of PCs

Cluster Analysis Euclidean or Mahalanobis distance The use of Mahalanobis distance allows one to account for dominant multivariate directions in the data when performing cluster analysis

Cluster Analysis Euclidean or Mahalanobis distance Illustration of Euclidean distance (a) and Mahalanobis distance (b) where the contours represent equidistant points from the center using each distance metric. R. De Maesschalck, D. Jouan-Rimbaud and D. L. Massart, "Tutorial - The Mahalanobis distance," Chemometrics and Intelligent Laboratory Systems, 2000

K-Means Data mining algorithm Starts with a random selection of K objects that are to be used as cluster targets, where K is determined a priori. During each cycle of this clustering method, the remaining objects are assigned to one of these clusters, based on distance from each of the K targets. New cluster targets are then calculated as the means of the objects in each cluster. The procedure is repeated until no objects are re-assigned after the updated mean calculations

K-Means k-means clustering is often more suitable than hierarchical clustering for large amounts of data Honey data K=4 Rape (ra): 8-10 Honeydew (hd): 11-19 Floral (of): 4-7; 20-27 Acacia (ac): 1-3 27 samples and 11 parameters

K-Means honeydata.mat X, 27x11 To get an idea of how well-separated the resulting clusters are, you can make a silhouette plot using the cluster indices output from kmeans. The silhouette plot displays a measure of how close each point in one cluster is to points in the neighboring clusters. This measure ranges from +1, indicating points that are very distant from neighboring clusters, through 0, indicating points that are not distinctly in one cluster or another, to -1, indicating points that are probably assigned to the wrong cluster.

Cluster K-Means autoscaling k = 4 idx4=kmeans(x,4); [silh4,h] = silhouette(x,idx4); 1 2 3 large silhouette values, greater than 0.6, indicate that the cluster is somewhat separated from neighboring clusters. points with low silhouette values, and points with negative values, indicate that the cluster is not well separated. 4 0 0.2 0.4 0.6 0.8 1 Silhouette Value

Cluster K-Means Diminuir o número de clusters (k=3)? idx3=kmeans(x,3); [silh3,h] = silhouette(x,idx3); 1 2 3 0 0.2 0.4 0.6 0.8 1 Silhouette Value

Cluster K-Means Aumentar o número de clusters (k=5)? idx5=kmeans(x,5); [silh5,h] = silhouette(x,idx5); 1 2 3 4 5 0 0.2 0.4 0.6 0.8 1 Silhouette Value

K-Means A more quantitative way to compare the solutions is to look at the average silhouette values Testei até k=9. O melhor valor foi com k=2 mean(silh2) ans = 0.4810

Cluster K-Means Espectros de RMN 1 H Testei de k = 2 até 9 Melhor valor k =3 mean(silh3) = 0.3955 x 10 8 4.5 4 3.5 1 3 2.5 2 1.5 1 2 0.5 0 0.7 0.8 0.9 1 1.1 1.2 1.3 x 10 4 3 0 0.2 0.4 0.6 0.8 1 Silhouette Value

K-Means Without some knowledge of how many clusters are really in the data, it is a good idea to experiment with a range of values for k

Nearest Neighbor (KNN) Single-linkage clustering The distance between any two clusters is defined as the minimum of all possible pair-wise distances of objects between the two clusters; the two clusters with the minimum distance are joined together. This method tends to perform well with data that form elongated "chain-type" clusters.

Nearest Neighbor (KNN) Honey data 25 12 14 17 15 13 20 19 18 11 10 9 15 10 27 45 23 20 24 22 5 21 6 3 2 1 0 16 26 25 7 8 Dendrogram of Data with Preprocessing: Autoscale Euclidian 0 1 2 3 4 5 Distance to K-Nearest Neighbor 25 20 15 10 5 0 16 12 15 13 14 18 17 19 11 25 22 5 9 10 20 78 21 26 6 27 4 24 23 3 2 1 Dendrogram of Data with Preprocessing: Autoscale 4 PCs 0 0.5 1 1.5 2 2.5 3 3.5 4 Distance to K-Nearest Neighbor 25 25 5 14 18 17 20 19 11 26 15 4 15 13 21 10 9 8 10 7 20 5 23 24 3 2 1 0 16 12 22 6 27 Dendrogram of Data with Preprocessing: Autoscale 4 PCs, Mahalanobis 0 0.5 1 1.5 2 2.5 3 Distance to K-Nearest Neighbor

Furthest Neighbor The distance between any two clusters is defined as the maximum of all possible pair-wise distances of objects between the two clusters; the two clusters with the minimum distance are joined together. This method tends to perform well with data that form "round", distinct clusters.

Furthest Neighbor 0 Honey data 16 17 25 15 14 13 12 27 20 23 19 18 11 15 10 9 24 8 22 20 10 21 26 6 25 5 5 7 4 3 2 1 Dendrogram of Data with Preprocessing: Autoscale 0 1 2 3 4 5 6 7 8 9 Distance to Furthest Neighbor 16 15 25 13 18 17 14 12 20 10 9 20 78 15 21 19 6 11 25 10 5 26 Dendrogram of Data with Preprocessing: Autoscale 27 4 24 5 23 22 3 2 1 0 0 1 2 3 4 5 6 7 8 9 Distance to Furthest Neighbor 16 25 10 9 8 21 20 7 20 18 6 17 19 11 15 25 26 5 14 4 10 15 13 12 27 23 5 22 24 3 2 1 0 Dendrogram of Data with Preprocessing: Autoscale 0 1 2 3 4 5 6 Distance to Furthest Neighbor

Centroid The distance between any two clusters is defined as the difference in the multivariate means (centroids) of each cluster; the two clusters with the minimum distance are joined together.

Centroid 25 20 21 27 23 24 20 22 6 26 14 17 15 15 13 19 18 11 10 Honey data 16 12 9 10 Dendrogram of Data with Preprocessing: Autoscale 25 78 5 5 4 3 2 1 0 0 1 2 3 4 5 6 7 Distance Between Cluster Centers 16 12 25 15 13 14 17 18 20 19 11 27 25 15 26 5 10 5 21 6 3 2 1 0 10 94 8 7 24 23 22 20 Dendrogram of Data with Preprocessing: Autoscale 0 1 2 3 4 5 6 Distance Between Cluster Centers 16 12 25 14 15 13 22 21 20 10 9 20 78 15 6 27 23 24 10 25 3 26 5 18 17 5 19 11 4 2 1 0 Dendrogram of Data with Preprocessing: Autoscale 0 0.5 1 1.5 2 2.5 3 3.5 4 Distance Between Cluster Centers

Pair-Group Average The distance between any two clusters is defined as the average distance of all possible pair-wise distances between objects in the two clusters; the two clusters with the minimum distance are joined together. This method tends to perform equally well with both "chain-type" and "round" clusters.

Pair-Group Average 25 14 17 15 13 19 20 18 11 27 23 24 15 22 20 21 26 6 10 16 12 9 10 Dendrogram of Data with Preprocessing: Autoscale 25 78 5 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 Average-Paired Distance 25 20 15 10 5 0 16 15 13 12 18 17 14 19 11 10 9 8 7 20 21 25 6 5 26 27 4 24 23 22 3 2 1 Dendrogram of Data with Preprocessing: Autoscale 0 1 2 3 4 5 6 Average-Paired Distance 16 25 25 26 5 10 94 20 21 8 20 7 15 12 6 14 15 13 18 10 17 19 11 27 23 5 22 24 3 2 1 0 Dendrogram of Data with Preprocessing: Autoscale 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 Average-Paired Distance

Median The distance between any two clusters is defined as the difference in the weighted multivariate means (centroids) of each cluster, where the means are weighted by the number of objects in each cluster; the two clusters with the minimum distance are joined together. This method might perform better than the Centroid method if the number of objects is expected to vary greatly between clusters.

Median 25 20 15 10 5 0 16 12 13 14 17 15 19 18 11 26 25 4 5 27 23 22 20 21 24 6 10 9 8 7 3 2 1 Dendrogram of Data with Preprocessing: Autoscale 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 Distance Between Cluster Centers 16 27 25 12 15 13 14 17 20 18 19 11 10 9 15 8 24 7 23 22 10 20 21 25 6 Dendrogram of Data with Preprocessing: Autoscale 5 26 5 4 3 2 1 0 0 1 2 3 4 5 Distance Between Cluster Centers 25 20 15 10 9 10 8 20 7 5 0 16 26 25 5 12 27 14 15 13 18 17 19 11 21 4 22 6 23 24 3 2 1 Dendrogram of Data with Preprocessing: Autoscale 0 0.5 1 1.5 2 2.5 3 3.5 4 Distance Between Cluster Centers

Ward's Method This method does not require calculation of the cluster centers; it joins the two existing clusters such that the resulting pooled withincluster variance (with respect to each cluster's centroid) is minimized.

Ward's Method Dendrogram of Data with Preprocessing: Autoscale Dendrogram of Data with Preprocessing: Autoscale Dendrogram of Data with Preprocessing: Autoscale 16 17 25 15 14 13 12 19 20 18 11 20 21 27 6 15 23 24 22 3 10 9 10 25 8 5 26 5 7 4 2 1 0 0 2 4 6 8 10 12 14 16 Variance Weighted Distance Between Cluster Centers 15 13 25 18 17 14 12 16 20 19 11 27 24 23 15 22 10 9 10 20 78 21 25 6 5 26 5 4 3 2 1 0 0 2 4 6 8 10 12 14 16 18 Variance Weighted Distance Between Cluster Centers 10 9 25 21 8 20 7 20 15 6 13 18 17 14 15 12 27 23 22 24 10 5 0 3 16 19 11 25 5 26 4 2 1 0 1 2 3 4 5 6 7 8 9 Variance Weighted Distance Between Cluster Centers

Scores on PC 2 (20.35%) PCA Honey data 3 Samples/Scores Plot 2 1 0-1 -2-3 -4-4 -3-2 -1 0 1 2 3 4 Scores on PC 1 (44.77%)