Clusters. Unsupervised Learning. Luc Anselin. Copyright 2017 by Luc Anselin, All Rights Reserved

Similar documents
Overview of clustering analysis. Yuehua Cui

Principal Component Analysis. Applied Multivariate Statistics Spring 2012

DIMENSION REDUCTION AND CLUSTER ANALYSIS

Machine Learning (BSMC-GA 4439) Wenke Liu

Focus was on solving matrix inversion problems Now we look at other properties of matrices Useful when A represents a transformations.

Dimension Reduction Techniques. Presented by Jie (Jerry) Yu

Principle Components Analysis (PCA) Relationship Between a Linear Combination of Variables and Axes Rotation for PCA

Structure in Data. A major objective in data analysis is to identify interesting features or structure in the data.

Statistical Machine Learning

Machine Learning 2nd Edition

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

Introduction to Machine Learning

Principal Component Analysis (PCA) Theory, Practice, and Examples

PCA, Kernel PCA, ICA

Preprocessing & dimensionality reduction

STATISTICAL LEARNING SYSTEMS

Machine Learning - MT Clustering

Dimension Reduc-on. Example: height of iden-cal twins. PCA, SVD, MDS, and clustering [ RI ] Twin 2 (inches away from avg)

Multivariate Statistics (I) 2. Principal Component Analysis (PCA)

Principal Components Analysis. Sargur Srihari University at Buffalo

Lecture 4: Principal Component Analysis and Linear Dimension Reduction

What is Principal Component Analysis?

Principal Component Analysis

Linear Dimensionality Reduction

Lecture 6: Methods for high-dimensional problems

14 Singular Value Decomposition

Learning with Singular Vectors

Lecture 5: November 19, Minimizing the maximum intracluster distance

Dimensionality Reduction

Principal component analysis

PCA and admixture models

Machine Learning - MT & 14. PCA and MDS

CSE 554 Lecture 7: Alignment

Principal component analysis (PCA) for clustering gene expression data

CS 340 Lec. 6: Linear Dimensionality Reduction

Chapter 4: Factor Analysis

Spatial Autocorrelation

PRINCIPAL COMPONENTS ANALYSIS

Spatial Regression. 1. Introduction and Review. Luc Anselin. Copyright 2017 by Luc Anselin, All Rights Reserved

Eigenvalues, Eigenvectors, and an Intro to PCA

Short Answer Questions: Answer on your separate blank paper. Points are given in parentheses.

Eigenvalues, Eigenvectors, and an Intro to PCA

Chemometrics. Matti Hotokka Physical chemistry Åbo Akademi University

Data Exploration and Unsupervised Learning with Clustering

Dimensionality Reduction: PCA. Nicholas Ruozzi University of Texas at Dallas

Unsupervised Learning Basics

Classification. The goal: map from input X to a label Y. Y has a discrete set of possible values. We focused on binary Y (values 0 or 1).

Dimension reduction, PCA & eigenanalysis Based in part on slides from textbook, slides of Susan Holmes. October 3, Statistics 202: Data Mining

ISSN: (Online) Volume 3, Issue 5, May 2015 International Journal of Advance Research in Computer Science and Management Studies

Unsupervised Learning

Clustering using Mixture Models

15 Singular Value Decomposition

Basics of Multivariate Modelling and Data Analysis

A Local Indicator of Multivariate Spatial Association: Extending Geary s c.

Lecture: Face Recognition and Feature Reduction

Motivating the Covariance Matrix

Multivariate Statistics

L11: Pattern recognition principles

7 Principal Component Analysis

Dimensionality Reduction Techniques (DRT)

Lecture 8. Principal Component Analysis. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 13, 2016

Dimensionality Reduction

Lecture 13. Principal Component Analysis. Brett Bernstein. April 25, CDS at NYU. Brett Bernstein (CDS at NYU) Lecture 13 April 25, / 26

Clustering. Stephen Scott. CSCE 478/878 Lecture 8: Clustering. Stephen Scott. Introduction. Outline. Clustering.

Unsupervised learning: beyond simple clustering and PCA

Lecture: Face Recognition and Feature Reduction

Unsupervised Learning. k-means Algorithm

Principal Component Analysis -- PCA (also called Karhunen-Loeve transformation)

Unsupervised Learning Techniques Class 07, 1 March 2006 Andrea Caponnetto

Unconstrained Ordination

Multivariate analysis of genetic data: exploring groups diversity

Face Recognition. Lecture-14

FACTOR ANALYSIS AND MULTIDIMENSIONAL SCALING

Mathematical foundations - linear algebra

Wolf-Tilo Balke Silviu Homoceanu Institut für Informationssysteme Technische Universität Braunschweig

Principal Component Analysis-I Geog 210C Introduction to Spatial Data Analysis. Chris Funk. Lecture 17

December 20, MAA704, Multivariate analysis. Christopher Engström. Multivariate. analysis. Principal component analysis

Face Recognition. Face Recognition. Subspace-Based Face Recognition Algorithms. Application of Face Recognition

Clustering: K-means. -Applied Multivariate Analysis- Lecturer: Darren Homrighausen, PhD

Unsupervised dimensionality reduction

Principal Component Analysis

Machine Learning. Principal Components Analysis. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

Statistical Pattern Recognition

ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015

Principal Component Analysis

Principal Component Analysis

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Covariance and Correlation Matrix

Announcements (repeat) Principal Components Analysis

Data Preprocessing. Cluster Similarity

Principal Component Analysis and Singular Value Decomposition. Volker Tresp, Clemens Otte Summer 2014

Applied Multivariate Analysis

MLCC Clustering. Lorenzo Rosasco UNIGE-MIT-IIT

Machine Learning for Software Engineering

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Unsupervised machine learning

Table of Contents. Multivariate methods. Introduction II. Introduction I

A Peak to the World of Multivariate Statistical Analysis

Face Recognition. Lecture-14

Transcription:

Clusters Unsupervised Learning Luc Anselin http://spatial.uchicago.edu 1

curse of dimensionality principal components multidimensional scaling classical clustering methods 2

Curse of Dimensionality 3

Curse of dimensionality in a nutshell low dimensional techniques break down in high dimensions complexity of some functions increases exponentially with variable dimension 4

Example 1 change with p (variable dimension) of distance in unit cube required to reach fraction of total data volume Source: Hastie, Tibshirani, Friedman (2009) 5

Example 2 nearest neighbor distance in one vs 2 dimensions Source: Hastie, Tibshirani, Friedman (2009) 6

Dimension reduction reduce multiple variables into a smaller number of functions of the original variables principal component analysis (PCA) visualize the multivariate similarity (distance) between observations in a lower dimension multidimensional scaling (MDS) projection pursuit 7

Principal Components 8

Principle capture the variance in the p by p covariance matrix X X through a set of k principal components with k << p principal components capture most of the variance principal components are orthogonal principal component coefficients (loadings) are scaled Copyright 2016 by Luc Anselin, All Rights Reserved 9

principal components variance decomposition Source: Hastie, Tibshirani, Friedman (2009) 10

More formally c i = ai1 x1 + ai2 x2 + + aip xp each principal component is a weighted sum of the original variables c i cj = 0 components are orthogonal to each other Σk aik 2 = 1 the sum of the squared loadings equals one components in direction of greatest variation closest linear fit to the data points Copyright 2016 by Luc Anselin, All Rights Reserved 11

Eigenvalue Decomposition eigenvalue and eigenvectors of square matrix A Av = γv with γ as eigenvalue and v as eigenvector v is a special vector such that translation by A (both a rotation and a shift) remains on v eigen decomposition of covariance matrix X X (with X n by p, X X is p by p) X X = VGV a p by p matrix V is p by p matrix of eigenvectors G is diagonal matrix of eigenvalues 12

Singular Value Decomposition (SVD) applied to a n by p matrix X (not square X X) X = UDV U is orthogonal n by p, U U = I V is orthogonal p by p, V V = I D is diagonal p by p SVD applied to X X (covariance matrix) X X = VDU UDV = VD 2 V since U U = I D2 is a diagonal matrix with the eigenvalues 13

Principal Components and Eigenvectors principal components are Xv cim = Σk xikvkm for each eigenvector m each variable multiplied by its loading in the eigenvector signs are arbitrary (v and -v are equivalent) Σk vkm 2 = 1 14

Typical results of interest loadings for each principal component the contribution of each of the original variables to that component principal component score the value of the principal component for each observation variance proportion explained the proportion of the overall variance each principal component explains 15

Illustration 16

Andre-Michel Guerry - Moral Statistics of France (1833) social indicators for 86 French departements (one of the) first multivariate statistical analyses with geographic visualization available as R package Guerry by Friendly and Dray (see also Friendly 2007 for an illustration) converted to a GeoDa sample data set re-analyzed by Dray and Jombart (2011) 17

Moral Statistics crime against persons (pop/crime) - Crm_prs crime against property (pop/crime) - Crm_prp literacy - Litercy donations - Donatns births out of wedlock (pop/births) - Infants suicides (pop/suicides) - Suicids indicators inverted such that larger values correspond to better outcomes 18

box maps 19

correlation matrix 20

eigenvalue = variance standard deviation = square root of eigenvalue sum of eigenvalues = total variance 2.1405 + 1.2008 + 1.1021 + 0.6667 + 0.5487 + 0.3410 = 5.9997 variance proportion = 2.1405/5.9997 = 0.357 Principal Components and Variance Decomposition 21

ideal shape Guerry data how many components? elbow in scree plot 22

ideal shape Guerry data cumulative scree plot 23

principal component ci = ai1 x1 + ai2 x2 + + aip xp sum of squared loadings = 1 0.0043 + 0.2625 + 0.2619 + 0.0113 + 0.2037 + 0.2563 = 1 24

squared correlations between variable and pca = proportion variance of variable explained by each pc 25

principal components biplot relative loadings of individual variables 26

principal components are uncorrelated 27

Spatializing the PCA 28

choropleth maps for principal components 29

pc1 pc2 local Moran and local Geary cluster maps 30

bivariate local Geary, pc1-pc2 31

Multidimensional Scaling 32

Principle n observations are points in a p-dimensional data hypercube p-variate distance or dissimilarity between all pairs of points (e.g., Euclidean distance in p dimensions) represent the n observations in a lower dimensional space (at most p - 1) while respecting the pairwise distances 33

More formally n by n distance or dissimilarity matrix D d ij = xi - xj Euclidean distance in p dimensions find values z 1, z2, zn in k-dimensional space (with k << p) that minimize the stress function S(z) = Σi,j ( dij - zi - zj ) 2 least squares or Kruskal-Shephard scaling 34

MDS plot 35

Spatializing MDS 36

neighbors in multivariate space are not always neighbors in geographic space 37

brushing the MDS plot 38

Classical Clustering Methods 39

Principle grouping of similar observations maximize within-group similarity minimize between-group similarity or, maximize between-group dissimilarity each observation belongs to one and only one group 40

Issues similarity criterion Euclidean distance, correlation how many groups many rules of thumb computational challenges combinatorial problem, NP hard n observations in k groups kn possible partitions k = 4 with n = 85, kn = 1.5 x 10 51 no guarantee of a global optimum 41

Dissimilarity data x ij with variable j at observation i distance by attribute dj(xij,xi j) = (xij - xi j) 2 typically Euclidean distance distance between objects D(xi,xi ) = Σj dj(xij,xi j) weighted distance between objects D(xi,xi ) = Σj wj dj(xij,xi j) with Σj wj = 1 reflect relative importance of attributes equal weighting = inverse attribute variance is automatic for standardized variables 42

Two main approaches hierarchical clustering start from bottom determine number of clusters later partitioning clustering (k-means) start with random assignment to k groups number of clusters pre-determined many clustering algorithms 43

Hierarchical Clustering 44

Algorithm find two observations that are closest (most similar) they form a cluster determine the next closest pair include the existing clusters in the comparisons continue grouping until all observations have been included result is a dendrogram a hierarchical tree structure 45

Variable2 1 2 3 4 5 7 8 Variable2 1 2 3 4 5 7 8 Variable2 1 2 3 4 5 7 8 Variable1 Variable1 2 3 Variable1 2 3 5 7 2 and 3 are the closest points. They become a cluster. 5 and 7 are the closest points. They become a cluster. 1 and the cluster of 2 and 3 are the closest points. Variable2 1 2 3 4 5 7 8 Variable2 1 2 3 4 5 7 8 Variable2 1 2 3 4 5 7 8 Variable1 2 3 1 5 7 Variable1 2 3 1 4 5 7 Variable1 2 3 1 4 5 7 8 4 and the cluster of 1, 2, and 3 are the closest points. 8 and the cluster of 5 and 7 are the closest points. The two remaining clusters are the closest points. Variable2 1 2 3 4 5 7 8 Variable2 1 2 3 4 5 7 8 Stop ("cut") here for two clusters Variable2 1 2 3 4 5 7 8 Stop ("cut") here for four clusters Variable1 2 3 1 4 5 7 8 Variable1 2 3 1 4 5 7 8 Variable1 2 3 1 4 5 7 8 The algorithm has finished. Rewind algorithm to reveal desired number of clusters. hierarchical clustering algorithm Source: Grolemund and Wickham (2016) 46

Practical issues measure of similarity (dissimilarity) between clusters = linkage complete compact clusters single elongated clusters, singletons average centroid others how many clusters where to cut the tree 47

complete single average centroid 1 2 3 δ 4 5 1 2 3 δ 4 5 1 2 3 δ 4 5 1 2 x 3 δ 4 x 5 types of linkages Source: Grolemund and Wickham (2016) 48

example: complete linkage dendrogram 49

hierarchical clustering: dendrogram and map complete linkage, k = 5 50

Characteristics of Clusters total sum of squares (total SS) relative to overall mean within sum of squares (within SS) for each cluster, relative to mean of each cluster between sum of squares (between SS) total SS - sum of within SS measure of cluster quality ratio of between SS / total SS higher is better 51

cluster characteristics with k=5 52

hierarchical clustering: dendrogram and map complete linkage, k = 4 53

cluster characteristics with k=4 54

example: single linkage dendrogram 55

hierarchical clustering: dendrogram and map single linkage, k = 4 56

cluster characteristics with k=4, single linkage 57

k-means Clustering 58

Algorithm randomly assign n observations to k groups compute group centroid (or other representative point) assign observations to closest centroid iterate until convergence 59

x x x x x x Randomly assign points to k groups. (Here k = 3). Compute centroid of each group. Reassign each point to group of closest centroid. x x x x x x x x x Re-compute centroid of each group. Reassign each point to group of closest centroid. Re-compute centroid of each group. x x x x x x Reassign each point to group of closest centroid. Re-compute centroid of each group. Stop when group membership ceases to change. k-means clustering algorithm Source: Grolemund and Wickham (2016) 60

Practical issues which k to select? compare solutions on within-group and between-group similarities (sum of squares) sensitivity to starting point use several random assignments and pick the best avoid local optima sensitivity analysis replicability set random seed 61

k-means, k=4 62

k-means, k=5 63

k=4 Total SS Within SS Between SS Ratio B/T complete linkage 504 335.3 168.7 0.335 single linkage 504 424.2 79.8 0.158 k-means 504 285.7 218.3 0.433 summary - cluster characteristics 64