SC4/SM4 Data Mining and Machine Learning Clustering

Similar documents
Unsupervised Learning Basics

STATISTICA MULTIVARIATA 2

Weekly price report on Pig carcass (Class S, E and R) and Piglet prices in the EU. Carcass Class S % + 0.3% % 98.

Machine Learning - MT Clustering

Weighted Voting Games

Variance estimation on SILC based indicators

Annotated Exam of Statistics 6C - Prof. M. Romanazzi

MLCC Clustering. Lorenzo Rosasco UNIGE-MIT-IIT

A Markov system analysis application on labour market dynamics: The case of Greece

Spectral Clustering. Zitao Liu

Computer Vision Group Prof. Daniel Cremers. 14. Clustering

Machine Learning for Data Science (CS4786) Lecture 11

Statistical Machine Learning

F M U Total. Total registrants at 31/12/2014. Profession AS 2, ,574 BS 15,044 7, ,498 CH 9,471 3, ,932

Clustering. CSL465/603 - Fall 2016 Narayanan C Krishnan

Lecture 12 : Graph Laplacians and Cheeger s Inequality

Measuring Instruments Directive (MID) MID/EN14154 Short Overview

Trends in Human Development Index of European Union

The trade dispute between the US and China Who wins? Who loses?

MB of. Cable. Wholesale. FWBA (fixed OAOs. connections of which Full unbundled. OAO owning. Internet. unbundled broadband

Spectral Clustering on Handwritten Digits Database

Directorate C: National Accounts, Prices and Key Indicators Unit C.3: Statistics for administrative purposes

Semi-Supervised Learning by Multi-Manifold Separation

Assessment and Improvement of Methodologies used for GHG Projections

Spectral Clustering. Guokun Lai 2016/10

Clustering using Mixture Models

University of Florida CISE department Gator Engineering. Clustering Part 1

Multivariate Analysis

AD HOC DRAFTING GROUP ON TRANSNATIONAL ORGANISED CRIME (PC-GR-COT) STATUS OF RATIFICATIONS BY COUNCIL OF EUROPE MEMBER STATES

Modelling structural change using broken sticks

NASDAQ OMX Copenhagen A/S. 3 October Jyske Bank meets 9% Core Tier 1 ratio in EU capital exercise

Data Analysis and Manifold Learning Lecture 7: Spectral Clustering

THE USE OF CSISZÁR'S DIVERGENCE TO ASSESS DISSIMILARITIES OF INCOME DISTRIBUTIONS OF EU COUNTRIES

Composition of capital NO051

Composition of capital CY007 CY007 POWSZECHNACY007 BANK OF CYPRUS PUBLIC CO LTD

Composition of capital DE025

Composition of capital ES060 ES060 POWSZECHNAES060 BANCO BILBAO VIZCAYA ARGENTARIA S.A. (BBVA)

Composition of capital LU045 LU045 POWSZECHNALU045 BANQUE ET CAISSE D'EPARGNE DE L'ETAT

Composition of capital CY006 CY006 POWSZECHNACY006 CYPRUS POPULAR BANK PUBLIC CO LTD

Composition of capital DE028 DE028 POWSZECHNADE028 DekaBank Deutsche Girozentrale, Frankfurt

Composition of capital FR015

Composition of capital FR013

Composition of capital DE017 DE017 POWSZECHNADE017 DEUTSCHE BANK AG

Composition of capital ES059

Spectral Clustering. Spectral Clustering? Two Moons Data. Spectral Clustering Algorithm: Bipartioning. Spectral methods

WHO EpiData. A monthly summary of the epidemiological data on selected Vaccine preventable diseases in the WHO European Region

Composition of capital as of 30 September 2011 (CRD3 rules)

Composition of capital as of 30 September 2011 (CRD3 rules)

Composition of capital as of 30 September 2011 (CRD3 rules)

Composition of capital as of 30 September 2011 (CRD3 rules)

Composition of capital as of 30 September 2011 (CRD3 rules)

Composition of capital as of 30 September 2011 (CRD3 rules)

Sampling scheme for LUCAS 2015 J. Gallego (JRC) A. Palmieri (DG ESTAT) H. Ramos (DG ESTAT)

Land Use and Land cover statistics (LUCAS)

APPLYING BORDA COUNT METHOD FOR DETERMINING THE BEST WEEE MANAGEMENT IN EUROPE. Maria-Loredana POPESCU 1

Principles of Pattern Recognition. C. A. Murthy Machine Intelligence Unit Indian Statistical Institute Kolkata

Restoration efforts required for achieving the objectives of the Birds and Habitats Directives

Clustering compiled by Alvin Wan from Professor Benjamin Recht s lecture, Samaneh s discussion

The EuCheMS Division Chemistry and the Environment EuCheMS/DCE

Clusters. Unsupervised Learning. Luc Anselin. Copyright 2017 by Luc Anselin, All Rights Reserved

Gravity Analysis of Regional Economic Interdependence: In case of Japan

A Prior Distribution of Bayesian Nonparametrics Incorporating Multiple Distances

Inverse Power Method for Non-linear Eigenproblems

Multivariate Statistics

WHO EpiData. A monthly summary of the epidemiological data on selected Vaccine preventable diseases in the European Region

WHO EpiData. A monthly summary of the epidemiological data on selected Vaccine preventable diseases in the WHO European Region

Data Preprocessing. Cluster Similarity

Data Mining and Analysis: Fundamental Concepts and Algorithms

This document is a preview generated by EVS

Bathing water results 2011 Latvia

EuroGeoSurveys & ASGMI The Geological Surveys of Europe and IberoAmerica

ECOSTAT nutrient meeting ( ) Session 1: Comparison of European freshwater and saline water nutrient boundaries

Identification of Very Shallow Groundwater Regions in the EU to Support Monitoring

Clustering. Stephen Scott. CSCE 478/878 Lecture 8: Clustering. Stephen Scott. Introduction. Outline. Clustering.

Unsupervised machine learning

WHO EpiData. A monthly summary of the epidemiological data on selected Vaccine preventable diseases in the European Region

Power Grid Partitioning: Static and Dynamic Approaches

Data Exploration and Unsupervised Learning with Clustering

WHO EpiData. A monthly summary of the epidemiological data on selected Vaccine preventable diseases in the European Region

Final Exam, Machine Learning, Spring 2009

Bathing water results 2011 Slovakia

MATH 829: Introduction to Data Mining and Analysis Clustering II

Multivariate Analysis

WHO EpiData. A monthly summary of the epidemiological data on selected Vaccine preventable diseases in the European Region

RISK ASSESSMENT METHODOLOGIES FOR LANDSLIDES

WHO EpiData. A monthly summary of the epidemiological data on selected vaccine preventable diseases in the European Region

Mining of Massive Datasets Jure Leskovec, AnandRajaraman, Jeff Ullman Stanford University

Publication Date: 15 Jan 2015 Effective Date: 12 Jan 2015 Addendum 6 to the CRI Technical Report (Version: 2014, Update 1)

Course Structure. Statistical Data Mining and Machine Learning Hilary Term Course Aims. What is Data Mining?

MATH 567: Mathematical Techniques in Data Science Clustering II

More on Unsupervised Learning

Overview of clustering analysis. Yuehua Cui

1 Matrix notation and preliminaries from spectral graph theory

Graph Clustering Algorithms

Hierarchical Clustering

Multivariate Statistics: Hierarchical and k-means cluster analysis

40 Years Listening to the Beat of the Earth

Part A: Salmonella prevalence estimates. (Question N EFSA-Q ) Adopted by The Task Force on 28 March 2007

Latent Variable Models and EM algorithm

Finite Metric Spaces & Their Embeddings: Introduction and Basic Tools

Transcription:

SC4/SM4 Data Mining and Machine Learning Clustering Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/dmml Department of Statistics, Oxford SC4/SM4 DMML, HT2017 1 / 36

Clustering Department of Statistics, Oxford SC4/SM4 DMML, HT2017 2 / 36

Clustering Introduction Many datasets consist of multiple heterogeneous subsets. Cluster analysis: Given an unlabelled data, want algorithms that automatically group the datapoints into coherent subsets/clusters. Examples: market segmentation of shoppers based on browsing and purchase histories different types of breast cancer based on the gene expression measurements discovering communities in social networks image segmentation Department of Statistics, Oxford SC4/SM4 DMML, HT2017 3 / 36

Types of Clustering Clustering Introduction Model-free clustering: Defined by similarity/dissimilarity among instances within clusters. Model-based clustering: Each cluster is described using a probability model. Department of Statistics, Oxford SC4/SM4 DMML, HT2017 4 / 36

Introduction Model-free clustering notion of similarity/dissimilarity between data items is central: many ways to define and the choice will depend on the dataset being analyzed and dictated by domain specific knowledge most common approach is partition-based clustering: one divides n data items into K clusters C 1,..., C K where for all k, k {1,..., K}, C k {1,..., n}, C k C k = k k, K C k = {1,..., n}. k=1 Intuitively, clustering aims to group similar items together and to place separate dissimilar items into different groups two objectives can contradict each other (similarity is not a transitive relation, while being in the same cluster is an equivalence relation) Department of Statistics, Oxford SC4/SM4 DMML, HT2017 5 / 36

Introduction Axiomatic approach Clustering method is a map F : (D = {x i } n i=1, ρ) {C 1,..., C K } which takes as an input dataset D and a dissimilarity function ρ and returns a partition of D. Three basic properties required Scale invariance. For any α > 0, F (D, αρ) = F (D, ρ). Richness. For any partition C = {C 1,..., C K } of D, there exists dissimilarity ρ, such that F (D, ρ) = C. Consistency. If ρ and ρ are two dissimilarities such that for all x i, x j D the following holds: x i, x j belong to the same cluster in F (D, ρ) = ρ (x i, x j ) ρ(x i, x j ) x i, x j belong to different clusters in F (D, ρ) = ρ (x i, x j ) ρ(x i, x j ), then F (D, ρ ) = F (D, ρ). Kleinberg (2003) proves that there exists no clustering method that satisfies all three properties! Department of Statistics, Oxford SC4/SM4 DMML, HT2017 6 / 36

Introduction Examples of Model-free Clustering K-means clustering: a partition-based method into K clusters. Finds groups such that variation within each group is small. The number of clusters K is usually fixed beforehand or various values of K are investigated as a part of the analysis. Spectral clustering: Similarity/dissimilarity between data items defines a graph. Find a partition of vertices which does not cut" many edges. Can be interpreted as nonlinear dimensionality reduction followed by K-means. Hierarchical clustering: nearby data items are joined into clusters, then clusters into super-clusters forming a hierarchy. Typically, the hierarchy forms a binary tree (a dendrogram) where each cluster has two children clusters. Dendrogram allows to view the clusterings for each possible number of clusters, from 1 to n (number of data items). Department of Statistics, Oxford SC4/SM4 DMML, HT2017 7 / 36

K-means Clustering K-means Goal: divide data items into a pre-assigned number K of clusters C 1,..., C K where for all k, k {1,..., K}, C k {1,..., n}, C k C k = k k, K C k = {1,..., n}. k=1 Each cluster is represented using a prototype or cluster centroid µ k. We can measure the quality of a cluster with its within-cluster deviance W(C k, µ k ) = i C k x i µ k 2 2. The overall quality of the clustering is given by the total within-cluster deviance: K W = W(C k, µ k ). k=1 W is the overall objective function used to select both the cluster centroids and the assignment of points to clusters. Department of Statistics, Oxford SC4/SM4 DMML, HT2017 8 / 36

K-means Clustering K-means W = K k=1 where c i = k if and only if i C k. i C k x i µ k 2 2 = n x i µ ci 2 2 Given partition {C k }, we can find the optimal prototypes easily by differentiating W with respect to µ k : W = 2 (x i µ k ) = 0 µ k = 1 x i µ k C k i C k i C k Given prototypes, we can easily find the optimal partition by assigning each data point to the closest cluster prototype: i=1 c i = argmin x i µ k 2 2 k But joint minimization over both is computationally difficult. Department of Statistics, Oxford SC4/SM4 DMML, HT2017 9 / 36

K-means K-means The K-means algorithm is a widely used method that returns a local optimum of the objective function W, using iterative and alternating minimization. 1 Randomly initialize K cluster centroids µ 1,..., µ K. 2 Cluster assignment: For each i = 1,..., n, assign each x i to the cluster with the nearest centroid, Set C k := {i : c i = k} for each k. c i := argmin x i µ k 2 2 k 3 Move centroids: Set µ 1,..., µ K to the averages of the new clusters: µ k := 1 x i C k i C k 4 Repeat steps 2-3 until convergence. 5 Return the partition {C 1,..., C K } and means µ 1,..., µ K. Department of Statistics, Oxford SC4/SM4 DMML, HT2017 10 / 36

K-means 0.5 0.0 0.5 1.0 1.5 0.5 0.0 0.5 1.0 1.5 x y K means illustration Department of Statistics, Oxford SC4/SM4 DMML, HT2017 11 / 36

K-means 0.5 0.0 0.5 1.0 1.5 0.5 0.0 0.5 1.0 1.5 x y Assign points. W = 128.1 Department of Statistics, Oxford SC4/SM4 DMML, HT2017 11 / 36

K-means 0.5 0.0 0.5 1.0 1.5 0.5 0.0 0.5 1.0 1.5 x y Move centroids. W = 50.979 Department of Statistics, Oxford SC4/SM4 DMML, HT2017 11 / 36

K-means 0.5 0.0 0.5 1.0 1.5 0.5 0.0 0.5 1.0 1.5 x y Assign points. W = 31.969 Department of Statistics, Oxford SC4/SM4 DMML, HT2017 11 / 36

K-means 0.5 0.0 0.5 1.0 1.5 0.5 0.0 0.5 1.0 1.5 x y Move centroids. W = 19.72 Department of Statistics, Oxford SC4/SM4 DMML, HT2017 11 / 36

K-means 0.5 0.0 0.5 1.0 1.5 0.5 0.0 0.5 1.0 1.5 x y Assign points. W = 19.688 Department of Statistics, Oxford SC4/SM4 DMML, HT2017 11 / 36

K-means 0.5 0.0 0.5 1.0 1.5 0.5 0.0 0.5 1.0 1.5 x y Move centroids. W = 19.632 Department of Statistics, Oxford SC4/SM4 DMML, HT2017 11 / 36

K-means K-means The algorithm stops in a finite number of iterations. Between steps 2 and 3, W either stays constant or it decreases, this implies that we never revisit the same partition. As there are only finitely many partitions, the number of iterations cannot exceed this. The K-means algorithm need not converge to global optimum. K-means is a heuristic search algorithm so it can get stuck at suboptimal configurations. The result depends on the starting configuration. Typically perform a number of runs from different configurations, and pick the end result with minimum W. W= 9.184 W= 3.418 W= 9.264 y 0.0 0.5 1.0 y 0.0 0.5 1.0 y 0.0 0.5 1.0 0.5 0.0 0.5 1.0 1.5 x 0.5 0.0 0.5 1.0 1.5 x 0.5 0.0 0.5 1.0 1.5 x Department of Statistics, Oxford SC4/SM4 DMML, HT2017 12 / 36

K-means on Crabs Clustering K-means Looking at the Crabs data again. library(mass) library(lattice) data(crabs) splom(~log(crabs[,4:8]), pch=as.numeric(crabs[,2]), col=as.numeric(crabs[,1]), main="circle/triangle is gender, black/red is species") Department of Statistics, Oxford SC4/SM4 DMML, HT2017 13 / 36

K-means on Crabs Clustering K-means circle/triangle is gender, black/red is species 3.0 2.5 3.0 2.5 BD 2.5 4.0 3.8 3.43.63.84.0 3.6 3.4 CW 3.4 3.2 2.83.03.23.4 3.0 2.8 3.8 3.6 3.43.63.8 3.4 CL 3.2 3.0 2.83.03.2 2.8 3.0 2.8 2.42.62.83.0 2.6 2.4 RW 2.4 2.2.82.02.22.4 2.0 3.2 1.8 3.0 2.62.83.03. 2.8 2.6 FL 2.6 2.4 2.02.22.42.6 2.2 2.0 Department of Statistics, Oxford SC4/SM4 Scatter DMML, Plot Matrix HT2017 14 / 36 2.0 2.5 2.0

K-means on Crabs Clustering K-means Apply K-means with 2 clusters and plot results. Crabs.kmeans <- kmeans( log(crabs[,4:8]), 2, nstart=1, iter.max=10) splom(~log(crabs[,4:8]), col=crabs.kmeans$cluster+2, main="blue/green is cluster; finds big/small") Department of Statistics, Oxford SC4/SM4 DMML, HT2017 15 / 36

K-means on Crabs Clustering K-means blue/green is cluster finds big/small 3.0 2.5 3.0 2.5 BD 2.5 2.0 2.5 2.0 4.0 3.8 3.43.63.84.0 3.6 3.4 CW 3.4 3.2 2.83.03.23.4 3.0 2.8 3.8 3.6 3.43.63.8 3.4 CL 3.2 3.0 2.83.03.2 2.8 3.0 2.8 2.42.62.83.0 2.6 2.4 RW 2.4 2.2.82.02.22.4 2.0 3.2 1.8 3.0 2.62.83.03. 2.8 2.6 FL 2.6 2.4 2.02.22.42.6 2.2 2.0 Department of Statistics, Oxford SC4/SM4 Scatter DMML, Plot Matrix HT2017 16 / 36

K-means on Crabs Clustering K-means Whiten or sphere 1 the data using PCA. pcp <- princomp( log(crabs[,4:8]) ) Crabs.sphered <- pcp$scores %*% diag(1/pcp$sdev) splom( ~Crabs.sphered[,1:3], col=as.numeric(crabs[,1]), pch=as.numeric(crabs[,2]), main="circle/triangle is gender, black/red is species") And apply K-means again. Crabs.kmeans <- kmeans(crabs.sphered, 2, nstart=1, iter.max=20) splom( ~Crabs.sphered[,1:3], col=crabs.kmeans$cluster+2, main="blue/green is cluster") 1 Apply a linear transformation so that the covariance matrix is identity. Department of Statistics, Oxford SC4/SM4 DMML, HT2017 17 / 36

K-means on Crabs Clustering K-means circle/triangle is gender, black/red is species blue/green is cluster 2 0 1 2 1 0 V3 0 1 2 1 0 2 2 0 1 2 1 0 V3 0 1 2 1 0 2 3 1 2 3 2 1 2 1 0 V1 0 1 2 2 1 0 2 1 0 0 1 2 V2 1 2 Scatter Plot Matrix 0 2 0 1 2 1 0 V2 0 1 2 1 0 2 3 1 2 3 2 1 V1 0 1 2 1 0 2 Scatter Plot Matrix Discovers gender difference... But the result depends crucially on sphering the data first! Department of Statistics, Oxford SC4/SM4 DMML, HT2017 18 / 36

K-means K-means on Crabs with K = 4 circle/triangle is gender, black/red is species colors are clusters 2 0 1 2 1 0 V3 0 1 2 1 0 2 2 0 1 2 1 0 V3 2 1 0 0 1 2 3 1 2 3 2 1 2 1 0 V1 0 1 2 2 1 0 2 1 0 0 1 2 V2 1 2 Scatter Plot Matrix 0 > table(crabs.kmeans$cluster,crabs.class) Crabs.class BF BM OF OM 1 3 0 41 0 2 39 8 6 0 3 8 42 0 0 4 0 0 3 50 2 0 1 2 1 0 V2 0 1 2 1 0 2 3 1 2 3 2 1 V1 0 1 2 1 0 2 Scatter Plot Matrix Department of Statistics, Oxford SC4/SM4 DMML, HT2017 19 / 36

K-means K-means Additional Comments Good practice initialization. Randomly pick K training examples (without replacement) and set µ 1, µ 2,..., µ K equal to those examples Sensitivity to distance measure. Euclidean distance can be greatly affected by measurement unit and by strong correlations. Can use Mahalanobis distance instead: x y M = (x y) M 1 (x y) where M is positive semi-definite matrix, e.g. sample covariance. Determination of K. The K-means objective will always improve with larger number of clusters K. Determination of K requires an additional regularization criterion. E.g., in DP-means 2, use W = K k=1 i C k x i µ k 2 2 + λk 2 DP-means paper. Department of Statistics, Oxford SC4/SM4 DMML, HT2017 20 / 36

K-means Other partition based methods Other partition-based methods with related ideas: K-medoids 3 : requires cluster centroids µ i to be an observation x j K-medians: cluster centroids represented by a median in each dimension K-modes: cluster centroids represented by a mode estimated from a cluster 3 See also Affinity propagation. Department of Statistics, Oxford SC4/SM4 DMML, HT2017 21 / 36

Nonlinear cluster structures Spectral Clustering K-means algorithm will often fail when applied to data with elongated or non-convex cluster structures. Department of Statistics, Oxford SC4/SM4 DMML, HT2017 22 / 36

Clustering and Graph Cuts Spectral Clustering Construct a weighted undirected similarity graph G = ({1,..., n}, W), where vertices correspond to data items and W is the matrix of edge weights corresponding to pairwise item similarities. Partition the graph vertices into C 1, C 2,..., C K to minimize the graph cut. The unnormalized graph cut across clusters is given by K cut (C 1,..., C K ) = cut(c k, C k ), where C k is the complement of C k and cut(a, B) = i A,j B w ij is the sum of the weights separating vertex subset A from the vertex subset B, where A and B are disjoint. Typically results with singleton clusters, so one needs to balance the cuts by the cluster sizes in the partition. One approach is to consider the notion of ratio cut" k=1 ratio-cut (C 1,..., C K ) = K k=1 cut(c k, C k ). C k Department of Statistics, Oxford SC4/SM4 DMML, HT2017 23 / 36

Graph Laplacian Clustering Spectral Clustering The (unnormalized) Laplacian of a graph G = ({1,..., n}, W) is an n n matrix given by L = D W, where D is a diagonal matrix with D ii = deg(i), and deg(i) denotes the degree of vertex i defined as n deg(i) = w ij. Laplacian always has the column vector 1 as an eigenvector with eigenvalue 0 (since all rows sum to zero) (exercise) Laplacian is a positive semi-definite matrix so all the eigenvalues are non-negative. j=1 Department of Statistics, Oxford SC4/SM4 DMML, HT2017 24 / 36

Spectral Clustering Laplacian and Ratio Cuts Lemma For a given partition C 1, C 2,..., C K define the column vectors h k R n as h k,i = 1 Ck 1 {i C k}. Then K ratio-cut (C 1,..., C K ) = h k Lh k. (1) k=1 To minimize the ratio cut, search for orthonormal vectors h k with entries either 0 or 1/ C k which minimize the RHS in (1). Equivalent to integer programming so computationally hard. Department of Statistics, Oxford SC4/SM4 DMML, HT2017 25 / 36

Spectral Clustering Laplacian and Ratio Cuts Lemma For a given partition C 1, C 2,..., C K define the column vectors h k R n as h k,i = 1 Ck 1 {i C k}. Then K ratio-cut (C 1,..., C K ) = h k Lh k. (1) k=1 Relaxation: Search for any collection of orthonormal vectors h k in R n that minimize RHS in (1) which corresponds to the eigendecomposition of the Laplacian. Department of Statistics, Oxford SC4/SM4 DMML, HT2017 25 / 36

Spectral Clustering Laplacian and Connected Components If the original graph is disconnected, in addition to 1, there would be other 0-eigenvectors of L, corresponding to the indicators of the connected components of the graph (Murphy Theorem 25.4.1). Department of Statistics, Oxford SC4/SM4 DMML, HT2017 26 / 36

Spectral Clustering Laplacian and Connected Components Spectral clustering treats the constructed graph as a small perturbation" of a disconnected graph. Department of Statistics, Oxford SC4/SM4 DMML, HT2017 26 / 36

Spectral Clustering Eigenvectors as dimensionality reduction Spectral Clustering. Eigendecompose L and take the K eigenvectors corresponding to the K smallest eigenvalues this gives a new "data matrix" Z = [u 1,..., u K ] R n K on which we can apply a more conventional clustering algorithm, such as K-means. Department of Statistics, Oxford SC4/SM4 DMML, HT2017 27 / 36

Hierarchical Clustering Hierarchical Clustering Hierarchically structured data is ubiquitous (genus, species, subspecies, individuals...) There are two general strategies for generating hierarchical clusters. Both proceed by seeking to minimize some measure of overall dissimilarity. Agglomerative / Bottom-Up / Merging Divisive / Top-Down / Splitting Higher level clusters are created by merging clusters at lower levels. This process can easily be viewed by a tree/dendrogram. Avoids specifying how many clusters are appropriate. hclust, agnes{cluster} Department of Statistics, Oxford SC4/SM4 DMML, HT2017 28 / 36

EU Indicators Data Clustering Hierarchical Clustering 6 Economic indicators for EU countries in 2011. > eu<-read.csv( http://www.stats.ox.ac.uk/~sejdinov/sdmml/data/eu_indicators.csv,sep= ) > eu[1:15,] Countries abbr CPI UNE INP BOP PRC UN_perc 1 Belgium BE 116.03 4.77 125.59 908.6 6716.5-1.6 2 Bulgaria BG 141.20 7.31 102.39 27.8 1094.7 3.5 3 CzechRep. CZ 116.20 4.88 119.01-277.9 2616.4-0.6 4 Denmark DK 114.20 6.03 88.20 1156.4 7992.4 0.5 5 Germany DE 111.60 4.63 111.30 499.4 6774.6-1.3 6 Estonia EE 135.08 9.71 111.50 153.4 2194.1-7.7 7 Ireland IE 106.80 10.20 111.20-166.5 6525.1 2.0 8 Greece EL 122.83 11.30 78.22-764.1 5620.1 6.4 9 Spain ES 116.97 15.79 83.44-280.8 4955.8 0.7 10 France FR 111.55 6.77 92.60-337.1 6828.5-0.9 11 Italy IT 115.00 5.05 87.80-366.2 5996.6-0.5 12 Cyprus CY 116.44 5.14 86.91-1090.6 5310.3-0.4 13 Latvia LV 144.47 12.11 110.39 42.3 1968.3-3.6 14 Lithuania LT 135.08 11.47 114.50-77.4 2130.6-4.3 15 Luxembourg LU 118.19 3.14 85.51 2016.5 10051.6-3.0 Data from Greenacre (2012) Department of Statistics, Oxford SC4/SM4 DMML, HT2017 29 / 36

EU Indicators Data Clustering Hierarchical Clustering dat<-scale(eu[,3:8]) rownames(dat)<-eu$countries biplot(princomp(dat)) 4 2 0 2 4 6 Greece Comp.2 0.2 0.0 0.2 0.4 UNE Bulgaria CzechRep. Malta CPI Romania Hungary Germany Latvia Lithuania Austria Sweden Netherlands Poland Slovakia Estonia INP Spain UN_perc Portugal Cyprus Ireland France Italy UnitedKingdom Finland Slovenia Denmark Belgium BOP PRC Luxembourg 4 2 0 2 4 6 0.2 0.0 0.2 0.4 Comp.1 Department of Statistics, Oxford SC4/SM4 DMML, HT2017 30 / 36

Hierarchical Clustering Visualising Hierarchical Clustering > hc<-hclust(dist(dat)) > plot(hc,hang=-1) > library(ape) > plot(as.phylo(hc), type = "fan") Cluster Dendrogram Height 0 1 2 3 4 5 6 Luxembourg Belgium Germany Austria Netherlands Denmark Sweden CzechRep. Malta Slovenia France Finland Italy UnitedKingdom Cyprus Portugal Greece Ireland Spain Romania Bulgaria Hungary Poland Slovakia Estonia Latvia Lithuania Italy nitedkingdom Cyprus Finland Portugal France Greece Ireland Slovenia Spain Malta Romania zechrep. Bulgaria Sweden Hungary Poland Denmark Netherlands Slovakia Austria Estonia Germany Latvia Belgium Luxembourg Lithuania dist(dat) hclust (*, "complete") Department of Statistics, Oxford SC4/SM4 DMML, HT2017 31 / 36

Hierarchical Clustering Visualising Hierarchical Clustering Levels in the dendrogram represent a dissimilarity between examples. Tree dissimilarity d T ij = minimum height in the tree at which examples i and j belong to the same cluster. ultrametric (stronger than triangle) inequality: d T ij max{d T ik, d T kj}. Hierarchical clustering can be interpreted as an approximation of a given dissimilarity d ij by an ultrametric dissimilarity. Department of Statistics, Oxford SC4/SM4 DMML, HT2017 32 / 36

Clustering Hierarchical Clustering Measuring Dissimilarity Between Clusters To join clusters C i and C j into super-clusters, we need a way to measure the dissimilarity D(C i, C j ) between them. x 2 x 4 single linkage: d(x 1,x 3) x3 x 1 x 5 x 2 x 4 complete linkage: d(x 2,x 5) x3 x 1 x 5 x 2 x 4 average linkage: 1 6 2 i=1 5 j=3 d(xi,xj) x3 x 1 x 5 Department of Statistics, Oxford SC4/SM4 DMML, HT2017 33 / 36

Hierarchical Clustering Measuring Dissimilarity Between Clusters To join clusters C i and C j into super-clusters, we need a way to measure the dissimilarity D(C i, C j ) between them. (a) Single Linkage: elongated, loosely connected clusters D(C i, C j ) = min x,y (d(x, y) x C i, y C j ) (b) Complete Linkage: compact clusters, relatively similar objects can remain separated at high levels D(C i, C j ) = max x,y (d(x, y) x C i, y C j ) (c) Average Linkage: tries to balance the two above, but affected by the scale of dissimilarities D(C i, C j ) = avg x,y (d(x, y) x C i, y C j ) Department of Statistics, Oxford SC4/SM4 DMML, HT2017 34 / 36

Hierarchical Clustering Using Dendrograms Different ways of measuring dissimilarity result in different trees. Dendrograms are useful for getting a feel for the structure of high-dimensional data though they don t represent distances between observations well. Dendrograms depict cluster assignments with respect to increasing values of dissimilarity threshold. Cutting a dendrogram horizontally at a particular height partitions the data into disjoint clusters which are represented by the vertical lines it intersects. Despite the simplicity of this idea and the above drawbacks, hierarchical clustering methods provide users with interpretable dendrograms that allow clusters in high-dimensional data to be better understood. Department of Statistics, Oxford SC4/SM4 DMML, HT2017 35 / 36

Hierarchical Clustering Further reading Hastie et al, 14.3 Murphy, 25 Shalev-Shwartz and Ben-David, 22 von Luxburg: Tutorial on Spectral Clustering Clustering on scikit-learn Department of Statistics, Oxford SC4/SM4 DMML, HT2017 36 / 36