Clustering: K-means. -Applied Multivariate Analysis- Lecturer: Darren Homrighausen, PhD

Similar documents
Clusters. Unsupervised Learning. Luc Anselin. Copyright 2017 by Luc Anselin, All Rights Reserved

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

More on Unsupervised Learning

Lecture 11: Unsupervised Machine Learning

Data Exploration and Unsupervised Learning with Clustering

Principal component analysis

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

When Dictionary Learning Meets Classification

Machine Learning 2nd Edition

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2017

DIMENSION REDUCTION AND CLUSTER ANALYSIS

Unsupervised Learning: K- Means & PCA

Short Answer Questions: Answer on your separate blank paper. Points are given in parentheses.

INTRODUCTION TO DATA SCIENCE

k-means clustering mark = which(md == min(md)) nearest[i] = ifelse(mark <= 5, "blue", "orange")}

Data Preprocessing. Cluster Similarity

Classification 2: Linear discriminant analysis (continued); logistic regression

AP Statistics. Chapter 6 Scatterplots, Association, and Correlation

Spectral Clustering of Polarimetric SAR Data With Wishart-Derived Distance Measures

Applying cluster analysis to 2011 Census local authority data

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Multivariate Analysis Cluster Analysis

Data Mining and Analysis: Fundamental Concepts and Algorithms

University of Florida CISE department Gator Engineering. Clustering Part 1

Unsupervised machine learning

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

CMPSCI611: The Matroid Theorem Lecture 5

the tree till a class assignment is reached

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Regression I: Mean Squared Error and Measuring Quality of Fit

Homework : Data Mining SOLUTIONS

CPSC 340: Machine Learning and Data Mining. Sparse Matrix Factorization Fall 2018

L11: Pattern recognition principles

Chapter 8. Linear Regression. Copyright 2010 Pearson Education, Inc.

Regression Clustering

Machine Learning for Data Science (CS4786) Lecture 8

A Bayesian Criterion for Clustering Stability

Clustering compiled by Alvin Wan from Professor Benjamin Recht s lecture, Samaneh s discussion

Lecture 14. Clustering, K-means, and EM

Clustering by Mixture Models. General background on clustering Example method: k-means Mixture model based clustering Model estimation

Clustering. CSL465/603 - Fall 2016 Narayanan C Krishnan

CS534 Machine Learning - Spring Final Exam

CMSC858P Supervised Learning Methods

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

Lecture 5: Clustering, Linear Regression

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

Learning Decision Trees

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

CS4445 Data Mining and Knowledge Discovery in Databases. B Term 2014 Solutions Exam 2 - December 15, 2014

Principal Component Analysis (PCA)

MSA220 Statistical Learning for Big Data

CS6220: DATA MINING TECHNIQUES

Machine Learning for Signal Processing Expectation Maximization Mixture Models. Bhiksha Raj 27 Oct /

Now we will define some common sampling plans and discuss their strengths and limitations.

Bayesian Classifiers and Probability Estimation. Vassilis Athitsos CSE 4308/5360: Artificial Intelligence I University of Texas at Arlington

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

December 20, MAA704, Multivariate analysis. Christopher Engström. Multivariate. analysis. Principal component analysis

Nonlinear Classification

Mixture of Gaussians Models

Boosting. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL , 10.7, 10.13

Introduction to Statistics and R

Lecture 5: Clustering, Linear Regression

Application of the Cross-Entropy Method to Clustering and Vector Quantization

Data Mining and Matrices

CSE446: Clustering and EM Spring 2017

Introduction to Machine Learning

Support vector machines Lecture 4

Decision trees COMS 4771

Latent Variable Models and Expectation Maximization

Training the linear classifier

Unsupervised Learning Basics

Sketching for Large-Scale Learning of Mixture Models

CSE 546 Final Exam, Autumn 2013

CS6220: DATA MINING TECHNIQUES

Machine Learning. Chao Lan

Ch. 10 Vector Quantization. Advantages & Design

Machine Learning for Data Science (CS4786) Lecture 2

Caesar s Taxi Prediction Services

Lecture 6: Methods for high-dimensional problems

Learning Decision Trees

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

K-means. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. November 19 th, Carlos Guestrin 1

Statistics and learning: Big Data

PRINCIPAL COMPONENTS ANALYSIS

Chapter 8. Linear Regression. The Linear Model. Fat Versus Protein: An Example. The Linear Model (cont.) Residuals

Clustering with k-means and Gaussian mixture distributions

CS168: The Modern Algorithmic Toolbox Lecture #7: Understanding Principal Component Analysis (PCA)

Machine Learning Lecture 5

New Prediction Methods for Tree Ensembles with Applications in Record Linkage

Final Exam, Machine Learning, Spring 2009

Classification 1: Linear regression of indicators, linear discriminant analysis

STATISTICA MULTIVARIATA 2

Statistical Machine Learning

CPSC 340: Machine Learning and Data Mining. Regularization Fall 2017

Lecture 5: Clustering, Linear Regression

Transcription:

Clustering: K-means -Applied Multivariate Analysis- Lecturer: Darren Homrighausen, PhD 1

Clustering Introduction When clustering, we seek to simplify the data via a small(er) number of summarizing variables PCA looks to find a low dimensional representation of the observations that explain a good fraction of the sums of squares Alternative clustering approaches instead look to find subgroups among the observations in which they are similar 2

Clustering Introduction We will focus on two particular clustering algorithms K-means: Seeks to partition the the observations into a pre-specified number of clusters. Hierarchical: Produces a tree-like representation of the observations, known as a dendrogram. There are advantages (disadvantages) to both approaches. We can cluster observations on the basis of the covariates in order to find subgroups of observations. (It is common in clustering to refer to covariates as features) Just as easily, we can find clusters of features based on the observations to find subgroups in the features. We will focus on clustering the observations. You can cluster features by transposing X (that is, clustering on X ). 3

K-means 1. Select a number of clusters K. 2. Let C 1,..., C K partition {1, 2, 3,..., n} such that All observations belong to some set Cj. No observation belongs to more than one set. 3. K-means attempts to form these sets by making within-cluster variation, W (C k ), as small as possible. min C 1,...,C K K W (C k ). k=1 4. To Define W, we need a concept of distance. By far the most common is Euclidean W (C k ) = 1 X i X i 2 C k 2. i,i C k That is, the average (Euclidean) distance between all cluster members. 4

K-means It turns out min C 1,...,C K K W (C k ). (1) k=1 is too hard of a problem to solve computationally (K n partitions!). So, we make a greedy approximation: 1. Randomly assign observations to the K clusters 2. Iterate until the cluster assignments stop changing: For each of K clusters, compute the centroid, which is the p-length vector of the means in that cluster. Assign each observation to the cluster whose centroid is closest (in Euclidean distance). This procedure is guaranteed to decrease (1) at each step. Warning: It finds only a local minimum, not necessarily the global one. Which local min depends on step 1. 5

K-means: A Summary To fit K-means, you need to 1. Pick K (inherent in the method) 2. Convince yourself you have found a good solution (due to the randomized approach to the algorithm). It turns out that 1. is difficult to do in a principled way. We will discuss these shortly. For 2., a commonly used approach is to run K-means many times with different starting points. Pick the solution that has the smallest value for min C 1,...,C K K W (C k ) k=1 As an aside, why can t we use this approach for picking K? 6

K-means: A Summary To fit K-means, you need to 1. Pick K (inherent in the method) 2. Convince yourself you have found a good solution (due to the randomized approach to the algorithm). It turns out that 1. is difficult to do in a principled way. We will discuss these shortly. For 2., a commonly used approach is to run K-means many times with different starting points. Pick the solution that has the smallest value for min C 1,...,C K K W (C k ) k=1 As an aside, why can t we use this approach for picking K? (We would choose K = n) 6

K-means: Various K s 7

K-means: Algorithm at Work 8

K-means: Finding Good Local Minimum 9

K-means in R Like usual, the interface with R is very basic n = 30 X1 = rnorm(n) X2 = rnorm(n) X = cbind(x1,x2) K = 3 kmeans.out = kmeans(x, centers=k) > names(kmeans.out) [1] "cluster" "centers" "totss" "withinss" [5] "tot.withinss" "betweenss" "size" > kmeans.out$cluster [1] 2 2 2 2 2 2 1 1 2 2 3 1 2 1 2 2 2 3 1 2 2 1 3 2 1 3 3 1 2 3 10

K-means in R Like usual, the interface with R is very basic n = 30 X1 = rnorm(n) X2 = rnorm(n) X = cbind(x1,x2) K = 3 kmeans.out = kmeans(x, centers=k) > names(kmeans.out) [1] "cluster" "centers" "totss" "withinss" [5] "tot.withinss" "betweenss" "size" > kmeans.out$cluster [1] 2 2 2 2 2 2 1 1 2 2 3 1 2 1 2 2 2 3 1 2 2 1 3 2 1 3 3 1 2 3 plot(x, col=(kmeans.out$cluster+1), xlab="", ylab="", pch=20, cex=2) 11

K-means in R 1.5 1.0 0.5 0.0 0.5 1.0 1.5 3 2 1 0 1 12

K-means in R Another example x = matrix(rnorm(50*2),ncol=2) x[1:25,1] = x[1:25,1] + 3 x[1:25,2] = x[1:25,2] -4 kmeans.out = kmeans(x,centers=2,nstart=20) 13

K-means in R 2 0 2 4 6 4 2 0 2 Figure: Two clusters (which is the true number) 14

K-means in R 2 0 2 4 6 4 2 0 2 Figure: Three clusters 15

K-means in R: Comparison R provides several objects in the kmeans output. W (C k ) is the same as: kmeans(x,centers=k)$withinss K W (C k ) is the same as: kmeans(x,centers=k)$tot.withinss k=1 > kmeans(x,centers=4,nstart=1)$tot.withinss [1] 19.12722 > kmeans(x,centers=4,nstart=20)$tot.withinss [1] 18.5526 > kmeans(x,centers=5,nstart=20)$tot.withinss [1] 12.01893 16

K-means using PCA We can use the PC scores as the input to cluster as well to get a different look at the data Note: this is fundamentally different than using PC scores to plot and visually inspect your data. forest = read.table(../data/forestfires.csv, sep=,,header=t) forestred = forest[,-c(3,4,13)] fires = forest[,13] pca.out = prcomp(forestred,center=t,scale=t) cum.sum = cumsum(pca.out$sdev^2/sum(pca.out$sdev^2)) > round(cum.sum,2) [1] 0.29 0.44 0.57 0.69 0.79 0.86 0.90 0.95 0.98 1.00 ncomps = min(which(cum.sum >.9)) > ncomps [1] 7 17

K-means using PCA kmeans.out = kmeans(pca.out$x[,1:ncomps],centers=2,nstart=20) Y = rep(1,nrow(forest)) Y[fires > 0] = 2 (Here, we are dividing the response fires into two groups, labeling them 1 and 2 to match the output for kmeans) > table(y,kmeans.out$cluster) Y 1 2 1 63 184 2 51 219 18

K-means using PCA 8 6 4 2 0 2 2 0 2 4 8 6 4 2 0 2 2 0 2 4 Figure: Left: Plot of PC scores, colored by cluster. Right: Plot of PC scores, colored by correct cluster assignment: Red means clustered to zero area when positive area was the label or the opposite. Blue means clustered properly. 19

Choosing the Number of Clusters Sometimes, the number of clusters is fixed ahead of time: Segmenting a client database into K clusters for K salesmen. Compressing an image using vector quantization (K is the compression rate). Most of the time, it isn t so straight forward. Why is this a hard problem? Determining the number of clusters is hard (for humans) unless the data is low dimensional. It is just as hard to explain what we are looking for (ie: in classification, we wanted a classifier that predicts well. In clustering, we want a clusterer to... what?) 20

Choosing the Number of Clusters Why is it important? It might make a big difference (concluding there are K = 2 cancer sub-types versus K = 3). One of the major goals of statistical learning is automatic inference. A good way of choosing K is certainly a part of this. 21

Reminder: What does K-means do? Given a number of clusters K, we (approximately) minimize: K W (C k ) = k=1 K k=1 1 C k i,i C k X i X i 2 2. We can rewrite this in terms of the centroids as W (K) = where X k R? (what is?). K k=1 i C k X i X k 2 2, 22

Reminder: What does K-means do? Given a number of clusters K, we (approximately) minimize: K W (C k ) = k=1 K k=1 1 C k i,i C k X i X i 2 2. We can rewrite this in terms of the centroids as W (K) = where X k R? (what is?). Answer:? = p. K k=1 i C k X i X k 2 2, 22

Minimizing W in K Of course, a lower value of W is better. Why not minimize W? plotw = rep(0,49) for(k in 1:49){ plotw[k] = kmeans(x,centers=k,nstart=20)$tot.withinss } W 0 100 200 300 0 10 20 30 40 50 K 23

Minimizing W in K Of course, a lower value of W is better. Why not minimize W? A look at the cluster solution 1 0 1 2 3 4 5 4 2 0 2 1 x1 x2 1 0 1 2 3 4 5 4 2 0 2 2 x1 x2 1 0 1 2 3 4 5 4 2 0 2 3 x1 x2 1 0 1 2 3 4 5 4 2 0 2 5 x1 x2 1 0 1 2 3 4 5 4 2 0 2 10 x1 x2 1 0 1 2 3 4 5 4 2 0 2 15 x1 x2 1 0 1 2 3 4 5 4 2 0 2 20 x1 x2 1 0 1 2 3 4 5 4 2 0 2 30 x1 x2 1 0 1 2 3 4 5 4 2 0 2 49 x1 x2 24

Between-cluster variation Within-cluster variation measures how tightly grouped the clusters are. As we increase K, this will always decrease. What we are missing is between-cluster variation, ie: how spread apart the groups are B = K C k X k X 2 2, k=1 where C k is the number of points in C k, and X is the grand mean of all observations: X = 1 n X i. n i=1 25

Between-cluster variation: Example 1 0 1 2 3 4 5 4 2 0 2 x1 x2 X 1 X 2 X B = C 1 X 1 X 2 2 + C 2 X 2 X 2 2 W = i C 1 C 1 X 1 X i 2 2 + i C 2 C 2 X 2 X i 2 2 26

x2 R tip detour 4 2 0 2 X 2 X X 1 1 0 1 2 3 4 5 x1 x1bar = apply(x[1:25,],2,mean) x2bar = apply(x[26:50,],2,mean) xbar = apply(x,2,mean) plot(x,xlab= x1,ylab= x2,col=kmeans(x,centers=2,nstart=20)$clu points(x1bar[1],x1bar[2]) points(x2bar[1],x2bar[2]) points(xbar[1],xbar[2]) segments(x1bar[1],x1bar[2],x2bar[1],x2bar[2],col= blue ) text(x1bar[1]+.15,x1bar[2]+.15,expression(bar(x)[1])) text(x2bar[1]+.15,x2bar[2]+.15,expression(bar(x)[2])) text(xbar[1]+.15,xbar[2]+.15,expression(bar(x))) 27

Can we just maximize B? Sadly, no. Just like W can be made arbitrarily small, B will always be increasing with increasing K. B 0 100 200 300 0 10 20 30 40 50 K 28

CH index Ideally, we would like our cluster assignment to simultaneously have small W and large B. This is the idea behind CH index. For clustering assignments coming from K clusters, we record CH score: CH(K) = B(K)/(K 1) W (K)/(n k) To choose K, pick some maximum number of clusters to be considered (K max = 20, for example) and choose the value of K with the { smallest, largest } CH score: 29

CH index Ideally, we would like our cluster assignment to simultaneously have small W and large B. This is the idea behind CH index. For clustering assignments coming from K clusters, we record CH score: CH(K) = B(K)/(K 1) W (K)/(n k) To choose K, pick some maximum number of clusters to be considered (K max = 20, for example) and choose the value of K with the { smallest, largest } CH score: ˆK = Note: CH is undefined for K = 1. arg max CH(K). K {2,...,K max} 29

CH index ch.index = function(x,kmax,iter.max=100,nstart=10, algorithm="lloyd") { ch = numeric(length=kmax-1) n = nrow(x) for (k in 2:kmax) { a = kmeans(x,k,iter.max=iter.max,nstart=nstart, algorithm=algorithm) w = a$tot.withinss b = a$betweenss ch[k-1] = (b/(k-1))/(w/(n-k)) } return(list(k=2:kmax,ch=ch)) } 30

Revisting simulated example x = matrix(rnorm(50*2),ncol=2) x[1:25,1] = x[1:25,1] + 3 x[1:25,2] = x[1:25,2] -4 We want to cluster this data set using K-means with K chosen via CH index. 31

CH plot CH 90 100 110 120 130 5 10 15 20 K 32

Corresponding solution 2 1 0 1 2 3 4 5 6 4 2 0 2 x1 x2 33

Alternate approach: Gap statistic While true that W (K) keeps dropping in K, how much it drops might be informative. The gap statistic is based on this idea. We compare the observed within-cluster variation W (K) to the within-cluster variation we would observe if the data were uniformly distributed W unif (K). Gap(K) = log W (K) log W unif (K) After simulating many log W unif (K), we compute its standard deviation s(k). Then, we choose K by ˆK = min{k {1,..., K max } : Gap(K) Gap(K + 1) s(k + 1)}. 34

Gap statistic: Summary I don t want to dwell too long on Gap(K), other than to say the following: As Gap(K) is defined for K = 1, it can pick the null model (all observations in 1 cluster). In fact, this is what it is best at (why?). It can be found using the R package SAGx or the package lga. In both cases the function is called gap. Beware: these functions are poorly documented. It s unclear what clustering method/algorithms they are using. 35