Machine Learning for Data Science (CS4786) Lecture 8

Machine Learning for Data Science (CS4786) Lecture 8 Clustering Course Webpage : http://www.cs.cornell.edu/courses/cs4786/2016fa/

Announcement Those of you who submitted HW1 and are still on waitlist email me.

CLUSTERING n X X d

CLUSTERING Grouping sets of data points s.t. points in same group are similar points in different groups are dissimilar A form of unsupervised classification where there are no predefined labels

SOME NOTATIONS Kary clustering is a partition of x 1,...,x n into K groups For now assume the magical K is given to use Clustering given by C 1,...,C K, the partition of data points. Given a clustering, we shall use c(x t ) to denote the cluster identity of point x t according to the clustering. Let n j denote C j, clearly K j=1 n j = n.

How do we formalize a good clustering objective?

How do we formalize? Say dissimilarity(x t, x s ) measures dissimilarity between x t & x s Given two clustering {C 1,...,C K } (or c) and {C 0 1,...,C 0 K} (or c 0 ) How do we decide which is better? points in same cluster are not dissimilar points in different clusters are dissimilar

CLUSTERING CRITERION Minimize total within-cluster dissimilarity M 1 = K j=1 dissimilarity(x t, x s ) s,t C j Maximize between-cluster dissimilarity M 2 = x s,x t c(x s ) c(x t ) dissimilarity(x t, x s ) Maximize smallest between-cluster dissimilarity M 3 = min dissimilarity(x t, x s ) x s,x t c(x s ) c(x t ) Minimize largest within-cluster dissimilarity M 4 = max j [K] max s,t C j dissimilarity(x t, x s )

CLUSTERING CRITERION Minimize average dissimilarity within cluster M 6 = K j=1 1 C j s C j dissimilarity x s, C j = = K j=1 K j=1 1 C j s C j t C j,t s 1 C j s C j t C j,t s dissimilarity (x s, x t ) x s x t 2 2 Minimize within-cluster variance: r j = 1 n j x Cj x M 5 = K j=1 t C j x t r j 2 2

How different are these criteria?

CLUSTERING CRITERION minimizing M 1 maximizing M 2 minimizing M 5 minimizing M 6

CLUSTERING Multiple clustering criteria all equally valid Different criteria lead to different algorithms/solutions Which notion of distances or costs we use matter

= ( ) Lets build algorithm for two criteria = ( ) = nj 1 M 5 = K = j=1 ( ) ( ) x t r j 2 2 x t C j t ( ) 2 M 3 = min dissimilarity(x t, x s ) x s,x t c(x s ) c(x t )

Lets build an Algorithm = nj M 5 = K j=1 x t r j 2 2 x t C j t where r j = 1 C j X t x t 2C j x t

K-MEANS CLUSTERING For all j [K], initialize cluster centroids ˆr 1 j randomly and set m = 1 Repeat until convergence (or until patience runs out) 1 For each t {1,...,n}, set cluster identity of the point ĉ m (x t ) = argmin j [K] x t ˆr m j 2 For each j [K], set new representative as ˆr m+1 j = 1 Ĉm j xt Ĉm j x t 3 m m + 1

K-MEANS CONVERGENCE K-means algorithm converges to local minima of objective O (c; r 1,...,r K ) = K j=1 c(x t )=j x t r j 2 2 Proof: Clustering assignment improves objective: O ĉ m 1 ; r m 1,...,rm K O (ĉm ; r m 1,...,rm K ) (By definition of ĉ m (x t )) Computing centroids improves objective: O (ĉ m ; r m 1,...,rm K ) O ĉm ; r m+1 1,...,r m+1 K (By the fact about centroid)

= ( ) Lets build an Algorithm = ( ) ( ) ( ) M 3 = min dissimilarity(x t, x s ) x s,x t c(x s ) c(x t ) = [ ] ( )

dissimilarity(c i,c j )= min dissimilarity(x t, x s ) t2c i,s2c j

SINGLE LINK CLUSTERING Initialize n clusters with each point x t to its own cluster Until there are only K clusters, do 1 Find closest two clusters and merge them into one cluster 2 Update between cluster distances (called proximity matrix)

SINGLE LINK OBJECTIVE = ( ) Objective for single-link: = ( ) ( ) ( ) M M 4 = min x s x t 2 3 = min dissimilarity(x 2 t, x s ) x s,x t c(xx s ) c(x,x t c(x t ) s ) c(x t ) Single link clustering is optimal for above objective! = [ ] ( )