Machine Learning for Data Science (CS4786) Lecture 2

Size: px

Start display at page:

Download "Machine Learning for Data Science (CS4786) Lecture 2"

Joel Payne
6 years ago
Views:

1 Machine Learning for Data Science (CS4786) Lecture 2 Clustering Course Webpage :

2 REPRESENTING DATA AS FEATURE VECTORS How do we represent data? Each data-point often represented as vector referred to as feature vector Eg. text document represented by vector in which each coordinate represents a word and value represents number of times the word occurred in the document Eg. Image represented as a vector where each coordinate represents a pixel and value represents the grayscale value of that pixel

3 EXAMPLE: IMAGES K K vectorize d = K 2

4 EXAMPLE: TEXT (BAG OF WORDS) Documents: car Chomsky corpus emissions car engine hood tires truck trunk car emissions hood make model trunk Chomsky corpus noun parsing tagging wonderful engine hood make model noun parsing tagging tires truck trunk wonderful

5 REPRESENTING DATA AS FEATURE VECTORS x > 1 n X x > n d

6 CLUSTERING n X X d

7 EXAMPLES What are the clusters?

8 EXAMPLES What are the clusters?

9 EXAMPLES What are the clusters?

10 EXAMPLES What are the clusters?

11 EXAMPLES What are the clusters?

12 EXAMPLES What are the clusters?

13 CLUSTERING Grouping sets of data points s.t. points in same group are similar points in different groups are dissimilar A form of unsupervised classification where there are no predefined labels

14 SOME NOTATIONS Kary clustering is a partition of x 1,...,x n into K groups For now assume the magical K is given to use Clustering given by C 1,...,C K, the partition of data points. Given a clustering, we shall use c(x t ) to denote the cluster identity of point x t according to the clustering. Let n j denote C j, clearly K j=1 n j = n.

15 How do we formalize a good clustering objective?

16 How do we formalize? Say dissimilarity(x t, x s ) measures dissimilarity between x t & x s Given two clustering {C 1,...,C K } (or c) and {C 0 1,...,C 0 K} (or c 0 ) How do we decide which is better? points in same cluster are not dissimilar points in different clusters are dissimilar

17 CLUSTERING CRITERION Minimize total within-cluster dissimilarity M 1 = K j=1 dissimilarity(x t, x s ) s,t C j Maximize between-cluster dissimilarity M 2 = x s,x t c(x s ) c(x t ) dissimilarity(x t, x s ) Maximize smallest between-cluster dissimilarity M 3 = min dissimilarity(x t, x s ) x s,x t c(x s ) c(x t ) Minimize largest within-cluster dissimilarity M 4 = max j [K] max s,t C j dissimilarity(x t, x s )

18 CLUSTERING CRITERION Minimize total within-cluster dissimilarity M 1 = K j=1 dissimilarity(x t, x s ) s,t C j Maximize between-cluster dissimilarity M 2 = x s,x t c(x s ) c(x t ) dissimilarity(x t, x s ) Maximize smallest between-cluster dissimilarity M 3 = min dissimilarity(x t, x s ) x s,x t c(x s ) c(x t ) Minimize largest within-cluster dissimilarity M 4 = max j [K] max s,t C j dissimilarity(x t, x s )

19 How different are these criteria?

20 CLUSTERING CRITERION minimizing M 1 maximizing M 2 minimizing M 5 minimizing M 6

21 CLUSTERING Multiple clustering criteria all equally valid Different criteria lead to different algorithms/solutions Which notion of distances or costs we use matter

22 = ( ) Lets Build an Algorithm = ( ) ( ) ( ) M 3 = min dissimilarity(x t, x s ) x s,x t c(x s ) c(x t ) = [ ] ( )

23 SINGLE LINK CLUSTERING Initialize n clusters with each point x t to its own cluster Until there are only K clusters, do 1 Find closest two clusters and merge them into one cluster 2 Update between cluster distances (called proximity matrix) dissimilarity(c i,c j )= min dissimilarity(x t, x s ) t2c i,s2c j

24 Demo

25 Demo

26 Demo

27 Demo

28 Demo

29 Demo

30 Demo

31 Demo

32 Demo

33 Demo

34 Demo

35 Demo

36 Demo

37 Demo

38 Demo

39 Demo

40 Demo

41 SINGLE LINK OBJECTIVE = ( ) Objective for single-link: = ( ) ( ) ( ) M M 4 = min x s x t 2 3 = min dissimilarity(x 2 t, x s ) x s,x t c(xx s ) c(x,x t c(x t ) s ) c(x t ) Single link clustering is optimal for above objective! = [ ] ( )

42 SINGLE LINK OBJECTIVE Proof: Say c is solution produced by single-link clustering Key observation: Objective for single-link: min dissimilarity(x i,x j ) > t,s:c(x i )6=c(x j ) Say c 0 6= c then, M 4 = Distance of points merged min x s,x t c(x s ) c(x t ) x s x t 2 2 Single link clustering is optimal for above objective! max dissimilarity(x t,x s ) t,s:c(x t )=c(x s ) (on the tree) 9 t, s s.t. c 0 (x t ) 6= c 0 (x s )butc(x t )=c(x s ) x t a b x s c boundary

43 CLUSTERING CRITERION Minimize average dissimilarity within cluster M 6 = K j=1 1 C j s C j dissimilarity x s, C j = = K j=1 K j=1 1 C j s C j t C j,t s 1 C j s C j t C j,t s dissimilarity (x s, x t ) x s x t 2 2 Minimize within-cluster variance: r j = 1 n j x Cj x M 5 = K j=1 t C j x t r j 2 2

44 CLUSTERING CRITERION minimizing M 1 maximizing M 2 minimizing M 5 minimizing M 6

45 Lets build an Algorithm = nj M 5 = K j=1 x t r j 2 2 x t C j t where r j = 1 C j X t x t 2C j x t

46 K-MEANS CLUSTERING For all j [K], initialize cluster centroids ˆr 1 j randomly and set m = 1 Repeat until convergence (or until patience runs out) 1 For each t {1,...,n}, set cluster identity of the point ĉ m (x t ) = argmin j [K] x t ˆr m j 2 For each j [K], set new representative as ˆr m+1 j = 1 Ĉm j xt Ĉm j x t 3 m m + 1

Machine Learning for Data Science (CS4786) Lecture 8

Machine Learning for Data Science (CS4786) Lecture 8 Clustering Course Webpage : http://www.cs.cornell.edu/courses/cs4786/2016fa/ Announcement Those of you who submitted HW1 and are still on waitlist email