Machine Learning. Clustering 1. Hamid Beigy. Sharif University of Technology. Fall PDF Free Download

Machine Learning Clustering 1 Hamid Beigy Sharif University of Technology Fall 1395 1 Some slides are taken from P. Rai slides Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1395 1 / 34

Table of contents 1 Introduction 2 K-means clustering 3 Hierarchical Clustering 4 Model-based clustering 5 Cluster validation and assessment Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1395 2 / 34

Introduction g Clustering is the process of grouping a set of data objects into multiple groups or clusters so that objects within a cluster have high similarity, but are very dissimilar to objects in other clusters. Dissimilarities and similarities are assessed based on the feature values describing the objects and often involve distance measures. Clustering is usually an unsupervised learning problem. Consider a dataset X = {x 1,..., x N },x i R D. Assume there are K clusters C 1,..., C K. The goal is to group the examples into K homogeneous partitions. an unsupervised learning problem N unlabeled examples {x 1,...,x N }; no. of desired partition roup the examples into K homogeneous partitions Picture courtesy: Data Clustering: 50 Years Beyond K-Means, A.K. Jain (2008) Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1395 3 / 34

Introduction A good clustering is one that achieves: High within-cluster similarity Low inter-cluster similarity Applications of clustering Document/Image/Webpage Clustering Image Segmentation Clustering web-search results Clustering (people) nodes in (social) networks/graphs Pre-processing phase Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1395 4 / 34

Comparing clustering methods The clustering methods can be compared using the following aspects: The partitioning criteria : In some methods, all the objects are partitioned so that no hierarchy exists among the clusters. Separation of clusters : In some methods, data partitioned into mutually exclusive clusters while in some other methods, the clusters may not be exclusive, that is, a data object may belong to more than one cluster. Similarity measure : Some methods determine the similarity between two objects by the distance between them; while in other methods, the similarity may be defined by connectivity based on density or contiguity. Clustering space : Many clustering methods search for clusters within the entire data space. These methods are useful for low-dimensionality data sets. With high- dimensional data, however, there can be many irrelevant attributes, which can make similarity measurements unreliable. Consequently, clusters found in the full space are often meaningless. Its often better to instead search for clusters within different subspaces of the same data set. Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1395 5 / 34

1 Flat or Partitional clustering Types of Clustering Flat or Partitional clustering (Partitions are independent f each of each other) Partitions are independent of each other Hierarchical clustering (Partitions can be visualized using a tree structure (a 2 Hierarchical clustering dendrogram)) Partitions can be visualized using a tree structure (a dendrogra Machine Learning (CS771A) Possible to view partitions at di erent levels of granularities (i. refine/coarsen clusters) Possible using to view di erent partitions K at different levels of granularities (i.e., can refine/coarsen clusters) using different K. Clustering: K-means and Kernel K-means Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1395 6 / 34

Flat Clustering: K-means algorithm (Lloyd, 1957) Associate a prototype µ k, k = 1,..., K with each cluster. Let r nk be the indicator of x n C k. The goal is to minimize Goal is to find {r nk } and {µ k }. J = N n=1 k=1 K r nk x n µ k 2 Optimization is performed by alternating minimization Optimize over {r nk } for a fixed {µ k }. Optimize over {µ k } for a fixed {r nk }. Initialize with arbitrary choices of {µ k } Iterative updates continue till convergence Iterative updates continue till convergence Guaranteed to converge, as objective is monotonic decreasing Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1395 7 / 34

Flat Clustering: K-means algorithm (Lloyd, 1957) Input: N examples {x 1,..., x N }; x n R D ; the number of partitions K Initialize: K cluster means µ 1,..., µ K, each µ k R D. Usually initialized randomly, but good initialization is crucial; many smarter initialization heuristics exist (e.g., K-means++, Arthur & Vassilvitskii, 2007) Repeat: (Re)-Assign each example xn to its closest cluster center (based on the smallest Euclidean distance) r nk = { 1 if k = argmin x n µ j 2 j 0 otherwise. Let C k is the set of examples assigned to cluster k with center µ k. Update the cluster means µ k = 1 C k N n=1 x n = r nkx n N x n C k n=1 r nk Stop: when cluster means or the loss (defined later) doesn t change by much Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1395 8 / 34

Kmeans Example -means example Roland Memisevic Machine Learning 14 Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1395 9 / 34

Kmeans 336 Example Representative-based Clustering 2 3 4 10 11 12 20 25 30 (a) Initial dataset µ 1 = 2 µ 2 = 4 2 3 4 10 11 12 20 25 30 (b) Iteration: t = 1 µ 1 = 2.5 µ 2 = 16 2 3 4 µ 1 = 3 10 11 12 20 25 30 (c) Iteration: t = 2 µ 2 = 18 2 3 4 10 11 12 20 25 30 (d) Iteration: t = 3 µ 1 = 4.75 µ 2 = 19.60 2 3 4 10 11 12 µ 1 = 7 (e) Iteration: t = 4 20 25 30 µ 2 = 25 2 3 4 10 11 12 (f) Iteration: t = 5(converged) 20 25 30 Figure 13.1. K-means in one dimension. Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1395 10 / 34

Kmeans Example I Replace the RGB-value (a 3-D vector) at each pixel with one of K prototypes: Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1395 11 / 34

Decrease in objective function ecrease in objective function 1000 J 500 0 1 2 3 4 Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1395 12 / 34

K-means objective function Consider the K-means objective function J(X, µ, r) = N n=1 k=1 K r nk x n µ k 2 It is a non-convex objective function, so may have many local minima. Also NP-hard to minimize in general (note that r is discrete) The K-means algorithm is a heuristic to optimize this function K-means algorithm alternated between the following two steps assign points to closest centers recompute the center means The algorithm usually converges to a local minima. Multiple runs with different initializations are usually tried to find a good solution. Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1395 13 / 34

ect K for the K-means algorithm is to try di e K-means (choosing K) means objective versus K, and look at the elb Try different values of K, plot the K-means objective versus K. We can also use information criterion such as AIC (Akaike Information Criterion) or BIC plot, (Bayesian K = Information 6 iscriterion) the elbow and choose Kpoint that gives smallest AIC/BIC (both penalize large K values) Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1395 14 / 34

Kernel K-means or Spectral clustering can handle non-convex ns Hamid Beigy or(sharif Spectral University of Technology) clustering Machine Learning can handle non-convex Fall 1395 15 / 34 K-means (limitations) ly is the clusters are roughtly of equal sizes Makes hard assignments of points to clusters lustering A point either methods completely belongs such to a cluster asorgaussian doesn t belong at allmixture mod No notion of a soft assignment. ues (model each cluster using a Gaussian distribu Works well only is the clusters are roughly of equal sizes. Probabilistic clustering methods such as Gaussian mixture models can handle both these issues works well only when the clusters are round-shap usters have non-convex shapes K-means also works well only when the clusters are round-shaped and does badly if the clusters have non-convex shapes

Kernel K-means t have to be computed/stored for data {x n } N n= because computations only depend on kernel 1 The idea is to replace the Euclidean distance/similarity computations in K-means by the kernelized versions ing k(x n, x n ) above is straightforward. Comput e involving µ d k 2 (x s n, µ requires k ) = ϕ(x n ) ϕ(µ some k ) simple algebra (s 2 = K(x n, x n ) + K(µ k, µ k ) 2K(µ k, x n ) Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1395 16 / 34

Kernel K-means Computing K(x n, x n ) is easy: Simply compute this kernel function. Compute K(µ k, µ k ) as follows (assume N k is the no. of points in cluster k) K(µ k, µ k ) = ϕ T (µ k )ϕ(µ k ) = = 1 N 2 k = 1 N 2 k K(µ k, x n ) can be computed as N k N k n=1 m=1 N k N k n=1 m=1 [ 1 ] N T [ k 1 ϕ(x n ) N k n=1 ϕ T (x n )ϕ(x m ) K(x n, x m ) K(µ k, x n ) = ϕ T (µ k )ϕ(x n ) = = 1 N k N k N k m=1 m=1 [ 1 N k N k n=1 ] N k ϕ(x n ) N k n=1 T ϕ(x n )] ϕ(x n ) ϕ T (x n )ϕ(x m ) = 1 N k N k m=1 K(x n, x m ) Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1395 17 / 34

Hierarchical Clustering 0 Chapter A hierarchical 10 Clusterclustering Analysis: Basic method Concepts works andby Methods grouping data objects into a hierarchy or tree of clusters. Agglomerative (AGNES) Step 0 Step 1 Step 2 Step 3 Step 4 a ab b c cde abcde d de e Step 4 Step 3 Step 2 Step 1 Step 0 Divisive (DIANA) Figure Hierarchical 10.6 Agglomerative clustering andmethods divisive hierarchical clustering on data objects {a, b, c, d, e}. Agglomerative hierarchical clustering Divisive hierarchical Level clustering a b c d e l = 0 1.0 l = 1 l = 2 Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1395 18 / 34 0.8 0.6 scale

compute the dissimilarity between two clusters R and S? kdistance or single-link: measures (linkage results in measures) chaining (clusters hierarchical can methods get very l d(r, S) = min How measure the distance between two clusters, where d(x R, x each cluster S ) x R 2R,x S 2S is generally a set. Four widely used measures for distance between clusters are as follows Minimum distance d min (C i, C j ) = min { p q } p C i,q C j k or complete-link: results in small, round shaped clusters Maximum distance d max (C i, C j ) = max { p q } p C i,q C j Mean distance e-link: compromise between single and complexte linkage Average distance d(r, S) = max d(x R, x S ) x R 2R,x S 2S d(r, S) = 1 d mean (C i, C j ) = µ i µ j X d(x R, x S ) d min (C i, C j ) = 1 R S p q xn Ri 2R,x N j p C S 2S i,q C j Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1395 19 / 34

e Hierarchical methods Step 4 Step 3 Step 2 Step 1 Step 0 Divisive (DIANA) Agglomerative and divisive hierarchical clustering on data objects {a, b, c, d, e}. Level l = 0 l = 1 l = 2 l = 3 l = 4 a b c d e 1.0 0.8 0.6 0.4 0.2 0.0 Similarity scale Dendrogram representation for hierarchical clustering of data objects {a, b, c, d, e}. different clusters. This is a single-linkage approach in that each cluster is represent Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1395 20 / 34

Model-based clustering K-means is closely related to a probabilistic model known as the Gaussian mixture model. K p(x) = π k N (x µ k, Σ k ) k=1 π k, µ k, Σ k are parameters. π k are called mixing proportions and each Gaussian is called a mixture component. The model is simply a weighted sum of Gaussian. But it is much more powerful than a Gaussian mixture models example Gaussian mixture single Gaussian, because it can model multi-modal distributions. Gaussian mixture models A Gaussian fit to Note that for p(x) to be I Aa mixture probability of threedistribution, Gaussians. we require that k π k = 1 and that for all k we have π k > 0. Thus, we may interpret the π k as probabilities themselves. Set of parameters θ = {{π k }, {µ k }, Roland {Σ k Memisevic }} Machine Learning 21 Gaussian mixture Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1395 21 / 34

Model-based clustering (cont.) Let use a K-dimensional binary random variable z in which a particular element z k equals to 1 and other elements are 0. The values of z k therefore satisfy z k {0, 1} and k z k = 1 We define the joint distribution p(x, z) in terms of a marginal distribution p(z) and a conditional distribution p(x z). The marginal distribution over z is specified in terms of π k, such that We can write this distribution in the form of p(z k = 1) = π k p(z k = 1) = The conditional distribution of x given a particular value for z is a Gaussian This can also be written in the form of K k=1 π z k k p(x z k = 1) = N (x µ k, Σ k ) p(x z k = 1) = K N (x µ k, Σ k ) z k k=1 Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1395 22 / 34

Model-based clustering (cont.) The marginal distribution of x equals to p(x) = z K p(z)p(x z) = π k N (x µ k, Σ k ) k=1 We can write p(z k = 1 x) as γ(z k ) = p(z k = 1 x) = p(z k = 1)p(x z k = 1) p(x) p(z k = 1)p(x z k = 1) = K j=1 p(z j = 1)p(x z j = 1) = π k N (x µ k, Σ k ) K j=1 π jn (x µ j, Σ j ) We shall view π k as the prior probability of z k = 1, and the quantity γ(z k ) as the corresponding posterior probability once we have observed x. Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1395 23 / 34

Gaussian mixture model (example) 112 2. PROBABILITY DISTRIBUTIONS 1 (a) 1 (b) 0.5 0.2 0.5 0 0.5 0.3 0 0 0.5 1 0 0.5 19.2. Mixtures of Gaussians 433 Figure 2.23 Illustration of a mixture of 3 Gaussians in a two-dimensional space. (a) Contours of constant density for each of the mixture components, in which the 3 components are denoted red, blue and green, and the values of the mixing coefficients are shown below each component. (b) Contours of the marginal probability 1 1 1 density (a) p(x) of the mixture distribution. (c) A(b) surface plot of the distribution p(x). (c) 0.5 0 We therefore see that the mixing coefficients satisfy the requirements to be probabilities. 0.5 0.5 From the sum and product rules, the marginal density is given by 0 p(x) = K p(k)p(x k) 0 (2.191) 0 0.5 1 0 0.5 1 0 0.5 1 which is equivalent to (2.188) in which we can view π k = p(k) as the prior prob- 500 points of picking drawn from the kthe component, mixture of 3 Gaussians and the density shownn in(x µ Figure k, Σ2.23. k )=p(x k) (a) Samples Figure 9.5 Example ofability as Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1395 24 / 34 k=1

Model-based clustering (cont.) Let X = {x 1,..., x N } be drawn i.i.d. from mixture of Gaussian. The log-likelihood of the observations equals to [ N K ] ln p(x µ, π, Σ) = ln π k N (x n µ k, Σ k ) n=1 Finding the derivatives of ln p(x µ, π, Σ) with respect to µ k and setting it equal to zero, we obtain N π k N (x n µ k, Σ k ) 0 = K n=1 j=1 π Σ k (x n µ k ) jn (x n µ j, Σ j ) }{{} γ(z nk ) Multiplying by Σ 1 k k=1 and then simplifying, we obtain µ k = 1 N γ(z nk )x n N k N k = n=1 N γ(z nk ) n=1 Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1395 25 / 34

Model-based clustering (cont.) Finding the derivatives of ln p(x µ, π, Σ) with respect to Σ k and setting it equal to zero, we obtain Σ k = 1 N k N n=1 γ(z nk )(x n µ k )(x n µ k ) T We maximize ln p(x µ, π, Σ) with respect to π k with constraint K k=1 π k = 1. This can be achieved using a Lagrange multiplier and maximizing the following quantity ( K ) ln p(x µ, π, Σ) + λ π k 1. which gives N n=1 k=1 π k N (x n µ k, Σ k ) K j=1 π jn (x n µ j, Σ j ) + λ If we now multiply both sides by π k and sum over k making use of the constraint K k=1 π k = 1, we find λ = N. Using this to eliminate λ and rearranging we obtain π k = N k N Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1395 26 / 34

EM for Gassian mixture models 1 Initialize µ k, Σ k, and π k, and evaluate the initial value of the log likelihood. 2 E step Evaluate γ(z nk ) using the current parameter values γ(z nk ) = π kn (x n µ k, Σ k ) K j=1 π jn (x n µ j, Σ j ) 3 M step Re-estimate the parameters using the current value of γ(z nk ) µ k = 1 N γ(z nk )x n N k n=1 Σ k = 1 N γ(z nk )(x n µ k )(x n µ k ) T N k π k = N k N n=1 where N k = N n=1 γ(z nk). 4 Evaluate the log likelihood ln p(x µ, π, Σ) = [ N n=1 ln K ] k=1 π kn (x n µ k, Σ k ) and check for convergence of either the parameters or the log likelihood. If the convergence criterion is not satisfied return to step 2. Please read section 9.2 of Bishop. Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1395 27 / 34

Model-based clustering (example) 9.2. Mixtures of Gaussians 437 2 2 2 L =1 0 0 0 2 2 2 2 0 (a) 2 2 0 (b) 2 2 0 (c) 2 2 L =2 2 L =5 2 L = 20 0 0 0 2 2 2 2 0 (d) 2 2 0 (e) 2 2 0 (f) 2 Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1395 28 / 34

Model-based clustering (GMM vs K-means) vs K-means 1 consider a dataset clustered by K-means and GMM.. M clustering (rightmost figure), the most probable cluster for each point ha 2 For the GMM clustering, the most probable cluster for each point has been labeled. 3 K-means, unlike GMM, tends to learn equi-sized clusters. 4 In what situation, the results of GMM is equivalent to the results of K-means? (do it.) Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1395 29 / 34

Cluster validation and assessment How good is the clustering generated by a method? How can we compare the clusterings generated by different methods? Clustering is an unupervised learning technique and it is hard to evaluate the quality of the output of any given method. If we use probabilistic models, we can always evaluate the likelihood of a test set, but this has two drawbacks: 1 It does not directly assess any clustering that is discovered by the model. 2 It does not apply to non-probabilistic methods. We discuss some performance measures not based on likelihood. The goal of clustering is to assign points that are similar to the same cluster, and to ensure that points that are dissimilar are in different clusters. There are several ways of measuring these quantities 1 Internal criterion : Typical objective functions in clustering formalize the goal of attaining high intra-cluster similarity and low inter-cluster similarity. But good scores on an internal criterion do not necessarily translate into good effectiveness in an application. An alternative to internal criteria is direct evaluation in the application of interest. 2 External criterion : Suppose we have labels for each object. Then we can compare the clustering with the labels using various metrics. We will use some of these metrics later, when we compare clustering methods. Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1395 30 / 34

troduction Purity Purity is a simple and transparent evaluation measure. Consider the following clustering. Let N ij be the number of objects in cluster i that belongs to class j and N i = C j=1 N ij be the total number of objects in cluster i. 25.1 Three clusters with labeled objects inside. Based on Figure 16.4 of (Manning et We define purity of cluster i as p i max j ( Nij purity i N i ), and the overall purity of a clustering as ing is an unupervised learning technique, so it is hard to evaluate the quality of givenfor method. the aboveiffigure, we use the purity probabilistic is models, we can always evaluate the lik set,butthishastwodrawbacks: 6 5 first,itdoesnotdirectlyassessanycluster 17 6 red by the model; and second, + 6 4 17 6 it does + 5 3 17 5 not = 5 + 4 + 3 = 0.71 17 apply to non-probabilistic method Bad clusterings have purity values close to 0, a perfect clustering has a purity of 1. uss some performance measures not based on likelihood. High purity is easy to achieve when the number of clusters is large. In particular, purity is itively, 1 if the eachgoal pointof gets clustering its own cluster. is to Thus, assign we cannot points use purity that to aretrade similar off theto quality theofsam ensure the that clustering points against that theare number dissimilar of clusters. are in different clusters. There are se N i N p i. Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1395 31 / 34

Rand index Let U = {u 1,..., u R } and V = {v 1,..., v C } be two different (flat) clustering of N data points. For example, U might be the estimated clustering and V is reference clustering derived from the class labels. Define a 2 2 contingency table, containing the following numbers: 1 TP is the number of pairs that are in the same cluster in both U and V (true positives); 2 TN is the number of pairs that are in different clusters in bothu and V (true negatives); 3 FN is the number of pairs that are in different clusters in U but the same cluster in V (false negatives); 4 FP is the number of pairs that are in the same cluster in U but different clusters in V (false positives). Rand index is defined as RI TP + TN TP + FP + FN + TN Rand index can be interpreted as the fraction of clustering decisions that are correct. Clearly RI [0, 1]. Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1395 32 / 34

Rand index (example) 25.1. Introduction 877 Consider the following clustering The three clusters contain 6, 6 and 5 points, so we have ( ) ( ) ( ) Figure 25.1 Three clusters with labeled objects 6 inside. 6 Based on 5 Figure 16.4 of (Manning et al. 2008). TP + FP = + + = 40. 2 2 2 Clustering The number is an of unupervised true positives learning ( ) technique, ( ) so( it) is hard ( ) to evaluate the quality of the output of any given method. If we use probabilistic 5 5 models, 3 we can 2 always evaluate the likelihood of TP = + + + = 20. atestset,butthishastwodrawbacks: 2 first,itdoesnotdirectlyassessanyclusteringthatis 2 2 2 discovered Then FP = by 40 the model; 20 = 20. and Similarly, second, it FN does = 24 not and apply TN to = non-probabilistic 72. methods. So now we discuss some performance measures not based on likelihood. Hence Rand index Intuitively, the goal of clustering is to20 assign + 72points that are similar to the same cluster, and to ensure that points that RI are = dissimilar 20 + 20 + are 24 in + 72 different = 0.68. clusters. There are several ways of Rand measuring index only these achieves quantities its lower e.g., see bound (Jainofand 0 ifdubes TP = 1988; TN = Kaufman 0, whichand is arousseeuw rare event. 1990). We However, can define these an adjusted internal criteria Rand index may be of limited use. An alternative is to rely on some external form of data with which to validate the method. For example, suppose we have labels for each index E[index] object, as in Figure 25.1. (Equivalently, ARI we can have a reference clustering; given a clustering, we can induce a set of labels and vice versa.) maxthen indexwe can E[index]. compare the clustering with the labels using various metrics which we describe below. We will use some of these metrics later, when Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1395 33 / 34

Mutual information Another measure of cluster quality is computing mutual information between U and V. Let P UV (i, j) = u i v j N be the probability that a randomly chosen object belongs to cluster u i in U and v j in V. Let P U (i) = u i N be the be the probability that a randomly chosen object belongs to cluster u i in U. Let P V (j) = v j N be the be the probability that a randomly chosen object belongs to cluster v j in V. Then mutual information is defined R C I(U, V ) P UV (i, j) log P UV (i, j) P U (i)p V (j). i=1 j=1 This lies between 0 and min{h(u), H(V )}. The maximum value can be achieved by using a lots of small clusters, which have low entropy. To compensate this, we can use normalized mutual information (NMI) This lies between 0 and 1. Please read section 25.1 of Murphy. I(U, V ) NMI (U, V ) 1 2 [H(U) + H(V )]. Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1395 34 / 34

Machine Learning. Clustering 1. Hamid Beigy. Sharif University of Technology. Fall 1395