Data Mining and Analysis: Fundamental Concepts and Algorithms

Size: px

Start display at page:

Download "Data Mining and Analysis: Fundamental Concepts and Algorithms"

Ophelia Hudson
6 years ago
Views:

1 Data Mining and Analysis: Fundamental Concepts and Algorithms dataminingbook.info Mohammed J. Zaki 1 Wagner Meira Jr. 2 1 Department of Computer Science Rensselaer Polytechnic Institute, Troy, NY, USA 2 Department of Computer Science Universidade Federal de Minas Gerais, Belo Horizonte, Brazil Chapter 13: Representative-based Clustering Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chap 13: Rep-based Clustering 1 / 30

2 Representative-based Clustering Given a dataset with n points in a d-dimensional space, D = {x i } n i=1, and given the number of desired clusters k, the goal of representative-based clustering is to partition the dataset into k groups or clusters, which is called a clustering and is denoted as C = {C 1, C 2,...,C k }. For each cluster C i there exists a representative point that summarizes the cluster, a common choice being the mean (also called the centroid) µ i of all points in the cluster, that is, µ i = 1 n i x j C i x j where n i = C i is the number of points in cluster C i. A brute-force or exhaustive algorithm for finding a good clustering is simply to generate all possible partitions of n points into k clusters, evaluate some optimization score for each of them, and retain the clustering that yields the best score. However, this is clearly infeasilbe, since there are O(k n /k!) clusterings of n points into k groups. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chap 13: Rep-based Clustering 2 / 30

3 K-means Algorithm: Objective The sum of squared errors scoring function is defined as SSE(C) = k i=1 x j C i x j µ i 2 The goal is to find the clustering that minimizes the SSE score: C = arg min C {SSE(C)} K-means employs a greedy iterative approach to find a clustering that minimizes the SSE objective. As such it can converge to a local optima instead of a globally optimal clustering. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chap 13: Rep-based Clustering 3 / 30

4 K-means Algorithm: Objective K-means initializes the cluster means by randomly generating k points in the data space. Each iteration of K-means consists of two steps: (1) cluster assignment, and (2) centroid update. Given the k cluster means, in the cluster assignment step, each point x j D is assigned to the closest mean, which induces a clustering, with each cluster C i comprising points that are closer to µ i than any other cluster mean. That is, each point x j is assigned to cluster C j, where j = arg min k { x j µ i 2} i=1 Given a set of clusters C i, i = 1,...,k, in the centroid update step, new mean values are computed for each cluster from the points in C i. The cluster assignment and centroid update steps are carried out iteratively until we reach a fixed point or local minima. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chap 13: Rep-based Clustering 4 / 30

5 K-Means Algorithm K-MEANS (D, k,ǫ): t = 0 Randomly initialize k centroids: µ t 1,µt 2,...,µt k Rd repeat t t + 1 C j for all j = 1,, k // Cluster Assignment Step foreach x j D do { j arg min i x j µ t 2} i // Assign x j to closest centroid C j C j {x j } // Centroid Update Step foreach i = 1 to k do µ t i 1 C i x j C i x j until k µ t i µ t 1 2 ǫ i=1 i Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chap 13: Rep-based Clustering 5 / 30

6 K-means in One Dimension 2 3 µ 1 = 2 4 µ 2 = (a) Initial dataset µ 1 = (b) Iteration: t = 1 µ 2 = (c) Iteration: t = Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chap 13: Rep-based Clustering 6 / 30

7 K-means in One Dimension (contd.) µ 1 = 3 µ 2 = µ 1 = (d) Iteration: t = 3 µ 2 = µ 1 = (e) Iteration: t = µ 2 = (f) Iteration: t = 5 (converged) Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chap 13: Rep-based Clustering 7 / 30

8 K-means in 2D: Iris Principal Components u (a) Random initialization: t = 0 u 1 Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chap 13: Rep-based Clustering 8 / 30

9 K-means in 2D: Iris Principal Components y (b) Iteration: t = 1 x Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chap 13: Rep-based Clustering 9 / 30

10 K-means in 2D: Iris Principal Components y (c) Iteration: t = 8 (converged) x Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chap 13: Rep-based Clustering 10 / 30

11 Kernel K-means In K-means, the separating boundary between clusters is linear. Kernel K-means allows one to extract nonlinear boundaries between clusters via the use of the kernel trick, i.e., we show that all operations involve only the kernel value between a pair of points. Let x i D be mapped to φ(x i ) in feature space. Let K = { K(x i, x j ) } i,j=1,...,n denote the n n symmetric kernel matrix, where K(x i, x j ) = φ(x i ) T φ(x j ). The cluster means in feature space are {µ φ 1,...,µφ k }, where µ φ i = 1 n i x j C i φ(x j ). The sum of squared errors in feature space is min C SSE(C) = k i=1 x j C i φ(x j ) µ φ i 2 = n K(x j, x j ) j=1 k 1 n i i=1 x a C i x b C i K(x a, x b ) Thus, the kernel K-means SSE objective function can be expressed purely in terms of the kernel function. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chap 13: Rep-based Clustering 11 / 30

12 Kernel K-means: Cluster Reassignment Consider the distance of a point φ(x j ) to the mean µ φ i in feature space φ(x j ) µ φ i 2 = φ(x j ) 2 2φ(x j ) T µ φ i + µ φ i 2 = K(x j, x j ) 2 K(x a, x j )+ 1 n i n 2 K(x a, x b ) x a C i i x b C i x a C i Kernel K-means assign a point to the closest cluster mean as follows: { φ(xj C (x j ) = arg min ) µ φ i i 2} { } 1 = arg min i ni 2 K(x a, x b ) 2 K(x a, x j ) n i x b C i x a C i x a C i Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chap 13: Rep-based Clustering 12 / 30

13 Kernel-Kmeans Algorithm KERNEL-KMEANS(K, k, ǫ): 1 t 0 2 C t {C1,...,C t k}// t Randomly partition points into k clusters 3 repeat 4 t t foreach C i C t 1 do// Compute squared norm of cluster means 6 sqnorm i 1 n x i 2 a C i x b C i K(x a, x b ) foreach x j D do// Average kernel value for x j foreach C i C t 1 do avg ji 1 n i x a C i K(x a, x j ) and C i // Find closest cluster for each point 10 foreach x j D do 11 foreach C i C t 1 do 12 d(x j, C i ) sqnorm i 2 avg ji { j arg min i d(xj, C i ) } Cj t Ct j {x j}// Cluster reassignment C t { } 15 C1,...,C t k t until 1 1 k n i=1 C t i C t 1 ǫ 16Zaki & Meira Jr. (RPI and UFMG) i Data Mining and Analysis Chap 13: Rep-based Clustering 13 / 30

14 K-Means or Kernel K-means with Linear Kernel X (a) Linear kernel: t = 5 iterations X 1 Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chap 13: Rep-based Clustering 14 / 30

15 Kernel K-means: Gaussian Kernel y (b) Gaussian kernel: t = 4 Iterations x Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chap 13: Rep-based Clustering 15 / 30

16 Expectation-Maximization Clustering: Gaussian Mixture Model Let X a denote the random variable corresponding to the ath attribute. Let X = (X 1, X 2,...,X d ) denote the vector random variable across the d-attributes, with x j being a data sample from X. We assume that each cluster C i is characterized by a multivariate normal distribution { } 1 f i (x) = f(x µ i,σ i ) = exp (x µ i) T Σ 1 i (x µ i ) (2π) d 2 Σ i where the cluster mean µ i R d and covariance matrix Σ i R d d are both unknown parameters. The probability density function of X is given as a Gaussian mixture model over all the k clusters k k f(x) = f i (x)p(c i ) = f(x µ i,σ i )P(C i ) i=1 where the prior probabilities P(C i ) are called the mixture parameters, which must satisfy the condition k P(C i ) = 1 Zaki & Meira Jr. (RPI and UFMG) Data i=1mining and Analysis Chap 13: Rep-based Clustering 16 / 30 i=1

17 Expectation-Maximization Clustering: Maximum Likelihood Estimation We write the set of all the model parameters compactly as θ = {µ 1,Σ 1, P(C i )...,µ k,σ k, P(C k )} Given the dataset D, we define the likelihood of θ as the conditional probability of the data D given the model parameters θ P(D θ) = n f(x j ) The goal of maximum likelihood estimation (MLE) is to choose the parametersθ that maximize the likelihood. We do this by maximizing the log of the likelihood function j=1 θ = arg max{ln P(D θ)} θ where the log-likelihood function is given as ln P(D θ) = n ln f(x j ) = j=1 n j=1 ( k ) ln f(x j µ i,σ i )P(C i ) i=1 Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chap 13: Rep-based Clustering 17 / 30

18 Expectation-Maximization Clustering Directly maximizing the log-likelihood over θ is hard. Instead, we can use the expectation-maximization (EM) approach for finding the maximum likelihood estimates for the parameters θ. EM is a two-step iterative approach that starts from an initial guess for the parameters θ. Given the current estimates for θ, in the expectation step EM computes the cluster posterior probabilities P(C i x j ) via the Bayes theorem: P(C i x j ) = P(C i and x j ) P(x j ) = P(x j C i )P(C i ) k a=1 P(x j C a )P(C a ) = f i (x j ) P(C i ) k a=1 f a(x j ) P(C a ) In the maximization step, using the weights P(C i x j ) EM re-estimates θ, that is, it re-estimates the parameters µ i, Σ i, and P(C i ) for each cluster C i. The re-estimated mean is given as the weighted average of all the points, the re-estimated covariance matrix is given as the weighted covariance over all pairs of dimensions, and the re-estimated prior probability for each cluster is given as the fraction of weights that contribute to that cluster. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chap 13: Rep-based Clustering 18 / 30

19 EM in One Dimension: Expectation Step Let D comprise of a single attribute X, with each point x j R a random sample from X. For the mixture model, we use univariate normals for each cluster: f i (x) = f(x µ i,σi 2 ) = 1 exp { (x µ i) 2 } 2πσi with the cluster parameters µ i, σ 2 i, and P(C i ). Initialization: For each cluster C i, with i = 1, 2,...,k, we can randomly initialize the cluster parameters µ i, σ 2 i, and P(C i ). Expectation Step: Given the mean µ i, variance σ 2 i, and prior probability P(C i ) for each cluster, the cluster posterior probability is computed as w ij = P(C i x j ) = 2σ 2 i f(x j µ i,σ 2 i ) P(C i ) k a=1 f(x j µ a,σ 2 a ) P(C a) Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chap 13: Rep-based Clustering 19 / 30

20 EM in One Dimension: Maximization Step Given w ij values, the re-estimated cluster mean is µ i = n j=1 w ij x j n j=1 w ij The re-estimated value of the cluster variance is computed as the weighted variance across all the points: n σi 2 j=1 = w ij(x j µ i ) 2 n j=1 w ij The prior probability of cluster C i is re-estimated as P(C i ) = n j=1 w ij n Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chap 13: Rep-based Clustering 20 / 30

21 EM in One Dimension µ 1 = 6.63 µ 2 = (a) Initialization: t = 0 µ 2 = µ 1 = (b) Iteration: t = 1 Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chap 13: Rep-based Clustering 21 / 30

22 EM in One Dimension: Final Clusters µ 2 = µ 1 = (c) Iteration: t = 5 (converged) Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chap 13: Rep-based Clustering 22 / 30

23 EM in d Dimensions Each cluster we have to reestiamte the d d covariance matrix: (σ1 i )2 σ12 i... σ1d i σ21 i (σ2 i Σ i = )2... σ2d i..... σd1 i σd2 i... (σd i )2 It requires O(d 2 ) parameters, which may be too many for reliable estimation. A simplification is to assume that all dimensions are independent, which leads to a diagonal covariance matrix: (σ i 1 ) (σ2 i Σ i = ) (σd i )2 Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chap 13: Rep-based Clustering 23 / 30

24 EM in d Dimensions Expectation Step: Given µ i, Σ i, and P(C i ), the posterior probability is given as f i (x j ) P(C i ) w ij = P(C i x j ) = k a=1 f a(x j ) P(C a ) Maximization Step: Given the weights w ij, in the maximization step, we re-estimate Σ i, µ i and P(C i ). The mean µ i for cluster C i can be estimated as n j=1 µ i = w ij x j n j=1 w ij The covariance matrix Σ i is re-estimated via the outer-product form n j=1 Σ i = w ij(x j µ i )(x j µ i ) T n j=1 w ij The prior probability P(C i ) for each cluster is n j=1 P(C i ) = w ij n Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chap 13: Rep-based Clustering 24 / 30

25 Expectation-Maximization Clustering Algorithm EXPECTATION-MAXIMIZATION (D, k, ǫ): t 0 Randomly initialize µ t 1,...,µt k Σ t i I, i = 1,...,k repeat t t + 1 for i = 1,...,k and j = 1,...,n do w ij f(x j µ i,σ i ) P(C i ) ka=1 // posterior probability f(x j µ a,σ a) P(C a) P t (C i x j ) for i = 1,...,k do µ t i Σ t i n j=1 w ij x j n j=1 w ij n j=1 w ij(x j µ i )(x j µ i ) T n j=1 w ij matrix n j=1 w ij // re-estimate mean // re-estimate covariance P t (C i ) n // re-estimate priors until k µ t i µ t 1 2 ǫ i=1 i Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chap 13: Rep-based Clustering 25 / 30

26 EM Clustering in 2D Mixture of k = 3 Gaussians f(x) X 2 X 1 (a) Iteration: t = 0 Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chap 13: Rep-based Clustering 26 / 30

27 EM Clustering in 2D Mixture of k = 3 Gaussians f(x) X 2 X 1 (b) Iteration: t = 1 Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chap 13: Rep-based Clustering 27 / 30

28 EM Clustering in 2D Mixture of k = 3 Gaussians f(x) 1 (c) iteration: t = 36 X Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chap 13: Rep-based Clustering 28 / 30 4 X 1

29 Iris Principal Components Data: Full Covariance Matrix X (a) Full covariance matrix (t = 36) X 1 Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chap 13: Rep-based Clustering 29 / 30

30 Iris Principal Components Data: Diagonal Covariance Matrix X (b) Diagonal covariance matrix (t = 29) X 1 Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chap 13: Rep-based Clustering 30 / 30

Data Mining and Analysis: Fundamental Concepts and Algorithms

Data Mining and Analysis: Fundamental Concepts and Algorithms dataminingbook.info Mohammed J. Zaki 1 Wagner Meira Jr. 2 1 Department of Computer Science Rensselaer Polytechnic Institute, Troy, NY, USA