Data Mining and Analysis: Fundamental Concepts and Algorithms

: Fundamental Concepts and Algorithms dataminingbook.info Mohammed J. Zaki 1 Wagner Meira Jr. 2 1 Department of Computer Science Rensselaer Polytechnic Institute, Troy, NY, USA 2 Department of Computer Science Universidade Federal de Minas Gerais, Belo Horizonte, Brazil Chapter 16: Spectral & Graph Clustering Chapter 16: Spectral & Graph Clustering 1 /

Graphs and Matrices: Adjacency Matrix Given a dataset D = {x i } n i=1 consisting of n points in Rd, let A denote the n n symmetric similarity matrix between the points, given as a 11 a 12 a 1n a 21 a 22 a 2n A =...... a n1 a n2 a nn where A(i, j) = a ij denotes the similarity or affinity between points x i and x j. We require the similarity to be symmetric and non-negative, that is, a ij = a ji and a ij 0, respectively. The matrix A is the weighted adjacency matrix for the data graph. If all affinities are 0 or 1, then A represents the regular adjacency relationship between the vertices. Chapter 16: Spectral & Graph Clustering 2 /

Iris Similarity Graph: Mutual Nearest Neighbors V = n = 150, E = m = 1730 Chapter 16: Spectral & Graph Clustering 3 /

Graphs and Matrices: Degree Matrix For a vertex x i, let d i denote the degree of the vertex, defined as d i = n j=1 a ij We define the degree matrix of graph G as the n n diagonal matrix: n d 1 0 0 j=1 a 1j 0 0 0 d 2 0 =...... = 0 n j=1 a 2j 0...... 0 0 d n n 0 0 j=1 a nj can be compactly written as (i, i) = d i for all 1 i n. Chapter 16: Spectral & Graph Clustering 4 /

Graphs and Matrices: Normalized Adjacency Matrix The normalized adjacency matrix is obtained by dividing each row of the adjacency matrix by the degree of the corresponding node. Given the weighted adjacency matrix A for a graph G, its normalized adjacency matrix is defined as a 11 a 12 a d 1 d 1 1n M = 1 A =. a 21 d 2 a 22 d 2 a n1 d n d 1 a 2n d 2..... a n2 d n Because A is assumed to have non-negative elements, this implies that each element of M, namely m ij is also non-negative, as m ij = a ij d i 0. Each row in M sums to 1, which implies that 1 is an eigenvalue of M. In fact, λ 1 = 1 is the largest eigenvalue of M, and the other eigenvalues satisfy the property that λ i 1. Because M is not symmetric, its eigenvectors are not necessarily orthogonal. a nn d n Chapter 16: Spectral & Graph Clustering 5 /

Example Graph: Adjacency and Degree Matrices 1 6 2 4 5 3 7 Its adjacency and degree matrices are given as 0 1 0 1 0 1 0 3 0 0 0 0 0 0 1 0 1 1 0 0 0 0 3 0 0 0 0 0 0 1 0 1 0 0 1 0 0 3 0 0 0 0 A = 1 1 1 0 1 0 0 = 0 0 0 4 0 0 0 0 0 0 1 0 1 1 0 0 0 0 3 0 0 1 0 0 0 1 0 1 0 0 0 0 0 3 0 0 0 1 0 1 1 0 0 0 0 0 0 0 3 Chapter 16: Spectral & Graph Clustering 6 /

Example Graph: Normalized Adjacency Matrix 1 6 2 4 5 3 7 The normalized adjacency matrix is as follows: 0 0.33 0 0.33 0 0.33 0 0.33 0 0.33 0.33 0 0 0 0 0.33 0 0.33 0 0 0.33 M = 1 A = 0.25 0.25 0.25 0 0.25 0 0 0 0 0 0.33 0 0.33 0.33 0.33 0 0 0 0.33 0 0.33 0 0 0.33 0 0.33 0.33 0 The eigenvalues of M are: λ 1 = 1, λ 2 = 0.483, λ 3 = 0.206, λ 4 = 0.045, λ 5 = 0.405, λ 6 = 0.539, λ 7 = 0.7 Chapter 16: Spectral & Graph Clustering 7 /

Graph Laplacian Matrix The Laplacian matrix of a graph is defined as L = A n j=1 a 1j 0 0 a 11 a 12 a 1n n 0 j=1 = a 2j 0...... a 21 a 22 a 2n...... n 0 0 j=1 a nj a n1 a n2 a nn j 1 a 1j a 12 a 1n a 21 j 2 = a 2j a 2n...... a n1 a n2 j n a nj L is a symmetric, positive semidefinite matrix. This means that L has n real, non-negative eigenvalues, which can be arranged in decreasing order as follows: λ 1 λ 2 λ n 0. Because L is symmetric, its eigenvectors are orthonormal. The rank of L is at most n 1, and the smallest eigenvalue is λ n = 0. Chapter 16: Spectral & Graph Clustering 8 /

Example Graph: Laplacian Matrix 1 6 2 4 5 3 7 The graph Laplacian is given as 3 1 0 1 0 1 0 1 3 1 1 0 0 0 0 1 3 1 0 0 1 L = A = 1 1 1 4 1 0 0 0 0 0 1 3 1 1 1 0 0 0 1 3 1 0 0 1 0 1 1 3 The eigenvalues of L are as follows: λ 1 = 5.618, λ 2 = 4.618, λ 3 = 4.414, λ 4 = 3.382, λ 5 = 2.382, λ 6 = 1.586, λ 7 = 0 Chapter 16: Spectral & Graph Clustering 9 /

Normalized Laplacian Matrices The normalized symmetric Laplacian matrix of a graph is defined as j 1 a 1j a 12 a 1n d1 d 1 d1 d 2 a 21 j 2 a 2j a 2n L s = 1/2 L 1/2 d2 d = 1 d2 d 2........ a n2 dnd2 a n1 dnd1 d1 d n d2 d n j n a nj dnd n L s is a symmetric, positive semidefinite matrix, with rank at most n 1. The smallest eigenvalueλ n = 0. The normalized asymmetric Laplacian matrix is defined as L a = 1 L = j 1 a 1j d 1 a 12 a 21 d 2. a n1 d n d 1 a 1n d 1 j 2 a 2j a 2n d 2 d 2....... a n2 d n j n a nj d n L a is also a positive semi-definite matrix with n real eigenvalues λ 1 λ 2 λ n = 0. Chapter 16: Spectral & Graph Clustering 10 /

Example Graph: Normalized Symmetric Laplacian Matrix 1 6 2 4 5 3 7 The normalized symmetric Laplacian is given as 1 0.33 0 0.29 0 0.33 0 0.33 1 0.33 0.29 0 0 0 0 0.33 1 0.29 0 0 0.33 L s = 0.29 0.29 0.29 1 0.29 0 0 0 0 0 0.29 1 0.33 0.33 0 0 0 0.33 1 0.33 0 0 0.33 0 0.33 0.33 1 The eigenvalues of L s are as follows: λ 1 = 1.7, λ 2 = 1.539, λ 3 = 1.405, λ 4 = 1.045, λ 5 = Zaki 0.794, & Meira Jr. λ(rpi 6 = and0.517, UFMG) λ 7 = 0 Chapter 16: Spectral & Graph Clustering 11 /

Example Graph: Normalized Asymmetric Laplacian Matrix 1 6 2 4 5 3 7 The normalized asymmetric Laplacian matrix is given as 1 0.33 0 0.33 0 0.33 0 0.33 1 0.33 0.33 0 0 0 0 0.33 1 0.33 0 0 0.33 L a = 1 L = 0.25 0.25 0.25 1 0.25 0 0 0 0 0 0.33 1 0.33 0.33 0 0 0 0.33 1 0.33 0 0 0.33 0 0.33 0.33 1 The eigenvalues of L a are identical to those for L s, namelyλ 1 = 1.7, λ 2 = 1.539, λ 3 = Zaki 1.405, & Meira Jr. λ(rpi 4 = and1.045, UFMG) λ 5 = 0.794, Data λ 6 Mining = 0.517, and Analysis λ 7 = 0 Chapter 16: Spectral & Graph Clustering 12 /

Clustering as Graph Cuts A k-way cut in a graph is a partitioning or clustering of the vertex set, given as C = {C 1,...,C k }. We require C to optimize some objective function that captures the intuition that nodes within a cluster should have high similarity, and nodes from different clusters should have low similarity. Given a weighted graph G defined by its similarity matrix A, let S, T V be any two subsets of the vertices. We denote by W(S, T) the sum of the weights on all edges with one vertex in S and the other in T, given as W(S, T) = a ij v i S v j T Given S V, we denote by S the complementary set of vertices, that is, S = V S. A (vertex) cut in a graph is defined as a partitioning of V into S V and S. The weight of the cut or cut weight is defined as the sum of all the weights on edges between vertices in S and S, given as W(S, S). Chapter 16: Spectral & Graph Clustering 13 /

Cuts and Matrix Operations Given a clustering C = {C 1,...,C k } comprising k clusters. Let c i {0, 1} n be the cluster indicator vector that records the cluster membership for cluster C i, defined as { 1 if v j C i c ij = 0 if v j C i The cluster size can be written as C i = c T i c i = c i 2 The volume of a cluster C i is defined as the sum of all the weights on edges with one end in cluster C i : vol(c i ) = W(C i, V) = n n d r = c ir d rc ir = c ir rsc is = c T i c i v r C i v r C i r=1 s=1 The sum of weights of all internal edges is: W(C i, C i ) = v r C i v s C i a rs = n r=1 s=1 n c ir a rsc is = c T i Ac i We can get the sum of weights for all the external edges as follows: W(C i, C i ) = v r C i v s V C i a rs = W(C i, V) W(C i, C i ) = c i ( A)c i = c T i Lc i Chapter 16: Spectral & Graph Clustering 14 /

Clustering Objective Functions: Ratio Cut The clustering objective function can be formulated as an optimization problem over the k-way cut C = {C 1,...,C k }. The ratio cut objective is defined over a k-way cut as follows: min C J rc (C) = k i=1 W(C i, C i ) C i = k i=1 c T i Lc i c T i c i = k i=1 c T i Lc i c i 2 Ratio cut tries to minimize the sum of the similarities from a cluster C i to other points not in the cluster C i, taking into account the size of each cluster. Unfortunately, for binary cluster indicator vectors c i, the ratio cut objective is NP-hard. An obvious relaxation is to allow c i to take on any real value. In this case, we can rewrite the objective as min C J rc (C) = k i=1 c T i Lc i c i 2 = k i=1 ( ) T ( ) ci ci L = c i c i k u T i Lu i The optimal solution comprises the eigenvectors corresponding to the k smallest eigenvalues of L, i.e., the eigenvectors u n, u n 1,...,u n k+1 represent the relaxed cluster indicator vectors. i=1 Chapter 16: Spectral & Graph Clustering 15 /

Clustering Objective Functions: Normalized Cut Normalized cut is similar to ratio cut, except that it divides the cut weight of each cluster by the volume of a cluster instead of its size. The objective function is given as min C J nc (C) = k i=1 W(C i, C i ) vol(c i ) = k c T i Lc i c T i c i i=1 We can obtain an optimal solution by allowing c i to be an arbitrary real vector. The optimal solution comprise the eigenvectors corresponding to the k smallest eigenvalues of either the normalized symmetric or asymmetric Laplacian matrices, L s and L a. Chapter 16: Spectral & Graph Clustering 16 /

Spectral Clustering Algorithm The spectral clustering algorithm takes a dataset D as input and computes the similarity matrix A. For normalized cut we chose either L s or L a, whereas for ratio cut we choose L. Next, we compute the k smallest eigenvalues and corresponding eigenvectors of the chosen matrix. The main problem is that the eigenvectors u i are not binary, and thus it is not immediately clear how we can assign points to clusters. One solution to this problem is to treat the n k matrix of eigenvectors as a new data matrix: U = y T 1 u n u n 1 u n k+1 y T 2 normalize rows.. = Y y T n We then cluster the new points in Y into k clusters via the K-means algorithm or any other fast clustering method to obtain binary cluster indicator vectors c i. Chapter 16: Spectral & Graph Clustering 17 /

Spectral Clustering Algorithm 1 2 3 4 5 6 7 SPECTRAL CLUSTERING (D, k): Compute the similarity matrix A R n n if ratio cut then B L else if normalized cut then B L s or L a Solve Bu i = λ i u i for i = n,...,n k + 1, where λ n λ n 1 λ n k+1 U ( ) u n u n 1 u n k+1 Y normalize rows of U C {C 1,...,C k } via K-means on Y Chapter 16: Spectral & Graph Clustering 18 /

Spectral Clustering on Example Graph k = 2, normalized cut (normalized asymmetric Laplacian) 1 6 2 4 5 3 7 u 2 5 6, 7 0.5 0 1, 3 0.5 4 2 1 1 0.9 0.8 0.7 0.6 u 1 Chapter 16: Spectral & Graph Clustering 19 /

Normalized Cut on Iris Graph k = 3, normalized asymmetric Laplacian setosa virginica versicolor C 1 (triangle) 50 0 4 C 2 (square) 0 36 0 C 3 (circle) 0 14 46 Chapter 16: Spectral & Graph Clustering 20 /

Maximization Objectives: Average Cut The average weight objective is defined as max C J aw (C) = k i=1 W(C i, C i ) C i = k i=1 c T i Ac i c T i c i = k u T i Au i i=1 where u i is an arbitrary real vector, which is a relaxation of the binary cluster indicator vectors c i. We can maximize the objective by selecting the k largest eigenvalues of A, and the corresponding eigenvectors. max C J aw (C) = u T 1 Au 1 + +u T k Au k = λ 1 + +λ k where λ 1 λ 2 λ n. In general, while A is symmetric, it may not be positive semidefinite. This means that A can have negative eigenvalues, and to maximize the objective we must consider only the positive eigenvalues and the corresponding eigenvectors. Chapter 16: Spectral & Graph Clustering 21 /

Maximization Objectives: Modularity Given A, the weighted adjacency matrix, the modularity of a clustering is the difference between the observed and expected fraction of weights on edges within the clusters. The clustering objective is given as max C J Q (C) = k i=1 ( c T i Ac i tr( ) (dt i c i ) 2 ) tr( ) 2 = k c T i Qc i i=1 where Q is the modularity matrix: Q = 1 ) (A d dt tr( ) tr( ) The optimal solution comprises the eigenvectors corresponding to the k largest eigenvalues of Q. Since Q is symmetric, but not positive semidefinite, we use only the positive eigenvalues. Chapter 16: Spectral & Graph Clustering 22 /

Markov Chain Clustering A Markov chain is a discrete-time stochastic process over a set of states, in our case the set of vertices V. The Markov chain makes a transition from one node to another at discrete timesteps t = 1, 2,..., with the probability of making a transition from node i to node j given as m ij. Let the random variable X t denote the state at time t. The Markov property means that the probability distribution of X t over the states at time t depends only on the probability distribution of X t 1, that is, P(X t = i X 0, X 1,...,X t 1 ) = P(X t = i X t 1 ) Further, we assume that the Markov chain is homogeneous, that is, the transition probability is independent of the time step t. P(X t = j X t 1 = i) = m ij Chapter 16: Spectral & Graph Clustering 23 /

Markov Chain Clustering: Markov Matrix The normalized adjacency matrix M = 1 A can be interpreted as the n n transition matrix where the entry m ij = a ij d i is the probability of transitioning or jumping from node i to node j in the graph G. The matrix M is thus the transition matrix for a Markov chain or a Markov random walk on graph G. That is, Given node i the transition matrix M specifies the probabilities of reaching any other node j in one time step. In general, the transition probability matrix for t time steps is given as M t 1 M = M t Chapter 16: Spectral & Graph Clustering 24 /

Markov Chain Clustering: Random Walk A random walk on G thus corresponds to taking successive powers of the transition matrix M. Let π 0 specify the initial state probability vector at time t = 0. The state probability vector after t steps is π T t = π T t 1M = π T t 2M 2 = = π T 0 M t Equivalently, taking transpose on both sides, we get π t = (M t ) T π 0 = (M T ) t π 0 The state probability vector thus converges to the dominant eigenvector of M T. Chapter 16: Spectral & Graph Clustering 25 /

Markov Clustering Algorithm Consider a variation of the random walk, where the probability of transitioning from node i to j is inflated by taking each element m ij to the power r 1. Given a transition matrix M, define the inflation operator Υ as follows: Υ(M, r) = { (mij ) r n a=1 (m ia) r } n i,j=1 The net effect of the inflation operator is to increase the higher probability transitions and decrease the lower probability transitions. The Markov clustering algorithm (MCL) is an iterative method that interleaves matrix expansion and inflation steps. Matrix expansion corresponds to taking successive powers of the transition matrix, leading to random walks of longer lengths. On the other hand, matrix inflation makes the higher probability transitions even more likely and reduces the lower probability transitions. MCL takes as input the inflation parameter r 1. Higher values lead to more, smaller clusters, whereas smaller values lead to fewer, but larger clusters. Chapter 16: Spectral & Graph Clustering 26 /

Markov Clustering Algorithm: MCL The final clusters are found by enumerating the weakly connected components in the directed graph induced by the converged transition matrix M t, where the edges are defined as: E = { (i, j) M t (i, j) > 0 } A directed edge (i, j) exists only if node i can transition to node j within t steps of the expansion and inflation process. A node j is called an attractor if M t (j, j) > 0, and we say that node i is attracted to attractor j if M t (i, j) > 0. The MCL process yields a set of attractor nodes, V a V, such that other nodes are attracted to at least one attractor in V a. To extract the clusters from G t, MCL first finds the strongly connected components S 1, S 2,...,S q over the set of attractors V a. Next, for each strongly connected set of attractors S j, MCL finds the weakly connected components consisting of all nodes i V t V a attracted to an attractor in S j. If a node i is attracted to multiple strongly connected components, it is added to each such cluster, resulting in possibly overlapping clusters. Chapter 16: Spectral & Graph Clustering 27 /

Algorithm MARKOV CLUSTERING 1 2 3 4 5 6 7 8 9 10 MARKOV CLUSTERING (A, r,ǫ): t 0 Add self-edges to A if they do not exist M t 1 A repeat t t + 1 M t M t 1 M t 1 M t Υ(M t, r) until M t M t 1 F ǫ G t directed graph induced by M t C {weakly connected components in G t } Chapter 16: Spectral & Graph Clustering 28 /

MCL Attractors and Clusters r = 2.5 1 6 2 4 5 3 7 1 6 1 0.5 1 2 4 5 0.5 1 0.5 3 7 Chapter 16: Spectral & Graph Clustering 29 /

MCL on Iris Graph (a) r = 1.3 (b) r = 2 Chapter 16: Spectral & Graph Clustering 30 /

Contingency Table: MCL Clusters versus Iris Types r = 1.3 iris-setosa iris-virginica iris-versicolor C 1 (triangle) 50 0 1 C 2 (square) 0 36 0 C 3 (circle) 0 14 49 Chapter 16: Spectral & Graph Clustering /