Spectral Clustering. Spectral Clustering? Two Moons Data. Spectral Clustering Algorithm: Bipartioning. Spectral methods

Size: px

Start display at page:

Download "Spectral Clustering. Spectral Clustering? Two Moons Data. Spectral Clustering Algorithm: Bipartioning. Spectral methods"

Erick Neal McBride
5 years ago
Views:

1 Spectral Clustering Seungjin Choi Department of Computer Science POSTECH, Korea 1 Spectral methods Spectral Clustering? Methods using eigenvectors of some matrices Involve eigen-decomposition (or spectral decomposition) Spectral clustering methods: Algorithms that cluster data points using eigenvectors of matrices derived from the data Closely related to spectral graph partitioning Pairwise (Similarity-based) clustering methods Standard statistical clustering methods assume a probabilistic model that generates the observed data points Pairwise clustering methods define a similarity function between pairs of data points and then formulates a criterion that the clustering must optimize 2 Spectral Clustering Algorithm: Bipartioning Two Moons Data 1. Construct affinity matrix W ij = { exp{ β vi v j 2 } if i j 0 if i = j Calculate the graph Laplacian L: L = D W where D = diag{d 1,...,d n } and d i = j W ij. 3. Compute the second smallest eigenvector of the graph Laplacian (denoted by u = [u 1 u n ], Fiedler vector) 4. Partition u i s by a pre-specified threshold value and assign data points v i to cluster

2 Two Moons Data: k-means Two Moons Data: Fiedler Vector Two Moons Data: Spectral Clustering Graphs Consider a connected graph G(V, E) where V = {v 1,...,v n } and E denote a set of vertices and a set of edges, respectively, with pairwise similarity values being assigned as edge weights. Adjacency matrix (similarity, proximity, affinity matrix): W = [W ij ] R n n. Degree of nodes: d i = j W ij Volume: vol(s 1 ) = d S1 = i S 1 d i

3 Neighborhood Graphs Gaussian similarity function is given by w(v i, v j ) = W ij = exp { v i v j 2 } 2σ 2. ɛ-neighborhood graph k-nearest neighbor graph Graph Laplacian (Unnormalized) graph Laplacian is defined as L = D W. 1. For every vector x R n, we have x Lx = 1 2 n i=1 j=1 n W ij (x i x j ) 2 0. (positive semidefinite) 2. The smallest eigenvalue of L is 0 and the corresponding eigenvector is 1 = [1 1], since D1 = W1, i.e, L1 = L has n nonnegative eigenvalues, λ 1 λ 2 λ n = Normalized Graph Laplacian x Lx = x Dx x Wx n n n = d i x 2 i W ij x i x j i=1 i=1 j=1 = 1 d i x 2 i 2 W ij x i x j + 2 i i j j = 1 W ij (x i x j ) 2. 2 i j d j x 2 j Two different normalization methods are popular, including: Symmetric normalization: L s = D 1 2 LD 2 1 = I D 2 1 WD 1 2. Normalization related to random walks: L rw = D 1 L = I D 1 W

4 1. For every vector x R n, we have x L s x = 1 2 n i=1 j=1 ( ) 2 n x i W ij x j. di dj 2. L sym and L rw are positive semidefinite and have n nonnegative real-valued eigenvalues, λ 1 λ n = λ is an eigenvalue of L rw with eigenvector u if and only if λ is an eigenvalue of L s with eigenvector D 1/2 u. 4. λ is an eigenvalue of L rw with eigenvector u if and only if λ and u solves the generalized eigenvalue problem Lu = λdu is an eigenvalue of L rw with the constant one vector 1 as eigenvector. 0 is an eigenvalue of L s with eigenvector D 1/2 1. Unnormalized Spectral Clustering 1. Construct a neighborhood graph with corresponding adjacency matrix W. 2. Compute the unnormalized graph Laplacian L = D W. 3. Find the k smallest eigenvectors of L and form the matrix U = [u 1 u k ] R n k. 4. Treating each row of U as a point in R k, cluster them into k groups using k-means algorithm. 5. Assign v i to cluster j if and only if row i of U is assigned to cluster j Normalized Spectral Clustering: Shi-Malik 1. Construct a neighborhood graph with corresponding adjacency matrix W. 2. Compute the unnormalized graph Laplacian L = D W. 3. Find the k smallest generalized eigenvectors u 1,...,u k of the problem Lu = λdu and form the matrix U = [u 1 u k ] R n k. 4. Treating each row of U as a point in R k, cluster them into k groups using k-means algorithm. 5. Assign v i to cluster j if and only if row i of U is assigned to cluster j. 15 Normalized Spectral Clustering: Ng-Jordan-Weiss 1. Construct a neighborhood graph with corresponding adjacency matrix W. 2. Compute the normalized graph Laplacian L s = D 1/2 LD 1/2. 3. Find the k smallest eigenvectors u 1,...,u k of L s and form the matrix U = [u 1 u k ] R n k. 4. Form the matrix Ũ from U by re-normalizing each row of U to have unit norm, i.e., Ũ ij = U ij /( j U ij) 1/2. 5. Treating each row of Ũ as a point in Rk, cluster them into k groups using k-means algorithm. 6. Assign v i to cluster j if and only if row i of Ũ is assigned to cluster j. 16

5 Where does this spectral clustering algorithm come from? Pictorial Illustration of Graph Partitioning Spectral graph partitioning Properties of block (diagonal) matrix Markov random walk Graph Partitioning: Bipartitioning Pictorial Illustration: Cut and Volume Consider a connected graph G(V, E) where V = {v 1,...,v n } and E denote a set of vertices and a set of edges, respectively, with pairwise similarity values being assigned as edge weights. Graph bipartitioning involves taking the set V apart into two coherent groups, S 1 and S 2, satisfying V = S 1 S 2, ( V = n), and S 1 S 2 =, by simply cutting edges connecting the two parts }{{} cut(s 1,S 2 ) = } {{ } vol(s 1 ) + } {{ } vol(s 2 ) Adjacency matrix (similarity, proximity, affinity matrix): W = [W ij ] R n n. Degree of nodes: d i = j W ij. Volume: vol(s 1 ) = d S1 = i S 1 d i. } {{ } S 1 only } {{ } S 2 only 19 20

6 Graph Partitioning The task is to find k disjoint sets, S 1,...,S k, given G = (V, E), where S 1 S k = φ and S 1 S k = V such that a certain cut criterion is minimized. 1. Bipartitioning: cut(s 1, S 2 ) = i S 1 j S 2 W ij. 2. Multiway partitioning: cut(s 1,...,S k ) = k i=1 cut(s i, S i ). 3. Ratio cut: Rcut(S 1,...,S k ) = k cut(s i,s i ) i=1 S i. 4. Normalized cut: Ncut(S 1,...,S k ) = k i=1 cut(s i,s i ). vol(s i ) Cut: Bipartitioning The degree of dissimilarity between S 1 and S 2 can be computed by the total weights of edges that have been removed. Cut(S 1, S 2 ) = X X W ij i S1 j S 2 8 = 1 < X X X X X X d i + d j W ij 2 : i S 1 j S 2 i S 1 j S 1 i S 2 = 1 n o (q 1 q 2 ) L (q 1 q 2 ), 4 j S 2 W ij where q j = [q 1j q nj ] R n is the indicator vector which represents partitions, j 1, if i Sj q ij =, for i = 1,..., n and j = 1, 2. 0, if i / S j 9 = ; Note that q 1 and q 2 are orthogonal, i.e., q 1 q 2 = Introducing bipolar indicator vector, x = q 1 q 2 {+1, 1} n, the cut criterion is simplified as Cut(S 1, S 2 ) = 1 4 x Lx. The balanced cut involves the following combinatorial optimization problem arg min x Lx x subject to 1 x = 0, x {1, 1}. Dropping the integer constrains (spectral relaxation), leads to the symmetric eigenvalue problem. The second smallest eigenvector of L corresponds to the solution, since the smallest eigenvalue of L is 0 and its associated eigenvector is 1. The second smallest eigenvector is known as Fiedler vector. 23 Rcut and Unnormalized Spectral Clustering: k = 2 Define the indicator vector x = [x 1 x n ] with entries Then one can easily see that S / S if v i S x i = S / S if v i S. x Lx = 2 V Rcut(S, S), x 1 = 0, x = n. 24

7 arg min Rcut(S, S) S V arg min x Lx, S V subject to x 1 = 0, x i is defined in previous slide, x = n. Rcut and Unnormalized Spectral Clustering: k > 2 Define the indicator matrix X = [x 1 x k ] R n k and x i = [x 1,i x n,i ] R n with entries x i,j = { 1/ Sj if v i S j 0 if v i S j. The relaxation by discarding the condition on the discrete values on x i and instead allowing x i R, leads to arg min x R n x Lx, subject to x 1 = 0 and x = n. Then we have x i Lx i = 2 cut(s i, S i ), X X = I. S i The eigenvector associated with the second smallest eigenvalue of L. Rounding the eigenvector gives an approximate solution to the ratio cut problem. Rcut(S 1,...,S k ) = 1 2 k x i Lx i = 1 2 tr { X LX }. i= arg min Rcut(S 1,...,S k ) S 1,...,S k Then the relaxed problem becomes arg min tr { X LX }, S 1,...,S k subject to X X = I, Xis defined in previous slide. arg min X R n k tr { X LX } subject to X X = I. We determine the k smallest eigenvectors of L to form U and run k-means algorithm on the rows of U. Ncut and Normalized Spectral Clustering: k = 2 Define the indicator vector x = [x 1 x n ] with entries vol(s)/vol(s) if v i S x i = vol(s)/vol(s) if v i S. Then one can easily see that x Lx = 2 V Ncut(S, S), (Dx) 1 = 0, x Dx = vol(v)

8 arg min Ncut(S, S) S V Relaxing the problem gives arg min x Lx, S V subject to (Dx) 1 = 0, x i is defined in previous slide, x Dx = vol(v). arg min x Lx, subject to (Dx) 1 = 0 and x Dx = vol(v). x R n Define y = D 1/2 x, then the problem is arg min y } D 1/2 {{ LD 1/2 } y, subject to y D 1/2 1 = 0 and y 2 = vol(v). y R n L s The solution y is given by the 2nd smallest eigenvector of L s, implying that x is the 2nd smallest eigenvector of L rw or equivalently the 2nd smallest generalized eigenvector of Lu = λdu. 29 Ncut and Normalized Spectral Clustering: k > 2 Define the indicator matrix X = [x 1 x k ] R n k and x i = [x 1,i x n,i ] R n with entries Then we have x i,j = { 1/ vol(sj ) if v i S j 0 if v i S j. x i Lx i = 2 cut(s i, S i ), X X = I, x i Dx i = 1, vol(s i ) Ncut(S 1,...,S k ) = 1 2 k x i Lx i = 1 2 tr{ X LX }. i=1 30 arg min Ncut(S 1,...,S k ) S 1,...,S k Then the relaxed problem becomes { } arg min tr Y D 1/2 LD 1/2 Y Y R n k arg min tr { X LX }, S 1,...,S k subject to X DX = I, Xis defined in previous slide. subject to Y Y = I. Markov Random Walk View of Normalized Cut Melia and Shi 2001 Probabilistic interpretation of normalized cut Data points are clustered on the basis of the eigenvectors of the resulting transition probability matrix (constructed by the weights) We determine the k smallest eigenvectors of L s to form U and run k-means algorithm on the rows of U

9 Transition Probability Matrix We define a Markov random walk over the graph by constructing a transition probability matrix from the edge weights where j P ij = 1 for all i. P ij = W ij j W, ij The random walk proceeds by successively selecting points according to j P ij where i specifies the current location. Ncut via Transition Probabilities Define P(S S) = P(X t+1 S X t S) where X t is a state of Markov random walk model at time t. The Ncut and Markov random walk model has the following relation: Ncut(S, S) = P(S S) + P(S S). The minimization of Ncut actually seeks a cut through the graph such that a random walk seldom transitions from S to S or vice versa. If the graph is connected and non-bipartite (ergodic Markov chain), then the random walk always possesses a unique stationary distribution π = [π 1 π n ] such that π P = π, which is given by π i = d i /vol(g) Random Walk: Properties If we start from i 0, the distribution of points i t that we end up with after t steps, is given by i 1 P i0 i 1, i 2 X i 1 P i0 i 1 P i1 i 2 = i 3 X i 1 X i t. h P ti i 0 i t, h P 2i i 0 i 2, i 2 P i0 i 1 P i1 i 2 P i2 i 3 = h P 3i i 0 i 3, where P t = PP P and [ ] ij denotes the i, j component of the matrix. The distribution of points that we end up with in after t random steps converges as t increases. 35 Stochastic Matrix The stochastic matrix P is defined by P = D 1 W. To find out how P t behaves for large t, it is useful to examine the eigen-decomposition of the following symmetric matrix D 1 2WD 1 2 = λ 1 u 1 u 1 + λ 2u 2 u λ nu n u n. The symmetric matrix is related to P t since D 1 2WD 1 2 D 1 2WD 1 2 = D2 1 (P P) D 1 2. This allows us to write the t step transition probability matrix in terms of the eigenvalues/vectors of the symmetric matrix t P t = D 1 2 D 1 2WD D2 = D 1 2 where λ 1 = 1 and P = D 1 2 `u 1 u 1 D1 2. λ t 1 u 1u 1 + λt 2 u 2u λt n u nu n D2, 1 36

10 Spectral Clustering: Stochastic Matrix We are interested in the largest correction to the asymptotic limit P t P ( + D 1 2 λ t 2 u 2 u ) 1 2 D2. Note that [ ] u 2 u 2 = u ij i2u j2 and thus the largest correction term increases the probability of transitions between points that share the same sign of u i2 and decreases transitions across points with different signs. Equivalence Proposition 1. If λ, x are solutions of Px = λx and P = D 1 W, then (1 λ), x are solutions of Lx = λdx. This proposition shows the equivalence between the spectral clustering formulated by the normalized cut and the eigenvalue/vectors of the stochastic matrix P. The largest eigenvector of P is 1 containing no information. The second smallest eigenvector in the normalized cut, corresponds to the second largest eigenvector of the stochastic matrix. Binary spectral clustering: Divide the points into clusters based on the sign of the elements of u 2 u j2 > 0 cluster 1, otherwise cluster Suggested Further Readings 1. P. K. Chan, M. D. F. Schlag, and J. Y. Zien, Spectral k-way ratio-cut partitioning and clustering, IEEE Trans. CAD of Integrated Circuits and Systems, J. Shi and J. Malik, Normalized cuts and image segmentation, CVPR Y. Weiss, Segmentation using eigenvectors: A unifying view, ICCV M. Meila and J. Shi, A random walks view of spectral segmentation, AISTATS C. Ding et al., A min-max cut algorithm for graph partitioning and data clustering, ICDM A. Ng, M. Jordan, and Y. Weiss, On spectral clustering: Analysis and an algorithm, NIPS U. von Luxburg, A tutorial on spectral clustering, MPI for Biological Cybernetics, TR-149,

Spectral Clustering. Zitao Liu

Spectral Clustering. Zitao Liu Spectral Clustering Zitao Liu Agenda Brief Clustering Review Similarity Graph Graph Laplacian Spectral Clustering Algorithm Graph Cut Point of View Random Walk Point of View Perturbation Theory Point of