Spectral Clustering. Guokun Lai 2016/10

Size: px

Start display at page:

Download "Spectral Clustering. Guokun Lai 2016/10"

Cori Norris
5 years ago
Views:

1 Spectral Clustering Guokun Lai 2016/10 1 / 37

2 Organization Graph Cut Fundamental Limitations of Spectral Clustering Ng 2002 paper (if we have time) 2 / 37

3 Notation We define a undirected weighted graph G(V, E), where V is the G s nodes set, and E is the G s edges set. The adjacency matrix is W ij = E(i, j), W ij 0. The degree Matrix D R n n is a diagonal matrix and D i,i = n j=1 W i,j. The Laplacian Matrix L R n n is L = D W. Indicator vector of a cluster: The indicator vector I c of a cluster C is, I c,i = { 1 vi C 0 otherwise (1) 3 / 37

4 Graph Cut The intuition of clustering is to separate points in different groups according to their similarities. If we try to separate the node set G into two disjoint sets A and B, we define Cut(A, B) = i A,j B w ij If we split the node set into K disjoint set, then Cut(A 1,, A k ) = K Cut(A, A) i=1 Where A is the complement set of A. 4 / 37

5 Defect of Graph Cut The simplest idea to cluster the node set V is to find a partition to minimize the Graph Cut function. But usually it will lead to solutions that the subset with few nodes. 5 / 37

6 Normalization Cut For overcoming the defect of the Graph Cut, the Shi proposed a new cost function to regularize the size of the subset. First, we define Vol(A) = i A,j V w(i, j), and we have Ncut(A, B) = cut(a, B) V (A) + cut(a, B) V (B) 6 / 37

7 Relation between NCut and Spectral Clustering Given a vertex subset A i V, we define the vector 1 f i = I Ai. Then we can write the optimization problem as, Vol(Ai ) min A i NCut = 1 2 s.t. f i = I Ai n i=0 f T i Lf i = 1 2 Tr(F T LF ) 1 Vol(Ai ) (2) F T DF = I 7 / 37

8 Optimization 1 Because the constraint f i = I Ai, the optimization Vol(Ai ) problem is a np-hard problem. So we can relax this constraint to the R n. Then the optimization problem is, min fi Tr(F T LF ) s.t. F T DF = I (3) Then we found the solution is the kth smallest eigenvector of D 1 L. Based on the F, we recover the A i by the k-mean algorithm. 8 / 37

9 Unnormalized Laplacian Matrix Similar to the above approach, we can prove that the eigenvectors of the unnormalized Laplacian matrix is the relaxed solution for RatioCut(A, B) = cut(a,b) A + cut(a,b) 1 B. We can set f i = I Ai Ai and get the relaxed optimization problem, min fi Tr(F T LF ) s.t. F T F = I (4) 9 / 37

10 Approximation The solution from the spectral method is approximately for the Normalized Cut objective function. And there is not bound for the gap between them. We can easily construct a case to make the solution to the relaxed problem very different from the origin problem. 10 / 37

11 Experiment Result of Shi paper 11 / 37

12 Organization Graph Cut Fundamental Limitations of Spectral Clustering Ng 2002 paper (if we have time) 12 / 37

13 Fundamental Limitations of Spectral Clustering As mentioned above, the spectral clustering approximately solve the Normalized Graph Cut objective function. But is that the Normalized Graph Cut a good criterion for the all situations? 13 / 37

14 Limitation of NCut The NCut function is more likely to capture the global structure. But sometimes, we may want to extract some local feature of the graph. The Graph Normalized Cut cannot separate the Gaussian distribution and the band. 14 / 37

15 Limitation of Spectral Clustering Next we analyze the spectral method based on the view of random walk process. We define the Markov transition matrix as M = D 1 W, it has eigenvalue λ i and eigenvector v i. And the random walk process in the graph converges to the unique equilibrium distribution π s. Then we can found the relationship between eigenvector and the diffusion distance between points, λ 2t j (v j (x) v j (y)) 2 = p(z, t x) p(z, t y) 2 L 2 (1/π s) j So we see that the spectral method want to capture the major pattern of the random walk on whole graph. 15 / 37

16 Limitation of Spectral Clustering But this method would fail in the situation, which the scale of clusters are very different. 16 / 37

17 Self-Tuning Spectral Clustering One way to solve above case is that we can accelerate the random walk process in the low density area. Assume we define the distance between node is, A i,j = exp( d(v i, v j ) 2 σ i σ j ) And σ i = d(v i, v k ), where v k is the k-th nearest neighbor of v i. 17 / 37

18 Result of Self-Tuning Spectral Clustering 18 / 37

19 Failure case 19 / 37

20 Another solution The paper proposed a solution is that we split the graph into two subsets recursively. And stop criterion is based on the relaxation time of the graph, which is τ V = 1/(1 λ 2 ). Then if the size of two subsets after splitting is comparable, we expect τ V >> τ 1 + τ 2 Otherwise, we expect max(τ 1, τ 2 ) >> min(τ 1, τ 2 ). If the partition satisfy either condition, we accept separation and continue to split the subset. If not, we stop. But it didn t address how to deal with K clustering problem. 20 / 37

21 Tong Zhang 2007 paper This paper gave a upper bound of expectation error in the semi-supervised learning task on graph. Because of the room of presentation, I will just introduce a interesting conclusion of this paper. 21 / 37

22 S-Normalized Laplacian Matrix We define the S-Normalized Laplacian Matrix as L S = S 1/2 LS 1/2 where S is a diagonal matrix. According to the analyze of the this paper, the best choice of S is S i,i = C j, where C j is the size of the cluster j. So this is an approach want to solve the different scale cluster problem cannot be dealt with by the spectral clustering. We can find this is similar to the self-tuning spectral clustering, it renormalized the adjacency matrix as Ŵij W = ij. Ci C j 22 / 37

23 S-Normalized Laplacian Matrix But we don t know C j, the author proposed a method to approximately computer it. We can define K 1 = αi + L S, α R. In the ideal case, which is that we have q disjoint connected components. Then we can prove that q 1 α 0, αk = C j v jvj T + O(α) i=1 where v j is the indicator vector of the cluster j. So if we have a small α, we can assume K i,i C j. Then we can set S i,i 1 K i,i. 23 / 37

24 Comparation 24 / 37

25 Organization Graph Cut Fundamental Limitations of Spectral Clustering Ng 2002 paper (if we have time) 25 / 37

26 Ng 2002 paper This paper analyzed the spectral clustering problem based on the matrix perturbation theory. It obtains a error bound of the spectral clustering algorithm with several assumptions. 26 / 37

27 Algorithm Define the weighted adjacency Matrix W, and construct the Laplacian Matrix L = D 1/2 WD 1/2. Find x 1,, x k, the K largest eigenvectors of L, and form the matrix X = [x 1 x k ] R n k Normalized the every row of X to have unit length, Y ij = X ij /( j X 2 ij )1/2 Treating each row of Y as a point in R k, cluster them into k clusters via K-means. 27 / 37

28 Ideal Case Assume the graph G contain K clusters, and it dose not contain cross-clusters edge. In this case, the Laplacian matrix contains exactly K eigenvector with eigenvalue / 37

29 Y Matrix of Ideal Case After running the algorithm on this graph, we can get Y matrix as Where R is any rotation matrix, and each row of Y will cluster into 3 groups naturally. 29 / 37

30 The general case In real world data, we have cross-clusters edges. So the author analyzes the cross-clusters edges influence on the Y matrix based on the matrix perturbation theory. 30 / 37

31 The general case Assumption 1 There exists δ > 0 so that, for all second largest eigenvalue of each cluster, i = 1,, k, λ i 2 1 δ. Assumption 2 There is some fixed ɛ 1 > 0, so that for every i 1, i 2 1,, k, i 1 i 2, we have that j S i1 where ˆd i is the degree of i in its cluster. Wjk 2 k S i2 ˆd j ˆdk ɛ 1, The intuition of this inequality is to limit the weight of cross-cluster edges, compared to weight of the intra-cluster edges. 31 / 37

32 The general case Assumption 3 There is some fixed ɛ 2 > 0, so that for every j S i, we have that k S W 2 i jk ɛ ˆd 2 ( Wkl 2 j k,l S i ) ˆd 1/2 k ˆdl The intuition of this inequality is also to limit the weight of cross-cluster edges, compared to weight of the intra-cluster edges. Assumption 4 There is some constant C > 0 so that for every i i = 1,, k, j = 1,, n i, we have ˆd j ( n i ˆd i k=1 k )/(Cni ). The intuition of this inequality is that no points in a cluster be too much less connected than other points in the same cluster. 32 / 37

33 The general case If the all of assumptions holds, set ɛ = k(k 1)ɛ + k ɛ 2 2 If σ > (2 + 2)ɛ. There exists k orthogonal vectors r 1,, r k so that 1 n k n i i=1 j=1 y j j r i 2 2 4C(4 + 2 k) 2 ɛ 2 (σ 2ɛ) 2 33 / 37

34 Liu s 2016 paper Motivation The original semi-supervised learning problem can be formalized as min l(f i, y i ) + f T Lf f i We can richer the label propagation patterns based on the spectrum transformation, which called ST-enhance semi-supervised learning min f l(f i, y i ) + f T σ(l)f i 34 / 37

35 Spectral Transform We can define L = i λ iφ i φ T i, and θ i = σ(λ i ) 1, where σ(x) should be a non-decrease function. We can substitute it into the objective function, min f C(f ; θ) = i τ whereas θ 1 θ 2,, θ m 0. l(f i, y i ) + γ m i=1 θ 1 i φ i, f 2 35 / 37

36 Jointly optimization We can try to jointly optimization eigenvalues set θ and labels set f, so we have min θ (min f C(f ; θ)) + τ θ 1 we can prove that this function is convex via θ. The optimization process can be describe as, First, fixed θ, we can optimize the convex problem on f. After that, optimize the θ in its domain. 36 / 37

37 Proof of convexity We can rewrite the objective function used the dual form of the C(f ; θ), which is C (u; θ). min θ (max u C (u; θ)) + τ θ 1 i θ i < φ i, u > 2, and w( u) is where C (u; θ) = w( u) 1 4γ the conjugate function of the l. So the objection is the point-wise maximum of a set of convex function. Then it still convex on θ. 37 / 37

MATH 567: Mathematical Techniques in Data Science Clustering II

This lecture is based on U. von Luxburg, A Tutorial on Spectral Clustering, Statistics and Computing, 17 (4), 2007. MATH 567: Mathematical Techniques in Data Science Clustering II Dominique Guillot Departments