University of Bristol - Explore Bristol Research. Publisher's PDF, also known as Version of record

Size: px

Start display at page:

Download "University of Bristol - Explore Bristol Research. Publisher's PDF, also known as Version of record"

Irene Paul
6 years ago
Views:

Zanetti, L., Sun, H., & Peng, R. 05. Partitioning Well-Clustered Graphs: Spectral Clustering Works!. Paper presented at Conference on Learning Theory, Paris, France.

1 Zanetti, L., Sun, H., & Peng, R. 05. Partitioning Well-Clustered Graphs: Spectral Clustering Works!. Paper presented at Conference on Learning Theory, Paris, France. Publisher's PDF, also known as Version of record Link to publication record in Explore Bristol Research PDF-document University of Bristol - Explore Bristol Research General rights This document is made available in accordance with publisher policies. Please cite only the published version using the reference above. Full terms of use are available:

2 JMLR: Workshop and Conference Proceedings vol 40: 33, 05 Partitioning Well-Clustered Graphs: Spectral Clustering Works! Richard Peng Massachusetts Institute of Technology, Cambridge, USA He Sun Luca Zanetti University of Bristol, Bristol, UK Abstract In this work we study the widely used spectral clustering algorithms, i.e. partition a graph into k clusters via embedding the vertices of a graph into a low-dimensional space using the bottom eigenvectors of the Laplacian matrix, and partitioning the embedded points via k-means algorithms. We show that, for a wide class of graphs, spectral clustering algorithms give a good approximation of the optimal clustering. While this approach was proposed in the early 990s and has comprehensive applications, prior to our work similar results were known only for graphs generated from stochastic models. We also give a nearly-linear time algorithm for partitioning well-clustered graphs, which is based on heat kernel embeddings and approximate nearest neighbor data structures.. Introduction Partitioning a graph into two or more pieces is one of the most fundamental problems in combinatorial optimization, and has comprehensive applications in various disciplines of computer science. One of the most studied graph partitioning problems is the edge expansion problem, i.e., finding a cut with few crossing edges normalized by the size of the smaller side of the cut. Formally, let G = V, E be an undirected graph. For any set S, the conductance of set S is defined by φ G S ES, V \ S, vols where vols is the total weight of edges incident to vertices in S, and let the conductance of G be φg min φ GS. S:volSvolG/ The edge expansion problem asks for a set S V of vols volv / such that φ G S = φg. This problem is known to be NP-hard Matula and Shahrokhi, 990 and, assuming the Small Set Expansion Conecture Raghavendra et al., 0, does not admit a polynomial-time algorithm which achieves a constant factor approximation in the worst case. The k-way partitioning problem is a natural generalization of the edge expansion problem. We call subsets of vertices i.e. clusters A,..., A k a k-way partition of G if A i A = for different c 05 R. Peng, H. Sun & L. Zanetti.

3 PENG SUN ZANETTI i and, and k i= A i = V. The k-way partitioning problem asks for a k-way partition of G such that the conductance of any A i in the partition is at most the k-way expansion constant, defined by ρk min max φ GA i.. partition A,...,A k ik Clusters of low conductance in a real network usually capture the notion of community, and algorithms for finding these subsets have applications in various domains such as community detection and network analysis. In computer vision, most image segmentation procedures are based on regionbased merge and split Coleman and Andrews, 979, which in turn rely on partitioning graphs into multiple subsets Shi and Malik, 000. On a theoretical side, decomposing vertex/edge sets into multiple disoint subsets is used in designing approximation algorithms for Unique Games Trevisan, 008, and efficient algorithms for graph problems Kelner et al., 04; Leighton and Rao, 999; Spielman and Teng, 0. Despite widespread use of various graph partitioning schemes over the past decades, the quantitative relationship between the k-way expansion constant and the eigenvalues of the graph Laplacians were unknown until a sequence of very recent results Lee et al., 0; Louis et al., 0. For instance, Lee et al. 0 proved the following higher-order Cheeger inequality: λ k ρk Ok λ k,. where 0 = λ... λ n are the eigevalues of the normalized Laplacian matrix L of G. Informally, the higher-order Cheeger inequality shows that a graph G has a k-way partition with low ρk if and only if λ k is small. Indeed,. implies that a large gap between λ k+ and ρk guarantees i existence of a k-way partition {S i } k i= with bounded φ GS i ρk, and ii any k + -way partition of G contains a subset with significantly higher conductance ρk + λ k+ / compared with ρk. Hence, a suitable lower bound on the gap Υ for some k, defined by Υ λ k+ ρk,.3 implies the existence of a k-way partition for which every cluster has low conductance, and that G is a well-clustered graph. In this paper we study the well-clustered graphs which satisfy a gap assumption on Υ, and similar assumptions were addressed in previous reference, e.g. Zhu et al., 03; Oveis Gharan and Trevisan, 04. To further reason the rationality of the assumption on Υ, first notice that if the gap assumption does not hold, then G can be partitioned into at least k + subsets of low conductance; secondly, the assumption on Υ is closely related to the gap between λ k+ and λ k for some k. The value of k for which there is a gap between λ k+ and λ k has been observed empirically as an indication of the correct number of clusters Fortunato, 00; von Luxburg, Our Results We give structural results that show close connections between the eigenvectors and the indicator vectors of the clusters. This characterization allows us to show that many variants of k-means algorithms, that are based on the spectral embedding and that work in practice, can be rigorously analyzed in theory. Moreover, exploiting our gap assumption, we can approximate this spectral

4 PARTITIONING WELL-CLUSTERED GRAPHS: SPECTRAL CLUSTERING WORKS! embedding using the heat kernel of the graph. Combining this with locality-sensitive hashing, we give a nearly-linear time algorithm for the k-way partitioning problem. Our structural results can be summarized as follows. Let {S i } k i= be a k-way partition of G achieving ρk defined in.. We define {ḡ i } k i= to be the normalized indicator vectors of the clusters {S i } k i=, and {f i} k i= to be the eigenvectors corresponding to the k smallest eigenvalues of L. We show that, under the condition of Υ = Ωk, the span of {ḡ i } k i= and the span of {f i} k i= are close to each other, which can be stated formally in Theorem.. Theorem. The Structure Theorem Let {S i } k i= be a k-way partition of G achieving ρk, and let Υ = λ k+ /ρk = Ωk. Assume that {f i } k i= are the first k eigenvectors of matrix L, and ḡ,..., ḡ k R n are the indicator vectors of {S i } k i= with proper normalization. Then, the following statements hold:. For every ḡ i, there is a linear combination of {f i } k i=, called ˆf i, such that g i ˆf i /Υ.. For every f i, there is a linear combination of {g i } k i=, called ĝ i, such that f i ĝ i.k/υ. This theorem generalizes the result shown by Arora et al. Arora et al. 00, Theorem., which proves the easier direction the first statement, Theorem., and can be considered as a stronger version of the well-known Davis-Kahan theorem Davis and Kahan, 970. We remark that, despite that we use the higher-order Cheeger inequality. to motivate the definition of Υ, our proof of the structure theorem is self-contained. Specifically, it omits much of the machinery used in the proofs of higher-order and improved Cheeger inequalities Kwok et al., 03; Lee et al., 0. The structure theorem has several applications. For instance, we look at the well-known spectral embedding F : V R k defined by F u NormalizationFactoru f u,..., f k u,.4 where NormalizationFactoru R is the normalization factor for u V. We use Theorem. to show that this well-known spectral embedding exhibits very nice geometric properties: i all points F u from the same cluster are close to each other, and ii most pairs of points F u, F v from different clusters are far from each other; iii the bigger the value of Υ, the higher concentration the embedded points within the same cluster. Based on these facts, we analyze the performance of spectral k-means algorithms, aiming at answering the following longstanding open question: Why do spectral k-means algorithms perform well in practice? We show that the partition {A i } k i= produced by the spectral k-means algorithm gives a good approximation of any optimal partition {S i } k i= : every A i has low conductance, and has large overlap with its correspondence S i. This allows us to apply various k-means algorithms Kumar et al., 004; Ostrovsky et al., 0 to give spectral clustering algorithms, with different time versus approximation tradeoffs. These algorithms have comprehensive applications, and have been the subect of extensive experimental studies for more than 0 years, e.g. Ng et al.,. See the formal definition in Section 3.. For simplicity, we use the word spectral k-means algorithms to refer to the algorithms which combine a spectral embedding with a k-means algorithm in Euclidean space. 3

5 PENG SUN ZANETTI 00; von Luxburg, 007. While similar results on spectral clustering mainly focus on graphs in the stochastic block model, to the best of our knowledge it is the first rigorous analysis of spectral clustering for general graphs exhibiting a multi-cut structure. Our result is as follows: Theorem. Approximation Guarantee of Spectral k-means Algorithms Let G be a graph satisfying the condition Υ = λ k+ /ρk = Ωk 3, and k N. Let F : V R k be the embedding defined above. Let {A i } k i= be a k-way partition by any k-means algorithm running in Rk that achieves an approximation ratio APT. Then, the following statements hold: i vola i S i = O APT k 3 Υ vols i, and ii φ G A i = O φ G S i + APT k 3 Υ. We further study fast algorithms for partitioning well-clustered graphs. Notice that, for moderately large values of k, e.g. k = ωlog n, directly applying k-means algorithms and Theorem. does not give a nearly-linear time algorithm, since i obtaining the spectral embedding.4 requires Ωmk time for computing k eigenvectors, and ii most k-means algorithms run in time Ωnk. To overcome the first obstacle, we study the so-called heat kernel embedding, an embedding from V to R n, and approximates the squared-distance F u F v of the embedded points F u and F v via their heat-kernel distance. Since a desired approximation of the heat kernel distances of vertices is computable in nearly-linear time, this approach avoids the computation of eigenvectors. For the second obstacle, instead of applying k-means algorithms as a black-box we apply approximate nearest-neighbor algorithms. Our approach presents an ad-hoc version of k- means algorithms, and indicates that in certain scenarios the standard Lloyd-type heuristic widely used in k-means algorithms can eventually be avoided. Our third result is as follows: Theorem.3 Nearly-Linear Time Algorithm For Partitioning Graphs Let G = V, E be a graph of n vertices and m edges, and a parameter k N. Assume that Υ = λ k+ /ρk = Ωk 5, and {S i } k i= is a k-way partition such that φ GS i ρk. Then there is an algorithm which runs in Õm time3 and outputs a k-way partition {A i } k i= such that i vola i S i = O k 3 log k Υ vols i, and ii φ G A i = Oφ G S i + k 3 log k Υ. We remark that bounds of other expansion parameters of k-way partitioning can be derived from our analysis as well. For instance, one quantity studied extensively in machine learning is the normalized cut Shi and Malik, 000. While the k-way expansion of a graph is the maximum conductance of the sets in the partition, the corresponding normalized cut is the sum of the conductance of all the sets in this partition. It is easy to see that these two measures differ by at most a factor of k, and the normallized cut for a k-way partition from spectral clustering can be bounded as well... Related Work There is a large amount of literature on graph partitioning under various settings. Oveis Gharan and Trevisan 04 formulate the notion of clusters with respect to the inner and outer conductance: a cluster S should have low outer conductance, and the conductance of the induced subgraph by S should be high. Under a gap assumption between λ k+ and λ k, they present a polynomialtime algorithm which finds a k-way partition {A i } k i= that satisfy the inner- and outer-conductance condition. In order to assure that every A i has high inner conductance, they assume that λ k+ 3. The Õ term hides a factor of poly log n. 4

6 PARTITIONING WELL-CLUSTERED GRAPHS: SPECTRAL CLUSTERING WORKS! polykλ /4 k, which is much stronger than ours. Moreover, their algorithm runs in polynomial-time, in contrast to our nearly-linear time algorithm. Dey et al. 04 studies the properties of the spectral embedding for graphs having a gap between λ k and λ k+ and presents a k-way partition algorithm, which is based on the k-center clustering and is similar in spirit to our work. Using combinatorial arguments, they are able to show that the clusters concentrate around k distant points in the spectral embedding. In contrast to our work, their result only holds for bounded-degree graphs, and cannot provide an approximate guarantee for individual clusters. Moreover, their algorithm runs in nearly-linear time only if k = Opoly log n. We also explore the separation between λ k and λ k+ from an algorithmic perspective, and show that this assumption interacts well with heat-kernel embeddings. The heat kernel has been used in previous algorithms on local partitioning Chung, 009, balanced separators Orecchia et al., 0. It also plays a key role in current efficient approximation algorithms for finding low conductance cuts Orecchia et al., 008; Sherman, 009. However, most of these theoretical guarantees are through the matrix multiplicative weights update framework Arora et al., 0; Arora and Kale, 007. Our algorithm instead directly uses the heat-kernel embedding to find low conductance cuts. There is also a considerable amount of research on partitioning random graphs. For instance, in the Stochastic Block Model SBM McSherry, 00, the input graph with k clusters is generated according to probabilities p and q with p > q: an edge between any two vertices within the same cluster is placed with probability p, and an edge between any two vertices from different clusters is placed with probability q. It is proven that spectral algorithms give the correct clustering for certain ranges of p and q, e.g. McSherry, 00; Rohe et al., 0; Vu, 04. However, the analysis of these algorithms cannot be easily generalized into our setting: we consider graphs where edges are not necessarily chosen independently with certain probabilities, but can be added in an adversarial way. For this reason, standard perturbation theorems used in the analysis of algorithms for SBMs, such as the Davis-Kahan theorem Davis and Kahan, 970, cannot be always applied, and ad-hoc arguments specific for graphs, like our structure theorem Theorem., become necessary.. Preliminaries Let G = V, E be an undirected and unweighted graph with n vertices and m edges. The set of neighbors of a vertex u is represented by Nu, and its degree is d u = Nu. For any set S V, let vols u S d u. For any set S, T V, we define ES, T to be the set of edges between S and T, aka ES, T {{u, v} u S and v T }. For simplicity, we write S = ES, V \ S for any set S V. For two sets X and Y, the symmetric difference of X and Y is defined as X Y X \ Y Y \ X. We will work extensively with algebraic obects related to G. We will also use D to denote the n n diagonal matrix with D uu = d u for u V [G]. The Laplacian matrix of G is defined by L D A, where A is the adacency matrix of G and D is the n n diagonal matrix with D uu = d u for u V [G]. The normalized Laplacian matrix of G is defined by L D / LD / = I D / AD /. For this matrix, we will denote its n eigenvalues with 0 = λ λ n, with their corresponding orthonormal eigenvectors f,..., f n. Note that if G is connected, the first eigenvector is f = D / f, where f is any non-zero constant vector. 5

7 PENG SUN ZANETTI For a vector x R n, the Euclidean norm of x is given by x = n i= x i /. For any f : V R, the Rayleigh quotient of f with respect to graph G is given by Rf f Lf f = f Lf {u,v} EG fu fv = f D u d ufu, where f D f Df. Based on the Rayleigh quotient, the conductance of a set S i can be expressed as φ G S i = Rḡ i, and the gap Υ can be written as Υ = λ k+ ρk = min ik λ k+ φ G S i = min ik λ k+ Rḡ i.. Throughout the rest of the paper, we will use S,..., S k to express a k-way partition of G achieving the minimum conductance ρk. Note that this partition may not be unique. 3. Connection Between Eigenvectors and Indicator Vectors of Clusters In this section we study the relations between the multiple cuts of a graph and the eigenvectors of the graph s normalized Laplacian matrix. Given clusters S... S k, define the indicator vector of cluster S i by { if u Si, g i u = 3. 0 if u S i, and define the corresponding normalized indicator vector by g i = D/ g i D / g i. 3. A basic result in spectral graph theory states that G has k connected components if and only if the k smallest eigenvalues are 0, implying that the spaces spanned by f,, f k and ḡ,, ḡ k are the same. Generalizing this result, we expect that these two spaces would be still similar if these k components of G are loosely connected, in the sense that i every eigenvector f i can be approximately expressed by a linear combination of {g i } k i=, and ii every indicator vector ḡ i can be approximately expressed by a linear combination of {f i } k i=. This leads to our structure theorem, which is illustrated in Figure. ˆf i Part ˆf i g i /Υ ḡ i ˆf i = a linear combination of {f } with coefficients α i ĝ i = a linear combination of {ḡ } with coefficients β i f i Part f i ĝ i.k/υ ĝ i Figure : Relations among { ˆf i }, {f i }, {ḡ i }, and {ĝ i } given in Theorem 3.. Here Υ is the gap defined with respect to λ k+ and ρk. 6

8 PARTITIONING WELL-CLUSTERED GRAPHS: SPECTRAL CLUSTERING WORKS! Theorem 3. Let Υ = Ωk, and i k. Then, the following statements hold:. There is a linear combination of the eigenvectors f,..., f k with coefficients α i : ˆfi = α i f + + α i k f k, such that g i ˆf i /Υ.. There is a linear combination of the vectors ḡ,..., ḡ k with coefficients β i : ĝ i = β i + β i k ḡk, such that f i ĝ i.k/υ. ḡ + Part of Theorem 3. shows that the normalized indicator vectors ḡ i of every cluster S i can be approximated by a linear combination of the first k eigenvectors, with respect to the value of Υ. The proof follows from the fact that if ḡ i has small Rayleigh quotient, then the inner product between ḡ i and the eigenvectors corresponding to larger eigenvalues must be small. This statement was also shown implicitly in Theorem. of Arora et al. 00. Part of Theorem 3. is more interesting, and shows that the opposite direction holds as well, i.e., any f i i k can be approximated by a linear combination of the normalized indicator vectors {g i } k i=. To sketch the proof, note that if we could write every g i exactly as a linear combination of {f i } k i=, then we could write every f i i k as a linear combination of {g i } k i=. This is because both of {f i } k i= and {g i} k i= are sets of linearly independent vectors of the same dimension and span {g,..., g k } span {f,..., f k }. However, the g i s are only close to a linear combination of the first k eigenvectors, as shown in Part. We will denote this combination as ˆf i, and use the fact that the errors of approximation are small to show that these { ˆf i } k i= are almost orthogonal between each other. This allows us to show that span { ˆf,..., ˆf k } = span {f,..., f k }, which implies Part. Theorem 3. shows a close connection between the first k eigenvectors and the indicator vectors of the clusters. We leverage this and the fact that the {ĝ i } s are almost orthogonal between each other to show that for any two different clusters S i, S there exists an eigenvector having reasonably different values on the coordinates which correspond to S i and S. Lemma 3. Let Υ = Ωk 3. For any i k, let ĝ i = β i g + + β i k g k be such that f i ĝ i.k/υ. Then, for any l, there exists i {,..., k} such that β i l β i ζ 0 k. 3.3 We point out that it was already shown in Kwok et al. 03 that the first k eigenvectors can be approximated by a k + -step function. The quality of the approximation is the same as the one given by our structure theorem. However, a k + -step approximation is not enough to show that the entire cluster is concentrated around a certain point. We further compare the structure theorem and standard matrix perturbation results, and point out that the standard matrix perturbation theorems cannot be applied in our setting. For instance, we look at a well-clustered graph G, and there are subsets C in a cluster such that most neighbors of vertices in C are outside the cluster. In this case the adacency matrix representing crossing edges of G has high spectral norm, and hence a standard matrix perturbation argument could not give us meaningful result. However, our structure theorem takes the fact that volc must be small into account, and that is why the structure theorem is needed to analyze the cut-structures of a graph. 7

9 PENG SUN ZANETTI 4. Analysis of Spectral k-means Algorithms In this section we analyze an algorithm based on the classical spectral clustering paradigm, and give an approximation guarantee of this method on well-clustered graphs. We will show that any k- means algorithm AlgoMeanX, k with certain approximation guarantee can be used for the k-way partitioning problem. Furthermore, it suffices to call AlgoMean in a black-box manner with a point set X R d. This section is structured as follows. We first give a quick overview of spectral and k-means clustering in Section 4.. In Section 4., we use the structure theorem to analyze the spectral embedding. Section 4.3 gives a general result about the k-means algorithm when applied to this embedding, and the proof sketch of Theorem k-means Clustering Given a set of points X R d, a k-means algorithm AlgoMeanX, k seeks to find a set K of k centers c,, c k to minimize the sum of the squared-distance between x X and the center to which it is assigned. Formally, for any partition X,, X k of the set X R d, we define the cost x X i x c i, i.e., the COST function function by COSTX,..., X k min k c,...,c k R d i= minimizes the total squared-distance between the points x s and their individually closest center c i, where c,..., c k are chosen arbitrarily in R d. We further define the optimal clustering cost by k X min COSTX,..., X k. 4. partition X,...,X k A typical spectral k-means algorithm on graphs can be described as follows: i Compute the bottom k eigenvectors f,, f k of the normalized Laplacian matrix 4 of graph G. ii Map every vertex u V [G] to a point F u R k according to F u = NormalizationFactoru f u,..., f k u, 4. with a proper normalization factor NormalizationFactoru R for each u V. iii Let X {F u : u V } be the set of the embedded points from vertices in G. Run AlgoMeanX, k, and group vertices of G into k clusters according to the output of AlgoMeanX, k. This approach that combines a k-means algorithm with a spectral embedding has been widely used in practice for a long time, although there is a lack of rigorous analysis of its performance prior to our result. 4.. Analysis of the Spectral Embedding The first step of the k-means clustering described above is to map vertices of a graph into points in Euclidean space, through the spectral embedding 4.. This subsection analyzes the properties of this embedding. Let us define the normalization factor to be NormalizationFactoru d u. We will show that the embedding 4. with the normalization factor above has very nice properties: embedded points from the same cluster S i are concentrated around their center c i R k, and 4. Other graph matrices e.g. the adacency matrix, and the Laplacian matrix are also widely used in practice. Notice that, with proper normalization, the choice of these matrices does not substantially influence the performance of k-means algorithms. 8

10 PARTITIONING WELL-CLUSTERED GRAPHS: SPECTRAL CLUSTERING WORKS! embedded points from different clusters of G are far from each other. These properties imply that a simple k-means algorithm is able to produce a good clustering 5. We first define k points p i R k i k, where p i vol Si β i,..., β k i. 4.3 We will show in Lemma 4. that all embedded points X i {F u : u S i } i k are concentrated around p i. Moreover, we bound the total squared-distance between points in the ith cluster X i and p i, which is proportional to /Υ: the bigger the value of Υ, the higher concentration the points within the same cluster have. Notice that we do not claim that p i is the actual center of X i. However, these approximated points p i s suffice for our analysis. Lemma 4. It holds that k i= u S i d u F u p i.k /Υ. We will further show in Lemma 4. that these points p i i k exhibit another excellent property: the distance between p i and p is inversely proportional to the volume of the smaller cluster between S i and S. Therefore, points in S i of smaller vols i are far from points in S of bigger vols. Notice that, if this was not the case, a small misclassification of points in a bigger cluster S could introduce a large error to the cluster of smaller volume. Lemma 4. For every i, it holds that p i p defined in 3.3. ζ 0 min {vols i,vols }, where ζ is 4.3. Proof Sketch of Theorem. Now we analyze why spectral k-means algorithms perform well for solving the k-way partitioning problem. We assume that A,..., A k is any k-way partition returned by a k-means algorithm with an approximation ratio of APT. We map every vertex u to d u identical points in R k. This trick allows us to bound the volume of the overlap between the clusters retrieved by a k-means algorithm and the optimal ones. For this reason we define the cost function of partition A,..., A k of V [G] by COSTA,..., A k and the optimal clustering cost is defined by k Lemma 4.3 It holds that k.k /Υ. min c,...,c k R k k i= u A i d u F u c i, min partition A,...,A k COSTA,..., A k. 5. Notice that this embedding is similar with the one used in Lee et al. 0, with the only difference that F u is not normalized and so it is not necessarily a unit vector. This difference, though, is crucial for our analysis. 9

11 PENG SUN ZANETTI Since A,, A k is the output of a k-means algorithm with approximation ratio APT, by Lemma 4.3 we have that COSTA,..., A k APT.k /Υ. We will show that this upper bound of APT.k /Υ suffices to show that this approximate clustering A,..., A k is close to the actual clustering S,..., S k, in the sense that i every A i has low conductance, and ii under a proper permutation σ : {,..., k} {,..., k}, the symmetric difference between A i and S σi is small. The fact is proven by contradiction: If we could always find a set A i with high symmetric difference with its correspondence S σi, regardless of how we map {A i } to their corresponding {S σi }, then the COST value will be high, which contradicts to the fact that COSTA,..., A k APT.k /Υ. The core of of the whole contradiction arguments is the following technical lemma: Lemma 4.4 Let A,..., A k be a partition of V. Suppose that, for every permutation of the indices σ : {,..., k} {,..., k}, there exists i such that vol A i S σi ε vol Sσi for ε 0 5 k 3 /Υ, then COSTA,..., A k 0 4 ε/k. 5. Partitioning Well-Clustered Graphs in Nearly-Linear Time In this section we present a nearly-linear time algorithm for partitioning well-clustered graphs. At a high level, our algorithm follows the general framework of k-means algorithms, and consists of two steps: the seeding step, and the grouping step. The seeding step chooses k candidate centers such that each one is close to the actual center of a different cluster. The grouping step assigns the remaining vertices to their individual closest candidate centers. We show that, due to the wellseparation properties of the embedded points, both steps in a k-means algorithm can be greatly simplified. Together with the heat kernel embedding, we obtain an ad-hoc version of k-means algorithms, which runs in nearly-linear time. Throughout the whole section, we assume that Υ = Ωk 5, where Ω hides a factor of log c k for some constant c 5.. The Seeding Step The seeding step chooses k points from {F u} u V [G], such that with constant probability every point i belongs to a different cluster, and ii is close to the center of a cluster. To give an intuition of this seeding step, notice that our approximate center p i satisfies p i / vols i, and most embedded points F u are close to their approximate centers, implying that F u is approximately equal to / vols i for most vertices u S i. Hence, sampling points F u with probability proportional to d u F u ensures that points from different clusters are sampled with approximately the same probability. We prove that, after sampling Θk log k points which constitute the sampling set C, with constant probability there is at least one point sampled from each cluster. Next we remove the sampled points in C which are close to each other, until there are exactly k points left. We call this resulting set C, and prove that with constant probability there is exactly one point in C from a cluster. See Algorithm for formal description. We remark that choosing good candidate centers is crucial for most k-means algorithms, and has been studied extensively in literature e.g. Arthur and Vassilvitskii 007; Ostrovsky et al. 0. While recent algorithms obtain good initial centers by iteratively picking points from a non-uniform distribution and take Ωnk time, Algorithm runs in Õk time. 0

12 PARTITIONING WELL-CLUSTERED GRAPHS: SPECTRAL CLUSTERING WORKS! Algorithm SEEDANDTRIMk, x : input: the number of clusters k, and the embedding {xu} u V [G] : Let N = Θk log k. 3: for i =,..., N do 4: Set c i = u with probability proportional to d u xu. 5: end for 6: for i =,..., N do 7: Delete all c with < i such that xc i xc < xc i 0 4 k. 8: end for 9: return c... c k 5.. The Grouping Step After obtaining C, we group the remaining vertices by assigning vertex u to cluster S i if F u is closer to c i C than the other points in C. A naive implementation of this step takes Ωnk time. To speed it up, we apply the ε-approximate nearest neighbor data structures Indyk and Motwani, 998 which gives us a nearly-linear time algorithm for the grouping step Dealing with Large k Since computing the spectral embedding F : V [G] R k takes Omk time, SEEDANDTRIMk, F and the grouping step together run in nearly-liner time as long as k = Opoly log n. However, for larger values of k this approach becomes problematic, as it is not clear how to obtain the embedding F in nearly-linear time. To handle this case, we notice that both steps only use the pairwise distances of the embedded points F u and F v, rather than the spectral embedding itself. We show that these distances can be approximated by the so-called heat kernel distance, which can be approximately computed in nearly-linear time. Formally speaking, the heat kernel of G with parameter t 0, called the temperature, is defined by n H t e tl = e tλ i f i f i. 5. We view the heat kernel as a geometric embedding from V [G] to R n, where the th coordinate of the embedding is given by the th bottom eigenvector of H t, i.e. i= x t u e t λ f u,, e t λn f n u. 5. du We define the squared-distance between the points x t u and x t v by η t u, v x t u x t v. 5.3 We prove that, under the condition of k = Ωlog n and the gap assumption of Υ, there is a wide range of t for which η t u, v gives a good approximation of F u F v.

13 PENG SUN ZANETTI Lemma 5. Let k = Ωlog n and Υ = Ωk 3. Then, there is t such that with high probability the embedding {x t u} u V [G] defined by 5. satisfies e F u F v x t u x t v F u F v + n c, e F u x t u F u + n c for all u, v, where c is some constant. Moreover, these pairwise distances x t u x t v for all {u, v} EG can be approximated up to a constant factor in Õ m time. Now we use the doubling argument to find a required t: We run the seeding and the grouping steps using the heat kernel embedding for all t of the form t = i, where i =,,...., Olog n, and keep track of the best partition found so far. It is easy to see that the final partition after enumerating these t s gives a desired approximation. The overall description of our algorithm for large values of k is shown in Algorithm. Algorithm Clustering Algorithm : INPUT: G, k : for i =,..., k do 3: A i := 4: end for 5: COSTA,..., A k := ; 6: for t =, 4, 8,..., polyn do 7: c,..., c k SEEDANDTRIMk, x t 8: Compute a partition A,..., A k of V : for every v V assign v to its nearest center c i using the ε-nns algorithm with ε = log k. 9: If COSTA,..., A k COSTA,..., A k SETA i := A i FOR i k 0: end for : return A,, A k Acknowledgments Part of this work was done while He Sun and Luca Zanetti worked at Max Planck Institute for Informatics, and He Sun was visiting the Simons Institute for the Theory of Computing, UC Berkeley. We are grateful to Luca Trevisan for insightful comments on an early version of our paper, and to Gary Miller for very helpful discussions about heat kernels on graphs. References Saneev Arora and Satyen Kale. A combinatorial, primal-dual approach to semidefinite programs. In 39th Annual ACM Symposium on Theory of Computing STOC 07, pages 7 36, 007. Saneev Arora, Boaz Barak, and David Steurer. Subexponential algorithms for unique games and related problems. In 5st Annual IEEE Symposium on Foundations of Computer Science FOCS 0, pages , 00.

14 PARTITIONING WELL-CLUSTERED GRAPHS: SPECTRAL CLUSTERING WORKS! Saneev Arora, Elad Hazan, and Satyen Kale. The multiplicative weights update method: a metaalgorithm and applications. Theory of Computing, 8: 64, 0. David Arthur and Sergei Vassilvitskii. k-means++: The advantages of careful seeding. In 8th Annual ACM-SIAM Symposium on Discrete Algorithms SODA 07, pages , 007. Fan R. K. Chung. A local graph partitioning algorithm using heat kernel pagerank. Internet Mathematics, 63:35 330, 009. Guy B. Coleman and Harry C. Andrews. Image segmentation by clustering. Proceedings of the IEEE, 675: , 979. Sanoy Dasgupta and Anupam Gupta. An elementary proof of a theorem of Johnson and Lindenstrauss. Random Structures & Algorithms, :60 65, 003. Chandler Davis and William M. Kahan. The rotation of eigenvectors by a perturbation. iii. SIAM Journal on Numerical Analysis, 7: 46, 970. Tamal K. Dey, Alfred Rossi, and Anastasios Sidiropoulos. Spectral concentration, robust k-center, and simple clustering. arxiv: , 04. Santo Fortunato. Community detection in graphs. Physics Reports, 4863:75 74, 00. Roger A. Horn and Charles R. Johnson. Matrix Analysis. Cambridge University Press, 0. Piotr Indyk and Raeev Motwani. Approximate nearest neighbors: towards removing the curse of dimensionality. In 30th Annual ACM Symposium on Theory of Computing STOC 98, pages , 998. Jonathan A. Kelner, Yin Tat Lee, Lorenzo Orecchia, and Aaron Sidford. An almost-linear-time algorithm for approximate max flow in undirected graphs, and its multicommodity generalizations. In 5th Annual ACM-SIAM Symposium on Discrete Algorithms SODA 4, pages 7 6, 04. Ioannis Koutis, Alex Levin, and Richard Peng. Improved Spectral Sparsification and Numerical Algorithms for SDD Matrices. In 9th International Symposium on Theoretical Aspects of Computer Science STACS, pages 66 77, 0. Amit Kumar, Yogish Sabharwal, and Sandeep Sen. A simple linear time + ε-approximation algorithm for geometric k-means clustering in any dimensions. In 45th Annual IEEE Symposium on Foundations of Computer Science FOCS 04, pages , 004. Tsz Chiu Kwok, Lap Chi Lau, Yin Tat Lee, Shayan Oveis Gharan, and Luca Trevisan. Improved Cheeger s inequality: analysis of spectral partitioning algorithms through higher order spectral gap. In 45th Annual ACM Symposium on Theory of Computing STOC 3, pages 0, 03. James R. Lee, Shayan Oveis Gharan, and Luca Trevisan. Multi-way spectral partitioning and higherorder Cheeger inequalities. In 44th Annual ACM Symposium on Theory of Computing STOC, pages 7 30, 0. Frank T. Leighton and Satish Rao. Multicommodity max-flow min-cut theorems and their use in designing approximation algorithms. Journal of the ACM, 466:787 83, 999. Anand Louis, Prasad Raghavendra, Prasad Tetali, and Santosh Vempala. Many sparse cuts via higher eigenvalues. In 44th Annual ACM Symposium on Theory of Computing STOC, pages 3 40, 0. 3

15 PENG SUN ZANETTI David W. Matula and Farhad Shahrokhi. Sparsest cuts and bottlenecks in graphs. Discrete Applied Mathematics, 7-:3 3, 990. Frank McSherry. Spectral partitioning of random graphs. In 4nd Annual IEEE Symposium on Foundations of Computer Science FOCS 0, pages , 00. Andrew Y. Ng, Michael I. Jordan, and Yair Weiss. On spectral clustering: Analysis and an algorithm. Advances in Neural Information Processing Systems, : , 00. Lorenzo Orecchia, Leonard J. Schulman, Umesh V. Vazirani, and Nisheeth K. Vishnoi. On partitioning graphs via single commodity flows. In 40th Annual ACM Symposium on Theory of Computing STOC 08, pages , 008. Lorenzo Orecchia, Sushant Sachdeva, and Nisheeth K Vishnoi. Approximating the exponential, the Lanczos method and an Õm-time spectral algorithm for balanced separator. In 44th Annual ACM Symposium on Theory of Computing STOC, pages 4 60, 0. Rafail Ostrovsky, Yuval Rabani, Leonard J. Schulman, and Chaitanya Swamy. The effectiveness of Lloyd-type methods for the k-means problem. Journal of the ACM, 596:8, 0. Shayan Oveis Gharan and Luca Trevisan. Partitioning into expanders. In 5th Annual ACM-SIAM Symposium on Discrete Algorithms SODA 4, pages 56 66, 04. Prasad Raghavendra, David Steurer, and Madhur Tulsiani. Reductions between expansion problems. In 7th Conference on Computational Complexity CCC, pages 64 73, 0. Karl Rohe, Sourav Chatteree, and Bin Yu. Spectral clustering and the high-dimensional stochastic blockmodel. The Annals of Statistics, 394:878 95, 0. Jonah Sherman. Breaking the multicommodity flow barrier for O log n-approximations to sparsest cut. In 50th Annual IEEE Symposium on Foundations of Computer Science FOCS 09, pages , 009. Jianbo Shi and Jitendra Malik. Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 8: , 000. Daniel A. Spielman and Nikhil Srivastava. Graph sparsification by effective resistances. SIAM Journal on Computing, 406:93 96, 0. Daniel A. Spielman and Shang-Hua Teng. Spectral sparsification of graphs. SIAM Journal on Computing, 404:98 05, 0. Luca Trevisan. Approximation algorithms for unique games. Theory of Computing, 4: 8, 008. Ulrike von Luxburg. A tutorial on spectral clustering. Statistics and Computing, 74:395 46, 007. Van Vu. A simple SVD algorithm for finding hidden partitions. arxiv: , 04. Zeyuan Allen Zhu, Silvio Lattanzi, and Vahab S. Mirrokni. A local algorithm for finding wellconnected clusters. In 30th International Conference on Machine Learning ICML 3, pages , 03. 4

16 PARTITIONING WELL-CLUSTERED GRAPHS: SPECTRAL CLUSTERING WORKS! Appendix A. Omitted Proofs of Section 3 In this section we prove Theorem 3. and Lemma 3.. We begin proving the first direction of Theorem 3.. Proof of Theorem 3. Part We write g i as a linear combination of the eigenvectors of L, i.e. g i = α i f + + α i n f n and let the vector ˆf i be the proection of vector ḡ i on the subspace spanned by {f i } k i=, i.e. ˆf i = α i f + + α i k f k. By the definition of Rayleigh quotients, we have that Rg i = α i f + + α n i f n L α i f + + α n i f n = α i λ + + α n i λn α i λ + + α i k λk + α α i α λ + α λ k+, where α and α i + + g i ˆf i = which finishes the proof. α i k α α i. Therefore, we have that α i α i k+ + + Rgi /λ k+ /Υ, α i n = α α i λ k+ /Υ, Next we prove Part of Theorem 3.. We first state two classical results that will be used in our proof. Theorem A. Geršgorin Circle Theorem Let A be an n n matrix, and let R i A = i A i,, for i n. Then, all eigenvalues of A are in the union of Geršgorin Discs defined by n {z C : z A i,i R i A}. i= Theorem A. Corollary 6.3.4, Horn and Johnson 0 Let A be an n n normal matrix with eigenvalues λ,..., λ n and E be an n n matrix. If ˆλ is an eigenvalue of A + E, then there is some eigenvalue λ i of A for which ˆλ λ i E. 5

17 PENG SUN ZANETTI Proof of Theorem 3. Part By Part, every g i is approximated by a vector ˆf i defined by ˆf i = α i f + α i k f k. Define a k by k matrix A such that A i, = α i, i.e., the th column of matrix A consists of values { } α k i representing ˆf. We define a vector α by i= α = α,, α k. Notice that i the norm of every column of A is close to, and ii different columns are almost orthogonal to each other, since by Cauchy-Schwarz inequality it holds that α i, α α i α max {Rḡi /λ k+, Rḡ /λ k+ } /Υ, for i. This implies that A is almost an orthogonal matrix. Moreover, it holds for any i that A k k A i, = A l,i A l, = α i l α l = α i, α /Υ l= l= while A A i,i = k l= α i l /Υ. Then, by the Geršgorin Circle Theorem cf. Theorem A., it holds that all the eigenvalues of A A are at least /Υ k /Υ = k/υ. Therefore, A has no eigenvalue with value 0 as long as Υ > k, i.e., the vectors { α } k = are linearly independent. Combining this with the fact that span { ˆf,..., ˆf k } span {f,..., f k } and dimspan {f,..., f k } = k, it holds that span { ˆf,..., ˆf k } = span {f,..., f k }. Hence, we can write every f i i k as a linear combination of { ˆf i } k i=, i.e., Now define the value of ĝ i as f i = β i Without loss of generality we assume that = f i = ˆf + β i ĝ i = β i g + β i g + + β i k l= β i β i β i β i l ˆf + + β i k ˆf k. A. = max { β i ˆfl + ˆf β i /Υ β i l l l k g k. A. },..., β i. Then, it holds k β i l ˆf l, ˆf l ˆf l, ˆf l l l k /Υ, β i 6

18 PARTITIONING WELL-CLUSTERED GRAPHS: SPECTRAL CLUSTERING WORKS! which implies for a large enough value of Υ that β i Υ k.. Υ Combining this with Part of Theorem 3., it is easy to see that f i ĝ i k max k β i ˆf g.k/υ. Now we prove Lemma 3., which shows that for any two different clusters S i, S there exists an eigenvector having reasonably different values on the coordinates which correspond to S i and S., Proof of Lemma 3. Let β i = β i,..., βi k for i k. Since ḡi ḡ for any i, we have by the orthonormality of g,, g k that ĝ i, ĝ = β i g + + β i k g k, β g + + β k g k k = β i l β l g l = β i, β, l= and β i, β = ĝi, ĝ f i f i ĝ i, f f ĝ = f i, f f i ĝ i, f f ĝ, f i + f i ĝ i, f ĝ f i ĝ i + f ĝ + f i ĝ i f ĝ. k/υ +.k/υ. Moreover, it holds that β i = ĝi = f i + ĝ i f i + ĝ i f i +.k/υ, and β i = ĝ i = f i + ĝ i f i ĝ i f i.k/υ, which implies that β i. k/υ +.k/υ, +. k/υ +.k/υ. A.3 In other words, we showed that β i s are almost orthonormal. Now we construct a k by k matrix B, where the th column of B is β. By the Geršgorin Circle Theorem Theorem A., all eigenvalues λ of B B satisfies λ B B i,i k. k/υ +.k/υ A.4 7

19 PENG SUN ZANETTI for any i. Combing this with A.3, we have that B eigenvalues have modulus close to. Now we show that β i l and β i are far from each other by contradiction. Suppose there exist l such that ζ max β i ik l β i < 0 k. This implies that the th row and lth row of matrix B are somewhat close to each other. Let us now define matrix E R k k, where E l,i β i β i l, and E t,i = 0 for any t l and i k. Moreover, let Q = B + E. Notice that Q has two identical rows, and rank at most k. Therefore, Q has an eigenvalue with value 0, and the spectral norm E of E, the largest singular value of E, is at most kζ. By definition of matrix Q we have that Q Q = B B + B E + E B + E E. Since B B is symmetric and 0 is an eigenvalue of Q Q, by Theorem A. we know that, if ˆλ is an eigenvalue of Q Q, then there is an eigenvalue λ of B B such that which implies that ˆλ λ B E + E B + E E B E + E B + E E 4 kζ + kζ, ˆλ λ 4 kζ kζ k. k/υ +.k/υ 4 kζ kζ, due to A.3 and A.4. By setting ˆλ = 0, we have that k. k/υ +.k/υ 4 kζ kζ 0. By the condition of Υ = Ωk 3, the inequality above implies that ζ contradiction. 0 k, which leads to a Appendix B. Omitted Proofs of Section 4 In this section we list all the proofs omitted from Section 4. We start proving Lemma 4., which shows that the embedded points of a cluster S i are concentrated around the corresponding point p i. Proof of Lemma 4. Since x = D / x D holds for any x R n, by Theorem 3. we have for any k that k d u i= u S i F u p i Summing over all for k implies that k d u i= u S i F u p i = = D / f D / ĝ D = f ĝ.k/υ. k k d u i= = u S i 8 F u p i.k /Υ.

20 PARTITIONING WELL-CLUSTERED GRAPHS: SPECTRAL CLUSTERING WORKS! The next lemma shows that the l -norm of p i is inversely proportional to the volume of S i. This implies that embedded points from a big cluster are close to the origin, while embedded points from a small cluster are far from the origin. Lemma B. It holds for every i k that 9 p 0 vols i i Proof By 4.3, we have that p i = vols i β i 0 vols i.,..., β k i. Notice that p i is ust the ith row of the matrix B defined in the proof of Lemma 3., normalized by volsi. Since B and B share the same singular values this follows from the SVD decomposition, by A.4 the eigenvalues of BB are close to. But since BB i,i is equal to the squared norm of the ith row of B, we have that β i,..., β k i which implies the statement.. k/υ +.k/υ, +. k/υ +.k/υ, B. We can now prove Lemma 4. which shows that, for every i, p i is far from p. Moreover, their distance is inversely proportional to the volume of the smaller cluster between S i and S. Proof of Lemma 4. Let S i and S be two arbitrary clusters. By Lemma 3., there exists l k such that β l i β l ζ. By the definition of p i and p it follows that p i p i p p By B., we know that k l= β l = β Therefore, we have that p i p i p p β l i k t= β t i β l k t= β t,..., β k ζ 0, + ζ. 0 β l i β l ζ,. 9

21 PENG SUN ZANETTI and p i p i, p p ζ /4. Without loss of generality, we assume that p i p. By Lemma B., it holds that p i 9 0 vols i, and p i p 9 0 vols. Hence, it holds that p i 9 0 min {vols i, vols }. We can now finish the proof by considering two cases based on p i. Case : Suppose that p i 4 p. We have that p i p p i p 3 p i, 4 which implies that p i p 9 p i 6 min {vols i, vols }. Case : Suppose p = α p i for α 4, ]. In this case, we have that p i p p = i + p p i p i, p p i p p p i p + ζ /4 p i p = + α p i ζ /4α p i = + α α + αζ / p i αζ p i ζ 0 min {vols i, vols }, and the lemma follows. The next lemma gives an upper bound to the cost of the optimal k-means clustering which depends on the gap Υ. Proof of Lemma 4.3 Since k is obtained by minimizing over all partitions A,..., A k and c,..., c k, we have that k k d u i= u S i F u p i. B. 0

22 PARTITIONING WELL-CLUSTERED GRAPHS: SPECTRAL CLUSTERING WORKS! Hence the statement follows by applying Lemma 4.. By Lemma 4.3 and the assumption that A,, A k is an APT-factor approximation of an optimal clustering, we have that COSTA,..., A k. APT k /Υ. Now we prove Theorem., which is based on the upper bound of COST and Lemma 4.4. Proof of Theorem. Let A,..., A k be a k-way partition that achieves an approximation ratio of APT. We first show that there exists a permutation σ of the indices such that vol A i S σi 0 5 k 3 APT Υ vols σi for any i k. Assume for contradiction that there is i k such that This implies by Lemma 4.4 that vola i S σi > 05 k 3 APT Υ vol S σi. COSTA,..., A k 0 APT k /Υ, which contradicts to the fact that A,..., A k is an APT-approximation to a k-way partition, whose optimal cost is at most APT k /Υ. Next we bound the conductance of every cluster A i. Let ε = 05 k 3 APT Υ k3 APT = O Υ For any i k, the number of leaving edges of A i is upper bounded by On the other hand, we have that A i A i \ S σi + A i S σi. A i S σi + A i S σi ε vol S σi + φg Sσi vol Sσi + ε vol Sσi = ε + φ G Sσi volsσi. vol A i vol A i S σi ε volsσi. Hence, φ G A i ε + φ GS σi vols σi ε vols σi = ε + φ GS σi ε = Oφ G S σi + APT k 3 /Υ. The rest of this section is devoted to prove Lemma 4.4, which is based on the following highlevel idea: Suppose by contradiction that there is a cluster S which is very different from every cluster A l. Then there is a cluster A i with significant overlap with two different clusters S and S Lemma B.. However, we already proved in Lemma 4. that any two clusters are far from each other. This implies that the COST value of A,..., A k must be high, which leads to a contradiction.

23 PENG SUN ZANETTI Lemma B. such that Suppose for every permutation π : {,..., k} {,..., k} there exists an index i vol A i S πi ε vol Sπi. Then, for any index i there are indices i i and ε i 0 such that and k i= ε i ε. vola i S i vola i S i ε i min {vols i, vols i }, Proof Let σ : {,..., k} {,..., k} be the function defined by σi = argmax k vola i S. vols We first assume that σ is one-to-one, i.e. σ is a permutation. By the hypothesis of the lemma, there exists an index i such that vola i S σi ε vols σi. Without loss of generality, we assume that i =. Notice that vol A S σ = vol A S σ + vol A S. B.3 σ Hence, one of the summations on the right hand side of B.3 is at least ε vol S σ. Now the proof is based on the case distinction. Case : Assume that vol A S σ ε volsσ. We define τ for k to be τ = vol A S σ vol S σ. We have that and by the definition of σ it holds that vol A S σ vol S σ τ ε, vol A S σ vol = τ S σ for k. Setting ε = τ for k and ε = 0 finishes the proof of Case. Case : Assume that vol A S ε vols σ. σ Let us define τ for k, σ, to be B.4 τ = vola S vol S σ. Then, B.4 implies that σ τ ε.

24 PARTITIONING WELL-CLUSTERED GRAPHS: SPECTRAL CLUSTERING WORKS! S i A i S i B i u c i p i p i Figure : We use the fact that p i c i p i c i, and lower bound the value of COST function by only looking at the contribution of points u B i for all i k. The statement in this case holds by assuming vol A S σ ε vol Sσ, since otherwise we have vol S σ vol A S σ = vol A S σ ε vol Sσ ε vol Sσ, and this case was proven in Case. So it suffices to study the case in which σ defined earlier is not one-to-one. Then, there is k such that {σ,, σk}. For any l k, let l = vola l S. vols Then, k l= τ l = ε and it holds for any l k that vol A l S σl vol vola l S = τ l S σl vols. τ Proof of Lemma 4.4 By the assumption of Lemma 4.4 and Lemma B., for every i there exist i i such that vola i S i ε i min {vols i, vols i }, B.5 vola i S i ε i min {vols i, vols i }, for some ε 0, and k ε i ε. i= 3

Graph Partitioning Algorithms and Laplacian Eigenvalues

Graph Partitioning Algorithms and Laplacian Eigenvalues Luca Trevisan Stanford Based on work with Tsz Chiu Kwok, Lap Chi Lau, James Lee, Yin Tat Lee, and Shayan Oveis Gharan spectral graph theory Use linear