Dissertation Defense

Size: px

Start display at page:

Download "Dissertation Defense"

Magdalene Phelps
5 years ago
Views:

1 Clustering Algorithms for Random and Pseudo-random Structures Dissertation Defense Pradipta Mitra 1 1 Department of Computer Science Yale University April 23, 2008 Mitra (Yale University) Dissertation Defense April 23, / 46

2 Committee Ravi Kannan (Advisor) Dana Angluin Dan Spielman Mike Mahoney (Yahoo!) Mitra (Yale University) Dissertation Defense April 23, / 46

3 Outline 1 Introduction: Clustering and Spectral algorithms 2 Four results: 1 Clustering using Bi-partitioning 2 Clustering in sparse graphs 3 A Robust clustering algorithm 4 An Entrywise notion for spectral stability 3 Future work Mitra (Yale University) Dissertation Defense April 23, / 46

4 Clustering What is clustering? Issues Given a set of objects S, partition into disjoint sets or clusters S 1... S k Partitioning done according to some notion of closeness, i.e. objects in a cluster S r are close to each other, and far from objects in other clusters. What is the right definition of closeness? Algorithms to find clusters given the right definition Mitra (Yale University) Dissertation Defense April 23, / 46

5 Clustering: Examples Figure: From Yale face dataset Mitra (Yale University) Dissertation Defense April 23, / 46

6 Clustering: Matrices Term-document matrices M V C A H K F P CS Doc CS Doc CS Doc Medicine Doc Medicine Doc Medicine Doc M - Microprocessor, V - VirtualMemory, C - L2 Cache, A - Algorithm, H - Hemoglobin, K - Kidney, F - Fracture, P - Painkiller Moral Clustering problems can be modelled as object-feature matrices, objects can be seen as vectors in high dimesional space. Mitra (Yale University) Dissertation Defense April 23, / 46

7 Mixture models Each cluster is defined by a simple (high-dimensional) probability distribution. Objects are samples from these distributions. Hope Can successfully cluster if centers (means) are far apart. µ 1 µ 2 How large does need to be? Figure: Two circles whose centers are separated Mitra (Yale University) Dissertation Defense April 23, / 46

8 Random graphs A G n,p random graph is generated by selecting each possible edge with independent probability p Example: G 5,0.5 EA = A Mitra (Yale University) Dissertation Defense April 23, / 46

9 Planted partition model for Graphs Total n vertices, divided in to k clusters T 1, T 2... T k of size n 1... n k There are k(k+1) 2 probabilities P rs (= P sr ) such that if v T r, u T s, the edge e(u, v) is present with probability P rs Mitra (Yale University) Dissertation Defense April 23, / 46

10 Planted partition model for Graphs P = EA = A µ 1 = {0.5, 0.5, 0.5, 0.1, 0.1, 0.1} µ 2 = {0.1, 0.1, 0.1, 0.5, 0.5, 0.5} Mitra (Yale University) Dissertation Defense April 23, / 46

11 Algorithmic Issues Heuristic Analysis: Analyze an algorithm known to work in practice. Spectral Algorthms Uses information about the spectrum (Eigenvalues, Eigenvectors, Singular Vectors etc) of the data matrix to do clustering Quite popular, seems to work in practice. Singular values and vectors can be computed efficiently For a matrix A, the span of the top k singular vectors gives A k, the rank k matrix such that for all rank k matrices M A A k A M Mitra (Yale University) Dissertation Defense April 23, / 46

12 Why might this work? Intuition: Eliminates noise Avoids curse of dimensionality Cheeger s inequality (relation to sparsest cut) Convention Eigen/Singular values are often sorted from largest to smallest (in absolute value). λ 1 λ 2 λ 3... The eigen/singular vector corresponding to λ i is the i th eigen/singular vector Mitra (Yale University) Dissertation Defense April 23, / 46

13 Why might this work? A = Quick Definition: if A is square symmetric, v its eigenvector if v = 1, and Av = λv for some λ (an eigenvalue). Mitra (Yale University) Dissertation Defense April 23, / 46

14 Why might this work? A = = {1,... 1} T, A1 = 41 1 is the first eigenvector Av v = {1, 1, 1, 1, 1, 1, 1, 1} T = 2v This is the second eigenvector, and reveals the cluster. Mitra (Yale University) Dissertation Defense April 23, / 46

15 Previous Work Lot of work: [B 87], [DF 89], [AK 97], [AKS 98], [VW2002]... McSherry 2001 An instance of the planted partition model A with k clusters can be clustered with probability 1 o(1) if the following separation condition holds (centers are far apart): for all r s µ r µ s 2 cσ 2 log n where σ 2 = max rs P rs, n = number of vertices Assumption: σ 2 log6 n n, i.e. atleast polylog degree Spectral method: Take best rank-k approximation A k, do greedy on that matrix. This gives approximate clustering Clean-up: Use combinatorial projections, ie counting edges to approximate partitions. Mitra (Yale University) Dissertation Defense April 23, / 46

16 Our Contributon Clustering by Recursive Bi-partitioning: use the second singular vector to bi-partition the data. repeat. Pseudo-random models of Clustering: used to model cluster problems for sparse (constant-degree) graphs. Rotationally invariant algorithms: Remove combinatorial/ad-hoc techniques for discrete distributions. Entrywise bounds for Eigenvectors: A different notion of spectral stability for random graphs. Mitra (Yale University) Dissertation Defense April 23, / 46

17 Spectral Clustering by Recursive Bi-partitioning Mitra (Yale University) Dissertation Defense April 23, / 46

18 Spectral Clustering by Recursive Bi-partitioning Joint work with Dasgupta, Hopcroft and Kannan (ESA 2006) Goal Instead of a rank-k approximation based method, use a incremental algorithm that bi-partitions the data at each step. Result Clustering possible if for all r s where σ 2 r = max s P rs log6 n n µ r µ s 2 c(σ r + σ s ) 2 log n Mitra (Yale University) Dissertation Defense April 23, / 46

19 Basic Step Given A, find vector v 1 of AJ that maximizes AJv 1 where J = I 1 n 11T Sort entries of v 1 : v 1 (1) v 1 (2)... v 1 (n) Find i, i + 1 such that v 1 (i) v 1 (i + 1) is largest Return {1,... i} and {i + 1,... n} as the bi-partition Definition refresher v 1 is the first right singular vector of AJ, and close to the second right singular vector of A Mitra (Yale University) Dissertation Defense April 23, / 46

20 Main algorithm Given A, randomly partition the rows into t = 4 log n parts B i (i = 1 to t) of equal size Bi-partition the (same) columns t times using Basic Step (last slide) Combine these (approximate) bi-partitions to find an accurate bi-partition Mitra (Yale University) Dissertation Defense April 23, / 46

21 Analysis Let s focus on one B i. Call it B. Let B = E(B) v 1 (BJ) is almost structured Let v 1 = v 1 (BJ). Then, v 1 = r α r g (r) + v where g (r) is the characteristic vector of T r, v is orthogonal to each g (r) and v 1 c 2 Mitra (Yale University) Dissertation Defense April 23, / 46

22 Analysis Let s focus on one B i. Call it B. Let B = E(B) v 1 (BJ) is almost structured Let v 1 = v 1 (BJ). Then, Furedi-Komlos 81 if σ 2 log6 n n B B 3σ n v 1 = r α r g (r) + v where g (r) is the characteristic vector of T r, v is orthogonal to each g (r) and v 1 c 2 BJ BJv + BJv BJ v + BJv + BJ BJ v BJ (1 v 2 ) + B B v Using (1 x) 1 x v B B 1 BJ c 2 Mitra (Yale University) Dissertation Defense April 23, / 46

23 Analysis v 1 = v + v, v = α r g (r) Claim When sorted, there is a Ω(1) gap in the α s. v is orthogonal to 1. This implies, αr 0 On the other hand, 1 = v 1 2 = v 2 + v 2 r α 2 r + 1 c 2 2 r α 2 r 1 2 v looks like this. Combines to prove the existence of a Ω(1) gap. Mitra (Yale University) Dissertation Defense April 23, / 46

24 Analysis v 1 = v + v, v = α r g (r) Claim No more than n min c 3 vertices cross the gap. (n min = min r n r ) Implied by the fact that v is small. Ω(1) gap in α 1 4 n min gap in v. Let there are m vertices that cross the gap. Then, 1 m v n min c2 2 m 16n min c 2 2 v 1 looks like this. Mitra (Yale University) Dissertation Defense April 23, / 46

25 Combining the 4 log n bi-partitions We showed No more than n min c 3 gap. vertices cross the Given 4 log n, bi-partitions, construct a graph on the vertices thus: For each u, v [n] set e(u, v) = 1 if vertices u and v are on the same side of the bi-partition in 1 2ɛ fraction of cases. Find connected components in the graph. Return them as a (bi-)partition. Mitra (Yale University) Dissertation Defense April 23, / 46

26 Combining the 4 log n bi-partitions Equivalently A vertex has ɛ = 1 c 3 being misclassified. probability of Given 4 log n, bi-partitions, construct a graph on the vertices thus: For each u, v [n] set e(u, v) = 1 if vertices u and v are on the same side of the bi-partition in 1 2ɛ fraction of cases. Find connected components in the graph. Return them as a (bi-)partition. Mitra (Yale University) Dissertation Defense April 23, / 46

27 Combining the 4 log n bi-partitions Equivalently A vertex has ɛ = 1 c 3 probability of being misclassified. Given 4 log n, bi-partitions, construct a graph on the vertices thus: For each u, v [n] set e(u, v) = 1 if vertices u and v are on the same side of the bi-partition in 1 2ɛ fraction of cases. Find connected components in the graph. Return them as a (bi-)partition. Need to show: No two vertices from the same cluster can be put in different components. We find at least two components. Mitra (Yale University) Dissertation Defense April 23, / 46

28 Combining the 4 log n bi-partitions Equivalently A vertex has ɛ = 1 c 3 probability of being misclassified. Given 4 log n, bi-partitions, construct a graph on the vertices thus: For each u, v [n] set e(u, v) = 1 if vertices u and v are on the same side of the bi-partition in 1 2ɛ fraction of cases. Find connected components in the graph. Return them as a (bi-)partition. Clean clusters No two vertices from the same cluster can be put in different components. Let u, v T r. Vertex v is on the right side of the bi-partition (1 ɛ) fraction of cases. Same is true for u. So u and v on the same side at least (1 2ɛ) fraction of cases. Mitra (Yale University) Dissertation Defense April 23, / 46

29 Combining the 4 log n bi-partitions Equivalently A vertex has ɛ = 1 c 3 probability of being misclassified. Given 4 log n, bi-partitions, construct a graph on the vertices thus: For each u, v [n] set e(u, v) = 1 if vertices u and v are on the same side of the bi-partition in 1 2ɛ fraction of cases. Find connected components in the graph. Return them as a (bi-)partition. Nontrivial Partitions We find at least two components. A counting argument. Mitra (Yale University) Dissertation Defense April 23, / 46

30 Pseudo-randomness and Clustering Mitra (Yale University) Dissertation Defense April 23, / 46

31 Sparse graphs? Goal Design a model that will allow constant-degree graphs. Problems Standard condition: µ r µ s 2 cσ 2 log n. A planted partition model with σ 2 = Θ( d n ) for constant d will have vertices with logarithmic degree. Our Result We Introduce a model where clustering possible if: for constant α µ r µ s 2 c α2 n log2 α Mitra (Yale University) Dissertation Defense April 23, / 46

32 Solution: Use pseudo-randomness A graph G(V, E) is (p, α) pseudo-random if for all A, B V e(a, B) p A B α A B Theorem A G n,p random graph is (p, 2 np) pseudo-random (p log6 n n ) Proof: E(e(A, B)) = p A B. Using Chernoff Bound, P( e(a, B) E(e(A, B)) > 2 np A B ) exp( 2n) But there are only 2 n.2 n = 2 2n pairs of sets A, B. The claim follows. Intuition Pseudo-random graphs are deterministic versions of random-graphs Mitra (Yale University) Dissertation Defense April 23, / 46

33 The model Graph G, k clusters T r, r [k] For some α, and for each r, s [k], there is p rs such that: G(T r, T s ) is (p rs, α) pseudo-random. Also, e(x, T s ) p rs T s 2α if x T r Algorithmic issue: Furedi-Komlos doesn t apply, and there is no independence! Ā A Mitra (Yale University) Dissertation Defense April 23, / 46

34 Rotationally Invariant Algorithm for Discrete Distributions Mitra (Yale University) Dissertation Defense April 23, / 46

35 Discrete vs. Continuous Similar results can be proved for discrete and continuous models µ r µ s 2 Ω (σ 2 log n) The algorithms: 1 Share the spectral part that gives an approximation 2 Differ in clean-up phase continuous models seems to have more natural algorithms Mixture of Gaussians: k high dimensional gaussians with centers µ r, r = 1 to k. pdf for the k-th cluster/gaussian: f r (x) ( exp 1 ) 2 (x µ r ) Σ 1 r (x µ r ) Σ r is the covariance matrix. Mitra (Yale University) Dissertation Defense April 23, / 46

36 Discrete vs. Continuous We would like an algorithm 1 Simple, natural clean-up phase 2 Rotationally invariant 3 Easily extensible to more complex models McSherry 2001 Conjecture: Such an algorithm exists. Mitra (Yale University) Dissertation Defense April 23, / 46

37 Discrete vs. Continuous We would like an algorithm 1 Simple, natural clean-up phase 2 Rotationally invariant 3 Easily extensible to more complex models Simplicity One shot distance-based or projection-based algorithm, instead of combinatorial, incremental or sampling techniques. McSherry 2001 Conjecture: Such an algorithm exists. Mitra (Yale University) Dissertation Defense April 23, / 46

38 Discrete vs. Continuous We would like an algorithm 1 Simple, natural clean-up phase 2 Rotationally invariant Natural assumption If the vectors are rotated, the clustering remains the same. 3 Easily extensible to more complex models McSherry 2001 Conjecture: Such an algorithm exists. Mitra (Yale University) Dissertation Defense April 23, / 46

39 Discrete vs. Continuous We would like an algorithm 1 Simple, natural clean-up phase 2 Rotationally invariant 3 Easily extensible to more complex models Extension Simpler algorithms are easier to adapt: models without complete independence, without block structuring. McSherry 2001 Conjecture: Such an algorithm exists. Mitra (Yale University) Dissertation Defense April 23, / 46

40 Discrete vs. Continuous We would like an algorithm 1 Simple, natural clean-up phase 2 Rotationally invariant 3 Easily extensible to more complex models McSherry 2001 Conjecture: Such an algorithm exists. Our result The conjecture is true. Theorem Consider a matrix generated from a discrete mixture model with k- clusters, m objects and n features. Clustering possible if: ( µ r µ s 2 cσ n ) log m m Mitra (Yale University) Dissertation Defense April 23, / 46

41 Our algorithm Cluster(A, k) Divide A into A 1 and A 2 {µ r } = Centers(A 1, k) Project (A 2, µ 1,... µ k ) Project(A 2, µ 1,... µ k ) Group v A 2 with the µ r that minimizes v µ r Centers(A 1, k) Uses a spectral algorithm to find approximate clusters P r, r [k]. Returns empirical centers µ r = 1 P r v P r v Mitra (Yale University) Dissertation Defense April 23, / 46

42 Analysis Lemma Proof idea: µ r µ r 2 c 1 σ 2 ( 1 + n m µ r = 1 p r v P r v p r µ r = v P r v = P r v + s v Q rs p r (µ r µ r ) = Pr (v µ r ) + s Q rs (v µ r ) ) 1 20 µ r µ s 2 (1) Spectral method returns a approximately correct partition. P r = correctly classified part of P r, p r = P r, p r = P r Q rs = should be in P s, placed in P r, q rs = Q rs Mitra (Yale University) Dissertation Defense April 23, / 46

43 Analysis p r (µ r µ r ) = Pr (v µ r ) + s Q rs (v µ r ) Need to bound Pr (v µ r ) and for all s (v µ r ) (v µ s ) + q rs µ s q rs µ r Q rs Q rs q rs µ s q rs µ r = q rs µ s µ r Turns out q rs decreases if µ s µ r increases, cancels each other out. Pr (v µ r ) follows from an argument based on a spectral norm bound (ala Furedi-Komlos). Mitra (Yale University) Dissertation Defense April 23, / 46

44 Analysis Lemma For each sample u, if u T r, then for all s r (u µ r ) (µ r µ s ) 2 5 µ r µ s 2 Assume µ r = µ r + δ r ; r. Then, (u µ r ) (µ s µ r ) =(u µ r δ r ) (µ s µ r δ r + δ s ) =(u µ r ) (µ r µ s ) δ r (µ s µ r ) δ r (δ s δ r ) + (u µ r ) (δ s δ r ) (u µ r ) (µ r µ s ) is small by separation assumption. Mitra (Yale University) Dissertation Defense April 23, / 46

45 Analysis Lemma For each sample u, if u T r, then for all s r (u µ r ) (µ r µ s ) 2 5 µ r µ s 2 Assume µ r = µ r + δ r ; r. Then, (u µ r ) (µ s µ r ) =(u µ r δ r ) (µ s µ r δ r + δ s ) =(u µ r ) (µ r µ s ) δ r (µ s µ r ) δ r (δ s δ r ) + (u µ r ) (δ s δ r ) δ r (µ s µ r ) δ r µ s µ r by Cauchy-Schwartz, is small as δ r is small. Mitra (Yale University) Dissertation Defense April 23, / 46

46 Analysis Lemma For each sample u, if u T r, then for all s r (u µ r ) (µ r µ s ) 2 5 µ r µ s 2 Assume µ r = µ r + δ r ; r. Then, (u µ r ) (µ s µ r ) =(u µ r δ r ) (µ s µ r δ r + δ s ) =(u µ r ) (µ r µ s ) δ r (µ s µ r ) δ r (δ s δ r ) + (u µ r ) (δ s δ r ) δ r (δ s δ r ) is similarly small. Mitra (Yale University) Dissertation Defense April 23, / 46

47 Analysis Lemma For each sample u, if u T r, then for all s r (u µ r ) (µ r µ s ) 2 5 µ r µ s 2 Assume µ r = µ r + δ r ; r. Then, (u µ r ) (µ s µ r ) =(u µ r δ r ) (µ s µ r δ r + δ s ) =(u µ r ) (µ r µ s ) δ r (µ s µ r ) δ r (δ s δ r ) + (u µ r ) (δ s δ r ) Main challenge Bounding (u µ r ) (δ s δ r ) Mitra (Yale University) Dissertation Defense April 23, / 46

48 Completing the proof Claim (u µ r ) (δ s δ r ) < c 3 σ 2 (1 + n m ) log m Proof idea: (u µ r ) δ r = (u(i) µ r (i))δ r (i) = x(i) i [n] i [n] This is a sum of zero mean random variables x(i). E(x(i) 2 ) 2δ r (i) 2 σ 2 i E(x(i) 2 ) 2σ 2 δ r 2 c 3 kσ 4 ( 1 + n m ) x(i) δ i 2c 4 σ 2, because the number of 1 s in a column can be at most 1.1mσ 2. Mitra (Yale University) Dissertation Defense April 23, / 46

49 Completing the proof So we have a sum of absolutely bounded, zero mean, bounded variance random variables. Can apply: Bernstein s inequality Let {X i } n i=1 be a collection of independent, random variables where Pr { X i M} = 1 i. Then, for any ε 0 { } ( ) n Pr (X i E[X i ]) ε ε 2 exp 2 ( θ 2 + M 3 ε) i=1 where θ = EX 2 i Plugging in our values, Pr{ ( x(i) c 3 σ 2 (1 + n )} m ) + log m 1 m 3 i [n] Mitra (Yale University) Dissertation Defense April 23, / 46

50 Entrywise Bounds for Eigenvectors of Random Graphs Mitra (Yale University) Dissertation Defense April 23, / 46

51 Well studied: l 2 norm bounds Already saw: if A is the adjacency matrix of a G n,p graph Lot of research on similar bounds. A E(A) 3 np v = v 1 (E(A)) = 1 n 1 Question u = v 1 (A) =? Goal Study u v max i [n] u(i) v(i) A potentially useful notion of spectral stability. Mitra (Yale University) Dissertation Defense April 23, / 46

52 Can l 2 give l? Not directly! The spectral norm bound on A E(A) can be converted to a bound on u v. Best bound we can get u v u v 3 Too weak! 1 np is much larger np 1 that n Mitra (Yale University) Dissertation Defense April 23, / 46

53 Eigenvector of a Random Graph Figure: G 400,0.2 Mitra (Yale University) Dissertation Defense April 23, / 46

54 Our result Let A be the adjacency matrix of a G n,p graph, and u = v 1 (A). Then with probability 1 o(1), for all i u(i) = 1 n (1 ± ɛ) log n where ɛ = c log n 2 log np np, p log6 n n, c 2 constant Essentially optimal. Mitra (Yale University) Dissertation Defense April 23, / 46

55 Proof Only need a few elementary properties. Let = 2 probability, e(i) = np(1 ± ); for all i [n] log n np e(a, B) p A B 2 np A B ; for all A, B λ 1 (A) np. With high Normalize u = v 1 (A) such that max i (u(i)) = u(1) = 1 Au = (np)u (Au)(1) = (np)u(1) i A(1, i)u(i) = np N(1) u(i) = np Claim At least np 2 vertices of N(1) have u(i) 1 2 We know, N(1) u(i) = np Mitra (Yale University) Dissertation Defense April 23, / 46

56 Proof Only need a few elementary properties. Let = 2 probability, e(i) = np(1 ± ); for all i [n] log n np e(a, B) p A B 2 np A B ; for all A, B λ 1 (A) np. With high If not N(1) u(i) np (np(1 + ) np )(1 2 ) 2 np 2 + (np(1 + ))(1 2 ) 2 np np 2 < np Claim At least np 2 vertices of N(1) have u(i) 1 2 We know, N(1) u(i) = np Mitra (Yale University) Dissertation Defense April 23, / 46

57 Proof (contd.) Idea Extend the argument to successive neighborhood sets. We define a sequence of sets {S t } for t = 1... S 1 = {1} S t+1 = {i : i N(S t ) and u(i) 1 c(t + 1) } How quickly does S t+1 grow? Lemma Let t be the last index such that S t 2n 3. For all t t S t+1 (np) S t 9t 2 Exponential increase! Mitra (Yale University) Dissertation Defense April 23, / 46

58 Connection to Clustering Experiments show that for our models, no clean-up is necessary at all. Needed Subtle entrywise bounds for the second (and smaller) eigenvectors for the planted model. Figure: Second eigenvector of a graph with two clusters. Mitra (Yale University) Dissertation Defense April 23, / 46

59 Connection to Clustering Can show for models with stronger separation conditions Theorem Assume σ 2 = Ω( 1 n ). Then the second eigenvector provides a clean clustering if µ r µ s 2 σ 2/3 log n Stronger than standard assumption σ 2 log n Mitra (Yale University) Dissertation Defense April 23, / 46

60 Future Work 1 Clustering without clean-up 2 Clustering below the variance bound Ω (σ 2 ) 3 A Chernoff type bound for entrywise error? Algorithmic applications? Mitra (Yale University) Dissertation Defense April 23, / 46

61 Thanks! Mitra (Yale University) Dissertation Defense April 23, / 46

A Simple Algorithm for Clustering Mixtures of Discrete Distributions

1 A Simple Algorithm for Clustering Mixtures of Discrete Distributions Pradipta Mitra In this paper, we propose a simple, rotationally invariant algorithm for clustering mixture of distributions, including