Separating Populations with Wide Data: a Spectral Analysis

Separating Populations with Wide Data: a Spectral Analysis Avrim Blum, Amin Coja-Oghlan, Alan Frieze, and Shuheng Zhou Carnegie Mellon University, Pittsburgh, PA 1513, USA Abstract. In this paper, we consider the problem of partitioning a small data sample drawn from a mixture of k product distributions. We are interested in the case that individual features are of low average quality γ, and we want to use as few of them as possible to correctly partition the sample. We analyze a spectral technique that is able to approximately optimize the total data size the product of number of data points n and the number of features K needed to correctly perform this partitioning as a function of 1/γ for K > n. Our goal is motivated by an application in clustering individuals according to their population of origin using markers, when the divergence between any two of the populations is small. 1 Introduction We explore a type of classification problem that arises in the context of computational biology. The problem is that we are given a small sample of size n, e.g., DNA of n individuals (think of n in the hundreds or thousands), each described by the values of K features or markers, e.g., SNPs (Single Nucleotide Polymorphisms, think of K as an order of magnitude larger than n). Our goal is to use these features to classify the individuals according to their population of origin. Features have slightly different probabilities depending on which population the individual belongs to, and are assumed to be independent of each other (i.e., our data is a small sample from a mixture of k very similar product distributions). The objective we consider is to minimize the total data size D = nk needed to correctly classify the individuals in the sample as a function of the average quality γ of the features, under the assumption that K > n. Throughout the paper, we use p j i and µj i as shorthands for p(j) i and µ (j) i respectively. Statistical Model: We have k probability spaces Ω 1,..., Ω k over the set {0, 1} K. Further, the components (features) of z Ω t are independent and Pr Ωt [z i = 1] = p i t (1 t k, 1 i K). Hence, the probability spaces Ω 1,...,Ω k comprise the distribution of the features for each of the k populations. Moreover, the input of the algorithm consists of a collection (mixture) of n = k t=1 N t unlabeled samples, N t points from Ω t, and the algorithm is to determine for each data point from which of Ω 1,..., Ω k it was chosen. In general we do not assume that N 1,..., N t are revealed Supported in part by the NSF under grant CCF-05149. Supported by the German Research Foundation under grant CO 646. Supported in part by the NSF under grant CCF-050793. Supported in part by the NSF under grants CCF-065879 and CNF 043538.

to the algorithm; but we do require some bounds on their relative sizes. An important parameter of the probability ensemble Ω 1,...,Ω k is the measure of divergence γ = K min (pi s p i t) 1 s<t k K (1) between any two distributions. Note that Kγ measures the Euclidean distance between the means of any two distributions and thus represents their separation. Further, let N = n/k (so if the populations were balanced we would have N of each type) and assume from now on that kn < K. Let D = nk denote the size of the data-set. In addition, let σ = max i,t p i t (1 pi t ) denote the maximum variance of any random bit. The biological context for this problem is we are given DNA information from n individuals from k populations of origin and we wish to classify each individual into the correct category. DNA contains a series of markers called SNPs, each of which has two variants (alleles). Given the population of origin of an individual, the genotypes can be reasonably assumed to be generated by drawing alleles independently from the appropriate distribution. The following theorem gives a sufficient condition for a balanced (N 1 = N ) input instance when k =. Theorem 1. (Zhou 06 [?]) Assume N 1 = N = N. If K = Ω( ln N ) and KN = ln N log log N Ω( γ ) then with probability 1 1/ poly(n), among all balanced cuts in the complete graph formed among sample individuals, the maximum weight cut corresponds to the partition of the individuals according to their population of origin. Here the weight of a cut is the sum of weights across all edges in the cut, and the edge weight equals the Hamming distance between the bit vectors of the two endpoints. Variants of the above theorem, based on a model that allows two random draws from each SNP for an individual, are given in [?,?]. In particular, notice that edge weights based on the inner-product of two individuals bit vectors correspond to the sample covariance, in which case the max-cut corresponds to the correct partition [?] with high probability. Finding a max-cut is computationally intractable; hence in the same paper [?], a hill-climbing algorithm is given to find the correct partition for balanced input instances but with a stronger requirement on the sizes of both K and nk. A Spectral Approach: In this paper, we construct two simpler algorithms using spectral techniques, attempting to reproduce conditions above. In particular, we study the requirements on the parameters of the model (namely, γ, N, k, and K) that allow us to classify every individual correctly and efficiently with high probability. The two algorithms CLASSIFY and PARTITION compare as follows. Both algorithms are based on spectral methods originally developed in graph partitioning. More precisely, Theorem is based on computing the singular vectors with the two largest singular values for each of the n K input random matrix. The procedure is conceptually simple, easy to implement, and efficient in practice. For simplicity, Procedure Classify assumes the separation parameter γ is known to decide which singular vector to examine; in practice, one can just try both singular vectors as we do in the simulations. Proof techniques for Theorem, however, are difficult to apply to cases of multiple populations, i.e., k >. Procedure Partition is based on computing a rank-k γ

approximation of the input random matrix and can cope with a mixture of a constant number of populations. It is more intricate for both implementation and execution than Classify. It does not require γ as an input, while only requires that the constant k is given. We prove the following theorems. Theorem. Let ω = min(n1,n) n and ω min be a lower bound on ω. Let γ be given. Assume that K > n lnn and k = (. Procedure ) CLASSIFY allows us to separate two populations w.h.p., when n Ω, where σ is the largest variance of any σ γω minω random bit, i.e. σ = max i,t p i t(1 p i t). Thus if the populations are roughly balanced, then n c γ suffices for some constant c. This implies that the data required is D = nk = O ( lnnσ 4 /γ ω ωmin). Let Ps = (p i s ),...,K, we have P 1 P = Kγ = K (p i 1 pi ) σ lnn. () ω min ω Theorem 3. Let ω = min(n1,...,n k) n. There is a polynomial time algorithm PARTITION that satisfies the following. Suppose that K > n log n, γ > K, n > C kσ γω for some large enough constant C k, and ω = Ω(1). Then given the empirical n K matrix comprising the K features for each of the n individuals along with the parameter k, PARTITION separates the k populations correctly w.h.p. Summary and Future Direction: Note that unlike Theorem 1, both Theorem and Theorem 3 require a lower bound on n, even when k = and the input instance is balanced. We illustrate through simulations to show that this seems not to be a fundamental constraint of the spectral techniques; our experimental results show that even when n is small, by increasing K so that nk = Ω(1/γ ), one can classify a mixture of two populations using ideas in Procedure Classify with success rate reaching an oracle curve, which is computed assuming that distributions are known, where success rate means the ratio between correctly classified individuals and N. Exploring the tradeoffs of n and K that are sufficient for classification, when sample size n is small, is both of theoretical interests and practical value. 1.1 Related Work In their seminal paper [?], Pritchard, Stephens, and Donnelly presented a model-based clustering method to separate populations using genotype data. They assume that observations from each cluster are random from some parametric model. Inference for the parameters corresponding to each population is done jointly with inference for the cluster membership of each individual, and k in the mixture, using Bayesian methods. The idea of exploiting the eigenvectors with the first two eigenvalues of the adjacency matrix to partition graphs goes back to the work of Fiedler [?], and has been used in the heuristics for various NP-hard graph partitioning problems (e.g., [?]). The main difference between graph partitioning problems and the classification problem

that we study is that the matrices occurring in graph partitioning are symmetric and hence diagonalizable, while our input matrix is rectangular in general. Thus, the contribution of Theorem is to show that a conceptually simple and efficient algorithm based on singular value decompositions performs well in the framework of a fairly general probabilistic model, where probabilities for each of the K features for each of the k populations are allowed to vary. Indeed, the analysis of CLASSIFY requires exploring new ideas such as the Separation Lemma and the normalization of the random matrix X, for generating a large gap between top two singular values of the expectation matrix X and for bounding the angle between random singular vectors and their static correspondents, details of which are included in Section with analysis in full version. Procedure Partition and its analysis build upon the spectral techniques of McSherry [?] on graph partitioning, and an extension due to Coja-Oghlan [?]. McSherry provides a comprehensive probabilistic model and presents a spectral algorithm for solving the partitioning problem on random graphs, provided that a separation condition similar to () is satisfied. Indeed, [?] encompasses a considerable portion of the prior work on Graph Coloring, Minimum Bisection, and finding Maximum Clique. Moreover, McSherry s approach easily yields an algorithm that solves the classification problem studied in the present paper under similar assumptions as in Theorem 3, provided that the algorithm is given the parameter γ as an additional input; this is actually pointed out in the conclusions of [?]. In the context of graph partitioning, an algorithm that does not need the separation parameter as an input was devised in [?]. The main difference between PARTITION and the algorithm presented in [?] is that PARTITION deals with the asymmetric n K matrix of individuals/features, whereas [?] deals with graph partitioning (i.e., a symmetric matrix). There are two streams of related work in the learning community. The first stream is the recent progress in learning from the point of view of clustering: given samples drawn from a mixture of well-separated Gaussians (component distributions), one aims to classify each sample according to which component distribution it comes from, as studied in [?,?,?,?,?,?,?]. This framework has been extended to more general distributions such as log-concave distributions in [?,?] and heavy-tailed distributions in [?], as well as to more than two populations. These results focus mainly on reducing the requirement on the separations between any two centers P 1 and P. In contrast, we focus on the sample size D. This is motivated by previous results [?,?] stating that by acquiring enough attributes along the same set of dimensions from each component distribution, with high probability, we can correctly classify every individual. While our aim is different from those results, where n > K is almost universal and we focus on cases K > n, we do have one common axis for comparison, the l -distance between any two centers of the distributions. In earlier works [?,?], the separation requirement depended on the number of dimensions of each distribution; this has recently been reduced to be independent of K, the dimensionality of the distribution for certain classes of distributions [?,?]. This is comparable to our requirement in () for the discrete distributions. For example, according to Theorem 7 in [?], in order to separate the mixture of two Gaussians, ( σ P 1 P = Ω ω + σ ) log n (3)

is required. Besides Gaussian and Logconcave, a general theorem: Theorem 6 in [?] is derived that in principle also applies to mixtures of discrete distributions. The key difficulty of applying their theorem directly to our scenario is that it relies on a concentration property of the distribution (Eq. (10) of [?]) that need not hold in our case. In addition, once the distance between any two centers is fixed (i.e., once γ is fixed in the discrete distribution), the sample size n in their algorithms is always larger than Ω ( K ω log5 K ) [?,?] for log-concave distributions (in fact, in Theorem 3 of [?], they discard at least this many individuals in order to correctly classify the rest in the sample), and larger than Ω( K ω ) for Gaussians [?], whereas in our case, n < K always holds. Hence, our analysis allows one to obtain a clean bound on n in the discrete case. The second stream of work is under the PAC-learning framework, where given a sample generated from some target distribution Z, the goal is to output a distribution Z 1 that is close to Z in Kullback-Leibler divergence: KL(Z Z 1 ), where Z is a mixture of product distributions over discrete domains or Gaussians [?,?,?,?,?,?,?]. They do not require a minimal distance between any two distributions, but they do not aim to classify every sample point correctly either, and in general require much more data. A Simple Algorithm Using Singular Vectors As described in Theorem, we assume we have a mixture of two product distributions. Let N 1, N be the number of individuals from each population class. Our goal is to correctly classify all individuals according to their distributions. Let n = = N 1 + N, and refer to the case when N 1 = N as the balanced input case. For convenience, let us redefine K to assume we have O(log n) blocks of K features each (so the total number of features is really O(K log n)) and we assume that each set of K features has divergence at least γ. (If we perform this partitioning of features into blocks randomly, then with high probability this divergence has changed by only a constant factor for most blocks.) The high-level idea of the algorithm is now to repeat the following procedure for each block of K features: use the K features to create an n K matrix X, such that each row X i, i = 1,...,n, corresponds to a feature vector for one sample point, across its K dimensions. We then compute the top two left singular vectors u 1, u of X and use these to classify each sample. This classification induces some probability of error f for each individual at each round, so we repeat the procedure for each of the O(log n) blocks and then take majority vote over different runs. Each round we require K n features, so we need O(n log n) features total in the end. In more detail, we repeat the following procedureo(log n) times. Let T = 15N 3 3ωmin γ, where ω min is the lower bound on the minimum weight min{ N1, N }, which is independent of an actual instance. Let s 1 (X), s (X) be the top two singular values of X. Procedure Classify: Given γ, N, ω min. Assume that N 1 γ, Normalization: use the K features to form a random n K matrix X; Each individual random variable X i,j is a normalized random variable based on the original Bernoulli r.v. b i,j {0, 1} with Pr[b i,j = 1] = p j 1 for X i P 1 and Pr[b i,j = 1] = p j for X i P, such that X i,j = b+1.

Take top two left singular vectors u 1, u of X, where u i = [u i,1,...,u i,n ], i = 1,. 1. If s (X) > T = 15N 3 3ωmin γ, use u to partition the individuals with 0 as the threshold, i.e., partition j [n] according to u,j < 0 or u,j 0.. Otherwise, use u 1 to partition, with mixture mean M = n u 1,n as the threshold. Analysis of the Simple Algorithm: Our analysis is based on comparing entries in the top two singular vectors of the normalized randomn K matrix X, with those of a static matrix X, where each entry X i,j = E[X i,j ] is the expected value of the corresponding entry in X. Hence i = 1,...,N 1, X i = [µ 1 1, µ 1,..., µk 1 ], where µj 1 = 1+pj 1, j, and i = N 1 + 1,...,n, X i = [µ 1, µ,...,µ K ], where µ j = 1+pj, j. We assume the divergence is exactly γ among the K features that we have chosen in all calculations. The inspiration for this approach is based on the following lemma, whose proof is built upon a theorem that is presented in a lecture note by Spielman [?]. For a n K matrix A, let s 1 (A) s (A)... s n (A) be singular values of A. Let u 1,...,u n, v 1,..., v n, be the n left and right singular vectors of X, corresponding to s 1 (X),..., s n (X) such that u i = 1, v i = 1, i. We denote the set of n left and right singular vectors of X with ū 1,..., ū n, v 1,..., v n. Lemma 4. Let X be the random n K matrix and X its expected value matrix. Let A = X X be the zero-mean random matrix. Let θ i be the angle between two vectors: [u i, v i ], [ū i, v i ], where [u i, v i ] = [ū i, v i ] = and [u, v] represents a vector that is the concatenation of two vectors u, v. u i ū i [u i, v i ] [ū i, v i ] θ i sin(θ i ) where gap(i, X) = min j i s i (X) s j (X). 4s 1(A) gap(i, X), (4) We first bound the largest singular value s 1 (A) = s 1 (X X) of (a i,j ) with independent zero-mean entries, which defines the Euclidean operator norm (a i,j ) := sup a i,j x i y j : x i 1, yi 1. (5) i,j The behavior of the largest singular value of an n m random matrices A with i.i.d. entries is well studied. Latala [?] shows that the weakest assumption for its regular behavior is boundedness of the fourth moment of the entries, even if they are not identically distributed. Combining Theorem 5 of Latala with the concentration Theorem 6 by Meckes [?] proves Theorem 7 that we need 1. Theorem 5. (Bounded Norm of Random Matrices [?]) For any finite n m matrix A of independent mean zero r.v. s a i,j we have, for an absolute constant C, 1 E (a i,j ) C max Ea i i,j + max Ea j i,j + 4 Ea 4 i,j. (6) j i i,j 1 One can also obtain an upper bound of O( n + K) on s 1(A) using a theorem on by Vu [?], through the construction a (n + K) (n + K) square matrix out of A.

Theorem 6. (Concentration of Largest Singular Value: Bounded Range [?]) For any finite n m, where n m, matrix A, such that entries a i,j are independent r.v. supported in an interval of length at most D, then, for all t, Pr[ s 1 (A) Ms 1 (A) t] 4e t /4D. (7) Theorem 7. (Largest Singular Value of a Mean-zero Random Matrix) For any finite n K, where n K, matrix A, such that entries a i,j are independent mean zero r.v. supported in an interval of length at most D, with fourth moment upper bounded by B, then [ Pr s 1 (A) CB 1/4 K + 4D ] π + t 4e t /4 (8) for all t. Hence A C 1 B 1/4 K for an absolute constant C 1..1 Generating a Large Gap in s 1 (X), s (X) In order to apply Lemma 4 to the top two singular vectors of X and X through u 1 ū 1 4s 1(X X) s 1 (X) s (X) (9) 4s 1 (X X) u ū min ( s 1 (X) s (X), s (X) ), (10) we need to first bound s 1 (X) s (X) away from zero, since otherwise, RHSs on both (9) and (10) become unbounded. We then analyze gap(, X) = min ( s 1 (X) s (X), s (X) ). Let us first define values a, b, c that we use throughout the rest of the paper: K K K a = (µ k 1), b = µ k 1µ k, c = (µ k ). (11) k=1 k=1 For the following analysis, we can assume that a, b, c [K/4, K], given that X is normalized in Procedure Classify. We first show that normalization of X as described in Procedure Classify guarantees that not only s 1 (X) s (X) 0, but there also exists a Θ( NK) amount of gap between s 1 (X) and s (X) in Proposition 8: k=1 gap(x) := s 1 (X) s (X) = Θ( NK). (1) Proposition 8. For a normalized random matrix X, its expected value matrix X satisfies 4c0 K 5 gap(x) K, where c 0 = b ac K(a+c) is a constant, given that a, b, c [K/4, K] as defined in (11). In addition, KN 4 s 1 (X) NK K, and s 1 (X) + s (X) K. (13)

We next state a few important results that justify Procedure Classify. Note that the left singular vectors ū i, i of X are of the form [x i,..., x i, y i,..., y i ] T : ū 1 = [x 1,...,x 1, y 1,...,y 1 ] T, and ū = [x,...,x, y,...,y ] T, (14) where x i repeats N 1 times and y i repeats N times. We first show Proposition 9 regarding signs of x i, y i, i = 1,, followed by a lemma bounding the separation of x, y. We then state the key Separation Lemma that allows us to conclude that least one of top two left singular vectors of X can be used to classify data at each round. It can be extended to cases when k >. Proposition 9. Let b as defined in (11): when b > 0, entries x 1, y 1 in ū 1 have the same sign while x, y in ū have opposite signs. ( Lemma 10. x y Cmax where C max = C x min where C x min = 1 ω 1 + ω 4ω ; y Cy min 1 +ω1ω where C y min = ) 1 ω 4 ω min ; x ω 1 4ω +ω1ω. Lemma 11. (Separation Lemma) Kγ = s 1 (X) (x 1 y 1 ) + s (X) (x y ). Proof. Let := P 1 P as in Theorem, and b = [1, 0,..., 0, 1, 0,..., 0] T, where 1 appears in the first and 1 appears in the N 1 +1 st positions. Then = X T b = [µ 1 1 µ 1, µ 1 µ,...,µ K 1 µ K ]. Given X = s 1 (X)ū 1 v T 1 + s (X)ū v T, we thus rewrite as: = X T b = s 1 (X) v 1 ū T 1 b+s (X) v ū T b = s 1(X) v 1 (x 1 y 1 )+s (X) v (x y ). The lemma follows from the fact that = Kγ and v 1, v are orthonormal. Combining Proposition 9, Lemma 10, (13), and Lemma 11, we have Corollary 1. s (X) s (X) ) = s (X) for a sufficiently small γ. Kγ cxmin+ c y min, and hence gap(, X) = min(s (X), s 1 (X) Finally, we show that the probability of error at each round for each individual is at most f = 1/10, given the sample size n as specified in Theorem. Hence by taking majority vote over the different runs for each sample, our algorithm will find the correct partition with probability 1 1/n, given that at each round we take a set of K > n independent features. We leave the detailed analysis in full version. 3 The Algorithm PARTITION 3.1 Preliminaries Let V = {1,..., n} be the set of all n individuals, and let ψ : V {1,..., k} be the map that assigns to each individual the population it belongs to. Set V t = ψ 1 (t) and N t = V t. Moreover, let E = (E vi ) 1 v n,1 i K be the n K matrix with entries E vi = p i ψ(v). For any 1 t k we let EVt = (p l t) l=1,...,k be the row of E corresponding to any v V t. In addition, let A = (a vi ) denote the empirical n K input matrix. Thus, the entries of E equal the expectations of the entries of A.

As in Theorem 3, we let γ = K 1 min E 1 i<j k EVi Vj, Γ = Kγ. Further, set λ = Kσ. Then the assumption from Theorem 3 can be rephrased as n min Γ > C k λ and Γ > K 1 (15) where C k signifies a sufficienly large number that depends on k only (the precise value of C k will be specified implicitly in the course of the analysis). As in the previous section, by repeating the partitioning process log n times, we may restrict our attention to the problem of classifying a constant fraction of the individuals correctly. That is, it is sufficient to establish the following claim. Claim 13. There is a polynomial time algorithm Partition that satisfies the following. Suppose that (15) is true. Then whp Partition(A, k) outputs a partition (S 1,..., S k ) of V such that there exists a permutation σ such that k V i S σ(i) < 0.001n min. Let X = (x ij ) 1 i n,1 j K be a n K matrix. By X i we denote the i th row (X i1,...,x ik ) of X. Moreover, we let X = max Xξ ξ R K : ξ =1 signify the operator norm of X. A rank k approximation of X is a matrix X of rank at most k such that for any n K matrix Y of rank at most k we have X X X Y. Given X, a rank k approximation X can be computed as follows. Letting ρ = rank(x), we compute the signular value decomposition X = ρ λ i ξ i ηi T ; here (ξ i ) 1 i ρ is an orthonormal family in R n, (η i ) 1 i ρ is an orthonormal family in R K, and we assume that the singular values λ i are in decreasing order (i.e., λ 1 λ ρ ). This can be acccomplished in polynomial time within any numerical percision. Then X = min{k,ρ} λ i ξ i ηi T is easily verified to be a rank k approximation. In addition to the operator norm, we are going to work with the Frobenius norm X F = n K x ij. j=1 Although the following fact is well known, we provide its proof for completeness. Lemma 14. If X has rank k, then X F k X.

Proof. Let X = k λ iξ i η T i be a singular value decomposition as above. Then X F = i,j k x ij = λ i λ j ξ i, ξ j η i, η j. i,j=1 Since ξ 1,..., ξ k and η 1,..., η k are orthonormal families, we have ξ i, ξ j = η i, η j = 1 if i = j and ξ i, ξ j = η i, η j = 0 if i j. Hence, X F = k λ i. This implies the assertion, because λ i X for all 1 i k. 3. Description of the algorithm Algorithm 15. PARTITION(A, k) Input: A n K matrix A and the parameter k. Output: A partition S 1,..., S k of V. 1. Compute a rank k approximation A b of A. For j = 1,..., log K do. Let Γ j = K j and compute Q (j) (v) = {w V : A b w A b v 0.01Γj } for all v V. Then, determine sets Q (j) 1,..., Q(j) k 3. Pick v V \ S i 1 l=1 Q(j) l Set Q (j) i = Q (j) (v) \ S i 1 l=1 Q(j) l 4. Partition the entire set V as follows: first, let S (j) i Then, add each v V \ S k l=1 Q(j) l as follows: for i = 1,..., k do such that Q (j) (v) \ S i 1 l=1 Q(j) l is maximum. and ξ (j) P i = 1 ba w. Q (j) i to a set S (j) i w Q (j) i minimum. Set r j = P k P v S (j) A b v ξ (j) i. i 5. Let J be such that r = r J is minimum. Return S (J) 1,..., S (J) = Q (j) i for all 1 i k. such that ba v ξ (j) i is The basic idea behind PARTITION is to classify each individual v V according to its row vector Âv in the rank k approximation Â. That is, two individuals v, w are deemed to belong to the same population iff Âv Âw 0.01Γ. Hence, PARTITION tries to determine sets S 1,..., S k such that for any two v, w in the same set S j the distance Âv Âw is small. To justify this approach, we show that Â is close to the expectation E of A in the following sense. Lemma 16. There is a constant C > 0 such that v V Âv E v Ckλ whp. Proof. Since both Â and E have rank at most k, and as therefore Â E has rank at most k, Lemma 14 yields k. Âv E v = Â E F k Â E. v V Furthermore, Â E Â A + E A E A, because Â is a rank k approximation of A. As Theorem 7 implies that A E Cλ /8 for a certain constant C > 0, we thus obtain v V Âv E v 8k A E Ckλ.

Observe that Lemma 16 implies that for most v we have Âv E v 10 6 Γ, say. For letting z = {v : Âv E v > 10 6 Γ }, we get 10 6 Γz v V Âv E v Ckλ, whence z n min due to our assumption that n min Γ kλ. Thus, most rows of Â are close to the corresponding rows of the expected matrix E. Therefore, the separation assumption n > C kσ γω from Theorem 3 imlies that for most pairs of elements in different classes v V i, w V j the squared distance Âv Âw will be large (at least 0.99Γ, say). By contrast, for most pairs u, v V i of elements belonging to the same class Âv Âw will be small (at most 0.01Γ, say), because E v = E w. As the above discussion indicates, if the algorithm were given Γ as an input paramter, the procedure described in Steps 4 (with Γ j replaced by Γ ) would yield the desired partition of V. The procedure described in Steps 4 is very similar to the spectral partitioning algorithm from [?]. However, since Γ is not given to the algorithm as an input parameter, PARTITION has to estimate Γ on its own. (This is the new aspect here in comparision to [?], and this fact necessitates a significantly more involved analysis.) To this end, the outer loop goes through logk candidate values Γ j. These values are then used to obtain partitions Q (j) 1,...,Q(j) k in Steps 4. More precisely, Step uses Γ j to compute for each v V the set Q(v) of elements w such that Âw Âv 0.01Γj. Then, Step 3 tries to compute big disjoint Q (j) 1,...,Q(j) k, where each Q(j) i results from some Q(v i ). Further, Step 4 assigns all elements v not covered by Q (j) 1,..., Q(j) k to that Q (j) i whose center vector ξ (j) i is closest to Âv. In addition, Step 4 computes an error parameter r j. Finally, Step 5 outputs the partition that minimizes the error parameter r j. Thus, we need to show that eventually picking the partition whose error term r j is minimum yields a good approximation to the ideal partition V 1,...,V k. The basic reason why this is true is that the empirical mean ξ (j) i should approximate the expectation E Vi for class V i well iff Q (j) i is a good approximation of V i. Hence, if Q (j) 1,..., Q(j) k is close to V 1,..., V k, then r j = k v S (j) i Âv ξ (j) i will be about as small as Â E F (cf. Lemma 16). In fact, the following lemma shows that if Γ j is close to the ideal Γ, then r j will be small. Lemma 17. If 1 Γ Γ j Γ, then r j C 0 k 3 λ for a certain constant C 0 > 0. We defer the proof of Lemma 17 to Section 3.3. Furthermore, the next lemma shows that any partition such that r j is small yields a good approximation to V 1,...,V k. Lemma 18. Let S 1,..., S k be a partition and ξ 1,..., ξ k a sequence of vectors such that k v S i ξ i Âv C 0 k 3 λ. Then there is a bijection Ξ : {1,...,k} {1,..., k} such that the following holds.

1. ξ i E V Ξ(i) 0.001Γ for all i = 1,...,k, and. k S i V Ξ(i) < 0.001n min. The proof of Lemma 18 can be found in Section 3.4. Proof of Claim 13. Since the rank k approximation Â can be computed in polynomial time (within any numerical precision), Partition is a polynomial time algorithm. Hence, we just need to show that K 1 Γ K; for then Partition will eventually try a Γ j such that 1 Γ Γ j Γ, so that the claim follows from Lemmas 17 and 18. To see that K 1 Γ K, recall that we explicitly assume that Γ > K 1. Furthermore, all entries of the vectors E Vi lie between 0 and 1, whence Γ = max i<j E Vi E Vj K. 3.3 Proof of Lemma 17 Suppose that 1 Γ Γ j Γ. To ease up the notation, we omit the superscript j; thus, we let S i = S (j) i, Q i = Q (j) i for 1 i k, and Q(v) = Q (j) (v) for v V (cf. Steps 4 of Partition). The following lemma shows that there is a permutation π such that ξ i is close to E V π(i) for all 1 i k, and that the sets Qi are not too small. Lemma 19. Suppose that 1 Γ Γ j Γ. There is a bijection π : {1,..., k} {1,...,k} such that for each 1 i k we have Q i 1 V π(i) and ξ i E V π(i) 0.1Γ. Proof. For 1 i k we choose π(i) so that Q i V π(i) is maximum. We shall prove below that for all 1 l k we have ξ l E V π(l) 0.1Γ, (16) Q l max{ V i : i {1,...,k} \ π({1,...,l 1})} 0.01n min,(17) Q l V π(l) Q l 0.01n min. (18) These three inequalities imply the assertion. To see that π is a bijection, let us assume that π(l) = π(l ) for two indices 1 l < l k. Indeed, suppose that l = min π 1 (l). Then Q l V π(l) 0.01n min by (17), and thus V π(l) \ Q l 0.1n min by (18). As Q l Q l = by construction, we obtain the contradiction 0.99n min (17) Q l (18) 1.1 Q l V π(l) 1.1 V π(l) \ Q l 0.11n min. Finally, as π is bijective, (17) entails that Q l 0.9V π(l) for all 1 l k. Hence, due to (18) we obtain Q l V l 0.9 Q l 1 V π(l), as desired. The remaining task is to establish (16) (18). We proceed by induction on l. Thus, let us assume that (16) (18) hold for all l < L; we are to show that then (16) (18) are true for l = L as well. As a first step, we establish (17). To this end, consider a class V i such that i π({1,...,l 1}) and let Z i = {v V i : Âv E v 0.001Γ }. Then 0.001Γ( V i Z i ) v V i\z i Âv E v Â E F Ckλ (cf. Lemma 16) whence the assumption (15) on Γ yields Z i V i 0.01n min, (19)

provided that C k is sufficiently large. Moreover, for all v Z i we have Q(v) = {w V : Âv Âw 0.01Γ j } Z i, (0) because we are assuming that Γ j Γ/. In addition, let w Q l for some l < L; since our choice of i ensures that v V i V π(l), we have Γ E V π(l) E v E v Âv + Âw Âv + ξ l Âw + ξ l E V π(l) (1). Now, the construction in Step 3 of Partition ensures that Âw ξ l 0.1 Γ. Furthermore, ξ l E V π(l) Γ/3 by induction (cf. (16)), and Â v E v 0.1 Γ, because v Z i. Hence, (1) entails that Âw Âv > 0.1 Γ, so that w Q(v). Consequently, (0) yields Z i Q l = for all l < L. () Finally, let v L signify the element chosen by Step 3 of Partition to construct Q L. Then by construction Q L = Q(v L ) \ L 1 l=1 Q l Q(v) \ L 1 l=1 Q l. Therefore, L 1 (0), () Q L Q(v) \ Q l Z i (19) V i 0.01n min. l=1 As this estimate holds for all i π({1,...,l 1}), (17) follows. Thus, we know that Q L is big. As a next step, we prove (18), i.e., we show that Q L mainly consists of vertices in V π(l). To this end, let 1 i k be such that E Vi Âv L is minimum. Let Y = Q L \V i. Then for all w Y we have E w Âv L E Vi Âv. Further, since Γ E w E Vi E w Âv L + E Vi Âv L E w Âv L, we conclude that E w Âv L 1 4 Γ. On the other hand, as w Q L, we have Âw Âv L 0.01Γ. Therefore, we obtain Âw E w 0.1Γ for all w Y, so that 0.1 Y Γ w Y Âw E w Â E F Lemma 16 c k λ. (3) Hence, due to our assumption (15) on Γ, (3) yields that Y < 0.01n min. Consequently, (17) entails that V i Q L 0.99 Q L, so that i = π(l). Hence, we obtain Q L V π(l) = Q L V i = Q L \ Y Q L 0.01n min, thereby establishing (18). Finally, to show (16), we note that by construction ξ L Âv L 0.01Γ and Âw Âv L 0.01Γ for all w Q L V π(l) (cf. Step 3 of Partition). Therefore, Q L V π(l) E π(l) ξ L 3 ξ L Âv L + Âw Âv L + Âw Êπ(L) w Q L V π(l) 0.06Γ Q L V π(l) + 3 Â E F Lemma 16 0.06Γ Q L V π(l) + 3c k λ. (4) Since Q L V π(l) 0.9n min due to (17) and (18), (4) entails that E π(l) ξ L 0.07Γ + 6c kλ n min 0.1Γ. Thus, (16) follows.

In the sequel, we shall assume without loss of generality that the map π from Lemma 19 is just the identity, i.e., π(i) = i for all i. Bootstrapping on the estimate ξ i E Vi 0.1Γ for 1 i k from Lemma 19, we derive the following stronger estimate. Corollary 0. For all 1 i k we have ξ i E Vi 100 Q i 1 v Q i Âv E v. Proof. By the Cauchy-Schwarz inequality, ξ i E Vi = Q i 1 Â v E Vi v Q i Q i 1/ Âv E Vi v Q i Furthermore, as ξ i E Vi 0.1Γ by Lemma 19, for all v Q i \ V i we have 1/.(5) Âv E Vi ( Âv ξ i + ξ i E Vi ) Γ 1/3, (6) because the construction of Q i in Step 3 of Partition ensures that Âv ξ i 0.01Γ. Hence, as E v E Vi Γ, (6) implies that Âv E v 0.1 Âv E Vi. Therefore, the assertion follows from (5). Corollary 1. For all v S i \ V i we have Âv ξ i 3 Âv E v. Proof. Let i l and consider a v S i V l. We shall establish below that Âv ξ i Âv ξ l. (7) Then by Lemma 19 Â v ξ i Â v E v + E v ξ l Â v E v + Γ/3, and thus Γ Ev E Vi Â v ξ i + ξ i E Vi + Â v E v Â v E v + 3 Γ. Consequently, we obtain Âv E v 6 1 Γ, so that the assertion follows from the estimate Âv ξ i (7) Lemma 19 Âv ξ l Âv E v + E v ξ l Âv E v + Γ 3 3 Âv E v. Finally, we prove (7). If v S i V l \ Q i, then the construction of S i in Step 4 of Partition guarantees that Âv ξ i Âv ξ l, as claimed. Thus, assume that v Q i V l. Then Âv ξ i 0.15 Γ [by the definition of Q i in Step 3], max{ ξ i E Vi, ξ l E v } 1 3 Γ [by Lemma 19], E Vi E v Γ. Therefore, if Âv ξ l < Âv ξ i, then we would arrive at the contradiction Γ E V i E v E Vi ξ i + E v ξ l + ξ i ξ l Γ + Â v ξ i + Âv ξ l < Γ + Â v ξ i 0.99 Γ. 3 3 Thus, we conclude that Âv ξ l Âv ξ i, thereby completing the proof.

Proof of Lemma 17. Since Q i 1 V i by Lemma 19, we have the estimate k w S i V i Âw ξ i k w S i V i Cor. 0 Â E F + 00 k Furthermore, by Corollary 1 k v S i\v i Âv ξ i 9 [ Âw E w + E w ξ i ] S i V i Q i k Âv E v 500 Â E (8) F. v Q i v S i\v i Âv E v 9 Â E F. (9) Since Â E F Ckλ by Lemma 16, the bounds (8) and (9) imply the assertion. 3.4 Proof of Lemma 18 Set S ab = S a V b for 1 a, b k. Moreover, for each 1 a k let 1 π(a) k be such that E V π(a) ξa is minimum. Then for all b π(a) we have Γ E V π(a) E V b E V π(a) ξ a + E V b ξ a E V b ξ a, (30) so that E V b ξ a Γ/. Therefore, by our assumption that k v S i ξ i Â v C 0 k 3 λ, we have Γ 4 Hence, k a=1 1 b k:b π(a) k S a V π(a) = a=1 S ab k a,b=1 k a,b=1 S ab E V b ξ a v S ab E v Âv + Âv ξ a Â E F + k a,b=1 v S ab Âv ξ a Lemma 16 4C 0 k 3 λ + C 0 k 3 λ C 0 k3 λ. (31) 1 a,b k:b π(a) S ab 8c 0k 3 λ Γ 0.001n min, (3) provided that C k is sufficiently large (cf. (15)). Combining (31) and (3), we obtain n min EV π(a) ξa S a V π(a) E π(a) ξ a c 0k 3 λ, whence E π(a) ξ a c 0 k3 λ n min 0.001Γ for all 1 a k, (33)

provided that C k is large enough. Thus, we have established the first two parts of the lemma. In addition, observe that (3) implies that π is bijective (because the sets S 1,..., S k are pairwise disjoint and V a n min for all 1 a k). Finally, the third assertion follows from the estimate k a,b=1 S ab E V π(a) E V π(b) (30) 8 k a,b=1 k a,b=1 S ab ( E V π(a) ξ a + E V π(b) ξ a ) S ab E V π(b) ξ a (31) 8C 0k 3 λ 0.001Γn min, where we assume once more that C k is sufficiently large. 4 Experiments We illustrate the effectiveness of spectral techniques using simulations. In particular, we explore the case when we have a mixture of two populations; we show that when NK > 1/γ and K > 1/γ, either the first or the second left singular vector of X shows an approximately correct partitioning, meaning that the success rate is well above 1/. The entry-wise expected value matrix X is: among K/ features, p i 1 > pi and for the other half, p i 1 < p i such that i, p i 1, p i { 1+α + ǫ, 1 α + ǫ }, where ǫ = 0.1α. Hence γ = α. We report results on balanced cases only, but we do observe that unbalanced cases show similar tradeoffs. For each population P, the success rate is defined as the number of individuals that are correctly classified, i.e., they belong to a group that P is the majority of that group, versus the size of the population P. Each point on the SVD curve corresponds to an average rate over 100 trials. Since we are interested in exploring the tradeoffs of N, K in all ranges (e.g., when N << K or N >> K), rather than using the threshold T in Procedure Classify that is chosen in case both N, K > 1/γ, to decide which singular vector to use, we try both u 1 and u and use the more effective one to measure the success rate at each trial. For each data point, the distribution of X is fixed across all trials and we generate an independent X K for each trial to measure success rate based on the more effective classifier between u 1 and u. One can see from the plot that when K < 1/γ, i.e., when K = 00 and 400, no matter how much we increase N, the success rate is consistently low. Note that 50/100 of success rate is equivalent to a total failure. In contrast, when N is smaller than 1/γ, as we increase K, we can always classify with a high success rate, where in general, NK > 1/γ is indeed necessary to see a high success rate. In particular, the curves for K = 5000, 500, 150 show the sharpness of the threshold behavior for increasing sample size n from below 1/Kγ to above. For each curve, we also compute the best possible classification one could hope to make if one knew in advance which features satisfied p i 1 > pi and which satisfied pi 1 < pi. These are the horizontal(ish) dotted lines

above each curve. The fact that the solid curves are approaching these informationtheoretic upper bounds shows that the spectral technique is correctly using the available information. γ=0.0016, Balanced case Succ rate(%) 50 60 70 80 90 100 K=500 K=5000 K=150 K=65 K=400 K=00 K= 5000 K= 500 K= 150 K= 65 K= 400 K= 00 0 500 1000 1500 000 500 3000 N Fig. 1. Plots show success rate as a function of N for several values of K, when γ = (0.04). Each point is an average over 100 trials. Horizontal lines ( oracles ) indicate the informationtheoretically best possible success rate for that value of K (how well one could do if one knew in advance which features satisfied p i 1 > p i and which satisfied p i 1 < p i ; they are not exactly horizontal because they are also an average over 100 runs). Vertical bars indicate the value of N for which NK = 1/γ. 5 Acknowledgments We thank John Lafferty, Frank McSherry, Roman Vershynin and Larry Wasserman for many helpful discussions on this work. A Detailed Analysis for the Simple Algorithm In this section, we first prove a proposition regarding a, b, c as defined in (11). We next provide the proof to Theorem 7 regarding the largest singular value of (X X). We then use Lemma 4 to bound the number of individuals that we misclassify at each round in Section A.. We finish by showing that with high probability, we can correctly classify all individuals by taking majority vote over O(log N) different runs. Throughout the rest of the paper, we use X, Y, H to represent random matrices, where H = XX T and Y = [ 0 X X T 0 ]. We use X, Y, H to represent the corresponding

static matrices. Let us substitute a, b, c in H = X X T, where the blocks in H from top to bottom and from left to right are of size: N 1 N 1, N 1 N, N N 1 and N N respectively: a... a b... b... H = X X T = a... a b... b b... b c... c. (34)... b... b c... c Proposition. For any choices of µ k i, ac b ; By definition, a + c b = K Proof. a + c b = k α k holds by definition. ( K K K ac b = (µ k 1) (µ k ) (µ k 1µ k ) k=1 k=1 k=1 α k, where α k = µ k 1 µk. (35) K = k=1(µ k 1 µk 1 ) + K ((µ k 1 µj ) + (µ j 1 µk ) ) j k k=1(µ k 1 µk 1 ) + j k = (µ k 1 µj ) + (µ j 1 µk ) µ k 1 µk µj 1 µj = j k j k(µ k 1 µj µj 1 µk ) 0. ) µ k 1 µk µj 1 µj Remark 3. Both matrices of X and X X T have rank at most two. When ac = b, H has rank 1. A.1 Proof of Theorem 7 Proof of Theorem 7. By having an upper bound on both maximum variance and fourth moment of any entry, we have the following corollary of Theorem 5. Corollary 4. (Largest Singular Value: Bounded Fourth Moment [?]) For any finite n m, where n m, matrix of independent mean zero r.v. s a i,j, such that the maximum variances of any entry is at most σ, and each entry has a finite fourth moment B we have ( E (a i,j ) C σ( m + n) + (mnb) 1/4) CB 1/4 m (36) for an absolute constant C. Remark 5. The requirement that σ is upper bounded is not essential. The conclusion in Corollary 4 works so long as fourth moment is bounded by B.

Let Ms 1 (A) be the median of s 1 (A). Following a calculation from [?], we have E[s 1 (A)] Ms 1 (A) E[ s 1 (A) Ms 1 (A) ] = 4 0 0 Pr[ s 1 (A) Ms 1 (A) t]dt e t /4D dt = 4D π, where D 1 for Bernoulli random variables that we consider. This allows us to conclude Theorem 7. A. Correctness of Classification for the Simple Algorithm We now prove correctness of our algorithm. We first show how to choose T for Procedure Classify. Let B denote the fourth moment bound for a single random variable in the mean zero random matrix X X ; for the type of normalized Bernoulli r.v.s that we care about, B is in the order of σ, where σ is defined in Theorem. Let Nγ be a large enough constant. Let s 1 (X X) C 0 K, where C0 = C 1 B 1/4 as defined in Theorem 7 and let the threshold which requires that T = C 3 KNγ 15C 0 K, (37) C 3 Nγ 5C 0, where C 3 satisfies (41). (38) Following Lemma 3, (56), (58), and Proposition 30, we have We have two cases, s (X) s (X) s 1 (X X) C 0 K. (39) 1. When s (X) T, by Lemma 10 and the fact that s (X) s (X)+s 1 (X X) T + C 0 K 16T 15, we have s (X) x y 56T 5 C max 18C 3KγC max 18C 3Kγ4 (40) 5 5ω min for C max as defined in Lemma 10. We want s (X) x y 3Kγ 4 so long as 18C3KγCmax 5 18C3Kγ4 5ω min 3Kγ 4, which is true if. This holds C 3 675ω min 048 ; thus we take C 3 = 675ω min from this point on. (41) 048 It follows from Lemma 11 that s 1 (X) x 1 y 1 Kγ 4. Hence by (13) x 1 y 1 Kγ s 1 (X) Kγ K 1 γ. (4)

Thus the condition of Theorem 6 holds with c = 1, so long as Nγ 048C 1 B, (43) 3ω min due to (38) and (41); This is a weaker condition than (54) for f < 1.. When s (X) T, we have s (X) s (X) s 1 (X X) T C 0 K 14T This satisfies the condition of Theorem 8, with c 3 = 14 C 3 15 = 7 3ω min 16. Let us first denote the first singular vector u 1 and its noise vector ǫ as follows: u T 1 = (x + δ 1,..., x + δ N1, y + τ 1,..., y + τ N ), ǫ T = (δ 1,...,δ N1, τ 1,...,τ N ). It turns out that we only need to use the mixture mean N1 M = (x + δ i) + N (y + τ i) to decide which side to put a node, i.e., to partition j [] according to u 1,j < M or u 1,j M, given that N 1 /N is a constant; Misclassifying any entry will contribute Ω ( γ ) amount to ū1 u 1. Theorem 6. Assume w.l.o.g. that N 1 N and K. Let ω 1 = N 1 / and ω = N /. Suppose x 1 y 1 c γ for some constant c = 1. By requiring N 048C 1 B as in (43), and 3γω min 15 ; (44) N 1 c 1σ fc γω 1ω, or equivalently c 1σ fc γω ω 1 = 5C 1 B fc γω ω1, (45) where c 1 = 5C1B1/4 c0σ for C 1 specified in Theorem 7 and c 0 specified in Proposition 8, we can classify the two population using the mixture mean M with the error factor at most f for N 1, N respectively whp. By Lemma 4 and Theorem 7, we immediately have the following claim. Claim. For c 1 chosen as in Theorem 6, ǫ = N 1 δ i + N τ i c 1 σ N. 1 Proof. Given that c 1 = 5C1 4 B such that C c0σ 1 appears in Theorem 7 and c 0 appears in Proposition 8, N1 N δi + τi = u 1 ū 1 θ 1 sin(θ 1 ) This allows us to conclude the claim. 4s 1(X X) gap(1, X) 4C 1 4 1 B K = c 1σ. 4c 0 K/5 N

We need the following lemma, proof of which appears in Appendix C. Lemma 7. Assume that K and Condition (45) in Theorem 6, we have M x N (1 γ) y x, y M N 1(1 γ) y x. (46) Proof of Theorem 6. Recall that the largest ū 1, ū have the form of [x,..., x, y,...,y], where x repeats N 1 times and y repeats N times; hence w.l.o.g., assume that x < y, we have i, s.t. x + δ i > M, it contributes δ i > M x N c γ 8N 3 to ǫ, (47) i, s.t. y τ i < M, it contributes δ i > M y N c γ 8N 3 to ǫ. (48) Hence the total number of entries that goes above M from P 1, and those goes below M from P can not be too many since their total contribution is upper bounded by ǫ = u 1 ū 1. Let l 1 be the number of misclassified entries from N 1, i.e., those described in (47), by Lemma 4, l 1 N c γ 8N 3 l 1 M x ǫ c 1 σ N. (49) Thus given that N 1 8c 1 σ fc γ c 1 σ fω c γ ; hence it suffices to guarantee that l 1 c 1 σ ω c γ fn 1. We next bound the number of entries from P that goes below M, which can not be too many either; let l be the number of misclassified entries from P, hence by requiring l N 1c γ 8N 3 l M y ǫ c 1σ N, (50) N c 1σ fω 1 c γ, (51) it suffices to guarantee that l c 1 σ ω c γ fn. Condition (51) is equivalent to Thus by requiring we have satisfied all requirements. Combining Lemma 4 and Corollary 1, we have N 1 = N ω 1 ω c 1σ fω 1 ω c γ, (5) N 1 c 1σ fc γω ω 1, (53)

Claim. Given that s (X) c 3 KNγ, u ū 16s1(X X) s (X) 16C 0 K c 3 KNγ. This allows us to prove the following theorem. Let the classification error factor be the number of misclassified individuals from one group over total amount of people in that group. Theorem 8. Assume N 1 N and K. Let ω 1 = N 1 / and ω = N /. Let s (X) c 3 KNγ, where c3 = 7 3ω min 16 and ω min is the minimum possible weight allowed by the algorithm. By requiring 360C 0 ω min fγ ( ω ω 1 + 1 ) ( = Θ σ fγω min ω 1 ), (54) we can classify the two population using 0 to separate components in u, with error factor at most f for both P 1, P whp. Proof. Let l 1, l be the number of misclassified entries from P 1 and P respectively; Cx min they each contribute at least, and Cy min amount to u ū, and hence by Claim A., l 1 C x min u ū 16C 0 K c 3 KNγ 16C 0. (55) Nγ ( Hence l 1 3C 0 c fn 3 γcx 1 given that N 16C 0 min c 3 fγ 4 ω1 ω + 1 C Similarly, by Claim A., we have l x min u ū and thus l 3C 0 c 3γCy min ( ) N f so long as N 16C 0 c 3 fγ 4 ω ω 1 + 1 ; the bound on follows by plugging in c 3 = 7 3ω min 16. Finally, ( ) σ Theorem 9. Given a set of n Ω γfωω min individuals, by trying Procedure Classify for log n rounds, with probability of error at each round for each individual being f = 1/10, where each round we take a set of K > n independent features, and by taking majority vote over the different runs for each sample, our algorithm will find the correct partition with probability 1 1/n. Proof. A sample is put in the wrong side with a probability 1/10 at each round. Let E i be the event that sample i is misclassified for more than log n times, thus Pr[E i ] = ( 1 10) log n 1/n 3.3 ; hence by union bound, with probability 1 1/n, none of the individuals is misclassified. c 3 ). B More Proofs for the Simple Algorithm Classify B.1 Proof of Lemma 4 Let u 1,..., u n, v 1,...,v n be the n left and right singular vectors of X, corresponding to s 1 (X) s (X)... s n (X), we have for i, u i = 1, v i = 1 such that X T u i = s i (X)v i and Xv i = s i (X)u i.

Before we prove Lemma 4, given an n K matrix X, where n < K, let us first define H = XX T and a block matrix [ ] 0 X Y = X T. (56) 0 (+K) (+K) Recall that singular values of a real n K matrix X are exactly the nonnegative square roots of the n largest eigenvalues of H = XX T, i.e, s i (X) = λ i (H), i = 1,...,n, given that Hu i = XX T u i = s i (X)Xv i = s i(x)u i. (57) Hence the left singular vectors u 1,..., u n of X are eigenvectors of H corresponding to λ i (H) = s i (X). We next show that the first n eigenvalues of Y and their corresponding eigenvectors: Y [ ui and hence ] = v i [ 0 X X T 0 ] [ ] ui = v i [ ] Xvi X T = u i [ ] si (X)u i = s s i (X)v i (X) i [ ui v i ], (58) Proposition 30. The largest n eigenvalues of Y are s 1 (X),...,s n (X) with corresponding eigenvectors [u i, v i ], i = 1,..., n, where u i, v i, i, are left and right singular vectors of X corresponding to s i (X). In fact both ±s i (X) are eigenvalues of Y, which is irrelevant. Proof of Lemma 4. We first state a theorem, whose statement appears in a lecture note by Spielman [?], with a slight modification (off by a factor on RHS). Our proof for this theorem is included here for completeness. It is known that for any real symmetric matrix, there exist a set of n orthonormal eigenvectors. Theorem 31. (Modified Version of Spielman [?]) For A and M being two symmetric matrices and E = M A. Let λ 1 (A) λ (A)... λ n (A) be eigenvalues of A, with orthonormal eigenvectors v 1, v,..., v n and let λ 1 (M) λ (M)... λ n (M) be eigenvalues of M and w 1, w,..., w n be the corresponding orthonormal eigenvectors of M, with θ i = (v i, w i ). Then θ i sin(θ i ) E ii gap(i, A) E + i gap(i, A) E gap(i, A) (59) where gap(i, A) = min j i λ i (A) λ j (A) and i = λ i (M) λ i (A). Let us apply Theorem 31 to the symmetric matrix Y in (56). In particular, we only compare the first n eigenvectors of Y of Y. For the numerator of RHS of (59), we have E = Y Y, and E = Y Y = s 1 (Y Y) by a derivation similar to (58), where eigenvectors of E are concatenations of left and right singular vectors of X X ; For the denominator, we have by Proposition 30, gap(i, Y) = min j i λ i (Y) λ j (Y) = min j i s i (X) s j (X). We first prove the following claim.