Online Principal Components Analysis

Size: px

Start display at page:

Download "Online Principal Components Analysis"

Horatio Griffin
6 years ago
Views:

1 Online Principal Components Analysis Christos Boutsidis Dan Garber Zohar Karnin Edo Liberty Abstract We consider the online version of the well nown Principal Component Analysis PCA problem. In standard PCA, the input to the problem is a set of d- dimensional vectors X = [x 1,..., x n ] and a target dimension < d; the output is a set of -dimensional vectors Y = [y 1,..., y n ] that imize the reconstruction error: Φ i x i Φy i 2 2. Here, Φ R d is restricted to being isometric. The global imum of this quantity, OPT, is obtainable by offline PCA. In online PCA OPCA the setting is identical except for two differences: i the vectors x t are presented to the algorithm one by one and for every presented x t the algorithm must output a vector y t before receiving x t+1 ; ii the output vectors y t are l dimensional with l to compensate for the handicap of operating online. To the best of our nowledge, this paper is the first to consider this setting of OPCA. Our algorithm produces y t R l with l = O poly1/ε such that ALG OPT +ε X 2 F. 1 Introduction Principal Component Analysis PCA is one of the most well nown and widely used methods in scientific computing. It is used for dimension reduction, signal denoising, regression, correlation analysis, visualization etc. [8]. It can be described in many ways, but one that is particularly appealing in the context of online algorithms is the following. Given n high-dimensional vectors x 1,..., x n R d and a target dimension < d, produce n low-dimensional vectors y 1,..., y n R such that the reconstruction error is imized. To define the reconstruction error, let O d, denote the set of d isometric embedding matrices: O d, = {Φ R d y R Φy 2 = y 2 }. Yahoo! Labs, New Yor, NY. boutsidis@yahooinc.com Technion - Israel Institute of Technology. dangar@tx.technion.ac.il Yahoo! Labs, Haifa, Israel, zarnin@yahoo-inc.com Yahoo! Labs, New Yor, NY. edo@yahoo-inc.com Then, the PCA reconstruction error is: 1.1 Φ O d, n x t Φy t 2 2. PCA can be cast as an optimization problem. Given x 1,..., x n R d and < d, the PCA problem is: find y 1,..., y n R that are the optimal solution to the problem: 1.2 y t R Φ O d, n x t Φy t 2 2. The solution to this problem goes through computing the Singular Value Decomposition SVD of the matrix X = [x 1,..., x n ] R d n. Throughout the paper we assume that d < n and ranx = d. Let U R d contain only the < d left singular vectors corresponding to the top singular values of X. Then, the optimal PCA vectors y t are y t = U x t. Equivalently, if Y = [y 1,..., y n ] R n, then, Y = U X. Also, the best isometry matrix is Φ = U. We denote this optimal solution by OPT : 1.3 OPT := y t R Φ O d, n x t Φy t 2 2. Computing the optimal vectors y t = U x t requires several passes over the matrix X. Power iteration based methods, for example, for computing U are memory and CPU efficient but require ω1 passes over X. Two passes also naively suffice; one to compute XX from which U is computed and one to generate the mapping y t = U x t. The bottlenec is in computing XX which demands Ωd 2 auxiliary space in memory and Ωd 2 operations per vector x t assug they are dense. This is prohibitive even for moderate values of d. A significant amount of research went into reducing the computational overhead of obtaining a good approximation for U in one pass [9, 7, 6, 18, 17, 4, 14, 11, 13, 10]. Still, for all those algorithms, a second pass is needed to produce the reduced dimension vectors y t, with the exception of [4, 11, 18], which we discuss shortly.

2 1.1 Online PCA In our setting of online PCA, the input, output, and cost function are identical to the offline case. The only difference is that the vectors x t are presented to the algorithm one by one, and for every presented x t the algorithm must output a vector y t before receiving x t+1. Once some vector y t is returned by the algorithm, it can not be changed at any later time. To compensate for the handicap of operating online, the target dimension of the algorithm, l, is potentially larger than. This is a natural model for PCA when a downstream online algorithm is applied to y t. Examples include online algorithms for clustering -means, - median, regression, classification SVM, logistic regression, facility location, and -server. Our model departs from earlier definitions of online PCA. We shortly review three other definitions and point out the differences as well as highlight their limitations in this context Random proections Most similar to our wor is the result of Sarlos [18], which uses the random proection method to perform online PCA. Using random proections, y t = S x t where S R d l is generated randomly and independently from the data. For example, each element of S can be ±1 with equal probability Theorem 4.4 in [4] or drawn from a normal Gaussian distribution Theorem 10.5 in [11]. Then, with constant probability and for l = Θ/ε n Ψ R d l x t Ψy t ε OPT. Here, the best reconstruction matrix is Ψ = XY which is not an isometry in general. 1 We claim that this seegly ute departure from our model is actually very significant. To see this, let Φ O d, be the optimal PCA proection for X. Consider y t R l whose first coordinates contain Φ x t and the remaining l coordinates contain an arbitrary vector z t R l. In the case where z t 2 Φ x t 2 the geometric arrangement of y t potentially shares very little with that of the signal in x t. Yet, Ψ R d l n x t Ψy t 2 2 = OPT, by setting Ψ = Φ 0 d l. This would have been impossible had Ψ been restricted to being an isometry Regret imization A regret imization approach to online PCA was investigated in [19, 16]. In their setting of online PCA, at time t, before receiving 1 The notation Y stands for the Moore-Penrose inverse or pseudo inverse of Y. the vector x t, the algorithm produces a ran proection matrix P t R d d. 2 The authors present two methods for computing proections P t such that the quantity t x t P t x t 2 2 converges to OPT in the usual no-regret sense. Since each P t can be written as P t = U t U t, for U t O d,, it would seem that setting y t = U t x t should solve our problem. Alas, the decomposition P t = U t U t and therefore y t is underdetered. Even if we ignore this issue, this obective is problematic for the sae of dimension reduction. Consider our setting where we can observe x t before outputting y t. One can simply choose the ran 1 proection P t = x t x t / x t 2 2. On the one hand this gives t x t P t x t 2 2 = 0. On the other, it clearly does not provide meaningful dimension reduction Stochastic Setting There are three recent results [2, 15, 3] that efficiently approximate the PCA obective in Equation 1.1. They assume the input vectors x t are drawn i.i.d. from a fixed and unnown distribution. In this setting, after observing n 0 columns x t one can efficiently compute U n0 O d, such that it approximately spans the top singular vectors of X. Returning y t = 0 1 for t < n 0 and y t = U n 0 x t for t n 0 completes the algorithm. This algorithm is provably correct for some n 0 which is independent of n. This is intuitively correct but non-trivial to show, see [2, 15, 3] for more details. While the stochastic setting is very common in machine learning e.g. the PAC model in online systems the data distribution is expected to change or at least drift over time. In systems that deal with abuse detection or prevention, one can expect an almost adversarial input. 1.2 Summary of contributions Our first contribution is a deteristic online algorithm see Algorithm 1 in Section 2 for the standard PCA obective in Eqn 1.2. Our main result see Theorem 2.1 shows that, for any X = [x 1,..., x n ] in R d n, under some assumptions discussed below, < d, and ε > 0, the proposed algorithm produces a set of vectors y 1,..., y n in R l such that Φ O d,l n x t Φy t 2 2 OPT +ε X 2 F, where l = 8/ε 2. The description of the algorithm and the proof of its correctness are given in Section 2. While Algorithm 1 solves the main conceptual difficulty in online PCA, it has certain drawbacs: 1. It must assume that x t 2 2 X 2 F /l. 2 Here, P t is a square proection matrix with P 2 t = Pt

3 2. It requires X 2 F as input. 3. It spends Ωd 2 floating point operations per x t and requires auxiliary Θd 2 space in memory. In section 4, we present Algorithm 2 that addresses all the issues above. That, in the cost of slightly increasing the target dimension and the additive error see Theorem 4.1 in Section 4. We briefly explain here how we deal with the above issues: 1. We deal with arbitrary input vectors by special handling of large norm input vectors. This is a simple amendment to the algorithm which only doubles the required target dimension. 2. The new algorithm avoids requiring X F as input by estimating it on the fly. A doubling argument analysis shows that the target dimension grows only to l = O logn/ε 2. 3 Bounding the target dimension by l = O/ε 3 requires a significant conceptual change to the algorithm. 3. The new algorithm spends only Odl floating point operations per input vector and uses only Odl space. It avoids storing and manipulating the covariance matrix C by utilizing a streag matrix approximation technique [13]. 2 A simple and inefficient OPCA algorithm Let X = [x 1,..., x n ] R d n be the input highdimensional vectors and l < d be the target dimension. Algorithm 1 returns Y = [y 1,..., y n ] R l n that is close to X in the PCA sense see Theorem 2.1 below for a precise statement of our quality-of-approximation result regarding Algorithm 1. Besides X and l, the algorithm also requires X 2 F in its input. Moreover, we assume x t 2 2 X 2 F /l. Algorithm 1 initializes U and C to d l and d d all-zeros matrices, respectively. It subsequently updates those matrices appropriately. The matrix U is the so-called proection matrix, i.e., y t = U x t. The matrix C is an auxiliary matrix to accumulate all the residual errors. The residual for a vector x t is: r t = x t UU x t. The algorithm starts with a ran one update of C as C = C + r 1 r 1. Notice that by the assumption for x t, we have that r 1 r X 2 F /l, and hence for t = 1 the algorithm will not enter the while-loop. Then, for the second input vector x 2, the algorithm proceeds by checing the spectral norm of C + r 2 r 2 = r 1 r 1 + r 2 r 2. If this does not exceed the threshold θ, which is always fixed to θ = 2 X 2 F /l, the algorithm eeps U unchanged, and it can go all the way 3 Here, we assume that x t are polynomial in n. Algorithm 1 An online algorithm for Principal Component Analysis Input: X = [x 1,..., x n ] R d n with x t 2 2 X 2 F /l, l < d, X 2 F U = 0 d l ; C = 0 d d ; θ = 2 X 2 F /l; for t = 1,..., n do r t = x t UU x t while C + r t r t 2 θ do [u, λ] TopEigenVectorAndValueOfC add u to the next zero column of U r t x t UU x t C C λuu end while C C + r t r t y t U x t end for to t = n if this is the case for all t > 1. Notice, then, that the fact that C 2 θ, implies that the spectral norm squared of R = [r 1,..., r n ] R d n is bounded by θ, because C = RR T, and as we will see in the proof of Theorem 2.1 this is a ey technical component in proving that the algorithm returns y t that are close to x t in the PCA sense. If, however, for some iterate t the spectral norm of C + r t r t exceeds the threshold θ, then, the algorithm does a correction to U, consequently to r t, in order to ensure that this is not the case. Specifically, it updates U with the principal eigenvector of C at that point of the algorithm execution. At the same time it downdates C inside the while-loop by removing this eigenvector. That way the algorithm ensures that at the end of each iterate t, the following relation is true: C = t r tr t λ u u with C λ u u i = 0 d d for a formal statement of the the latter orthogonality condition see Lemma 3.1. Hence, when the condition in the whileloop gives C + r t r t 2 2 θ, which implies that C 2 2 θ, it really means by an orthogonality argument - see Lemma 3.6 that t r tr t 2 2 θ, which is the ultimate goal of the algorithm as we argued before. 2.1 Main result The following theorem is our main quality-of-approximation result regarding Algorithm 1. We prove the theorem based on several other facts which we state and prove in the next section. Theorem 2.1. Let X = [x 1,..., x n ] R d n with x t 2 2 X 2 F /l, l < d, and X 2 F be inputs to Algorithm 1. For any target dimension < d, and any accuracy parameter ε > 0, let the target dimension of the algorithm be l = 8/ε 2 < d. Then:

4 1. The algorithm terates and the while-loop runs at most l times see Lemma Upon teration of the algorithm: n ALG l := x t Φy t 2 2 OPT +ε X 2 F. Φ O d,l 3. The algorithm uses amortized Od 3 arithmetic operations per vector x t and Od 2 auxiliary space. Proof. We quicly prove item 2 in the theorem by using several intermediate results all proven in Section 3. Let R be the d n matrix containing r t in its t-th column 4, and let Y be the l n matrix containing y t in its t-th column. In Lemma 3.4 we prove: Φ O d,l n x t Φy t 2 2 R 2 F; next, in Lemma 3.5 we prove: R 2 F OPT +2 X F R 2 ; and in Lemma 3.6 we prove: R X 2 F/l. Combining these three bounds gives, n x t Φy t 2 2 OPT +2 X F Φ O d,l 2 l X F. IUse l = 8/ε 2 in the latter equation to wrap up the proof of the second item in the theorem. 2.2 Finding the best isometry matrix We remar that our algorithm does not compute the best isometry matrix Φ, for which the bound in Theorem 2.1 holds. The algorithm indeed only returns the lowdimensional vectors y t. The best Φ, after X and Y are fixed, is related to the so-called Procrustes problem: arg X ΦY 2 F. Φ O d,l Let XY T = UΣV T R d l be the SVD of XY T with U R d l, Σ R l l and V R l l. Then, Φ = UV T. Clearly, it is not possible to construct this matrix in an online fashion. However, our algorithm does find an isometry matrix the matrix U R d l upon teration of the algorithm which satisfies the bound in Theorem 2.1 see the proof of Lemma 3.4 to verify this claim. 4 Here, r t is the corresponding vector upon teration of the corresponding iterate of the algorithm. 3 Auxiliary Lemmas 3.1 Notation We introduce notation that, though not necessary to describe the algorithm, is useful for proving our results regarding Algorithm 1. Let l be the number of vectors u inserted in U R d l. This is exactly the number of times the whileloop is executed. We argue in Lemma 3.7 that l l. Notice that after the execution of the algorithm U may still contain some all-zero columns. We reserve the index t to refer to the iterate of the algorithm where the algorithm receives x t. Also, let U t R d l denote the proection matrix U used for the vector x t. We reserve the index i to index the various instances of the matrix C during the execution of the algorithm. Hence, i ranges from i = 1 C 1 = 0 d d to i = z, where z is the number of times the algorithm updates C. Clearly, z n, because for each t there is at least one such update outside the while-loop. Also, z = n + l n + l, since the while-loop is executed at most l times. We reserve the index to index the various vectors u added to the matrix U during the execution of the algorithm. For = 1, 2..., l, u R d are the vectors u computed inside the while-loop in Algorithm 1. Notice that u is the eigenvector of some C i. Also, let λ be the corresponding largest eigenvalue of C i C i is a symmetric matrix as we argue in Eqn. 3.4 such that λ = u C iu. Let X R d n, Y R l n, X R d n and R R d n, denote the matrices whose t th column is x t R d, y t R l, x t = U t U t x t R d, and r t R d. The vectors r t s are taen as the ones in the end of the corresponding iteration. By construction of the algorithm the following relation is also true 3.4 C z = RR λ u u. l 3.2 Results The following lemma argues about some properties of the vectors u and the matrix C. It argues that the vectors u inserted in U are orthogonal to each other. This is important in interpreting our algorithm as a true PCA algorithm, since those vectors play the role of the left singular vectors of X, and as such should at the very least be orthogonal to each other. The lemma also shows that the vectors u inserted in U are perpendicular to the subspace spanned by the columns of the matrix C z. Recall that in Eqn. 3.4 we

5 showed: RR = C z + λ u u ; l hence the lemma argues that C z and l λ u u span two perpendicular subspaces which combined, span the subspace spanned by RR. This will be useful in bounding the spectral norm of RR in Lemma 3.6. The idea is that for such a matrix RR, its spectral norm can be bounded as the maximum spectral norm of C z and l λ u u, which in turn are bounded by design of the algorithm. Lemma 3.1. Let U t R d l be the instance of U in Algorithm 1 when calculating the corresponding vector y t = U x t. Then, for all t = 1,..., n and for some [1,..., l ] not necessarily the same for different t: U I t U t = R l l. 0 l l Also, for the final value taen by the matrix C it holds that l C z λ u u = 0 d d. Proof. See Section 5.1. The following two lemmas prove upper and lower bounds for the values λ calculated in the algorithm. The lower bounds are useful in upper bounding the number of times the algorithm enters the while-loop see Lemma 3.7; the upper bounds will be useful in providing an upper bound for the spectral norm of the matrix R see Lemma 3.6, which is crucial for providing the error accuracy guarantee of the algorithm in Theorem 2.1. Lemma 3.2. Upper bound on λ s Let λ, for = 1,..., l be the eigenvalues computed in Algorithm 1 with λ computed before λ +1. Then, for all = 1, 2,..., l : λ 2 X 2 F/l. Proof. Each λ corresponds to the largest eigenvalue/singular value of some C 5 i Hence, it suffices to argue that for all i = 1,..., z: σ 1 C i = C i 2 2 X 2 F /l. We prove the result by induction on i. For i = 1, it is C 1 = 0 d d and it is trivial that σ 1 C 1 = 0 2 X 2 F /l. By the induction hypothesis: for some i > 1 let C i 2 X 2 F /l. We want to show that C i For the purposes of this proof, σ i A denotes the ith largest singular value of some symmetric matrix A. 2 X 2 F /l. Notice that either C i+1 = C i + r t r t for some iterate t or C i+1 = C i λ u u for some. If C i+1 = C i + r t r t, then, C i + r t r t 2 2 X 2 F /l is ensured by the while-loop condition in the algorithm, since that update happens outside the while-loop. If C i+1 = C i λ u u inside the while-loop, then, C i+1 2 = C i λ u u 2 = σ 1 C i λ u u = 3.5 = σ 2 C i σ 1 C i = C i 2 2 X 2 F/l, where C i 2 2 X 2 F /l uses the induction hypothesis. Also, in the third equality we use the fact that λ, u is the top eigen-pair of C i hence removing it from C i turns the second largest eigenvalue of C i to be the largest one in C i λ u u. Lemma 3.3. Lower bound on λ s Let λ, for = 1,..., l be the eigenvalues computed in Algorithm 1 with λ computed before λ +1. Assug that for all t, x t 2 2 X 2 F /l, then, for all = 1, 2,..., l : λ X 2 F/l. Proof. Each λ corresponds to the largest eigenvalue of some C i. And by design of the algorithm, the extracting of λ from C i is done inside the while-loop. Note that the condition in the while-loop is C + r t r t 2 2 X 2 F /l. This implies that for the iterate t: C i 2 C i + r t r t 2 r t r t 2 X 2 F/l. The first inequality follows by the triangle inequality: C i + r t r t 2 C i 2 + r t r t 2. The second inequality uses C i + r t r t 2 2 X 2 F /l and r tr t 2 X 2 F /l. To verify that r t r t 2 X 2 F /l, recall that r t = I d U t U t x t. Then, using the fact that I d U t U t is a proection matrix: r t r t 2 = r t 2 2 x t 2 2 X 2 F /l, where the last inequality uses the assumption in the lemma. The next lemma argues that the reconstruction error of our algorithm is bounded from R 2 F. This result is based on the fact that the matrix U contains orthonormal columns and the updates on U occur by adding columns one after the other, without ever deleting any of them. The first inequality in the derivation in the proof of the lemma also indicates that the bound in Theorem 2.1 not only holds for the best isometry matrix Φ but also for the matrix U n, i.e., the matrix U upon teration Algorithm 1, which is also an isometry. Lemma 3.4. Let X R d n, Y R l n, and R R d n, be the matrices whose t th column is x t R d, y t R l,

6 and r t R d the vectors r t s are taen as the ones in the end of the corresponding iteration, respectively. Then, Φ O d l X ΦY 2 F R 2 F. Proof. We manipulate the term Φ Od l X ΦY 2 F as follows: = X ΦY 2 F X U n Y 2 F = Φ O d l n x t U n U t x t 2 2 = n x t U t U t x t 2 2 = R 2 F. The first inequality holds because U n is an isometry. This holds because U n contains a subset of l orthonormal columns, according to Lemma 3.1; the remaining columns are all-zeros. Also, the bottom l l rows in Y are all-zeros. In the second equality, U n U t = U t U t, since U t = [u 1, u 2,..., u t, 0..., 0] for some t and U n = [u 1, u 2,..., u n, 0,..., 0] with n t and all the vectors u are orthonormal. Next, we provide an upper bound for the Frobenius norm squared of R with respect to the spectral norm of R. That helps because the algorithm only provides a bound for R 2 see Lemma 3.6. Lemma 3.5. Let X, R R d n, be the matrices whose t th column is x t and r t R d the vectors r t s are taen as the ones in the end of the corresponding iteration, respectively. Then, R 2 F OPT +2 X F R 2 Proof. For notational convenience, let 1 = and v i X 2 2 vi X, vi R + vi R 2 2, 2 = vi X 2 2 vi R 2 2. We manipulate the term R 2 F as follows. R 2 F i = X 2 F X 2 F X 2 F ii iii = X 2 F v i X 2 2 vi X vi R 2 2 iv = X 2 F 1 v vi X 2 F vi X X 2 F vi X 2 vi R 2 vi X vii OPT +2 X F R 2 i follows by Lemma 5.1 in Section 5. In ii, for i = 1,..., let v i R d denote the ith left singular vector of X corresponding to the ith largest singular value of X, also let V R d contains those vectors as columns. In this equation we used X 2 F v i X 2 2. This is true because : X 2 2 = V T 2 X F V T 2 2 X 2 F = X 2 F. v i In the first inequality we used the fact that XY 2 F X 2 2 Y 2 F, for any X and Y; in the last equality we used V T 2 2 = 1. In iii we use the relation X = X+R. In iv we used the fact that for any two vectors α, β: α β 2 2 = α 2 2+ β α, β. We used this property times for all i = 1,...,, with α = vi X and β = vi R. In vi we used the Cauchy-Swartz inequality: for numbers γ i 0, δ i 0 and for i = 1,..., it is: γ iδ i. We used this γ2 i δ2 i inequality for γ i = vi X 2 2 and δ i = vi R 2 2. In vii we used that X 2 F vi X 2 2 = = d σi 2 X σi 2 X = d i=+1 σ 2 i X = OPT. Also, we used that v i X 2 2 σ2 i X X 2 F, and that v i R 2 2 R 2 2. The latter inequality is true because vi R 2 2 R 2 2 for all i = 1,...,.

7 Next, we provide an upper bound on the spectral norm squared of R. The previous lemma motivates the need for such an analysis. Lemma 3.6. Let X, R R d n, be the matrices whose t th column is x t and r t R d the vectors r t s are taen as the ones in the end of the corresponding iteration, respectively. Then, R X 2 F/l. Proof. Notice that according to the updates of C we have that RR = C z + λ u u. l According to Lemma 3.1 we have that l C z λ u u = 0 d d meaning that l R 2 2 = RR = max C z, λ u u. The second equality is proved in Lemma 5.4 in Section 5. Now, according to Lemma 3.2 we have that l λ u u 2 X 2 F /l. From the structure of the algorithm it is straightforward that upon teration it must be the case that C z 2 X 2 F /l, as C C + r t r t and the if the latter is larger than 2 X 2 F /l then the while loop in the algorithm will not halt. Combining the three inequalities leads to the claim. Finally, we provide an upper bound on the number of times the algorithm enters the while-loop; this immediately implies an upper bound on the number of vectors u inserted in U. The main idea here is to study the trace of the matrix l λ u u and provide a lower bound that depends on l and an upper bound that depends on l. Combining the two bounds gives the desired relation between l and l. Lemma 3.7. Let X R d n be the matrix whose t th column is x t Then, assug that for all t, x t 2 2 X 2 F /l, the while loop in Algorithm 1 can occur a total of at most l l {1, OPT / X 2 F + 8/l} l. Proof. For notational convenience, let l λ u u := Z. First, using Lemma 3.3, we calculate a lower bound for TraceZ: 3.6 TraceZ = λ l X 2 F/l. l Since C z is clearly positive semi definite we have TraceZ = TraceRR TraceC z 3.7 TraceRR = R 2 F X 2 F. The inequality R 2 F X 2 F follows because for all t, r t 2 2 = I d U t U t x t 2 2 I d U t U t 2 2 x t 2 2 = x t 2 2, since I d U t U t Also, by Theorem 2.1: hence R 2 F OPT + 8/l X 2 F, 3.8 TraceZ R 2 F OPT + 8/l X 2 F. Combining Eqn. 3.6, Eqn. 3.7, and Eqn. 3.8 shows the claim. 4 A complex and efficient OPCA algorithm In this section we modify Algorithm 1 in order to 1 remove the assumption of nowledge of X 2 F and 2 improve the time and space complexity. These two improvements are made at the expense of sacrificing the accuracy slightly. To sidestep trivial technicalities, we assume some prior nowledge over the quantity X 2 F ; we merely assume that we have some quantity w 0 such that w 0 X 2 F and w 0 x t 2 for all x t 2 2. The assumption of nowing w 0 can be removed by woring with a buffer of roughly /ε 3 columns. Since we already require this amount of space, the resulting increase in the memory complexity is asymptotically negligible. Specifically, we provide the following theorem: Theorem 4.1. There exists an algorithm that on input X R d n, target dimension < d, and ε 0, 1/15, it produces vectors y 1, y 2,..., y n R l, with n Φ Od,l x t Φy t 2 2 OPT +ε X 2 F. For some δ = {ε, ε 2 X 2 F / OPT}, this algorithms requires Od/δ 3 space and Ond/δ 3 + lognd 3 /δ 6 arithmetic operations assug x t 2 are polynomial in n. The target dimension of the algorithm is l = /δ 3.

8 Algorithm 2 An efficient online PCA algorithm input: X,, ε 0, 1 15, w 0 which defaults to w 0 = 0 U = 0 d /ε3, Z = 0 d /ε2, w = 0, w U = 0 /ε3 for t = 1,..., n do w = w + x t 2 2 r t = x t UU x t C = I UU ZZ I UU while C + r t r t 2 max{w 0, w} ε2 do u, λ = topeigenvectorandvaluec w U u λ If U has a zero column, write u in its place. Otherwise, write u instead of the column v of U with the imal quantity of w U v C = I UU ZZ I UU r t = x t UU x t end while Z Setch[Z, r t ] For each non-zero column u in U, w U u w U u + x t, u 2 yield: y t U x t end for 4.1 Notations For a variable ξ in the algorithm, where ξ can either be a matrix Z, X, C or a vector r, x we denote by ξ t the value of ξ at the end of iteration number t. ξ 0 will denote the value of the variable at the beginning of the algorithm. For variables that may change during the while loop we denote by ξ t,z the value of ξ at the end of the z th while loop in the t th iteration. In particular, if the while loop was never entered the value of ξ at the end of the iteration t will be equal to ξ t,0. For such ξ that may change during the while loop notice that ξ t is its value at the end of the iteration, meaning after the last while loop has ended. An exception to the above is for the variable w. Here, we denote by w t the value of max{w 0, w} at the end of iteration number t. We denote by n the index of the last iteration of the algorithm. In particular ξ n denotes the value of ξ upon the teration of the algorithm. 4.2 Matrix Setching We use the Frequent Direction matrix setching algorithm presented by Liberty in [13]. This algorithm provides a setching algorithm for the streag version where we observe a matrix R column by column. The idea is to maintain a matrix Z of low ran that at all times has the property that ZZ and RR are close, in operator norm. Lemma 4.1. Let R 1,..., R t,... be a sequence of matrices with columns of dimension d where R t+1 is obtained by appending a column vector r t+1 to R t. Let Z 1,..., Z t,... the corresponding setches of R obtained by adding these columns as they arrive, according to the Frequent Direction algorithm in [13] with parameter l. 1. The worst case time required by the setch to add a single column is Old. 2. Each Z t is a matrix of dimensions d Ol. 3. Let u be a left singular vector of Z t with singular value σ and assume that r t+1 is orthogonal to u. Then u is a singular vector of Z t+1 with singular value σ. 4. For any vector u and time t it holds that Z t u R t u. 5. For any t there exists a positive semidefinite matrix E t such that E t 2 2 R t 2 F /l and Z tz t + E = R t R t. 4.3 Proof of Theorem 4.1 Lemma 4.2. At all times, the matrix U U is a diagonal matrix with either zeros or ones across the diagonal. In other words, the non-zero column vectors of U, at any time point are orthonormal. Proof. We prove the claim for any U t,z by induction on t, z. For t = 0 the claim is trivial as U 0 = 0. For the induction step we need only to consider U t,z for z > 0 since U t,0 = U t 1. Let u be the new vector in U t,z. Since u is defined as an eigenvector of C t,z 1 = I U t,z 1 U t,z 1Z t 1 Z t 1I U t,z 1 U t,z 1, we have that U t,z 1 u = 0 and the claim immediately follows. Lemma 4.3. For all t, z, the non-zero columns of U t,z are left singular vectors of Z t 1 possibly with singular value zero. In particular, the claim holds for the final values of the matrices U and Z. Proof. We prove the claim by induction on t, z, with a trivial base case of t = 1 where Z t 1 = 0. Let t > 1. For z = 0, each non-zero column vector u of U t,0 is a non-zero column vector of U t 1,z for the largest valid z w.r.t t 1. By the induction hypothesis it holds that u is a singular vector of Z t 2. According to Observation 4.2 we have that r t 1, the vector added to Z t 2 is orthogonal to u, hence Lemma 4.1, item 3 indicates that u is a singular vector of Z t 1 as required. Consider now z > 0. In this case u is a vector added in the while loop. Recall that u is defined as an eigenvector of C t,z 1 = I U t,z 1 U t,z 1Z t 1 Z t 1I U t,z 1 U t,z 1

9 According to our induction hypothesis, all of the nonzero column vectors of U t,z 1 are singular vectors of Z t 1, hence Z t 1 Z t 1 = C t,z 1 + U t,z 1 U t,z 1Z t 1 Z t 1U t,z 1 U t,z 1. The two equalities above imply that any eigenvector of C t,z 1 is an eigenvector of Z t 1 Z t 1 as well. It follows that u is a singular vector of Z t 1 as required, thus proving the claim. Lemma 4.4. Let v be a column vector of U t,z that is not in U t,z+1. Let t v, Z v be the earliest time stamp from which v was a column vector in U consecutively up to time t, z. It holds that Z t 1 v X t 1 X tv 1v 2 2 2ε3 w t 1 Proof. We denote by w U u the values of the w U vector for the different directions u at time t, z. Let λ v be the eigenvalue associated with v during the time it was entered to U. Then at that time Zv 2 2 = Cv 2 = λ v recall that v is chosen as a vector orthogonal to those of U hence Cv = ZI Uv 2 = Zv 2. Furthermore, since v was a column in U up to time t we get that all vectors added to R from the insertion of v up to time t are orthogonal to v. Lemma 4.1, item 3 shows that Z t 1 v 2 2 λ v. Since v was a column vector of U from iteration t v up to iteration t, we have that Z t 1 v X t 1 X tv 1v 2 2 λ v + t τ=t v v, x τ 2 = w U v It remains to bound the quantity of w U v. We will bound the sum u U t,z w U u and use the fact that v is the imizer of the corresponding expression hence w U v is upper bounded by u w U u ε3. Let t u be the index of the iteration in which u is inserted into U. It is easy to verify that for λ u = C tu 1u it holds that w u = λ u + X t 1 X tu 1u 2 2 Now, since C Z R we have that Finally, w u = R tu 1u X t 1 X tu 1u 2 2 w U u X t 1 X tu 1u R tu 1u 2 2 u u i X t 1 u R t 1 u 2 2 u iii 2 X t 1 2 F = 2w t 1 ii X t 1 2 F + R t 1 2 F Inequality i is immediate from the definitions of X t, R t as concatenations of t column vectors. Inequality ii follows since any orthogonal vector set and any matrix admit u Au 2 2 A 2 F. Inequality iii is due to the fact that each column of R is obtained by proection a column of X onto some subspace. Lemma 4.5. At all times C t,z 2 w t 1 ε2, Proof. We prove the claim by induction over t, z. The base case for t = 0 is trivial. For t > 0, z = 0, if the while loop in iteration t was entered to we have that C t,0 = C t 1,z for some z. Since w t 1 w t 2 the claim holds. If t is such that the while loop of the iteration was not entered the condition of the while loop asserts that C t,z = C t w t 1 ε2. Consider now t, z > 0. We have that C t,z 1 = I d U t,z 1 U t,z 1Z t 1 Z t 1I d U t,z 1 U t,z 1 C t,z = I d U t,z U t,zz t 1 Z t 1I d U t,z U t,z If U t,z is obtained by writing u instead of a zero column of U t,z 1 then C t,z is a proection of C t,z 1 and the claim holds due to the induction hypothesis. If not, u is inserted instead of some vector v. According to Lemmas 4.3 and 4.4, v is an eigenvector of Z t 1 Z t 1 with eigenvalue λ v w t 1 2ε3 w t 1 ε2, assug ε 0.5. It follows that C t,z is a proection of C t,z 1 + λ v vv. Now, since C t,z 1 v = 0 as v is a column vector of U t,z 1 we have that C t,z 2 C t,z 1 +λ v vv 2 = max{ C t,z 1 2, λ v vv 2 } According to our induction hypothesis and the bound for λ v, the above expression is bounded by w t 1 ε2 as required. Lemma 4.6. Let u be a vector that is not in U t,z and in U t,z+1. Let λ be the eigenvector associated to it w.r.t C t,z. It holds that λ w t 1 ε2. Furthermore, if u is a column vector in U t,z for all t t, z z, it holds that u Z n 2 λ X n 2 F ε2. Proof. Since u is chosen as the top eigenvector of C t,z we have by Lemma 4.5 that λ = C t,z u 2 = C t,z 2 w t 1 ε2

10 For the second claim in the lemma we note that since u is an eigenvector of C t,z 1 = I U t,z 1 U t,z 1Z t 1 Z t 1I U t,z 1 U t,z 1, we have that U t,z 1u = 0, hence u Z t = u I U t,z 1 U t,z 1Z t = u C t,z u C t,z u 2 w t 1 ε2 Since u is assumed to be an element of U throughout the running time of the algorithm, it holds that for all future vectors r added to the setch Z, u is orthogonal to r. The claim now follows from Lemma 4.1 item 3. Lemma 4.7. R n 2 2 2ε2 X n 2 F. Proof. Let u 1,..., u l and λ 1,..., λ l be the columns of U n and their corresponding eigenvalues in C at the time of their addition to U. From Lemmas 4.3 and 4.6 we have that each u is an eigenvector of ZZ with eigenvalue λ λ ε2 X n 2 F. It follows that Z n Z n 2 = { } ε 2 max X n 2 F, I U n U n Z n Z n I U n U n 2 { } ε 2 = max X n 2 F, C n 2 ε2 X n 2 F. The last inequality is due to Lemma 4.5. Next, by the setching property Lemma 4.1 item 5, for appropriate matrix E: Z n Z n = R n R n +E, with E ε2 R n 2 F. As the columns of R are proections of those of X we have that R n 2 F X n 2 F, hence R n 2 2 = R n R n 2 = Z n Z n E 2 Z n Z n 2 + E 2 2ε2 X n 2 F, which concludes the proof. Lemma 4.8. R n 2 F OPT +4ε X n 2 F Proof. The Lemma can be proven analogically to Theorem 2.1 as the only difference is the bound over R n 2 2. Lemma 4.9. Assume that for all t, x t 2 2 w t ε2 5. Assume that ε 0.1. For τ > 0 consider the iterations of the algorithm during which w t [2 τ, 2 τ+1. During this time, the while loop will be executed at most 5/ε 2 times. Proof. For the proof, we define a potential function Φ t,z = TraceR t 1 R t 1 TraceC t,z. We first notice that since C is clearly PSD, Φ t,z TraceR t 1 R t 1 = R t 1 2 F X t 1 2 F 2 τ+1. The first inequality correct because the columns of R are proections of those of X and the second because X t 1 2 F w t 1. We show that Φ is non-decreasing with time and that, for valid z > 0, Φ t,z Φ t,z ε2 2τ+1. The result immediately follows. Consider a pair t, z followed by the pair of indices t + 1, 0. Here, Φ t+1,0 Φ t,z = r t hence for such pairs the potential is non-decreasing. Now consider some pair t, z for z > 0. Since t, z is a valid pair it holds that 4.9 C t,z 1 2 C t,z 1 + r t,z 1 r t,z 1 r t,z 1 r t,z 1 2 C t,z 1 + r t,z 1 r t,z 1 2 r t,z 1 r t,z 1 2 ε 2 w t τ+1 ε Denote by u the column vector in U t,z that is not in U t,z 1. Let U be the matrix obtained by appending the column u to the matrix U t,z 1. Let C = I U U Z t 1 Z t 1I U U = I uu C t,z 1 I uu Since u is the top eigenvector of C t,z 1 we have by equation 4.9 that TraceC τ+1 ε2 TraceC t,z 1 = C t,z If U t,z 1 had a zero column then C t,z = C and we are done. If not, let v be the vector that was replaced by u. According to Lemma 4.3, v is a singular vector of Z t 1. According to Lemma 4.4 and ε < 0.1 we have that Z t 1 v 2 2 2ε3 X t 1 2 F ε2 5 2τ+1. Hence, C t,z = C + Z t 1 v 2 vv meaning that TraceC t,z C t,z 1 =

11 TraceC t,z C + TraceC C t,z 1 ε2 5 2τ+1. We conclude that as required Φ is non-decreasing over time and in each iteration of the while loop, increases by at least ε2 5 2τ+1. Since Φ is upper bounded by 2 τ+1 during the discussed iterations, the lemma follows. Lemma Let v 1, t 1 +1, t 1 +1,..., v, t +1, t + 1,... be the sequence of triplets of vectors removed from U, the times on which they were added to U and the times on which they were removed from U. 2 ALG R n F + 2 X t X t v 2 2. Proof. For any time t denote by U t the matrix U in the end of iteration t, by U 1 t the outcome of zeroingout every column of U t that is different from the corresponding column in U n and by U 2 t its complement, that is the outcome of zeroing-out every column in U t that is identical to the corresponding column in U n. In the same way define U 2 to be the outcome of zeroingout columns in U n that are all zeros in U 2 t. It holds that, x t U n U t x t 2 2 = x t U t + U n U t U t x t x t U t U t x t 2 + U n U t U t x t 2 2 = r t 2 + U 2 U 2 t U 2 t x t 2 r t U 2 t x t 2 Sumg over all times t we have, ALG = = n n r t U 2 t x t 2 2 r t r t 2 U 2 t x t U 2 t x t 2 2 R n 2 F + 4 n r t 2 n 2 U 2 t x t t U 2 t x t 2 2 Where the last inequality follows from applying the Cauchy-Schwarz inequality to the dot product between the vectors r 1 2, r 2 2,..., r n 2 and U 2 1 x 1 2, U 2 2 x 2 2,..., U 2 n x n 2. Since U 2 t contains only vectors that were columns of U at time t but were replaced later, and are not present in U n, we have that U 2 t x t 2 x t v 2 and so n U 2 t x t 2 2 Thus we have that, ALG :t >t>t R n 2 F + 4 R n F X t X t v 2 2. X t X t v X t X t v = R n F + 2 X t X t v 2 2, which concludes the proof. Lemma Let v 1, t 1 +1, t 1 +1,..., v, t +1, t + 1,... be the sequence of triplets of vectors removed from U, the times on which they were added to U and the times on which they were removed from U. Then X t X t v ε X n 2 F. Proof. For some τ > 0 consider the execution of the algorithm during the period in which w t [2 τ, 2 τ+1. According to Lemma 4.9, at most 5 ε vectors v were 2 removed from the U during that period. According to Lemma 4.4, for each such v it holds that X t X t v 2 2 w t 2ε3 2ε3 2τ+1 It follows that the contribution of vectors v thrown from the set during the discussed time period is at most 5 ε 2 2ε3 2τ+1 = 10ε 2 τ+1 The entire sum can now be bounded by a geometric series, ending at τ = log 2 X 2 F thus proving the lemma. Corollary 4.1. ε Xn 2 ALG,ε OPT F

12 OPT ε X n F In particular, for ε = δ 2 OPT / X n 2 F, ALG,ε = OPT 1 + Oδ Lemma Time complexity Algorithm 2 requires On d ε + logw 3 n /w 0 d3 ε arithmetic operations. 6 Proof. We begin by pointing out that in the algorithm we wor with the d d matrices C and ZZ. These matrices have a bounded ran and furthermore, we are always able to maintain a representation of them of the form AA with A being a d r matrix with R being the ran of C or ZZ. Hence, all computations involving them requires Odr or Odr 2 arithmetic operations corresponding to a product with a vector and computing the eigendecomposition. Consider first the cost of the operations outside the while loop that do not involve the matrix C. It is easy to see, with Lemma 4.1, item 1, that the amortized time required in each iteration is Od ε. The two 3 operations of computing C and its top singular value may require O d 2 ε time. However, these operations 6 do not necessarily have to occur every iteration if we use the following tric: When entering the while loop we will require the norm of C + r t r t to be bounded by ε w 2 t rather than w t ε2 2. Assume now that we entered the while loop at time t. In this case, we do not need to chec the condition of the while loop until we reach an iteration t where w t w t 1 + ε2 2. Other than checing the condition of the while loop there is no need to compute C, hence the costly operations of the external loop need only be executed an amount of O logw n /w 0 ε 2 We now proceed to analyze the running time of the inner while loop. Each such iteration requires O d 2 ε arithmetic operations. However, according 6 to Lemma 4.9 we have that the total number of such iterations is bounded by O logw n /w 0 ε 2 The lemma immediately follows. Lemma Space complexity Algorithm 1 requires O d/ε 3 space. Proof. Immediate from the algorithm and the properties of the Frequent Direction Setch Lemma 4.1 item 2 5 Conclusion Recent discussions with Mar Tygert raise the conecture that a random proection based PCA technique see Section could potentially yield similar, bounds than the ones we achieve here. Another interesting observation is that our algorithm actually bounds X ΦY 2 2. To the best of our understating, comparable results are unliely to be obtainable via randomproection based techniques. Such bounds are useful in the case of noisy data where the optimal registration error is large but the signal is still recoverable. A proof of this fact will appear in the full version of this paper. References [1] Dimitris Achlioptas. Database-friendly random proections: Johnson-lindenstrauss with binary coins. J. Comput. Syst. Sci., 664: , [2] Raman Arora, Andy Cotter, and Nati Srebro. Stochastic optimization of pca with capped msg. In C.J.C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems 26, pages [3] Ashay Balsubramani, Sanoy Dasgupta, and Yoav Freund. The fast convergence of incremental pca. In C.J.C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems 26, pages [4] Kenneth L Clarson and David P Woodruff. Numerical linear algebra in the streag model. In Proceedings of the 41st annual ACM symposium on Theory of computing, pages ACM, [5] Sanoy Dasgupta and Anupam Gupta. An elementary proof of the ohnson-lindenstrauss lemma. Technical Report TR , Bereley, CA, [6] Amit Deshpande and Santosh Vempala. Adaptive sampling and fast low-ran matrix approximation. In APPROX-RANDOM, pages , [7] Petros Drineas and Ravi Kannan. Pass efficient algorithms for approximating large matrices. In Proceedings of the fourteenth annual ACM-SIAM symposium on Discrete algorithms, SODA 03, pages , Philadelphia, PA, USA, Society for Industrial and Applied Mathematics. [8] George H Dunteman. Principal components analysis. Number 69. Sage, [9] Alan Frieze, Ravi Kannan, and Santosh Vempala. Fast monte-carlo algorithms for finding low-ran approximations. In Proceedings of the 39th Annual Symposium on Foundations of Computer Science, FOCS 98, pages 370, Washington, DC, USA, IEEE Computer Society. [10] Mina Ghashami and Jeff M. Phillips. Relative errors for deteristic low-ran matrix approximations. In SODA, 2014.

13 [11] Nathan Halo, Per-Gunnar Martinsson, and Joel A. Tropp. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM Review, 532: , [12] W.B. Johnson and J. Lindenstrauss. Extensions of Lipschitz mappings into a Hilbert space. Contemp. Math., 26: , [13] Edo Liberty. Simple and deteristic matrix setching. In KDD, pages , [14] Edo Liberty, Franco Woolfe, Per-Gunnar Martinsson, Vladimir Rohlin, and Mar Tygert. Randomized algorithms for the low-ran approximation of matrices. Proceedings of the National Academy of Sciences,, 10451: , December [15] Ioannis Mitliagas, Constantine Caramanis, and Pratee Jain. Memory limited, streag pca. In C.J.C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems 26, pages [16] Jiazhong Nie, Wociech Kotlowsi, and Manfred K. Warmuth. Online pca with optimal regrets. In ALT, pages , [17] Mar Rudelson and Roman Vershynin. Sampling from large matrices: An approach through geometric functional analysis. J. ACM, 544, July [18] Tamás Sarlós. Improved approximation algorithms for large matrices via random proections. In FOCS, pages IEEE Computer Society, [19] Manfred K. Warmuth and Dima Kuz. Randomized online pca algorithms with regret bounds that are logarithmic in the dimension, APPENDIX: Omitted Proofs from Section Proof of Lemma 3.1 Proof. Let C i be a value taen by the matrix C at some point during the algorithm. Let li be the number of non-zero vectors in U at the time of the update to C. For l denote by U the d l matrix whose p th column is u p for p < and the zero column for p. We prove by induction on i that 1. U li uli = 0 l 1 li 2. C i λ u u = 0 d d The claim will immediately follow. Notice that since all u are chosen as unit vectors, statement 1 above also indicates that u 1,..., u li form an orthonormal set. Base case. The base case of the proof is trivial as for i = 1 it holds that li = 0, hence U li is the all-zeros matrix, and u li = 0. Induction hypothesis. Let i 1. We assume the following: 1. U li uli = 0 l 1 li 2. C i λ u u = 0 d d 1. Induction step. We prove that for i + 1: U li+1 uli+1 = 0 l 1 li+1 2. C i+1 λ u u = 0 d d Consider the matrix C i+1. There are two possible reasons for the change in C and we split the proof to deal with each reason separately. First reason: C i+1 = C i + r t r t for some t. In this case item 1 holds trivially as li = li + 1. For item 2 denote Z = li+1 C i+1 λ u u = I d li u u = li C i + r t r t λ u u i = r t r t ii = li λ u u Zx t x t Z li λ u u iii = 0 d d, we have Equality i is due to induction hypothesis 2. Equality ii is due to the definition of r t and equality iii holds since the induction hypothesis for item 1 indicates that u 1,..., u li are orthonormal, hence Z I d Z = 0 d d. Second reason: li + 1 = li + 1 and C i+1 = C i λ li+1 u li+1 u li+1. Furthermore u li+1, λ li+1 are the top eigen-pair of C i. For item 1, notice that according to our induction hypothesis item 2 C i is a symmetric matrix whose row and column space are orthogonal to u 1,..., u li. Hence, as u li+1 is an eigenvector of C i it must be the case that u li+1 is orthogonal to u 1,..., u li. Next, we proceed to item 2. Denote Z = li+1 λ u u. Notice that

14 li+1 C i+1 λ u u = = C i λ li+1 u li+1 u li+1 Z i = C i I d u li+1 u li+1 Z li ii = C i I d u u iii = C i I d iv = 0 d d li+1 u u Z I d u li+1 u li+1 Z Equality i holds as u li+1, λ li+1 are an eigen-pair of C i. Equality ii is due to induction hypothesis 2. Equality iii and iv holds as u 1,..., u li+1 are an orthonormal set. 5.2 Basic Facts First, we show that the matrices X, X, and R satisfy some form of the pythagorean theorem for matrices. This result will be useful in a technical manipulation in Lemma 3.5, where we provide an upper bound for R 2 F. Lemma 5.1. Let X R d n, X R d n and R R d n be the matrices whose t th column is x t R d, x t = U t U t x t R d, and r t = x t x t R d, respectively the vectors r t s are taen as the ones in the end of the corresponding iteration in Algorithm 1. Then, X 2 F = X 2 F + R 2 F. Proof. Recall that for any proection matrix P and vector x it holds that x 2 2 = Px I Px 2 2 According to Lemma 3.1 we have that P t = U t U t is a proection matrix for all t hence, x t 2 2 = P t x t I P t x t 2 2 = x t r t 2 2 Sumg over all t s yields the claim. The next two lemmas, provide upper bounds for the dimensions of the subspaces spanned by the columns of l λ u u and C z, respectively. We will use those results in Lemma 5.4, where one requires a description of the SVD of the sum l λ u u + C z to calculate a bound for R 2 2. Lemma 5.2. Let [λ, u ], for = 1,..., l be the pairs of eigenvalues/eigevectors computed in Algorithm 1 with [λ, u ] computed before [λ +1, u +1 ]. Then, l ran λ u u = l. Proof. This result follows immediately from Lemma 3.1 and Lemma 3.3. Specifically, from Lemma 3.1, we have that the u s are orthogonal to each other and from Lemma 3.3 we have that λ > 0, hence ran l λ u u = l. Lemma 5.3. Let C z be the matrix C upon teration in Algorithm 1. Then, ran C z d l. Proof. We give a proof by contradiction. Assume that ranc z := ρ > d l. For notational convenience, let l λ u u := B. From Lemma 5.2, we have that ranb = l, hence B = U B Σ B U B is the SVD of B with U B R d l and Σ B R l l. From our assumption, we have that ranc z = ρ, hence C z = U Cz Σ Cz U C z is the SVD of C z with U Cz R d ρ and Σ Cz R ρ ρ. Here, Σ Cz contains strictly positive diagonals. From Lemma 3.1 we have that C z B = 0 d d, which is equivalent to saying that U T C z U B = 0 d d. Also, U Cz contains ρ > d l columns. From one hand, based on the relation U T C z U B = 0 d d, we have that every column in U Cz is in spani U B U T B. From the other hand, based on the fact that U Cz contains ρ > d l orthonormal columns, we have that the last ρ d l 1 columns in U Cz are in spani U Cz U T C z. The two statements contradict each other because, if spani U B U T B = spani U Cz U T C z, it would have not been possible that U T C z U B = 0 d d. Finally, we prove a result which says that the spectral norm square of R is at most that of the max spectral norm of C z and the spectral norm of l λ u u. This result is useful in proving Lemma 3.6. Lemma 5.4. Let R R d n, be the matrix whose t th column is r t R d the vectors r t s are taen as the ones in the end of the corresponding iteration, respectively. Let C z be the d d matrix C at the end of the algorithm, and Let λ, u for = 1,..., l be the eigenvalues, and corresponding eigenvectors, computed in Algorithm 1 with λ computed before λ +1. Then, l R 2 2 = RR = max C z, λ u u. Proof. For notational convenience, let l λ u u := B. From Lemma 5.2, we have that ranb = l, hence B = U B Σ B U B is the SVD of B with U B R d l

to be more efficient on enormous scale, in a stream, or in distributed settings.

to be more efficient on enormous scale, in a stream, or in distributed settings. 16 Matrix Sketching The singular value decomposition (SVD) can be interpreted as finding the most dominant directions in an (n d) matrix A (or n points in R d ). Typically n > d. It is typically easy to