Online Principal Components Analysis

Size: px
Start display at page:

Download "Online Principal Components Analysis"

Transcription

1 Online Principal Components Analysis Christos Boutsidis Dan Garber Zohar Karnin Edo Liberty Abstract We consider the online version of the well nown Principal Component Analysis PCA problem. In standard PCA, the input to the problem is a set of d- dimensional vectors X = [x 1,..., x n ] and a target dimension < d; the output is a set of -dimensional vectors Y = [y 1,..., y n ] that imize the reconstruction error: Φ i x i Φy i 2 2. Here, Φ R d is restricted to being isometric. The global imum of this quantity, OPT, is obtainable by offline PCA. In online PCA OPCA the setting is identical except for two differences: i the vectors x t are presented to the algorithm one by one and for every presented x t the algorithm must output a vector y t before receiving x t+1 ; ii the output vectors y t are l dimensional with l to compensate for the handicap of operating online. To the best of our nowledge, this paper is the first to consider this setting of OPCA. Our algorithm produces y t R l with l = O poly1/ε such that ALG OPT +ε X 2 F. 1 Introduction Principal Component Analysis PCA is one of the most well nown and widely used methods in scientific computing. It is used for dimension reduction, signal denoising, regression, correlation analysis, visualization etc. [8]. It can be described in many ways, but one that is particularly appealing in the context of online algorithms is the following. Given n high-dimensional vectors x 1,..., x n R d and a target dimension < d, produce n low-dimensional vectors y 1,..., y n R such that the reconstruction error is imized. To define the reconstruction error, let O d, denote the set of d isometric embedding matrices: O d, = {Φ R d y R Φy 2 = y 2 }. Yahoo! Labs, New Yor, NY. boutsidis@yahooinc.com Technion - Israel Institute of Technology. dangar@tx.technion.ac.il Yahoo! Labs, Haifa, Israel, zarnin@yahoo-inc.com Yahoo! Labs, New Yor, NY. edo@yahoo-inc.com Then, the PCA reconstruction error is: 1.1 Φ O d, n x t Φy t 2 2. PCA can be cast as an optimization problem. Given x 1,..., x n R d and < d, the PCA problem is: find y 1,..., y n R that are the optimal solution to the problem: 1.2 y t R Φ O d, n x t Φy t 2 2. The solution to this problem goes through computing the Singular Value Decomposition SVD of the matrix X = [x 1,..., x n ] R d n. Throughout the paper we assume that d < n and ranx = d. Let U R d contain only the < d left singular vectors corresponding to the top singular values of X. Then, the optimal PCA vectors y t are y t = U x t. Equivalently, if Y = [y 1,..., y n ] R n, then, Y = U X. Also, the best isometry matrix is Φ = U. We denote this optimal solution by OPT : 1.3 OPT := y t R Φ O d, n x t Φy t 2 2. Computing the optimal vectors y t = U x t requires several passes over the matrix X. Power iteration based methods, for example, for computing U are memory and CPU efficient but require ω1 passes over X. Two passes also naively suffice; one to compute XX from which U is computed and one to generate the mapping y t = U x t. The bottlenec is in computing XX which demands Ωd 2 auxiliary space in memory and Ωd 2 operations per vector x t assug they are dense. This is prohibitive even for moderate values of d. A significant amount of research went into reducing the computational overhead of obtaining a good approximation for U in one pass [9, 7, 6, 18, 17, 4, 14, 11, 13, 10]. Still, for all those algorithms, a second pass is needed to produce the reduced dimension vectors y t, with the exception of [4, 11, 18], which we discuss shortly.

2 1.1 Online PCA In our setting of online PCA, the input, output, and cost function are identical to the offline case. The only difference is that the vectors x t are presented to the algorithm one by one, and for every presented x t the algorithm must output a vector y t before receiving x t+1. Once some vector y t is returned by the algorithm, it can not be changed at any later time. To compensate for the handicap of operating online, the target dimension of the algorithm, l, is potentially larger than. This is a natural model for PCA when a downstream online algorithm is applied to y t. Examples include online algorithms for clustering -means, - median, regression, classification SVM, logistic regression, facility location, and -server. Our model departs from earlier definitions of online PCA. We shortly review three other definitions and point out the differences as well as highlight their limitations in this context Random proections Most similar to our wor is the result of Sarlos [18], which uses the random proection method to perform online PCA. Using random proections, y t = S x t where S R d l is generated randomly and independently from the data. For example, each element of S can be ±1 with equal probability Theorem 4.4 in [4] or drawn from a normal Gaussian distribution Theorem 10.5 in [11]. Then, with constant probability and for l = Θ/ε n Ψ R d l x t Ψy t ε OPT. Here, the best reconstruction matrix is Ψ = XY which is not an isometry in general. 1 We claim that this seegly ute departure from our model is actually very significant. To see this, let Φ O d, be the optimal PCA proection for X. Consider y t R l whose first coordinates contain Φ x t and the remaining l coordinates contain an arbitrary vector z t R l. In the case where z t 2 Φ x t 2 the geometric arrangement of y t potentially shares very little with that of the signal in x t. Yet, Ψ R d l n x t Ψy t 2 2 = OPT, by setting Ψ = Φ 0 d l. This would have been impossible had Ψ been restricted to being an isometry Regret imization A regret imization approach to online PCA was investigated in [19, 16]. In their setting of online PCA, at time t, before receiving 1 The notation Y stands for the Moore-Penrose inverse or pseudo inverse of Y. the vector x t, the algorithm produces a ran proection matrix P t R d d. 2 The authors present two methods for computing proections P t such that the quantity t x t P t x t 2 2 converges to OPT in the usual no-regret sense. Since each P t can be written as P t = U t U t, for U t O d,, it would seem that setting y t = U t x t should solve our problem. Alas, the decomposition P t = U t U t and therefore y t is underdetered. Even if we ignore this issue, this obective is problematic for the sae of dimension reduction. Consider our setting where we can observe x t before outputting y t. One can simply choose the ran 1 proection P t = x t x t / x t 2 2. On the one hand this gives t x t P t x t 2 2 = 0. On the other, it clearly does not provide meaningful dimension reduction Stochastic Setting There are three recent results [2, 15, 3] that efficiently approximate the PCA obective in Equation 1.1. They assume the input vectors x t are drawn i.i.d. from a fixed and unnown distribution. In this setting, after observing n 0 columns x t one can efficiently compute U n0 O d, such that it approximately spans the top singular vectors of X. Returning y t = 0 1 for t < n 0 and y t = U n 0 x t for t n 0 completes the algorithm. This algorithm is provably correct for some n 0 which is independent of n. This is intuitively correct but non-trivial to show, see [2, 15, 3] for more details. While the stochastic setting is very common in machine learning e.g. the PAC model in online systems the data distribution is expected to change or at least drift over time. In systems that deal with abuse detection or prevention, one can expect an almost adversarial input. 1.2 Summary of contributions Our first contribution is a deteristic online algorithm see Algorithm 1 in Section 2 for the standard PCA obective in Eqn 1.2. Our main result see Theorem 2.1 shows that, for any X = [x 1,..., x n ] in R d n, under some assumptions discussed below, < d, and ε > 0, the proposed algorithm produces a set of vectors y 1,..., y n in R l such that Φ O d,l n x t Φy t 2 2 OPT +ε X 2 F, where l = 8/ε 2. The description of the algorithm and the proof of its correctness are given in Section 2. While Algorithm 1 solves the main conceptual difficulty in online PCA, it has certain drawbacs: 1. It must assume that x t 2 2 X 2 F /l. 2 Here, P t is a square proection matrix with P 2 t = Pt

3 2. It requires X 2 F as input. 3. It spends Ωd 2 floating point operations per x t and requires auxiliary Θd 2 space in memory. In section 4, we present Algorithm 2 that addresses all the issues above. That, in the cost of slightly increasing the target dimension and the additive error see Theorem 4.1 in Section 4. We briefly explain here how we deal with the above issues: 1. We deal with arbitrary input vectors by special handling of large norm input vectors. This is a simple amendment to the algorithm which only doubles the required target dimension. 2. The new algorithm avoids requiring X F as input by estimating it on the fly. A doubling argument analysis shows that the target dimension grows only to l = O logn/ε 2. 3 Bounding the target dimension by l = O/ε 3 requires a significant conceptual change to the algorithm. 3. The new algorithm spends only Odl floating point operations per input vector and uses only Odl space. It avoids storing and manipulating the covariance matrix C by utilizing a streag matrix approximation technique [13]. 2 A simple and inefficient OPCA algorithm Let X = [x 1,..., x n ] R d n be the input highdimensional vectors and l < d be the target dimension. Algorithm 1 returns Y = [y 1,..., y n ] R l n that is close to X in the PCA sense see Theorem 2.1 below for a precise statement of our quality-of-approximation result regarding Algorithm 1. Besides X and l, the algorithm also requires X 2 F in its input. Moreover, we assume x t 2 2 X 2 F /l. Algorithm 1 initializes U and C to d l and d d all-zeros matrices, respectively. It subsequently updates those matrices appropriately. The matrix U is the so-called proection matrix, i.e., y t = U x t. The matrix C is an auxiliary matrix to accumulate all the residual errors. The residual for a vector x t is: r t = x t UU x t. The algorithm starts with a ran one update of C as C = C + r 1 r 1. Notice that by the assumption for x t, we have that r 1 r X 2 F /l, and hence for t = 1 the algorithm will not enter the while-loop. Then, for the second input vector x 2, the algorithm proceeds by checing the spectral norm of C + r 2 r 2 = r 1 r 1 + r 2 r 2. If this does not exceed the threshold θ, which is always fixed to θ = 2 X 2 F /l, the algorithm eeps U unchanged, and it can go all the way 3 Here, we assume that x t are polynomial in n. Algorithm 1 An online algorithm for Principal Component Analysis Input: X = [x 1,..., x n ] R d n with x t 2 2 X 2 F /l, l < d, X 2 F U = 0 d l ; C = 0 d d ; θ = 2 X 2 F /l; for t = 1,..., n do r t = x t UU x t while C + r t r t 2 θ do [u, λ] TopEigenVectorAndValueOfC add u to the next zero column of U r t x t UU x t C C λuu end while C C + r t r t y t U x t end for to t = n if this is the case for all t > 1. Notice, then, that the fact that C 2 θ, implies that the spectral norm squared of R = [r 1,..., r n ] R d n is bounded by θ, because C = RR T, and as we will see in the proof of Theorem 2.1 this is a ey technical component in proving that the algorithm returns y t that are close to x t in the PCA sense. If, however, for some iterate t the spectral norm of C + r t r t exceeds the threshold θ, then, the algorithm does a correction to U, consequently to r t, in order to ensure that this is not the case. Specifically, it updates U with the principal eigenvector of C at that point of the algorithm execution. At the same time it downdates C inside the while-loop by removing this eigenvector. That way the algorithm ensures that at the end of each iterate t, the following relation is true: C = t r tr t λ u u with C λ u u i = 0 d d for a formal statement of the the latter orthogonality condition see Lemma 3.1. Hence, when the condition in the whileloop gives C + r t r t 2 2 θ, which implies that C 2 2 θ, it really means by an orthogonality argument - see Lemma 3.6 that t r tr t 2 2 θ, which is the ultimate goal of the algorithm as we argued before. 2.1 Main result The following theorem is our main quality-of-approximation result regarding Algorithm 1. We prove the theorem based on several other facts which we state and prove in the next section. Theorem 2.1. Let X = [x 1,..., x n ] R d n with x t 2 2 X 2 F /l, l < d, and X 2 F be inputs to Algorithm 1. For any target dimension < d, and any accuracy parameter ε > 0, let the target dimension of the algorithm be l = 8/ε 2 < d. Then:

4 1. The algorithm terates and the while-loop runs at most l times see Lemma Upon teration of the algorithm: n ALG l := x t Φy t 2 2 OPT +ε X 2 F. Φ O d,l 3. The algorithm uses amortized Od 3 arithmetic operations per vector x t and Od 2 auxiliary space. Proof. We quicly prove item 2 in the theorem by using several intermediate results all proven in Section 3. Let R be the d n matrix containing r t in its t-th column 4, and let Y be the l n matrix containing y t in its t-th column. In Lemma 3.4 we prove: Φ O d,l n x t Φy t 2 2 R 2 F; next, in Lemma 3.5 we prove: R 2 F OPT +2 X F R 2 ; and in Lemma 3.6 we prove: R X 2 F/l. Combining these three bounds gives, n x t Φy t 2 2 OPT +2 X F Φ O d,l 2 l X F. IUse l = 8/ε 2 in the latter equation to wrap up the proof of the second item in the theorem. 2.2 Finding the best isometry matrix We remar that our algorithm does not compute the best isometry matrix Φ, for which the bound in Theorem 2.1 holds. The algorithm indeed only returns the lowdimensional vectors y t. The best Φ, after X and Y are fixed, is related to the so-called Procrustes problem: arg X ΦY 2 F. Φ O d,l Let XY T = UΣV T R d l be the SVD of XY T with U R d l, Σ R l l and V R l l. Then, Φ = UV T. Clearly, it is not possible to construct this matrix in an online fashion. However, our algorithm does find an isometry matrix the matrix U R d l upon teration of the algorithm which satisfies the bound in Theorem 2.1 see the proof of Lemma 3.4 to verify this claim. 4 Here, r t is the corresponding vector upon teration of the corresponding iterate of the algorithm. 3 Auxiliary Lemmas 3.1 Notation We introduce notation that, though not necessary to describe the algorithm, is useful for proving our results regarding Algorithm 1. Let l be the number of vectors u inserted in U R d l. This is exactly the number of times the whileloop is executed. We argue in Lemma 3.7 that l l. Notice that after the execution of the algorithm U may still contain some all-zero columns. We reserve the index t to refer to the iterate of the algorithm where the algorithm receives x t. Also, let U t R d l denote the proection matrix U used for the vector x t. We reserve the index i to index the various instances of the matrix C during the execution of the algorithm. Hence, i ranges from i = 1 C 1 = 0 d d to i = z, where z is the number of times the algorithm updates C. Clearly, z n, because for each t there is at least one such update outside the while-loop. Also, z = n + l n + l, since the while-loop is executed at most l times. We reserve the index to index the various vectors u added to the matrix U during the execution of the algorithm. For = 1, 2..., l, u R d are the vectors u computed inside the while-loop in Algorithm 1. Notice that u is the eigenvector of some C i. Also, let λ be the corresponding largest eigenvalue of C i C i is a symmetric matrix as we argue in Eqn. 3.4 such that λ = u C iu. Let X R d n, Y R l n, X R d n and R R d n, denote the matrices whose t th column is x t R d, y t R l, x t = U t U t x t R d, and r t R d. The vectors r t s are taen as the ones in the end of the corresponding iteration. By construction of the algorithm the following relation is also true 3.4 C z = RR λ u u. l 3.2 Results The following lemma argues about some properties of the vectors u and the matrix C. It argues that the vectors u inserted in U are orthogonal to each other. This is important in interpreting our algorithm as a true PCA algorithm, since those vectors play the role of the left singular vectors of X, and as such should at the very least be orthogonal to each other. The lemma also shows that the vectors u inserted in U are perpendicular to the subspace spanned by the columns of the matrix C z. Recall that in Eqn. 3.4 we

5 showed: RR = C z + λ u u ; l hence the lemma argues that C z and l λ u u span two perpendicular subspaces which combined, span the subspace spanned by RR. This will be useful in bounding the spectral norm of RR in Lemma 3.6. The idea is that for such a matrix RR, its spectral norm can be bounded as the maximum spectral norm of C z and l λ u u, which in turn are bounded by design of the algorithm. Lemma 3.1. Let U t R d l be the instance of U in Algorithm 1 when calculating the corresponding vector y t = U x t. Then, for all t = 1,..., n and for some [1,..., l ] not necessarily the same for different t: U I t U t = R l l. 0 l l Also, for the final value taen by the matrix C it holds that l C z λ u u = 0 d d. Proof. See Section 5.1. The following two lemmas prove upper and lower bounds for the values λ calculated in the algorithm. The lower bounds are useful in upper bounding the number of times the algorithm enters the while-loop see Lemma 3.7; the upper bounds will be useful in providing an upper bound for the spectral norm of the matrix R see Lemma 3.6, which is crucial for providing the error accuracy guarantee of the algorithm in Theorem 2.1. Lemma 3.2. Upper bound on λ s Let λ, for = 1,..., l be the eigenvalues computed in Algorithm 1 with λ computed before λ +1. Then, for all = 1, 2,..., l : λ 2 X 2 F/l. Proof. Each λ corresponds to the largest eigenvalue/singular value of some C 5 i Hence, it suffices to argue that for all i = 1,..., z: σ 1 C i = C i 2 2 X 2 F /l. We prove the result by induction on i. For i = 1, it is C 1 = 0 d d and it is trivial that σ 1 C 1 = 0 2 X 2 F /l. By the induction hypothesis: for some i > 1 let C i 2 X 2 F /l. We want to show that C i For the purposes of this proof, σ i A denotes the ith largest singular value of some symmetric matrix A. 2 X 2 F /l. Notice that either C i+1 = C i + r t r t for some iterate t or C i+1 = C i λ u u for some. If C i+1 = C i + r t r t, then, C i + r t r t 2 2 X 2 F /l is ensured by the while-loop condition in the algorithm, since that update happens outside the while-loop. If C i+1 = C i λ u u inside the while-loop, then, C i+1 2 = C i λ u u 2 = σ 1 C i λ u u = 3.5 = σ 2 C i σ 1 C i = C i 2 2 X 2 F/l, where C i 2 2 X 2 F /l uses the induction hypothesis. Also, in the third equality we use the fact that λ, u is the top eigen-pair of C i hence removing it from C i turns the second largest eigenvalue of C i to be the largest one in C i λ u u. Lemma 3.3. Lower bound on λ s Let λ, for = 1,..., l be the eigenvalues computed in Algorithm 1 with λ computed before λ +1. Assug that for all t, x t 2 2 X 2 F /l, then, for all = 1, 2,..., l : λ X 2 F/l. Proof. Each λ corresponds to the largest eigenvalue of some C i. And by design of the algorithm, the extracting of λ from C i is done inside the while-loop. Note that the condition in the while-loop is C + r t r t 2 2 X 2 F /l. This implies that for the iterate t: C i 2 C i + r t r t 2 r t r t 2 X 2 F/l. The first inequality follows by the triangle inequality: C i + r t r t 2 C i 2 + r t r t 2. The second inequality uses C i + r t r t 2 2 X 2 F /l and r tr t 2 X 2 F /l. To verify that r t r t 2 X 2 F /l, recall that r t = I d U t U t x t. Then, using the fact that I d U t U t is a proection matrix: r t r t 2 = r t 2 2 x t 2 2 X 2 F /l, where the last inequality uses the assumption in the lemma. The next lemma argues that the reconstruction error of our algorithm is bounded from R 2 F. This result is based on the fact that the matrix U contains orthonormal columns and the updates on U occur by adding columns one after the other, without ever deleting any of them. The first inequality in the derivation in the proof of the lemma also indicates that the bound in Theorem 2.1 not only holds for the best isometry matrix Φ but also for the matrix U n, i.e., the matrix U upon teration Algorithm 1, which is also an isometry. Lemma 3.4. Let X R d n, Y R l n, and R R d n, be the matrices whose t th column is x t R d, y t R l,

6 and r t R d the vectors r t s are taen as the ones in the end of the corresponding iteration, respectively. Then, Φ O d l X ΦY 2 F R 2 F. Proof. We manipulate the term Φ Od l X ΦY 2 F as follows: = X ΦY 2 F X U n Y 2 F = Φ O d l n x t U n U t x t 2 2 = n x t U t U t x t 2 2 = R 2 F. The first inequality holds because U n is an isometry. This holds because U n contains a subset of l orthonormal columns, according to Lemma 3.1; the remaining columns are all-zeros. Also, the bottom l l rows in Y are all-zeros. In the second equality, U n U t = U t U t, since U t = [u 1, u 2,..., u t, 0..., 0] for some t and U n = [u 1, u 2,..., u n, 0,..., 0] with n t and all the vectors u are orthonormal. Next, we provide an upper bound for the Frobenius norm squared of R with respect to the spectral norm of R. That helps because the algorithm only provides a bound for R 2 see Lemma 3.6. Lemma 3.5. Let X, R R d n, be the matrices whose t th column is x t and r t R d the vectors r t s are taen as the ones in the end of the corresponding iteration, respectively. Then, R 2 F OPT +2 X F R 2 Proof. For notational convenience, let 1 = and v i X 2 2 vi X, vi R + vi R 2 2, 2 = vi X 2 2 vi R 2 2. We manipulate the term R 2 F as follows. R 2 F i = X 2 F X 2 F X 2 F ii iii = X 2 F v i X 2 2 vi X vi R 2 2 iv = X 2 F 1 v vi X 2 F vi X X 2 F vi X 2 vi R 2 vi X vii OPT +2 X F R 2 i follows by Lemma 5.1 in Section 5. In ii, for i = 1,..., let v i R d denote the ith left singular vector of X corresponding to the ith largest singular value of X, also let V R d contains those vectors as columns. In this equation we used X 2 F v i X 2 2. This is true because : X 2 2 = V T 2 X F V T 2 2 X 2 F = X 2 F. v i In the first inequality we used the fact that XY 2 F X 2 2 Y 2 F, for any X and Y; in the last equality we used V T 2 2 = 1. In iii we use the relation X = X+R. In iv we used the fact that for any two vectors α, β: α β 2 2 = α 2 2+ β α, β. We used this property times for all i = 1,...,, with α = vi X and β = vi R. In vi we used the Cauchy-Swartz inequality: for numbers γ i 0, δ i 0 and for i = 1,..., it is: γ iδ i. We used this γ2 i δ2 i inequality for γ i = vi X 2 2 and δ i = vi R 2 2. In vii we used that X 2 F vi X 2 2 = = d σi 2 X σi 2 X = d i=+1 σ 2 i X = OPT. Also, we used that v i X 2 2 σ2 i X X 2 F, and that v i R 2 2 R 2 2. The latter inequality is true because vi R 2 2 R 2 2 for all i = 1,...,.

7 Next, we provide an upper bound on the spectral norm squared of R. The previous lemma motivates the need for such an analysis. Lemma 3.6. Let X, R R d n, be the matrices whose t th column is x t and r t R d the vectors r t s are taen as the ones in the end of the corresponding iteration, respectively. Then, R X 2 F/l. Proof. Notice that according to the updates of C we have that RR = C z + λ u u. l According to Lemma 3.1 we have that l C z λ u u = 0 d d meaning that l R 2 2 = RR = max C z, λ u u. The second equality is proved in Lemma 5.4 in Section 5. Now, according to Lemma 3.2 we have that l λ u u 2 X 2 F /l. From the structure of the algorithm it is straightforward that upon teration it must be the case that C z 2 X 2 F /l, as C C + r t r t and the if the latter is larger than 2 X 2 F /l then the while loop in the algorithm will not halt. Combining the three inequalities leads to the claim. Finally, we provide an upper bound on the number of times the algorithm enters the while-loop; this immediately implies an upper bound on the number of vectors u inserted in U. The main idea here is to study the trace of the matrix l λ u u and provide a lower bound that depends on l and an upper bound that depends on l. Combining the two bounds gives the desired relation between l and l. Lemma 3.7. Let X R d n be the matrix whose t th column is x t Then, assug that for all t, x t 2 2 X 2 F /l, the while loop in Algorithm 1 can occur a total of at most l l {1, OPT / X 2 F + 8/l} l. Proof. For notational convenience, let l λ u u := Z. First, using Lemma 3.3, we calculate a lower bound for TraceZ: 3.6 TraceZ = λ l X 2 F/l. l Since C z is clearly positive semi definite we have TraceZ = TraceRR TraceC z 3.7 TraceRR = R 2 F X 2 F. The inequality R 2 F X 2 F follows because for all t, r t 2 2 = I d U t U t x t 2 2 I d U t U t 2 2 x t 2 2 = x t 2 2, since I d U t U t Also, by Theorem 2.1: hence R 2 F OPT + 8/l X 2 F, 3.8 TraceZ R 2 F OPT + 8/l X 2 F. Combining Eqn. 3.6, Eqn. 3.7, and Eqn. 3.8 shows the claim. 4 A complex and efficient OPCA algorithm In this section we modify Algorithm 1 in order to 1 remove the assumption of nowledge of X 2 F and 2 improve the time and space complexity. These two improvements are made at the expense of sacrificing the accuracy slightly. To sidestep trivial technicalities, we assume some prior nowledge over the quantity X 2 F ; we merely assume that we have some quantity w 0 such that w 0 X 2 F and w 0 x t 2 for all x t 2 2. The assumption of nowing w 0 can be removed by woring with a buffer of roughly /ε 3 columns. Since we already require this amount of space, the resulting increase in the memory complexity is asymptotically negligible. Specifically, we provide the following theorem: Theorem 4.1. There exists an algorithm that on input X R d n, target dimension < d, and ε 0, 1/15, it produces vectors y 1, y 2,..., y n R l, with n Φ Od,l x t Φy t 2 2 OPT +ε X 2 F. For some δ = {ε, ε 2 X 2 F / OPT}, this algorithms requires Od/δ 3 space and Ond/δ 3 + lognd 3 /δ 6 arithmetic operations assug x t 2 are polynomial in n. The target dimension of the algorithm is l = /δ 3.

8 Algorithm 2 An efficient online PCA algorithm input: X,, ε 0, 1 15, w 0 which defaults to w 0 = 0 U = 0 d /ε3, Z = 0 d /ε2, w = 0, w U = 0 /ε3 for t = 1,..., n do w = w + x t 2 2 r t = x t UU x t C = I UU ZZ I UU while C + r t r t 2 max{w 0, w} ε2 do u, λ = topeigenvectorandvaluec w U u λ If U has a zero column, write u in its place. Otherwise, write u instead of the column v of U with the imal quantity of w U v C = I UU ZZ I UU r t = x t UU x t end while Z Setch[Z, r t ] For each non-zero column u in U, w U u w U u + x t, u 2 yield: y t U x t end for 4.1 Notations For a variable ξ in the algorithm, where ξ can either be a matrix Z, X, C or a vector r, x we denote by ξ t the value of ξ at the end of iteration number t. ξ 0 will denote the value of the variable at the beginning of the algorithm. For variables that may change during the while loop we denote by ξ t,z the value of ξ at the end of the z th while loop in the t th iteration. In particular, if the while loop was never entered the value of ξ at the end of the iteration t will be equal to ξ t,0. For such ξ that may change during the while loop notice that ξ t is its value at the end of the iteration, meaning after the last while loop has ended. An exception to the above is for the variable w. Here, we denote by w t the value of max{w 0, w} at the end of iteration number t. We denote by n the index of the last iteration of the algorithm. In particular ξ n denotes the value of ξ upon the teration of the algorithm. 4.2 Matrix Setching We use the Frequent Direction matrix setching algorithm presented by Liberty in [13]. This algorithm provides a setching algorithm for the streag version where we observe a matrix R column by column. The idea is to maintain a matrix Z of low ran that at all times has the property that ZZ and RR are close, in operator norm. Lemma 4.1. Let R 1,..., R t,... be a sequence of matrices with columns of dimension d where R t+1 is obtained by appending a column vector r t+1 to R t. Let Z 1,..., Z t,... the corresponding setches of R obtained by adding these columns as they arrive, according to the Frequent Direction algorithm in [13] with parameter l. 1. The worst case time required by the setch to add a single column is Old. 2. Each Z t is a matrix of dimensions d Ol. 3. Let u be a left singular vector of Z t with singular value σ and assume that r t+1 is orthogonal to u. Then u is a singular vector of Z t+1 with singular value σ. 4. For any vector u and time t it holds that Z t u R t u. 5. For any t there exists a positive semidefinite matrix E t such that E t 2 2 R t 2 F /l and Z tz t + E = R t R t. 4.3 Proof of Theorem 4.1 Lemma 4.2. At all times, the matrix U U is a diagonal matrix with either zeros or ones across the diagonal. In other words, the non-zero column vectors of U, at any time point are orthonormal. Proof. We prove the claim for any U t,z by induction on t, z. For t = 0 the claim is trivial as U 0 = 0. For the induction step we need only to consider U t,z for z > 0 since U t,0 = U t 1. Let u be the new vector in U t,z. Since u is defined as an eigenvector of C t,z 1 = I U t,z 1 U t,z 1Z t 1 Z t 1I U t,z 1 U t,z 1, we have that U t,z 1 u = 0 and the claim immediately follows. Lemma 4.3. For all t, z, the non-zero columns of U t,z are left singular vectors of Z t 1 possibly with singular value zero. In particular, the claim holds for the final values of the matrices U and Z. Proof. We prove the claim by induction on t, z, with a trivial base case of t = 1 where Z t 1 = 0. Let t > 1. For z = 0, each non-zero column vector u of U t,0 is a non-zero column vector of U t 1,z for the largest valid z w.r.t t 1. By the induction hypothesis it holds that u is a singular vector of Z t 2. According to Observation 4.2 we have that r t 1, the vector added to Z t 2 is orthogonal to u, hence Lemma 4.1, item 3 indicates that u is a singular vector of Z t 1 as required. Consider now z > 0. In this case u is a vector added in the while loop. Recall that u is defined as an eigenvector of C t,z 1 = I U t,z 1 U t,z 1Z t 1 Z t 1I U t,z 1 U t,z 1

9 According to our induction hypothesis, all of the nonzero column vectors of U t,z 1 are singular vectors of Z t 1, hence Z t 1 Z t 1 = C t,z 1 + U t,z 1 U t,z 1Z t 1 Z t 1U t,z 1 U t,z 1. The two equalities above imply that any eigenvector of C t,z 1 is an eigenvector of Z t 1 Z t 1 as well. It follows that u is a singular vector of Z t 1 as required, thus proving the claim. Lemma 4.4. Let v be a column vector of U t,z that is not in U t,z+1. Let t v, Z v be the earliest time stamp from which v was a column vector in U consecutively up to time t, z. It holds that Z t 1 v X t 1 X tv 1v 2 2 2ε3 w t 1 Proof. We denote by w U u the values of the w U vector for the different directions u at time t, z. Let λ v be the eigenvalue associated with v during the time it was entered to U. Then at that time Zv 2 2 = Cv 2 = λ v recall that v is chosen as a vector orthogonal to those of U hence Cv = ZI Uv 2 = Zv 2. Furthermore, since v was a column in U up to time t we get that all vectors added to R from the insertion of v up to time t are orthogonal to v. Lemma 4.1, item 3 shows that Z t 1 v 2 2 λ v. Since v was a column vector of U from iteration t v up to iteration t, we have that Z t 1 v X t 1 X tv 1v 2 2 λ v + t τ=t v v, x τ 2 = w U v It remains to bound the quantity of w U v. We will bound the sum u U t,z w U u and use the fact that v is the imizer of the corresponding expression hence w U v is upper bounded by u w U u ε3. Let t u be the index of the iteration in which u is inserted into U. It is easy to verify that for λ u = C tu 1u it holds that w u = λ u + X t 1 X tu 1u 2 2 Now, since C Z R we have that Finally, w u = R tu 1u X t 1 X tu 1u 2 2 w U u X t 1 X tu 1u R tu 1u 2 2 u u i X t 1 u R t 1 u 2 2 u iii 2 X t 1 2 F = 2w t 1 ii X t 1 2 F + R t 1 2 F Inequality i is immediate from the definitions of X t, R t as concatenations of t column vectors. Inequality ii follows since any orthogonal vector set and any matrix admit u Au 2 2 A 2 F. Inequality iii is due to the fact that each column of R is obtained by proection a column of X onto some subspace. Lemma 4.5. At all times C t,z 2 w t 1 ε2, Proof. We prove the claim by induction over t, z. The base case for t = 0 is trivial. For t > 0, z = 0, if the while loop in iteration t was entered to we have that C t,0 = C t 1,z for some z. Since w t 1 w t 2 the claim holds. If t is such that the while loop of the iteration was not entered the condition of the while loop asserts that C t,z = C t w t 1 ε2. Consider now t, z > 0. We have that C t,z 1 = I d U t,z 1 U t,z 1Z t 1 Z t 1I d U t,z 1 U t,z 1 C t,z = I d U t,z U t,zz t 1 Z t 1I d U t,z U t,z If U t,z is obtained by writing u instead of a zero column of U t,z 1 then C t,z is a proection of C t,z 1 and the claim holds due to the induction hypothesis. If not, u is inserted instead of some vector v. According to Lemmas 4.3 and 4.4, v is an eigenvector of Z t 1 Z t 1 with eigenvalue λ v w t 1 2ε3 w t 1 ε2, assug ε 0.5. It follows that C t,z is a proection of C t,z 1 + λ v vv. Now, since C t,z 1 v = 0 as v is a column vector of U t,z 1 we have that C t,z 2 C t,z 1 +λ v vv 2 = max{ C t,z 1 2, λ v vv 2 } According to our induction hypothesis and the bound for λ v, the above expression is bounded by w t 1 ε2 as required. Lemma 4.6. Let u be a vector that is not in U t,z and in U t,z+1. Let λ be the eigenvector associated to it w.r.t C t,z. It holds that λ w t 1 ε2. Furthermore, if u is a column vector in U t,z for all t t, z z, it holds that u Z n 2 λ X n 2 F ε2. Proof. Since u is chosen as the top eigenvector of C t,z we have by Lemma 4.5 that λ = C t,z u 2 = C t,z 2 w t 1 ε2

10 For the second claim in the lemma we note that since u is an eigenvector of C t,z 1 = I U t,z 1 U t,z 1Z t 1 Z t 1I U t,z 1 U t,z 1, we have that U t,z 1u = 0, hence u Z t = u I U t,z 1 U t,z 1Z t = u C t,z u C t,z u 2 w t 1 ε2 Since u is assumed to be an element of U throughout the running time of the algorithm, it holds that for all future vectors r added to the setch Z, u is orthogonal to r. The claim now follows from Lemma 4.1 item 3. Lemma 4.7. R n 2 2 2ε2 X n 2 F. Proof. Let u 1,..., u l and λ 1,..., λ l be the columns of U n and their corresponding eigenvalues in C at the time of their addition to U. From Lemmas 4.3 and 4.6 we have that each u is an eigenvector of ZZ with eigenvalue λ λ ε2 X n 2 F. It follows that Z n Z n 2 = { } ε 2 max X n 2 F, I U n U n Z n Z n I U n U n 2 { } ε 2 = max X n 2 F, C n 2 ε2 X n 2 F. The last inequality is due to Lemma 4.5. Next, by the setching property Lemma 4.1 item 5, for appropriate matrix E: Z n Z n = R n R n +E, with E ε2 R n 2 F. As the columns of R are proections of those of X we have that R n 2 F X n 2 F, hence R n 2 2 = R n R n 2 = Z n Z n E 2 Z n Z n 2 + E 2 2ε2 X n 2 F, which concludes the proof. Lemma 4.8. R n 2 F OPT +4ε X n 2 F Proof. The Lemma can be proven analogically to Theorem 2.1 as the only difference is the bound over R n 2 2. Lemma 4.9. Assume that for all t, x t 2 2 w t ε2 5. Assume that ε 0.1. For τ > 0 consider the iterations of the algorithm during which w t [2 τ, 2 τ+1. During this time, the while loop will be executed at most 5/ε 2 times. Proof. For the proof, we define a potential function Φ t,z = TraceR t 1 R t 1 TraceC t,z. We first notice that since C is clearly PSD, Φ t,z TraceR t 1 R t 1 = R t 1 2 F X t 1 2 F 2 τ+1. The first inequality correct because the columns of R are proections of those of X and the second because X t 1 2 F w t 1. We show that Φ is non-decreasing with time and that, for valid z > 0, Φ t,z Φ t,z ε2 2τ+1. The result immediately follows. Consider a pair t, z followed by the pair of indices t + 1, 0. Here, Φ t+1,0 Φ t,z = r t hence for such pairs the potential is non-decreasing. Now consider some pair t, z for z > 0. Since t, z is a valid pair it holds that 4.9 C t,z 1 2 C t,z 1 + r t,z 1 r t,z 1 r t,z 1 r t,z 1 2 C t,z 1 + r t,z 1 r t,z 1 2 r t,z 1 r t,z 1 2 ε 2 w t τ+1 ε Denote by u the column vector in U t,z that is not in U t,z 1. Let U be the matrix obtained by appending the column u to the matrix U t,z 1. Let C = I U U Z t 1 Z t 1I U U = I uu C t,z 1 I uu Since u is the top eigenvector of C t,z 1 we have by equation 4.9 that TraceC τ+1 ε2 TraceC t,z 1 = C t,z If U t,z 1 had a zero column then C t,z = C and we are done. If not, let v be the vector that was replaced by u. According to Lemma 4.3, v is a singular vector of Z t 1. According to Lemma 4.4 and ε < 0.1 we have that Z t 1 v 2 2 2ε3 X t 1 2 F ε2 5 2τ+1. Hence, C t,z = C + Z t 1 v 2 vv meaning that TraceC t,z C t,z 1 =

11 TraceC t,z C + TraceC C t,z 1 ε2 5 2τ+1. We conclude that as required Φ is non-decreasing over time and in each iteration of the while loop, increases by at least ε2 5 2τ+1. Since Φ is upper bounded by 2 τ+1 during the discussed iterations, the lemma follows. Lemma Let v 1, t 1 +1, t 1 +1,..., v, t +1, t + 1,... be the sequence of triplets of vectors removed from U, the times on which they were added to U and the times on which they were removed from U. 2 ALG R n F + 2 X t X t v 2 2. Proof. For any time t denote by U t the matrix U in the end of iteration t, by U 1 t the outcome of zeroingout every column of U t that is different from the corresponding column in U n and by U 2 t its complement, that is the outcome of zeroing-out every column in U t that is identical to the corresponding column in U n. In the same way define U 2 to be the outcome of zeroingout columns in U n that are all zeros in U 2 t. It holds that, x t U n U t x t 2 2 = x t U t + U n U t U t x t x t U t U t x t 2 + U n U t U t x t 2 2 = r t 2 + U 2 U 2 t U 2 t x t 2 r t U 2 t x t 2 Sumg over all times t we have, ALG = = n n r t U 2 t x t 2 2 r t r t 2 U 2 t x t U 2 t x t 2 2 R n 2 F + 4 n r t 2 n 2 U 2 t x t t U 2 t x t 2 2 Where the last inequality follows from applying the Cauchy-Schwarz inequality to the dot product between the vectors r 1 2, r 2 2,..., r n 2 and U 2 1 x 1 2, U 2 2 x 2 2,..., U 2 n x n 2. Since U 2 t contains only vectors that were columns of U at time t but were replaced later, and are not present in U n, we have that U 2 t x t 2 x t v 2 and so n U 2 t x t 2 2 Thus we have that, ALG :t >t>t R n 2 F + 4 R n F X t X t v 2 2. X t X t v X t X t v = R n F + 2 X t X t v 2 2, which concludes the proof. Lemma Let v 1, t 1 +1, t 1 +1,..., v, t +1, t + 1,... be the sequence of triplets of vectors removed from U, the times on which they were added to U and the times on which they were removed from U. Then X t X t v ε X n 2 F. Proof. For some τ > 0 consider the execution of the algorithm during the period in which w t [2 τ, 2 τ+1. According to Lemma 4.9, at most 5 ε vectors v were 2 removed from the U during that period. According to Lemma 4.4, for each such v it holds that X t X t v 2 2 w t 2ε3 2ε3 2τ+1 It follows that the contribution of vectors v thrown from the set during the discussed time period is at most 5 ε 2 2ε3 2τ+1 = 10ε 2 τ+1 The entire sum can now be bounded by a geometric series, ending at τ = log 2 X 2 F thus proving the lemma. Corollary 4.1. ε Xn 2 ALG,ε OPT F

12 OPT ε X n F In particular, for ε = δ 2 OPT / X n 2 F, ALG,ε = OPT 1 + Oδ Lemma Time complexity Algorithm 2 requires On d ε + logw 3 n /w 0 d3 ε arithmetic operations. 6 Proof. We begin by pointing out that in the algorithm we wor with the d d matrices C and ZZ. These matrices have a bounded ran and furthermore, we are always able to maintain a representation of them of the form AA with A being a d r matrix with R being the ran of C or ZZ. Hence, all computations involving them requires Odr or Odr 2 arithmetic operations corresponding to a product with a vector and computing the eigendecomposition. Consider first the cost of the operations outside the while loop that do not involve the matrix C. It is easy to see, with Lemma 4.1, item 1, that the amortized time required in each iteration is Od ε. The two 3 operations of computing C and its top singular value may require O d 2 ε time. However, these operations 6 do not necessarily have to occur every iteration if we use the following tric: When entering the while loop we will require the norm of C + r t r t to be bounded by ε w 2 t rather than w t ε2 2. Assume now that we entered the while loop at time t. In this case, we do not need to chec the condition of the while loop until we reach an iteration t where w t w t 1 + ε2 2. Other than checing the condition of the while loop there is no need to compute C, hence the costly operations of the external loop need only be executed an amount of O logw n /w 0 ε 2 We now proceed to analyze the running time of the inner while loop. Each such iteration requires O d 2 ε arithmetic operations. However, according 6 to Lemma 4.9 we have that the total number of such iterations is bounded by O logw n /w 0 ε 2 The lemma immediately follows. Lemma Space complexity Algorithm 1 requires O d/ε 3 space. Proof. Immediate from the algorithm and the properties of the Frequent Direction Setch Lemma 4.1 item 2 5 Conclusion Recent discussions with Mar Tygert raise the conecture that a random proection based PCA technique see Section could potentially yield similar, bounds than the ones we achieve here. Another interesting observation is that our algorithm actually bounds X ΦY 2 2. To the best of our understating, comparable results are unliely to be obtainable via randomproection based techniques. Such bounds are useful in the case of noisy data where the optimal registration error is large but the signal is still recoverable. A proof of this fact will appear in the full version of this paper. References [1] Dimitris Achlioptas. Database-friendly random proections: Johnson-lindenstrauss with binary coins. J. Comput. Syst. Sci., 664: , [2] Raman Arora, Andy Cotter, and Nati Srebro. Stochastic optimization of pca with capped msg. In C.J.C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems 26, pages [3] Ashay Balsubramani, Sanoy Dasgupta, and Yoav Freund. The fast convergence of incremental pca. In C.J.C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems 26, pages [4] Kenneth L Clarson and David P Woodruff. Numerical linear algebra in the streag model. In Proceedings of the 41st annual ACM symposium on Theory of computing, pages ACM, [5] Sanoy Dasgupta and Anupam Gupta. An elementary proof of the ohnson-lindenstrauss lemma. Technical Report TR , Bereley, CA, [6] Amit Deshpande and Santosh Vempala. Adaptive sampling and fast low-ran matrix approximation. In APPROX-RANDOM, pages , [7] Petros Drineas and Ravi Kannan. Pass efficient algorithms for approximating large matrices. In Proceedings of the fourteenth annual ACM-SIAM symposium on Discrete algorithms, SODA 03, pages , Philadelphia, PA, USA, Society for Industrial and Applied Mathematics. [8] George H Dunteman. Principal components analysis. Number 69. Sage, [9] Alan Frieze, Ravi Kannan, and Santosh Vempala. Fast monte-carlo algorithms for finding low-ran approximations. In Proceedings of the 39th Annual Symposium on Foundations of Computer Science, FOCS 98, pages 370, Washington, DC, USA, IEEE Computer Society. [10] Mina Ghashami and Jeff M. Phillips. Relative errors for deteristic low-ran matrix approximations. In SODA, 2014.

13 [11] Nathan Halo, Per-Gunnar Martinsson, and Joel A. Tropp. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM Review, 532: , [12] W.B. Johnson and J. Lindenstrauss. Extensions of Lipschitz mappings into a Hilbert space. Contemp. Math., 26: , [13] Edo Liberty. Simple and deteristic matrix setching. In KDD, pages , [14] Edo Liberty, Franco Woolfe, Per-Gunnar Martinsson, Vladimir Rohlin, and Mar Tygert. Randomized algorithms for the low-ran approximation of matrices. Proceedings of the National Academy of Sciences,, 10451: , December [15] Ioannis Mitliagas, Constantine Caramanis, and Pratee Jain. Memory limited, streag pca. In C.J.C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems 26, pages [16] Jiazhong Nie, Wociech Kotlowsi, and Manfred K. Warmuth. Online pca with optimal regrets. In ALT, pages , [17] Mar Rudelson and Roman Vershynin. Sampling from large matrices: An approach through geometric functional analysis. J. ACM, 544, July [18] Tamás Sarlós. Improved approximation algorithms for large matrices via random proections. In FOCS, pages IEEE Computer Society, [19] Manfred K. Warmuth and Dima Kuz. Randomized online pca algorithms with regret bounds that are logarithmic in the dimension, APPENDIX: Omitted Proofs from Section Proof of Lemma 3.1 Proof. Let C i be a value taen by the matrix C at some point during the algorithm. Let li be the number of non-zero vectors in U at the time of the update to C. For l denote by U the d l matrix whose p th column is u p for p < and the zero column for p. We prove by induction on i that 1. U li uli = 0 l 1 li 2. C i λ u u = 0 d d The claim will immediately follow. Notice that since all u are chosen as unit vectors, statement 1 above also indicates that u 1,..., u li form an orthonormal set. Base case. The base case of the proof is trivial as for i = 1 it holds that li = 0, hence U li is the all-zeros matrix, and u li = 0. Induction hypothesis. Let i 1. We assume the following: 1. U li uli = 0 l 1 li 2. C i λ u u = 0 d d 1. Induction step. We prove that for i + 1: U li+1 uli+1 = 0 l 1 li+1 2. C i+1 λ u u = 0 d d Consider the matrix C i+1. There are two possible reasons for the change in C and we split the proof to deal with each reason separately. First reason: C i+1 = C i + r t r t for some t. In this case item 1 holds trivially as li = li + 1. For item 2 denote Z = li+1 C i+1 λ u u = I d li u u = li C i + r t r t λ u u i = r t r t ii = li λ u u Zx t x t Z li λ u u iii = 0 d d, we have Equality i is due to induction hypothesis 2. Equality ii is due to the definition of r t and equality iii holds since the induction hypothesis for item 1 indicates that u 1,..., u li are orthonormal, hence Z I d Z = 0 d d. Second reason: li + 1 = li + 1 and C i+1 = C i λ li+1 u li+1 u li+1. Furthermore u li+1, λ li+1 are the top eigen-pair of C i. For item 1, notice that according to our induction hypothesis item 2 C i is a symmetric matrix whose row and column space are orthogonal to u 1,..., u li. Hence, as u li+1 is an eigenvector of C i it must be the case that u li+1 is orthogonal to u 1,..., u li. Next, we proceed to item 2. Denote Z = li+1 λ u u. Notice that

14 li+1 C i+1 λ u u = = C i λ li+1 u li+1 u li+1 Z i = C i I d u li+1 u li+1 Z li ii = C i I d u u iii = C i I d iv = 0 d d li+1 u u Z I d u li+1 u li+1 Z Equality i holds as u li+1, λ li+1 are an eigen-pair of C i. Equality ii is due to induction hypothesis 2. Equality iii and iv holds as u 1,..., u li+1 are an orthonormal set. 5.2 Basic Facts First, we show that the matrices X, X, and R satisfy some form of the pythagorean theorem for matrices. This result will be useful in a technical manipulation in Lemma 3.5, where we provide an upper bound for R 2 F. Lemma 5.1. Let X R d n, X R d n and R R d n be the matrices whose t th column is x t R d, x t = U t U t x t R d, and r t = x t x t R d, respectively the vectors r t s are taen as the ones in the end of the corresponding iteration in Algorithm 1. Then, X 2 F = X 2 F + R 2 F. Proof. Recall that for any proection matrix P and vector x it holds that x 2 2 = Px I Px 2 2 According to Lemma 3.1 we have that P t = U t U t is a proection matrix for all t hence, x t 2 2 = P t x t I P t x t 2 2 = x t r t 2 2 Sumg over all t s yields the claim. The next two lemmas, provide upper bounds for the dimensions of the subspaces spanned by the columns of l λ u u and C z, respectively. We will use those results in Lemma 5.4, where one requires a description of the SVD of the sum l λ u u + C z to calculate a bound for R 2 2. Lemma 5.2. Let [λ, u ], for = 1,..., l be the pairs of eigenvalues/eigevectors computed in Algorithm 1 with [λ, u ] computed before [λ +1, u +1 ]. Then, l ran λ u u = l. Proof. This result follows immediately from Lemma 3.1 and Lemma 3.3. Specifically, from Lemma 3.1, we have that the u s are orthogonal to each other and from Lemma 3.3 we have that λ > 0, hence ran l λ u u = l. Lemma 5.3. Let C z be the matrix C upon teration in Algorithm 1. Then, ran C z d l. Proof. We give a proof by contradiction. Assume that ranc z := ρ > d l. For notational convenience, let l λ u u := B. From Lemma 5.2, we have that ranb = l, hence B = U B Σ B U B is the SVD of B with U B R d l and Σ B R l l. From our assumption, we have that ranc z = ρ, hence C z = U Cz Σ Cz U C z is the SVD of C z with U Cz R d ρ and Σ Cz R ρ ρ. Here, Σ Cz contains strictly positive diagonals. From Lemma 3.1 we have that C z B = 0 d d, which is equivalent to saying that U T C z U B = 0 d d. Also, U Cz contains ρ > d l columns. From one hand, based on the relation U T C z U B = 0 d d, we have that every column in U Cz is in spani U B U T B. From the other hand, based on the fact that U Cz contains ρ > d l orthonormal columns, we have that the last ρ d l 1 columns in U Cz are in spani U Cz U T C z. The two statements contradict each other because, if spani U B U T B = spani U Cz U T C z, it would have not been possible that U T C z U B = 0 d d. Finally, we prove a result which says that the spectral norm square of R is at most that of the max spectral norm of C z and the spectral norm of l λ u u. This result is useful in proving Lemma 3.6. Lemma 5.4. Let R R d n, be the matrix whose t th column is r t R d the vectors r t s are taen as the ones in the end of the corresponding iteration, respectively. Let C z be the d d matrix C at the end of the algorithm, and Let λ, u for = 1,..., l be the eigenvalues, and corresponding eigenvectors, computed in Algorithm 1 with λ computed before λ +1. Then, l R 2 2 = RR = max C z, λ u u. Proof. For notational convenience, let l λ u u := B. From Lemma 5.2, we have that ranb = l, hence B = U B Σ B U B is the SVD of B with U B R d l

to be more efficient on enormous scale, in a stream, or in distributed settings.

to be more efficient on enormous scale, in a stream, or in distributed settings. 16 Matrix Sketching The singular value decomposition (SVD) can be interpreted as finding the most dominant directions in an (n d) matrix A (or n points in R d ). Typically n > d. It is typically easy to

More information

MAT 585: Johnson-Lindenstrauss, Group testing, and Compressed Sensing

MAT 585: Johnson-Lindenstrauss, Group testing, and Compressed Sensing MAT 585: Johnson-Lindenstrauss, Group testing, and Compressed Sensing Afonso S. Bandeira April 9, 2015 1 The Johnson-Lindenstrauss Lemma Suppose one has n points, X = {x 1,..., x n }, in R d with d very

More information

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra.

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra. DS-GA 1002 Lecture notes 0 Fall 2016 Linear Algebra These notes provide a review of basic concepts in linear algebra. 1 Vector spaces You are no doubt familiar with vectors in R 2 or R 3, i.e. [ ] 1.1

More information

Randomized Algorithms

Randomized Algorithms Randomized Algorithms Saniv Kumar, Google Research, NY EECS-6898, Columbia University - Fall, 010 Saniv Kumar 9/13/010 EECS6898 Large Scale Machine Learning 1 Curse of Dimensionality Gaussian Mixture Models

More information

Compressed Sensing and Robust Recovery of Low Rank Matrices

Compressed Sensing and Robust Recovery of Low Rank Matrices Compressed Sensing and Robust Recovery of Low Rank Matrices M. Fazel, E. Candès, B. Recht, P. Parrilo Electrical Engineering, University of Washington Applied and Computational Mathematics Dept., Caltech

More information

Linear Algebra and Eigenproblems

Linear Algebra and Eigenproblems Appendix A A Linear Algebra and Eigenproblems A working knowledge of linear algebra is key to understanding many of the issues raised in this work. In particular, many of the discussions of the details

More information

Approximate Spectral Clustering via Randomized Sketching

Approximate Spectral Clustering via Randomized Sketching Approximate Spectral Clustering via Randomized Sketching Christos Boutsidis Yahoo! Labs, New York Joint work with Alex Gittens (Ebay), Anju Kambadur (IBM) The big picture: sketch and solve Tradeoff: Speed

More information

14 Singular Value Decomposition

14 Singular Value Decomposition 14 Singular Value Decomposition For any high-dimensional data analysis, one s first thought should often be: can I use an SVD? The singular value decomposition is an invaluable analysis tool for dealing

More information

15 Singular Value Decomposition

15 Singular Value Decomposition 15 Singular Value Decomposition For any high-dimensional data analysis, one s first thought should often be: can I use an SVD? The singular value decomposition is an invaluable analysis tool for dealing

More information

A Randomized Algorithm for Large Scale Support Vector Learning

A Randomized Algorithm for Large Scale Support Vector Learning A Randomized Algorithm for Large Scale Support Vector Learning Krishnan S. Department of Computer Science and Automation Indian Institute of Science Bangalore-12 rishi@csa.iisc.ernet.in Chiranjib Bhattacharyya

More information

Randomized algorithms for the approximation of matrices

Randomized algorithms for the approximation of matrices Randomized algorithms for the approximation of matrices Luis Rademacher The Ohio State University Computer Science and Engineering (joint work with Amit Deshpande, Santosh Vempala, Grant Wang) Two topics

More information

Applications of Randomized Methods for Decomposing and Simulating from Large Covariance Matrices

Applications of Randomized Methods for Decomposing and Simulating from Large Covariance Matrices Applications of Randomized Methods for Decomposing and Simulating from Large Covariance Matrices Vahid Dehdari and Clayton V. Deutsch Geostatistical modeling involves many variables and many locations.

More information

Linear Algebra Massoud Malek

Linear Algebra Massoud Malek CSUEB Linear Algebra Massoud Malek Inner Product and Normed Space In all that follows, the n n identity matrix is denoted by I n, the n n zero matrix by Z n, and the zero vector by θ n An inner product

More information

Math 350 Fall 2011 Notes about inner product spaces. In this notes we state and prove some important properties of inner product spaces.

Math 350 Fall 2011 Notes about inner product spaces. In this notes we state and prove some important properties of inner product spaces. Math 350 Fall 2011 Notes about inner product spaces In this notes we state and prove some important properties of inner product spaces. First, recall the dot product on R n : if x, y R n, say x = (x 1,...,

More information

Lecture Notes 1: Vector spaces

Lecture Notes 1: Vector spaces Optimization-based data analysis Fall 2017 Lecture Notes 1: Vector spaces In this chapter we review certain basic concepts of linear algebra, highlighting their application to signal processing. 1 Vector

More information

Randomized Numerical Linear Algebra: Review and Progresses

Randomized Numerical Linear Algebra: Review and Progresses ized ized SVD ized : Review and Progresses Zhihua Department of Computer Science and Engineering Shanghai Jiao Tong University The 12th China Workshop on Machine Learning and Applications Xi an, November

More information

Principal Component Analysis

Principal Component Analysis Machine Learning Michaelmas 2017 James Worrell Principal Component Analysis 1 Introduction 1.1 Goals of PCA Principal components analysis (PCA) is a dimensionality reduction technique that can be used

More information

Singular Value Decomposition

Singular Value Decomposition Chapter 6 Singular Value Decomposition In Chapter 5, we derived a number of algorithms for computing the eigenvalues and eigenvectors of matrices A R n n. Having developed this machinery, we complete our

More information

Chapter 6: Orthogonality

Chapter 6: Orthogonality Chapter 6: Orthogonality (Last Updated: November 7, 7) These notes are derived primarily from Linear Algebra and its applications by David Lay (4ed). A few theorems have been moved around.. Inner products

More information

Lecture 2: Linear Algebra Review

Lecture 2: Linear Algebra Review EE 227A: Convex Optimization and Applications January 19 Lecture 2: Linear Algebra Review Lecturer: Mert Pilanci Reading assignment: Appendix C of BV. Sections 2-6 of the web textbook 1 2.1 Vectors 2.1.1

More information

Methods for sparse analysis of high-dimensional data, II

Methods for sparse analysis of high-dimensional data, II Methods for sparse analysis of high-dimensional data, II Rachel Ward May 23, 2011 High dimensional data with low-dimensional structure 300 by 300 pixel images = 90, 000 dimensions 2 / 47 High dimensional

More information

arxiv: v6 [cs.ds] 11 Jul 2012

arxiv: v6 [cs.ds] 11 Jul 2012 Simple and Deterministic Matrix Sketching Edo Liberty arxiv:1206.0594v6 [cs.ds] 11 Jul 2012 Abstract We adapt a well known streaming algorithm for approximating item frequencies to the matrix sketching

More information

Lecture 9: Matrix approximation continued

Lecture 9: Matrix approximation continued 0368-348-01-Algorithms in Data Mining Fall 013 Lecturer: Edo Liberty Lecture 9: Matrix approximation continued Warning: This note may contain typos and other inaccuracies which are usually discussed during

More information

First Efficient Convergence for Streaming k-pca: a Global, Gap-Free, and Near-Optimal Rate

First Efficient Convergence for Streaming k-pca: a Global, Gap-Free, and Near-Optimal Rate 58th Annual IEEE Symposium on Foundations of Computer Science First Efficient Convergence for Streaming k-pca: a Global, Gap-Free, and Near-Optimal Rate Zeyuan Allen-Zhu Microsoft Research zeyuan@csail.mit.edu

More information

Linear Algebra for Machine Learning. Sargur N. Srihari

Linear Algebra for Machine Learning. Sargur N. Srihari Linear Algebra for Machine Learning Sargur N. srihari@cedar.buffalo.edu 1 Overview Linear Algebra is based on continuous math rather than discrete math Computer scientists have little experience with it

More information

2. Review of Linear Algebra

2. Review of Linear Algebra 2. Review of Linear Algebra ECE 83, Spring 217 In this course we will represent signals as vectors and operators (e.g., filters, transforms, etc) as matrices. This lecture reviews basic concepts from linear

More information

A Randomized Algorithm for the Approximation of Matrices

A Randomized Algorithm for the Approximation of Matrices A Randomized Algorithm for the Approximation of Matrices Per-Gunnar Martinsson, Vladimir Rokhlin, and Mark Tygert Technical Report YALEU/DCS/TR-36 June 29, 2006 Abstract Given an m n matrix A and a positive

More information

The Hilbert Space of Random Variables

The Hilbert Space of Random Variables The Hilbert Space of Random Variables Electrical Engineering 126 (UC Berkeley) Spring 2018 1 Outline Fix a probability space and consider the set H := {X : X is a real-valued random variable with E[X 2

More information

Very Sparse Random Projections

Very Sparse Random Projections Very Sparse Random Projections Ping Li, Trevor Hastie and Kenneth Church [KDD 06] Presented by: Aditya Menon UCSD March 4, 2009 Presented by: Aditya Menon (UCSD) Very Sparse Random Projections March 4,

More information

Vector spaces. DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis.

Vector spaces. DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis. Vector spaces DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_fall17/index.html Carlos Fernandez-Granda Vector space Consists of: A set V A scalar

More information

Online Kernel PCA with Entropic Matrix Updates

Online Kernel PCA with Entropic Matrix Updates Dima Kuzmin Manfred K. Warmuth Computer Science Department, University of California - Santa Cruz dima@cse.ucsc.edu manfred@cse.ucsc.edu Abstract A number of updates for density matrices have been developed

More information

Dot Products. K. Behrend. April 3, Abstract A short review of some basic facts on the dot product. Projections. The spectral theorem.

Dot Products. K. Behrend. April 3, Abstract A short review of some basic facts on the dot product. Projections. The spectral theorem. Dot Products K. Behrend April 3, 008 Abstract A short review of some basic facts on the dot product. Projections. The spectral theorem. Contents The dot product 3. Length of a vector........................

More information

Basic Calculus Review

Basic Calculus Review Basic Calculus Review Lorenzo Rosasco ISML Mod. 2 - Machine Learning Vector Spaces Functionals and Operators (Matrices) Vector Space A vector space is a set V with binary operations +: V V V and : R V

More information

Lecture 3: Review of Linear Algebra

Lecture 3: Review of Linear Algebra ECE 83 Fall 2 Statistical Signal Processing instructor: R Nowak Lecture 3: Review of Linear Algebra Very often in this course we will represent signals as vectors and operators (eg, filters, transforms,

More information

Lecture 3: Review of Linear Algebra

Lecture 3: Review of Linear Algebra ECE 83 Fall 2 Statistical Signal Processing instructor: R Nowak, scribe: R Nowak Lecture 3: Review of Linear Algebra Very often in this course we will represent signals as vectors and operators (eg, filters,

More information

Non-Asymptotic Theory of Random Matrices Lecture 4: Dimension Reduction Date: January 16, 2007

Non-Asymptotic Theory of Random Matrices Lecture 4: Dimension Reduction Date: January 16, 2007 Non-Asymptotic Theory of Random Matrices Lecture 4: Dimension Reduction Date: January 16, 2007 Lecturer: Roman Vershynin Scribe: Matthew Herman 1 Introduction Consider the set X = {n points in R N } where

More information

Methods for sparse analysis of high-dimensional data, II

Methods for sparse analysis of high-dimensional data, II Methods for sparse analysis of high-dimensional data, II Rachel Ward May 26, 2011 High dimensional data with low-dimensional structure 300 by 300 pixel images = 90, 000 dimensions 2 / 55 High dimensional

More information

THE SINGULAR VALUE DECOMPOSITION MARKUS GRASMAIR

THE SINGULAR VALUE DECOMPOSITION MARKUS GRASMAIR THE SINGULAR VALUE DECOMPOSITION MARKUS GRASMAIR 1. Definition Existence Theorem 1. Assume that A R m n. Then there exist orthogonal matrices U R m m V R n n, values σ 1 σ 2... σ p 0 with p = min{m, n},

More information

Lecture 18 Nov 3rd, 2015

Lecture 18 Nov 3rd, 2015 CS 229r: Algorithms for Big Data Fall 2015 Prof. Jelani Nelson Lecture 18 Nov 3rd, 2015 Scribe: Jefferson Lee 1 Overview Low-rank approximation, Compression Sensing 2 Last Time We looked at three different

More information

PCA, Kernel PCA, ICA

PCA, Kernel PCA, ICA PCA, Kernel PCA, ICA Learning Representations. Dimensionality Reduction. Maria-Florina Balcan 04/08/2015 Big & High-Dimensional Data High-Dimensions = Lot of Features Document classification Features per

More information

642:550, Summer 2004, Supplement 6 The Perron-Frobenius Theorem. Summer 2004

642:550, Summer 2004, Supplement 6 The Perron-Frobenius Theorem. Summer 2004 642:550, Summer 2004, Supplement 6 The Perron-Frobenius Theorem. Summer 2004 Introduction Square matrices whose entries are all nonnegative have special properties. This was mentioned briefly in Section

More information

Fast Random Projections using Lean Walsh Transforms Yale University Technical report #1390

Fast Random Projections using Lean Walsh Transforms Yale University Technical report #1390 Fast Random Projections using Lean Walsh Transforms Yale University Technical report #1390 Edo Liberty Nir Ailon Amit Singer Abstract We present a k d random projection matrix that is applicable to vectors

More information

Dimensionality reduction: Johnson-Lindenstrauss lemma for structured random matrices

Dimensionality reduction: Johnson-Lindenstrauss lemma for structured random matrices Dimensionality reduction: Johnson-Lindenstrauss lemma for structured random matrices Jan Vybíral Austrian Academy of Sciences RICAM, Linz, Austria January 2011 MPI Leipzig, Germany joint work with Aicke

More information

Sketching as a Tool for Numerical Linear Algebra

Sketching as a Tool for Numerical Linear Algebra Sketching as a Tool for Numerical Linear Algebra (Part 2) David P. Woodruff presented by Sepehr Assadi o(n) Big Data Reading Group University of Pennsylvania February, 2015 Sepehr Assadi (Penn) Sketching

More information

Solution-recovery in l 1 -norm for non-square linear systems: deterministic conditions and open questions

Solution-recovery in l 1 -norm for non-square linear systems: deterministic conditions and open questions Solution-recovery in l 1 -norm for non-square linear systems: deterministic conditions and open questions Yin Zhang Technical Report TR05-06 Department of Computational and Applied Mathematics Rice University,

More information

Noisy Streaming PCA. Noting g t = x t x t, rearranging and dividing both sides by 2η we get

Noisy Streaming PCA. Noting g t = x t x t, rearranging and dividing both sides by 2η we get Supplementary Material A. Auxillary Lemmas Lemma A. Lemma. Shalev-Shwartz & Ben-David,. Any update of the form P t+ = Π C P t ηg t, 3 for an arbitrary sequence of matrices g, g,..., g, projection Π C onto

More information

Lecture 4: Purifications and fidelity

Lecture 4: Purifications and fidelity CS 766/QIC 820 Theory of Quantum Information (Fall 2011) Lecture 4: Purifications and fidelity Throughout this lecture we will be discussing pairs of registers of the form (X, Y), and the relationships

More information

Contents. 1 Vectors, Lines and Planes 1. 2 Gaussian Elimination Matrices Vector Spaces and Subspaces 124

Contents. 1 Vectors, Lines and Planes 1. 2 Gaussian Elimination Matrices Vector Spaces and Subspaces 124 Matrices Math 220 Copyright 2016 Pinaki Das This document is freely redistributable under the terms of the GNU Free Documentation License For more information, visit http://wwwgnuorg/copyleft/fdlhtml Contents

More information

Simple and Deterministic Matrix Sketches

Simple and Deterministic Matrix Sketches Simple and Deterministic Matrix Sketches Edo Liberty + ongoing work with: Mina Ghashami, Jeff Philips and David Woodruff. Edo Liberty: Simple and Deterministic Matrix Sketches 1 / 41 Data Matrices Often

More information

A Randomized Algorithm for Large Scale Support Vector Learning

A Randomized Algorithm for Large Scale Support Vector Learning A Randomized Algorithm for Large Scale Support Vector Learning Krishnan S. Department of Computer Science and Automation, Indian Institute of Science, Bangalore-12 rishi@csa.iisc.ernet.in Chiranjib Bhattacharyya

More information

Lecture 9: Low Rank Approximation

Lecture 9: Low Rank Approximation CSE 521: Design and Analysis of Algorithms I Fall 2018 Lecture 9: Low Rank Approximation Lecturer: Shayan Oveis Gharan February 8th Scribe: Jun Qi Disclaimer: These notes have not been subjected to the

More information

SVD, Power method, and Planted Graph problems (+ eigenvalues of random matrices)

SVD, Power method, and Planted Graph problems (+ eigenvalues of random matrices) Chapter 14 SVD, Power method, and Planted Graph problems (+ eigenvalues of random matrices) Today we continue the topic of low-dimensional approximation to datasets and matrices. Last time we saw the singular

More information

A fast randomized algorithm for overdetermined linear least-squares regression

A fast randomized algorithm for overdetermined linear least-squares regression A fast randomized algorithm for overdetermined linear least-squares regression Vladimir Rokhlin and Mark Tygert Technical Report YALEU/DCS/TR-1403 April 28, 2008 Abstract We introduce a randomized algorithm

More information

sublinear time low-rank approximation of positive semidefinite matrices Cameron Musco (MIT) and David P. Woodru (CMU)

sublinear time low-rank approximation of positive semidefinite matrices Cameron Musco (MIT) and David P. Woodru (CMU) sublinear time low-rank approximation of positive semidefinite matrices Cameron Musco (MIT) and David P. Woodru (CMU) 0 overview Our Contributions: 1 overview Our Contributions: A near optimal low-rank

More information

arxiv: v1 [cs.ds] 30 May 2010

arxiv: v1 [cs.ds] 30 May 2010 ALMOST OPTIMAL UNRESTRICTED FAST JOHNSON-LINDENSTRAUSS TRANSFORM arxiv:005.553v [cs.ds] 30 May 200 NIR AILON AND EDO LIBERTY Abstract. The problems of random projections and sparse reconstruction have

More information

Elementary linear algebra

Elementary linear algebra Chapter 1 Elementary linear algebra 1.1 Vector spaces Vector spaces owe their importance to the fact that so many models arising in the solutions of specific problems turn out to be vector spaces. The

More information

4 Bias-Variance for Ridge Regression (24 points)

4 Bias-Variance for Ridge Regression (24 points) Implement Ridge Regression with λ = 0.00001. Plot the Squared Euclidean test error for the following values of k (the dimensions you reduce to): k = {0, 50, 100, 150, 200, 250, 300, 350, 400, 450, 500,

More information

Boutsidis, Garber, Karnin, Liberty. PRESENTED BY Firstname Lastname August 25, 2013 PRESENTED BY Zohar Karnin November 23, 2014

Boutsidis, Garber, Karnin, Liberty. PRESENTED BY Firstname Lastname August 25, 2013 PRESENTED BY Zohar Karnin November 23, 2014 A Online PowerPoint Principal Presentation Component Analysis Boutsidis, Garber, Karnin, Liberty PRESENTED BY Firstname Lastname August 5, 013 PRESENTED BY Zohar Karnin November 3, 014 Data Matrix Often,

More information

A Simple Algorithm for Clustering Mixtures of Discrete Distributions

A Simple Algorithm for Clustering Mixtures of Discrete Distributions 1 A Simple Algorithm for Clustering Mixtures of Discrete Distributions Pradipta Mitra In this paper, we propose a simple, rotationally invariant algorithm for clustering mixture of distributions, including

More information

Normalized power iterations for the computation of SVD

Normalized power iterations for the computation of SVD Normalized power iterations for the computation of SVD Per-Gunnar Martinsson Department of Applied Mathematics University of Colorado Boulder, Co. Per-gunnar.Martinsson@Colorado.edu Arthur Szlam Courant

More information

Vector Spaces, Orthogonality, and Linear Least Squares

Vector Spaces, Orthogonality, and Linear Least Squares Week Vector Spaces, Orthogonality, and Linear Least Squares. Opening Remarks.. Visualizing Planes, Lines, and Solutions Consider the following system of linear equations from the opener for Week 9: χ χ

More information

Matrices and Vectors. Definition of Matrix. An MxN matrix A is a two-dimensional array of numbers A =

Matrices and Vectors. Definition of Matrix. An MxN matrix A is a two-dimensional array of numbers A = 30 MATHEMATICS REVIEW G A.1.1 Matrices and Vectors Definition of Matrix. An MxN matrix A is a two-dimensional array of numbers A = a 11 a 12... a 1N a 21 a 22... a 2N...... a M1 a M2... a MN A matrix can

More information

Large Scale Data Analysis Using Deep Learning

Large Scale Data Analysis Using Deep Learning Large Scale Data Analysis Using Deep Learning Linear Algebra U Kang Seoul National University U Kang 1 In This Lecture Overview of linear algebra (but, not a comprehensive survey) Focused on the subset

More information

SPECTRAL PROPERTIES OF THE LAPLACIAN ON BOUNDED DOMAINS

SPECTRAL PROPERTIES OF THE LAPLACIAN ON BOUNDED DOMAINS SPECTRAL PROPERTIES OF THE LAPLACIAN ON BOUNDED DOMAINS TSOGTGEREL GANTUMUR Abstract. After establishing discrete spectra for a large class of elliptic operators, we present some fundamental spectral properties

More information

arxiv: v5 [math.na] 16 Nov 2017

arxiv: v5 [math.na] 16 Nov 2017 RANDOM PERTURBATION OF LOW RANK MATRICES: IMPROVING CLASSICAL BOUNDS arxiv:3.657v5 [math.na] 6 Nov 07 SEAN O ROURKE, VAN VU, AND KE WANG Abstract. Matrix perturbation inequalities, such as Weyl s theorem

More information

Dense Fast Random Projections and Lean Walsh Transforms

Dense Fast Random Projections and Lean Walsh Transforms Dense Fast Random Projections and Lean Walsh Transforms Edo Liberty, Nir Ailon, and Amit Singer Abstract. Random projection methods give distributions over k d matrices such that if a matrix Ψ (chosen

More information

7 Principal Component Analysis

7 Principal Component Analysis 7 Principal Component Analysis This topic will build a series of techniques to deal with high-dimensional data. Unlike regression problems, our goal is not to predict a value (the y-coordinate), it is

More information

Column Selection via Adaptive Sampling

Column Selection via Adaptive Sampling Column Selection via Adaptive Sampling Saurabh Paul Global Risk Sciences, Paypal Inc. saupaul@paypal.com Malik Magdon-Ismail CS Dept., Rensselaer Polytechnic Institute magdon@cs.rpi.edu Petros Drineas

More information

Some Useful Background for Talk on the Fast Johnson-Lindenstrauss Transform

Some Useful Background for Talk on the Fast Johnson-Lindenstrauss Transform Some Useful Background for Talk on the Fast Johnson-Lindenstrauss Transform Nir Ailon May 22, 2007 This writeup includes very basic background material for the talk on the Fast Johnson Lindenstrauss Transform

More information

Recall the convention that, for us, all vectors are column vectors.

Recall the convention that, for us, all vectors are column vectors. Some linear algebra Recall the convention that, for us, all vectors are column vectors. 1. Symmetric matrices Let A be a real matrix. Recall that a complex number λ is an eigenvalue of A if there exists

More information

Using the Johnson-Lindenstrauss lemma in linear and integer programming

Using the Johnson-Lindenstrauss lemma in linear and integer programming Using the Johnson-Lindenstrauss lemma in linear and integer programming Vu Khac Ky 1, Pierre-Louis Poirion, Leo Liberti LIX, École Polytechnique, F-91128 Palaiseau, France Email:{vu,poirion,liberti}@lix.polytechnique.fr

More information

Math 102, Winter Final Exam Review. Chapter 1. Matrices and Gaussian Elimination

Math 102, Winter Final Exam Review. Chapter 1. Matrices and Gaussian Elimination Math 0, Winter 07 Final Exam Review Chapter. Matrices and Gaussian Elimination { x + x =,. Different forms of a system of linear equations. Example: The x + 4x = 4. [ ] [ ] [ ] vector form (or the column

More information

Random Methods for Linear Algebra

Random Methods for Linear Algebra Gittens gittens@acm.caltech.edu Applied and Computational Mathematics California Institue of Technology October 2, 2009 Outline The Johnson-Lindenstrauss Transform 1 The Johnson-Lindenstrauss Transform

More information

Tikhonov Regularization of Large Symmetric Problems

Tikhonov Regularization of Large Symmetric Problems NUMERICAL LINEAR ALGEBRA WITH APPLICATIONS Numer. Linear Algebra Appl. 2000; 00:1 11 [Version: 2000/03/22 v1.0] Tihonov Regularization of Large Symmetric Problems D. Calvetti 1, L. Reichel 2 and A. Shuibi

More information

Least Squares Optimization

Least Squares Optimization Least Squares Optimization The following is a brief review of least squares optimization and constrained optimization techniques. Broadly, these techniques can be used in data analysis and visualization

More information

Approximating sparse binary matrices in the cut-norm

Approximating sparse binary matrices in the cut-norm Approximating sparse binary matrices in the cut-norm Noga Alon Abstract The cut-norm A C of a real matrix A = (a ij ) i R,j S is the maximum, over all I R, J S of the quantity i I,j J a ij. We show that

More information

Chapter XII: Data Pre and Post Processing

Chapter XII: Data Pre and Post Processing Chapter XII: Data Pre and Post Processing Information Retrieval & Data Mining Universität des Saarlandes, Saarbrücken Winter Semester 2013/14 XII.1 4-1 Chapter XII: Data Pre and Post Processing 1. Data

More information

Randomized methods for computing the Singular Value Decomposition (SVD) of very large matrices

Randomized methods for computing the Singular Value Decomposition (SVD) of very large matrices Randomized methods for computing the Singular Value Decomposition (SVD) of very large matrices Gunnar Martinsson The University of Colorado at Boulder Students: Adrianna Gillman Nathan Halko Sijia Hao

More information

Improved Bounds on the Dot Product under Random Projection and Random Sign Projection

Improved Bounds on the Dot Product under Random Projection and Random Sign Projection Improved Bounds on the Dot Product under Random Projection and Random Sign Projection Ata Kabán School of Computer Science The University of Birmingham Birmingham B15 2TT, UK http://www.cs.bham.ac.uk/

More information

CS168: The Modern Algorithmic Toolbox Lecture #8: How PCA Works

CS168: The Modern Algorithmic Toolbox Lecture #8: How PCA Works CS68: The Modern Algorithmic Toolbox Lecture #8: How PCA Works Tim Roughgarden & Gregory Valiant April 20, 206 Introduction Last lecture introduced the idea of principal components analysis (PCA). The

More information

A randomized algorithm for approximating the SVD of a matrix

A randomized algorithm for approximating the SVD of a matrix A randomized algorithm for approximating the SVD of a matrix Joint work with Per-Gunnar Martinsson (U. of Colorado) and Vladimir Rokhlin (Yale) Mark Tygert Program in Applied Mathematics Yale University

More information

A Fast Algorithm For Computing The A-optimal Sampling Distributions In A Big Data Linear Regression

A Fast Algorithm For Computing The A-optimal Sampling Distributions In A Big Data Linear Regression A Fast Algorithm For Computing The A-optimal Sampling Distributions In A Big Data Linear Regression Hanxiang Peng and Fei Tan Indiana University Purdue University Indianapolis Department of Mathematical

More information

The Johnson-Lindenstrauss Lemma

The Johnson-Lindenstrauss Lemma The Johnson-Lindenstrauss Lemma Kevin (Min Seong) Park MAT477 Introduction The Johnson-Lindenstrauss Lemma was first introduced in the paper Extensions of Lipschitz mappings into a Hilbert Space by William

More information

Least Squares Optimization

Least Squares Optimization Least Squares Optimization The following is a brief review of least squares optimization and constrained optimization techniques, which are widely used to analyze and visualize data. Least squares (LS)

More information

Functional Analysis Review

Functional Analysis Review Outline 9.520: Statistical Learning Theory and Applications February 8, 2010 Outline 1 2 3 4 Vector Space Outline A vector space is a set V with binary operations +: V V V and : R V V such that for all

More information

A fast randomized algorithm for approximating an SVD of a matrix

A fast randomized algorithm for approximating an SVD of a matrix A fast randomized algorithm for approximating an SVD of a matrix Joint work with Franco Woolfe, Edo Liberty, and Vladimir Rokhlin Mark Tygert Program in Applied Mathematics Yale University Place July 17,

More information

Exercises * on Principal Component Analysis

Exercises * on Principal Component Analysis Exercises * on Principal Component Analysis Laurenz Wiskott Institut für Neuroinformatik Ruhr-Universität Bochum, Germany, EU 4 February 207 Contents Intuition 3. Problem statement..........................................

More information

1 Singular Value Decomposition and Principal Component

1 Singular Value Decomposition and Principal Component Singular Value Decomposition and Principal Component Analysis In these lectures we discuss the SVD and the PCA, two of the most widely used tools in machine learning. Principal Component Analysis (PCA)

More information

CS 229r: Algorithms for Big Data Fall Lecture 17 10/28

CS 229r: Algorithms for Big Data Fall Lecture 17 10/28 CS 229r: Algorithms for Big Data Fall 2015 Prof. Jelani Nelson Lecture 17 10/28 Scribe: Morris Yau 1 Overview In the last lecture we defined subspace embeddings a subspace embedding is a linear transformation

More information

Tighter Low-rank Approximation via Sampling the Leveraged Element

Tighter Low-rank Approximation via Sampling the Leveraged Element Tighter Low-rank Approximation via Sampling the Leveraged Element Srinadh Bhojanapalli The University of Texas at Austin bsrinadh@utexas.edu Prateek Jain Microsoft Research, India prajain@microsoft.com

More information

Mathematical foundations - linear algebra

Mathematical foundations - linear algebra Mathematical foundations - linear algebra Andrea Passerini passerini@disi.unitn.it Machine Learning Vector space Definition (over reals) A set X is called a vector space over IR if addition and scalar

More information

10 Distributed Matrix Sketching

10 Distributed Matrix Sketching 10 Distributed Matrix Sketching In order to define a distributed matrix sketching problem thoroughly, one has to specify the distributed model, data model and the partition model of data. The distributed

More information

APPENDIX A. Background Mathematics. A.1 Linear Algebra. Vector algebra. Let x denote the n-dimensional column vector with components x 1 x 2.

APPENDIX A. Background Mathematics. A.1 Linear Algebra. Vector algebra. Let x denote the n-dimensional column vector with components x 1 x 2. APPENDIX A Background Mathematics A. Linear Algebra A.. Vector algebra Let x denote the n-dimensional column vector with components 0 x x 2 B C @. A x n Definition 6 (scalar product). The scalar product

More information

RandNLA: Randomized Numerical Linear Algebra

RandNLA: Randomized Numerical Linear Algebra RandNLA: Randomized Numerical Linear Algebra Petros Drineas Rensselaer Polytechnic Institute Computer Science Department To access my web page: drineas RandNLA: sketch a matrix by row/ column sampling

More information

Throughout these notes we assume V, W are finite dimensional inner product spaces over C.

Throughout these notes we assume V, W are finite dimensional inner product spaces over C. Math 342 - Linear Algebra II Notes Throughout these notes we assume V, W are finite dimensional inner product spaces over C 1 Upper Triangular Representation Proposition: Let T L(V ) There exists an orthonormal

More information

arxiv: v1 [math.na] 1 Sep 2018

arxiv: v1 [math.na] 1 Sep 2018 On the perturbation of an L -orthogonal projection Xuefeng Xu arxiv:18090000v1 [mathna] 1 Sep 018 September 5 018 Abstract The L -orthogonal projection is an important mathematical tool in scientific computing

More information

Yale university technical report #1402.

Yale university technical report #1402. The Mailman algorithm: a note on matrix vector multiplication Yale university technical report #1402. Edo Liberty Computer Science Yale University New Haven, CT Steven W. Zucker Computer Science and Appled

More information

Chapter 3 Transformations

Chapter 3 Transformations Chapter 3 Transformations An Introduction to Optimization Spring, 2014 Wei-Ta Chu 1 Linear Transformations A function is called a linear transformation if 1. for every and 2. for every If we fix the bases

More information

MATRICES ARE SIMILAR TO TRIANGULAR MATRICES

MATRICES ARE SIMILAR TO TRIANGULAR MATRICES MATRICES ARE SIMILAR TO TRIANGULAR MATRICES 1 Complex matrices Recall that the complex numbers are given by a + ib where a and b are real and i is the imaginary unity, ie, i 2 = 1 In what we describe below,

More information

17 Random Projections and Orthogonal Matching Pursuit

17 Random Projections and Orthogonal Matching Pursuit 17 Random Projections and Orthogonal Matching Pursuit Again we will consider high-dimensional data P. Now we will consider the uses and effects of randomness. We will use it to simplify P (put it in a

More information