Lecture 9: Matrix approximation continued

0368-348-01-Algorithms in Data Mining Fall 013 Lecturer: Edo Liberty Lecture 9: Matrix approximation continued Warning: This note may contain typos and other inaccuracies which are usually discussed during class. Please do not cite this note as a reliable source. If you find mistakes, please inform me. Recap Last class was dedicated to approximating matrices by sampling individual entries from them. We used the following useful concentration result for sums of random matrices. Lemma 0.1 (Matrix Bernstein Inequality [1]). Let X 1,..., X s be independent m n matrix valued random variables such that k [s] E[X k ] 0 and X k R Set σ max{ k E[X kx T k ], k E[XT k X k] } then Pr[ X k > t] (m + n)e t σ +Rt/3 We obtained an approximate sampled matrix B which was sparser than A. For example, for a matrix A containing values in {1, 0, 1} it sufficed that B contains only s entries and ( ) nr log(n/δ) s O. Moreover, we got that A B ε A. We also claimed that we can compute the PCA projection of B, Pk B instead of that of A, P k and still have: σ k+1 A P k A A P B k A σ k+1 + ε A In this class we will reduce the amount of space needed to be independent of n. We will do this by approximating AA T directly. Here we follow the ideas in [] and give a much simpler proof which, unfortunately, obtains slightly worse bounds. PCA with an approximate covariance matrix Here we see that approximating BB T AA T is sufficient in order to compute an approximate PCA projection (Lemma 3.8 in []). Let Pk B denote the projection on the top k left singular values of B. ε A Pk B A sup xa xpk B A (1) x, x 1 sup x null(p B k ), x 1 xa () sup x null(p B k ), x 1 xaa T (3) sup x(aa T BB T ) + xbb T (4) x null(pk B), x 1 AA T BB T + σ k+1 (BB T ) (5) AA T BB T + σ k+1 (AA T ) (6) 1

The last transition is due to the fact that σ k+1 (BB ) σ k+1 (AA T ) + AA T BB T. By taking the square root and recalling that σ k+1 (AA T ) σk+1 (A) σ k+1 we get: A Pk B A σk+1 + AAT BB T σ k+1 + AA T BB T Therefore, we have that AA T BB T ε A A P B k A σ k+1 + ε A Column subset selection by sampling From this point on, our goal is to find a matrix B such that AA T BB T is small and B is as sparse or small as possible. Note that AA T A j A j where we denote by A j the j th column of the matrix A. Let Computing the expectation of BB T we have B A j / p j with probability p j E[BB T ] p j (A j / p j )(A j / p j ) T A j A T j AA T Clearly, we cannot hope to approximate the matrix A with only one column. We therefore define B to be s such sampled columns from A side by side. B 1 s [B 1... B s ] Just to clarify, B is an m s matrix containing s columns from A (rescaled) and B k A j / p j with probability p j. Computing the expectation of BB T we get that E[BB T ] E[ 1 s B kbk T ] E[B k Bk T ] AA T k1 We are now ready to use the matrix Bernstein inequality above: BB T AA T k1 1 s (B kb T k AA T ) X k To make things simpler we pick p j A j / A F. In words, the columns are picked with probability proportional to their squared norm. R max X k k1 n max 1 s A ja T j /p j + 1 s AAt 1 s A F + 1 s A σ k E[X k X T k ] 1 s E[B kb T k B k B T k AA T AA T ] (7) 1 n s p j A j A T j A j A T j /p j + 1 s A 4 1 s A F A + 1 s A 4 (8)

Plugging both expressions into the matrix Chernoff bound above we get that Pr[ BB T AA T > ε A /] me sε 4 A 4 /4 A F A + A 4 +ε A A F /3+ε A 4 /3 By using that A F A and that ε 1 and denoting the numeric rank of A by r A F / A we can simplify this to be Pr[ BB T AA T > ε A /] me sε4 16r To conclude, it is enough to sample s 16r log(m/δ) ε 4 columns from A (with probability proportional to their squared norm) to form a matrix B such that BB T AA T ε A / with probability at least 1 δ. According to above this also gives us that A Pk BA σ k+1 + ε A which completes our claim. Note that the number of non zeros in B is bounded by s m which is independent of n, the number of columns. This is potentially significantly better than the results obtained in last week s class. Remark 0.1. More elaborate column selection algorithms exist which provide better approximation but these will not be discussed here. See [3] for the latest result that I am aware of. Deterministic Lossy SVD I this section we will see that the above approximation can be improved using a simple trick [4]. The algorithm keeps an m s sketch matrix B which is updated every time a new column from the input matrix A is added. It maintains the invariant that the last column in the sketch is always zero. When a new input row is added it is places in the last (all zeros) column of the sketch. Then, using its SV D the sketch is rotated from the right so that its columns are orthogonal. Finally, the sketch column norms are shrunk so that the last column is again all zeros. In the algorithm we denote by [U, Σ, V ] SV D(B) the Singular Value Decomposition of B. We use the convention that UΣV T B, U T U V T V I, and Σ diag([σ 1,..., σ s ]), σ 1... σ s. The notation I stands for the s s identity matrix while B s denotes the s th column of B (similarly A j ). Input: s, A R m n B 0 all zeros matrix R m s for i [n] do C i B i 1 Cs i A i [U i, Σ i, V i ] SV D(C i ) D i U i Σ i δ i Σ i s,s W i I Σ δi B i D i W i end for Return: B B n Put A i in the last (all zeros) column of C i Lemma 0.. Let B be the output of the above algorithm for a matrix A and an integer s then: AA T BB T A F /s Proof. We start by bounding AA T BB T from above: AA T BB T max x 1 (xt AA T x x T BB T x) max x 1 ( xt A x T B ). 3

We open this with a simple telescopic sum x T B n xt B i x T B i 1 and by replacing x T A n (xt A j ). x T A x T B [(x T A j ) ( x T B i x T B i 1 )] (9) [((x T A j ) + x T B i 1 ) x T B i ] (10) [ x T C i x T B i ] (11) [ x T D i x T D i W i ] (1) x T [D i (D i ) T D i (W i ) (D i ) T ]x (13) x T [δi D i (Σ i ) (D i ) T ]x (14) x T [δi U i (U i ) T ]x (15) δi U i (U i ) T δi (16) For this to mean anything we have to bound the term n δ i. We do this computing the Frobenius norm of B. B n F B i F B i 1 F (17) i1 [ B i F D i F ] + [ D i F C i F ] + [ C i F B i 1 F ] (18) i1 Let us deal with each term separately: Putting these together we get B i F D i F tr(b i (B i ) T D i (D i ) T ) tr(δ i U i (U i ) ) sδ i (19) D i F C i F 0 (0) C i F B i 1 F A i (1) B F B n F A F s δ i Since B F 0 we conclude that δi A F /s Combining with the above: BB T AA T A F /s Finally, we recall that to achieve the approximation guarantee A P B k A σ k+1 + ε A it is sufficient to require BB T AA T ε A / which is obtained when This completes the claim. s r ε 4

Discusion Note that while the deterministic lossy SVD procedure requires less space it also requires more operations per column insertion. Namely, it computes the SVD of the sketch in every iteration. This can be somewhat improved (see [4]) but it is still far from being as efficient as column sampling. Making this algorithm faster while maintaining its approximation guaranty is an open problem. Another interesting problem is to make this algorithm take advantage of the matrix sparsity. References [1] Emmanuel Candès and Benjamin Recht. Exact matrix completion via convex optimization. Commun. ACM, 55(6):111 119, June 01. [] Mark Rudelson and Roman Vershynin. Sampling from large matrices: An approach through geometric functional analysis. J. ACM, 54(4), July 007. [3] Christos Boutsidis, Petros Drineas, and Malik Magdon-Ismail. Near optimal column-based matrix reconstruction. In Proceedings of the 011 IEEE 5nd Annual Symposium on Foundations of Computer Science, FOCS 11, pages 305 314, Washington, DC, USA, 011. IEEE Computer Society. [4] Edo Liberty. Simple and deterministic matrix sketching. CoRR, abs/106.0594, 01. 5