CS 229r: Algoriths for Big Data Fall 2013 Prof. Jelani Nelson Lecture 20 Noveber 7, 2013 Scribe: Yun Willia Yu 1 Introduction Today we re going to go through the analysis of atrix copletion. First though, let s go through the history of prior work on this proble. Recall the setup and odel: Matrix copletion setup: Model: Want to recover M R n 1 n 2, under the assuption that rank(m) = r, where r is sall. Only soe sall subset of the entries (M ij ) ij Ω are revealed, where Ω [n 1 ] [n 2 ], Ω = n 1, n 2 ties we saple i, j uniforly at rando + insert into Ω (so Ω is a ultiset). Note that the sae results hold if we choose entries without replaceent, but it s easier to analyze this way. In fact, if you can show that if recovery works with replaceent, then that iplies that recovery works without replaceent, which akes sense because you d only be seeing ore inforation about M. We recover M by Nuclear Nor Miniization (NNM): Solve the progra in X s.t. i, j Ω, X ij = M ij [Recht, Fazel, Parrilo 10] [RFP10] was first to give soe rigorous guarantees for NNM. As you ll see on the pset, you can actually solve this in polynoial tie since it s an instance of what s known as a sei-definite progra. [Candés, Recht, 09] [CR09] was the first paper to show provable guarantees for NNM applied to atrix copletion. There were soe quantitative iproveents (in the paraeters) in two papers: [Candés, Tao 09] [CT10] and [Keshavan, Montanari, Oh 09] [KMO10] Today we re going to cover an even newer analysis given in [Recht, 2011] [Rec11], which has a couple of advantages. First, it has the laxest of all the conditions. Second, it s also the siplest of all the analyses in the papers. Thus, it s really better in every way there is. The approach of [Rec11] was inspired by work in quantu toography [GLF+10]. A ore general theore than the one proven in class today was later proven by Gross [Gross]. 1
2 Theore Stateent We re alost ready to forally state the ain theore, but need a couple of definitions first. Definition 1. Let M = UΣV be the singular value decoposition. (Note that U R n 1 r and V R n 2 r.) Definition 2. Define the incoherence of the subspace U as µ(u) = n 1 r ax i P U e i 2, where P U is projection onto U. Siilarly, the incoherence of V is µ(v ) = n 2 r ax i P V e i 2, where P V is projection onto V. Definition 3. µ 0 def = ax{µ(u), µ(v )}. Definition 4. µ 1 def = UV n1 n 2 /r, where UV is the largest agnitude of an entry of UV. Theore 1. If ax{µ 2 1, µ 0} n 2 r log 2 (n 2 ) then with high probability M is the unique solution to the sei-definite progra in X s.t. i, j Ω, X ij = M ij. Note that 1 µ 0 n 2 r. The way µ 0 can be n 2 r is if a standard basis vector appears in a colun of V, and the way µ 0 can get all the way down to 1 is like the best case scenario where all the entries 1 of V are like 1 n2 and all the entries of U are like n1, so for exaple if you took a Fourier atrix and cut off soe of its coluns. Thus, the condition on is a good bound if the atrix has low incoherence. One ight wonder about the necessity of all the funny ters in the condition on. Unfortunately, [Candes, Tao, 09] [CT10] showed µ 0 n 2 r log(n 2 ) is needed. If you want to have any decent chance of recovering M over the rando choice of Ω using this SDP, then you need to saple at least that any entries. The condition isn t copletely tight because of the square in the log factor and the dependence on µ 2 1. However, you can show that µ2 1 µ2 0 r. Just like in copressed sensing, there are also soe iterative algoriths to recover M, but we re not going to analyze the in class. For exaple, the SparSA algorith given in [Wright, Nowak, Figueiredo 09] [WNF09] (thanks for Ben Recht for pointing this out to e). That algorith roughly looks as follows when one wants to iniize AX M 2 F + µ X : Pick X 0, and a stepsize t and iterate (a)-(d) soe nuber of ties: (a) Z = X k t A T (AX k M) (b) [U, diag(s), V ] = svd(z) (c) r = ax(s µt, 0) (d) X k+1 = Udiag(r)V T As an aside, trace-nor iniization is actually tolerant to noise, but I not going to cover that. 3 Analysis The way that the analysis is going to go is we re going to condition on lots of good events all happening, and if those good events happen, then the iniization works. The way I going to structure the proof is I ll first state what all those events are, then I ll show why those events ake the iniization work, and finally I ll bound the probability of those events not happening. 2
3.1 Background and ore notation Before I do that, I want to say one things about the trace nor. How any people are failiar with dual nors? How any people have heard of the Hahn-Banach theore? OK, good. Definition 5. A, B def = Tr(A B) = i,j A ijb ij Clai 1. The dual of the trace nor is the operator nor: = sup A, B B s.t. B 1 This akes sense because the dual of l 1 for vectors is l and this sort of looks like that because the trace nor and operator nor are respectively like the l 1 and l nor of the singular value vector. More rigorously, we can prove it by proving inequality in both directions. One direction is not so hard, but the other requires the following lea. Lea 1. }{{} (1) = in X,Y s.t. A=XY X F Y F }{{} (2) 1 ( ) X 2 F 2 F A=XY } {{ } (3) = in X,Y s.t. Proof of lea. (2) (3): AM-GM inequality: xy 1 2 (x2 + y 2 ). (3) (1): We basically just need to exhibit an X and Y which are give soething that is at ost the. Set X = Y = A 1/2. In general, given f : R + R +, then f(a) = Uf(Σ)V. i.e. write the SVD of A and apply f to each diagonal entry of Σ. You can easily check that A 1/2 A 1/2 = A and that the square of the Frobenius nor of A 1/2 is exactly the trace nor. (1) (2): Let X, Y be soe atrices such that A = XY. Then = XY sup {a i } orthonoral basis {b i } orthonoral basis = sup sup Y a i, X b i i Y a i X b i i sup( i Y a i 2 ) 1/2 ( i XY a i, b i i X b i 2 ) 1/2 This can be seen to be true by letting a i =v i and b i =u i (fro the SVD), when we get equality. (by Cauchy-Schwarz) = X F Y F because {a i },{b i } are orthonoral bases and the Frobenius nor is rotationally invariant 3
Proof of clai. Part 1: sup A, B. B =1 We show this by writing A = UΣV. Then take B = i u ivi. That will give you soething on the right that is at least the trace nor. As an aside, in general, this is how dual nors are defined. Given a nor X the dual nor is defined by Z X = sup Y X 1 Z, Y. In this case, we re proving the dual of the operator nor is the trace nor. Or, for exaple, the dual nor of the Schatten p-nor is the Schatten q-nor where 1 p + 1 q = 1. As an aside, if X is a nored space with nor then X is the set of all linear functionals λ x : X R for x X with dual nor λ x = sup y X x, y. One can then ap x X to (X ) by the evaluation ap f : X (X ) : for λ X, f(x)(λ) = λ(x). Then f is injective and the nors of x and f(x) are equal by the Hahn Banach theore, though f need not be surjective (in the case where it is, X is called a reflexive Banach space). You can learn ore on wiki if you want, or take a functional analysis class. Part 2: We show this using the lea. A, B B s.t. B = 1. Write A = XY s.t. = X F Y F (lea guarantees that there exists such an X and Y ). Write B = i σ ia i b i, i, σ i 1. Then using a siilar arguent to last tie A, B = XY, i σ i a i b i = i i σ i Y a i, X b i Y a i, X b i = X F Y = which concludes the proof of the clai. Recall that the set of atrices that are n 1 n 2 is itself a vector space. I going to decopose that vector space into T and the orthogonal copleent of T by defining the following projection operators. P T (Z) def = (I P U )Z(I P V ) P T (Z) def = Z P T (Z) 4
So basically, the atrices that are in the vector space T are the atrices that can be written as the su of rank 1 atrices a i b i where the a i s are orthogonal to all the u s and the b i s are orthogonal to all the v s. Also define R Ω (Z) as only keeping entries in Ω, ultiplied by ultiplicity in Ω. If you think of the operator R Ω : R n 1n 2 R n 1n 2 as a atrix, it is a diagonal atrix with the ultiplicity of entries in Ω on the diagonal. 3.2 Good events With high probability probability 1 1 poly(n 2 ), and you can ake the 1 poly(n 2 ) factor decay as uch as you want by increasing the constant in fro of all these events happen: 1. P T R Ω P T P T µ 0 r(n 1 +n 2 ) log(n 2 ) 1 2 (this is a deviation inequality fro the expectation over the randoness coing fro Ω) 2. ( n 1n 2 R Ω I)Z n 1 n 2 2 log(n 1+n 2 ) Z (this is another deviation inequality fro the expectation) 3. If Z T then P T R Ω (Z) Z µ 0 rn 2 log(n 2 ) Z 4. R Ω log(n 2 ) This one is actually really easy (also the shortest): it s just balls and bins. We ve already said it s a diagonal atrix, so the operator nor is just the largest diagonal entry. Iagine we have balls, and we re throwing the independently at rando into bins, naely the diagonal entries, and this is just how loaded is the axiu bin. In particular, <, or else we wouldn t be doing atrix copletion since we d have the whole atrix. In general, when you throw t balls into t bins, the axiu load by the Chernoff bound is at ost log t. In fact, it s at ost log t/ log log t, but who cares, since that would save us an extra log log soewhere. Actually, I not even sure it would save us that since there are other log s that coe into play. 5. Y in range(r Ω ) s.t. (5a) P T (Y ) UV F r 2n 2 (5b) P T (Y ) < 1 2 3.3 Recovery conditioned on good events Now that we ve stated all these things, let s show that they iply trace nor iniization actually works. We want to ake sure arg in X X s.t. R Ω (X)=R Ω (M) is unique and equal to M. 5
Let Z Ker(R Ω ), (Z 0); we want to show M + Z > M. First we want to argue that P T (Z) F cannot be big. Lea 2. P T (Z) F n 2 2r P T (Z) F Proof. Also Also have 0 = R Ω (Z) F R Ω (P T (Z)) F R Ω (P T (Z)) F R Ω (P T (Z)) 2 F = R ΩP T Z, R Ω P T Z P T Z, R Ω P T Z = Z, P T R Ω P T Z = P T Z, P T R Ω P T P T Z = P T Z, P T P T Z P T Z 2 F n 1 n 2 P T R Ω P T P T Z 2 F P T Z, (P T R Ω P T P T Z 2 F )P T Z R Ω (P T (Z)) 2 F R Ω 2 P T (Z) 2 F log 2 (n 2 ) P T (Z) 2 F Suarize: cobining all the inequalities together, and then aking use of our choice of, log 2 (n 2 ) P T (Z) F P T (Z) F n2 2r P T (Z) F Pick U, V s.t. U V, P T (Z) = P T (Z) and s.t. [U, U ], [V, V ] orthogonal atrices. We know fro clai 1 that the trace nor is exactly the sup over all B atrices of the inner product. But the B atrix that achieves the sup has all singular values equal to 1, so B = U V, because P T (Z) is in the orthogonal space so B should also be in the orthogonal space. Now we have a long chain of inequalities to show that the trace of any M + Z is greater than the trace of M: 6
M + Z UV + U V, M + Z by clai 1 = M + UV + U V, Z since M U V = M + UV + U V Y, Z since Z ker(r Ω ) and Y range(r Ω ) = M + UV P T (Y ), P T (Z) + U V P T (Y ), P T (Z) decoposition into T & T M UV P T (Y ) F P T (Z) F x,y x 2 y 2 + P T (Z) by our choice of UV P T (Y ) P T (Z) F nor inequality But note that the operator nor is always bigger than the Frobenius nor, so P T (Z) P T (Z) F. We want to ensure that that ter is strictly bigger than the two negative ters. By condition (5b), we ensure that P T (Y ) P T (Z) F < 1 2 P T (Z) F. By condition (5a) and lea 2, we can also ensure that UV P T (Y ) F P T (Z) F < 1 2 P T (Z) F. Thus, back to the ain equation: r M + Z > M P T (Z) 2n F + 1 2 2 P T (Z) F M Hence, when all of the good conditions hold, iniizing the trace nor recovers M. 3.4 Probability of good events holding Unfortunately, we do not have enough tie to go through the full analysis. We ight overflow soe of this into next lecture, but for now, let s introduce the noncoutative Bernstein inequality we use to get conditions (1) and (2). As an aside, I tend to call all of these inequalities Chernoff inequalities, since they re all quite siilar, but this one really should have a different nae because it the proof for this atrix Bernstein is very different fro the proof of ordinary Chernoff. Theore 2 (Non-coutative Bernstein Chernoff inequality). Suppose X 1,..., X N are rando atrices of the sae diensions and E X i = 0 s.t. 1. X i M, i w.p. 1 2. σ 2 i = ax{ E X ix i, E X i X i } Then ( ) N P X i > λ i=1 { (n 1 + n 2 ) ax exp ( Cλ 2 ɛσ 2 i ) ( )} Cλ, exp M As entioned, conditions (2) and (3) were deviation inequalities fro expectation, so we can get the using Bernstein on the rando atrices over distribution of Ω (subtracting out the expectation to set expectation 0 where appropriate). As an additional aside, conditions (4), (5), and (1) were used in the proofs above. However, we only need conditions (2) and (3) to show (5). Next tie if we have tie, we ight say soething about proving (5). 7
4 Concluding rearks Why would you think of trace iniization as solving atrix copletion? Analogously, why would you use l 1 iniization for copressed sensing? In soe way, these two questions are very siilar in that rank is like the support size of the singular value vector, and trace nor is the l 1 nor of the singular value vector, so the two are very analogous. l 1 iniization sees like a natural choice, since it is the closest convex function to support size fro all the l p nors (and being convex allows us to solve the progra in polynoial tie). References [CR09] [CT10] [Gross] Eanuel J Candès and Benjain Recht, Exact atrix copletion via convex optiization, Foundations of Coputational atheatics 9 (2009), no. 6, 717 772. Eanuel J Candès and Terence Tao, The power of convex relaxation: Near-optial atrix copletion, Inforation Theory, IEEE Transactions on 56 (2010), no. 5, 2053 2080. David Gross, Recovering low-rank atrices fro few coefficients in any basis, Inforation Theory, IEEE Transactions on (2011), no. 57, :1548-1566. [GLF+10] David Gross, Yi-Kai Liu, Steven T. Flaia, Stephen Becker, and Jens Eisert. Quantu state toography via copressed sensing, Physical Review Letters (2010), 105(15):150401. [KMO10] Raghunandan H Keshavan, Andrea Montanari, and Sewoong Oh, Matrix copletion fro noisy entries, The Journal of Machine Learning Research 99 (2010), 2057 2078. [Rec11] [RFP10] Benjain Recht, A sipler approach to atrix copletion, The Journal of Machine Learning Research 12 (2011), 3413 3430. Benjain Recht, Marya Fazel, and Pablo A Parrilo, Guaranteed iniu-rank solutions of linear atrix equations via nuclear nor iniization, SIAM review 52 (2010), no. 3, 471 501. [WNF09] Stephen J Wright, Robert D Nowak, and Mário AT Figueiredo, Sparse reconstruction by separable approxiation, Signal Processing, IEEE Transactions on 57 (2009), no. 7, 2479 2493. 8