Low correlation tensor decomposition via entropy maximization

Size: px

Start display at page:

Download "Low correlation tensor decomposition via entropy maximization"

Magdalen Cobb
5 years ago
Views:

1 CS369H: Herarches of Integer Programmng Relaxatons Sprng Low correlaton tensor decomposton va entropy maxmzaton Lecture and notes by Sam Hopkns Scrbes: James Hong Overvew CS369H). These notes are adapted from Sam Hopkn s notes for hs talk on May 19, 2017 (Lecture 7 n 1 Introducton Tensor decomposton has recently become an nvaluable algorthmc prmtve. It has seen much use n new algorthms wth provable guarantees for fundamental statstcs and machne learnng problems. In these settngs, some low-rank k-tensor A = whch we would lke to decompose nto components a k a 1,..., a r R n s often not drectly accessble. Ths could happen for many reasons; a common one s that A = EX k for some random varable X, and estmatng A to hgh precson may requre too many ndependent samples from X. 1.1 Contexts Suppose x R n s a mxture of r Gaussans wth centers a 1,..., a r and we are allowed to see x 1,..., x m IID samples of X. Can we recover the centers, gven enough samples. Specfcally, f the samples are drawn as follows: (1) sample [r], (2) output x N (a, I). Some possble ways we could attempt ths problem based on moments: 1. Ex 1 m m x Ths does not work and s probably 0 when 2. Exx 1 m m x x a = 0. Ths also does not work. Consder what happens f the centers are the coordnate vectors. Then Exx = I. Moreover, ths occurs when a 1,..., a r are a rotaton of the coordnate bass. The result s not unque. 3. Ex 3 1 m m x 3 = 1 m m x x x Instead, we should estmate Ex 3, whch s unque f we know the exact centers. We must ensure that the tensor decomposton algorthm s robust to estmaton errors. A more robust algorthm allows for lower sample complexty. 1

2 In ths lecture we wll dg n to algorthms for robust tensor decomposton that s, how to accomplsh tensor decomposton effcently n the presence of errors. 2 Jenrch s algorthm for orthogonal tensor decomposton We wll focus on orthogonal tensor decomposton where components a 1,..., a r R n of the tensor A = to be decomposed are orthogonal unt vectors. Tensor decomposton s already both algorthmcally a k nontrval and qute useful n ths settng the orthogonal settng s good enough to gve the best known algorthms for Gaussan mxtures, some knds of dctonary learnng, and the stochastc blockmodel. The algorthm Input: A = for orthogonal unt vectors a 1,..., a r R n Algorthm: sample g N (0, I) and compute the contracton M = the top egenvector of M. g, a a a. Output Analyss: clearly the top egenvector s a for = arg max a, g. By symmetry, each vector a s equally lkely to be the output of the algorthm, so runnng the algorthm n log n tmes recovers all the vectors. 2.1 Robustness to 1/poly(n) errors Jenrch s algorthm s already robust to a small amount of error n the nput. Input: B = A + C, where A s as above and every entry of C has magntude at most n 10. Algorthm: same as above. Analyss: Now the matrx M takes the form M = a, g a a + C The entres of C are of the form: It s elementary to show that for every a, C j = n g k C jk k=1 Pr{ a, g 200 max j a, g, 200 C op } n O(1) 2

3 Suppose ths occurs for a 1. Then there s a number c such that M = ca 1 a 1 + M, where M c/10. Thus, the top egenvalue of M s at least 99c/100, and so a, v where v s the top egenvector of M. It follows that for every, wth probablty n O(1), the algorthm outputs b such that a, b Intutvely, ths means that we can stll recover each a w.h.p and every a appears. To turn ths n to an algorthm to recover a 1,..., a r to accuracy 0.9 we need a way to check that ths n O(1) -probablty event has occurred, but we wll gnore ths ssue for ths lecture. (Same as n dctonary learnng algorthm.) Exercse 7.1. Let v be the max egenvector of M. Show that w.h.p, max a, v 2 1 o(1) and every a s stll maxmal w.p. n O(1). 2.2 Nonrobustness to larger errors What about larger errors? If we thnk of A = as an n 3 -dmensonal vector, ts 2-norm A 2 = r. Prevously the error tensor C we consdered had C 2 n 5 A 2. Could we tolerate errors lke C A 2? Example: Ths example wll show that Jenrch s algorthm wll not work out of the box n the presence of such large errors. Suppose r = n/2 and A = for a 1,..., a r orthogonal unt vectors n R n, as usual. Let C = c 3, where c s a vector of norm 0.01n 1/10. Then the matrx M computed by Jenrch s algorthm has the form M = a, g a a + g, c cc Because c a, the probablty that a, g > g, c s exponentally small; the top egenvector of M wll nstead by very close to c. However, A 2 n, whle C n 0.9. So measured n 2 norm, the error C s small compared to A. We mght even consder error tensors C whch are large n metrcs other than the 2-norm. For nstance, we wll defne the njectve tensor norm below. Note that f T =, then T, x 3 = T nj = max x =1 T, x 3 a, x 3. Exercse 7.2. Suppose we nstead defned T = + R where entres of R are IID Gaussan. How large can R be for the problem to stll be well defned? Explan why Jenrch s algorthm wll fal. 3

4 3 The hgh spectral entropy tensor decomposton algorthm Ma, Sh, and Steurer ntroduced a method to get around the problem that Jenrch s algorthm has n the presence of larger errors. 3.1 Asde: SoS norms For a k-tensor T and d N, then sos d norm of T s max Ẽ x k, T Ẽ s degree d, satsfes x 2 =1 As d, the sos d norm becomes the njectve tensor norm. But we wll be nterested n the constant-d settng, whch as usual wll correspond to polynomal tme algorthms. By the usual dualty, T sosd s also the best bound certfable on the polynomal x k, T by degree-d SoS proofs. That s, f c = T sosd, there s an SoS proof c T, x k + q(x)( x 2 1) 0 for some q of degree d. (And ths s not true for any c < c.) Exercse 7.3. Show that T sosk s a norm. Exercse 7.4. Show that the 2-norm of a k-tensor (that s, ts 2-norm as a large vector n R nk ) s an upper bound on ts sos d norm, for any d k. Exercse 7.5. Show that f k s even, the norm T op gven by unfoldng T to a n k/2 n k/2 matrx and measurng ts spectral norm satsfes T op T sosk. Exercse 7.6. What s the analogue of the sos norm for matrces? Prove that they collapse to M nj. One note here s that computng T nj s ntractable. For matrces, we can use the power method. For the problem, max M, x =1 x 2 = max x =1 x Mx whch can be computed by x N (0, I), M l x max egenvector. We can apply a smlar method for orthogonal tensors. 3.2 Asde: decomposng tensors s the same thng as roundng moments As usual, consder the goal of recoverng a 1,..., a r unt vectors n R n from a tensor T = +C. From now on, nstead of applyng Jenrch-lke algorthms drectly to nput tensors, we wll thnk of algorthms whch work n two phases: 1. Solve a convex relaxaton formed from the nput tensor T to fnd moments of a (pseudo)dstrbuton whch s correlated wth the vectors a 1,..., a r. 2. Round a moment tensor (usually the thrd or fourth moments) of the (pseudo)dstrbuton to output vectors b 1,..., b r correlated wth a 1,..., a r. 4

5 Let us return to our frst example of zero-error orthogonal tensor decomposton wth nput A = Rescalng, the tensor 1 r A s the thrd moment tensor of the fntely-supported dstrbuton µ on the unt sphere whch unformly chooses one of the vectors a 1,..., a r. That s, E x µ x 3 = 1 r A. Applyng Jenrch s algorthm to ths tensor (va the matrx M = E x µ x, g xx ) was enough to recover the vectors a 1,..., a r. Here we dd not even have to solve a convex relaxaton to obtan a good moment tensor Ex Orthogonal tensor decomposton wth SoS-bounded errors Theorem 7.1 (Ma-Sh-Steurer (weakened parameters for easer proof)). There s n O(d) -tme tme algorthm wth the followng guarantees. Let a 1,..., a r R n be orthonormal and let A =. Let T = A + C where C sosd o(1). The algorthm takes nput T and outputs a (randomzed) unt vector b R n such that for every r, Pr{ a, b 1 o(1)} n O(1) The frst ngredent n the proof uses the SoS algorthm to fnd a pseudodstrbuton whose moments are correlated wth those of the unform dstrbuton over a 1,..., a r. Clam 7.1. In the settng of the above theorem, f Ẽ of degree d solves then Ẽ a, x 3 1 o(1). arg max Ẽ T, x 3 Ẽ satsfes x 2 =1 Proof. Let µ be the unform dstrbuton on a 1,..., a r. On the one hand, the maxmum value of ths optmzaton problem s at least x, a 3 + C, x 3 1 o(1) E x µ where we have used the sos d -bounedness of C. On the other hand, any Ẽ whch acheves objectve value δ must satsfy Ẽ x, a 3 δ o(1) by smlar reasonng. All together, the optmzer satsfes Ẽ x, a 3 1 o(1). It wll be techncally convenent also to assume that Ẽ s fourth moments are correlated wth the fourth moments of the unform dstrbuton on a 1,..., a r. Ths s allowed, because f Ẽ a, x 3 1 o(1), then also 1 o(1) Ẽ a, x 3 (Ẽ a, x 2) 1/2(Ẽ a, x 4) 1/2 (Ẽ a, x 4) 1/2 5

6 ( ) 1/2 Exercse 7.7. Show that Ẽ a, x 2 1. Thus we can assume access to a pseudodstrbuton wth Ẽ x, a 4 1 o(1). We are hopng that Ẽ s moments look enough lke those of µ that we can extract the a s from Ẽ usng Jenrch s algorthm. Unfortunately, knowng only that Ẽ x, a 4 1 o(1) s not enough. 3.4 Hgh entropy saves the day The key observaton of Ma, Sh, and Steurer s that a dstrbuton (or a pseudodstrbuton) on the unt sphere whch s correlated wth A and has hgh entropy (n a sense we wll momentarly make precse) s enough lke the unform dstrbuton on a 1,..., a r that t can be rounded usng Jenrch s algorthm. Ths should make sense n lght of the precedng exercse. The counterexample ν (descrbed n the hnt) places probablty 1/r on a sngle vector a very low entropy thng to do! If we can force our pseudodstrbuton not to do somethng lke ths, we can remove spurous vectors appearng n the spectrum of the matrces n Jenrch s algorthm. The MSS algorthm s as follows: 1. Solve arg max Ẽ T, x 3 s.t. 2. Apply Jenrch s algorthm to Ẽx 4. Exercse 7.8. Show that the above s a convex program. degẽ = 6 satsfes { x 2 = 1} (1) Ẽxx op 1 r (2) Ẽ(x x)(x x) op 1 r (3) We requre that our pseudodstrbuton s moment matrces do not have large egenvalues. Notce that f µ s the unform dstrbuton over orthonormal vectors a 1,..., a r, then E x µ xx = 1/r. Clam 7.2. Let a 1,..., a r R n be orthonormal. If Ẽ s a degree-4 pseudodstrbuton satsfyng { x 2 = 1} and Ẽxx op, Ẽ(x x)(x x) op 1/r wth Ẽ a, x 4 1 o(1), then for all but a o(1)- fracton of a 1,..., a r, Ẽ a, x 4 (1 o(1))/r Proof. Suppose to the contrary that a δ = Ω(1)-fracton of a 1,..., a r have Ẽ a, x 4 (1 δ)/r. Then there s some a wth Ẽ a, x 4 > 1/r, by averagng. But for any unt vector a, Ẽ a, x 4 Ẽ(x x)(x x) op 1 r 6

7 Next we show how to explot the constrants Ẽxx, Ẽ(x x)(x x) 1/r to round a pseudodstrbuton Ẽ to produce estmates of the vectors a 1,..., a r. Lemma 7.1. Let a R n be a unt vector and let Ẽ be a degree-6 pseudodstrbuton satsfyng { x 2 = 1} and Ẽxx op, Ẽ(x x)(x x) op, Ẽ(x 3 )(x 3 ) op 1 r. Suppose Ẽ x, a 4 (1 o(1))/r. Then wth probablty n O(1) def, the top egenvector v of the matrx M g = Ẽ x x, g xx for g N (0, Id) satsfes v, a Together wth the precedng clam ths s enough to prove a (slghtly weakened) verson of the theorem. To prove Lemma 7.1 wll requre two clams. Clam 7.3. Let g N (0, Σ) for some Σ Id. Then E g Ẽ x x, g xx O(log n) 1/2 /r Proof sketch. We prove the case Σ = Id from whch general Σ Id can be derved. In ths case, Ẽ x x, g xx = n g j Ẽx x j xx where g j N (0, 1) are ndependent. Let M j = Ẽx x j xx. By standard matrx concentraton bounds, the expected spectral norm of ths matrx s at most O(log n) 1/2 n M j M j 1/2 It s an exercse to show that our assumptons on spectral norms of moments of Ẽ mply n M j M j 1/2 1/r The clam follows. Exercse 7.9. Show that n M j M j 1/2 1 r. Clam 7.4. The matrx Ẽ x, a 2 xx can be expressed as where E op o(1/r). Ẽ x, a 2 xx = 1 r aa + E Proof sketch. To save on notaton, wthout loss of generalty suppose that a = e 1. Consder the submatrx M of Ẽx 1xx gven by rows and columns 2,..., n. Ths matrx has spectral norm We assumed that but at the same tme Ẽx 4 1 (1 o(1))/r Ẽx 2 1 x 2 Ẽx2 1 1/r 7

8 by our egenvalue bounds on Ẽxx. So, Ẽx 2 1 =2 Let v R n be a unt vector orthogonal to a. Then x 2 o(1/r) Ẽx 2 1 x, v 2 Ẽx2 1 Π x 2 o(1/r) where Π s the projector to last n 1 coordnates. Snce Ẽx2 1xx 0, ths mples that e 1 e 1 /r Ẽx 2 1xx o(1/r). Proof sketch of Lemma 7.1. We sample the vector g as g = ξ a+g, where g s a unt-varance multvarate Gaussan n the subspace orthogonal to a a, and ξ s a unt-varance unvarate Gaussan. Furthermore, ξ and g are ndependent. So we can wrte M g as M g = ξẽ x, a 2 xx + Ẽ x x, g xx By Markov s nequalty, our clams above, and Gaussan ant-concentraton, wth probablty n O(1) we can wrte M g = ξaa /r + E where E 0.001ξ/r and ξ > 0. The lemma follows. 4 What f the errors are not bounded n SoS norm? Many tensors do not have errors bounded n SoS norm but should nonetheless be easy to decompose. For example, consder the tensor T = + c 3, where as usual the a 1,..., a r are orthonormal but c has norm 100. The tensor c 3 does not have SoS norm 1, but at least ntutvely ths should not present a real dffculty n decomposng ths tensor. However, the soluton to max Ẽ T, x 3 wll put all ts weght on c, Ẽ so the resultng Ẽ wll (probably) not contan any nformaton about a 1,..., a r. There are lkely many knds of errors not 1 n SoS norm but whch do not present a problem for tensor decomposton. Hopkns and Steurer study the settng that the nput tensor s correlated n the Eucldean sense wth the target orthogonal tensor. More formally, the goal s to decompose an orthogonal tensor A =, and the nput s a tensor T such that T, A T A δ = Ω(1) By standard lnear algebra, up to scalng we can thnk of T = A+B where A, B = 0 and B = O( A ). Note that the condton A, B = 0 cannot be dropped: f T = A + B and we do not requre A, B = 0, then settng B = A would destroy all the nformaton about A n the nput T. Even assumng A, B = 0, t s possble n ths settng that not all the vectors a 1,..., a r can be recovered. For example, f B = 1 2, then A, B = 0 but A + B contans no nformaton about a 2. We wll have to set our sghts on recoverng just some of the vectors. 8

9 In lght of the lemma on roundng pseudodstrbutons Ẽ havng Ẽ x, a 3 δ, t would be enough to show how to take nput T and produce such a pseudodstrbuton. For ths we have the followng lemma. Lemma 7.2 (Hopkns-Steurer). Let T satsfy The soluton to the followng convex program satsfes Ẽ a, x 3 poly(δ). T, A T A δ = Ω(1) mn Ẽx 3 such that (4) Ẽ degree 4 Ẽ satsfes { x 2 = 1} (5) Ẽ x 3, T δ T r (6) Ẽxx op 1 r (7) Ẽ(x x)(x x) op 1 r (8) Before we prove the lemma how should we nterpret ths convex program? The objectve functon may be unfamlar, but we can obtan some good ntuton f we thnk about what E x µ x 3 means for µ a dstrbuton supported on orthonormal vectors a 1,..., a r (but not necessarly the unform dstrbuton on those vectors). In ths case, E x µ x 3 2 = µ(), µ() = µ() 2 = COLLISION-PROBABILITY(µ) The collson probablty s an l 2 verson of entropy as µ becomes closer to unform, the collson probablty decreases. It s a good exercse to convnce yourself that n the motvatng example from before T = + c 3 where c = 100 the dstrbuton µ of mnmal collson probablty whch obtans E x µ x 3, T δ also has E x µ x, a 3 poly(δ), for small enough constants δ > 0. The lemma follows from the followng general fact Theorem 7.2 (Appears n ths form n Hopkns-Steurer). Let C be a convex set and Y C. Let P be a vector wth P, Y δ P Y. Then, f we let Q be the vector that mnmzes Q subject to Q C and P, Q δ P Y, we have Furthermore, Q satsfes Q δ Y. Q, Y δ/2 Q Y. (9) 9

10 Proof. By constructon, Q s the Eucldean projecton of 0 nto the set C := {Q C P, Q δ P Y }. It s a basc geometrc fact (sometmes called Pythagorean nequalty) that a Eucldean projecton nto a set decreases dstances to ponts nto the set. Therefore, Y Q 2 Y 0 2 (usng that Y C ). Thus, Y, Q Q 2 /2. On the other hand, P, Q δ P Y means that Q δ Y by Cauchy Schwarz. We conclude Y, Q δ/2 Y Q. Now we can prove the lemma. Proof of lemma. Consder the convex set C = {Ẽ degree-4 satsfyng x 2 = 1, Ẽxx op 1 r, Ẽ(x x)(x x) op 1 r } The unform dstrbuton µ over a 1,..., a r s n C, and T satsfes T, E x µ x 3 δ T E x µ x 3 Let Ẽ be the soluton to the convex program n the lemma. Accordng to the the theorem on correlatonpreservng projectons, Ẽx 3, E x µ x 3 δ/2 Ẽx 3 E x µ x 3 δ 2 /2 E x µ x 3 2 = δ 2 /(2r) where n the last step we have used that the collson probablty of µ s 1/r. Rearrangng, whch proves the lemma. Ẽx 3, E x µ x 3 = 1 r Ẽ a, x 3 To turn the above nto an algorthm requres a verson of Lemma 7.1 sutable for ths low-correlaton regme, stated below. The proof uses mostly the same deas as that of Lemma 7.1. Lemma 7.3 (Hopkns-Steurer). For every 0 < δ < 1 there s a polynomal tme algorthm wth the followng guarantees. Suppose Ẽ s a degree-4 pseudoexpectaton n the varables x 1,..., x n satsfyng { x 2 = 1}. Furthermore, suppose that 1. Ẽ x, a 3 δ. 2. Ẽxx op 1 r (ths s a convex constrant!). 3. Ẽ(x x)(x x) 1 r (ths s also a convex constrant!). Then for at least r = poly(δ)r vectors a 1,..., a r, the algorthm takes nput Ẽ and produces a unt vector b such that Pr{ a, b poly(δ)} n poly(1/δ) 10

11 5 Bblography [Hopkns-Steurer]: Effcent Bayesan estmaton from few samples: communty detecton and related problems. Samuel B. Hopkns, Davd Steurer. In submsson. [Ma-Sh-Steurer]: Polynomal-tme Tensor Decompostons wth Sum-of-Squares. Tengyu Ma, Jonathan Sh, Davd Steurer. FOCS

U.C. Berkeley CS294: Beyond Worst-Case Analysis Luca Trevisan September 5, 2017

U.C. Berkeley CS294: Beyond Worst-Case Analysis Luca Trevisan September 5, 2017 U.C. Berkeley CS94: Beyond Worst-Case Analyss Handout 4s Luca Trevsan September 5, 07 Summary of Lecture 4 In whch we ntroduce semdefnte programmng and apply t to Max Cut. Semdefnte Programmng Recall that