BEYOND MATRIX COMPLETION

Size: px

Start display at page:

Download "BEYOND MATRIX COMPLETION"

Rose Hensley
5 years ago
Views:

1 BEYOND MATRIX COMPLETION ANKUR MOITRA MASSACHUSETTS INSTITUTE OF TECHNOLOGY Based on joint work with Boaz Barak (MSR)

2 Part I: RecommendaIon systems and parially observed matrices

3 THE NETFLIX PROBLEM movies users

4 THE NETFLIX PROBLEM movies users 1-5 stars

5 THE NETFLIX PROBLEM movies users 480,189 17,770

6 THE NETFLIX PROBLEM movies users 1% observed 480,189 17,770

7 THE NETFLIX PROBLEM movies users 1% observed 480,189 17,770 Can we (approximately) fill- in the missing entries?

8 MATRIX COMPLETION Let M be an unknown, approximately low- rank matrix

9 MATRIX COMPLETION Let M be an unknown, approximately low- rank matrix M + + +

10 MATRIX COMPLETION Let M be an unknown, approximately low- rank matrix M drama

11 MATRIX COMPLETION Let M be an unknown, approximately low- rank matrix M drama comedy

12 MATRIX COMPLETION Let M be an unknown, approximately low- rank matrix M drama comedy sports

13 MATRIX COMPLETION Let M be an unknown, approximately low- rank matrix M drama comedy sports Model: we are given random observaions M i,j for all i,j Ω

14 MATRIX COMPLETION Let M be an unknown, approximately low- rank matrix M drama comedy sports Model: we are given random observaions M i,j for all i,j Ω Is there an efficient algorithm to recover M?

15 MATRIX COMPLETION The natural formulaion is non- convex, and NP- hard min rank(x) s.t. 1 Ω (i,j) Ω X i,j M i,j η

16 MATRIX COMPLETION The natural formulaion is non- convex, and NP- hard min rank(x) s.t. 1 Ω (i,j) Ω There is a powerful, convex relaxaion X i,j M i,j η

17 THE NUCLEAR NORM Consider the singular value decomposi;on of X: X = U Σ V T orthogonal diagonal orthogonal

18 THE NUCLEAR NORM Consider the singular value decomposi;on of X: X = U Σ V T orthogonal diagonal orthogonal Let σ 1 σ 2 σ r > σ r+1 = σ m = 0 be the singular values

19 THE NUCLEAR NORM Consider the singular value decomposi;on of X: X = U Σ V T orthogonal diagonal orthogonal Let σ 1 σ 2 σ r > σ r+1 = σ m = 0 be the singular values Then rank(x) = r, and X * = σ 1 + σ σ r (nuclear norm)

20 This yields a convex relaxaion, that can be solved efficiently: min X * s.t. 1 Ω (i,j) Ω X i,j M i,j η [Fazel], [Srebro, Shraibman], [Recht, Fazel, Parrilo], [Candes, Recht], [Candes, Tao], [Candes, Plan], [Recht], (P)

21 This yields a convex relaxaion, that can be solved efficiently: min X * s.t. 1 Ω (i,j) Ω X i,j M i,j η [Fazel], [Srebro, Shraibman], [Recht, Fazel, Parrilo], [Candes, Recht], [Candes, Tao], [Candes, Plan], [Recht], (P) Theorem: If M is n x n and has rank r, and is C- incoherent then (P) recovers M exactly from C 6 nrlog 2 n observaions

22 This yields a convex relaxaion, that can be solved efficiently: min X * s.t. 1 Ω (i,j) Ω X i,j M i,j η [Fazel], [Srebro, Shraibman], [Recht, Fazel, Parrilo], [Candes, Recht], [Candes, Tao], [Candes, Plan], [Recht], (P) Theorem: If M is n x n and has rank r, and is C- incoherent then (P) recovers M exactly from C 6 nrlog 2 n observaions This is nearly opimal, since there are 2nr parameters

23 This yields a convex relaxaion, that can be solved efficiently: min X * s.t. 1 Ω (i,j) Ω X i,j M i,j η [Fazel], [Srebro, Shraibman], [Recht, Fazel, Parrilo], [Candes, Recht], [Candes, Tao], [Candes, Plan], [Recht], (P) Theorem: If M is n x n and has rank r, and is C- incoherent then (P) recovers M exactly from C 6 nrlog 2 n observaions This is nearly opimal, since there are 2nr parameters Many other approaches, e.g. alterna;ng minimiza;on: [Keshavan, Montanari, Oh], [Jain, Netrapalli, Sanghavi], [Hardt],

24 Part II: Higher order structure?

25 TENSOR PREDICTION Can using more than two asributes can lead to beser recommendaions?

26 TENSOR PREDICTION Can using more than two asributes can lead to beser recommendaions? ac;vi;es users e.g. Groupon

27 TENSOR PREDICTION Can using more than two asributes can lead to beser recommendaions? ;me users e.g. Groupon ;me: season, Ime of day, weekday/weekend, etc

28 TENSOR PREDICTION Can using more than two asributes can lead to beser recommendaions? users ;me T = r i = 1 e.g. Groupon snow sports ;me: season, Ime of day, weekday/weekend, etc

29 TENSOR PREDICTION Can using more than two asributes can lead to beser recommendaions? users ;me r T = a i b i c i i = 1

30 TENSOR PREDICTION Can using more than two asributes can lead to beser recommendaions? users ;me r T = a i b i c i i = 1 Can we (approximately) fill- in the missing entries?

31 TENSOR PREDICTION Can using more than two asributes can lead to beser recommendaions? users ;me r T = a i b i c i i = 1 Can we (approximately) fill- in the missing entries? More asributes lead to beser recommendaions, but more complex objects

32 THE TROUBLE WITH TENSORS Natural approach (suggested by many authors): min X s.t. * 1 Ω (i,j,k) Ω X i,j,k T i,j,k η (P) tensor nuclear norm

33 THE TROUBLE WITH TENSORS Natural approach (suggested by many authors): min X s.t. * 1 Ω (i,j,k) Ω X i,j,k T i,j,k η (P) tensor nuclear norm The tensor nuclear norm is NP- hard to compute! [Gurvits], [Liu], [Harrow, Montanaro]

34 In fact, most of the linear algebra toolkit is ill- posed, or computa;onally hard for tensors

35 In fact, most of the linear algebra toolkit is ill- posed, or computa;onally hard for tensors e.g. [Hillar, Lim] Most Tensor Problems are NP- Hard

36 In fact, most of the linear algebra toolkit is ill- posed, or computa;onally hard for tensors e.g. [Hillar, Lim] Most Tensor Problems are NP- Hard Table I. Tractability of Tensor Problems Problem Complexity Bivariate Matrix Functions over R, C Undecidable (Proposition 12.2) Bilinear System over R, C NP-hard (Theorems 2.6, 3.7, 3.8) Eigenvalue over R NP-hard (Theorem 1.3) Approximating Eigenvector over R NP-hard (Theorem 1.5) Symmetric Eigenvalue over R NP-hard (Theorem 9.3) Approximating Symmetric Eigenvalue over R NP-hard (Theorem 9.6) Singular Value over R, C NP-hard (Theorem 1.7) Symmetric Singular Value over R NP-hard (Theorem 10.2) Approximating Singular Vector over R, C NP-hard (Theorem 6.3) Spectral Norm over R NP-hard (Theorem 1.10) Symmetric Spectral Norm over R NP-hard (Theorem 10.2) Approximating Spectral Norm over R NP-hard (Theorem 1.11) Nonnegative Definiteness NP-hard (Theorem 11.2) Best Rank-1 Approximation NP-hard (Theorem 1.13) Best Symmetric Rank-1 Approximation NP-hard (Theorem 10.2) Rank over R or C NP-hard (Theorem 8.2) Enumerating Eigenvectors over R #P-hard (Corollary 1.16) Combinatorial Hyperdeterminant NP-, #P-, VNP-hard (Theorems 4.1, 4.2, Corollary 4.3) Geometric Hyperdeterminant Conjectures 1.9, 13.1 Symmetric Rank Conjecture 13.2 Bilinear Programming Conjecture 13.4 Bilinear Least Squares Conjecture 13.5 Note: Except for positive definiteness and the combinatorial hyperdeterminant, which apply to -tensors,

37 FLATTENING A TENSOR Many tensor methods rely on flatening:

38 FLATTENING A TENSOR Many tensor methods rely on flatening: n 3 flat( ) = n 1 n 1 n 2 n 3

39 FLATTENING A TENSOR Many tensor methods rely on flatening: n 3 flat( ) = n 1 users ac;vi;es ;me

40 FLATTENING A TENSOR Many tensor methods rely on flatening: flat( ) = n 1 n 3 i (j,k) T i,j,k

41 FLATTENING A TENSOR Many tensor methods rely on flatening: flat( ) = n 1 n 3 i (j,k) T i,j,k This is a rearrangement of the entries, into a matrix, that does not increase its rank

42 FLATTENING A TENSOR Many tensor methods rely on flatening: flat( ) = n 1 n 3 i r flat( a i b i c i i = 1 r (j,k) ) = a i T vec(b i c i ) i = 1 T i,j,k n 2 n 3 - dimensional vector

43 Let n 1 = n 2 = n 3 = n We would need O(n 2 r) observaions to fill- in flat(t)

44 Let n 1 = n 2 = n 3 = n We would need O(n 2 r) observaions to fill- in flat(t) There are many other variants of flatening, but with comparable guarantees [Liu, Musialski, Wonka, Ye], [Gandy, Recht, Yamada], [SignoreTo, De Lathauwer, Suykens], [Tomioko, Hayashi, Kashima], [Mu, Huang, Wright, Goldfarb],

45 Let n 1 = n 2 = n 3 = n We would need O(n 2 r) observaions to fill- in flat(t) There are many other variants of flatening, but with comparable guarantees [Liu, Musialski, Wonka, Ye], [Gandy, Recht, Yamada], [SignoreTo, De Lathauwer, Suykens], [Tomioko, Hayashi, Kashima], [Mu, Huang, Wright, Goldfarb], Can we beat flasening?

46 Let n 1 = n 2 = n 3 = n We would need O(n 2 r) observaions to fill- in flat(t) There are many other variants of flatening, but with comparable guarantees [Liu, Musialski, Wonka, Ye], [Gandy, Recht, Yamada], [SignoreTo, De Lathauwer, Suykens], [Tomioko, Hayashi, Kashima], [Mu, Huang, Wright, Goldfarb], Can we beat flasening? Can we make beser predicions than we do by treaing each ac;vity x ;me as unrelated?

47 Part III: Nearly opimal algorithms for tensor predicion

48 OUR RESULTS r T = a i b i c i i = 1 σ i + noise standard Gaussian r.v.

49 OUR RESULTS r T = a i b i c i i = 1 σ i + noise standard Gaussian r.v. Theorem: Suppose var(t i,j,k ) r/n 3 polylog(n). Then there is an efficient algorithm that X which saisfies: X i,j,k = (1±o(1))T i,j,k for a 1- o(1) fracion of entries, provided m = Ω (n 3/2 r)

50 OUR RESULTS r T = a i b i c i i = 1 σ i + noise standard Gaussian r.v. Theorem: Suppose var(t i,j,k ) r/n 3 polylog(n). Then there is an efficient algorithm that X which saisfies: X i,j,k = (1±o(1))T i,j,k for a 1- o(1) fracion of entries, provided m = Ω (n 3/2 r) This variance bound holds for random tensors, but also tensors where the factors (a i s, b i s, c i s) have large inner- product

51 OUR RESULTS r T = a i b i c i i = 1 σ i + noise standard Gaussian r.v. Theorem: Suppose var(t i,j,k ) r/n 3 polylog(n). Then there is an efficient algorithm that X which saisfies: X i,j,k = (1±o(1))T i,j,k for a 1- o(1) fracion of entries, provided m = Ω (n 3/2 r) Even for r = n 3/2- δ (highly overcomplete), we only need to observe an o(1) fracion of the entries to predict almost everything

52 LOWER BOUNDS Not only is the tensor nuclear norm hard to compute, but

53 LOWER BOUNDS Not only is the tensor nuclear norm hard to compute, but Tensor predicion with m observaions Refute random 3- SAT with m clauses

54 LOWER BOUNDS Not only is the tensor nuclear norm hard to compute, but Tensor predicion with m observaions Refute random 3- SAT with m clauses The best known algorithms require m = n 3/2, and even powerful SDP hierarchies fail with fewer clauses [Shor], [Nesterov], [Parrilo], [Lasserre], [Grigoriev], [Schoenebeck]

55 LOWER BOUNDS Not only is the tensor nuclear norm hard to compute, but Tensor predicion with m observaions Refute random 3- SAT with m clauses The best known algorithms require m = n 3/2, and even powerful SDP hierarchies fail with fewer clauses [Shor], [Nesterov], [Parrilo], [Lasserre], [Grigoriev], [Schoenebeck] Corollary [informal]: Any algorithm for solving tensor predicion, in the sum- of- squares hierarchy that uses m=n 3/2- δ r observaions must run in exponenial Ime

56 Part IV: Our techniques: connecions to random CSPs

57 Can we disinguish between low- rank and random?

58 Can we disinguish between low- rank and random? Case #1: Approximately low- rank a T a + noise

59 Can we disinguish between low- rank and random? Case #1: Approximately low- rank a T a + noise For each (i,j) Ω M i,j = a i a j w/ probability ¾ random ±1 w/ probability ¼ where each a i = ±1

60 Can we disinguish between low- rank and random?

61 Can we disinguish between low- rank and random? Case #2: Random random For each (i,j) Ω, M i,j = random ±1

62 Can we disinguish between low- rank and random? Case #2: Random random For each (i,j) Ω, M i,j = random ±1 In Case #1 the entries are (somewhat) predictable, but in Case #2 they are completely unpredictable

63 There are two very different communiies that (essenially) asacked this same disinguishing problem:

64 There are two very different communiies that (essenially) asacked this same disinguishing problem: The community working on matrix comple;on

65 There are two very different communiies that (essenially) asacked this same disinguishing problem: The community working on matrix comple;on The community working on refu;ng random CSPs

66 There are two very different communiies that (essenially) asacked this same disinguishing problem: The community working on matrix comple;on The community working on refu;ng random CSPs Matrix predicion with m observaions Strongly refute* random 2- XOR with m clauses *Want an algorithm that cerifies a formula is far from saisfiable

67 There are two very different communiies that (essenially) asacked this same disinguishing problem: The community working on matrix comple;on The community working on refu;ng random CSPs Tensor predic;on with m observaions Strongly refute* random 3- XOR with m clauses *Want an algorithm that cerifies a formula is far from saisfiable

68 There are two very different communiies that (essenially) asacked this same disinguishing problem: The community working on matrix comple;on The community working on refu;ng random CSPs Tensor predic;on with m observaions Strongly refute* random 3- XOR with m clauses [Coja- Oghlan, Goerdt, Lanka] *Want an algorithm that cerifies a formula is far from saisfiable

69 There are two very different communiies that (essenially) asacked this same disinguishing problem: The community working on matrix comple;on The community working on refu;ng random CSPs Tensor predic;on with m observaions Rademacher Complexity Strongly refute* random 3- XOR with m clauses [Coja- Oghlan, Goerdt, Lanka] We then embed this algorithm into the sixth level of the sum- of- squares hierarchy, to get a relaxaion for tensor predicion *Want an algorithm that cerifies a formula is far from saisfiable

70 SUMMARY New Algorithm: We gave an algorithm for 3 rd - order tensor predicion that uses m = n 3/2 rlog 4 n observaions

71 SUMMARY New Algorithm: We gave an algorithm for 3 rd - order tensor predicion that uses m = n 3/2 rlog 4 n observaions An Inefficient Algorithm: (via tensor nuclear norm) There is an inefficient algorithm that use m = nrlogn observaions

72 SUMMARY New Algorithm: We gave an algorithm for 3 rd - order tensor predicion that uses m = n 3/2 rlog 4 n observaions An Inefficient Algorithm: (via tensor nuclear norm) There is an inefficient algorithm that use m = nrlogn observaions A Phase Transi;on: Even for n δ rounds of the powerful sum- of- squares hierarchy, no norm solves tensor predicion with m = n 3/2- δ r observaions

73 DISCUSSION Convex programs are unreasonably effecive for linear inverse problems!

74 DISCUSSION Convex programs are unreasonably effecive for linear inverse problems! But we gave simple linear inverse problems that exhibit striking gaps between efficient and inefficient esimators

75 DISCUSSION Convex programs are unreasonably effecive for linear inverse problems! But we gave simple linear inverse problems that exhibit striking gaps between efficient and inefficient esimators Where else are there computaional vs staisical tradeoffs?

76 DISCUSSION Convex programs are unreasonably effecive for linear inverse problems! But we gave simple linear inverse problems that exhibit striking gaps between efficient and inefficient esimators Where else are there computaional vs staisical tradeoffs? New Direc;on: Explore computaional vs. staisical tradeoffs through the powerful sum- of- squares hierarchy

BEYOND MATRIX COMPLETION

BEYOND MATRIX COMPLETION ANKUR MOITRA MASSACHUSETTS INSTITUTE OF TECHNOLOGY Based on joint work with Boaz Barak (Harvard) Part I: RecommendaJon systems and parjally observed matrices THE NETFLIX PROBLEM