Statistical Performance of Convex Tensor Decomposition

Size: px

Start display at page:

Download "Statistical Performance of Convex Tensor Decomposition"

Alban Barnett
5 years ago
Views:

1 Slides available: h-p:// tokyo.ac.jp/ryotat/tensor12kyoto.pdf Statistical Performance of Convex Tensor Decomposition Ryota Tomioka Kyoto University Perspectives in Informatics 4B Collaborators: Taiji Suzuki Kohei Hayashi Hisashi Kashima

Netflix challenge (2006-2009) $1,000,000 prize Goal: Improve the performance of a video recommendafon system

2 Netflix challenge ( ) $1,000,000 prize Goal: Improve the performance of a video recommendafon system (predict who likes which movies) Example: Likes Star Wars and E.T., Doesn t like Minority Report. Does he like Blade Runner?

3 Matrix completion view User A User B User C User D ?? +1?? +1?? +1 1? +1 +1??? Goal: fill the missing entries!

4 Matrix completion Impossible without an assumpfon. (Missing entries can be arbitrary) problem is ill posed Most common assumpfon: Low rank decomposifon Y U V T

5 Matrix completion Most common assumpfon: Low rank decomposifon (rank r) y ij u i v j y ij = u i v j ( ) dot prodcut in r dim space

6 r dimensional space Geometric Intuition (r: the rank of the decomposifon) u a U Movies he doesn t like V T Movies he likes

7 r dimensional space Geometric Intuition (r: the rank of the decomposifon) u b u a U V T

8 Tensor data completion Tensor = MulF dimensional array Beyond 2D Movie preference + Fme / context / acfon Users Movies

9 Tensor data completion Tensor = MulF dimensional array Beyond 2D Climate monitoring temperature humidity rainfall

10 Tensor data completion Tensor = MulF dimensional array Beyond 2D Neuroscience (brain imaging) Sensors Time

11 Rest of this talk CompuFng low rank matrix decomposifon Generalizing from matrix to tensor Analyzing the performance StaFsFcal learning theory

12 CompuFng low rank matrix decomposifon

13 Computing low-rank decomposition If all entries are observed (no missing entries) m Given Y, compute singular value decomposifon (SVD) n r Y m Ur Σ r V T r where U, V: Orthogonal (U T U=I, V T V=I) Σ = ( σ1... σ r ) n σ j : jth largest singular value

14 r m U Tolerating missings OpFmizaFon problem minimize U,V r n V (ij) Ω (y ij u i v j ) 2 Set of observed index pairs

15 Tolerating missings OpFmizaFon problem r m U minimize U,V r n V v Non convex! (ij) Ω (y ij u i v j ) 2 (1,1) u ( 1, 1)

16 Tolerating missings OpFmizaFon problem SFll non convex! minimize W subject to (ij) Ω (y ij w ij ) 2, rank(w ) r Rank constraint is NP hard

17 Convex relaxation of rank Scha-en p norm (to the pth power) W p S p := r j=1 σp j (W ) σ j (W ) : jth largest singular value W p S p p 0 rank(w ) %! " # $ &!!!"!# $ # "! & '(' $)$# '(' p=1 is the Fghtest $)* '(' ( " convex relaxafon (also known as trace norm / nuclear norm)

18 Tolerating missings OpFmizaFon problem Convex relaxafon minimize W subject to Scha-en 1 norm (nuclear norm, trace norm) (ij) Ω W S1 τ Cf. Lasso (L 1 norm) for variable selecfon = linear sum of abs. coefficients (y ij w ij ) 2, W S1 = r j=1 σ j(w ) σ j (W ) : jth largest singular value

19 Take home messages Rank constrained minimizafon is hard to solve (non convex and NP hard) Can be relaxed into a tractable convex problem using Scha-en 1 norm.

20 How about tensors? How to define tensor rank? How related to matrix rank?

21 Ranf of a tensor DefiniFon. Let X R n 1 n K (Kth order tensor) The smallest number R such that the given tensor X is wri-en as X = R r=1 A r where A r = is rank one. can be wri-en as an outer product of K vectors Called CP (CANDECOMPO/PARAFAC) decomposifon Bad news: NP hard to compute the rank R even for a fully observed X.

22 Bad news 2: Tensor rank is not closed X is rank 3 X = a 1 b 1 c 2 + a 1 b 2 c 1 + a 2 b 1 c 1 Y is rank 2 Kolda & Bader 2009 Y = α(a α a 2) (b α b 2) (c α c 2) αa 1 b 1 c 1 X Y F 0 (α )

23 Tucker decomposition [Tucker 66] ( X ijk = r 1 r 2 a=1 b=1 r 3 c=1 = C abc U (1) ia U (2) jb U (3) kc ) Also known as higher order SVD [De Lathauwer+00] Rank (r 1,r 2,r 3 ) can be computed in polynomial Fme using unfolding operafons.

24 Mode-k unfoldings (matricization)

25 Computing Tucker rank For each k=1,,k Compute the mode k unfolding X (k) Compute the (matrix) rank of X (k) r k = rank(x (k) ) Unfolding Tensor X is low rank in the kth mode Folding Matrix X (k) is low rank (as a matrix)

Computing Tucker rank For each k=1,,k Compute the mode k unfolding X (k) Compute the (matrix) rank of X (k) r k = rank(x (k) ) Difference between Tensor rank and Tucker rank Tensor rank is a

26 Computing Tucker rank For each k=1,,k Compute the mode k unfolding X (k) Compute the (matrix) rank of X (k) r k = rank(x (k) ) Difference between Tensor rank and Tucker rank Tensor rank is a single number R (may not be easy to compute) Tucker rank is defined for each mode (easy to compute) CP decomp is a special case of Tucker decomp with R=r 1 =r 2 = =r K and diagonal core C = R R R

27 Basic idea We know how to do matrix complefon with Scha-en 1 norm (tractable convex opfmizafon) We know how to compute Tucker rank (=the rank of the mode k unfolding) Scha-en 1 norm + Unfolding = Convex, tractable tensor complefon

28 Overlapped Schatten 1-norm for Tensors W := 1 K W S1 (k) K S1 k=1 Measures the overall low rank ness (not just a single mode)

29 Convex Tensor Estimation Matrix EsFmaFon of lowrank matrix (hard) Tensor EsFmaFon of lowrank tensor (hard) Convex relaxafon Convex relaxafon Scha-en 1 norm minimizafon (tractable) [Fazel, Hindi, Boyd 01] Generalize Overlapped Scha-en 1 norm minimizafon [Liu+09, Signoretto+10, Tomioka+10, Gandy+11]

30 #!! #!!% Empirical performance Tensor complefon result [Tomioka et al. 2010] 4 71B8C'!D'!D$!E4-.3FC)D*D+ G2398D =H4I323/2398DJ KL01<1B ;8-.3/8!!"#!"$!"%!"&!"'!"(!")!"*!"+ #,-./ :48;8<8307 (No noise) M/(n 1 n 2 n 3 ) 4

31 Analyzing the performance of convex tensor decomposifon

32 ObservaFon model Problem setting y i = X i, W + ɛ i (i =1,..., M) W true tensor rank (r 1,...,r K ) ɛ i Gaussian noise Example (tensor complefon) n 3 n 2 X 1 = n1

33 ObservaFon model Problem setting y i = X i, W + ɛ i (i =1,..., M) W true tensor rank (r 1,...,r K ) ɛ i Gaussian noise Example (tensor complefon) n 3 n 2 X 2 = n1

34 ObservaFon model Problem setting y i = X i, W + ɛ i (i =1,..., M) W true tensor rank (r 1,...,r K ) ɛ i Gaussian noise Example (tensor complefon) n 3 n 2 X 3 = n1

35 ObservaFon model Problem setting y i = X i, W + ɛ i (i =1,..., M) W true tensor rank (r 1,...,r K ) ɛ i Gaussian noise Example (tensor complefon) n 3 n 2 X 4 = n1 and so on

36 ObservaFon model OpFmizaFon Ŵ = Problem setting y i = X i, W + ɛ i (i =1,..., M) argmin W R n 1 n K W ɛ i true tensor rank (r 1,...,r K ) Gaussian noise N(0,σ 2 ) ( 1 2M y X(W) λ M W S1 ) (N = K k=1 n k) X : R N R M X(W) = ( X 1, W,..., X M, W )

37 Analysis objective We would like to show something like Mean squared error EsFmated tensor Ŵ W 2 N True lowrank tensor F The size The rank Number of samples M ( ) c(n, r) O p M n =(n 1,..., n K ) r =(r 1,..., r K )

38 Theorem: random Gauss design Assume elements of X i are drown iid from standard normal distribufon Moreover #samples (M) #variables (N) c 1 n 1 1/2 r 1/2 r n n 1 1/2 := ( 1 K K k=1 1/nk ) 2, r 1/2 := ( 1 K K k=1 rk ) 2

39 Theorem: random Gauss design Assume elements of X i are drown iid from standard normal distribufon Moreover #samples (M) #variables (N) c 1 n 1 1/2 r 1/2 r n Convergence! Ŵ W 2 F N n 1 1/2 := ( ( σ 2 n 1 ) 1/2 r 1/2 O p M 1 K K k=1 1/nk ) 2, r 1/2 := ( 1 K K k=1 rk ) 2

is not so hard to see: X (ɛ) = F True lowrank tensor 1 2M

40 What we want to derive: Ŵ W 2 N Proof idea Since Ŵ minimizes the objecfve, EsFmated tensor Obj(Ŵ) Obj(W ) It is not so hard to see: X (ɛ) = F True lowrank tensor 1 2M X(Ŵ W ) 2 2 X (ɛ)/m, Ŵ W + λm Ŵ W S1 ( ) c(n, r) O p M M ɛ i X i i=1

41 EsFmated tensor Proof outline (1/3) True lowrank tensor 1 2M X(Ŵ W ) 2 2 X (ɛ)/m, Ŵ W + λm Ŵ W S1 Inequality 1: upper bound the dot product X (ɛ)/m, Ŵ W O p ( σ 2 N n 1 1/2 M Ŵ W S1 ) (opfmizafon duality / random matrix theory)

42 Proof outline (1/3) EsFmated tensor True lowrank tensor 1 2M X(Ŵ W ) 2 2 ( σ 2 N n 1 1/2 M + λ M ) Ŵ W S 1 Inequality 1: upper bound the dot product X (ɛ)/m, Ŵ W O p ( σ 2 N n 1 1/2 M Ŵ W S1 ) Trade off between OpFmal reg. const and λ M O p ( σ 2 N n 1 1/2 M )

43 Proof outline (2/3) EsFmated True lowrank tensor tensor 1 2M X(Ŵ W ) 2 2 σ 2 N n 1 1/2 M Inequality 2: relate the scha-en 1 norm with the Frobenius norm Ŵ W S1 Ŵ W S1 r 1/2 Ŵ W F (relafon between L1 and L2 norm)

44 EsFmated tensor Proof outline (2/3) True lowrank tensor 1 2M X(Ŵ W ) 2 2 Inequality 2: relate the scha-en 1 norm with the Frobenius norm σ 2 N n 1 1/2 r 1/2 M Ŵ W S1 r 1/2 Ŵ W F (relafon between L1 and L2 norm) Ŵ W F

45 EsFmated tensor Proof outline (3/3) Inequality 3: lower bound the les hand side κ True lowrank tensor 1 2M X(Ŵ W ) 2 2 Ŵ W 2 σ 2 N n 1 1/2 r 1/2 M F 1 M X(Ŵ W ) 2 2 Ŵ W F If #samples (M) #variables (N) c 1 n 1 1/2 r 1/2 (Gordon Slepian Theorem in Gaussian process theory)

46 EsFmated tensor κ Proof outline (3/3) Ŵ W 2 Inequality 3: lower bound the les hand side κ True lowrank tensor F Ŵ W 2 σ 2 N n 1 1/2 r 1/2 M F 1 M X(Ŵ W ) 2 2 Ŵ W F If #samples (M) #variables (N) c 1 n 1 1/2 r 1/2 (Gordon Slepian Theorem in Gaussian process theory)

47 Back to the theorem statement Assume elements of X i are drown iid from standard normal distribufon Moreover Convergence! Ŵ W 2 F N NoFce: #samples (M) #variables (N) c 1 n 1 1/2 r 1/2 r n n 1 1/2 := ( σ 2 n 1 ) 1/2 r 1/2 O p M ( 1 K K k=1 1/nk ) 2, r 1/2 := ( 1 K K k=1 Sample size condifon independent of noise σ 2. Bound RHS proporfonal to σ 2. Threshold behavior in the limit σ 2 0 rk ) 2

48 : /015**/* '!! '!!( 1 Tensor completion results!!"#!"$!"%!"& ' )*+,-./01/21/345* #samples (M) #variables (N) 7*,89.)32,92:**)*;<!"!' 1 '!"&!"%!"$!"# ;/065<1=>1&1?@ ;/65<1=$!1?1>@ AB-.9.C+-./01-/85*+0,5 =./0<>?!2?!2#!@ =./0<>'!!2'!!2?!@! 2!!"#!"$!"%!"& 2 ()*+,-./012*,342553!' 55 '6# 55*55 '6#

49 Including 4 th order tensors ' 2 7*,89.)32,920**:;!"!'!"&!"%!"$!"# Gap between necessity & sufficiency? <./0;=>!2>!2#!? <./0;='!!2'!!2>!? <./0;=>!2>!2#!2'!? <./0;='!!2'!!2#!2'!?! 2!!"#!"$!"%!"& ' ()*+,-./012*,342553!' 55 55*55 '6# '6#

50 9,.:;0+54.;42,,<=!"!# #!")!"(!"& Matrix / tensor completion K = 2 >012=?'!4$!@!"$ >012=?#!!4&!@ >012=?$'!4$!!@! 4!!"#!"$!"%!"&!"'!"( *+,-./01234, !# 77 #8$ 77,77 #8$ 4 7*,89.)32,920**:;!"!' '!"&!"%!"$!"# K 3 <./0;=>!2>!2#!? <./0;='!!2'!!2>!? <./0;=>!2>!2#!2'!? <./0;='!!2'!!2#!2'!?! 2!!"#!"$!"%!"& ' ()*+,-./012*,342553!' 55 55*55 '6# '6# 2

51 Conclusion Many real world problems can be cast into the form of tensor data analysis. Convex opfmizafon is a useful tool also for the analysis of higher order tensors. Proposed a convex tensor decomposifon algorithm with performance guarantee Normalized rank predicts empirical scaling behavior well

52 Issues Why matrix complefon more difficult than tensor complefon? How big the gap between necessity and sufficiency? Random Gaussian design tensor complefon Incoherence (Candes & Recht 09) Spikiness (Negahban et al 10) When only some modes are low rank Scha-en 1 norm is too strong Mixture approach E.g. Mode 1, 4 is low rank but the rest is not (combinatorial problem)

53 T h a n k y o u!

ECE 8201: Low-dimensional Signal Models for High-dimensional Data Analysis

ECE 8201: Low-dimensional Signal Models for High-dimensional Data Analysis Lecture 7: Matrix completion Yuejie Chi The Ohio State University Page 1 Reference Guaranteed Minimum-Rank Solutions of Linear