Multi-Way Compressed Sensing for Big Tensor Data

Size: px

Start display at page:

Download "Multi-Way Compressed Sensing for Big Tensor Data"

Shannon Horn
5 years ago
Views:

1 Multi-Way Compressed Sensing for Big Tensor Data Nikos Sidiropoulos Dept. ECE University of Minnesota MIIS, July 1, 2013 Nikos Sidiropoulos Dept. ECE University of Minnesota ()Multi-Way Compressed Sensing for Big Tensor Data MIIS, July 1, / 28

2 STAR Group, Collaborators, Credits Signal and Tensor Analytics Research (STAR) group Signal processing Big data Preference measurement Cognitive radio Spectrum sensing Christos Faloutsos, Tom Mitchell, George Karypis (UMN), NSF-NIH/BIGDATA: Big Tensor Mining: Theory, Scalable Algorithms and Applications Vaggelis Papalexakis (CMU) Tasos Kyrillidis (EPFL) Nikos Sidiropoulos Dept. ECE University of Minnesota ()Multi-Way Compressed Sensing for Big Tensor Data MIIS, July 1, / 28

3 Tensor? What is this? Has different formal meaning in Physics (spin, symmetries) Informally adopted in CS as shorthand for three-way array: dataset X indexed by three indices, (i, j, k)-th entry X(i, j, k). For two vectors a (I 1) and b (J 1), a b is an I J rank-one matrix with (i, j)-th element a(i)b(j); i.e., a b = ab T. For three vectors, a (I 1), b (J 1), c (K 1), a b c is an I J K rank-one three-way array with (i, j, k)-th element a(i)b(j)c(k). The rank of a three-way array X is the smallest number of outer products needed to synthesize X. Curiosities : Two-way (I J): row-rank = column-rank = rank min(i, J); Three-way: row-rank column-rank tube -rank rank Two-way: rank(randn(i,j))=min(i,j) w.p. 1; Three-way: rank(randn(2,2,2)) is a RV (2 w.p. 0.3, 3 w.p. 0.7) Nikos Sidiropoulos Dept. ECE University of Minnesota ()Multi-Way Compressed Sensing for Big Tensor Data MIIS, July 1, / 28

4 CMU / Tom Mitchell Crawl web, learn language like children do : encounter new concepts, learn from context NELL triplets of subject-verb-object naturally lead to a 3-mode tensor subject verb X c 1 b 1 + a 1 c 2 b 2 + a c F a F b F object Each rank-one factor corresponds to a concept, e.g., leaders or tools E.g., say a 1, b 1, c 1 corresponds to leaders : subjects/rows with high score on a 1 will be Obama, Merkel, Steve Jobs, objects/columns with high score on b 1 will be USA, Germany, Apple Inc., and verbs/fibers with high score on c 1 will be verbs, like lead, is-president-of, and is-ceo-of. Nikos Sidiropoulos Dept. ECE University of Minnesota ()Multi-Way Compressed Sensing for Big Tensor Data MIIS, July 1, / 28

Semantic analysis of Brain fmri data fmri semantic category scores fmri mode is vectorized (O(10 5 10 6 )) Could treat as three separate spatial modes 5-way array.

5 Semantic analysis of Brain fmri data fmri semantic category scores fmri mode is vectorized (O( )) Could treat as three separate spatial modes 5-way array... or even include time as another dimension 6-way array Nikos Sidiropoulos Dept. ECE University of Minnesota ()Multi-Way Compressed Sensing for Big Tensor Data MIIS, July 1, / 28

6 Kronecker and Khatri-Rao products stands for the Kronecker product: BA(1, 1), BA(1, 2), A B = BA(2, 1), BA(2, 2),. stands for the Khatri-Rao (column-wise Kronecker) product: given A (I F ) and B (J F ), A B is the JI F matrix A B = [ A(:, 1) B(:, 1) A(:, F ) B(:, F ) ] vec(abc) = (C T A)vec(B) If D = diag(d), then vec(adc) = (C T A)d Nikos Sidiropoulos Dept. ECE University of Minnesota ()Multi-Way Compressed Sensing for Big Tensor Data MIIS, July 1, / 28

7 Tucker3 I J K three-way array X A : I L, B : J M, C : K N mode loading matrices G : L M N Tucker3 core Nikos Sidiropoulos Dept. ECE University of Minnesota ()Multi-Way Compressed Sensing for Big Tensor Data MIIS, July 1, / 28

8 Tucker3, continued Consider an I J K three-way array X comprising K matrix slabs {X k } K k=1, arranged into matrix X := [vec(x 1),, vec(x K )]. The Tucker3 model can be written as X (B A)GC T, where A, B, C, are three mode loading matrices, assumed orthogonal without loss of generality, and G is the so-called Tucker3 core tensor G recast in matrix form. The non-zero elements of the core tensor determine the interactions between columns of A, B, C. The associated model-fitting problem is min X (B A,B,C,G A)GCT 2 F, which is usually solved using an alternating least squares procedure. The Tucker3 model can be fully vectorized as vec(x) (C B A) vec(g). Nikos Sidiropoulos Dept. ECE University of Minnesota ()Multi-Way Compressed Sensing for Big Tensor Data MIIS, July 1, / 28

9 Low-rank tensor decomposition / approximation X F a f b f c f, f =1 Parallel factor analysis (PARAFAC) model [Harshman 70-72], a.k.a. canonical decomposition [Carroll & Chang, 70], a.k.a. CP; cf. [Hitchcock, 27] PARAFAC can be written as a system of matrix equations X k AD k (C)B T, where D k (C) is a diagonal matrix holding the k-th row of C in its diagonal; or in compact matrix form as X (B A)C T, using the Khatri-Rao product. In particular, employing a property of the Khatri-Rao product, where 1 is a vector of all 1 s. X (B A)C T vec(x) (C B A) 1, Nikos Sidiropoulos Dept. ECE University of Minnesota ()Multi-Way Compressed Sensing for Big Tensor Data MIIS, July 1, / 28

10 Uniqueness The distinguishing feature of the PARAFAC model is its essential uniqueness: under certain conditions, (A, B, C) can be identified from X, i.e., they are unique up to permutation and scaling of columns [Kruskal 77, Sidiropoulos et al 00-07, de Lathauwer 04-, Stegeman 06-] Consider an I J K tensor X of rank F. In vectorized form, it can be written as the IJK 1 vector x = (A B C) 1, for some A (I F), B (J F), and C (K F) - a PARAFAC model of size I J K and order F parameterized by (A, B, C). The Kruskal-rank of A, denoted k A, is the maximum k such that any k columns of A are linearly independent (k A r A := rank(a)). Given X ( x), if k A + k B + k C 2F + 2, then (A, B, C) are unique up to a common column permutation and scaling Nikos Sidiropoulos Dept. ECE University of Minnesota ()Multi-Way Compressed Sensing for Big Tensor Data MIIS, July 1, / 28

11 Special case: Two slabs Assume square/tall, full column rank A(I F) and B(J F ); and distinct nonzero elements along the diagonal of D: X 2 X 1 = AB T, X 2 = ADB T Introduce the compact F -component SVD of [ ] [ ] X1 A = B T = UΣV H AD ([ A rank(b) = F implies that span(u) = span AD a nonsingular matrix P such that U = [ U1 ] [ A = U 2 AD ] P ]) ; hence there exists Nikos Sidiropoulos Dept. ECE University of Minnesota ()Multi-Way Compressed Sensing for Big Tensor Data MIIS, July 1, / 28

12 Special case: Two slabs, continued Next, construct the auto- and cross-product matrices R 1 = U H 1 U 1 = P H A H AP =: QP, R 2 = U H 1 U 2 = P H A H ADP = QDP. All matrices on the right hand side are square and full-rank. Solving the first equation and substituting to the second, yields the eigenvalue problem ( ) R 1 1 R 2 P 1 = P 1 D P 1 (and P) recovered up to permutation and scaling From bottom of previous page, A likewise recovered, then B Starts making sense... Nikos Sidiropoulos Dept. ECE University of Minnesota ()Multi-Way Compressed Sensing for Big Tensor Data MIIS, July 1, / 28

13 Alternating Least Squares (ALS) Based on matrix view: X (KJ I) = (B C)A T Given interim estimates of B, C, solve for conditional LS update of A: A CLS = ((B C) X (KJ I)) T Similarly for the CLS updates of B, C (symmetry); repeat in a circular fashion until cost converges Nikos Sidiropoulos Dept. ECE University of Minnesota ()Multi-Way Compressed Sensing for Big Tensor Data MIIS, July 1, / 28

14 Heavy-tailed noise, sparse residuals Instead of LS, Conditional update: LP min X (KJ I) (B C)A T 1 Almost as good: coordinate-wise, using weighted median filtering (very cheap!) Vorobyov, Rong, Sidiropoulos, Gershman, 2005 Nikos Sidiropoulos Dept. ECE University of Minnesota ()Multi-Way Compressed Sensing for Big Tensor Data MIIS, July 1, / 28

15 Big data: need for compression Tensors can easily become really big! - size exponential in the number of dimensions ( ways, or modes ). Cannot load in main memory; can reside in cloud storage. Tensor compression? Commonly used compression method for moderate -size tensors: fit orthogonal Tucker3 model, regress data onto fitted mode-bases. Lossless if exact mode bases used [CANDELINC]; but Tucker3 fitting is itself cumbersome for big tensors (big matrix SVDs), cannot compress below mode ranks without introducing errors If tensor is sparse, can store as [i, j, k, value] + use specialized sparse matrix / tensor alorithms [(Sparse) Tensor Toolbox, Bader & Kolda]. Useful if sparse representation can fit in main memory. Nikos Sidiropoulos Dept. ECE University of Minnesota ()Multi-Way Compressed Sensing for Big Tensor Data MIIS, July 1, / 28

16 Tensor compression Consider compressing x into y = Sx, where S is d IJK, d IJK. In particular, consider a specially structured compression matrix S = U T V T W T Corresponds to multiplying (every slab of) X from the I-mode with U T, from the J-mode with V T, and from the K -mode with W T, where U is I L, V is J M, and W is K N, with L I, M J, N K and LMN IJK M J L I I J X _ K K N L Y _ M N Nikos Sidiropoulos Dept. ECE University of Minnesota ()Multi-Way Compressed Sensing for Big Tensor Data MIIS, July 1, / 28

17 Key Due to a property of the Kronecker product ( U T V T W T ) (A B C) = ( ) (U T A) (V T B) (W T C), from which it follows that ( ) ) y = (U T A) (V T B) (W T C) 1 = (Ã B C 1. i.e., the compressed data follow a PARAFAC model of size L M N and order F parameterized by (Ã, B, C), with Ã := UT A, B := V T B, C := W T C. Nikos Sidiropoulos Dept. ECE University of Minnesota ()Multi-Way Compressed Sensing for Big Tensor Data MIIS, July 1, / 28

18 Random multi-way compression can be better! Sidiropoulos & Kyrillidis, IEEE SPL Oct Assume that the columns of A, B, C are sparse, and let n a (n b, n c ) be an upper bound on the number of nonzero elements per column of A (respectively B, C). Let the mode-compression matrices U (I L, L I), V (J M, M J), and W (K N, N K ) be randomly drawn from an absolutely continuous distribution with respect to the Lebesgue measure in R IL, R JM, and R KN, respectively. If min(l, k A ) + min(m, k B ) + min(n, k C ) 2F + 2, L 2n a, M 2n b, N 2n c, and then the original factor loadings A, B, C are almost surely identifiable from the compressed data. Nikos Sidiropoulos Dept. ECE University of Minnesota ()Multi-Way Compressed Sensing for Big Tensor Data MIIS, July 1, / 28

19 Proof rests on two lemmas + Kruskal Lemma 1: Consider Ã := UT A, where A is I F, and let the I L matrix U be randomly drawn from an absolutely continuous distribution with respect to the Lebesgue measure in R IL (e.g., multivariate Gaussian with a non-singular covariance matrix). Then kã = min(l, k A ) almost surely (with probability 1). Lemma 2: Consider Ã := UT A, where Ã and U are given and A is sought. Suppose that every column of A has at most n a nonzero elements, and that k U T 2n a. (The latter holds with probability 1 if the I L matrix U is randomly drawn from an absolutely continuous distribution with respect to the Lebesgue measure in R IL, and min(i, L) 2n a.) Then A is the unique solution with at most n a nonzero elements per column [Donoho & Elad, 03] Nikos Sidiropoulos Dept. ECE University of Minnesota ()Multi-Way Compressed Sensing for Big Tensor Data MIIS, July 1, / 28

20 Complexity First fitting PARAFAC in compressed space and then recovering the sparse A, B, C from the fitted compressed factors entails complexity O(LMNF + (I J K 3.5 )F ). Using sparsity first and then fitting PARAFAC in raw space entails complexity O(IJKF + (IJK ) 3.5 ) - the difference is huge. Also note that the proposed approach does not require computations in the uncompressed data domain, which is important for big data that do not fit in memory for processing. Nikos Sidiropoulos Dept. ECE University of Minnesota ()Multi-Way Compressed Sensing for Big Tensor Data MIIS, July 1, / 28

21 Further compression - down to O( F) in 2/3 modes Sidiropoulos & Kyrillidis, IEEE SPL Oct Assume that the columns of A, B, C are sparse, and let n a (n b, n c ) be an upper bound on the number of nonzero elements per column of A (respectively B, C). Let the mode-compression matrices U (I L, L I), V (J M, M J), and W (K N, N K ) be randomly drawn from an absolutely continuous distribution with respect to the Lebesgue measure in R IL, R JM, and R KN, respectively. If r A = r B = r C = F L(L 1)M(M 1) 2F (F 1), N F, L 2n a, M 2n b, N 2n c, and then the original factor loadings A, B, C are almost surely identifiable from the compressed data up to a common column permutation and scaling. Nikos Sidiropoulos Dept. ECE University of Minnesota ()Multi-Way Compressed Sensing for Big Tensor Data MIIS, July 1, / 28

22 Proof: Lemma 3 + results on a.s. ID of PARAFAC Lemma 3: Consider Ã = UT A, where A (I F) is deterministic, tall/square (I F ) and full column rank r A = F, and the elements of U (I L) are i.i.d. Gaussian zero mean, unit variance random variables. Then the distribution of Ã is nonsingular multivariate Gaussian. From [Stegeman, ten Berge, de Lathauwer 2006] (see also [Jiang, Sidiropoulos 2004], we know that PARAFAC is almost surely identifiable if the loading matrices Ã, B are randomly drawn from an absolutely continuous distribution with respect to the Lebesgue measure in R (L+M)F, C is full column rank, and L(L 1)M(M 1) 2F (F 1). Nikos Sidiropoulos Dept. ECE University of Minnesota ()Multi-Way Compressed Sensing for Big Tensor Data MIIS, July 1, / 28

23 Generalization to higher-way arrays Theorem 3: Let x = (A 1 A δ ) 1 R δ d=1 I d, where Ad is I d F, and consider compressing it to y = ( U T 1 ) UT δ x = ( (U T 1 A 1 ) (U T δ A δ) ) 1 = (Ã1 Ãδ) 1 R δ d=1 L d, where the mode-compression matrices U d (I d L d, L d I d ) are randomly drawn from an absolutely continuous distribution with respect to the Lebesgue measure in R I d L d. Assume that the columns of Ad are sparse, and let n d be an upper bound on the number of nonzero elements per column of A d, for each d {1,, δ}. If δ min(l d, k Ad ) 2F + δ 1, and L d 2n d, d {1,, δ}, d=1 then the original factor loadings {A d } δ d=1 are almost surely identifiable from the compressed data y up to a common column permutation and scaling. Various additional results possible, e.g., generalization of Theorem 2. Nikos Sidiropoulos Dept. ECE University of Minnesota ()Multi-Way Compressed Sensing for Big Tensor Data MIIS, July 1, / 28

24 Latest on PARAFAC uniqueness Luca Chiantini and Giorgio Ottaviani, On Generic Identifiability of 3-Tensors of Small Rank, SIAM. J. Matrix Anal. & Appl., 33(3), : Consider an I J K tensor X of rank F, and order the dimensions so that I J K Let i be maximal such that 2 i I, and likewise j maximal such that 2 i J If F 2 i+j 2, then X has a unique decomposition almost surely For I, J powers of 2, the condition simplifies to F IJ 4 More generally, condition implies: if F (I+1)(J+1) 16, then X has a unique decomposition almost surely Nikos Sidiropoulos Dept. ECE University of Minnesota ()Multi-Way Compressed Sensing for Big Tensor Data MIIS, July 1, / 28

25 Even further compression Assume that the columns of A, B, C are sparse, and let n a (n b, n c ) be an upper bound on the number of nonzero elements per column of A (respectively B, C). Let the mode-compression matrices U (I L, L I), V (J M, M J), and W (K N, N K ) be randomly drawn from an absolutely continuous distribution with respect to the Lebesgue measure in R IL, R JM, and R KN, respectively. Assume L M N, and L, M are powers of 2, for simplicity If r A = r B = r C = F LM 4F, N M L, and L 2n a, M 2n b, N 2n c, then the original factor loadings A, B, C are almost surely identifiable from the compressed data up to a common column permutation and scaling. Allows compression down to order of F in all three modes Nikos Sidiropoulos Dept. ECE University of Minnesota ()Multi-Way Compressed Sensing for Big Tensor Data MIIS, July 1, / 28

26 What if A, B, C are not sparse? If A, B, C are sparse with respect to known bases, i.e., A = RǍ, B = SˇB, and C = TČ, with R, S, T the respective sparsifying bases, and Ǎ, ˇB, Č sparse Then the previous results carry over under appropriate conditions, e.g., when R, S, T are non-singular. Nikos Sidiropoulos Dept. ECE University of Minnesota ()Multi-Way Compressed Sensing for Big Tensor Data MIIS, July 1, / 28

27 PARCUBE: Parallel sampling-based tensor decomp Papalexakis, Faloutsos, Sidiropoulos, ECML-PKDD 2012 %!"!"! $%! " &" #!" $!" % #" %!" $!"! $%# " # #" $ #" #!" Challenge: different permutations, scaling Anchor in small common sample Hadoop implementation 100-fold improvement (size/speedup) Nikos Sidiropoulos Dept. ECE University of Minnesota ()Multi-Way Compressed Sensing for Big Tensor Data MIIS, July 1, / 28

28 Road ahead Important first steps / results pave way, but simply scratched surface Randomized tensor algorithms based on generalized sampling Other models? Rate-distortion theory for big tensor data compression? Statistically and computationally efficient algorithms - big open issue Distributed computations - not all data reside in one place - Hadoop / multicore Statistical inference for big tensors Applications Nikos Sidiropoulos Dept. ECE University of Minnesota ()Multi-Way Compressed Sensing for Big Tensor Data MIIS, July 1, / 28

Big Tensor Data Reduction

Big Tensor Data Reduction Nikos Sidiropoulos Dept. ECE University of Minnesota NSF/ECCS Big Data, 3/21/2013 Nikos Sidiropoulos Dept. ECE University of Minnesota () Big Tensor Data Reduction NSF/ECCS Big