High Performance Parallel Tucker Decomposition of Sparse Tensors

Size: px

Start display at page:

Download "High Performance Parallel Tucker Decomposition of Sparse Tensors"

Ursula Lawrence
5 years ago
Views:

1 High Performance Parallel Tucker Decomposition of Sparse Tensors Oguz Kaya INRIA and LIP, ENS Lyon, France SIAM PP 16, April 14, 2016, Paris, France Joint work with: Bora Uçar, CNRS and LIP, ENS Lyon, France 1/ 16 Parallel Sparse Tucker Decompositions

2 Tucker Tensor Decomposition K C R 3 I X J K G I A B J R 2 R 1 Tucker decomposition provides a rank-(r 1,..., R N ) approximation of a tensor. consists of a core tensor G R R 1 R N and N matrices having R 1,..., R N columns. We are interested in the case when X is big, sparse, and is of low rank. Example: Google web queries, Netflix movie ratings, Amazon product reviews, etc. 2/ 16 Parallel Sparse Tucker Decompositions

3 Tucker Tensor Decomposition K C R 3 I X J K G I A B J R 2 R 1 Tucker decomposition provides a rank-(r 1,..., R N ) approximation of a tensor. consists of a core tensor G R R 1 R N and N matrices having R 1,..., R N columns. We are interested in the case when X is big, sparse, and is of low rank. Example: Google web queries, Netflix movie ratings, Amazon product reviews, etc. 2/ 16 Parallel Sparse Tucker Decompositions

4 Tucker Tensor Decomposition Related Work: Matlab Tensor Toolbox by Kolda et al. Efficient and scalable computations with sparse tensors (Baskaran et al., 12) Parallel Tensor Compression for Large-Scale Scientific Data (Austin et al., 15) Haten2: Billion-scale tensor decompositions (Jung et al., 15) Applications (in data mining): CubeSVD: A Novel Approach to Personalized Web Search (Sun et al., 05) Tag Recommendations Based on Tensor Dimensionality Reduction (Symeonidis et al., 08) Extended feature combination model for recommendations in location-based mobile services (Sattari et al. 15) Goal: To compute sparse Tucker decomposition in parallel (shared/distributed memory). 3/ 16 Parallel Sparse Tucker Decompositions

5 Tucker Tensor Decomposition Related Work: Matlab Tensor Toolbox by Kolda et al. Efficient and scalable computations with sparse tensors (Baskaran et al., 12) Parallel Tensor Compression for Large-Scale Scientific Data (Austin et al., 15) Haten2: Billion-scale tensor decompositions (Jung et al., 15) Applications (in data mining): CubeSVD: A Novel Approach to Personalized Web Search (Sun et al., 05) Tag Recommendations Based on Tensor Dimensionality Reduction (Symeonidis et al., 08) Extended feature combination model for recommendations in location-based mobile services (Sattari et al. 15) Goal: To compute sparse Tucker decomposition in parallel (shared/distributed memory). 3/ 16 Parallel Sparse Tucker Decompositions

6 Tucker Tensor Decomposition Related Work: Matlab Tensor Toolbox by Kolda et al. Efficient and scalable computations with sparse tensors (Baskaran et al., 12) Parallel Tensor Compression for Large-Scale Scientific Data (Austin et al., 15) Haten2: Billion-scale tensor decompositions (Jung et al., 15) Applications (in data mining): CubeSVD: A Novel Approach to Personalized Web Search (Sun et al., 05) Tag Recommendations Based on Tensor Dimensionality Reduction (Symeonidis et al., 08) Extended feature combination model for recommendations in location-based mobile services (Sattari et al. 15) Goal: To compute sparse Tucker decomposition in parallel (shared/distributed memory). 3/ 16 Parallel Sparse Tucker Decompositions

7 Higher Order Orthogonal Iteration (HOOI) Algorithm K R 3 Algorithm: HOOI for 3rd order tensors repeat I X J K G I A R 1 C B J R 2 1 Â [X 2 B 3 C] (1) 2 A TRSVD(Â, R 1) //R 1 leading left singular vectors 3 ˆB [X 1 A 3 C] (2) 4 B TRSVD(ˆB, R 2) 5 Ĉ [X 1 A 2 B] (2) 6 C TRSVD(Ĉ, R 3) until no more improvement or maximum iterations reached 7 G X 1 A 2 B 3 C 8 return [G; A, B, C] We discuss the case where R 1 = R 2 = = R N = R and N = 3. A R I R,B R J R, and C R K R are dense. Â [X 2 B 3 C] (1) R I R2 is called tensor-times-matrix multiply (TTM). Â R I RN 1,ˆB R J RN 1, and Ĉ R K RN 1 are dense. (R 2 columns for N = 3) 4/ 16 Parallel Sparse Tucker Decompositions

8 Higher Order Orthogonal Iteration (HOOI) Algorithm K R 3 Algorithm: HOOI for 3rd order tensors repeat I X J K G I A R 1 C B J R 2 1 Â [X 2 B 3 C] (1) 2 A TRSVD(Â, R 1) //R 1 leading left singular vectors 3 ˆB [X 1 A 3 C] (2) 4 B TRSVD(ˆB, R 2) 5 Ĉ [X 1 A 2 B] (2) 6 C TRSVD(Ĉ, R 3) until no more improvement or maximum iterations reached 7 G X 1 A 2 B 3 C 8 return [G; A, B, C] We discuss the case where R 1 = R 2 = = R N = R and N = 3. A R I R,B R J R, and C R K R are dense. Â [X 2 B 3 C] (1) R I R2 is called tensor-times-matrix multiply (TTM). Â R I RN 1,ˆB R J RN 1, and Ĉ R K RN 1 are dense. (R 2 columns for N = 3) 4/ 16 Parallel Sparse Tucker Decompositions

9 Higher Order Orthogonal Iteration (HOOI) Algorithm K R 3 Algorithm: HOOI for 3rd order tensors repeat I X J K G I A R 1 C B J R 2 1 Â [X 2 B 3 C] (1) 2 A TRSVD(Â, R 1) //R 1 leading left singular vectors 3 ˆB [X 1 A 3 C] (2) 4 B TRSVD(ˆB, R 2) 5 Ĉ [X 1 A 2 B] (2) 6 C TRSVD(Ĉ, R 3) until no more improvement or maximum iterations reached 7 G X 1 A 2 B 3 C 8 return [G; A, B, C] We discuss the case where R 1 = R 2 = = R N = R and N = 3. A R I R,B R J R, and C R K R are dense. Â [X 2 B 3 C] (1) R I R2 is called tensor-times-matrix multiply (TTM). Â R I RN 1,ˆB R J RN 1, and Ĉ R K RN 1 are dense. (R 2 columns for N = 3) 4/ 16 Parallel Sparse Tucker Decompositions

10 Higher Order Orthogonal Iteration (HOOI) Algorithm K R 3 Algorithm: HOOI for 3rd order tensors repeat I X J K G I A R 1 C B J R 2 1 Â [X 2 B 3 C] (1) 2 A TRSVD(Â, R 1) //R 1 leading left singular vectors 3 ˆB [X 1 A 3 C] (2) 4 B TRSVD(ˆB, R 2) 5 Ĉ [X 1 A 2 B] (2) 6 C TRSVD(Ĉ, R 3) until no more improvement or maximum iterations reached 7 G X 1 A 2 B 3 C 8 return [G; A, B, C] We discuss the case where R 1 = R 2 = = R N = R and N = 3. A R I R,B R J R, and C R K R are dense. Â [X 2 B 3 C] (1) R I R2 is called tensor-times-matrix multiply (TTM). Â R I RN 1,ˆB R J RN 1, and Ĉ R K RN 1 are dense. (R 2 columns for N = 3) 4/ 16 Parallel Sparse Tucker Decompositions

11 Higher Order Orthogonal Iteration (HOOI) Algorithm K R 3 Algorithm: HOOI for 3rd order tensors repeat I X J K G I A R 1 C B J R 2 1 Â [X 2 B 3 C] (1) 2 A TRSVD(Â, R 1) //R 1 leading left singular vectors 3 ˆB [X 1 A 3 C] (2) 4 B TRSVD(ˆB, R 2) 5 Ĉ [X 1 A 2 B] (2) 6 C TRSVD(Ĉ, R 3) until no more improvement or maximum iterations reached 7 G X 1 A 2 B 3 C 8 return [G; A, B, C] We discuss the case where R 1 = R 2 = = R N = R and N = 3. A R I R,B R J R, and C R K R are dense. Â [X 2 B 3 C] (1) R I R2 is called tensor-times-matrix multiply (TTM). Â R I RN 1,ˆB R J RN 1, and Ĉ R K RN 1 are dense. (R 2 columns for N = 3) 4/ 16 Parallel Sparse Tucker Decompositions

12 Tensor-Times-Matrix Multiply Â [X 2 B 3 C] (1), Â R I R2 B(j, :) C(k, :) R R2 is a Kronecker product. For each nonzero x i,j,k ; Â(i, :) receives the update x i,j,k [B(j, :) C(k, :)]. Algorithm: Â [X 2 B 3 C] (1) 1 Â zeros(i, R 2 ) foreach x i,j,k X do 2 Â(i, :) Â(i, :) + x i,j,k[b(j, :) C(k, :)] 5/ 16 Parallel Sparse Tucker Decompositions

13 Outline 1 Introduction 2 Parallel HOOI 3 Results 4 Conclusion

14 (Bad) Fine-Grain Parallel TTM within Tucker-ALS A and Â are rowwise distributed Process p owns and computes A(I p, :) and Â(I p, :). Tensor nonzeros are partitioned (arbitrarily) Process p owns the subset of nonzeros X p Performing x i,j,k[b(j, :) C(k, :)] and generating a partial result for Â(i, :) is a fine-grain task We use post-communication scheme at each iteration: at the beginning, rows of A, B, and C are available. at the end, only A is updated and communicated. Partial row results of Â are sent/received (fold). Rows of A(I p, :) are sent/received (expand). Algorithm: Computing A in fine-grain HOOI at process p foreach xi,j,k X p do 1 Â(i, :) Â(i, :) + xi,j,k[b(j, :) C(k, :)] 2 Send/Receive and sum up partial rows of Â 3 A(Ip, :) TRSVD(Â, R) 4 Send/Receive rows of A 6/ 16 Parallel Sparse Tucker Decompositions

15 (Bad) Fine-Grain Parallel TTM within Tucker-ALS Number of rows sent/received in fold/expand are equal. Each communication unit of expand has size R. Each communication unit of fold has size R N 1. We want to avoid assembling Â in fold communication. We need to compute TRSVD(Â, R). Algorithm: Computing A in fine-grain HOOI at process p foreach xi,j,k X p do 1 Â(i, :) Â(i, :) + xi,j,k[b(j, :) C(k, :)] 2 Send/Receive and sum up partial rows of Â 3 A(Ip, :) TRSVD(Â, R) 4 Send/Receive rows of A 6/ 16 Parallel Sparse Tucker Decompositions

16 (Bad) Fine-Grain Parallel TTM within Tucker-ALS Number of rows sent/received in fold/expand are equal. Each communication unit of expand has size R. Each communication unit of fold has size R N 1. We want to avoid assembling Â in fold communication. We need to compute TRSVD(Â, R). Algorithm: Computing A in fine-grain HOOI at process p foreach xi,j,k X p do 1 Â(i, :) Â(i, :) + xi,j,k[b(j, :) C(k, :)] 2 Send/Receive and sum up partial rows of Â 3 A(Ip, :) TRSVD(Â, R) 4 Send/Receive rows of A 6/ 16 Parallel Sparse Tucker Decompositions

17 Computing TRSVD A ~ ~ A 1 ~ A 2 ~ A 3 = + + Gram matrix ÂÂ T? Iterative solvers? Need to perform Âx and Â T x efficiently. 7/ 16 Parallel Sparse Tucker Decompositions

18 Computing TRSVD A ~ ~ A 1 ~ A 2 ~ A 3 = + + Gram matrix ÂÂ T? Iterative solvers? Need to perform Âx and Â T x efficiently. 7/ 16 Parallel Sparse Tucker Decompositions

19 Computing TRSVD A ~ ~ A 1 ~ A 2 ~ A 3 = + + Gram matrix ÂÂ T? Iterative solvers? Need to perform Âx and Â T x efficiently. 7/ 16 Parallel Sparse Tucker Decompositions

20 Computing y Âx A ~ ~ A 1 ~ A 2 ~ A 3 = + + 8/ 16 Parallel Sparse Tucker Decompositions

21 Computing y Âx ~ ~ ~ A ~ A 1 A 2 A 3 x x x x = + + 8/ 16 Parallel Sparse Tucker Decompositions

22 Computing y Âx ~ ~ ~ A ~ A 1 A 2 A 3 x x x x = + + For each unit of communication, we perform extra work in MxV. 8/ 16 Parallel Sparse Tucker Decompositions

23 Computing y Âx y y 1 y2 y3 = + + Instead of communicating R N 1 entries, we communicate 1! (per SVD iteration) 8/ 16 Parallel Sparse Tucker Decompositions

24 Computing y Âx y y 1 y2 y3 = + + y Â T x works in reverse with the same communication cost. Row distribution of y and left-singular vectors are the same as Â A gets the same row distribution as Â. 8/ 16 Parallel Sparse Tucker Decompositions

25 (Good) Fine-Grain Parallel TTM within Tucker-ALS Algorithm: Computing A in fine-grain HOOI at process p foreach x i,j,k X p do 1 Â(i, :) Â(i, :) + x i,j,k [B(j, :) C(k, :)] 2 A(I p, :) TRSVD(Â, R, MxV(... ), MTxV(... )) 3 Send/Receive rows of A 9/ 16 Parallel Sparse Tucker Decompositions

26 Hypergraph Model for Parallel HOOI Multi-constraint hypergraph partitioning We balance computation and memory costs. By minimizing the cutsize of the hypergraph, we minimize: the total communication volume of MtV/MTxV, the total extra MxV/MTxV work, and the total volume of communication for TTM. Ideally, should minimize the maximum, not total X = {(1, 2, 3), (2, 3, 1), (3, 1, 2)} a 2 a 1 b 1 c 1 b 2 c 2 x 1,2,3 x 3,1,2 x 2,3,1 c 3 b 3 a 3 10/ 16 Parallel Sparse Tucker Decompositions

27 Outline 1 Introduction 2 Parallel HOOI 3 Results 4 Conclusion

28 Experimental Setup HyperTensor Hybrid OpenMP/MPI code in C++ Dependencies to BLAS, LAPACK, and C++11 STL SLEPc/PETSc for distributed memory TRSVD computations IBM BlueGene/Q Machine 16GB memory and 16 cores (at 1.6GHz) per node Experiments using up to 4096 cores (256 nodes) R i is set to 5/10 for 4/3-dimensional tensors. Tensor sizes Tensor I 1 I 2 I 3 I 4 #nonzeros Netflix 480K 17K 2K - 100M NELL 3.2M K - 78M Delicious 1K 530K 17M 2.4M 140M Flickr K 28M 1.6M 112M 11/ 16 Parallel Sparse Tucker Decompositions

29 Results - Flickr/Delicious Per iteration runtime of the parallel HOOI (in seconds) #nodes #cores Delicious Flickr fine-hp fine-rd coarse-hp coarse-bl fine-hp fine-rd coarse-hp coarse-bl Coarse-grain kernel is slow due to load imbalance and communication. On Delicious, fine-hp is 1.8x/5.4x/6.4x faster than fine-rd/coarse-hp/coarse-bl. On Flickr, fine-hp is 1.5x/3.7x/4.3x faster than fine-rd/coarse-hp/coarse-bl. All instances achieve scalability to 4096 cores. 12/ 16 Parallel Sparse Tucker Decompositions

30 Results - NELL/Netflix Per iteration runtime of the parallel HOOI (in seconds) #nodes #cores NELL Netflix fine-hp fine-rd coarse-hp coarse-bl fine-hp fine-rd coarse-hp coarse-bl On Netflix, fine-hp is 2.8x/5x/5.5x faster than fine-rd/coarse-hp/coarse-bl. On NELL, fine-rd is faster than fine-hp (5x less total comm. but 2x more max comm.) 13/ 16 Parallel Sparse Tucker Decompositions

31 Outline 1 Introduction 2 Parallel HOOI 3 Results 4 Conclusion

32 Conclusion We provide the first high performance shared/distributed memory parallel algorithm/implementation for the Tucker decomposition of sparse tensors hypergraph partitioning models of these computations for better scalability. We achieve scalability up to 4096 cores even with random partitioning. We enable Tucker-based tensor analysis of very big sparse data. 14/ 16 Parallel Sparse Tucker Decompositions

33 Conclusion We provide the first high performance shared/distributed memory parallel algorithm/implementation for the Tucker decomposition of sparse tensors hypergraph partitioning models of these computations for better scalability. We achieve scalability up to 4096 cores even with random partitioning. We enable Tucker-based tensor analysis of very big sparse data. 14/ 16 Parallel Sparse Tucker Decompositions

34 Conclusion We provide the first high performance shared/distributed memory parallel algorithm/implementation for the Tucker decomposition of sparse tensors hypergraph partitioning models of these computations for better scalability. We achieve scalability up to 4096 cores even with random partitioning. We enable Tucker-based tensor analysis of very big sparse data. 14/ 16 Parallel Sparse Tucker Decompositions

35 References [1] O. Kaya and B. Uçar, High-performance parallel algorithms for the Tucker decomposition of higher order sparse tensors, Tech. Rep. RR-8801, Inria, Grenoble Rhône-Alpes, Oct [2] W. Austin, G. Ballard, and T. G. Kolda, Parallel tensor compression for large-scale scientific data, tech. rep., arxiv, [3] I. Jeon, E. E. Papalexakis, U. Kang, and C. Faloutsos, Haten2: Billion-scale tensor decompositions, in IEEE 31st International Conference on Data Engineering (ICDE), pp , [4] M. Baskaran, B. Meister, N. Vasilache, and R. Lethin, Efficient and scalable computations with sparse tensors, in IEEE Conference on High Performance Extreme Computing (HPEC), pp. 1 6, Sept [5] J. Sun, H. Zeng, H. Liu, Y. Lu, and Z. Chen, CubeSVD: A novel approach to personalized web search, in Proceedings of the 14th International Conference on World Wide Web, WWW 05, (New York, NY, USA), pp , ACM, / 16 Parallel Sparse Tucker Decompositions

36 Contact perso.ens-lyon.fr/bora.ucar 16/ 16 Parallel Sparse Tucker Decompositions

Parallel Tensor Compression for Large-Scale Scientific Data

Parallel Tensor Compression for Large-Scale Scientific Data Woody Austin, Grey Ballard, Tamara G. Kolda April 14, 2016 SIAM Conference on Parallel Processing for Scientific Computing MS 44/52: Parallel