Parallel Sparse Tensor Decompositions using HiCOO Format

Size: px

Start display at page:

Download "Parallel Sparse Tensor Decompositions using HiCOO Format"

Emerald Violet Bennett
5 years ago
Views:

Sparse Tensor Decompositions using HiCOO Format

1 Figure sources: A brief survey of tensors by Berton Earnshaw and NVIDIA Tensor Cores Parallel Sparse Tensor Decompositions using HiCOO Format Jiajia Li, Jee Choi, Richard Vuduc May 8, SIAM ALA 8

2 Outline Background HiCOO Format Multicore CP-ALS Distributed CPDs

3 Tensors Social Networks Quantum Chemistry NOT ENOUGH Deep Learning

4 Applications of Tensor Methods 4 Data analytics and compression Natural language processing, healthcare analytics, social network analytics, brain signal processing, and deep learning Scientific computing Quantum chemistry, computational physics, quantum mechanics Demographic Procedure Diagnosis Medication EHR Clinical+ notes Lab+Tests Phenotyping Medical+ Concepts+ (phenotypes)

Tensors & Decompositions 5 Tensors, multi-way arrays, provide a natural way to represent multidimensional data. Special cases: matrices - D tensors, vectors - D tensors.

5 Tensors & Decompositions 5 Tensors, multi-way arrays, provide a natural way to represent multidimensional data. Special cases: matrices - D tensors, vectors - D tensors. Tensor decomposition: an extension of matrix factorization to analyze tensor features. Singular)Value)Decomposition)(SVD) Tensor mode or order: tensor dimension. A SPARSE tensor, a tensor consisting mostly of zero entries, widely exists in real applications. CANDECOMP/PARAFAC CP#Decomposition i =,,I j =,,J k =,,K

6 CP-ALS and CP-APR 6 CP-ALS - alternating least squares For general data Describe input data as Gaussian distribution Fitting function: least squares. Core operation: MTTKRP CP-APR alternating Poisson regression For non-negative data, e.g. count data. Describe input data as Poisson distribution Fitting function:poisson likelihood fitting algorithm Core operation: element-wise division.

7 CP-ALS 7 Bottleneck: MTTKRP

8 MTTKRP Operation 8 Matriced Tensor Times Khatri-Rao Product (MTTKRP) Tensor Matricization Khatri-Rao Product Updated Factor Matrix Factor Matrix

9 MTTKRP Operation 9 Matriced Tensor Times Khatri-Rao Product (MTTKRP) Tensor Matricization Khatri-Rao Product Khatri-Rao Product D C. B Matrix Updated Factor Matrix Factor Matrix R R R I J IJ C B D

10 CP-APR Bottleneck: element-wise product

11 Bottleneck of CP-APR Consider CP-APR MU Stored in sparse format, size nnz * R

12 Outline Background HiCOO Format Multicore CP-ALS Distributed CPDs

13 Mode Orientation Tensor decomposition Kernel in Mode- Mode Oriented Mode- oriented (CSF) Neutral Mode Orientation Coordinate (COO) Efficient Inefficient Kernel in Mode-?? Kernel in Mode-??

14 Mode Oriented Formats 4 Three CSF/F-COO representations are required/preferred for the three MTTKRPs. CP Decomposition MTTKRP in Mode- i j CSF- k val More storage MTTKRP in Mode- j i k CSF- val MTTKRP in Mode- k i CSF- j val

15 Mode Oriented Formats 5 Three CSF/F-COO representations are required/preferred for the three MTTKRPs. CP Decomposition MTTKRP in Mode- i j CSF- k val MTTKRP in Mode- Performance payoff MTTKRP in Mode-

16 Current Sparse Tensor Formats 6 COO: coordinate formats [Bader et al., 6] CSF: Compressed Sparse Fibers, extension of CSR. [Smith et al. 5] F-COO: Flagged COO format [Liu et al., 7] 4 i =,,I j =,,J 4 k =,,K

17 Current Sparse Tensor Formats 7 COO: coordinate formats [Bader et al., 6] CSF: Compressed Sparse Fibers, extension of CSR. [Smith et al. 5] F-COO: Flagged COO format [Liu et al., 7] 4 i =,,I j =,,J 4 k =,,K i j k val i j k val sf[]= sf[]= bf j k val (a) COO (b) CSF (c) F-COO

18 Current Sparse Tensor Formats 8 COO: coordinate formats [Bader et al., 6] CSF: Compressed Sparse Fibers, extension of CSR. [Smith et al. 5] F-COO: Flagged COO format [Liu et al., 7] 4 i =,,I j =,,J 4 k =,,K i j k val i j k val sf[]= sf[]= bf j k val (a) COO (b) CSF (c) F-COO neutral mode orientation mode oriented, prefer different representations for different modes.

19 HiCOO Format 9 Store a sparse tensor in units of small sparse blocks, saving storage Shorten the bit-length of element indices Compress the number of block indices an improvement of the Compressed Sparse Blocks (CSB) format -bit -bit 8-bit i j k val bptr bi bj bk ei ej ek val 4 B B 4 For the tensor: Reduce its storage and memory footprints 5 6 B For matrices: Better data locality 7 8 B (a) COO (b) HiCOO i = bi * B + ei

20 Construct HiCOO Representation Morton-order sorting for the input COO tensor. Partition to blocks and determine each block location. Construct HiCOO representation by compressing index length. bits int i j k val COO 7 8 Z-Order Sorting int long int long int byte i j k val Partition bptr 4 6 i j k val Compress i = bi*b+ei; j = bj*b+ej; k= bk*b+ek 6 8 B B B B bptr 4 6 bi bj bk ei ej HiCOO ek val

21 Outline Background HiCOO Format Multicore CP-ALS Distributed CPDs

Two-level Blocking for Efficient Thread Parallelism Use two-level blocking strategy Large bulks (logical) + small blocks (physical) To avoid using locks, we add a bit extra storage for scheduling

22 Two-level Blocking for Efficient Thread Parallelism Use two-level blocking strategy Large bulks (logical) + small blocks (physical) To avoid using locks, we add a bit extra storage for scheduling information. y No Write Conflicts iter iter iter Bulk Bulk 4 Bulk 8 Bulk Bulk 5 Bulk 9 Bulk Bulk 6 Bulk Bulk 7 Write Conflicts x bulk bptr bi bj bk ei ej ek val B i =,,I B B j =,,J k =,,K B 6 HiCOO 6 8

23 Platform and Dataset Platform: Intel Xeon CPU E7-485 platform consisting 56 physical cores with gcc 5.4. and parallelized by OpenMP. Dataset: FROSTT [Smith et al. 7] and HaTen [Jeon et al. 5] J. Li, J. Sun, R. Vuduc. HiCOO: Hierarchical Storage of Sparse Tensors. 8

24 Multicore MTTKRP 4 ParTI! library: Speedups of HiCOO over COO Speedup mode nell choa darpa fb-m fb-s deli nell crime nips enron flickr deli4d -D Tensors 4-D Tensors.8x SPLATT library: Speedups of HiCOO over CSF Storage size: Speedup mode nell choa darpa fb-m fb-s deli nell crime nips enron flickr deli4d -D Tensors 4-D Tensors.x comparable

25 Multicore CP-ALS 5 HiCOO outperforms COO by up to 9.4 and CSF up to.4. Normalized Time HiCOO CSF COO. nell choa darpa fb-m fb-s deli nell crime nips enron flickr -D Tensors 4-D Tensors deli4d

26 Outline 6 Background HiCOO Format Multicore CP-ALS Distributed CPDs

27 Data Distribution 7 Tensor: Two-level blocking Bulk level: node partitioning to ensure nonzero balance Block level: ensure fast local computation Factor matrices Each processor owns the portion of factor matrices its local tensor needs. Some duplication exists. i =,,I j =,,J bulk k =,,K B B B B bptr 4 6 bi bj bk ei HiCOO ej ek val

28 Data Distribution Cont. 8 X: 4*4*4 sparse tensor Processors are split to ** grids. Use sub-communicator to update factor matrices. A(:) B(:) C(:) A(:) B(:) C(:) P(,,) P(,,) A(:) B(:) C(:) A(:) B(:) C(:) P(,,) P(,,)

29 Platform and Dataset 9 Platform: Cori supercomputer in NERSC, which is Cray XC4. Each node is Intel Xeon CPU E5-698 v ( Haswell ) consisting physical cores. We use Intel MPI 8 version. We test to 4 processors with processors per node. Dataset: Synthetic tensors from Poisson distribution.

30 Distributed MTTKRP Performance COO HiCOO Time (s). syn syn.5 syn #Procs 5 4 Time (s).6 syn.5 syn syn #Procs MTTKRP time using 4 procs Time (s) HiCOO COO.. syn syn Tensors syn

31 Distributed CP-ALS Performance Time (s).5..5 COO.5 syn syn. syn #Procs 5 4 Time (s) HiCOO. syn syn. syn #Procs 5 4 CP-ALS time using 4 procs Time (s).8 HiCOO.7 COO syn syn syn Tensors

32 Distributed CompPhi Performance COO HiCOO Time (s) syn syn 8 syn #Procs 5 4 Time (s) #Procs 5 syn syn syn 4 CompPhi time using 4 procs Time (s) HiCOO COO. syn syn Tensors syn

33 Distributed CP-APR Performance COO HiCOO Time (s) 5 syn syn syn #Procs 5 4 Time (s) #Procs 5 syn syn syn 4 CP-APR time using 4 procs Time (s).8 HiCOO.7 COO syn syn syn Tensors

34 Conclusion and Future Work 4 Conclusion HiCOO WORKS! Future work Combine multicore and distributed parallelism together to build hybrid MPI+OpenMP HiCOO CPDs. Also include our GPU implementations. Consider CP-APR algorithms to Newton-based optimization methods.

35 ParTI! Project 5 Hardware Multicore8 CPUs GPUs Distributed8 Memory Emu ParTI!'Library Sparse Baseline8Tensor8 Routines Tensor8 Decomposition Algorithms Code: jiajiali@gatech.edu

36 Contributors 6 Jiajia Li PhD, GaTech & PNNL* * From Aug 8 Yuchen Ma Undergrad, HDU Nick Liu Undergrad, GaTech Srinivas Eswar PhD, GaTech Junghyun Kim Undergrad, GaTech Richard Vuduc Professor, GaTech

37 Acknowledgement 7

38 Related Work 8 S. Smith et al. A Medium-Grained Algorithm for Distributed Sparse Tensor Factorization, IPDPS 6 O. Kaya et al. Scalable Sparse Tensor Decompositions in Distributed Memory Systems, SC 5 E. Solomonik et al. Sparse Tensor Algebra as a Parallel Programming Model, TR, 5 S. Rajbhandari et al. A communication-optimal framework for contracting distributed tensors, SC'4 W. Austin et al. Parallel Tensor Compression for Large-Scale Scientific Data, IPDPS 6 J. Choi et al. DFacTo: Distributed Factorization of Tensors, NIPS 4

A Unified Optimization Approach for Sparse Tensor Operations on GPUs

A Unified Optimization Approach for Sparse Tensor Operations on GPUs Bangtian Liu, Chengyao Wen, Anand D. Sarwate, Maryam Mehri Dehnavi Rutgers, The State University of New Jersey {bangtian.liu, chengyao.wen,