Tensor Contractions with Extended BLAS Kernels on CPU and GPU

Size: px

Start display at page:

Download "Tensor Contractions with Extended BLAS Kernels on CPU and GPU"

Sherman French
6 years ago
Views:

1 Tensor Contractions with Extended BLAS Kernels on CPU and GPU Cris Cecka Senior Research Scientist NVIDIA Research, Santa Clara, California Joint work with Yang Shi, U.N. Niranjan, and Animashree Anandkumar Electrical Engineering and Computer Science University of California, Irvine Workshop on Batched, Reproducible, and Reduced Precision BLAS February 25, 2017 Cris Cecka (NVIDIA) Tensor Contractions cublas February 25, / 30

2 Tensor Contraction-Motivation Cris Cecka (NVIDIA) Tensor Contractions cublas February 25, / 30

3 Tensor Contraction-Motivation Modern data is inherently multi-dimensional Cris Cecka (NVIDIA) Tensor Contractions cublas February 25, / 30

4 Tensor Contraction-Motivation Modern data is inherently multi-dimensional Input Hidden 1 Hidden 2 Output Cris Cecka (NVIDIA) Tensor Contractions cublas February 25, / 30

5 Tensor Contraction-Motivation Modern data is inherently multi-dimensional Cris Cecka (NVIDIA) Tensor Contractions cublas February 25, / 30

6 Tensor Contraction-Motivation What is tensor contraction? Cris Cecka (NVIDIA) Tensor Contractions cublas February 25, / 30

7 Tensor Contraction-Motivation What is tensor contraction? C C = A A B B Cris Cecka (NVIDIA) Tensor Contractions cublas February 25, / 30

8 Tensor Contraction-Motivation What is tensor contraction? C C = A A B B A 422 = B 21 = A(:,1,:) C 421 A(:,2,:) e.g. C mnp = A mnk B kp Cris Cecka (NVIDIA) Tensor Contractions cublas February 25, / 30

9 Tensor Contraction-Motivation What is tensor contraction? C C = A A B B A 422 = A(:,1,:) A(:,2,:) B 21 = C 421 e.g. C mnp = A mnk B kp Why do we need tensor contraction? 1 Core primitive of multilinear algebra. 2 BLAS Level 3: Unbounded compute intensity. Cris Cecka (NVIDIA) Tensor Contractions cublas February 25, / 30

10 Tensor Contraction Motivation Lots of hot applications at the moment: Machine learning Deep learning e.g. Learning latent variable model with tensor decomposition: Topic model 1 Cris Cecka (NVIDIA) Tensor Contractions cublas February 25, / 30

11 Tensor Contraction Motivation Lots of hot applications at the moment: Machine learning Deep learning e.g. Learning latent variable model with tensor decomposition: Topic model 1 h: PDF of topics in a document. A: Topic-word matrix. A ij = P(x m = i y m = j) Cris Cecka (NVIDIA) Tensor Contractions cublas February 25, / 30

12 Tensor Contraction Motivation Lots of hot applications at the moment: Machine learning Deep learning e.g. Learning latent variable model with tensor decomposition: Topic model 1 h: PDF of topics in a document. A: Topic-word matrix. A ij = P(x m = i y m = j) Form third-order tensor M 3 = E(x x x) = i h ia i a i a i 1 Tensor Decompositions for Learning Latent Variable Models, Anima Anandkumar, Rong Ge, Daniel Hsu et. al. Cris Cecka (NVIDIA) Tensor Contractions cublas February 25, / 30

13 Tensor Contraction Motivation Distributed FFT Cris Cecka (NVIDIA) Tensor Contractions cublas February 25, / 30

14 Tensor Contraction Motivation Distributed FFT Cris Cecka (NVIDIA) Tensor Contractions cublas February 25, / 30

15 Tensor Contraction-Motivation What do we have? Cris Cecka (NVIDIA) Tensor Contractions cublas February 25, / 30

16 Tensor Contraction-Motivation What do we have? Tensor computation libraries 1 Arbitrary/restricted tensor operation of any order and dimension 1 Tensortoolbox (Matlab) 2 FTensor (C++) 3 Cyclops (C++) 4 BTAS (C++) 5 All the Python... Cris Cecka (NVIDIA) Tensor Contractions cublas February 25, / 30

17 Tensor Contraction-Motivation What do we have? Tensor computation libraries 1 Arbitrary/restricted tensor operation of any order and dimension 1 Tensortoolbox (Matlab) 2 FTensor (C++) 3 Cyclops (C++) 4 BTAS (C++) 5 All the Python... Efficient computing frame 1 Static analysis solutions 1 PPCG [ISL] (polyhedral) 2 TCE (DSL) 2 Parallel and distributed primitives 1 BLAS, cublas 2 BLIS, BLASX, cublasxt Cris Cecka (NVIDIA) Tensor Contractions cublas February 25, / 30

18 Tensor Contraction-Motivation Libraries Explicit permutation dominates. Cris Cecka (NVIDIA) Tensor Contractions cublas February 25, / 30

19 Tensor Contraction-Motivation Libraries Explicit permutation dominates. Consider C mnp = A km B pkn. 1 A km A mk 2 B pkn B kpn 3 C mnp C mpn 4 C m(pn) = A mk B k(pn) 5 C mpn C mnp Cris Cecka (NVIDIA) Tensor Contractions cublas February 25, / 30

$1 A km A mk 2 B pkn B kpn 3 C mnp C mpn 4 C m(pn) = A mk B k(pn) 5 C mpn C mnp 100 200 300 400 500 1 0.8 0.6 0.4 0.2 0 (Top) CPU. (Bottom) GPU. The fraction of time spent in copies/transpositions.$

20 Tensor Contraction-Motivation Libraries Explicit permutation dominates. Consider C mnp = A km B pkn. 1 A km A mk 2 B pkn B kpn 3 C mnp C mpn 4 C m(pn) = A mk B k(pn) 5 C mpn C mnp (Top) CPU. (Bottom) GPU. The fraction of time spent in copies/transpositions. Lines are shown with 1, 2, 3, and 6 transpositions. n Cris Cecka (NVIDIA) Tensor Contractions cublas February 25, / 30

21 Existing Primitives GEMM Suboptimal for many small matrices. Pointer-to-Pointer BatchedGEMM Available in MKL 11.3β and cublas 4.1 C[p] = α op(a[p]) op(b[p]) + β C[p] cublas<t>gemmbatched(cublashandle_t handle, cublasoperation_t transa, cublasoperation_t transb, int M, int N, int K, const T* alpha, const T** A, int lda, const T** B, int ldb, const T* beta, T** C, int ldc, int batchcount) Cris Cecka (NVIDIA) Tensor Contractions cublas February 25, / 30

22 Existing Primitives Pointer-to-Pointer BatchedGEMM Cris Cecka (NVIDIA) Tensor Contractions cublas February 25, / 30

23 Existing Primitives Pointer-to-Pointer BatchedGEMM Except actually... Solution: StridedBatchedGEMM Cris Cecka (NVIDIA) Tensor Contractions cublas February 25, / 30

24 StridedBatchedGEMM Exists!... Still no documentation?!? Documentation as of last Tuesday! Cris Cecka (NVIDIA) Tensor Contractions cublas February 25, / 30

25 StridedBatchedGEMM Exists!... Still no documentation?!? Documentation as of last Tuesday! In cublas 8.0: $$ grep StridedBatched -A 17 /usr/local/cuda/include/cublas_api.h 2320:CUBLASAPI cublasstatus_t cublassgemmstridedbatched (cublashandle_t handle, cublasoperation_t transa, cublasoperation_t transb, int m, int n, int k, const float *alpha, // host or device pointer const float *A, int lda, long long int stridea, // purposely signed const float *B, int ldb, long long int strideb, const float *beta, // host or device pointer float *C, int ldc, long long int stridec, int batchcount);... Cris Cecka (NVIDIA) Tensor Contractions cublas February 25, / 30

26 StridedBatchedGEMM cublas<t>gemmstridedbatched(cublashandle_t handle, cublasoperation_t transa, cublasoperation_t transb, int M, int N, int K, const T* alpha, const T* A, int lda1, int stridea, const T* B, int ldb1, int strideb, const T* beta, T* C, int ldc1, int stridec, int batchcount) Common use case for Pointer-to-pointer BatchedGEMM. No Pointer-to-pointer data structure or overhead. Performance on par with pure GEMM (P100 and beyond). Cris Cecka (NVIDIA) Tensor Contractions cublas February 25, / 30

27 Tensor Contraction with Extended BLAS Primitives C mnp = A B C mnp C[m + n ldc1 + p ldc2] Case Contraction Kernel1 Kernel2 Case Contraction Kernel1 Kernel2 1.1 AmkBknp C m(np) = AmkB k(np) C mn[p] = AmkB kn[p] 4.1 AknBkmp C mn[p] = B km[p] Akn 1.2 AmkBkpn C mn[p] = AmkB k[p]n C m[n]p = AmkB kp[n] 4.2 AknBkpm C mn[p] = B k[p]m Akn 1.3 AmkBnkp C mn[p] = AmkB nk[p] 4.3 AknBmkp C mn[p] = B mk[p] Akn 1.4 AmkBpkn C m[n]p = AmkB pk[n] 4.4 AknBpkm 1.5 AmkBnpk C m(np) = AmkB (np)k C mn[p] = AmkB n[p]k 4.5 AknBmpk C mn[p] = B m[p]k Akn 1.6 AmkBpnk C m[n]p = AmkB p[n]k 4.6 AknBpmk 2.1 AkmBknp C m(np) = A km B k(np) C mn[p] = A km B kn[p] 5.1 ApkBkmn C (mn)p = B k(mn) A pk C m[n]p = B km[n] A pk 2.2 AkmBkpn C mn[p] = A km B k[p]n C m[n]p = A km B kp[n] 5.2 ApkBknm C m[n]p = B k[n]m A pk 2.3 AkmBnkp C mn[p] = A km B nk[p] 5.3 ApkBmkn C m[n]p = B mk[n] A pk 2.4 AkmBpkn C m[n]p = A km B pk[n] 5.4 ApkBnkm 2.5 AkmBnpk C m(np) = A km B (np)k C mn[p] = A km B n[p]k 5.5 ApkBmnk C (mn)p = B (mn)k A pk C m[n]p = B m[n]k A pk 2.6 AkmBpnk C m[n]p = A km B p[n]k 5.6 ApkBnmk 3.1 AnkBkmp C mn[p] = B km[p] A nk 6.1 AkpBkmn C (mn)p = B k(mn) Akp C m[n]p = B km[n] Akp 3.2 AnkBkpm C mn[p] = B k[p]m A nk 6.2 AkpBknm C m[n]p = B k[n]m Akp 3.3 AnkBmkp C mn[p] = B mk[p] A nk 6.3 AkpBmkn C m[n]p = B mk[n] Akp 3.4 AnkBpkm 6.4 AkpBnkm 3.5 AnkBmpk C mn[p] = B m[p]k A nk 6.5 AkpBmnk C (mn)p = B (mn)k Akp C m[n]p = B m[n]k Akp 3.6 AnkBpmk 6.6 AkpBnmk Cris Cecka (NVIDIA) Tensor Contractions cublas February 25, / 30

28 Tensor Contraction with Extended BLAS Primitives Case Contraction Kernel1 Kernel2 Kernel3 1.1 A mk B knp C m(np) = A mk B k(np) C mn[p] = A mk B kn[p] C m[n]p = A mk B k[n]p 6.1 A kp B kmn C (mn)p = Bk(mn) A kp C m[n]p = Bkm[n] A kp Example: Mappings to Level 3 BLAS routines Case 1.1, Kernel2: C mn[p] = A mk B kn[p] cublasdgemmstridedbatched(handle, CUBLAS_OP_N, CUBLAS_OP_N, M, N, K, 1.0, A, lda1, 0, B, ldb1, ldb2, 0.0, C, ldc1, ldc2, P) Cris Cecka (NVIDIA) Tensor Contractions cublas February 25, / 30

29 Tensor Contraction with Extended BLAS Primitives Case Contraction Kernel1 Kernel2 Kernel3 1.1 A mk B knp C m(np) = A mk B k(np) C mn[p] = A mk B kn[p] C m[n]p = A mk B k[n]p 6.1 A kp B kmn C (mn)p = Bk(mn) A kp C m[n]p = Bkm[n] A kp Example: Mappings to Level 3 BLAS routines Case 6.1, Kernel2: C m[n]p = B km[n] A kp cublasdgemmstridedbatched(handle, CUBLAS_OP_T, CUBLAS_OP_N, M, P, K, 1.0, B, ldb1, ldb2, A, lda1, 0, 0.0, C, ldc2, ldc1, N) Cris Cecka (NVIDIA) Tensor Contractions cublas February 25, / 30

30 Performance Flatten V.S. SBGEMM Flattening Speedup (Batch / Flat) Case 1.1 [n] Case 1.1 [p] Case 1.5 [p] Case 6.1 [n] n n Prefer flattening to pure GEMM. Cris Cecka (NVIDIA) Tensor Contractions cublas February 25, / 30

31 Performance Batching in last mode versus middle mode Last Mode Speedup ([n] / [p]) n Case 1.1 Case n On CPU: Prefer batching in the last mode. Cris Cecka (NVIDIA) Tensor Contractions cublas February 25, / 30

32 Performance Mixed mode batching Last Output Mode Speedup ([n] / [p]) n Case 1.2 Case n On CPU: mode of the output tensor is more important than the batching mode of the input tensor. Cris Cecka (NVIDIA) Tensor Contractions cublas February 25, / 30

33 Analysis Exceptional Cases: Cannot be computed by StridedBatchedGEMM. Case Contraction 3.4 C mnp = A nk B pkm 6.4 C mnp = A kp B nkm C mnp = A mkp B mkn C mnp = A pkm B nkp Example of exceptional cases. These cases are precisely the interleaved GEMMs. When batching index is the major index in an argument: That argument is interpreted as interleaved matrices. May be one or both inputs and/or output. Cris Cecka (NVIDIA) Tensor Contractions cublas February 25, / 30

34 3D Tiled GEMM Implement GEMM with a 3D tile: Transpositions performed on the way to smem/reg. Keep canonical GEMM core. Considers three modes rather than two: Major mode: A mnkpqr Reduction mode: A mnkpqr Aux (batch,row,col) mode: A mnkpqr (Optional) Third tile dimension interpolates between pure GEMM and interleaved GEMM. Nested loop over remaining modes performs full contraction. Cris Cecka (NVIDIA) Tensor Contractions cublas February 25, / 30

35 3D Tiled GEMM Tilesize tuning with PPCG for exceptional cases: 6.5 (128,1) Blocking (p, k) (64,1) (32,1) (16,1) (8,1) (4,1) (2,1) (1,1) (1,1) (2,1) (4,1) (8,1) (16,1) (32,1) (64,1) (128,1) Blocking (m, n) Time [µs] PPCG BATCHEDGEMV BATCHEDGEMM GEAM n Cris Cecka (NVIDIA) Tensor Contractions cublas February 25, / 30

36 3D Tiled GEMM C mnp = A mkp B nkp : Increasing BLK P decreases effective tile size. C pmn = A pmk B pnk : Increasing BLK P increases cache line utilization. e.g. BLK P= 1, 2, 4, 8 BLK P = 1 equivalent to BLIS (strides in row and column) Cris Cecka (NVIDIA) Tensor Contractions cublas February 25, / 30

37 3D Tiled GEMM Interface? Extend the StridedBatchedGEMM transpose parameters? C mnp A pmk B pkn EX N EX N C mpn = A pmk B pnk EX N EX T C pmn A pkm B pkn EX T EX N A pkm B pnk EX T EX T Cris Cecka (NVIDIA) Tensor Contractions cublas February 25, / 30

38 contract contract(cublas::par, alpha, A, {M,P,K}, _< m, p, k >, B, {K,N,P}, _< k, n, p >, beta, C, _< m, n, p >); ROWIDX COLIDX BATIDX REDIDX Kernel * dot XXX (c p = a p b p) XXX (c p = a pk b pk ) scal gemv dgmm batch gemm (XXX: exceptional) ger gemm XXX (batch gemm[k = 1]? batch ger?) batch gemm (XXX: exceptional) Cris Cecka (NVIDIA) Tensor Contractions cublas February 25, / 30

39 Applications: Tucker Decomposition T mnp = G ijk A mi B nj C pk Cpk T G mnp ijk A mi B nj Cris Cecka (NVIDIA) Tensor Contractions cublas February 25, / 30

40 Applications: Tucker Decomposition T mnp = G ijk A mi B nj C pk T G mnp ijk A Cpk mi B nj Main steps in the algorithm Y mjk = T mnp Bnj t C pk t Y ink = T mnp A t+1 mi Cpk t Y ijp = T mnp B t+1 nj A t+1 mi Cris Cecka (NVIDIA) Tensor Contractions cublas February 25, / 30

41 Applications: Tucker Decomposition Performance on Tucker decomposition: Time (sec) TensorToolbox BTAS Cyclops CPU Batched GPU Batched n Cris Cecka (NVIDIA) Tensor Contractions cublas February 25, / 30

42 Applications: FFT Low-Communication FFT for multiple GPUs. StridedBatchedGEMM composes 75%+ of the runtime. Essential to the performance. Two custom kernels are precisely interleaved GEMMs. 2 P100 GPUs: 1.3x over cufftxt. 8 P100 GPUs: 2.1x over cufftxt. Cris Cecka (NVIDIA) Tensor Contractions cublas February 25, / 30

43 Conclusion StridedBatchedGEMM in cublas for generalized tensor contractions. Avoid explicit transpositions or permutations. 10x(GPU) and 2x(CPU) speedup on small/moderate sized tensors. Available in cublas 8.0 Cris Cecka (NVIDIA) Tensor Contractions cublas February 25, / 30

44 Conclusion StridedBatchedGEMM in cublas for generalized tensor contractions. Avoid explicit transpositions or permutations. 10x(GPU) and 2x(CPU) speedup on small/moderate sized tensors. Available in cublas 8.0 Future work: Exceptional case kernels/performance/interface?? Library Optimizations Matrix stride zero Persistent Matrix Strided Batched GEMM Staged RNNs: Staged Persistent Matrix Strided Batched GEMM Cris Cecka (NVIDIA) Tensor Contractions cublas February 25, / 30

45 Thank you! Questions? Cris Cecka (NVIDIA) Tensor Contractions cublas February 25, / 30

TENSOR LAYERS FOR COMPRESSION OF DEEP LEARNING NETWORKS. Cris Cecka Senior Research Scientist, NVIDIA GTC 2018

TENSOR LAYERS FOR COMPRESSION OF DEEP LEARNING NETWORKS Cris Cecka Senior Research Scientist, NVIDIA GTC 2018 Tensors Computations and the GPU AGENDA Tensor Networks and Decompositions Tensor Layers in