Antti-Pekka Hynninen, 5/10/2017, GTC2017, San Jose CA

Size: px

Start display at page:

Download "Antti-Pekka Hynninen, 5/10/2017, GTC2017, San Jose CA"

Gervais Logan
6 years ago
Views:

1 S7255: CUTT: A HIGH- PERFORMANCE TENSOR TRANSPOSE LIBRARY FOR GPUS Antti-Pekka Hynninen, 5/10/2017, GTC2017, San Jose CA

2 MOTIVATION Tensor contractions are the most computationally intensive part of quantum manybody methods used in NWCHEM, DIRAC, LS-DALTON, and ACES-IV Sum over repeated indices c and k Tensor contractions D a, b, i += L a, c, i, k R k, b, c Evaluating tensor contractions directly requires implementing a lot of hard-to-write custom code Indirect approach transposes tensors and uses efficient linear algebra libraries (such as cublas) to perform matrix multiply 2

3 TENSOR CONTRACTIONS Indirect approach Reduction over a pair of indices shared by two tensors, e.g. D a, b, i += L a, c, i, k R k, b, c This can be evaluated as L a, c, i, k L a, i, k, c R k, b, c R k, c, b D a, i, b += L a, i, k, c R k, c, b D a, i, b D a, b, i # tensor transpose # tensor transpose # matrix multiply # tensor transpose Able to take advantage of the high-performance matrix multiply routines provided by cublas 3

PREVIOUS WORK No runtime high-performance tensor transpose library exists for GPUs Previous implementation by my co-author [1] was sub-optimal on GPU platforms Work in [2] relies on compiler to build

4 PREVIOUS WORK No runtime high-performance tensor transpose library exists for GPUs Previous implementation by my co-author [1] was sub-optimal on GPU platforms Work in [2] relies on compiler to build custom kernels e.g. not runtime [1] Dmitry I. Lyakh An efficient tensor transpose algorithm for multicore CPU, Intel Xeon Phi, and NVidia Tesla GPU. Computer Physics Communications 189, (2015), DOI: [2] Paul Springer, Aravind Sankaran, and Paolo Bientinesi TTC: A Tensor Transposition Compiler for Multiple Architectures 4

5 TENSOR TRANSPOSE ALGORITHMS 5

syncthreads() MATRIX TRANSPOSE: TILED ALGORITHM Step 1:

2: Read shared memory in transposed order and write to

in CUDA C/C+, Parallel Forall Blog: https://devblogs.

6 syncthreads() MATRIX TRANSPOSE: TILED ALGORITHM Step 1: Read 32x32 tile from global memory to shared memory Step 2: Read shared memory in transposed order and write to global memory Mark Harris An Efficient Matrix Transpose in CUDA C/C+, Parallel Forall Blog: 6

7 TILED ALGORITHM Constant shared memory usage (~32x32) shared memory volume looped over using TB Performs well when d1 and d5 are fairly large (~32) Poor performance for small (2-8) dimensions Would it be possible to pack multiple small dimensions into shared memory? 7

8 PACKED ALGORITHM shared memory TB loop volume No longer uses 32x32 shared memory tile Loads entire dimensions into shared memory (not tiled) As much shared memory is allocated as it takes to store the elements Must choose which dimensions to pack New problem: What if e.g. d5 is very large? 8

9 PACKED-SPLIT ALGORITHM shared memory TB loop volume Split largest dimension Number of splits is determined by the shared memory size Must choose which dimensions to pack, and number of splits

10 MEMORY POSITION CALCULATION 10

11 GLOBAL MEMORY POSITION CALCULATION glread s= 0,..., H-1 p= 0,..., M shread H = Number of elements in shared memory M = Number of elements in loop volume Need to convert scalar positions s and p to global memory positions: glread = Global memory read glwrite = Global memory write glwrite Global memory position is split into: glread = glminorread(s) + glmajorread(p) glwrite = glminorwrite(s) + glmajorwrite(p) 11

12 MAJOR POSITION CALCULATION p= 0,..., M-1 // int p =0,...,M-1 // int c[n] = {1, d3, d3*d4} // int d[n] = {d3, d4, d6} glmajorread(p) // int t[n] = {d1*d2, d1*d2*d3, d1*d2*d3*d4*d5} int glmajorread = 0; for (int i=0;i < n;i++) { glmajorread p = mod n i=1 p c i, d i t i glmajorread += ((p / c[i]) % d[i]) * t[i]; } O(n) Observation: p is constant within thread block (and therefore 12 warp)

13 WARP-PARALLEL POSITION CALCULATION // int p = 0,...,M-1 // int c = {1, d3, d3*d4, 1,..., 1} // int d = {d3, d4, d6, 1,..., 1} // int t = {d1*d2, d1*d2*d3, d1*d2*d3*d4*d5,...} p= 0,..., M glmajorread(p) int glmajorread = ((p / c) % d) * t; for (int i=16;i >= 1;i/=2) { glmajorread += shfl_xor(glmajorread, i); n mod i=1 p c i, d i t i } Single divide, modulo, and multiply O(1) i.e. performance independent of tensor rank Works up to n=32 13

14 MINOR POSITION CALCULATION For Tiled algorithm this is trivial glminorread(s) s= 0,..., H shared memory shread(s) For Packed and Packed-Split, pre-compute positions and store into registers Number of registers per thread: numreg = (H - 1)/blockDim.x + 1 int glminorread[numreg] int shread[numreg] glminorwrite(s) int glminorwrite[numreg] Template kernel with numreg 14

15 ALGORITHM & PARAMETER CHOICE 15

16 CHOOSING THE BEST ALGORITHM Algorithm choice: Tiled, Packed, Packed-Split Tiled: no free parameters Packed: input and output ranks Packed-Split: input and output ranks, number of splits Large performance differences between different algorithm and parameter choices 16

17 CUTT PLANS cuttresult cuttplanmeasure(cutthandle* handle, int rank, int* dim, int* permutation, size_t sizeoftype, cudastream_t stream, void* idata, void * odata); cuttresult cuttplan(cutthandle* handle, int rank, int* dim, int* permutation, size_t sizeoftype, cudastream_t stream); Measure plans perform all possible tensor transposes and choose the best performing plan. LARGE overhead Heuristic plans choose best plan by estimating the transpose runtime based on analytical GPU performance model. SMALL overhead Heuristic plans must be used in QM calculations Getting the heuristic planning to work accurately was a major hurdle Better approach is needed for choosing the heuristic plans (Machine Learning?) 17

18 BENCHMARKS 18

19 Tensor ranks 2 to 7 Ratio between largest and smallest tensor dimensions 1:1, 5:1, and 15:1 BENCHMARK 1 Tensor volume normally distributed with average 200M elements and standard deviation of 20M elements 500 random permutations for each tensor rank and ratio 9000 tensor transposes in total 19

20 TESLA K20X * * maximum bandwidth measured using GPU-STREAM: Tom Deakin, James Price, Matt J. Martineau M, and Simon N. McIntosh-Smith GPU-STREAM v2.0: Benchmarking the achievable memory bandwidth of many-core processors across diverse parallel programming models Paper presented at P^3MA Workshop at ISC High Performance, Frankfurt, Germany 20

21 TESLA M40 21

22 TESLA P100 22

23 Tensor ranks 8 and 12 Rank 8: (5, 3, 2, 4, 35, 33, 37, 40) 200M elements BENCHMARK 2 Rank 12: (2, 3, 4, 3, 2, 2, 3, 2, 20, 18, 22, 24) 328M elements 500 random permutations for both tensor ranks Simulates realistic workload in Quantum Chemistry calculations 23

24 TESLA K20X 24

25 TESLA M40 25

26 TESLA P100 26

27 PERFORMANCE DISTRIBUTION 27

28 Set of 57 tensor transposes from (TTC): BENCHMARK 3 P. Springer, J. R. Hammond, and P. Bientinesi. TTC: A high performance compiler for tensor transpositions. CoRR, Somewhat easy benchmark due to small number of permutations 28

29 TESLA K40M TTC average 140 GiB/s cutt average 144 GiB/s TTC data from: Paul Springer, Aravind Sankaran, and Paolo Bientinesi TTC: A Tensor Transposition Compiler for Multiple Architectures. 29

30 Real world tensor contractions performed on TAL- SH (Tensor Algebra Library for Shared Memory Computers) BENCHMARK 4 Dmitry I. Lyakh at Oak Ridge National Laboratory 9306 random permutations on tensors up to rank 8 Matrix multiply performed using cublas 30

31 GFlop/s Percentage of max. performance GFlop/s Percentage of max. performance TESLA K20X GPU (a) Best Average Worst (b) Best Average Worst Arithmetic Intensity Arithmetic Intensity Single precision Double precision D = D + L R Arithmetic Intensity = 2 vol D vol L vol R vol D + vol L + vol R 31

32 TESLA M40 Single precision 32

33 GFlop/s Percentage of max. performance GFlop/s Percentage of max. performance TESLA P100 (a) Best Average Worst (b) Best Average Worst Arithmetic Intensity Arithmetic Intensity Single precision Double precision 33

34 CONCLUSIONS & ACKNOWLEDGEMENTS 34

35 CONCLUSIONS Fully runtime library for high-performance tensor transposing on NVIDIA GPUs Extensive benchmarking Achieves median of 70-80% of the maximum achievable memory bandwidth Performance equals or exceeds the performance of compiler-based approach (TTC) Enables close to peak FLOP tensor contractions on P100 Integrated as part of TAL-SH ( Work underway to be used in NWCHEM, DIRAC, LS-DALTON, and ACES-IV Source code available at: Manuscript available at: 35

36 ACKNOWLEDGEMENTS Dmitry I. Lyakh at Oak Ridge Leadership Computing Facility at ORNL ORNL where 80% of the work was done 36

A Quantum Chemistry Domain-Specific Language for Heterogeneous Clusters

A Quantum Chemistry Domain-Specific Language for Heterogeneous Clusters ANTONINO TUMEO, ORESTE VILLA Collaborators: Karol Kowalski, Sriram Krishnamoorthy, Wenjing Ma, Simone Secchi May 15, 2012 1 Outline!