Targeting Extreme Scale Computational Challenges with Heterogeneous Systems

Targeting Extreme Scale Computational Challenges with Heterogeneous Systems Oreste Villa, Antonino Tumeo Pacific Northwest Na/onal Laboratory (PNNL) 1

Introduction! PNNL Laboratory Directed Research & Development (LDRD) initiative! XSCI extreme Scale Computing Initiave! Focuses on computational challenges associated with programmability of heterogeneous systems! Targets two PNNL key applications:! Quantum Chemistry - NWCHEM! Subsurface Transportation - STOMP 2

NWChem (http://www.nwchem-sw.org)! Scalable computational chemistry tool! Treats large scientific computational chemistry problems! Exploits workstations, clusters, high performance parallel supercomputers! Handles! Biomolecules, nanostructures, and solid-state! From quantum to classical, and all combinations! Gaussian basis functions or plane-waves! Scaling from one to thousands of processors! Properties and relativity! Actively developed by a consortium! maintained by the Environmental Molecular Science Laboratory (EMSL) located at the Pacific Northwest National Laboratory (PNNL) 3

NWChem Quantum Chemistry Module! Quantum Chemistry Coupled Cluster module! Approximates the solution to the Schrodinger equation using the equivalent of Taylor s expansion.! Only NWChem module capable of achieving several petaflops of sustained performance! Used for instance to understand fundamental optical processes in solar cells, and other optically active materials, photosynthesis 4

NWChem - Computational Challenge Phase Itera*ons Comp. Complexity Data Movement Varia*ons Hartree Fock 1 no^4 no^3 1 4- index transforma?on 1 N^4 N^3 1 CCSD - Itera?ve 4-40 (?ll convergence) N^5 N^4 in N^4 out ~100 types CCSD(T) Not itera?ve 1 N^7 N^6 in scalar out 18 types! no=number of occupied orbitals (bigger molecule);! nu=number of unoccupied orbitals (higher energy/temperature, precision);! N = no+nu; Problem size (uracil molecule: 54 no 230 nu: 4 N^4 Tensors ~6 GB each, N^6 tensors are not fully stored)! Permutations, symmetry of contractions (18 and 100 types) 5

TCG - Tensor Contraction Generator! Builds on NWChem Tensor Contraction Engine (TCE)! Developed in Python! Tiling: the whole spinorobital domain is partitioned into tiles which contain several spinorbitals of the same spatial- and spin-symmetry.! Domain Specific Language (DSL): takes in input the mathematical description of the TCs! Limited hints for enhancing code generation! Generates code for:! CUDA, OpenCL, OpenMP, OpenACC! NWChem TCE is targeted to Fortran! Optimizes for the different targets! Currently, mostly focused on CUDA W Ma, S Krishnamoorthy, O Villa, K Kowalski GPU- based implementa?ons of the nonitera?ve regularized- CCSD (T) correc?ons: Applica?ons to strongly correlated systems Journal of Chemical Theory and Computa?on, 2011 W Ma, S Krishnamoorthy, O Villa, K Kowalski, Op?mizing tensor contrac?on expressions for hybrid CPU- GPU execu?on IEEE Cluster 2011 K. Kowalski, S. Krishnamoorthy, O. Villa, J.R. Hammond, N. Govind, J. Chem. Phys. 2011. 6

Basic CUDA kernel! Memory management! Reuses memory allocated for previous kernel calls! Kernel Arguments! Common stride offsets pre-computed by the host! Encoding Thread-Block specific arguments! Using CUDA block dimensions to map index dimensions! Computation in a Thread-Block! One thread computes one element in the output array.! The threads in each thread block co-operate in moving data between the GPU memory and shared memory! Default thread Block size 16x16 = 256 Threads 7

GPU code generation challenges! Large tensors + memory requirements of the application, results in each dimension being relatively small! Interferes with achieving good locality in SM! Incurs high index computation overhead (modulo operations)! Poor device thread block utilization on GPUs! In general, tensor contraction is harder to efficiently compute than square matrix-matrix multiplication often employed in benchmark studies! Problem sizes are not known until runtime! CCSD(T) has 18 contraction types 8

CUDA optimizations! Index Combining (General optimization)! When a pair of indices occur in the same order in every occurrence, they can be replaced by one index whose size is their product! reduction of index calculations! Dimension Flattening (increase utilization in specific tile sizes)! Baseline approach works well only when index dimensions are multiple of the thread block configuration. We flatten the loops and we recreate them of the right size, with some index operations! increase of index operations but better utilization! Pipelining on the outer dimensions (Streaming)! Most effective optimization to hide PCI express transfers 9

Dimension flattening March 18, 2013 10

CCSD(T) - Application Structure Global Arrays Inputs Outputs CPU memory CPU memory Individual bricks GPU memory Individual bricks GPU memory 11

NWChem Full Application Results Results for Kepler K20 on Nvidia PSG cluster (4 nodes - 2 sockets Intel Xeon E5-2667 + 2 GPU K20) Uracil - ( func?ons = 132, Dodecane - ( func?ons = 160, atoms = 12, closed shells = 29) atoms = 38, closed shells = 49) 60 600 Time (s) 50 40 30 20 10 0 Time (s) 500 400 300 200 100 0 Note: results of the code generator tuned for Fermi, currently under refactoring for Kepler. System/cluster too small, wai?ng for ORNL Titan. Note: currently working on SC13 Gordon Bell s prize submission > 1 Pflops 12

Under Development! Code generators for other targets! OpenMP, OpenCL and OpenACC! Mostly working, not yet optimized! Many parameters to optimize for! Trying to distill the most relevant! CCSD: Evaluate cublas_dgemm + sorting (cuda) instead of custom kernel! Auto-tuning mechanisms! Generate all the possible code permutations! Choose the best code and processing units depending on TC sizes! Evolutionary methods for the design space exploration? 13

STOMP - Subsurface Transport Over Multiple Phases (http://stomp.pnnl.gov)! Hydrology code that simulates the flow transport and chemical reaction of species in the subsurface in the following phases:! Aqueous, gas, non-aqueous liquid, ice phase, solid phase. 14

STOMP Application Structure! Application consists of 2 main phases! Flow transportation phase (large sparse linear system solver in PETSc)! ECKEChem(Equilibrium, Conservation, Kinetic)! The domain is discretized in millions of small grid node ( cubes )! Partial-differential equations that describe the conservation of mass or energy are assembled for each grid node ( cube ) (from 8 100 species)! The resulting equations are nonlinear and solved using Newton-Raphson iteration which involves performing LU decomposition on many matrices (95% total execution time in certain simulations) 15

STOMP Application Structure (II) Solve Flow (use Petsc solver) Use of OpemMP Threads Solve for Flux (OpenMP Threads) Use of Petsc Solver Obtain Flow Veloci?es Solute Transport Loop Next?me step Obtain values for kine?c and conserva?ve species Solve reac?on chemistry ECKEChem Hybrid CUDA and OpenMP threads Obtain new values for kine?c and conserva?ve species No convergence convergence 16

ECKEChem! For each Grid node O(N) Constant for all program all grid nodes Species Concentr. Eq. Conserva?on Eq. Kine?c Eq. Equilibr Assembly of Jacobian Matrix A (NxN) A x = b Constant for all program Porosity LU ~O(N^3) Newton- Raphson (typically 2-20 itera?ons) O(N) New Species Concentr. Check Residuum Embarrassingly parallel Lot of parallelism Small matrices Matrices do not always fit in Shared Memory (N from ~ 8 to ~100) 17

STOMP - Solver Requirement! Manage systems up to ~100x100 with full pivoting support! Three solutions available for CUDA 1. One Thread-block one Matrix (M. Fatica et al.) Load small matrices in Shared Memory Limited by SM size (max 76x76) Full pivoting complex as disrupts parallelism 2. One Warp One Matrix (CUBLAS Version <= 5.0) cublasdgetrfbatched (i.e. PA = LU), cublasdtrsmbatched (i.e. Lc = y, Ux = c), No full pivoting but easy to maintain (wait next cuda version) Limited by Warp size (max 32x32) 3. One Thread one Matrix (PNNL) No use of shared memory Coalescing of matrix in heap space allocate by one thread per TB Manages matrices of size 128x128 (limited by memory size, above this size is not needed, other methods are more efficient) Easy to implement full pivoting 18

Batched LU: heterogeneous approach A1 A2 A3 A4 A5 A6 A9 A10 A11 Ai Ai+1 SM0 GPU SM1 Performs LU CPU Performs LU! Time stamp based load balancer use timing information of previous execution to partition among GPUs and CPU! CPU! One OMP thread per matrix! GPU! No use of shared memory! Coalescing of matrix in heap space allocate by one thread per TB! Use a custom memory manager to avoid calling real malloc and free! Manages matrices of size 128x128 (limited by memory size, above this size is not needed, other methods are more efficient) 19

Batched LU: one thread one Matrix (Cuda Kernel) ONE THREAD ALLOCATES MEMORY FOR BLOCK COALESCE A and B SINGLE MATRIX LU UN- COALESCE A and B ONE THREAD FREES MEMORY FOR BLOCK 20

Batched LU: one thread one Matrix (coalescing overhead)! Percentage of time spent Coalescing before solving the systems 21

Performance Results on Kepler K20c *CPU = Xeon X5650 at 2.67GHz, 6 Nehalem cores with 12 threads at 12 MB of L3 22

Power Results on Kepler K20c and Fermi M2090 Fermi Kepler O.Villa, M. Fa?ca, A. Tumeo, N. Gawande Power/Performance Trade- offs of Small Batched LU Based Solvers on GPUs, submised Europar 2013 23

STOMP Full Application Results! 84 equations per grid node! Approx. 105000 grid nodes! PETSs iter. 2902! ECKEChem iter. 26! Heterogeneous version OMP + GPU! PTSC multi-threading! Inter-node, 3D domain is split in regions! no load balancing (as in the original code)! Intra-node! Task based load balancer as each grid node reaches convergence (in Newton-Raphson) at different time! Assembly of Jacobian directly on GPU (only O(N) data are moved to/from GPU memory) 24

STOMP Full Application Results! Each cluster node 2 AMD Opteron 6272 (Bulldozer Interlagos) + 1 M2090 700 600 500 Time, s 400 300 200 100 x5 0 N = 8, n=128, npernode=16, w/o OpenMP N = 8, n=8, npernode=1, threads=16 N = 8, n=8, npernode=1, OpenMP & GPU total_?me 656.60 195.60 132.40 ECKEChem_?me 519.40 142.70 78.80 2.6x PETSc 3.6x ECKChem 1.8x ECKChem 25

Intra-Node Load Balancing (16 CPU cores + GPU) - Node 0 16000 12 Grid Nodes 14000 12000 10000 8000 6000 4000 2000 0 Process 0, GPU nodes Process 0, CPU nodes Process 0, GPU Time Process 0, CPU Time 0 5 10 15 20 Eckechem Calls in a Single Step 10 8 6 4 2 0 Tme, s 26

Intra-Node Load Balancing (16 CPU cores + GPU) - Node 1 Grid Nodes 16000 14000 12000 10000 8000 6000 4000 2000 0 Process 1, GPU nodes Process 1, CPU nodes Process 1, GPU Time Process 1, CPU Time 0 5 10 15 Eckechem Calls in a Single Step 20 16 14 12 10 8 6 4 2 0 Tme, s 27

Conclusions! We have presented two important DoE PNNL applications! NWCHEM (quantum chemistry) and STOMP (subsurface transportation) use concurrently both CPU and GPU during the most computational intensive portion of the applications.! Full application Speedup using both GPUs and CPUs! NWChem-CCSD(T) is ~4x on dual socket sandy bridge platforms when using Kepler! STOMP is ~5x after restructuring, using Fermi on a Interlagos host platform! Challenges for NWChem - CCSD! Unfavorable computation/data-movement ratio! Tensor sorting routine to couple with cublas_dgemm! Challenges for STOMP! More efficient 100x100 batched LU solver 28

Thank you very much for your attention!! Questions?! Oreste Villa! oreste.villa@pnnl.gov! Antonino Tumeo! antonino.tumeo@pnnl.gov 29

Profiler of LU decomposition on Kepler K20c 30

Pipelined execution! We can avoid the O(N 6 ) copy IN and just have the copy OUT and then accumulate on the host! We can create streams of kernels using one loop dimension and asynchronously copy out partial data Host Support and streams 31