Targeting Extreme Scale Computational Challenges with Heterogeneous Systems

Similar documents
A Quantum Chemistry Domain-Specific Language for Heterogeneous Clusters

Practical Combustion Kinetics with CUDA

Antti-Pekka Hynninen, 5/10/2017, GTC2017, San Jose CA

Scalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver

First, a look at using OpenACC on WRF subroutine advance_w dynamics routine

Heterogeneous programming for hybrid CPU-GPU systems: Lessons learned from computational chemistry

Cyclops Tensor Framework

Faster Kinetics: Accelerate Your Finite-Rate Combustion Simulation with GPUs

Parallelization of Molecular Dynamics (with focus on Gromacs) SeSE 2014 p.1/29

Accelerating linear algebra computations with hybrid GPU-multicore systems.

Ilya A. Kaliman* and Anna I. Krylov. Introduction

Performance Analysis of Lattice QCD Application with APGAS Programming Model

Direct Self-Consistent Field Computations on GPU Clusters

Dynamic Scheduling for Work Agglomeration on Heterogeneous Clusters

Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers

Julian Merten. GPU Computing and Alternative Architecture

Sparse LU Factorization on GPUs for Accelerating SPICE Simulation

Perm State University Research-Education Center Parallel and Distributed Computing

Introduction to Benchmark Test for Multi-scale Computational Materials Software

S0214 : GPU Based Stacking Sequence Generation For Composite Skins Using GA

Research on GPU-accelerated algorithm in 3D finite difference neutron diffusion calculation method

Solving PDEs with CUDA Jonathan Cohen

S Subdivide, Preprocess and Conquer: Micromagnetism FEM/BEM-Simulations on Single-Node/Multi-GPU Systems

Introduction to numerical computations on the GPU

Open-Source Parallel FE Software : FrontISTR -- Performance Considerations about B/F (Byte per Flop) of SpMV on K-Supercomputer and GPU-Clusters --

Lattice Boltzmann simulations on heterogeneous CPU-GPU clusters

Static-scheduling and hybrid-programming in SuperLU DIST on multicore cluster systems

Parallelization and benchmarks

Parallel Sparse Tensor Decompositions using HiCOO Format

SPARSE SOLVERS POISSON EQUATION. Margreet Nool. November 9, 2015 FOR THE. CWI, Multiscale Dynamics

Petascale Quantum Simulations of Nano Systems and Biomolecules

Using a CUDA-Accelerated PGAS Model on a GPU Cluster for Bioinformatics

Massively parallel semi-lagrangian solution of the 6d Vlasov-Poisson problem

Hybrid CPU/GPU Acceleration of Detection of 2-SNP Epistatic Interactions in GWAS

arxiv: v1 [hep-lat] 7 Oct 2010

Accelerating Model Reduction of Large Linear Systems with Graphics Processors

Paralleliza(on and Performance of the NIM Weather Model on CPU, GPU and MIC Architectures

TR A Comparison of the Performance of SaP::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems

11 Parallel programming models

Adapting Irregular Computations to Large CPU-GPU Clusters in the MADNESS Framework

Performance of the fusion code GYRO on three four generations of Crays. Mark Fahey University of Tennessee, Knoxville

A CUDA Solver for Helmholtz Equation

HYCOM and Navy ESPC Future High Performance Computing Needs. Alan J. Wallcraft. COAPS Short Seminar November 6, 2017

S3D Direct Numerical Simulation: Preparation for the PF Era

Efficient Molecular Dynamics on Heterogeneous Architectures in GROMACS

Multiphase Flow Simulations in Inclined Tubes with Lattice Boltzmann Method on GPU

Information Sciences Institute 22 June 2012 Bob Lucas, Gene Wagenbreth, Dan Davis, Roger Grimes and

NWChem: Coupled Cluster Method (Tensor Contraction Engine)

Efficient algorithms for symmetric tensor contractions

GPU acceleration of Newton s method for large systems of polynomial equations in double double and quad double arithmetic

Parallel Multivariate SpatioTemporal Clustering of. Large Ecological Datasets on Hybrid Supercomputers

GPU Acceleration of Cutoff Pair Potentials for Molecular Modeling Applications

Claude Tadonki. MINES ParisTech PSL Research University Centre de Recherche Informatique

Massively scalable computing method to tackle large eigenvalue problems for nanoelectronics modeling

Parallel Asynchronous Hybrid Krylov Methods for Minimization of Energy Consumption. Langshi CHEN 1,2,3 Supervised by Serge PETITON 2

Real-time signal detection for pulsars and radio transients using GPUs

Performance Evaluation of Scientific Applications on POWER8

Software implementation of correlated quantum chemistry methods. Exploiting advanced programming tools and new computer architectures

Multicore Parallelization of Determinant Quantum Monte Carlo Simulations

A microsecond a day keeps the doctor away: Efficient GPU Molecular Dynamics with GROMACS

Everyday Multithreading

NWChem: Coupled Cluster Method (Tensor Contraction Engine)

Massively scalable computing method to tackle large eigenvalue problems for nanoelectronics modeling

COMPARISON OF CPU AND GPU IMPLEMENTATIONS OF THE LATTICE BOLTZMANN METHOD

Sparse solver 64 bit and out-of-core addition

On Portability, Performance and Scalability of a MPI OpenCL Lattice Boltzmann Code

Parallel Transposition of Sparse Data Structures

HIGH PERFORMANCE CTC TRAINING FOR END-TO-END SPEECH RECOGNITION ON GPU

Utilisation de la compression low-rank pour réduire la complexité du solveur PaStiX

The Fast Multipole Method in molecular dynamics

ELSI: A Unified Software Interface for Kohn-Sham Electronic Structure Solvers

Scalable and Power-Efficient Data Mining Kernels

CRYSTAL in parallel: replicated and distributed (MPP) data. Why parallel?

GPU Computing Activities in KISTI

Large-scale Electronic Structure Simulations with MVAPICH2 on Intel Knights Landing Manycore Processors

上海超级计算中心 Shanghai Supercomputer Center. Lei Xu Shanghai Supercomputer Center San Jose

Strassen s Algorithm for Tensor Contraction

Review for the Midterm Exam

A Framework for Hybrid Parallel Flow Simulations with a Trillion Cells in Complex Geometries

Cyclops Tensor Framework: reducing communication and eliminating load imbalance in massively parallel contractions

Jacobi-Based Eigenvalue Solver on GPU. Lung-Sheng Chien, NVIDIA

arxiv: v1 [hep-lat] 10 Jul 2012

Adaptive Heterogeneous Computing with OpenCL: Harnessing hundreds of GPUs and CPUs

OpenCL: Programming Heterogeneous Architectures

WRF performance tuning for the Intel Woodcrest Processor

Exascale challenges for Numerical Weather Prediction : the ESCAPE project

Implementing NNLO into MCFM

Efficient implementation of the overlap operator on multi-gpus

ACCELERATING WEATHER PREDICTION WITH NVIDIA GPUS

Jacobi-Davidson Eigensolver in Cusolver Library. Lung-Sheng Chien, NVIDIA

Acceleration of WRF on the GPU

Factorization and Inversion of a Million Matrices using GPUs: Challenges and Countermeasures

MAGMA MIC 1.0: Linear Algebra Library for Intel Xeon Phi Coprocessors

ERLANGEN REGIONAL COMPUTING CENTER

A model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization)

Using BLIS for tensor computations in Q-Chem

Weile Jia 1, Long Wang 1, Zongyan Cao 1, Jiyun Fu 1, Xuebin Chi 1, Weiguo Gao 2, Lin-Wang Wang 3

ACCELERATING SPARSE CHOLESKY FACTORIZATION ON THE GPU

Using AmgX to accelerate a PETSc-based immersed-boundary method code

Welcome to MCS 572. content and organization expectations of the course. definition and classification

Transcription:

Targeting Extreme Scale Computational Challenges with Heterogeneous Systems Oreste Villa, Antonino Tumeo Pacific Northwest Na/onal Laboratory (PNNL) 1

Introduction! PNNL Laboratory Directed Research & Development (LDRD) initiative! XSCI extreme Scale Computing Initiave! Focuses on computational challenges associated with programmability of heterogeneous systems! Targets two PNNL key applications:! Quantum Chemistry - NWCHEM! Subsurface Transportation - STOMP 2

NWChem (http://www.nwchem-sw.org)! Scalable computational chemistry tool! Treats large scientific computational chemistry problems! Exploits workstations, clusters, high performance parallel supercomputers! Handles! Biomolecules, nanostructures, and solid-state! From quantum to classical, and all combinations! Gaussian basis functions or plane-waves! Scaling from one to thousands of processors! Properties and relativity! Actively developed by a consortium! maintained by the Environmental Molecular Science Laboratory (EMSL) located at the Pacific Northwest National Laboratory (PNNL) 3

NWChem Quantum Chemistry Module! Quantum Chemistry Coupled Cluster module! Approximates the solution to the Schrodinger equation using the equivalent of Taylor s expansion.! Only NWChem module capable of achieving several petaflops of sustained performance! Used for instance to understand fundamental optical processes in solar cells, and other optically active materials, photosynthesis 4

NWChem - Computational Challenge Phase Itera*ons Comp. Complexity Data Movement Varia*ons Hartree Fock 1 no^4 no^3 1 4- index transforma?on 1 N^4 N^3 1 CCSD - Itera?ve 4-40 (?ll convergence) N^5 N^4 in N^4 out ~100 types CCSD(T) Not itera?ve 1 N^7 N^6 in scalar out 18 types! no=number of occupied orbitals (bigger molecule);! nu=number of unoccupied orbitals (higher energy/temperature, precision);! N = no+nu; Problem size (uracil molecule: 54 no 230 nu: 4 N^4 Tensors ~6 GB each, N^6 tensors are not fully stored)! Permutations, symmetry of contractions (18 and 100 types) 5

TCG - Tensor Contraction Generator! Builds on NWChem Tensor Contraction Engine (TCE)! Developed in Python! Tiling: the whole spinorobital domain is partitioned into tiles which contain several spinorbitals of the same spatial- and spin-symmetry.! Domain Specific Language (DSL): takes in input the mathematical description of the TCs! Limited hints for enhancing code generation! Generates code for:! CUDA, OpenCL, OpenMP, OpenACC! NWChem TCE is targeted to Fortran! Optimizes for the different targets! Currently, mostly focused on CUDA W Ma, S Krishnamoorthy, O Villa, K Kowalski GPU- based implementa?ons of the nonitera?ve regularized- CCSD (T) correc?ons: Applica?ons to strongly correlated systems Journal of Chemical Theory and Computa?on, 2011 W Ma, S Krishnamoorthy, O Villa, K Kowalski, Op?mizing tensor contrac?on expressions for hybrid CPU- GPU execu?on IEEE Cluster 2011 K. Kowalski, S. Krishnamoorthy, O. Villa, J.R. Hammond, N. Govind, J. Chem. Phys. 2011. 6

Basic CUDA kernel! Memory management! Reuses memory allocated for previous kernel calls! Kernel Arguments! Common stride offsets pre-computed by the host! Encoding Thread-Block specific arguments! Using CUDA block dimensions to map index dimensions! Computation in a Thread-Block! One thread computes one element in the output array.! The threads in each thread block co-operate in moving data between the GPU memory and shared memory! Default thread Block size 16x16 = 256 Threads 7

GPU code generation challenges! Large tensors + memory requirements of the application, results in each dimension being relatively small! Interferes with achieving good locality in SM! Incurs high index computation overhead (modulo operations)! Poor device thread block utilization on GPUs! In general, tensor contraction is harder to efficiently compute than square matrix-matrix multiplication often employed in benchmark studies! Problem sizes are not known until runtime! CCSD(T) has 18 contraction types 8

CUDA optimizations! Index Combining (General optimization)! When a pair of indices occur in the same order in every occurrence, they can be replaced by one index whose size is their product! reduction of index calculations! Dimension Flattening (increase utilization in specific tile sizes)! Baseline approach works well only when index dimensions are multiple of the thread block configuration. We flatten the loops and we recreate them of the right size, with some index operations! increase of index operations but better utilization! Pipelining on the outer dimensions (Streaming)! Most effective optimization to hide PCI express transfers 9

Dimension flattening March 18, 2013 10

CCSD(T) - Application Structure Global Arrays Inputs Outputs CPU memory CPU memory Individual bricks GPU memory Individual bricks GPU memory 11

NWChem Full Application Results Results for Kepler K20 on Nvidia PSG cluster (4 nodes - 2 sockets Intel Xeon E5-2667 + 2 GPU K20) Uracil - ( func?ons = 132, Dodecane - ( func?ons = 160, atoms = 12, closed shells = 29) atoms = 38, closed shells = 49) 60 600 Time (s) 50 40 30 20 10 0 Time (s) 500 400 300 200 100 0 Note: results of the code generator tuned for Fermi, currently under refactoring for Kepler. System/cluster too small, wai?ng for ORNL Titan. Note: currently working on SC13 Gordon Bell s prize submission > 1 Pflops 12

Under Development! Code generators for other targets! OpenMP, OpenCL and OpenACC! Mostly working, not yet optimized! Many parameters to optimize for! Trying to distill the most relevant! CCSD: Evaluate cublas_dgemm + sorting (cuda) instead of custom kernel! Auto-tuning mechanisms! Generate all the possible code permutations! Choose the best code and processing units depending on TC sizes! Evolutionary methods for the design space exploration? 13

STOMP - Subsurface Transport Over Multiple Phases (http://stomp.pnnl.gov)! Hydrology code that simulates the flow transport and chemical reaction of species in the subsurface in the following phases:! Aqueous, gas, non-aqueous liquid, ice phase, solid phase. 14

STOMP Application Structure! Application consists of 2 main phases! Flow transportation phase (large sparse linear system solver in PETSc)! ECKEChem(Equilibrium, Conservation, Kinetic)! The domain is discretized in millions of small grid node ( cubes )! Partial-differential equations that describe the conservation of mass or energy are assembled for each grid node ( cube ) (from 8 100 species)! The resulting equations are nonlinear and solved using Newton-Raphson iteration which involves performing LU decomposition on many matrices (95% total execution time in certain simulations) 15

STOMP Application Structure (II) Solve Flow (use Petsc solver) Use of OpemMP Threads Solve for Flux (OpenMP Threads) Use of Petsc Solver Obtain Flow Veloci?es Solute Transport Loop Next?me step Obtain values for kine?c and conserva?ve species Solve reac?on chemistry ECKEChem Hybrid CUDA and OpenMP threads Obtain new values for kine?c and conserva?ve species No convergence convergence 16

ECKEChem! For each Grid node O(N) Constant for all program all grid nodes Species Concentr. Eq. Conserva?on Eq. Kine?c Eq. Equilibr Assembly of Jacobian Matrix A (NxN) A x = b Constant for all program Porosity LU ~O(N^3) Newton- Raphson (typically 2-20 itera?ons) O(N) New Species Concentr. Check Residuum Embarrassingly parallel Lot of parallelism Small matrices Matrices do not always fit in Shared Memory (N from ~ 8 to ~100) 17

STOMP - Solver Requirement! Manage systems up to ~100x100 with full pivoting support! Three solutions available for CUDA 1. One Thread-block one Matrix (M. Fatica et al.) Load small matrices in Shared Memory Limited by SM size (max 76x76) Full pivoting complex as disrupts parallelism 2. One Warp One Matrix (CUBLAS Version <= 5.0) cublasdgetrfbatched (i.e. PA = LU), cublasdtrsmbatched (i.e. Lc = y, Ux = c), No full pivoting but easy to maintain (wait next cuda version) Limited by Warp size (max 32x32) 3. One Thread one Matrix (PNNL) No use of shared memory Coalescing of matrix in heap space allocate by one thread per TB Manages matrices of size 128x128 (limited by memory size, above this size is not needed, other methods are more efficient) Easy to implement full pivoting 18

Batched LU: heterogeneous approach A1 A2 A3 A4 A5 A6 A9 A10 A11 Ai Ai+1 SM0 GPU SM1 Performs LU CPU Performs LU! Time stamp based load balancer use timing information of previous execution to partition among GPUs and CPU! CPU! One OMP thread per matrix! GPU! No use of shared memory! Coalescing of matrix in heap space allocate by one thread per TB! Use a custom memory manager to avoid calling real malloc and free! Manages matrices of size 128x128 (limited by memory size, above this size is not needed, other methods are more efficient) 19

Batched LU: one thread one Matrix (Cuda Kernel) ONE THREAD ALLOCATES MEMORY FOR BLOCK COALESCE A and B SINGLE MATRIX LU UN- COALESCE A and B ONE THREAD FREES MEMORY FOR BLOCK 20

Batched LU: one thread one Matrix (coalescing overhead)! Percentage of time spent Coalescing before solving the systems 21

Performance Results on Kepler K20c *CPU = Xeon X5650 at 2.67GHz, 6 Nehalem cores with 12 threads at 12 MB of L3 22

Power Results on Kepler K20c and Fermi M2090 Fermi Kepler O.Villa, M. Fa?ca, A. Tumeo, N. Gawande Power/Performance Trade- offs of Small Batched LU Based Solvers on GPUs, submised Europar 2013 23

STOMP Full Application Results! 84 equations per grid node! Approx. 105000 grid nodes! PETSs iter. 2902! ECKEChem iter. 26! Heterogeneous version OMP + GPU! PTSC multi-threading! Inter-node, 3D domain is split in regions! no load balancing (as in the original code)! Intra-node! Task based load balancer as each grid node reaches convergence (in Newton-Raphson) at different time! Assembly of Jacobian directly on GPU (only O(N) data are moved to/from GPU memory) 24

STOMP Full Application Results! Each cluster node 2 AMD Opteron 6272 (Bulldozer Interlagos) + 1 M2090 700 600 500 Time, s 400 300 200 100 x5 0 N = 8, n=128, npernode=16, w/o OpenMP N = 8, n=8, npernode=1, threads=16 N = 8, n=8, npernode=1, OpenMP & GPU total_?me 656.60 195.60 132.40 ECKEChem_?me 519.40 142.70 78.80 2.6x PETSc 3.6x ECKChem 1.8x ECKChem 25

Intra-Node Load Balancing (16 CPU cores + GPU) - Node 0 16000 12 Grid Nodes 14000 12000 10000 8000 6000 4000 2000 0 Process 0, GPU nodes Process 0, CPU nodes Process 0, GPU Time Process 0, CPU Time 0 5 10 15 20 Eckechem Calls in a Single Step 10 8 6 4 2 0 Tme, s 26

Intra-Node Load Balancing (16 CPU cores + GPU) - Node 1 Grid Nodes 16000 14000 12000 10000 8000 6000 4000 2000 0 Process 1, GPU nodes Process 1, CPU nodes Process 1, GPU Time Process 1, CPU Time 0 5 10 15 Eckechem Calls in a Single Step 20 16 14 12 10 8 6 4 2 0 Tme, s 27

Conclusions! We have presented two important DoE PNNL applications! NWCHEM (quantum chemistry) and STOMP (subsurface transportation) use concurrently both CPU and GPU during the most computational intensive portion of the applications.! Full application Speedup using both GPUs and CPUs! NWChem-CCSD(T) is ~4x on dual socket sandy bridge platforms when using Kepler! STOMP is ~5x after restructuring, using Fermi on a Interlagos host platform! Challenges for NWChem - CCSD! Unfavorable computation/data-movement ratio! Tensor sorting routine to couple with cublas_dgemm! Challenges for STOMP! More efficient 100x100 batched LU solver 28

Thank you very much for your attention!! Questions?! Oreste Villa! oreste.villa@pnnl.gov! Antonino Tumeo! antonino.tumeo@pnnl.gov 29

Profiler of LU decomposition on Kepler K20c 30

Pipelined execution! We can avoid the O(N 6 ) copy IN and just have the copy OUT and then accumulate on the host! We can create streams of kernels using one loop dimension and asynchronously copy out partial data Host Support and streams 31