Heterogeneous programming for hybrid CPU-GPU systems: Lessons learned from computational chemistry

Similar documents
Cyclops Tensor Framework: reducing communication and eliminating load imbalance in massively parallel contractions

A Quantum Chemistry Domain-Specific Language for Heterogeneous Clusters

Programming models for quantum chemistry applications Jeff Hammond, James Dinan, Edgar Solomonik and Devin Matthews Argonne LCF and MCS, UC Berkeley a

Efficient algorithms for symmetric tensor contractions

Cyclops Tensor Framework

Scalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver

Scalable and Power-Efficient Data Mining Kernels

Lazy matrices for contraction-based algorithms

Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers

Julian Merten. GPU Computing and Alternative Architecture

GPU Acceleration of Cutoff Pair Potentials for Molecular Modeling Applications

Direct Self-Consistent Field Computations on GPU Clusters

Leveraging sparsity and symmetry in parallel tensor contractions

Introduction to numerical computations on the GPU

arxiv: v1 [hep-lat] 7 Oct 2010

SP-CNN: A Scalable and Programmable CNN-based Accelerator. Dilan Manatunga Dr. Hyesoon Kim Dr. Saibal Mukhopadhyay

Welcome to MCS 572. content and organization expectations of the course. definition and classification

Accelerating linear algebra computations with hybrid GPU-multicore systems.

Communication-avoiding parallel algorithms for dense linear algebra

First, a look at using OpenACC on WRF subroutine advance_w dynamics routine

Ilya A. Kaliman* and Anna I. Krylov. Introduction

Efficient implementation of the overlap operator on multi-gpus

Performance Analysis of Lattice QCD Application with APGAS Programming Model

Software implementation of correlated quantum chemistry methods. Exploiting advanced programming tools and new computer architectures

Targeting Extreme Scale Computational Challenges with Heterogeneous Systems

11 Parallel programming models

Some notes on efficient computing and setting up high performance computing environments

Jacobi-Based Eigenvalue Solver on GPU. Lung-Sheng Chien, NVIDIA

Hybrid CPU/GPU Acceleration of Detection of 2-SNP Epistatic Interactions in GWAS

Using BLIS for tensor computations in Q-Chem

INITIAL INTEGRATION AND EVALUATION

Research on GPU-accelerated algorithm in 3D finite difference neutron diffusion calculation method

Information Sciences Institute 22 June 2012 Bob Lucas, Gene Wagenbreth, Dan Davis, Roger Grimes and

Tips Geared Towards R. Adam J. Suarez. Arpil 10, 2015

Massively parallel semi-lagrangian solution of the 6d Vlasov-Poisson problem

INF2270 Spring Philipp Häfliger. Lecture 8: Superscalar CPUs, Course Summary/Repetition (1/2)

Lattice Boltzmann simulations on heterogeneous CPU-GPU clusters

Parallel programming practices for the solution of Sparse Linear Systems (motivated by computational physics and graphics)

Accelerating Model Reduction of Large Linear Systems with Graphics Processors

SPARSE SOLVERS POISSON EQUATION. Margreet Nool. November 9, 2015 FOR THE. CWI, Multiscale Dynamics

A new multiplication algorithm for extended precision using floating-point expansions. Valentina Popescu, Jean-Michel Muller,Ping Tak Peter Tang

Parallel PIPS-SBB Multi-level parallelism for 2-stage SMIPS. Lluís-Miquel Munguia, Geoffrey M. Oxberry, Deepak Rajan, Yuji Shinano

ACCELERATING SPARSE CHOLESKY FACTORIZATION ON THE GPU

Tuning And Understanding MILC Performance In Cray XK6 GPU Clusters. Mike Showerman, Guochun Shi Steven Gottlieb

Petascale Quantum Simulations of Nano Systems and Biomolecules

TR A Comparison of the Performance of SaP::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems

Performance Evaluation of Scientific Applications on POWER8

Block AIR Methods. For Multicore and GPU. Per Christian Hansen Hans Henrik B. Sørensen. Technical University of Denmark

MUMPS. The MUMPS library: work done during the SOLSTICE project. MUMPS team, Lyon-Grenoble, Toulouse, Bordeaux

Using a CUDA-Accelerated PGAS Model on a GPU Cluster for Bioinformatics

Acceleration of WRF on the GPU

Dr. Andrea Bocci. Using GPUs to Accelerate Online Event Reconstruction. at the Large Hadron Collider. Applied Physicist

Antti-Pekka Hynninen, 5/10/2017, GTC2017, San Jose CA

COMPARISON OF CPU AND GPU IMPLEMENTATIONS OF THE LATTICE BOLTZMANN METHOD

Accelerating computation of eigenvectors in the nonsymmetric eigenvalue problem

The Performance Evolution of the Parallel Ocean Program on the Cray X1

Accelerating computation of eigenvectors in the dense nonsymmetric eigenvalue problem

CRYPTOGRAPHIC COMPUTING

Binding Performance and Power of Dense Linear Algebra Operations

arxiv: v1 [hep-lat] 10 Jul 2012

Parallel Asynchronous Hybrid Krylov Methods for Minimization of Energy Consumption. Langshi CHEN 1,2,3 Supervised by Serge PETITON 2

Dense Arithmetic over Finite Fields with CUMODP

More Science per Joule: Bottleneck Computing

Parallel Sparse Tensor Decompositions using HiCOO Format

S8241 VERSIONING GPU- ACCLERATED WRF TO Jeff Adie, 26 March, 2018 (Presented by Stan Posey, NVIDIA)

Claude Tadonki. MINES ParisTech PSL Research University Centre de Recherche Informatique

Model Order Reduction via Matlab Parallel Computing Toolbox. Istanbul Technical University

Parallelization of the QC-lib Quantum Computer Simulator Library

Swift: task-based hydrodynamics at Durham s IPCC. Bower

HYCOM and Navy ESPC Future High Performance Computing Needs. Alan J. Wallcraft. COAPS Short Seminar November 6, 2017

Parallel Numerics. Scope: Revise standard numerical methods considering parallel computations!

Improvements for Implicit Linear Equation Solvers

Parallelization of the Molecular Orbital Program MOS-F

MPI at MPI. Jens Saak. Max Planck Institute for Dynamics of Complex Technical Systems Computational Methods in Systems and Control Theory

High-performance processing and development with Madagascar. July 24, 2010 Madagascar development team

GPU Computing Activities in KISTI

An Integrative Model for Parallelism

Case Study: Quantum Chromodynamics

Parallelization of Molecular Dynamics (with focus on Gromacs) SeSE 2014 p.1/29

The Panel: What does the future look like for NPW application development? 17 th ECMWF Workshop on High Performance Computing in Meteorology

Solving PDEs with CUDA Jonathan Cohen

A CUDA Solver for Helmholtz Equation

Dynamic Scheduling for Work Agglomeration on Heterogeneous Clusters

Efficient Parallelization of Molecular Dynamics Simulations on Hybrid CPU/GPU Supercoputers

Perm State University Research-Education Center Parallel and Distributed Computing

Some thoughts about energy efficient application execution on NEC LX Series compute clusters

Che-Wei Chang Department of Computer Science and Information Engineering, Chang Gung University

Data analysis of massive data sets a Planck example

Accelerating Quantum Chromodynamics Calculations with GPUs

GRAPE-DR, GRAPE-8, and...

A model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization)

N-body Simulations. On GPU Clusters

Performance optimization of WEST and Qbox on Intel Knights Landing

Exploring performance and power properties of modern multicore chips via simple machine models

Communication-avoiding parallel and sequential QR factorizations

Sparse LU Factorization on GPUs for Accelerating SPICE Simulation

sri 2D Implicit Charge- and Energy- Conserving Particle-in-cell Application Using CUDA Christopher Leibs Karthik Murthy

On Portability, Performance and Scalability of a MPI OpenCL Lattice Boltzmann Code

ACCELERATING WEATHER PREDICTION WITH NVIDIA GPUS

MAGMA MIC 1.0: Linear Algebra Library for Intel Xeon Phi Coprocessors

Transcription:

Heterogeneous programming for hybrid CPU-GPU systems: Lessons learned from computational chemistry and Eugene DePrince Argonne National Laboratory (LCF and CNM) (Eugene moved to Georgia Tech last week) 8 September 2011

Abstract (for posterity) We have implemented complex computational chemistry algorithms known as coupled-cluster theory (a quantum many-body method) using CUDA and OpenMP for efficient execution on hybrid CPU-GPU systems. While many of the floating-point operations of this code are performed inside of the CPU or GPU BLAS library, the large memory footprint of the data-structures requires careful consideration of data motion. The latest version of our code is able to exploit multiple CPU cores and one GPU at the same time, maximizing overlap of communication and computation using CUDA streams as well as dealing with load-balancing via dynamically rescheduling computation between iterations. The techniques used for our computational chemistry application should be applicable to other domains, especially those which have large memory footprints and use iterative solvers. In addition to our computational chemistry application performance results, we will show microbenchmarks relevant to GPU codes executing over multiple nodes using MPI. The role of GPUdirect and related developments in the CUDA software stack will be discussed.

Power Power efficiency forces us to use parallelism all the way down... Green500 summary: 1 Blue Gene/Q (2097.19 MF/W) 3 Intel+ATI (1375.88 MF/W) 4 Intel+NVIDIA (958.35 MF/W) 14 POWER7 (565.97 MF/W) 19 Cell-based (458.33 MF/W) 21 Intel-only (441.18 MF/W)

Central Points Given: Power compels kitchen sink architecture design with many-level parallelism (evolving from multi-level parallelism). CPU+GPU just scratches the surface of programmer pain to come. Compilers will never be good enough. Memory (bandwidth, capacity, granularity) will always be the bottleneck. 1000s of threads (private cache) per-node shared memory 300 GB/s 300 GB/s 300 GB/s 300 GB/s 10 GB 300 GB/s 300 GB/s 300 GB/s 300 GB/s 10 GB 10 GB 10 GB 10 GB 10 GB 10 GB 10 GB 50 GB/s ~1 million nodes (hierarchical network)

Central Points Try: High-level code generation (in addition to low-level code generation). Factorize code around inter/intra-node parallelism. Dynamic task parallelism on top of data-parallelism Asynchrony, overlap, nonblocking, one-sided, etc. Performance portability impossible with traditional languages (Fortran, C, C++). Regarding PGAS, compilers that cannot block for cache should not be allowed to generate network communication.

Tensor Contraction Engine

Tensor Contraction Engine What does it do? 1 GUI input quantum many-body theory e.g. CCSD. 2 Operator specification of theory (as in a theory paper). 3 Apply Wick s theory to transform operator expressions into array expressions (as in a computational paper). 4 Transform input array expression to operation tree using many types of optimization (i.e. compile). 5 Produce Fortran+Global Arrays+NXTVAL implementation Developer can intercept at various stages to modify theory, algorithm or implementation.

TCE Input We get 73 lines of serial F90 or 604 lines of parallel F77 from this: 1.0/1.0 Sum( g1 g2 p3 h4 ) f( g1 g2 ) t( p3 h4 ) { g1+ g2 } { p3+ h4 } 1.0/4.0 Sum( g1 g2 g3 g4 p5 h6 ) v( g1 g2 g3 g4 ) t( p5 h6 ) { g1+ g2+ g4 g3 } { p5+ h6 } 1.0/16.0 Sum( g1 g2 g3 g4 p5 p6 h7 h8 ) v( g1 g2 g3 g4 ) t( p5 p6 h7 h8 ) { g1+ g2+ g4 g3 } { p5+ p6+ h8 h7 } 1.0/8.0 Sum( g1 g2 g3 g4 p5 h6 p7 h8 ) v( g1 g2 g3 g4 ) t( p5 h6 ) t( p7 h8 ) { g1+ g2+ g4 g3 } { p5+ h6 } { p7+ h8 } LaTeX equivalent of the first term: g 1,g 2,p 3,h 4 f g1,g 2 t p3,h 4 {g 1 g 2}{p 3 h 4}

Summary of TCE module http://cloc.sourceforge.net v 1.53 T=30.0 s --------------------------------------------- Language files blank comment code --------------------------------------------- Fortran 77 11451 1004 115129 2824724 --------------------------------------------- SUM: 11451 1004 115129 2824724 --------------------------------------------- Only <25 KLOC are hand-written; 100 KLOC is utility code following TCE data-parallel template. Expansion from TCE input to massively-parallel F77 is 200.

TCE Template Pseudocode for R a,b i,j = R c,d i,j V c,d a,b : for i,j in occupied blocks: for a,b in virtual blocks: for c,d in virtual blocks: if symmetry criteria(i,j,a,b,c,d): if dynamic load balancer(me): Get block t(i,j,c,d) from T Permute t(i,j,c,d) Get block v(a,b,c,d) from V Permute v(a,b,c,d) r(i,j,c,d) += t(i,j,c,d) * v(a,b,c,d) Permute r(i,j,a,b) Accumulate r(i,j,a,b) block to R

TCE for Heterogeneous Systems New Get and Accumulate that communicate from/to remote memory to GPU instead of CPU. Implement Permute in GPU code, use GPU BLAS. In this scenario, porting to a new heterogeneous architecture requires a few hundred lines of code. This is only a first-order solution: Load-balancing is significantly harder when the compute portion goes significantly faster. Single-level blocking not optimal for heterogenous nodes.

Quantum Chemistry on Heterogeneous Nodes

Coupled-cluster theory CC = exp(t ) 0 T = T 1 + T 2 + + T n (n N) T 1 = ia t a i â aâ i T 2 = tij ab â aâ bâjâ i ijab Ψ CCD = exp(t 2 ) Ψ HF = (1 + T 2 + T 2 2 ) Ψ HF Ψ CCSD = exp(t 1 + T 2 ) Ψ HF = (1 + T 1 + + T 4 1 + T 2 + T 2 2 + T 1 T 2 + T 2 1 T 2 ) Ψ HF

Coupled cluster (CCD) implementation R ab ij = V ab ij [ + P(ia, jb) Tij ae Ie b Tim ab Ij m + 1 2 V ef ab T ef 1 2 T mni ab ij mn TmjI ae ie mb I a b = ( 2V mn eb I i j = (2V mi ef ij + Iie ma Tmj eb + (2Tmi ea Tim)I ea ej mb + V be mn ea )Tmn V im ef I ij kl = V ij kl + V ij ef T kl ef )T ef mj Ijb ia = Vjb ia 1 2 V eb im T jm ea I ia bj = V ia bj + V be im ea (Tmj 1 2 T mj) ae 1 2 V be mi T mj ae Tensor contractions currently implemented as GEMM plus PERMUTE. ]

Relative Performance of GEMM GPU versus SMP CPU (8 threads): SGEMM performance DGEMM performance 700 X5550 C2050 350 X5550 C2050 600 300 performance (gigaflop/s) 500 400 300 200 performance (gigaflop/s) 250 200 150 100 100 50 0 0 1000 2000 3000 4000 5000 0 0 500 1000 1500 2000 2500 3000 3500 4000 rank rank CPU = 156.2 GF GPU = 717.6 GF CPU = 79.2 GF GPU = 335.6 GF We expect roughly 4-5 times speedup based upon this evaluation because GEMM should be 90% of the execution time.

CPU or GPU CCD This code was written to minimize communication. Only one term is memory-bound and requires communication during the iteration. C2050 C1060 X5550 C 8 H 10 0.3 0.8 1.3 C 10 H 12 0.8 2.5 3.5 C 12 H 14 2.0 7.1 10.0 C 14 H 10 2.7 10.2 13.9 C 14 H 16 4.5 16.7 21.6 C 16 H 18 10.5 35.9 50.2 C 18 H 20 20.1 73.0 86.6 Iteration time in seconds for double precision.

Numerical Precision versus Performance Iteration time in seconds C1060 C2050 X5550 molecule SP DP SP DP SP DP C 8 H 10 0.2 0.8 0.2 0.3 0.7 1.3 C 10 H 12 0.7 2.5 0.4 0.8 2.0 3.5 C 12 H 14 1.8 7.1 1.0 2.0 5.6 10.0 C 14 H 10 2.6 10.2 1.5 2.7 8.4 13.9 C 14 H 16 4.1 16.7 2.4 4.5 12.1 21.6 C 16 H 18 9.0 35.9 5.0 10.5 28.8 50.2 C 18 H 20 17.2 73.0 10.1 20.1 47.0 86.6 Mixed-precision is essentially trivial when you have a relaxation solver and is always worth at least 2x (except on BG).

CPU and GPU CCSD This code was rewritten to minimize memory, overlap communication. Focus on overlap and pipelining. Hybrid CPU Molpro C 8 H 10 0.6 1.4 2.4 C 10 H 12 1.4 4.1 7.2 C 12 H 14 3.3 11.1 19.0 C 14 H 10 4.4 15.5 31.0 C 14 H 16 6.3 24.1 43.1 C 16 H 18 10.0 38.9 84.1 C 18 H 20 22.5 95.9 161.4 Iteration time in seconds for double precision. Staticly distribute most diagrams between GPU and CPU, dynamically distribute leftovers. Small terms always done on CPU.

Details 1 Preallocate buffers on GPU, used pinned buffers on CPU. 2 Put GPU transfers and kernels on a stream. 3 Backfill with CPU computation. 4 See who finishes first, rebalance next iteration. Complications: CUBLAS stream support took a while. Multi-GPU nodes and threads was a pain; now using 1 GPU per MPI rank. Older gripes: Early CUBLAS required us to pad dimensions. GPUdirect two years later than I wanted it.

Lessons learned Do not GPU-ize legacy code! Reimplementation from scratch was faster and easier. Verification of new implementation is a challenge. Task-parallelism absolutely critical for heterogeneous utilization. Threading ameliorates memory capacity and BW bottlenecks. (How many cores required to saturate STREAM BW?) Data-parallel kernels very easy to implement in both OpenMP and CUDA. Careful organization of asynchronous data movement hides entire PCI transfer cost for non-trivial problems. Näive data movement leads to 2x; smart data movement leads to 8x.

Acknowledgments

Hardware Details CPU GPU X5550 2 X5550 C1060 C2050 processor speed (MHz) 2660 2660 1300 1150 memory bandwidth (GB/s) 32 64 102 144 memory speed (MHz) 1066 1066 800 1500 ECC available yes yes no yes SP peak (GF) 85.1 170.2 933 1030 DP peak (GF) 42.6 83.2 78 515 power usage (W) 95 190 188 238 Note that power consumption is apples-to-oranges since CPU does not include DRAM, whereas GPU does.