A Quantum Chemistry Domain-Specific Language for Heterogeneous Clusters

Similar documents
Targeting Extreme Scale Computational Challenges with Heterogeneous Systems

Heterogeneous programming for hybrid CPU-GPU systems: Lessons learned from computational chemistry

Antti-Pekka Hynninen, 5/10/2017, GTC2017, San Jose CA

Direct Self-Consistent Field Computations on GPU Clusters

Scalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver

Parallelization and benchmarks

Ilya A. Kaliman* and Anna I. Krylov. Introduction

Perm State University Research-Education Center Parallel and Distributed Computing

Julian Merten. GPU Computing and Alternative Architecture

Cyclops Tensor Framework

Using a CUDA-Accelerated PGAS Model on a GPU Cluster for Bioinformatics

Performance Analysis of Lattice QCD Application with APGAS Programming Model

Efficient algorithms for symmetric tensor contractions

First, a look at using OpenACC on WRF subroutine advance_w dynamics routine

Efficient Molecular Dynamics on Heterogeneous Architectures in GROMACS

Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers

arxiv: v1 [hep-lat] 10 Jul 2012

Lattice Boltzmann simulations on heterogeneous CPU-GPU clusters

Solving PDEs with CUDA Jonathan Cohen

Performance evaluation of scalable optoelectronics application on large-scale Knights Landing cluster

Claude Tadonki. MINES ParisTech PSL Research University Centre de Recherche Informatique

Basic introduction of NWChem software

On Portability, Performance and Scalability of a MPI OpenCL Lattice Boltzmann Code

Faster Kinetics: Accelerate Your Finite-Rate Combustion Simulation with GPUs

Dynamic Scheduling for Work Agglomeration on Heterogeneous Clusters

Vector Lane Threading

Petascale Quantum Simulations of Nano Systems and Biomolecules

arxiv: v1 [hep-lat] 7 Oct 2010

COMPARISON OF CPU AND GPU IMPLEMENTATIONS OF THE LATTICE BOLTZMANN METHOD

Panorama des modèles et outils de programmation parallèle

11 Parallel programming models

Accelerating linear algebra computations with hybrid GPU-multicore systems.

S0214 : GPU Based Stacking Sequence Generation For Composite Skins Using GA

Introduction to Benchmark Test for Multi-scale Computational Materials Software

Scalable and Power-Efficient Data Mining Kernels

MAGMA MIC 1.0: Linear Algebra Library for Intel Xeon Phi Coprocessors

Parallelization of Molecular Dynamics (with focus on Gromacs) SeSE 2014 p.1/29

Some thoughts about energy efficient application execution on NEC LX Series compute clusters

Cyclops Tensor Framework: reducing communication and eliminating load imbalance in massively parallel contractions

HYCOM and Navy ESPC Future High Performance Computing Needs. Alan J. Wallcraft. COAPS Short Seminar November 6, 2017

Introduction to numerical computations on the GPU

Performance Evaluation of Scientific Applications on POWER8

GPU Acceleration of Cutoff Pair Potentials for Molecular Modeling Applications

NWChem: Coupled Cluster Method (Tensor Contraction Engine)

Static-scheduling and hybrid-programming in SuperLU DIST on multicore cluster systems

上海超级计算中心 Shanghai Supercomputer Center. Lei Xu Shanghai Supercomputer Center San Jose

Basic introduction of NWChem software

Block AIR Methods. For Multicore and GPU. Per Christian Hansen Hans Henrik B. Sørensen. Technical University of Denmark

GPU acceleration of Newton s method for large systems of polynomial equations in double double and quad double arithmetic

Massively parallel semi-lagrangian solution of the 6d Vlasov-Poisson problem

Leveraging sparsity and symmetry in parallel tensor contractions

Research on GPU-accelerated algorithm in 3D finite difference neutron diffusion calculation method

Implementing NNLO into MCFM

A Massively Parallel Eigenvalue Solver for Small Matrices on Multicore and Manycore Architectures

Multicore Parallelization of Determinant Quantum Monte Carlo Simulations

GPU Computing Activities in KISTI

Benchmarking program performance evaluation of Parallel programming language XcalableMP on Many core processor

CRYPTOGRAPHIC COMPUTING

Massively scalable computing method to tackle large eigenvalue problems for nanoelectronics modeling

NWChem: Coupled Cluster Method (Tensor Contraction Engine)

Hybrid CPU/GPU Acceleration of Detection of 2-SNP Epistatic Interactions in GWAS

Real-time signal detection for pulsars and radio transients using GPUs

Practical Combustion Kinetics with CUDA

Exploiting In-Memory Processing Capabilities for Density Functional Theory Applications

S Subdivide, Preprocess and Conquer: Micromagnetism FEM/BEM-Simulations on Single-Node/Multi-GPU Systems

TR A Comparison of the Performance of SaP::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems

Measuring freeze-out parameters on the Bielefeld GPU cluster

Massively scalable computing method to tackle large eigenvalue problems for nanoelectronics modeling

Towards a highly-parallel PDE-Solver using Adaptive Sparse Grids on Compute Clusters

Module 9: "Introduction to Shared Memory Multiprocessors" Lecture 17: "Introduction to Cache Coherence Protocols" Invalidation vs.

Performance and Energy Analysis of the Iterative Solution of Sparse Linear Systems on Multicore and Manycore Architectures

Improvement of MPAS on the Integration Speed and the Accuracy

Strassen s Algorithm for Tensor Contraction

Jacobi-Based Eigenvalue Solver on GPU. Lung-Sheng Chien, NVIDIA

Paralleliza(on and Performance of the NIM Weather Model on CPU, GPU and MIC Architectures

Multiphase Flow Simulations in Inclined Tubes with Lattice Boltzmann Method on GPU

Software implementation of correlated quantum chemistry methods. Exploiting advanced programming tools and new computer architectures

Using BLIS for tensor computations in Q-Chem

WRF performance tuning for the Intel Woodcrest Processor

Parallel Sparse Tensor Decompositions using HiCOO Format

S3D Direct Numerical Simulation: Preparation for the PF Era

Adapting Irregular Computations to Large CPU-GPU Clusters in the MADNESS Framework

Massive Parallelization of First Principles Molecular Dynamics Code

Model Order Reduction via Matlab Parallel Computing Toolbox. Istanbul Technical University

Efficient implementation of the overlap operator on multi-gpus

Quantum Chemical Calculations by Parallel Computer from Commodity PC Components

Module 9: Addendum to Module 6: Shared Memory Multiprocessors Lecture 18: Sharing Patterns and Cache Coherence Protocols. The Lecture Contains:

A CUDA Solver for Helmholtz Equation

Accelerating Model Reduction of Large Linear Systems with Graphics Processors

Weile Jia 1, Long Wang 1, Zongyan Cao 1, Jiyun Fu 1, Xuebin Chi 1, Weiguo Gao 2, Lin-Wang Wang 3

High-performance processing and development with Madagascar. July 24, 2010 Madagascar development team

Performance Evaluation of MPI on Weather and Hydrological Models

Two case studies of Monte Carlo simulation on GPU

Crossing the Chasm. On the Paths to Exascale: Presented by Mike Rezny, Monash University, Australia

Hydra. A library for data analysis in massively parallel platforms. A. Augusto Alves Jr and Michael D. Sokoloff

Reactive Molecular Dynamics on Massively Parallel Heterogeneous Architectures

ACCELERATING WEATHER PREDICTION WITH NVIDIA GPUS

Large-scale Electronic Structure Simulations with MVAPICH2 on Intel Knights Landing Manycore Processors

Parallel PIPS-SBB Multi-level parallelism for 2-stage SMIPS. Lluís-Miquel Munguia, Geoffrey M. Oxberry, Deepak Rajan, Yuji Shinano

Dr. Andrea Bocci. Using GPUs to Accelerate Online Event Reconstruction. at the Large Hadron Collider. Applied Physicist

Transcription:

A Quantum Chemistry Domain-Specific Language for Heterogeneous Clusters ANTONINO TUMEO, ORESTE VILLA Collaborators: Karol Kowalski, Sriram Krishnamoorthy, Wenjing Ma, Simone Secchi May 15, 2012 1

Outline! Introduction! NWChem Tensor Contraction Engine! Tensor Contraction Generator! CUDA optimizations! Experimental results May 15, 2012 2

NWChem! Scalable computational chemistry tool! Treats large scientific computational chemistry problems! Exploits workstations, clusters, high performance parallel supercomputers! Handles! Biomolecules, nanostructures, and solid-state! From quantum to classical, and all combinations! Gaussian basis functions or plane-waves! Scaling from one to thousands of processors! Properties and relativity! Actively developed by a consortium of developers! maintained by the Environmental Molecular Science Laboratory (EMSL) located at the Pacific Northwest National Laboratory (PNNL) May 15, 2012 3

NWChem TCE Tensor Contraction Engine! Quantum Chemistry Coupled Cluster theory is used to understand fundamental optical processes in solar cells, photosynthesis, and other optically active materials.! ORNL s Jaguar achieve 1.31 PFlops (over 50% of peak) on 225,000 processors during CCSD calculation! ORNL PNNL Collaboration Lead Edo Apra (Gordon Bell Finalist at SC 2009) May 15, 2012 4

Anatomy of NWChem-TCE Phase Hartree Fock 4- index transforma7on CCSD - Itera7ve CCSD(T) Not itera7ve Complexity N^4 N^5 no^2 - nu^4 no^3 - nu^4 no=number of occupied; nu=number of unoccupied orbitals; N=no+nu Number of electron (for the closed shell systems N_el=2*no). N basis functions are used (usually defined as a set of atomic orbitals localized on the atoms of your system) N=no+nu 4-index transformation transforms 2-electron atomic integrals to 2-electron molecular integrals. The molecular orbitals are obtained in Hartree Fock calculations and are expressed as a linear combinations of atomic orbitals May 15, 2012 5

Coupled cluster! Coupled Cluster involves the solution of the time independent Schrodinger equation! CCSD(T) consists in calculating the Cluster operator T up to the third contribution! Coupled Cluster Single Double (with Triple correction)! T 1 = 0(n 1 * 2 + 1 ) ; T 2 = 0(n 2 * 2 + 1 ) ; T 3 = 0(n 3 * 2 + 1 ) May 15, 2012 6

Tensor Contractions! Tensor contractions are generalized multidimensional matrix multiplication operations that widely occur in quantum chemistry.! The tensor contraction involves computing tensor R from tensors T and V.! The index l is common to both input tensors, corresponding to the common dimension in matrix-matrix multiplication, and is referred to as the contracted index.! CCSD(T) consists of 18 such contraction expressions - N 7! CCSD has 100 types but they are N 5 (types refers to different index permutations) we can t do it by hand!! May 15, 2012 7

TCG - Tensor Contraction Generator! Builds on NWChem TCE! Developed in Python! Tiling: the whole spinorobital domain is partitioned into tiles which contain several spinorbitals of the same spatial- and spin-symmetry.! Domain Specific Language: takes in input the mathematical description of the TCs! Limited hints for enhancing code generation! Generates code for:! CUDA, OpenCL, OpenMP, OpenACC! NWChem TCE is targeted to Fortran! Optimizes for the different targets! Currently, mostly focused on CUDA May 15, 2012 8

Basic CUDA kernel! Memory management! Reuses memory allocated for previous kernel calls! Kernel Arguments! common strides offsets pre-computed by the host! Encoding Thread-Block specific arguments! Using CUDA block dimensions to map index dimensions! Computation in a Thread-Block! One thread computes one element in the output array.! The threads in each thread block co-operate in moving data between the GPU memory and shared memory! Default thread Block size 16x16 = 256 Threads May 15, 2012 9

GPU code generation challenges! Large tensors + memory requirements of the application, results in each dimension being relatively small! interferes with achieving good locality in SM! incurs high index computation overhead (modulo operations)! poor device thread block utilization on GPUs! In general, tensor contraction is harder to efficiently compute than square matrix-matrix multiplication often employed in benchmark studies! Problem sizes are not known until runtime! CCSD(T) has 18 contraction types May 15, 2012 10

CUDA optimizations! Index Combining (General optimization)! When a pair of indices occur in the same order in every occurrence, they can be replaced by one index whose size is their product! reduction of index calculations! Dimension Flattening (increase utilization in specific tile sizes)! Baseline approach works well only when index dimensions are multiple of the thread block configuration. We flatten the loops and we recreate them of the right size, with some index operations! increase of index operations but better utilization! Pipelining on the outer dimensions (Streaming) with Host Support for accumulation! Most effective optimization to hide PCI express transfers May 15, 2012 11

Dimension flattening May 15, 2012 12

Pipelined execution! We can avoid the O(N 6 ) copy IN and just have the copy OUT and then accumulate on the host! We can create streams of kernels using one loop dimension and asynchronously copy out partial data Host Support and streams May 15, 2012 13

Fermi optimizations (I)! Fermi significantly improves computation rate with respect to memory bandwidth! Execution is bounded by data transfers! Intermediates are kept in memory, across tensor contractions! It is necessary to transfer the inputs into the GPU! Intermediate is computed and translated into a scalar contribution, which is transferred back to the host memory! Fermi includes a wider register file and larger shared memory! Register tiling: each thread contributes to 16 output elements! contributions are stored in 16 registers and finally written back to GPU memory May 15, 2012 14

Fermi optimizations (II)! Register tiling results in each thread performing more computation! Enables scalar optimizations across calculations for the 16 output elements! Each write to the GPU memory is enclosed in a boundary check to ensure that a thread has a valid contribution to make! When tile executed is smaller than the thread block size! Condition checks are coalesced May 15, 2012 15

Hybrid execution task based framework! A CPU core drives a GPU! Accumulation is done on the CPU core! Nodes have more CPU cores than GPUs! We exploit a task based framework! Exploits the Global Arrays toolkit (PGAS)! Load balance at the level of the node! Load balance at the level of the cluster! Free cores execute a sequential version of the TC! We expect to be able to use OpenMP (or OpenCL for CPUs) version in the next few months May 15, 2012 16

Experimental setup! 60 nodes cluster named Barracuda hosted at PNNL/EMSL! Each node of Barracuda consists of two quad-core Intel Xeon X5560 CPUs with 8MB L2 cache running at 2.80GHz.! A Tesla S1070 box, which has 4 Tesla 10 GPUs, is shared between two compute nodes, resulting in two GPUs being available per SMP node (tot 128 GPUs)! Each SMP node is interconnected in the cluster with an Infiniband QDR card. May 15, 2012 17

Single node performance single tile 6 dimensions of size 16 3 dimensions of size 16 3 dimensions of size 17 cublasdgemm: (first approach used in the past) In NWChem, each tensor contrac7on is translated into index permuta7on and dgemm opera7ons. May 15, 2012 18

Results for 60 nodes GPU cluster (millions of tiles) Minutes to Completion 120 100 80 60 40 20 0 113 56 28 1 core 2 cores 4 cores (1 socket) One level Hybrid execu7on GPU performs contrac7on and CPU accumulates 17 13 8 cores (2 sockets) Two level Hybrid execu7on GPUs share the PCI bus 5 1 GPU 2 GPUs + 6 cores Double precision calculations for green fluorescent protein (GFP) with 284 and 476 basis functions. In all calculations core electrons were not correlated. 19

Under development! Code generators for other targets! OpenMP, OpenCL and OpenACC! Mostly working, not yet optimized! Many parameters to optimize for! Trying to distill the most relevant! We are thinking about auto-tuning mechanisms! Generate all the possible code permutations! Choose the best code and processing units depending on TC sizes! Evolutionary methods for the design space exploration? May 15, 2012 20

Conclusion! NWChem TCE! Domain specific language for generating Tensor Contractions! New, re-targetable code generator! Generates optimized code for CUDA! Initial support for: OpenMP, OpenCL, OpenACC! Task based, cluster wide load balancing! Based on Global Arrays (PGAS)! Supports Hybrid Execution! Independent from NWChem TCE May 15, 2012 21

Thank you for your attention!! Questions? May 15, 2012 22