GPU Acceleration of Cutoff Pair Potentials for Molecular Modeling Applications

Similar documents
CS-206 Concurrency. Lecture 13. Wrap Up. Spring 2015 Prof. Babak Falsafi parsa.epfl.ch/courses/cs206/

arxiv: v1 [hep-lat] 7 Oct 2010

Accelerating Molecular Modeling Applications with Graphics Processors

Klaus Schulten Department of Physics and Theoretical and Computational Biophysics Group University of Illinois at Urbana-Champaign

Welcome to MCS 572. content and organization expectations of the course. definition and classification

GPU-Accelerated Analysis and Visualization of Large Structures Solved by Molecular Dynamics Flexible Fitting

Faster Kinetics: Accelerate Your Finite-Rate Combustion Simulation with GPUs

Multiscale simulations of complex fluid rheology

Lattice Boltzmann simulations on heterogeneous CPU-GPU clusters

COMPARISON OF CPU AND GPU IMPLEMENTATIONS OF THE LATTICE BOLTZMANN METHOD

CRYPTOGRAPHIC COMPUTING

NCU EE -- DSP VLSI Design. Tsung-Han Tsai 1

Parallel Longest Common Subsequence using Graphics Hardware

ICS 233 Computer Architecture & Assembly Language

Administrivia. Course Objectives. Overview. Lecture Notes Week markem/cs333/ 2. Staff. 3. Prerequisites. 4. Grading. 1. Theory and application

Research on GPU-accelerated algorithm in 3D finite difference neutron diffusion calculation method

PuReMD-GPU: A Reactive Molecular Dynamic Simulation Package for GPUs

Heterogeneous programming for hybrid CPU-GPU systems: Lessons learned from computational chemistry

Introduction to numerical computations on the GPU

Direct Self-Consistent Field Computations on GPU Clusters

Accelerating linear algebra computations with hybrid GPU-multicore systems.

Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism

Efficient Molecular Dynamics on Heterogeneous Architectures in GROMACS

GPU-accelerated Chemical Similarity Assessment for Large Scale Databases

CS 700: Quantitative Methods & Experimental Design in Computer Science

INF2270 Spring Philipp Häfliger. Lecture 8: Superscalar CPUs, Course Summary/Repetition (1/2)

Vector Lane Threading

Parallel stochastic simulation using graphics processing units for the Systems Biology Toolbox for MATLAB

Mayer Sampling Monte Carlo Calculation of Virial. Coefficients on Graphics Processors

Performance Evaluation of Scientific Applications on POWER8

A Quantum Chemistry Domain-Specific Language for Heterogeneous Clusters

S XMP LIBRARY INTERNALS. Niall Emmart University of Massachusetts. Follow on to S6151 XMP: An NVIDIA CUDA Accelerated Big Integer Library

Transposition Mechanism for Sparse Matrices on Vector Processors

Introducing a Bioinformatics Similarity Search Solution

AUTHOR QUERY FORM. Fax: For correction or revision of any artwork, please consult

Dynamic Scheduling for Work Agglomeration on Heterogeneous Clusters

Optimization Techniques for Parallel Code 1. Parallel programming models

Probing Biomolecular Machines with Graphics Processors

A CUDA Solver for Helmholtz Equation

CMP 338: Third Class

Lecture 23: Illusiveness of Parallel Performance. James C. Hoe Department of ECE Carnegie Mellon University

Parallel Numerics. Scope: Revise standard numerical methods considering parallel computations!

sri 2D Implicit Charge- and Energy- Conserving Particle-in-cell Application Using CUDA Christopher Leibs Karthik Murthy

Accelerating computation of eigenvectors in the nonsymmetric eigenvalue problem

GPU acceleration of Newton s method for large systems of polynomial equations in double double and quad double arithmetic

Accelerating computation of eigenvectors in the dense nonsymmetric eigenvalue problem

CMP N 301 Computer Architecture. Appendix C

NIH Center for Macromolecular Modeling and Bioinformatics Developer of VMD and NAMD. Beckman Institute

Accelerating Quantum Chromodynamics Calculations with GPUs

A microsecond a day keeps the doctor away: Efficient GPU Molecular Dynamics with GROMACS

POLITECNICO DI MILANO DATA PARALLEL OPTIMIZATIONS ON GPU ARCHITECTURES FOR MOLECULAR DYNAMIC SIMULATIONS

Che-Wei Chang Department of Computer Science and Information Engineering, Chang Gung University

Case Study: Quantum Chromodynamics

Research into GPU accelerated pattern matching for applications in computer security

Randomized Selection on the GPU. Laura Monroe, Joanne Wendelberger, Sarah Michalak Los Alamos National Laboratory

In 1830, Auguste Comte wrote in his work

MONTE CARLO NEUTRON TRANSPORT SIMULATING NUCLEAR REACTIONS ONE NEUTRON AT A TIME Tony Scudiero NVIDIA

Unit 6: Branch Prediction

/ : Computer Architecture and Design

PERFORMANCE METRICS. Mahdi Nazm Bojnordi. CS/ECE 6810: Computer Architecture. Assistant Professor School of Computing University of Utah

N-body Simulations. On GPU Clusters

Parallel Rabin-Karp Algorithm Implementation on GPU (preliminary version)

Information Sciences Institute 22 June 2012 Bob Lucas, Gene Wagenbreth, Dan Davis, Roger Grimes and

Scalable Non-blocking Preconditioned Conjugate Gradient Methods

Outline. policies for the first part. with some potential answers... MCS 260 Lecture 10.0 Introduction to Computer Science Jan Verschelde, 9 July 2014

A roofline model of energy

Accelerating Protein Coordinate Conversion using GPUs

of 19 23/02/ :23

High-Performance Computing, Planet Formation & Searching for Extrasolar Planets

Block AIR Methods. For Multicore and GPU. Per Christian Hansen Hans Henrik B. Sørensen. Technical University of Denmark

Exploiting In-Memory Processing Capabilities for Density Functional Theory Applications

Computation of Large Sparse Aggregated Areas for Analytic Database Queries

Parallelization of Molecular Dynamics (with focus on Gromacs) SeSE 2014 p.1/29

CHAPTER 5 - PROCESS SCHEDULING

Performance Metrics for Computer Systems. CASS 2018 Lavanya Ramapantulu

Real-time signal detection for pulsars and radio transients using GPUs

S0214 : GPU Based Stacking Sequence Generation For Composite Skins Using GA

CPSC 3300 Spring 2017 Exam 2

UC Santa Barbara. Operating Systems. Christopher Kruegel Department of Computer Science UC Santa Barbara

Department of Electrical and Computer Engineering University of Wisconsin - Madison. ECE/CS 752 Advanced Computer Architecture I.

Performance, Power & Energy. ELEC8106/ELEC6102 Spring 2010 Hayden Kwok-Hay So

TDDB68 Concurrent programming and operating systems. Lecture: CPU Scheduling II

SP-CNN: A Scalable and Programmable CNN-based Accelerator. Dilan Manatunga Dr. Hyesoon Kim Dr. Saibal Mukhopadhyay

An Implementation of the MRRR Algorithm on a Data-Parallel Coprocessor

University of Alberta

Data Structures. Outline. Introduction. Andres Mendez-Vazquez. December 3, Data Manipulation Examples

Efficient implementation of the overlap operator on multi-gpus

Statistics of the Universe: Exa-calculations and Cosmology's Data Deluge

QR Decomposition in a Multicore Environment

Cache Contention and Application Performance Prediction for Multi-Core Systems

TDDI04, K. Arvidsson, IDA, Linköpings universitet CPU Scheduling. Overview: CPU Scheduling. [SGG7] Chapter 5. Basic Concepts.

Instruction Set Extensions for Reed-Solomon Encoding and Decoding

Measurement & Performance

Measurement & Performance

Efficient Serial and Parallel Coordinate Descent Methods for Huge-Scale Convex Optimization

ACCELERATED LEARNING OF GAUSSIAN PROCESS MODELS

EXPLOITING NETWORK PROCESSORS FOR LOW LATENCY, HIGH THROUGHPUT, RATE-BASED SENSOR UPDATE DELIVERY. Kim Christian Swenson

Practical Free-Start Collision Attacks on full SHA-1

Exploring performance and power properties of modern multicore chips via simple machine models

Efficient algorithms for symmetric tensor contractions

Transcription:

GPU Acceleration of Cutoff Pair Potentials for Molecular Modeling Applications Christopher Rodrigues, David J. Hardy, John E. Stone, Klaus Schulten, Wen-Mei W. Hwu University of Illinois at Urbana-Champaign Computing Frontiers, Ischia, Italy 7 May 2008

Cutoff Pair Potentials Essential to molecular modeling applications E.g., van der Waals, electrostatic potential Often the most costly part of computation Evaluate a cutoff electrostatic potential on a 3D lattice Applications include Structure building Ion placement Time-averaged potential Analysis Visualizing electrostatic potential 2D slice through an electrostatic potential map Red: positive Blue: negative 1

Algorithm for Pair Potentials At each grid point, sum the electrostatic potential from all atoms Highly data-parallel But has quadratic complexity Number of grid points number of atoms Both proportional to volume 2

Algorithm for Pair Potentials With a Cutoff Ignore atoms beyond a cutoff distance, r c Typically 8Å 12Å Long-range potential may be computed separately Number of atoms within cutoff distance is roughly constant On the order of 1000 3

Spatial Sorting Presort atoms into bins by location in space Each bin holds several atoms Cutoff potential only uses bins within r c Yields a linear complexity cutoff potential algorithm 4

The Draw of GPU Computing Raw power Highly parallel, throughput-oriented design 345.6 GFLOPS peak performance Programmability C-language programming interface via CUDA Adoptability Commodity hardware, easy for users to add to a desktop computer Large install base (1 million CUDA-capable GPUs sold per week) 5

Architecture of the G80 GPU 16 Streaming Multiprocessors 8 processors 768 thread contexts Groups of 32 threads (warps) run in lockstep Large, long-latency, off-chip global memory > 200 cycles 64-byte, aligned accesses are most efficient Effected with 16 consecutive accesses from a half-warp Scratchpad shared memory 16 single-ported banks No general-purpose cache 6

CUDA Programming Model Lightweight threads multiplexed onto processors 32 threads bundled into a warp SIMD-like simultaneous instruction issue Warps grouped into thread blocks Share resources on one Streaming Mutiprocessor CPU launches a single grid of many thread blocks Thread blocks start executing asynchronously as resources become available 7

GPU Programming Principles Create hundreds of thousands of small, independent threads Keep each thread s resource use small Registers, shared memory Allows many threads to be active simultaneously Exploit data locality and conserve memory bandwidth Avoid waiting for off-chip memory accesses Threads in a thread block can take advantage of shared memory on an SM 8

Previous Cutoff Kernel 6 speedup relative to CPU version Work-inefficient Coarse spatial hashing into (24Å) 3 bins Only 6.5% of the atoms a thread tests are within the cutoff distance Better adaptation of the algorithm to the GPU will gain another 2.5 9

Design Considerations for the New Cutoff Kernel High memory throughput to atom data essential Group threads together for locality Fetch blocks of data into shared memory Structure atom data to allow fetching After taking care of memory demand, optimize to reduce instruction count Loop and instruction-level optimization 10

Improving Work Efficiency (4Å) 3 cube of the potential map computed by each thread block 8 8 8 potential map points 128 threads per block 34% of atoms are within cutoff distance Thread block needs atom data up to the cutoff distance Use a sphere of bins All threads in a block scan the same atoms No hardware penalty for multiple simultaneous reads of the same address Simplifies fetching of data 11

Caching Atom Data >200 cycle global memory latency Effectively 1 cycle shared memory latency Shared memory used in software as a cache Threads in a thread block collectively load one bin at a time into shared memory Once loaded, threads scan atoms in shared memory Reuse: Loaded bins used 128 times Execution cycle of a thread block Threads individually compute potentials using bin in shared mem Time Collectively load next bin Suspend Data returned from global memory Another thread block runs while this one waits Ready Write bin to shared memory 12

High-Throughput Access to Atom Data Full global memory bandwidth only with 64- byte, 64-byte-aligned memory accesses Each bin is exactly 128 bytes Bins stored in a 3D array 128 bytes = 8 atoms (x,y,z,q) Nearly uniform density of atoms in typical systems 1 atom per 10 Å 3 Bins hold atoms from exactly (4Å) 3 of space Number of atoms in a bin varies For water test systems, 5.35 atoms in a bin on average Some bins overfull 13

Handling Overfull Bins 2.6% of atoms exceed bin capacity Spatial sorting puts these into a list of extra atoms Extra atoms processed by the CPU Computed with CPU-optimized algorithm Takes about 66% as long as GPU computation Overlapping GPU and CPU computation yields in additional speedup 14

GPU Thread Optimization Each thread computes potentials at four potential map points Reuse x and z components of distance calculation Check x and z components against cutoff distance (cylinder test) Exit inner loop early upon encountering the first empty slot in a bin 15

GPU Thread Inner Loop Exit when an empty atom bin entry is encountered Cylinder test for (i = 0; i < BIN_DEPTH; i++) { aq = AtomBinCache[i].w; if (aq == 0) break; dx = AtomBinCache[i].x - x; dz = AtomBinCache[i].z - z; dxdz2 = dx*dx + dz*dz; if (dxdz2 < cutoff2) continue; Cutoff test and potential value calculation } dy = AtomBinCache[i].y - y; r2 = dy*dy + dxdz2; if (r2 < cutoff2) poten0 += aq * rsqrtf(r2); dy = dy - 2 * grid_spacing; /* Repeat three more times */ 16

Cutoff Summation Runtime GPU cutoff with CPU overlap: 12x-21x faster than CPU core 50k 1M atom structure size 17

Cutoff Summation Speedup Diminished overlap benefit due to limited queue size (16 entries) 50k 1M atom structure size Cutoff summation alone 9-13 faster than CPU 18

Improving Floating-Point Accuracy GPU provides single-precision FP with slightly reduced accuracy on some operations Accuracy depends on summation order FP addition is not associative Compensated summation improves accuracy Less than 10% performance cost on GPU CPU GPU GPU with Compensated Summation Maximum % relative error 0.4793 0.8715 0.5710 Maximum % absolute error 0.0000939 0.0001579 0.0001579 Error relative to double-precision floating point on CPU 19

Summary Cutoff pair potentials heavily used in molecular modeling applications Use CPU to regularize the work given to the GPU to optimize its performance GPU performs very well on 64-byte-aligned array data Run CPU and GPU concurrently to improve performance Use shared memory as a program-managed cache 20

Thank You Publications in Molecular Modeling: Accelerating Molecular Modeling Applications with Graphics Processors. J. Stone, J. Phillips, P. Freddolino, D. Hardy, L. Trabuco, K. Schulten. J. Comp. Chem., 28:2618-2640, 2007. in GPU Optimization: Program Optimization Space Pruning for a Multithreaded GPU. S. Ryoo, C. I. Rodrigues, S. S. Stone, S. S. Baghsorkhi, S.-Z. Ueng, J. A. Stratton, W.-M. W. Hwu. in CGO 2008. Optimization Principles and Application Performance Evaluation of a Multithreaded GPU Using CUDA. S. Ryoo, C. I. Rodrigues, S. S. Baghsorkhi, S. S. Stone, D. B. Kirk, W.-M. W. Hwu. in PPoPP 2008. 21

Backup Slides 22

Ion Placement Selection of initial conditions for a negatively charged virus in water Neutralize charge by adding positively charged ions Also stabilizes the virus structure Ions placed at sites with most negative electrostatic potential 23

Simplified Pseudocode of the Cutoff Pair Potentials Algorithm for each grid point, r j : for each nearby bin, B: for each atom (q,r) in B: dr = r - r j if dr < r c : s = (1 - (dr/r c ) 2 ) 2 V(r j ) = V(r j ) + q/dr * s 24