Mechanisms for exploiting heterogeneous computing: Harnessing hundreds of GPUs and CPUs

Similar documents
Adaptive Heterogeneous Computing with OpenCL: Harnessing hundreds of GPUs and CPUs

Adaptive Heterogeneous Computing with OpenCL: Harnessing hundreds of GPUs and CPUs

BUDE. A General Purpose Molecular Docking Program Using OpenCL. Richard B Sessions

Julian Merten. GPU Computing and Alternative Architecture

Direct Self-Consistent Field Computations on GPU Clusters

Real-time signal detection for pulsars and radio transients using GPUs

Klaus Schulten Department of Physics and Theoretical and Computational Biophysics Group University of Illinois at Urbana-Champaign

Scalable and Power-Efficient Data Mining Kernels

Docking. GBCB 5874: Problem Solving in GBCB

Predictive Molecular Simulation for Drug Discovery

Petascale Quantum Simulations of Nano Systems and Biomolecules

Molecular dynamics simulation. CS/CME/BioE/Biophys/BMI 279 Oct. 5 and 10, 2017 Ron Dror

Progress of Compound Library Design Using In-silico Approach for Collaborative Drug Discovery

Homology modeling. Dinesh Gupta ICGEB, New Delhi 1/27/2010 5:59 PM

5.1. Hardwares, Softwares and Web server used in Molecular modeling

SP-CNN: A Scalable and Programmable CNN-based Accelerator. Dilan Manatunga Dr. Hyesoon Kim Dr. Saibal Mukhopadhyay

Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers

Chemical properties that affect binding of enzyme-inhibiting drugs to enzymes

The QMC Petascale Project

Population annealing study of the frustrated Ising antiferromagnet on the stacked triangular lattice

Perm State University Research-Education Center Parallel and Distributed Computing

A model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization)

Cheminformatics platform for drug discovery application

Tips Geared Towards R. Adam J. Suarez. Arpil 10, 2015

Open-Source Parallel FE Software : FrontISTR -- Performance Considerations about B/F (Byte per Flop) of SpMV on K-Supercomputer and GPU-Clusters --

est Drive K20 GPUs! Experience The Acceleration Run Computational Chemistry Codes on Tesla K20 GPU today

How is molecular dynamics being used in life sciences? Davide Branduardi

GPU acceleration of Newton s method for large systems of polynomial equations in double double and quad double arithmetic

Graphics Card Computing for Materials Modelling

Research on GPU-accelerated algorithm in 3D finite difference neutron diffusion calculation method

Parallelization of Molecular Dynamics (with focus on Gromacs) SeSE 2014 p.1/29

Hybrid CPU/GPU Acceleration of Detection of 2-SNP Epistatic Interactions in GWAS

Introduction to numerical computations on the GPU

HYCOM and Navy ESPC Future High Performance Computing Needs. Alan J. Wallcraft. COAPS Short Seminar November 6, 2017

Introduction to Benchmark Test for Multi-scale Computational Materials Software

Performance Analysis of Lattice QCD Application with APGAS Programming Model

arxiv: v1 [hep-lat] 7 Oct 2010

Welcome to MCS 572. content and organization expectations of the course. definition and classification

CRYPTOGRAPHIC COMPUTING

MURCIA: Fast parallel solvent accessible surface area calculation on GPUs and application to drug discovery and molecular visualization

Efficient Molecular Dynamics on Heterogeneous Architectures in GROMACS

Claude Tadonki. MINES ParisTech PSL Research University Centre de Recherche Informatique

The Fast Multipole Method in molecular dynamics

11 Parallel programming models

Heterogeneous programming for hybrid CPU-GPU systems: Lessons learned from computational chemistry

A Quantum Chemistry Domain-Specific Language for Heterogeneous Clusters

Molecular Modelling for Medicinal Chemistry (F13MMM) Room A36

Mesoscopic Simulation of Aggregate Structure and Stability of Heavy Crude Oil by GPU Accelerated DPD

Ultra High Throughput Screening using THINK on the Internet

RAID+: Deterministic and Balanced Data Distribution for Large Disk Enclosures

The Quantum Landscape

Performance Evaluation of Scientific Applications on POWER8

Leveraging Task-Parallelism in Energy-Efficient ILU Preconditioners

Using AmgX to accelerate a PETSc-based immersed-boundary method code

User Guide for LeDock

Molecular Mechanics, Dynamics & Docking

MM-GBSA for Calculating Binding Affinity A rank-ordering study for the lead optimization of Fxa and COX-2 inhibitors

BeiHang Short Course, Part 7: HW Acceleration: It s about Performance, Energy and Power

A CUDA Solver for Helmholtz Equation

Ligand Scout Tutorials

GPU Computing Activities in KISTI

MD Simulation in Pose Refinement and Scoring Using AMBER Workflows

Benchmark of the CPMD code on CRESCO HPC Facilities for Numerical Simulation of a Magnesium Nanoparticle.

Random Sampling for Short Lattice Vectors on Graphics Cards

Using a CUDA-Accelerated PGAS Model on a GPU Cluster for Bioinformatics

Quantum Chemical Calculations by Parallel Computer from Commodity PC Components

Using AutoDock for Virtual Screening

Randomized Selection on the GPU. Laura Monroe, Joanne Wendelberger, Sarah Michalak Los Alamos National Laboratory

Alchemical free energy calculations in OpenMM

Accelerating linear algebra computations with hybrid GPU-multicore systems.

Hit Finding and Optimization Using BLAZE & FORGE

NIH Center for Macromolecular Modeling and Bioinformatics Developer of VMD and NAMD. Beckman Institute

APPLICATION OF CUDA TECHNOLOGY FOR CALCULATION OF GROUND STATES OF FEW-BODY NUCLEI BY FEYNMAN'S CONTINUAL INTEGRALS METHOD

Kd = koff/kon = [R][L]/[RL]

Nuclear Physics and Computing: Exascale Partnerships. Juan Meza Senior Scientist Lawrence Berkeley National Laboratory

GPU Accelerated Markov Decision Processes in Crowd Simulation

Softwares for Molecular Docking. Lokesh P. Tripathi NCBS 17 December 2007

N-body Simulations. On GPU Clusters

Chemical properties that affect binding of enzyme-inhibiting drugs to enzymes

MPI at MPI. Jens Saak. Max Planck Institute for Dynamics of Complex Technical Systems Computational Methods in Systems and Control Theory

PROTEIN-PROTEIN DOCKING REFINEMENT USING RESTRAINT MOLECULAR DYNAMICS SIMULATIONS

VMware VMmark V1.1 Results

Atomistic Simulation of Nuclear Materials

Piz Daint & Piz Kesch : from general purpose supercomputing to an appliance for weather forecasting. Thomas C. Schulthess

P214 Efficient Computation of Passive Seismic Interferometry

The Memory Intensive System

arxiv: v1 [physics.data-an] 19 Feb 2017

CIS 371 Computer Organization and Design

Integrated Cheminformatics to Guide Drug Discovery

The PhilOEsophy. There are only two fundamental molecular descriptors

INITIAL INTEGRATION AND EVALUATION

Numerical estimation of the area of the Mandelbrot set using quad tree tessellation and distance estimation

Accelerating Model Reduction of Large Linear Systems with Graphics Processors

Simulation Laboratories at JSC

GPU Acceleration of Cutoff Pair Potentials for Molecular Modeling Applications

On Portability, Performance and Scalability of a MPI OpenCL Lattice Boltzmann Code

Practical QSAR and Library Design: Advanced tools for research teams

Computational chemical biology to address non-traditional drug targets. John Karanicolas

Delayed and Higher-Order Transfer Entropy

Cheminformatics Role in Pharmaceutical Industry. Randal Chen Ph.D. Abbott Laboratories Aug. 23, 2004 ACS

Transcription:

Mechanisms for exploiting heterogeneous computing: Harnessing hundreds of GPUs and CPUs Simon McIntosh-Smith simonm@cs.bris.ac.uk Head of Microelectronics Research University of Bristol, UK 1

! Collaborators Richard B. Sessions, Amaurys Avila Ibarra University of Bristol, Biochemistry Developers of the docking algorithm and original serial code James Price (port to OpenCL) University of Bristol, Computer Science Tsuyoshi Hamada, Felipe Cruz (GPUs) University of Nagasaki, Japan 2

The Microelectronics Group at the University of Bristol http://www.cs.bris.ac.uk/research/micro/ 3

! The team Prof David May Dr Jose Nunez-Yanez Prof Dhiraj Pradhan Simon McIntosh-Smith Head of Group Dr Kerstin Eder Dr Dinesh Pamunuwa Dr Simon Hollis 7 tenured staff, 6 research assistants, 16 PhD students 4

! Research Group expertise Energy Aware COmputing (EACO): Multi-core and many-core computer architectures Algorithms for heterogeneous architectures (GPUs, OpenCL) Electronic and Optical Network on Chip (NoC) Reconfigurable architectures (FPGA) Design verification (formal and simulation-based), formal specification and analysis Silicon process variation Fault tolerant design (hardware and software) Design methodologies, modelling & simulation of MNT based structures and systems 5

! Industrial Collaboration The Bristol Microelectronics Group partners with most of Europe s chip industry: ARM Imagination Technology NVIDIA (Icera) Broadcom STMicro, ST-Ericsson Toshiba Infineon Gnodal Picochip And many more XMOS is a successful CPU spin-out from the group Global partners include AMD, Intel, IBM, 6

Molecular docking 7

! Molecular docking Proteins typically O(1000) atoms Ligands typically O(100) atoms 8

! BUDE: Bristol University Docking Engine Accuracy Speed Typical docking scoring functions Empirical Free Energy Forcefield BUDE Free Energy calculations MM 1,2 QM/MM 3 Entropy: solvation configurational Electrostatics All atom Explicit solvent No Yes Yes Approx Approx Yes? Approx Yes No Yes Yes No No Yes 1. MD Tyka, AR Clarke, RB Sessions, J. Phys. Chem. B 110 17212-20 (2006) 2. MD Tyka, RB Sessions, AR Clarke, J. Phys. Chem. B 111 9571-80 (2007) 3. CJ Woods, FR Manby, AJ Mulholland, J. Chem. Phys. 128 014109 (2008) 9

! Empirical Free Energy Function (atom-atom) ΔG ligand binding = i=1 N proteinj=1 N ligand f(xi,x j ) Parameterised using experimental data N. Gibbs, A.R. Clarke & R.B. Sessions, "Ab-initio Protein Folding using Physicochemical Potentials and a Simplified Off-Lattice Model", Proteins 43:186-202,2001 10

! BUDE Acceleration with OpenCL START (input) Copy protein and ligand coordinates (once) GA like, energy minimisation (output) END ( ) R x R xy R xz T x R yx R y R yz T y R zx R zy R z T z Energy of pose Geometry (transform ligand) Energy i=1 N prot j=1 N lig f(x i,x j ) "Benchmarking energy efficiency, power costs and carbon emissions on heterogeneous systems", Simon McIntosh-Smith, Terry Wilson, 11 Amaurys Avila Ibarra, Jon Crisp and Richard B.Sessions, The Computer Journal, September 12th 2011. DOI: 10.1093/comjnl/bxr091

! Multiple levels of parallelism O(10 8 ) conformers from O(10 7 ) ligands O(10 5 ) poses per conformer (ligand) O(10 3 ) atoms per protein O(10 2 ) atoms per ligand (drug molecule) Conformers all independent Poses can all be independent, but there are benefits in grouping all poses of one conformer to one OpenCL device Parallelism strategy: Distribute ligands across nodes 10 7 -way parallelism All the poses of one conformer distributed across all the OpenCL devices in a node 10 3 -way parallelism Each Work-Item (thread) performs an entire conformer-protein docking 10 5 atom-atom force calculations per Work-Item 12

! How BUDE s EMC works 13

Experimental results 14

! Redocking into Xray Structure 1CIL (Human carbonic anhydrase II) 15

! Another example 1EZQ (Human Factor XA) 16

! The science is working well 53 Experimental results from the Ligand Binding Database 17

Heterogeneous Systems 18

! OpenCL for heterogeneous computing A modern computer includes: One or more CPUs One or more GPUs CPU GPU CPU GPU OpenCL (Open Compute Language) lets programmers write a single portable program that uses ALL resources in the heterogeneous platform 19

! Kinds of Heterogeneous Systems SuperMicro 4 GPU workstation 20

! Kinds of Heterogeneous Systems SuperMicro GPU servers 2 GPUs in 1U 21

! Kinds of Heterogeneous Systems HP SL390 2 CPUs, 8 GPUs in half width 4U 22

! Kinds of Heterogeneous Systems Dell PowerEdge C410x 16 GPUs in 4U 23

! Kinds of Heterogeneous Systems AMD Llano Fusion APUs 24

! Kinds of Heterogeneous Systems Intel Core2 Duo CPU P8600 @ 2.40GHz, NVIDIA GeForce 9400M integrated GPU, NVIDIA GeForce 9600M GT discrete GPU 25

Benchmark results 26

! BUDE s heterogeneous approach 1. Discover all OpenCL platforms/devices, including CPUs and GPUs 2. Run a micro benchmark on each device, ideally a short piece of real work 3. Load balance using micro benchmark results 4. Re-run micro benchmark at regular intervals in case load changes 27

! Benchmarking methodology Use the same power measurement equipment for all the systems under test Watts Up? Pro meter Measures complete system power at the wall Run as fast as possible on all available resources (i.e. all cores or all GPUs simultaneously) 28

! MacBook Pro 2009 results Time in seconds, lower is better 29

! Benchmark results 30

! Performance results Speedup (higher is better) 16 14 12 10 8 6 4 13.4 11.7 8.1 GPUs 7.3 6.5 5.3 2 0 M2050 (x2) GTX-580 M2070 M2050 GTX-285 HD5870 A8-3850 0.8 31

! Performance results Speedup (higher is better) 1.2 1.0 0.8 0.6 0.4 2.8GHz 1.0 2.4GHz 3.4GHz 0.9 0.9 2.3GHz 0.7 CPUs 2.9GHz 0.3 0.2 0.0 8 cores E5462 (Fortran) x2 8 cores 4 cores 4 cores 4 cores E5620 x2 i7-2600k i5-2500t A8-3850 CPU 32

! Performance results Speedup (higher is better) 16 14 12 10 8 6 13.6 Heterogeneous (CPUs+GPUs) 11.7 5.9 4 2 1.2 0 M2050 (x2) & E5620 (x2) GTX-580 & E8500 HD5870 & i5-2500t A8-3850 GPU & CPU 33

! Performance results 16 Speedup (higher is better) 14 12 10 8 6 13.4 11.7 Where we finished Where we could go Where we started 4 2 1.0 0 M2050 x2 GTX-580 E5462 (Fortran) x2 ~$2,000 x2 ~$400 34

! Relative energy and run-time 18 16 88% reduction in energy 93% reduction in time 16.2 15.4 Relative Performance 15.2 14 13.2 Relative Performance Per Watt 12 10.6 11.4 10 9.3 9.1 8 6 6.6 5.9 4.5 4 2 1.0 1.0 0.8 0.7 2.4 0 M2050 + E5620 x2 M2050 x2 GTX-580 HD5870 + i5-2500t HD5870 E5620 x2 A8-3850 GPU i5-2500t Measurements are for a constant amount of work. Energy measurements are at the wall and include any idle components. 35

What does this let us do? 36

! Potentially save lives New Delhi Metallo-betalactamase-1 (NDM-1) is an enzyme that makes bacteria resistant to antibiotics, giving rise to superbugs http://news.discovery.com/ human/superbug-found-injapan.html 37

! NDM-1 as a docking target NDM-1 protein made up of 939 atoms 38

! GPU-system DEGIMA Used 222 GPUs in parallel for drug docking simulations ATI Radeon HD5870 (2.72 TFLOPS) & Intel i5-2500t ~600 TFLOPS single precision Courtesy of Tsuyoshi Hamada and Felipe Cruz, Nagasaki 39

! NDM-1 experiment 7.65 million candidate drug molecules, 21.8 conformers each à 166.7x10 6 dockings 4.168 x 10 12 poses calculated ~98 hours actual wall-time Top 300 hits being analysed, down selecting to 10 compounds for wetlab trials 40

Future heterogeneous systems 41

! Future heterogeneous systems ARM Cortex Quad-core CPU @1.4GHz, 10GbE, 4GB DDR3 DRAM 42

! Future heterogeneous systems PandaBoard TI OMAP4, dual core ARM, PowerVR GPU 43

! Future heterogeneous systems Sony PSVita quad core ARM, PowerVR GPU 44

! Future heterogeneous systems Asus quad-core tablet announced at CES 2012 45

! Future heterogeneous systems 46

! Conclusions OpenCL enables truly heterogeneous computing, harnessing all hardware resources in a system GPUs can yield significant savings in energy costs (and equipment costs) OpenCL can work just as well for multi-core CPUs as it does for GPUs It s possible to screen libraries of millions of molecules against complex targets using highly accurate, computationally-expensive methods in one weekend using equipment costing O( 100K) 47

! For an introduction to GPUs The GPU Computing Revolution A Knowledge Transfer Report from the London Mathematical Society and the KTN for Industrial Mathematics https://ktn.innovateuk.org/web/mathsktn/ articles/-/blogs/the-gpu-computingrevolution 48

! References S. McIntosh-Smith, T. Wilson, A.A. Ibarra, J. Crisp and R.B. Sessions, "Benchmarking energy efficiency, power costs and carbon emissions on heterogeneous systems", The Computer Journal, September 12th 2011. DOI: 10.1093/comjnl/bxr091 N. Gibbs, A.R. Clarke & R.B. Sessions, "Abinitio Protein Folding using Physicochemical Potentials and a Simplified Off-Lattice Model", Proteins 43:186-202,200 49