Parallelization of Molecular Dynamics (with focus on Gromacs) SeSE 2014 p.1/29

Similar documents
The Fast Multipole Method in molecular dynamics

A microsecond a day keeps the doctor away: Efficient GPU Molecular Dynamics with GROMACS

Advanced Molecular Molecular Dynamics

Efficient Molecular Dynamics on Heterogeneous Architectures in GROMACS

XXL-BIOMD. Large Scale Biomolecular Dynamics Simulations. onsdag, 2009 maj 13

Efficient Parallelization of Molecular Dynamics Simulations on Hybrid CPU/GPU Supercoputers

Parallel Asynchronous Hybrid Krylov Methods for Minimization of Energy Consumption. Langshi CHEN 1,2,3 Supervised by Serge PETITON 2

Performance of the fusion code GYRO on three four generations of Crays. Mark Fahey University of Tennessee, Knoxville

Introduction to numerical computations on the GPU

Piz Daint & Piz Kesch : from general purpose supercomputing to an appliance for weather forecasting. Thomas C. Schulthess

Lattice Boltzmann simulations on heterogeneous CPU-GPU clusters

FENZI: GPU-enabled Molecular Dynamics Simulations of Large Membrane Regions based on the CHARMM force field and PME

An FPGA Implementation of Reciprocal Sums for SPME

What is Classical Molecular Dynamics?

Accelerating Three-Body Potentials using GPUs NVIDIA Tesla K20X

SPARSE SOLVERS POISSON EQUATION. Margreet Nool. November 9, 2015 FOR THE. CWI, Multiscale Dynamics

Julian Merten. GPU Computing and Alternative Architecture

Molecular dynamics simulation. CS/CME/BioE/Biophys/BMI 279 Oct. 5 and 10, 2017 Ron Dror

Petascale Quantum Simulations of Nano Systems and Biomolecules

A Quantum Chemistry Domain-Specific Language for Heterogeneous Clusters

Introduction to molecular dynamics

Direct Self-Consistent Field Computations on GPU Clusters

RWTH Aachen University

CRYSTAL in parallel: replicated and distributed (MPP) data. Why parallel?

Performance Analysis of Lattice QCD Application with APGAS Programming Model

Multiscale simulations of complex fluid rheology

Accelerating Three-Body Molecular Dynamics Potentials Using NVIDIA Tesla K20X GPUs. GE Global Research Masako Yamada

Nuclear Physics and Computing: Exascale Partnerships. Juan Meza Senior Scientist Lawrence Berkeley National Laboratory

arxiv: v1 [physics.comp-ph] 22 Nov 2012

Advanced Molecular Dynamics

HPMPC - A new software package with efficient solvers for Model Predictive Control

Introduction to Benchmark Test for Multi-scale Computational Materials Software

Accelerating linear algebra computations with hybrid GPU-multicore systems.

Optimizing GROMACS for parallel performance

First, a look at using OpenACC on WRF subroutine advance_w dynamics routine

A Massively Parallel Eigenvalue Solver for Small Matrices on Multicore and Manycore Architectures

Why Proteins Fold? (Parts of this presentation are based on work of Ashok Kolaskar) CS490B: Introduction to Bioinformatics Mar.

Accelerating Rosetta with OpenMM. Peter Eastman. RosettaCon, August 5, 2010

Claude Tadonki. MINES ParisTech PSL Research University Centre de Recherche Informatique

GPU Computing Activities in KISTI

Static-scheduling and hybrid-programming in SuperLU DIST on multicore cluster systems

Targeting Extreme Scale Computational Challenges with Heterogeneous Systems

Anton: A Specialized ASIC for Molecular Dynamics

Perm State University Research-Education Center Parallel and Distributed Computing

Improved Adaptive Resolution Molecular Dynamics Simulation

Performance Evaluation of MPI on Weather and Hydrological Models

A CPU-GPU Hybrid Implementation and Model-Driven Scheduling of the Fast Multipole Method

Huge-Scale Molecular Dynamics Simulation of Multi-bubble Nuclei

Development of Molecular Dynamics Simulation System for Large-Scale Supra-Biomolecules, PABIOS (PArallel BIOmolecular Simulator)

Non-bonded interactions

Bioengineering 215. An Introduction to Molecular Dynamics for Biomolecules

GPU acceleration of Newton s method for large systems of polynomial equations in double double and quad double arithmetic

High-Performance Scientific Computing

11 Parallel programming models

Welcome to MCS 572. content and organization expectations of the course. definition and classification

Real-time signal detection for pulsars and radio transients using GPUs

Gromacs Workshop Spring CSC

Molecular Dynamics Simulation of a Biomolecule with High Speed, Low Power and Accuracy Using GPU-Accelerated TSUBAME2.

Practical Combustion Kinetics with CUDA

arxiv: v1 [hep-lat] 7 Oct 2010

Non-bonded interactions

GRAPE-DR, GRAPE-8, and...

Tips Geared Towards R. Adam J. Suarez. Arpil 10, 2015

Open-Source Parallel FE Software : FrontISTR -- Performance Considerations about B/F (Byte per Flop) of SpMV on K-Supercomputer and GPU-Clusters --

Improving Dynamical Core Scalability, Accuracy, and Limi:ng Flexibility with the ADER- DT Time Discre:za:on

Supercomputers: instruments for science or dinosaurs that haven t gone extinct yet? Thomas C. Schulthess

Explore Computational Power of GPU in Electromagnetics and Micromagnetics

Molecular Dynamics. What to choose in an integrator The Verlet algorithm Boundary Conditions in Space and time Reading Assignment: F&S Chapter 4

GPU Acceleration of Cutoff Pair Potentials for Molecular Modeling Applications

Faster Kinetics: Accelerate Your Finite-Rate Combustion Simulation with GPUs

Massively parallel semi-lagrangian solution of the 6d Vlasov-Poisson problem

Extending Parallel Scalability of LAMMPS and Multiscale Reactive Molecular Simulations

TR A Comparison of the Performance of SaP::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems

LAMMPS Performance Benchmark on VSC-1 and VSC-2

Optimization Techniques for Parallel Code 1. Parallel programming models


From Piz Daint to Piz Kesch : the making of a GPU-based weather forecasting system. Oliver Fuhrer and Thomas C. Schulthess

ab initio Electronic Structure Calculations

Information Sciences Institute 22 June 2012 Bob Lucas, Gene Wagenbreth, Dan Davis, Roger Grimes and

Multiphase Flow Simulations in Inclined Tubes with Lattice Boltzmann Method on GPU

Molecular Dynamics 9/6/16

CRYPTOGRAPHIC COMPUTING

A Fast, Parallel Potential Flow Solver

Molecular Dynamics Simulation of Methanol-Water Mixture

Supercomputing: Why, What, and Where (are we)?

Heterogeneous programming for hybrid CPU-GPU systems: Lessons learned from computational chemistry

上海超级计算中心 Shanghai Supercomputer Center. Lei Xu Shanghai Supercomputer Center San Jose

591 TFLOPS Multi-TRILLION Particles Simulation on SuperMUC

Molecular Dynamics Simulations

Molecular Dynamics Simulations. Dr. Noelia Faginas Lago Dipartimento di Chimica,Biologia e Biotecnologie Università di Perugia

Verbundprojekt ELPA-AEO. Eigenwert-Löser für Petaflop-Anwendungen Algorithmische Erweiterungen und Optimierungen

Computer simulation methods (2) Dr. Vania Calandrini

The QMC Petascale Project

WRF performance tuning for the Intel Woodcrest Processor

Dense Arithmetic over Finite Fields with CUMODP

Accelerating Model Reduction of Large Linear Systems with Graphics Processors

Scalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver

MOLECULAR DYNAMIC SIMULATION OF WATER VAPOR INTERACTION WITH VARIOUS TYPES OF PORES USING HYBRID COMPUTING STRUCTURES

TOPS Contributions to PFLOTRAN

N-body Simulations. On GPU Clusters

Transcription:

Parallelization of Molecular Dynamics (with focus on Gromacs) SeSE 2014 p.1/29

Outline A few words on MD applications and the GROMACS package The main work in an MD simulation Parallelization Stream computing SeSE 2014 p.2/29

Supercomputers Lindgren (PDC) Cray XE6 36 000 AMD cores 305 teraflops K computer (Riken, JP) 705 024 SPARC64 cores 10 petaflops (faster than the next 5 computers together) Cray XK computer (just on market) Nodes: 2 AMD 16-core Interlagos processors 2 NVidia Tesla X2090 GPUs SeSE 2014 p.3/29

Challenges for exascale computing Issues: Power usage (K computer: 10 MW, cooling doubles this) Reliability, chance of components failing increases Interconnect speed needs to keep up with millions of cores Data handling Software does not scale! Question: Do we really need an exaflop computer? Does that get us better science? SeSE 2014 p.4/29

Two general computing trends Over the past decades: Moore s law transistor count doubles every two years Processor speed doubles every 18 months Memory speed does not increase this fast Result: memory access relatively expensive! Second important development: Past two decades: slow increase of parallel computing This decade: no MHz increase, instead core count increase Arrival of GPGPUs: even mores cores! Result: parallel algorithms very important SeSE 2014 p.5/29

MD versus typical HPC applications Other typical HPC applications, e.g. turbulence: Weak scaling: scale problem size with computer size Calculation times stay the same Molecular dynamics: Little increase in problem size Time per step decrease (sub millisecond) Need lots of computing power, but use of supercomputers is challenging Algorithms need to be re-designed SeSE 2014 p.6/29

How to write efficient code for MD? Hardware and software: Multi-core CPUs: C (SIMD intrinsics?) or Fortran + MPI + OpenMP GPUs: CUDA or OpenCL Intel Xeon 5 (Larabee, MIC,... ): C or something else? FPGAs? OpenACC for everything: much less work Molecular dynamics need sub-millisecond iterations times: we need low-level optimization: C + SIMD intrinsics + CUDA A lot of work, but it might be worth it SeSE 2014 p.7/29

Classical molecular simulation Common applications: simple liquids 10 3-10 4 atoms ns - µs peptide/protein folding 10 4-10 5 atoms µs - s protein functional motions 10 5-10 6 atoms µs - s polymers 10 4-10 6 atoms µs -? materials science 10 4-10 8 atoms ns -? SeSE 2014 p.8/29

Simulations of bio-molecules Proteins, DNA, RNA Fixed system sizes Functional motions Overdamped dynamics Stochastic kinetics Need to sample many events The time scales often increase exponentially with system size SeSE 2014 p.9/29

Simulations in (Chemical) Physics Simulations of biomass for bio-ethanol (millions of atoms) Understanding and controlling interactions of droplets on surfaces (100 million particles) SeSE 2014 p.10/29

Molecular Dynamics Basically solving Newton s equation of motion: m i d 2 x i dt 2 = iv(x) = F i (x) i = 1,...,N atoms Symplectic leap-frog integrator: v i (n+ 1 2 ) = v i(n 1 t )+ F i (x(n)) 2 m i x i (n+1) = x i (n)+ tv i (n+ 1 2 ) V(x) = bondedsv b (x)+ i<j A ij r 12 ij B ij r 6 ij + q iq j r ij Computational cost: integration: c N force calculation: CN 2 SeSE 2014 p.11/29

Molecular Dynamics in practice Setting up simulations requires a lot of physical/chemical/biological knowledge The force field (parameters) is critical: the results are as good/bad as the force field Developing algorithms for simulations is a completely different business, but you need to know the application needs SeSE 2014 p.12/29

Computational cost: force calculation All forces are short ranged, except for the Coulomb forces potential 2 0 Coulomb Lennard Jones 0 0.5 1 distance (nm) Thus use a (smooth) cut-off for the particle-particle interactions and do the remaining part of the Coulomb interaction on a grid Particle Mesh Ewald electrostatics: q Solve the Poisson equation in reciprocal space using a 3D FFT SeSE 2014 p.13/29

Computational breakdown computation cost communication bonded forces co(n) along with nbf non-bonded forces cm N local M = 10 2-10 3 spread charge CN local 3D FFT c N log N global gather forces CN local integration c N constraints c N local t: 4 virtual sites c N local t: 2 Time step: 2-5 fs, simulation length 10 7 to 10 9 steps. A step takes 1-100 milliseconds. SeSE 2014 p.14/29

CPU vs GPU How many pair interactions per second? Intel Core i7, 2.67 GHz 4 cores + Hyperthreading NVidia GTX580, 1.5 GHz 512 CUDA Cores SeSE 2014 p.15/29

CPU vs GPU How many pair interactions per second? Intel Core i7, 2.67 GHz 4 cores + Hyperthreading NVidia GTX580, 1.5 GHz 512 CUDA Cores 192 million per core 960 million total 3200 million total SeSE 2014 p.15/29

Speeding up simulations For many applications we would like orders of magnitude more sampling Speed up simulations through: Increasing the time step (constraints, virtual interaction sites) Use less particles (triclinic, more spherical periodic cells) Use more processors: parallelize Use stream computing Parallelization nowadays is MPI + threads SeSE 2014 p.16/29

Scaling limits Strong scaling limited by: all communication latencies/bandwith and load imbalance Weak scaling limited by: electrostatics global communication Electrostatics solvers: PME/PM: O(N logn), small pre-factor Multi level methods: O(N), large pre-factor Fast multipole: O(N), discontinuous gradient Note: (nearly) embarrassing parallelizm, run 1-1000 simulations in parallel SeSE 2014 p.17/29

GROMACS simulation package Started at the University of Groningen in the early 1990s. Now all main developers are in Sweden (Stockholm/Uppsala) Thousands of users world wide A million users through Folding@home Open source (GPL),www.gromacs.org Support for all major biomolecular force fields Highly optimized algorithms and code Advanced algorithms for increasing the time step Language: C and essential parts in assembly: SSE2/Altivec/CUDA/... Efficient single processor performance Since three years efficient parallelization Hess, Kutzner, Van der Spoel, Lindahl; J. Chem. Theory Comput. 4, 435 (2008) SeSE 2014 p.18/29

Domain decomposition methods r c r c 1 2 r c (a) (b) (c) comm. cut-off Half Shell Eighth Shell Midpoint #cells r c < L d 13 7 26 r c < 2L d 62 26 26 r c = 1 2 L d 2.94 L 3 d 2.15 L 3 d volume r c = L d 9.81 L 3 d 5.88 L 3 d r c 1 2 sphere 1 8 sphere Eighth shell: Liem, Brown Clarke; Comput. Phys. Commun. 67(2), 261 (1991) Midpoint: Bowers, Dror, Shaw; JCP 124, 184109 (2006) SeSE 2014 p.19/29

Full, dynamic load balancing Load imbalance can occur due to three reasons: imhomogeneous particle distribution inhomogeneous interaction cost distribution (charged/uncharged, water/non-water due to GROMACS water innerloops) statistical fluctuation (only with small particle numbers) SeSE 2014 p.20/29

Full, dynamic load balancing Load imbalance can occur due to three reasons: imhomogeneous particle distribution inhomogeneous interaction cost distribution (charged/uncharged, water/non-water due to GROMACS water innerloops) statistical fluctuation (only with small particle numbers) So we need a dynamic load balancing algorithm where the volume of each domain decomposition cell can be adjusted independently SeSE 2014 p.20/29

Triclinic unit cells with load balancing 00000 0000000 00 00 0000 0000 00000 00000 000000 000000 000000 000000 000000 000000 000000 000000 000000 000000 000000 000000 000000 00000 00000 0000 0000 00 00 0000000 0000 11111 1111111 11 11 1111 1111 11111 11111 111111 111111 111111 111111 111111 111111 111111 111111 111111 111111 111111 111111 111111 11111 11111 1111 1111 11 11 1111111 1111 00 000 000 00000 00000 00000 000000 000000 0000000 0000000 00000000 00000000 00000000 0 00000000 00000000 0000000 000000 11 111 111 11111 11111 11111 111111 111111 1111111 1111111 11111111 11111111 11111111 1 11111111 11111111 1111111 111111 00 00 00 000 000 11 11 11 111 111 00000000 00000000 00000000 00000000 00000000 0000000 0000000 0000000 0000000 0000000 11111111 11111111 11111111 11111111 11111111 1111111 1111111 1111111 1111111 1111111 0 0 1 1 00 00 0 0 11 11 1 1 00 000 000 000 000 000 00 11 111 111 111 111 111 11 0 0 00 00 0 00 000 00 0 1 1 11 11 1 11 111 11 1 1 1 2 d 0 3 2 3 r c r b2 SeSE 2014 p.21/29

Dynamic load balancing in action SeSE 2014 p.22/29

Multiple-program, multiple data PME SP MD switch MP MD switch PME PME latency 4/3 lower bandwith 4/3 higher switch switch PME PME PME PME PME PME PME latency 4 x lower PME PME PME PME Futher advantage: 4 to 16 times less MPI messages Enables 4x further scaling! SeSE 2014 p.23/29

Flow chart, CPU only process send x to PME comm. x PME mesh process receive x from redistribute x spread on grid 3D FFT calculate f solve 3D FFT spread force reduce f receive f from PME redistribute f send f to integrate eq. of m. apply constraints SeSE 2014 p.24/29

2D PME (pencil) decomposition cellulose + lignocellulose + water 2.2 million atoms Cray XT5 performance (ns/day) 20 15 10 5 0 1D PME decomp. 2D PME decomp. 2D PME, with "cube" 0 768 1536 2304 3072 #cores SeSE 2014 p.25/29

How far can we scale? 100 million atoms half protein half water no PME (!) Jaguar Cray XT5 at Oak Ridge ns per day 20 10 0 0 50 000 100 000 150 000 # CPU cores Issues: a µs is far too short for a 100 million atom system no full electrostatics SeSE 2014 p.26/29

How far can we scale? 100 million atoms half protein half water no PME (!) Jaguar Cray XT5 at Oak Ridge ns per day 20 10 0 0 50 000 100 000 150 000 # CPU cores Issues: a µs is far too short for a 100 million atom system no full electrostatics Solutions: combine thread and MPI parallelization (a lot of work) develop electrostatics methods with less communication SeSE 2014 p.26/29

Stream calculations Single CPU-code: focus on functions Stream calculations: focus on data streams, operations done in parallel Common stream type archtectures: SSE: operations on 4 floats at once (or 2 doubles) AVX: operations on 8 floats at once (Intel Sandy Bridge, AMD Bulldozer) CUDA (NVidia GPUs): operations of 32 floats/doubles at once So 4 to 32 times speedup, assuming the same #cores and frequency SeSE 2014 p.27/29

Hybrid acceleration simulation box CPU GPU only does nonbonded Re-use most of the CPU code Use the GPU for what it s good at GPU compute node SeSE 2014 p.28/29

Cut-off schemes in Gromacs Gromacs 4.5 group cut-off scheme, based on (ancient) charge-groups Gromacs 4.6 Verlet buffered cut-off scheme optional Gromacs 5.0 Verlet cut-off scheme default Group cut-off scheme: Fast, since we process groups of atoms at once; water is a group very fast No Verlet list buffer by default, but optional: energy drift Verlet cut-off scheme: Fast on wide SIMD and GPUs Verlet list buffer by default Supports OpenMP parallelization (+MPI) SeSE 2014 p.29/29

Later today What is stream computing (SIMD, GPUs, CUDA) How to use it efficiently in MD SeSE 2014 p.30/29