Optimizing GROMACS for parallel performance

Similar documents
Advanced Molecular Molecular Dynamics

An FPGA Implementation of Reciprocal Sums for SPME

Non-bonded interactions

Neighbor Tables Long-Range Potentials

Non-bonded interactions

Long-range interactions: P 3 M, MMMxD, ELC, MEMD and ICC

Computer simulation methods (2) Dr. Vania Calandrini

Development of Molecular Dynamics Simulation System for Large-Scale Supra-Biomolecules, PABIOS (PArallel BIOmolecular Simulator)

ab initio Electronic Structure Calculations

Molecular Dynamics. A very brief introduction

FENZI: GPU-enabled Molecular Dynamics Simulations of Large Membrane Regions based on the CHARMM force field and PME

Computational Molecular Biophysics. Computational Biophysics, GRS Jülich SS 2013

Lecture 12: Solvation Models: Molecular Mechanics Modeling of Hydration Effects

Parallelization of Molecular Dynamics (with focus on Gromacs) SeSE 2014 p.1/29

N-body simulations. Phys 750 Lecture 10

The Fast Multipole Method in molecular dynamics

Molecular dynamics simulation. CS/CME/BioE/Biophys/BMI 279 Oct. 5 and 10, 2017 Ron Dror

Efficient multiple time step method for use with Ewald and particle mesh Ewald for large biomolecular systems

A microsecond a day keeps the doctor away: Efficient GPU Molecular Dynamics with GROMACS

Molecular Dynamics Simulations

Extending Parallel Scalability of LAMMPS and Multiscale Reactive Molecular Simulations

Tutorial on the smooth particle-mesh Ewald algorithm

Efficient Parallelization of Molecular Dynamics Simulations on Hybrid CPU/GPU Supercoputers

Why Proteins Fold? (Parts of this presentation are based on work of Ashok Kolaskar) CS490B: Introduction to Bioinformatics Mar.

Bioengineering 215. An Introduction to Molecular Dynamics for Biomolecules

Computational Chemistry - MD Simulations

Structural Bioinformatics (C3210) Molecular Mechanics

Coarse-Grained Models!

This is an electronic reprint of the original article. This reprint may differ from the original in pagination and typographic detail.

Gromacs Workshop Spring CSC

Hands-on : Model Potential Molecular Dynamics

Molecular dynamics simulation of Aquaporin-1. 4 nm

Computation of non-bonded interactions: Part 1

Don t forget to bring your MD tutorial. Potential Energy (hyper)surface

Efficient Molecular Dynamics on Heterogeneous Architectures in GROMACS

All-atom Molecular Mechanics. Trent E. Balius AMS 535 / CHE /27/2010

Multiple time step Monte Carlo simulations: Application to charged systems with Ewald summation

Towards fast and accurate binding affinity. prediction with pmemdgti: an efficient. implementation of GPU-accelerated. Thermodynamic Integration

Kasetsart University Workshop. Multigrid methods: An introduction

A reciprocal space based method for treating long range interactions in ab initio and force-field-based calculations in clusters

Free energy simulations

A Fast N-Body Solver for the Poisson(-Boltzmann) Equation

Quantum Interference and Selectivity through Biological Ion Channels

Generation of topology files of a protein chain and simulations of a dipeptide

Seminar in Particles Simulations. Particle Mesh Ewald. Theodora Konstantinidou

RWTH Aachen University

Journal of Pharmacology and Experimental Therapy-JPET#172536

Applications of Molecular Dynamics

MOLECULAR DYNAMICS SIMULATIONS OF BIOMOLECULES: Long-Range Electrostatic Effects

Anton: A Specialized ASIC for Molecular Dynamics

Multiscale Modelling, taking into account Collisions

Parallel Utility for Modeling of Molecular Aggregation

pka Calculations of Key Ionizable Protein Residues in Acetylcholinesterase

Quantifying Artifacts in Ewald Simulations of Inhomogeneous Systems with a Net Charge

Parallel programming using MPI. Analysis and optimization. Bhupender Thakur, Jim Lupo, Le Yan, Alex Pacheco

Spring 2011 CS264: Parallel POMDP

Force fields, thermo- and barostats. Berk Hess

Molecular Dynamics Simulations. Dr. Noelia Faginas Lago Dipartimento di Chimica,Biologia e Biotecnologie Università di Perugia

2 Structure. 2.1 Coulomb interactions

MARTINI simulation details

Energy-Level Alignment at the Interface of Graphene Fluoride and Boron Nitride Monolayers: An Investigation by Many-Body Perturbation Theory

Can a continuum solvent model reproduce the free energy landscape of a β-hairpin folding in water?

Chapter 2 - Water 9/8/2014. Water exists as a H-bonded network with an average of 4 H-bonds per molecule in ice and 3.4 in liquid. 104.

Introduction to molecular dynamics

10: Testing Testing. Basic procedure to validate calculations

A smooth particle-mesh Ewald algorithm for Stokes suspension simulations: The sedimentation of fibers

Enhancing drug residence time by shielding of intra-protein hydrogen bonds: a case study on CCR2 antagonists. Supplementary Information

Atomistic Modeling of Small-Angle Scattering Data Using SASSIE-web

Supporting Information

Fourier Series : Dr. Mohammed Saheb Khesbak Page 34

Introduction to Parallelism in CASTEP

Introduction to Spark

Laplacian-Centered Poisson Solvers and Multilevel Summation Algorithms

GPU Acceleration of Cutoff Pair Potentials for Molecular Modeling Applications

Solutions to Assignment #4 Getting Started with HyperChem

Fast Space Charge Calculations with a Multigrid Poisson Solver & Applications

A Brief Guide to All Atom/Coarse Grain Simulations 1.1

Amino Acids and Proteins at ZnO-water Interfaces in Molecular Dynamics Simulations: Electronic Supplementary Information

Secondary beam production with fragment separators

XXL-BIOMD. Large Scale Biomolecular Dynamics Simulations. onsdag, 2009 maj 13

On the calculation of solvation free energy from Kirkwood- Buff integrals: A large scale molecular dynamics study

Free energy calculations

Massively parallel semi-lagrangian solution of the 6d Vlasov-Poisson problem

Fast Ewald Summation based on NFFT with Mixed Periodicity

Fast and accurate Coulomb calculation with Gaussian functions

Molecular Mechanics / ReaxFF

Developing Monovalent Ion Parameters for the Optimal Point Charge (OPC) Water Model. John Dood Hope College

Simulations of bulk phases. Periodic boundaries. Cubic boxes

APBS electrostatics in VMD - Software. APBS! >!Examples! >!Visualization! >! Contents

FFTs in Graphics and Vision. Fast Alignment of Spherical Functions

Supporting Material for. Microscopic origin of gating current fluctuations in a potassium channel voltage sensor

The Molecular Dynamics Method

A particle-in-cell method with adaptive phase-space remapping for kinetic plasmas

LAMMPS Performance Benchmark on VSC-1 and VSC-2

Course Introduction. SASSIE CCP-SAS Workshop. January 23-25, ISIS Neutron and Muon Source Rutherford Appleton Laboratory UK

OpenDiscovery: Automated Docking of Ligands to Proteins and Molecular Simulation

User Guide for LeDock

Running MD on HPC architectures I. Hybrid Clusters

1. Hydrogen atom in a box

References. Documentation Manuals Tutorials Publications

Transcription:

Optimizing GROMACS for parallel performance Outline 1. Why optimize? Performance status quo 2. GROMACS as a black box. (PME) 3. How does GROMACS spend its time? (MPE) 4. What you can do What I want to do next

1 Why optimize? The performance status quo GROMACS: high single CPU performance compared to AMBER, CHARMM, GROMOS96 (Lindahl et al. 2001) faster multiple CPUs

2 Why optimize? The performance status quo 1 n processors: scaling/efficiency E, speedup S E = 1 T 1 n T n S = T 1 T n run times T n (±0.5%) on Orca1 n T n (s) scaling E speedup S 1 878 1.00 1.00 2 456 0.96 1.93 good 4 372 0.59 2.36 acceptable 8 652 0.17 1.35 waste of resources

What can be done?

3 Potential for optimizations GROMACS as a black box LAM / mpich parameters: communication module (TCP, usysv, VIA), size of rpi tcp short,... GROMACS parameters: shuffle & sort, optimize FFT, fourierspacing & PME order Inside the box Local optimizations: MPI realization of given task Restructure communication scheme

4 Why PME is used Van der Waals: ok to use cut-off radius of 1.0 1.2 nm Coulomb: cutoff unphysical artefacts in structure + dynamics 90% of run time calc. non-bonded electrostatic forces

5 Particle Mesh Ewald = Mesh up the Ewald sum! N particles, charges q i, positions r i, neutral cubic box, length L, periodic b.c. Electrostatic energy. (Conditionally convergent, S-L-O-W) V = 1 2 N i, j=1 n Z 3 q i q j r i j + nl (1) Trick 1: Ewald summation. Split eq. (1) into: 1 r = f (r) + 1 f (r) r r (2) Coulomb: rapid variation at small r, slow decay at large r f (r) r 0 beyond some cutoff r max, transform needs only a few k 1 f (r) r slowly varying function r Fourier

6 Ewald formula for f = erfc(r) Ewald parameter α: rel weight of V dir to V rec V = V dir +V rec +V 0 V dir = 1 2 i, j V rec = 1 2L 3 k 0 V 0 = α π q 2 i i erfc(α r i j + ml ) q i q j r m Z 3 i j + ml (4) 4π k 2 e k2 /4α 2 ρ( k) 2 k {2π n/l : n Z 3 } (5) (3) (6) exponentially converging sums over m and k in eqs. (4, 5) allow introduction of cutoffs ρ( k) = N j=1 q j e i k r j FT charge density finally get forces by F i = r i V

7 Ewald summation on a grid Trick 2: discretization of charge density: continuous charge positions discrete mesh, use FFT SPME: Approximate e ikx by cardinal-b-splines M (P) of order P (=pme order) for even P (x: continuous particle coordinate) e ikx b(k) M P (x lh)e iklh (7) l Z insert (7) into (5) and derive V rec 1 2 h 3 ρ M ( r p )[ρ M G]( r p ) (8) r p M ρ M G = FFT[ FFT(ρ M ) FFT(G)] (9)

8 Reduce communication Spline interpolation: local; FFT: global reducing charge mesh size reduces communication keep force error constant by enlarging interpolation order Aquaporin-1, 80 000 atoms, protein (tetramer) embedded in a lipid bilayer membrane surrounded by water

9 A measure of accuracy just consider absolute force values Exact (absolute) force on particle number i: approximated numerically by F i F exa i Difference in abs force: F i F exa i Mean force deviation: FD mean = 1 N N i=1 { F i F exa i } Relative force deviation: FD rel = N i=1 { F i F exa N i=1 Fexa i i } 1 time step F i for 80k particles out of traj.trr reference calculation at fine mesh F exa i

10 Parameter combinations at same error level Characteristics of test system: maximum force F max = 5102 Mean force F mean = 868 kj mol nm kj mol nm kj fourierspacing grid size PME FD mean [ mol nm ] FD kj rel FD max [ 0.030 360x350x320 10 reference 0.120 90x88x80 4 0.165 0.00019 3.6 0.178 64x60x54 6 0.150 0.00017 3.7 0.200 54x52x48 8 0.147 0.00017 3.5 0.217 52x48x44 12 0.156 0.00018 4.3 mol nm ] Which possibility performes best? f(ncpu, CPU speed, network speed)

11 Scaling at optimal PME settings Dolphin default optimal PME settings n scaling speedup scaling speedup PME order 1 1.00* 1.00 1.00 1.00 4 2 0.86 1.72 0.92 1.84 6 4 0.34 1.36 0.54 2.16 6 8 0.13 1.04 0.29 2.32 8

12 Scaling at optimal PME settings Orca1 (Ethernet) default optimal PME settings n scaling speedup scaling speedup PME order 2 1.00 * 1.00 1.01 1.01 6 4 0.57 1.14 0.75 1.50 6 8 0.42 1.68 8 Orca2 (Myrinet) default optimal PME settings n scaling speedup scaling speedup PME order 2 1.00 * 1.00 1.10 1.10 6 4 0.73 1.46 0.97 1.94 6 6 0.54 1.62 0.79 2.37 6 8 0.75 3.00 6

13 Scaling at optimal PME settings IBM p690 default optimal PME settings n scaling speedup scaling speedup PME order grid size 1 1.00 1.00 1.00 1.00 4 90x88x80 2 0.96 1.92 0.96 1.92 4 90x88x80 4 0.89 3.56 0.89 3.56 4 96x88x80 8 0.76 6.08 0.77 6.16 6 64x64x60 9 0.70 6.30 8 54x54x48 16 0.47 7.53 0.61 9.76 6 64x64x60 18 0.61 11.0 8 54x54x48 27 0.53 14.3 + 8 54x54x48 32 0.36 11.5 8 64x64x48 32 0.31 10.0 0.37 11.8 6 64x64x60 + i.e. 1.5 the performance of 8 Orca2 CPUs

What does GROMACS do all the time? Detailed analysis of time step

14 Installation of MPE logging MPE: automatic logging of MPI calls manual MPE logging by defining events: #include <mpi.h> #include <mpe.h>... MPI Init( );... MPE Describe state( ev1, ev2, doing PME, grey ); MPE Describe state( ev3, ev4, whatever, orange );... MPE Log event( ev1, 0, ); <code fragment to be logged> MPE Log event( ev2, 0, );... MPI Finalize( );...

15 Calculation of the non-bonded forces do force LOOP OVER TIME STEPS... force do fnbf calc V vdw and V dir part of Coulomb do pme calc V rec part of Coulomb spread on grid spread home atom charges on full grid sum qgrid sum contributions to local (z slice) grid from other CPUs (n MPI Reduce) gmxfft3d ρ M = FFT(ρ M ) (n slices in z) solve pme gmxfft3d ρ M G FFT [ ( ρ M ) G ] sum qgrid distribute local (= z slice) grid to all nodes gather f bsplines get forces on home atoms F i = r i V

16 Detailed analysis - ncpu 1, 2, 4, 6

17 Detailed analysis - PME order 4, 6, 8

18 Shuffle and Sort

19 Changing MPI routines in sum qgrid timing results at n = 4, pme-order 6 Operation time(s) (Dolphin) time(s) (Orca1) n MPI Reduce 0.149 0.021 1 MPI Reduce scatter 0.142 0.022 1 MPI Alltoall+sum 0.089 0.016 time step length: 0.99s (Dolphin), 0.34 s (Orca1) use of MPI Alltoall enhances 4 CPU scaling xx 0.54 0.60 on Dolphin xx 0.81 0.82 on Orca1

20 Problem: Communication delays for ncpu 6

21 Summary for ncpu > 1 replace xxx pme order=4 xxx fourierspacing=0.120 by xxx pme order=6 xxx fourierspacing=0.178 in your.mdp file Orca/Ethernet (4 CPUs): switch to PME order 6 scaling 57 75% Orca/Myrinet: 2 8 CPUs: speedup=3

22 What next 1. Replace n MPI Bcast-calls by 1 MPI Alltoall Further enhancement of 4-CPU scaling 2. Cause of communication delays? Monitor network traffic. 3. Overlap V rec communication with V dir calculation