Optimizing GROMACS for parallel performance

Optimizing GROMACS for parallel performance Outline 1. Why optimize? Performance status quo 2. GROMACS as a black box. (PME) 3. How does GROMACS spend its time? (MPE) 4. What you can do What I want to do next

1 Why optimize? The performance status quo GROMACS: high single CPU performance compared to AMBER, CHARMM, GROMOS96 (Lindahl et al. 2001) faster multiple CPUs

2 Why optimize? The performance status quo 1 n processors: scaling/efficiency E, speedup S E = 1 T 1 n T n S = T 1 T n run times T n (±0.5%) on Orca1 n T n (s) scaling E speedup S 1 878 1.00 1.00 2 456 0.96 1.93 good 4 372 0.59 2.36 acceptable 8 652 0.17 1.35 waste of resources

What can be done?

3 Potential for optimizations GROMACS as a black box LAM / mpich parameters: communication module (TCP, usysv, VIA), size of rpi tcp short,... GROMACS parameters: shuffle & sort, optimize FFT, fourierspacing & PME order Inside the box Local optimizations: MPI realization of given task Restructure communication scheme

4 Why PME is used Van der Waals: ok to use cut-off radius of 1.0 1.2 nm Coulomb: cutoff unphysical artefacts in structure + dynamics 90% of run time calc. non-bonded electrostatic forces

5 Particle Mesh Ewald = Mesh up the Ewald sum! N particles, charges q i, positions r i, neutral cubic box, length L, periodic b.c. Electrostatic energy. (Conditionally convergent, S-L-O-W) V = 1 2 N i, j=1 n Z 3 q i q j r i j + nl (1) Trick 1: Ewald summation. Split eq. (1) into: 1 r = f (r) + 1 f (r) r r (2) Coulomb: rapid variation at small r, slow decay at large r f (r) r 0 beyond some cutoff r max, transform needs only a few k 1 f (r) r slowly varying function r Fourier

6 Ewald formula for f = erfc(r) Ewald parameter α: rel weight of V dir to V rec V = V dir +V rec +V 0 V dir = 1 2 i, j V rec = 1 2L 3 k 0 V 0 = α π q 2 i i erfc(α r i j + ml ) q i q j r m Z 3 i j + ml (4) 4π k 2 e k2 /4α 2 ρ( k) 2 k {2π n/l : n Z 3 } (5) (3) (6) exponentially converging sums over m and k in eqs. (4, 5) allow introduction of cutoffs ρ( k) = N j=1 q j e i k r j FT charge density finally get forces by F i = r i V

7 Ewald summation on a grid Trick 2: discretization of charge density: continuous charge positions discrete mesh, use FFT SPME: Approximate e ikx by cardinal-b-splines M (P) of order P (=pme order) for even P (x: continuous particle coordinate) e ikx b(k) M P (x lh)e iklh (7) l Z insert (7) into (5) and derive V rec 1 2 h 3 ρ M ( r p )[ρ M G]( r p ) (8) r p M ρ M G = FFT[ FFT(ρ M ) FFT(G)] (9)

8 Reduce communication Spline interpolation: local; FFT: global reducing charge mesh size reduces communication keep force error constant by enlarging interpolation order Aquaporin-1, 80 000 atoms, protein (tetramer) embedded in a lipid bilayer membrane surrounded by water

9 A measure of accuracy just consider absolute force values Exact (absolute) force on particle number i: approximated numerically by F i F exa i Difference in abs force: F i F exa i Mean force deviation: FD mean = 1 N N i=1 { F i F exa i } Relative force deviation: FD rel = N i=1 { F i F exa N i=1 Fexa i i } 1 time step F i for 80k particles out of traj.trr reference calculation at fine mesh F exa i

10 Parameter combinations at same error level Characteristics of test system: maximum force F max = 5102 Mean force F mean = 868 kj mol nm kj mol nm kj fourierspacing grid size PME FD mean [ mol nm ] FD kj rel FD max [ 0.030 360x350x320 10 reference 0.120 90x88x80 4 0.165 0.00019 3.6 0.178 64x60x54 6 0.150 0.00017 3.7 0.200 54x52x48 8 0.147 0.00017 3.5 0.217 52x48x44 12 0.156 0.00018 4.3 mol nm ] Which possibility performes best? f(ncpu, CPU speed, network speed)

11 Scaling at optimal PME settings Dolphin default optimal PME settings n scaling speedup scaling speedup PME order 1 1.00* 1.00 1.00 1.00 4 2 0.86 1.72 0.92 1.84 6 4 0.34 1.36 0.54 2.16 6 8 0.13 1.04 0.29 2.32 8

12 Scaling at optimal PME settings Orca1 (Ethernet) default optimal PME settings n scaling speedup scaling speedup PME order 2 1.00 * 1.00 1.01 1.01 6 4 0.57 1.14 0.75 1.50 6 8 0.42 1.68 8 Orca2 (Myrinet) default optimal PME settings n scaling speedup scaling speedup PME order 2 1.00 * 1.00 1.10 1.10 6 4 0.73 1.46 0.97 1.94 6 6 0.54 1.62 0.79 2.37 6 8 0.75 3.00 6

13 Scaling at optimal PME settings IBM p690 default optimal PME settings n scaling speedup scaling speedup PME order grid size 1 1.00 1.00 1.00 1.00 4 90x88x80 2 0.96 1.92 0.96 1.92 4 90x88x80 4 0.89 3.56 0.89 3.56 4 96x88x80 8 0.76 6.08 0.77 6.16 6 64x64x60 9 0.70 6.30 8 54x54x48 16 0.47 7.53 0.61 9.76 6 64x64x60 18 0.61 11.0 8 54x54x48 27 0.53 14.3 + 8 54x54x48 32 0.36 11.5 8 64x64x48 32 0.31 10.0 0.37 11.8 6 64x64x60 + i.e. 1.5 the performance of 8 Orca2 CPUs

What does GROMACS do all the time? Detailed analysis of time step

14 Installation of MPE logging MPE: automatic logging of MPI calls manual MPE logging by defining events: #include <mpi.h> #include <mpe.h>... MPI Init( );... MPE Describe state( ev1, ev2, doing PME, grey ); MPE Describe state( ev3, ev4, whatever, orange );... MPE Log event( ev1, 0, ); <code fragment to be logged> MPE Log event( ev2, 0, );... MPI Finalize( );...

15 Calculation of the non-bonded forces do force LOOP OVER TIME STEPS... force do fnbf calc V vdw and V dir part of Coulomb do pme calc V rec part of Coulomb spread on grid spread home atom charges on full grid sum qgrid sum contributions to local (z slice) grid from other CPUs (n MPI Reduce) gmxfft3d ρ M = FFT(ρ M ) (n slices in z) solve pme gmxfft3d ρ M G FFT [ ( ρ M ) G ] sum qgrid distribute local (= z slice) grid to all nodes gather f bsplines get forces on home atoms F i = r i V

16 Detailed analysis - ncpu 1, 2, 4, 6

17 Detailed analysis - PME order 4, 6, 8

18 Shuffle and Sort

19 Changing MPI routines in sum qgrid timing results at n = 4, pme-order 6 Operation time(s) (Dolphin) time(s) (Orca1) n MPI Reduce 0.149 0.021 1 MPI Reduce scatter 0.142 0.022 1 MPI Alltoall+sum 0.089 0.016 time step length: 0.99s (Dolphin), 0.34 s (Orca1) use of MPI Alltoall enhances 4 CPU scaling xx 0.54 0.60 on Dolphin xx 0.81 0.82 on Orca1

20 Problem: Communication delays for ncpu 6

21 Summary for ncpu > 1 replace xxx pme order=4 xxx fourierspacing=0.120 by xxx pme order=6 xxx fourierspacing=0.178 in your.mdp file Orca/Ethernet (4 CPUs): switch to PME order 6 scaling 57 75% Orca/Myrinet: 2 8 CPUs: speedup=3

22 What next 1. Replace n MPI Bcast-calls by 1 MPI Alltoall Further enhancement of 4-CPU scaling 2. Cause of communication delays? Monitor network traffic. 3. Overlap V rec communication with V dir calculation