Efficient Molecular Dynamics on Heterogeneous Architectures in GROMACS

Size: px

Start display at page:

Download "Efficient Molecular Dynamics on Heterogeneous Architectures in GROMACS"

Timothy Benson
5 years ago
Views:

1 Efficient Molecular Dynamics on Heterogeneous Architectures in GROMACS Berk Hess, Szilárd Páll KTH Royal Institute of Technology GTC 2012

2 GROMACS: fast, scalable, free Classical molecular dynamics package Main developers: Stockholm, Sweden & world-wide Open source: GPL User base: thousands worldwide both academic & private hundreds of thousands through (300k active CPUs, Apr. 2012) Force-fields: AMBER, CHARMM, OPLS, GROMOS Coarse-grained simulations Strong focus on optimized algorithms and efficient code philosophy: do more with less scaling!= absolute performance

3 Target application areas Membrane protein: 200k atoms Water droplet on substrate: 1.5 million atoms Cellulose + lignocellulose + water: 5 million atoms

4 GROMACS acceleration GROMACS 4.5: highly optimized non-bonded SSE assembly kernels single-gpu acceleration using OpenMM wall-time per iteration as low as 1 ms GROMACS 4.6: SSE/AVX intrinsics in all compute-intensive code GPU acceleration: hard to beat the CPU re-implementing everything is not an option

5 What/how to accelerate with GPUs? Design principles: support all features maximize both CPU and GPU utilization develop future-proof algorithms arbitrary units cells parallel constraints Offload non-bonded force calculation to GPUs strategy successfully used by other packages our challenges: a fast code is hard to accelerate sub-millisecond iteration rate: latencies hurt more virtual interaction sites

6 Hybrid parallelization Each domain maps to an MPI process OpenMP within a process single GPU per process Automated multi-level load balancing: inter-process: dynamic domain resizing intra-process: automated CPU GPU work shifting

7 Non-bonded cluster pair-list x,y,z gridding Standard cell grid: spatially uniform x, y gridding z sorting z binning cluster pair-list generation Atom clusters: #atoms uniform

8 Non-bonded algorithm CPU: SSE (AVX) cluster: 4 atoms work unit: 4x4 pair-forces GPU: CUDA cluster: 8 atoms work unit: 8x8 pair-forces (2 warps) optimize for caching: super-clusters with 8 clusters each

9 Heterogeneous scheme: data & control flow Pair-search step every iterations MD iteration CPU OpenMP threads Pair search pair-list GPU CUDA Idle Bonded F coordinates charges Idle Wait for GPU PME Integration, Constraints forces (energies) Non-bonded F & Pair-list pruning Idle Avg. CPU/GPU overlap: 60-80% per iteration Idle

10 GPU non-bonded kernel sci i-supercluster index = block index for each cj cluster (loop over all neighbors of any ci in sci) load i cluster interaction and exclusion mask if sj not masked load j atom data for each ci cluster in sci (loop over the 8 i-clusters) load i atom data r2 = sqrt xj xi load exclusion mask (one per warp) extract exclusion bit for i-j atom-pair excl_bit if (r2 < rcoulomb_squared * excl_bit) calculate i-j coulomb and LJ forces accumulate i- and j-forces in registers store per-thred j-forces in shmem reduce j-forces reduce i-forces Launch configuration: grid: #i-superclusters x1 (one supercluster/block) block: 8x8x1, 64 threads shared mem: Fermi: 768 bytes (reduction) Kepler: 0 bytes

11 Pruning All vs all atom distance check is expensive pair-list built with cluster bounding-box distance check on the CPU distance anyway calculated on the GPU Solution: prune using warp vote (Fermi+) any(r2 < rlist_sq) == false if no pairs are within range 10-25% overhead need to prune only every pair-search step! prunes 25-35% of the atom-pairs

12 GPU non-bonded kernel sci i-supercluster index = block index for each cj cluster (loop over all neighbors of any ci in sci) load i cluster interaction and exclusion mask if sj not masked load j atom data for each ci cluster in sci (loop over the 8 i-clusters) load i atom data r2 = sqrt xj xi if! any(r2 < rlist_squared) prune cj from current ci load exclusion mask (one per warp) extract exclusion bit for i-j atom-pair excl_bit if (r2 < rcoulomb_squared * excl_bit) calculate i-j coulomb and LJ forces accumulate i- and j-forces in registers store per-thred j-forces in shmem reduce j-forces store pruned i-mask reduce i-forces Launch configuration: grid: #i-superclusters x1 (one supercluster/block) block: 8x8x1, 64 threads shared mem: Fermi: 768 bytes (reduction) Kepler: 0 bytes pair-list pruning

13 Kernel characteristics Full warp skips branch-free execution Kernel emphasizes data reuse, relies heavily on caching: } ECC agnostic 95% L1 hit rate ~75 flops/byte compute bound ~39 flops, ~150 ops total/inner loop many iops (Kepler concern!) 15.5 warps/cycle in flight IPC: on Fermi Force accumulation requires lots of shmem/registers limiting i-force: 8 x 64 x 12 bytes shmem or 8 x 12 bytes reg+shmem j-force: (only) 64 x 12 bytes shmem

14 GPU non-bonded kernel work-efficiency Verlet Cluster-pair pruned 0.36 rc=1.5, rl= rc=1.2, rl= rc=0.9, rl=1.0 0 Cluster-pair unpruned Work-efficiency = number of non-zero forces calculated Number of 0-s calculated is only 40-60% Pruning improves by 1.6-2x

15 Pair force calculation rate GeForce GTX 580 PME CUDA PME non-zero CUDA 2395 rc=1.2, nstlist= rc=1.2, nstlist= rc=0.9, nstlist= rc=0.9, nstlist= rc=1.2, nstlist= rc=0.9, nstlist= Core i GHz Mpairs/s PME SSE PME AVX 2000 rc: cut-off, nstlist: pair-list update interval rc=0.9, nstlist=10 2x 5x faster on GPUs rc=1.2, nstlist=10 Nonbonded force evaluation: PME non-zero SSE PME non-zero AVX GeForce GTX 580: effective: Gpairs/s useful: Gpairs/s 8000 Core i GHz (SSE4.1+AVX): effective: ~1.4 Gpairs/s useful: Gpairs/s

16 0.3 PME Reaction-field 0.3 limit to strong scaling 0.2 1xC2075 CUDA F kernel xC2075 CPU total 2xC2075 CPU total 0.2 4xC2075 CPU total System size/gpu (1000s of atoms) System size/gpu (1000s of atoms) Systems: water boxes million atoms Settings: electrostatic cut-off 0.9 nm with PME (auto-tuned), 0.9 nm with reaction-field, LJ cut-off 0.9 nm, 2 fs time step Hardware: workstation with 2x Intel Xeon X5650 (6C), 4x NVIDIA Tesla C2075 Iteration time per 1000 atoms (ms/step) Iteration time per 1000 atoms (ms/step) Single-node weak scaling

17 Absolute performance & speedup cubic box cubic box, NPT dodec box + vsites dodec box 3C 3C+C2075 6C 6C+2xC2075 9C 9C+3xC C 12C+4xC Performance (ns/day) PME CUDA- vs SSE-accelerated non-bonded kernels System: RNase in water: atoms cubic box atoms in dodecahedron box Settings: elec. cut-off tuned 0.9 nm LJ cut-off 0.9 nm 2 fs and 5 fs (with vsites) Hardware: 2x Xeon X5650 (6C) 4x Tesla C2075

18 Strong scaling on GPU clusters Cellulose + lignin + water, 23M atoms ADH solvated protein, 134k atoms Box of water, 1.5M atoms 1000 RF scaling RF linear sclaing RF RF linear scaling PME PME linear scaling #GPUs 100 Hardware: BSC Bullx cluster 2x Intel Xeon E5649 (6C) 2x NVIDIA Tesla M2090 2x QDR Infiniband 40 Gb/s ns/day ns/day ns/day #GPUs Bonded F imbalance + kernel atoms/gpu #GPUs Settings: cut-off: elec. 0.9 nm with PME (tuned), 0.9 nm with reaction-field; LJ 0.9 nm 2 fs time-step 480 Hardware: Cray XK6, Jaguar GPU partition 480 nodes Settings: reaction-field, cutoff 1.2 nm, 2 fs time-step Courtesy: Roland Schulz, ORNL-CMB

19 Kepler outlook Current performance: GTX 680 GTX % Concerns: integer througput nvcc 4.2 kernels slower than 4.0 even on Kepler worked around it, but required ninja moves unrolling can result in spilling shfl-based reduction is not only elegant: no in-shared memory accumulation/needed shfl reduction + sm_30 = shmem reduction + sm_ % Dual-chip boards: GTX 690/equivalent Tesla + PCI-E 3.0 close to having 2x GTX 680 in PCI-E 2.0

20 Future directions Accelerating dihedral force calculations on GPU improve CPU-GPU load-balance better scaling Further workload balancing/regularization improve scaling to small systems better strong scaling Mont Blanc: Tegra 3 + GeForce 520M 38 ~5 KW 7.5 GFlops/W, 3.5x better than BG/Q

21 Acknowledgements Hardware / support Developers: Roland Schulz Erik Lindahl Sander Pronk The GROMACS community NVIDIA: Gernot Ziegler engineering team We are looking for computer scientists/engineers to join our team! Funding

22 Extra material

23 Atom-cluster pair algorithm super-cluster definition and particle sorting Cluster: algorithmic work unit, 4 with SSE, 8 with CUDA Super-cluster: hierarchical grouping for cache-efficiency on GPUs Flexible cluster size: adjust to match the architecture's SIMD width set x/y super-cluster size (s.t. super-clusters will be approx. cubic) for each p<np sx = x[p]/size_x sy = y[p]/size_y scluster_xy[p] = sx*nsy + sy (column of super-clusters for given x,y) for each scluster_xy sort p on z add dummy particles to get to Ns_xy*64 particles (now we have Ns_xy scluster in this column) for each scluster with this scluster_xy for upper and lower half sort p on y for upper and lower half sort p on x for upper and lower half define a cluster for the 8 particles we have here pair search and cluster generation for each si in scluster for each sj in scluster in range of si for each cj cluster in sj for each ci cluster in si if (ci in range of cj) add ci,cj to pair-list of si,sj to create interaction

24 GPU non-bonded kernel in/out In: Out: Simulation constants: C6/C12 params & tabulated Ewald coulomb force: texture memory (fully cached) Coordinates + charges: updated every iteration Pair-list: updated every iterations list j-clusters : group 4-by-4 for coalesced loading list of i-superclusters (8 i-cluster each) + reference to all j-clusters in range interaction bit-masks encoding: i-j cluster in-range relationship (updated at pruning) exclusions Calculate only what/when needed: forces: every iteration energies: typically every iterations pruned pair-list: at pair-search (kept on GPU) 12 kernels 4 per output type 3 per electrostatics type

25 Data & control flow parallel case Pair-search step every iterations MD iteration GPU Local stream Idle Non-local stream Idle PME Wait for non-local F Transfer non- local F Transfer non- local x,q Bonded F Transfer local x,q Transfer local pair-list CPU Non-Local pair search Transfer non-local pair-list Local Pair search MPI send non-local F Local non-bonded F pair-list pruning Non-local non-bonded F pair-list pruning Wait for local F Integration Constraints Transfer local F MPI receive non-local x Idle

26 Load balancing on GPU: balanced pair lists Pair-list splitting balances workload: improves SM load balance improves scaling performance with small inputs

27 CPU-GPU load balancing PME tuning on Optereon C + Tesla M without tuning with tuning 8 Settings: cut-off: elecstrostatic 0.9 nm with tuning, 0.9 nm without LJ 0.9 nm pair-list update every 20 steps 2 fs time-step ns/day #cores

A microsecond a day keeps the doctor away: Efficient GPU Molecular Dynamics with GROMACS

GTC 20130319 A microsecond a day keeps the doctor away: Efficient GPU Molecular Dynamics with GROMACS Erik Lindahl erik.lindahl@scilifelab.se Molecular Dynamics Understand biology We re comfortably on