LAMMPS Performance Benchmark on VSC-1 and VSC-2

LAMMPS Performance Benchmark on VSC-1 and VSC-2 Daniel Tunega and Roland Šolc Institute of Soil Research, University of Natural Resources and Life Sciences VSC meeting, Neusiedl am See, February 27-28, 2012

Objectives LAMMPS features LAMMPS benchmark tests Parallel performance issues on VSC-1 and VSC-2 Memory performance VSC-1 VSC-2 comparison

LAMMPS features Large-scale Atomic/Molecular Massively Parallel Simulator classical molecular dynamics code for modeling o atomic o polymeric o biological o metallic o granular and coarse-grained o hybrid and mesoscopic systems implemented various force fields ensembles, constrains, boundary conditions, integrators multi-replica models (parallel tempering, ) pre- and post-processing specialized features (e.g. GC MC, peridynamics)

LAMMPS features runs on a single processor or in parallel runs efficiently in parallel using MPI technique highly portable C++ easily modified and extended runs on various platforms including GPU (CUDA & OpenCL) open-source code, distributed under GNU Public License developed at Sandia National Laboratories core group: S. Plimpton, A. Thomson & P. Crozier http://lammps.sandia.gov/authors.html

LAMMPS in parallel spatial decomposition techniques to partition the simulation domain into small 3d sub domains, one of which is assigned to each processor for computational efficiency LAMMPS uses neighbor lists to keep track of nearby particles processors communicate and store "ghost" atom information for atoms that border their sub domain CPU δt {Comm} = communication {Bond} {Neigh} {Pair} {Other + Output} Compute bonded terms If needed update the neighbor list Compute short- and longrange interaction terms for energy/forces Loop: i = 1 to N Collect forces, time integration and update positions), adjust T/p, print/write output

AMMPS on VSC-1 and VSC-2 VSC-1 (Intel X5550) VSC-2 (Opteron 6132HE) LAMMPS version 20Feb2010 27oct2011 C++ compiler opt. flags icc (11.1) -O icc (12.1.2) -O MPI Openmpi-1.3.2 Openmpi-1.4.3 FFTW fftw-2.1.5 fftw-3.3

LAMMPS benchmark tests In LAMMPS standard distribution 5 benchmark tests Short-range forces modeled with a cut off distance Chain: Polymer chain melt (coarse-grained, FENE/LJ potential) LJ: Lennard-Jones liquid (LJ pot.) EAM: EAM metallic solid (EAM pot.) Chute: granular chute flow (granular pot.) Long-range forces Rhodopsin: solvated rhodopsin protein (CHARMM) All five tests can be run as Fixed-size (default 32000 particles, 100 time steps) Scaled-size the size increases with increasing number of cores

LAMMPS benchmark tests our test based on our VSC project: molecular dynamics study of wetting of mineral surfaces (SiO2, clays, FeOOH) FF: CLAYFF for minerals and SPC/E for water θ E = E + E + E CLAYFF = total Coul LJ bond stretch 12 6 2 e qiq j σ ij σ ij = + 4ε ij + k r r 4πε 0 i j rij i j r ij r ij ( ) 1 ij 0 2

LAMMPS benchmark tests Description Chain LJ Rhodo Our test bead-spring polymer melt of 100-mer chains Atomic fluid Rhodopsin protein in solvated lipid bilayer Kaolinite layer with water droplet of 500 water molecules # of atoms 32000 32000 32k/64k/128k 6940 FF FENE/LJ LJ CHARMM CLAYFF/SPC/E cutoff 2^(1/6) σ 2.5 σ 10 Å 12 / 40 Å Long-rage N/A N/A PPPM Ewald Ensemble NVE NVE NpT NVT T / p 1 kbt/ε 0 kbt/ε 300 K / 1 atm 300 K Time step 0.012 τ 0.005 τ 2 fs 1 fs Run time 12000 τ 5000 τ 200 ps 200 ps Output 100 steps at the end 50 steps 50 steps

AMMPS VSC-1 scalability 250 Ideal scaling Test benchmark (6940 atoms) Chain benchmark (32000 atoms) LJ benchmark (32000 atoms) 200 Rhodo benchmark (32000 atoms) Rhodo benchmark (64000 atoms) Rhodo benchmark (128000 atoms) Scalability 150 100 50 0 0 50 100 150 200 250 # of cores

AMMPS VSC-2 scalability Scalability 500 400 300 200 Ideal scaling Test benchmark (6940 atoms) Chain benchmark (32000 atoms) LJ benchmark (32000 atoms) Rhodo benchmark (32000 atoms) Rhodo benchmark (64000 atoms) Rhodo benchmark (128000 atoms) 100 0 0 100 200 300 400 500 # of cores

AMMPS VSC-1 and 2 speed-up 30000 25000 20000 VSC-1 Rhodo 32k VSC-1 Rhodo 64k VSC-1 Rhodo 128k time [s] 15000 10000 VSC-2 Rhodo 32k VSC-2 Rhodo 64k VSC-2 Rhodo 128k 5000 0 0 50 100 150 200 250 300 # of cores

MMPS VSC-1 and VSC-2 run time [s] Test Chain LJ 6940 32k 32k # cores VSC-1 VSC-2 ratè VSC-1 VSC-2 rate VSC-1 VSC-2 rate 16 29493 65232 0.45 1365 1547 0.88 2918 3963 0.74 32 15682 32312 0.49 913 933 0.98 1668 1968 0.85 64 8232 18285 0.45 396 600 0.66 1043 1222 0.85 128 8341 17054 0.49 258 451 0.57 768 713 1.08 256 5070 10589 0.48 197 373 0.53 310 460 0.68 512 6688 355 348 Aver.: 0.47 Aver.: 0.72 Aver.: 0.84 Rhodo 32k 64k 128k # cores VSC-1 VSC-2 rate VSC-1 VSC-2 rate VSC-1 VSC-2 rate 16 3514 6141 0.57 6809 12474 0.55 13794 24869 0.55 32 1888 3530 0.53 3658 6734 0.54 7874 13642 0.58 64 1062 1979 0.54 1924 3634 0.53 3911 6948 0.56 128 615 1290 0.48 1111 2478 0.45 2019 3834 0.53 256 412 971 0.42 664 1426 0.47 1159 2307 0.50 512 1534 2019 2344 Aver.: 0.51 Aver.: 0.51 Aver.: 0.54

MPS VSC-1 and VSC-2 parallel efficiency 120 100 80 Efficiency / % 60 40 20 0 efficiency >80% for optimal sub-domain size of >1000 particles VSC-1 Rhodo 32k VSC-1 Rhodo 128k VSC-2 Rhodo 32k VSC-2 Rhodo 128k 0 2 4 6 8 10 ln2(n) cores

MPS VSC-1 and VSC-2 parallel efficiency 120 VSC-1 vs VSC-2 100 Efficiency / % 80 60 40 20 0 VSC-1 Rhodo 32k VSC-1 Rhodo 128k VSC-2 Rhodo 32k VSC-2 Rhodo 128k 0 2 4 6 8 10 ln2(n) cores 0 3 6 9 ln2(n) cores http://lammps.sandia.gov/bench.html

MMPS - memory per processor [MB] Test Chair LJ Rhodopsin 6940 32k 32k 32k 64k 128k # cores VSC-1 VSC-2 VSC-1 VSC-2 VSC-1 VSC-2 VSC- 1 VSC-2 VSC-1 VSC-2 VSC-1 VSC-2 1 91 8 12 124 231 428 4 51 4 4 48 77 128 8 32 3 2 29 48 78 16 8 11 2 3 2 2 23 27 29 33 48 55 32 7 10 2 3 2 2 12 14 23 27 29 33 64 7 10 2 2 2 2 11 12 12 14 23 27 128 5 6 2 2 2 2 10 11 11 13 12 14 256 4 6 2 2 1 2 9 11 10 11 11 12 512 6 2 2 11 11 12 Rhodopsin 32 k 64 k 128k

CONCLUSIONS LAMMPS performance boost depends on model: better parallel efficiency for long-range models LAMMPS shows a good parallel performance, however o parallel efficiency of LAMMPS varies from the size of the benchmark data o performance advantage extends as cluster size increases o LAMMPS scales better to more processors for larger systems LAMMPS runs faster on VSC-1 than on VSC-2 (~0.5 for long-range models, 0.7-0.8 for short-range models)