Accelerating Three-Body Molecular Dynamics Potentials Using NVIDIA Tesla K20X GPUs. GE Global Research Masako Yamada

Size: px

Start display at page:

Download "Accelerating Three-Body Molecular Dynamics Potentials Using NVIDIA Tesla K20X GPUs. GE Global Research Masako Yamada"

Oswald Gilbert Hunt
6 years ago
Views:

1 Accelerating Three-Body Molecular Dynamics Potentials Using NVIDIA Tesla K20X GPUs GE Global Research Masako Yamada

Overview of MD Simulations Non-Icing Surfaces for Wind Turbines Large simulations ~ 1 million water molecules Long simulations ~ 1 microsecond Many

2 Overview of MD Simulations Non-Icing Surfaces for Wind Turbines Large simulations ~ 1 million water molecules Long simulations ~ 1 microsecond Many simulations ~ 1000 independent droplets Awarded two DOE ALCC grants 40M CPU-h on Jaguar Cray XK6 at ORNL 40M CPU/GPU-h on Titan Cray XK7 at ORNL hybrid 2

3 Overview of Titan Total 18,600 nodes. Each node has 16 cores AMD Opteron CPUs 1 Tesla K20X GPU accelerator 2688 compute cores Gemini interconnect (ASIC, MPI messages) PCI-Express 2.0 bus LAMMPS was part of acceptance testing 3

4 Overview of MD Atom-by-atom modeling of materials N-body problem Discrete, numerical integration Biology, chemistry requires good water models Dozens of potentials available Most use pair-wise interactions Most non-polarizable/rigid MD always on the forefront of HPC 4

5 Overview of LAMMPS Open-source molecular dynamics code developed by Sandia Nat l Lab Pre-populated with many popular pair-wise and many-body potentials TIP3P/TIP4P water potential Stillinger-Weber Three-Body potential User can also modify/define potential 5

6 Billion-fold growth in a (half) career Year Software/Language # of Molecules Hardware 1995 Pascal Few Desktop Mac 2000 C, Fortran90 Hundreds IBM SP, SGI O2K 2010 NAMD, LAMMPS 1000 s Linux HPC Present GPU-enabled LAMMPS Millions Titan

Properties comparable or better than existing models Much faster than point-charge models Exemplary test

7 Why use a three-body potential? Stillinger Weber 3-body particle = one water molecule mw water introduced in 2009, Nature paper in 2011 Properties comparable or better than existing models Much faster than point-charge models Exemplary test case by authors: 180x faster than SPC/E Our production simulation: 40-50x faster than SPC/E asymmetric million molecule droplet on engineered surface; loaded onto 64 nodes SPC/E mw 7

8 Relevant GPU acceleration activity Pair-wise potentials LAMMPS already GPU-enabled Three-body potentials Impressive acceleration but for crystal solids only Present work >5x acceleration demonstrated using LAMMPS Works for liquids, glass, vapor 8

9 Parallelization scheme Host Time integration Thermostat/barostat Bond/angle calculations Statistics Accelerator 3-body potential Neighbor-lists 9

10 Generic 3-body potential U = i j i k>j φ p i, p j, p k r ij < r c, r ik < r c 0 otherwise Good candidate for GPU 1. Occupies majority of computational time 2. Can be decomposed into independent kernels/work-items p i i p j r ij r ik j k Stillinger-Weber MEAM Tersoff REBO/AIREBO Bond-order (0,0,0) p k r c = cutoff r α = neighbor skin 10

11 Stillinger-Weber Parallelization U = φ 2 (r ij ) + φ 3 r ij, r ik, θ jik i j<i i j i k>j 2-body operations Atom i 3 kernels no data dependencies 3-body operations (r ij < r α ).AND. (r ik < r α ) ==.TRUE. update forces on i only 3-body operations (r ij < r α ).AND. (r ik < r α ) ==.FALSE. neighbor-of-neighbor interactions 11

12 Redundant Computation Approach Atom-decomposition 1 atom 1 computational kernel only fewest operations (and effective parallelization) but shared memory access a bottleneck Force-decomposition 1 atom 3 computational kernels required redundant computations but reduced shared memory issues many work-items = more effective use of cores 12

13 Neighbor List on GPU 3-body force-decomposition approach involves neighbor-of-neighbor operations Requires additional overhead increase in border size shared by two processes neighbor list for ghost atoms straddling across cores GPU implementation not necessarily faster than CPU but less time spent in host-accelerator data transfer (note: neighbor lists are huge) 13

14 >200x overall speedup since Switched to mw water potential 3-body model is more expensive/complex than 2-body but Particle reduction at least 3x Timestep increase 10x No long-range forces 2. LAMMPS dynamic load balance 2-3x 3. GPU acceleration of 3-body model 5x 2011: 6 femtosecond/1024 CPU-second (SPC/E) 2013: 2 picoseconds/1024 CPU-second (mw) 14

15 Post-processing and Viz Big Data Total 50TB 1 million molecules per snapshot Dozens of snapshots per file 10,000 s files Big Compute NOT a simple search/sort Execute three-body calculation again Subtle pattern-matching of intra-molecular position Post-processing a Titan job in itself!!! Big Visualization need dedicated viz resource 15

16 Bottom View Side View Visualizing crystalline regions Steinhardt-Nelson order parameter particle mobility

17 Credits Mike Brown (ORNL) GPU acceleration Paul Crozier (Sandia) dynamic load balancing Valeria Molinero (Utah) mw potential Aaron Keyes (Umich, Berkeley) Steinhardt-Nelson order parameters Art Voter/Danny Perez (LANL) Parallel Replica method Mike Matheson (ORNL) -- Visualization Jack Wells, Suzy Tichenor (ORNL) General Azar Alizadeh, Branden Moore, Rick Arthur, Margaret Blohm (GE Global Research) This research was conducted in part under the auspices of the Office of Advanced Scientific Computing Research, Office of Science, U.S. Department of Energy under Contract No. DEAC05-00OR22725 with UT- Battelle, LLC. This research was also conducted in part under the auspices of the GE Global Research High Performance Computing program. 17

18 Backup 18

19 Load 1 million molecules on Host/CPU million molecules 64 nodes Processor sub-domains correspond to spatial partitioning of droplet 8 MPI tasks/node 1 core/paired-unit 19

20 Host Memory Per node ~ 15,000 molecules Host AMD Opteron 6274 CPU Core0 Core1 Core2 Core3 Core4 Core5 Core6 Core7 Core8 Core9 Core10 Core11 Core12 Core13 Core14 Core15 Work iterm Work item Work item Work item Kernel Work item Work item Work item Work Group Processor 1 Processor 2. Processor 14 Accelerator NVIDIA Tesla K20X GPU Core 1 Core 2 Core 192 Private Private Private Local Memory Global Memory Work item = fundamental unit of activity 20

21 Dynamic load balancing Introduced in LAMMPS in 2012 Adjusts size of processor sub-domains to equalize number of particles 2-3x speedup for 1 million molecule droplets on 64 nodes (with user-specified processor mapping) No load balancing Default load balancing User-specified mapping 21

22 Development of water-surface interaction potential Interaction potential developed at GE Global Research 22

23 References W. Michael Brown, W. M and Yamada, M. Implementing Molecular Dynamics on Hybrid High Performance Computers Three-Body Potentials. Computer Physics Communications Computer Physics Communications, (2013) C. Hou, J. Xu, P. Wang, W. Huang, X. Wang, Computer Physics Communications (2013) Shi, B. and Dhir, V. K. Molecular dynamics simulation of the contact angle of liquids on solid surfaces. The Journal of Chemical Physics, 130, 3 (01/21/ 2009), ; Sergi, D., Scocchi, G. and Ortona, A. Molecular dynamics simulations of the contact angle between water droplets and graphite surfaces. Fluid Phase Equilibria, 332, 0 (10/25/ 2012), Oxtoby, D. W. Homogeneous nucleation: theory and experiment. Journal of Physics: Condensed Matter, 4, ), Plimpton, S. Fast Parallel Algorithms for Short-Range Molecular Dynamics. Journal of Computational Physics, 117, 1 (3/1/ 1995), Humphrey, W., Dalke, A. and Schulten, K. VMD: Visual molecular dynamics. Journal of Molecular Graphics, 14, 1 (2// 1996), Keys, A. S. Shape Matching Analysis Code. University of Michigan, City, 2011; Keys, A. S., Iacovella, C. R. and Glotzer, S. C. Characterizing Structure Through Shape Matching and Applications to Self-Assembly. Annual Review of Condensed Matter Physics, 2, 1 (2011/03/ ), ; Steinhardt, P. J., Nelson, D. R. and Ronchetti, M. Bond-orientational order in liquids and glasses. Physical Review B, 28, 2 (07/15/ 1983), Stillinger, F. H. and Weber, T. A. Computer simulation of local order in condensed phases of silicon. Physical Review B, 31, 8 (04/15/ 1985), Berendsen, H. J. C., Grigera, J. R. and Straatsma, T. P. The missing term in effective pair potentials. The Journal of Physical Chemistry, 91, 24 (1987/11/ ), Molinero, V. and Moore, E. B. Water Modeled As an Intermediate Element between Carbon and Silicon. The Journal of Physical Chemistry B, 113, 13 (2009/04/ ), ; Moore, E. B. and Molinero, V. Structural transformation in supercooled water controls the crystallization rate of ice. Nature, 479, 7374 (11/24/print 2011), Yamada, M., Mossa, S., Stanley, H. E. and Sciortino, F. Interplay between Time-Temperature Transformation and the Liquid-Liquid Phase Transition in Water. Physical Review Letters, 88, 19 (04/26/ 2002), Brown, W. M., Wang, P., Plimpton, S. J. and Tharrington, A. N. Implementing molecular dynamics on hybrid high performance computers short range forces. Computer Physics Communications, 182, 4 (4// 2011), Voter, A. F. Parallel replica method for dynamics of infrequent events. Physical Review B, 57, 22 (06/01/ 1998), R13985-R13988.

Accelerating Three-Body Potentials using GPUs NVIDIA Tesla K20X

Using a Hybrid Cray Supercomputer to Model Non-Icing Surfaces for Cold- Climate Wind Turbines Accelerating Three-Body Potentials using GPUs NVIDIA Tesla K20X GE Global Research Masako Yamada Opportunity