sri 2D Implicit Charge- and Energy- Conserving Particle-in-cell Application Using CUDA Christopher Leibs Karthik Murthy

Similar documents
GPU Acceleration of Cutoff Pair Potentials for Molecular Modeling Applications

Guangye Chen, Luis Chacón,

Solving PDEs with CUDA Jonathan Cohen

COMPARISON OF CPU AND GPU IMPLEMENTATIONS OF THE LATTICE BOLTZMANN METHOD

A Massively Parallel Eigenvalue Solver for Small Matrices on Multicore and Manycore Architectures

arxiv: v1 [hep-lat] 7 Oct 2010

A fully implicit, exactly conserving algorithm for multidimensional particle-in-cell kinetic simulations

MONTE CARLO NEUTRON TRANSPORT SIMULATING NUCLEAR REACTIONS ONE NEUTRON AT A TIME Tony Scudiero NVIDIA

Faster Kinetics: Accelerate Your Finite-Rate Combustion Simulation with GPUs

S XMP LIBRARY INTERNALS. Niall Emmart University of Massachusetts. Follow on to S6151 XMP: An NVIDIA CUDA Accelerated Big Integer Library

A model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization)

Lecture 4: The particle equations (1)

Heterogeneous programming for hybrid CPU-GPU systems: Lessons learned from computational chemistry

EXASCALE COMPUTING. Implementation of a 2D Electrostatic Particle in Cell algorithm in UniÞed Parallel C with dynamic load-balancing

Hybrid Simulations: Numerical Details and Current Applications

Dynamic Scheduling for Work Agglomeration on Heterogeneous Clusters

Beam dynamics calculation

A Hybrid Method for the Wave Equation. beilina

On Portability, Performance and Scalability of a MPI OpenCL Lattice Boltzmann Code

Array-of-Struct particles for ipic3d on MIC. Alec Johnson and Giovanni Lapenta. EASC2014 Stockholm, Sweden April 3, 2014

Two case studies of Monte Carlo simulation on GPU

SP-CNN: A Scalable and Programmable CNN-based Accelerator. Dilan Manatunga Dr. Hyesoon Kim Dr. Saibal Mukhopadhyay

Earth System Modeling Domain decomposition

Towards a highly-parallel PDE-Solver using Adaptive Sparse Grids on Compute Clusters

MODELING OF CONCRETE MATERIALS AND STRUCTURES. Kaspar Willam. Class Meeting #5: Integration of Constitutive Equations

Optimization of Particle-In-Cell simulations for Vlasov-Poisson system with strong magnetic field

The Generalized Interpolation Material Point Method

AMSC 663 Project Proposal: Upgrade to the GSP Gyrokinetic Code

Claude Tadonki. MINES ParisTech PSL Research University Centre de Recherche Informatique

Block-Structured Adaptive Mesh Refinement

Efficient Molecular Dynamics on Heterogeneous Architectures in GROMACS

Algebraic Multigrid as Solvers and as Preconditioner

Tips Geared Towards R. Adam J. Suarez. Arpil 10, 2015

Explore Computational Power of GPU in Electromagnetics and Micromagnetics

PIConGPU Bringing Large-Scale Laser Plasma Simulations to GPU Supercomputing

Accelerating Model Reduction of Large Linear Systems with Graphics Processors

High-performance processing and development with Madagascar. July 24, 2010 Madagascar development team

Parallelization of Molecular Dynamics (with focus on Gromacs) SeSE 2014 p.1/29

TR A Comparison of the Performance of SaP::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems

Introduction to numerical computations on the GPU

Edwin Chacon-Golcher 1, Sever A. Hirstoaga 2 and Mathieu Lutz 3. Introduction

Multigrid solvers for equations arising in implicit MHD simulations

Review: From problem to parallel algorithm

Research on GPU-accelerated algorithm in 3D finite difference neutron diffusion calculation method

Code Generation for GPU Accelerators in the Domain of Image Preprocessing

Lecture XI. Approximating the Invariant Distribution

Real-Time Scheduling and Resource Management

Lattice Boltzmann simulations on heterogeneous CPU-GPU clusters

Accelerating incompressible fluid flow simulations on hybrid CPU/GPU systems

Multicore Parallelization of Determinant Quantum Monte Carlo Simulations

A microsecond a day keeps the doctor away: Efficient GPU Molecular Dynamics with GROMACS

S Subdivide, Preprocess and Conquer: Micromagnetism FEM/BEM-Simulations on Single-Node/Multi-GPU Systems

An Overview of Fluid Animation. Christopher Batty March 11, 2014

Hideyuki Usui 1,3, M. Nunami 2,3, Y. Yagi 1,3, T. Moritaka 1,3, and JST/CREST multi-scale PIC simulation team

Modeling and Solving Constraints. Erin Catto Blizzard Entertainment

Prof. Brant Robertson Department of Astronomy and Astrophysics University of California, Santa

GPU Computing Activities in KISTI

Simulation of Coulomb Collisions in Plasma Accelerators for Space Applications

Multistep Methods for IVPs. t 0 < t < T

HIGH PERFORMANCE CTC TRAINING FOR END-TO-END SPEECH RECOGNITION ON GPU

6. Iterative Methods for Linear Systems. The stepwise approach to the solution...

arxiv: v1 [physics.comp-ph] 30 Oct 2017

Scientific Computing II

A CUDA Solver for Helmholtz Equation

Parsek2D: An Implicit Parallel Particle-in-Cell Code

Treecodes for Cosmology Thomas Quinn University of Washington N-Body Shop

A Quantum Chemistry Domain-Specific Language for Heterogeneous Clusters

Accelerating linear algebra computations with hybrid GPU-multicore systems.

Beam Propagation Method Solution to the Seminar Tasks

General Physics - E&M (PHY 1308) - Lecture Notes. General Physics - E&M (PHY 1308) Lecture Notes

An Algorithmic Framework of Large-Scale Circuit Simulation Using Exponential Integrators

Hybrid Simulation Method ISSS-10 Banff 2011

Accelerating Quantum Chromodynamics Calculations with GPUs

Behavioral Simulations in MapReduce

Population annealing study of the frustrated Ising antiferromagnet on the stacked triangular lattice

CS-206 Concurrency. Lecture 13. Wrap Up. Spring 2015 Prof. Babak Falsafi parsa.epfl.ch/courses/cs206/

Plasma Physics Prof. V. K. Tripathi Department of Physics Indian Institute of Technology, Delhi

Dense Arithmetic over Finite Fields with CUMODP

Fluid Animation. Christopher Batty November 17, 2011

Supplementary Figure 1: Chemical compound space. Errors depending on the size of the training set for models with T = 1, 2, 3 interaction passes

NIMEQ: MHD Equilibrium Solver for NIMROD

A particle-in-cell method with adaptive phase-space remapping for kinetic plasmas

Direct Self-Consistent Field Computations on GPU Clusters

Randomized Selection on the GPU. Laura Monroe, Joanne Wendelberger, Sarah Michalak Los Alamos National Laboratory

Accelerating computation of eigenvectors in the nonsymmetric eigenvalue problem

Sparse LU Factorization on GPUs for Accelerating SPICE Simulation

Huge-Scale Molecular Dynamics Simulation of Multi-bubble Nuclei

CSC321 Lecture 8: Optimization

GPU Accelerated Markov Decision Processes in Crowd Simulation

Parallel programming practices for the solution of Sparse Linear Systems (motivated by computational physics and graphics)

UCSD CSE 21, Spring 2014 [Section B00] Mathematics for Algorithm and System Analysis

An Efficient FETI Implementation on Distributed Shared Memory Machines with Independent Numbers of Subdomains and Processors

Concurrent Divide-and-Conquer Library

MULTIGRID CALCULATIONS FOB. CASCADES. Antony Jameson and Feng Liu Princeton University, Princeton, NJ 08544

Chapter 5. Formulation of FEM for Unsteady Problems

Julian Merten. GPU Computing and Alternative Architecture

Practical Combustion Kinetics with CUDA

INF2270 Spring Philipp Häfliger. Lecture 8: Superscalar CPUs, Course Summary/Repetition (1/2)

Cyclops Tensor Framework

Practical Free-Start Collision Attacks on full SHA-1

Transcription:

2D Implicit Charge- and Energy- Conserving sri Particle-in-cell Application Using CUDA Christopher Leibs Karthik Murthy Mentors Dana Knoll and Allen McPherson IS&T CoDesign Summer School 2012, Los Alamos National Laboratory, NM LA-UR-12-25342: Approved for public release; distribution is unlimited.

Agenda Co-Design Summer School@LANL Problem - 2D Implicit Energy- and Charge- Conservation 2D Implicit PIC Method Outline CUDA Implementation Successful strategies Exploiting texture memory for storing the electric and magnetic fields Usage of intrinsics and strength reduction operations Sorting particles by Cell-x and Cell-y Sorting particles by done-ness and velocity directions 1/2 ions + 1/2 electrons on each GPU Unsuccessful strategies Red-black strategy of launching blocks of GPU threads Ions on one GPU and electrons on another GPU

Co-Design Summer School The Los Alamos IS&T Co-Design Summer School was inaugurated in 2011. Students from diverse technical backgrounds including nuclear engineering, applied mathematics, and computer science, form teams that work together to solve a focussed co-design problem... Emmanuel Cieren Applied Mathematics ENSTA ParisTech Nicolas Feltman Computer Science Carnegie Mellon University Christopher Leibs Applied Mathematics University of Colorado Colleen McCarthy Applied Mathematics North Carolina State University Karthik Murthy Computer Science Rice University Yijie Wang Computer Science University of South Florida

Problem- Plasma Simulation (charge, current density) MOMENT SOLVER Solve Maxwell, J E, B (electric, magnetic fields) Interpolate Particles Fields PARTICLE PUSHER Interpolate Fields Particles r, v (position, velocity) Push Particles F = q(e + v B) (force) Implicit/ Explicit Method

Problem - Explicit Particle-In-Cell Method Main idea Interpolate field values to the particles Push particles Interpolate particle information to field locations Solve field equations and update values Constraints! finite grid instability (need dx D ) tight CFL constraint (need dt small enough ) can be computationally demanding Solution  Try to use implicit methods to relax these conditions!

Problem - Implicit Particle-In-Cell Method Chen, Chacón and Barnes* developed a 1D electrostatic PIC method that : relaxes the CFL condition, is stable against the finite grid instability, conserves charge, conserves energy, and controls momentum. Â We will draw heavily from many of these ideas * An energy- and charge-conserving, implicit, electrostatic particle-in-cell algorithm. Journal of Computational Physics, 230:7018 7036, 2011.

Today s Problem Application to demonstrate 2D Implicit method: Island Equilibrium 2 0 2 6 4 2 0 2 4 6 Figure 2.3 Initial conditions. A contour plot of the density function (in blue) with the fieldlines of the magnetic field (in orange). For this figure, ce /! pe =0.3, =0.25, and the domain length is [ 2, 2 ] [, ].

2D Implicit PIC - Cell in the Electric-Magnetic Field k+1 j+1 B z E x, J x k+ 1 2 B y E y, J y j+ 1 2 E x,j x E z,j z E y,j y B y k j B z i i i+ 1 2 B x i+1 B x E z, J z i+ 1 2 j j+ 1 2 j+1 i+1

2D Implicit PIC - Particle Sub-stepping Outline initialization fields, particles fields compute work write output loop over all particles time estimator particle push cell crossing accumulation while d < dt

2D Implicit PIC - Time Estimation (Control Momentum) sub-step times are chosen to help control momentum by comparing a first order (Euler) and second order (Heun) integration scheme the estimate is then compared with a fractional value of the gyro frequency and a distance limiter in order to help alleviate stresses in the Picard iteration `e,r `e,v 2 2 a(r ) 2 2 (ra v) We choose d such that : p`e,r ( ) 2 + `e,v ( ) 2 < a + r kr 0 ( )k 2 Where r 0 ( ) is the initial residual of the equations of motion

2D Implicit PIC - Energy Conserving Particle Push Crank-Nicolson discretization 8 >< r +1 p r p = v +1/2 p >: v +1 p v p = q p m p h E(r +1/2 p )+v +1/2 p i B(r +1/2 p ) j+1 p = v p + v +1 p v +1/2 r +1/2 2 p = r p + r +1 p F(r +1/2 p ) = X i,j 2 F i,j S(r i,j r +1/2 p ) j+ 1 2 j By Ez,Jz Ex,Jx Ey,Jy Bz i i+ 1 2 Bx i+1 Â Converged through fixed-point iterations (Picard) PICARD for r +1 p and v +1 p

2D Implicit PIC - Cell Crossing (Conserve Charge)

2D Implicit PIC - Cell Crossing (Conserve Charge)

2D Implicit PIC - Cell Crossing (Conserve Charge) Some attempts The linear intercept is good enough (fast but not accurate) Bisection method wrapped around original CN (accurate but slow) Fix the final boundary value in CN and solve new system for free dimension and time ( fast but not stable) Estimate time of crossing with explicit solve to accelerate above methods Lesson Learned Cell crossing was (much) harder than we anticipated

2D Implicit PIC - Current Accumulation Each particle must accumulate its sub-step weighted current to the grid Jn+1/2 i,j = 1 dt 1 dxdy X X p q p S(r i,j r 1/2 p )v +1/2 p j+1 j+ 1 2 By Ez,Jz Ex,Jx Ey,Jy j Bz Bx i i+ 1 2 i+1 Lesson Learned (for parallel implementation) This is a map from a high dimension set (particles) to a lower dimension set (grid). Must be careful to ensure particles are not competing for write access.

2D Implicit PIC - Implementation void runpic(){ read_fields(); read_particles(); for(int p=0; p<n; ++p){ while(tau<dt){ time_estimator(); push_particle(); cell_crossing(); } accum_current(); } accum_charge(); } time_average_current(); export_data();

GPUs Built a version of PIC using CUDA Capable of exploiting multiple GPUs Experiment results on: One node of Darwin (2x Tesla M2090s) Scooter (1x Kepler GTX 680) Fig. credit: Nvidia documentation

GPUs Kernels launch a grid of blocks Each block contains a set of threads Blocks are scheduled onto SMs by a hardware scheduler Can t guarantee the order of execution of threads or blocks Fig. credit: Nvidia documentation

CUDA 2D PIC- Lesson 1: Locality Parallelization Strategy Assign groups of cells (Mesh Blocks) to a single CUDA block

CUDA 2D PIC- Lesson 2: Locality Parallelization Strategy Reflect memory hierarchy in the accumulation of current density

CUDA 2D PIC- Lesson 3: Locality Parallelization Strategy Drifting particles need to be re-sorted

CUDA 2D PIC- Exploiting Texture Memory Texture Memory is Special read-only memory Optimized for access patterns exhibiting spatial locality Each SM has it s own texture cache Special texture units help accelerate fetching of data (Z-order curve) Employed for electric and magnetic fields Electric and magnetic fields are constant Field access patterns in force computation exhibit spatial locality Span of shape functions allow for efficient texture cache performance Perfect candidates for texture memory

Big Picture each block works on a mesh of cells E,B fields local J fields global J fields

Performance and Optimizations(1) Tunable parameters Mesh Cells Per Block Number of Particle sub-steps before resort Max Number of Crossings Red-Black Offsets (discussed later) 129 118 Time in seconds

Performance and Optimizations(1) Bitwise hacks, Intrinsics and Strength reductions Optimized shape functions using bitwise operations (combo-hack!) Usage of fused-multiply-add ( fmaf_rn) and other intrinsics Converting division into multiplication by pre-computing constant values Loop unrolling (#pragma unroll) Time in seconds 129 118 49.7 #define SIGN_MASK 0x7fffffff union combo_hack{ unsigned int in; float fl; }; device float b2(float x){ combo_hack flip; flip.fl = x; flip.in = flip.in & SIGN_MASK; if(flip.fl <= 1.5f) { if(flip.fl > 0.5f) return fmaf_rn( 0.5f*flip.fl, (flip.fl - 3.0f),1.125f); return fmaf_rn(-x, x, 0.75f); } return 0.0f; }

Performance and Optimizations(3) Sorting Strategies Particles are sorted by Cell-x and Cell-y Within Mesh Cells, particles are sorted by particle done-ness particle x-velocity direction particle y-velocity direction 129 118 Time in seconds 49.7 41.8

Performance and Optimizations(4) Intuition Avoid write conflicts in overlap region (atomics are expensive)

Performance and Optimizations(4) Red-Black Scheduling Thwarted by the block scheduler Advantage in reduction of atomics vs Texture cache misses 129 118 119 Time in seconds 49.7 41.8 71

Performance and Optimizations(5) Targeting Multiple GPUs (Tesla M2090s) Unsuccessful Attempt: Ions on one GPU and Electrons on second GPU Successful Attempt: 1 2 Ions + 1 2 Electrons on each GPU 70 Time in seconds 42

Conclusions Co-Design was a wonderful experience Successful strategies Exploiting texture memory for storing the electric and magnetic fields Usage of intrinsics and strength reducing operations Sorting particles by Cell-x and Cell-y Sorting particles by done-ness and velocity directions 1/2 ions + 1/2 electrons on each GPU Unsuccessful strategies Red-black strategy of launching blocks of GPU threads Ions on one GPU and electrons on another GPU Future Dynamic load balancing: launch blocks to match density profile Domain decomposition across multi-gpus

EXTRA For a typical run, we load 80 10 6 particles (40 million ions, 40 million electrons) on grids of size 256 128 or 512 256. The total time of the simulation is ratio of m i m e = 100. t = 10/! pe, with and artificial mass 2 0 2 6 4 2 0 2 4 6 Figure 2.3 Initial conditions. A contour plot of the density function (in blue) with the fieldlines of the magnetic field (in orange). For this figure, ce /! pe =0.3, =0.25, and the domain length is [ 2, 2 ] [, ].