sri 2D Implicit Charge- and Energy- Conserving Particle-in-cell Application Using CUDA Christopher Leibs Karthik Murthy

2D Implicit Charge- and Energy- Conserving sri Particle-in-cell Application Using CUDA Christopher Leibs Karthik Murthy Mentors Dana Knoll and Allen McPherson IS&T CoDesign Summer School 2012, Los Alamos National Laboratory, NM LA-UR-12-25342: Approved for public release; distribution is unlimited.

Agenda Co-Design Summer School@LANL Problem - 2D Implicit Energy- and Charge- Conservation 2D Implicit PIC Method Outline CUDA Implementation Successful strategies Exploiting texture memory for storing the electric and magnetic fields Usage of intrinsics and strength reduction operations Sorting particles by Cell-x and Cell-y Sorting particles by done-ness and velocity directions 1/2 ions + 1/2 electrons on each GPU Unsuccessful strategies Red-black strategy of launching blocks of GPU threads Ions on one GPU and electrons on another GPU

Co-Design Summer School The Los Alamos IS&T Co-Design Summer School was inaugurated in 2011. Students from diverse technical backgrounds including nuclear engineering, applied mathematics, and computer science, form teams that work together to solve a focussed co-design problem... Emmanuel Cieren Applied Mathematics ENSTA ParisTech Nicolas Feltman Computer Science Carnegie Mellon University Christopher Leibs Applied Mathematics University of Colorado Colleen McCarthy Applied Mathematics North Carolina State University Karthik Murthy Computer Science Rice University Yijie Wang Computer Science University of South Florida

Problem- Plasma Simulation (charge, current density) MOMENT SOLVER Solve Maxwell, J E, B (electric, magnetic fields) Interpolate Particles Fields PARTICLE PUSHER Interpolate Fields Particles r, v (position, velocity) Push Particles F = q(e + v B) (force) Implicit/ Explicit Method

Problem - Explicit Particle-In-Cell Method Main idea Interpolate field values to the particles Push particles Interpolate particle information to field locations Solve field equations and update values Constraints! finite grid instability (need dx D ) tight CFL constraint (need dt small enough ) can be computationally demanding Solution Â Try to use implicit methods to relax these conditions!

Problem - Implicit Particle-In-Cell Method Chen, Chacón and Barnes* developed a 1D electrostatic PIC method that : relaxes the CFL condition, is stable against the finite grid instability, conserves charge, conserves energy, and controls momentum. Â We will draw heavily from many of these ideas * An energy- and charge-conserving, implicit, electrostatic particle-in-cell algorithm. Journal of Computational Physics, 230:7018 7036, 2011.

Today s Problem Application to demonstrate 2D Implicit method: Island Equilibrium 2 0 2 6 4 2 0 2 4 6 Figure 2.3 Initial conditions. A contour plot of the density function (in blue) with the fieldlines of the magnetic field (in orange). For this figure, ce /! pe =0.3, =0.25, and the domain length is [ 2, 2 ] [, ].

2D Implicit PIC - Cell in the Electric-Magnetic Field k+1 j+1 B z E x, J x k+ 1 2 B y E y, J y j+ 1 2 E x,j x E z,j z E y,j y B y k j B z i i i+ 1 2 B x i+1 B x E z, J z i+ 1 2 j j+ 1 2 j+1 i+1

2D Implicit PIC - Particle Sub-stepping Outline initialization fields, particles fields compute work write output loop over all particles time estimator particle push cell crossing accumulation while d < dt

2D Implicit PIC - Time Estimation (Control Momentum) sub-step times are chosen to help control momentum by comparing a first order (Euler) and second order (Heun) integration scheme the estimate is then compared with a fractional value of the gyro frequency and a distance limiter in order to help alleviate stresses in the Picard iteration `e,r `e,v 2 2 a(r ) 2 2 (ra v) We choose d such that : p`e,r ( ) 2 + `e,v ( ) 2 < a + r kr 0 ( )k 2 Where r 0 ( ) is the initial residual of the equations of motion

2D Implicit PIC - Energy Conserving Particle Push Crank-Nicolson discretization 8 >< r +1 p r p = v +1/2 p >: v +1 p v p = q p m p h E(r +1/2 p )+v +1/2 p i B(r +1/2 p ) j+1 p = v p + v +1 p v +1/2 r +1/2 2 p = r p + r +1 p F(r +1/2 p ) = X i,j 2 F i,j S(r i,j r +1/2 p ) j+ 1 2 j By Ez,Jz Ex,Jx Ey,Jy Bz i i+ 1 2 Bx i+1 Â Converged through fixed-point iterations (Picard) PICARD for r +1 p and v +1 p

2D Implicit PIC - Cell Crossing (Conserve Charge)

2D Implicit PIC - Cell Crossing (Conserve Charge) Some attempts The linear intercept is good enough (fast but not accurate) Bisection method wrapped around original CN (accurate but slow) Fix the final boundary value in CN and solve new system for free dimension and time ( fast but not stable) Estimate time of crossing with explicit solve to accelerate above methods Lesson Learned Cell crossing was (much) harder than we anticipated

2D Implicit PIC - Current Accumulation Each particle must accumulate its sub-step weighted current to the grid Jn+1/2 i,j = 1 dt 1 dxdy X X p q p S(r i,j r 1/2 p )v +1/2 p j+1 j+ 1 2 By Ez,Jz Ex,Jx Ey,Jy j Bz Bx i i+ 1 2 i+1 Lesson Learned (for parallel implementation) This is a map from a high dimension set (particles) to a lower dimension set (grid). Must be careful to ensure particles are not competing for write access.

2D Implicit PIC - Implementation void runpic(){ read_fields(); read_particles(); for(int p=0; p<n; ++p){ while(tau<dt){ time_estimator(); push_particle(); cell_crossing(); } accum_current(); } accum_charge(); } time_average_current(); export_data();

GPUs Built a version of PIC using CUDA Capable of exploiting multiple GPUs Experiment results on: One node of Darwin (2x Tesla M2090s) Scooter (1x Kepler GTX 680) Fig. credit: Nvidia documentation

GPUs Kernels launch a grid of blocks Each block contains a set of threads Blocks are scheduled onto SMs by a hardware scheduler Can t guarantee the order of execution of threads or blocks Fig. credit: Nvidia documentation

CUDA 2D PIC- Lesson 1: Locality Parallelization Strategy Assign groups of cells (Mesh Blocks) to a single CUDA block

CUDA 2D PIC- Lesson 2: Locality Parallelization Strategy Reflect memory hierarchy in the accumulation of current density

CUDA 2D PIC- Lesson 3: Locality Parallelization Strategy Drifting particles need to be re-sorted

CUDA 2D PIC- Exploiting Texture Memory Texture Memory is Special read-only memory Optimized for access patterns exhibiting spatial locality Each SM has it s own texture cache Special texture units help accelerate fetching of data (Z-order curve) Employed for electric and magnetic fields Electric and magnetic fields are constant Field access patterns in force computation exhibit spatial locality Span of shape functions allow for efficient texture cache performance Perfect candidates for texture memory

Big Picture each block works on a mesh of cells E,B fields local J fields global J fields

Performance and Optimizations(1) Tunable parameters Mesh Cells Per Block Number of Particle sub-steps before resort Max Number of Crossings Red-Black Offsets (discussed later) 129 118 Time in seconds

Performance and Optimizations(1) Bitwise hacks, Intrinsics and Strength reductions Optimized shape functions using bitwise operations (combo-hack!) Usage of fused-multiply-add ( fmaf_rn) and other intrinsics Converting division into multiplication by pre-computing constant values Loop unrolling (#pragma unroll) Time in seconds 129 118 49.7 #define SIGN_MASK 0x7fffffff union combo_hack{ unsigned int in; float fl; }; device float b2(float x){ combo_hack flip; flip.fl = x; flip.in = flip.in & SIGN_MASK; if(flip.fl <= 1.5f) { if(flip.fl > 0.5f) return fmaf_rn( 0.5f*flip.fl, (flip.fl - 3.0f),1.125f); return fmaf_rn(-x, x, 0.75f); } return 0.0f; }

Performance and Optimizations(3) Sorting Strategies Particles are sorted by Cell-x and Cell-y Within Mesh Cells, particles are sorted by particle done-ness particle x-velocity direction particle y-velocity direction 129 118 Time in seconds 49.7 41.8

Performance and Optimizations(4) Intuition Avoid write conflicts in overlap region (atomics are expensive)

Performance and Optimizations(4) Red-Black Scheduling Thwarted by the block scheduler Advantage in reduction of atomics vs Texture cache misses 129 118 119 Time in seconds 49.7 41.8 71

Performance and Optimizations(5) Targeting Multiple GPUs (Tesla M2090s) Unsuccessful Attempt: Ions on one GPU and Electrons on second GPU Successful Attempt: 1 2 Ions + 1 2 Electrons on each GPU 70 Time in seconds 42

Conclusions Co-Design was a wonderful experience Successful strategies Exploiting texture memory for storing the electric and magnetic fields Usage of intrinsics and strength reducing operations Sorting particles by Cell-x and Cell-y Sorting particles by done-ness and velocity directions 1/2 ions + 1/2 electrons on each GPU Unsuccessful strategies Red-black strategy of launching blocks of GPU threads Ions on one GPU and electrons on another GPU Future Dynamic load balancing: launch blocks to match density profile Domain decomposition across multi-gpus

EXTRA For a typical run, we load 80 10 6 particles (40 million ions, 40 million electrons) on grids of size 256 128 or 512 256. The total time of the simulation is ratio of m i m e = 100. t = 10/! pe, with and artificial mass 2 0 2 6 4 2 0 2 4 6 Figure 2.3 Initial conditions. A contour plot of the density function (in blue) with the fieldlines of the magnetic field (in orange). For this figure, ce /! pe =0.3, =0.25, and the domain length is [ 2, 2 ] [, ].