Massively parallel semi-lagrangian solution of the 6d Vlasov-Poisson problem

Size: px

Start display at page:

Download "Massively parallel semi-lagrangian solution of the 6d Vlasov-Poisson problem"

Sara McCoy
5 years ago
Views:

1 Massively parallel semi-lagrangian solution of the 6d Vlasov-Poisson problem Katharina Kormann 1 Klaus Reuter 2 Markus Rampp 2 Eric Sonnendrücker 1 1 Max Planck Institut für Plasmaphysik 2 Max Planck Computing and Data Facility October 20, 2016

2 1 Introduction Interpolation und Parallelization Numerical comparison of interpolators Code optimization Overlap of computation and communication

3 2 Vlasov Poisson equation and characteristics Vlasov Poisson equation for electrons in neutralizing background f (t, x, v) + v x f (t, x, v) E(t, x) v f (t, x, v) = 0, t φ(t, x) = 1 ρ(t, x), E(t, x) = φ(t, x), ρ(t, x) = f (t, x, v) dv. Advection equation keeps values along characteristics: dx dt = V, dv = E(t, X). dt Solution: f (t, x, v) = f 0 (X(0; t, x, v), V(0; t, x, v))

4 3 Split-semi-Lagrangian scheme Given f (m) and E (m) at time t m, we compute f (m+1) at time t m + t for all grid points (x i, v j ) as follows: 1. Solve t f E n v f = 0 on half time step: f (m, ) (x i, v j ) = f (m) (x i, v j + E (m) i t 2 ) 2. Solve t f + v x f = 0 on full time step: f (m, ) (x i, v j ) = f (m, ) (x i v j t, v j ) 3. Compute ρ(x i, v i ) and solve the Poisson equation for E (m+1) 4. Solve t f E (m+1) x f = 0 on half time step: f (m+1) (x i, v j ) = f (m, ) (x i, v j + E (m+1) i t 2 ) Use cascade interpolation for the x and v advection steps to reduce interpolations to successive 1d interpolation on stripes of the domain. Main building block: 1d interpolation on stripes of the domain of the form g(x j ) = f (x j + α).

5 4 Introduction Interpolation und Parallelization Numerical comparison of interpolators Code optimization Overlap of computation and communication

6 Interpolation schemes Let z j be a grid point and α = (β + γ) z be the shift of the grid points to the origin of the characteristic (β [0, 1], γ Q). Fixed interval Lagrange (odd number of points q): f (z j + α) = j+(q 1)/2 i=j (q 1)/2 l i(α)f (z i ) for α x. z j z j + α Centered interval Lagrange (even number of points q): f (z j + α) = f (z j+γ + β) = j+γ+q/2 i=j+γ+q/2 1 l i(β)f (z i ). z j z j + α Cubic splines: Global spline interpolant by solution of linear system, evaluation as f (z j + α) = j+γ+2 i=j+γ 1 c is i (β). z j z j + α 5

7 6 Parallelization strategy 1: Remapping scheme Two domain partitionings: One keeping x sequential and one keeping v sequential. p0 p1 p0 p1 p2 p2 Impact on interpolation: None as long as 1d interpolation (or at least split into x and v parts). MPI communication: All-to-all communication, fraction of the data to be communicated for p MPI processes: (p 1)/p. Memory requirements: Two copies of the 6d array (+ MPI communication buffers).

8 7 Parallelization strategy 2: Domain decomposition Patches of six dimensional data blocks. Impact on interpolation Local interpolant needed (Lagrange or local splines glued together with Hermite-type boundary conditions), artificial CFL number, communication increases with order. MPI communication: Nearest-neighbor communication of halo cells around the local domain, size depending on required halo width of interpolator and maximal displacement: 2wn 5 per 1d interpolation. p6 p7 p8 p3 p4 p5 p0 p1 p2

9 7 Parallelization strategy 2: Domain decomposition Patches of six dimensional data blocks. MPI communication: Nearest-neighbor communication of halo cells around the local domain, size depending on required halo width of interpolator and maximal displacement: 2wn 5 per 1d interpolation. Memory requirements: Two alternative implementations Connected buffers: (n + 2w) 6 (+ MPI communication buffers). Dynamic halo buffers ( DD slim ): Memory overhead of 2wn 5 (exploits the fact that only halos in one dimension at a time are necessary + MPI communication buffers, partly reused) p6 p7 p8 p3 p4 p5 p0 p1 p2

10 7 Parallelization strategy 2: Domain decomposition Patches of six dimensional data blocks. MPI communication: Nearest-neighbor communication of halo cells around the local domain, size depending on required halo width of interpolator and maximal displacement: 2wn 5 per 1d interpolation. Memory requirements: Two alternative implementations Connected buffers: (n + 2w) 6 (+ MPI communication buffers). Dynamic halo buffers ( DD slim ): Memory overhead of 2wn 5 (exploits the fact that only halos in one dimension at a time are necessary + MPI communication buffers, partly reused) p6 p7 p8 p3 p4 p5 p0 p1 p2

11 8 Lagrange interpolation Let x i be a grid point and α = β + γ x be the shift of the grid points to the origin of the characteristic (β [0, 1], γq). Interpolate f at x i + α. q-points Lagrange interpolation, q odd, with fixed stencil around: f (x j + α) = j+(q 1)/2 i=j (q 1)/2 l i(α)f (x i ) q-points Lagrange interpolation, q even, centered around the interval [x i+γ, x i+γ+1 ]: f (x j + α) = f (x j+γ + β) = j+γ+q/2 i=j+γ+q/2 1 l i(β)f (x i ) Parallelization for distributed domains: Fixed stencil: CFL-like condition α z, exchange of (q 1)/2 data points on each side. Centered stencil: CFL-like condition α w z, exchange of w + q/2 on each side.

12 9 Impact of domain decomposition Imposes a CFL-like condition. Vlasov Poisson: CFL-like condition is dominated by x-advections but here α = tv constant over time. Idea: Use the knowledge of sign of α to reduce data transfer. Resulting data transfer for CFL-like condition α = (w + β) z for centered stencil: max(q/2 w, 0) on left side, q/2 + w on right side. Total data to be sent: q if w q/2.

13 10 Local cubic splines Computation of interpolant: Use local spline on each domain with Hermite-type boundary conditions from neighboring domains 1. Use fast algorithm introduced by Unser et al. 2. Algorithm for x 1,..., N processor-local and α = (β + γ) x: ( d 0 = 1 M ( f (x γ) + b ) ) i f (x γ i ), a a Here a = i=1 d i = 1 a (f (x i+γ) bd i 1 ), i = 1,..., N + 1, c N+2 = ( M ( 3 f (x N+γ+2 ) + b ) ) i (f (x N+2+γ i ) + f (x N+2+γ+i )) a 2+ 3 i=1 c i = 1 a (d i bc i+1 ), i = N + 1,..., 0. 2, b = 3 and M determines accuracy (M = 27 for machine 6 6 precision). 1 Crouseilles et al., J. Comput. Phys. 228, Unser et al., IEEE Trans. Pattern Anal. Mach. Intell. 13, 1991

14 10 Local cubic splines Algorithm for x 1,..., N processor-local and α = (β + γ) x: ( d 0 = 1 M ( f (x γ) + b ) ) i f (x γ i ), a a Here a = precision). i=1 d i = 1 a (f (x i+γ) bd i 1 ), i = 1,..., N + 1, c N+2 = ( M ( 3 f (x N+γ+2 ) + b ) ) i (f (x N+2+γ i ) + f (x N+2+γ+i )) a 2+ 3 i=1 c i = 1 a (d i bc i+1 ), i = N + 1,..., 0. 2, b = 3 and M determines accuracy (M = 27 for machine 6 6 Data exchange: remote part of d 0 and c N+2, max( γ, 0) on left or max(γ + 1, 0) on right side.

15 11 Introduction Interpolation und Parallelization Numerical comparison of interpolators Code optimization Overlap of computation and communication

16 12 Weak Landau damping ) ( Initial condition: f 0 (x, v) = 1 exp ( v 2 (2π) 3/ α ) 3 l=1 cos(k lx l ) Parameters: α = 0.01, k x = 0.5, periodic boundaries. Weak perturbation α = 0.01 yields a mostly linear phenomenon. No real 6d effects. Relatively good resolution on the studied grids. Error measure: Absolute error in field energy. Reference: Created from 1d solution with spectral method at very high resolution (Jakob Ameres). Helios cluster: Sandy-Bridge EP 2.7GHz, 16 processors and 58 GB of usable memory per node, InfiniBand Compiler: Intel 15, IMPI 5.0.3

17 Interpolation error for various interpolators N x = 16, N v = 64, weak Landau damping CFL-like condition (for x-interpolation steps) at: t spline lag55 lag77 lag65 lag67 error time step 13

18 Interpolation error for various interpolators Data points: (N x = 8, N v = 32, t = 0.1, 1 MPI), (N x = 16, N v = 64, t = 0.05, 8 MPI), (N x = 32, N v = 128, t = 0.01, 2048 MPI) dds65 dds67 dds77 dds77 (h32) rmp spline error total CPU time 14

19 15 Bump-on-tail Initial condition: f 0 (x, v) = 1 (2π) 3/2 ( exp Instability and nonlinear effects. ) 0.9 exp( v ) exp( 2(v 1 4.5) 2 ) ( ( (v 2 2+v 3 2) ) 3 l=1 cos(0.3x l). Relatively bad resolution on the studied grid. Absolute error in field energy (until time 15). Reference: Solution with Lagrange interpolation of order 6,7 on grid with 64 6 data points.

20 Bump-on-tail Simulation with memory slim domain decomposition and Lagrange 6,7 (7,7 for N = 128) Number of processes (MPI OMP): 1 1, 16 1, , field energy time 16

21 Interpolation error (until t=15) N x = N v = 32, 16 MPI processes CFL-like condition (for x-interpolation steps) at: t rmp, lag77 rmp, spl33 dds, lag77 dds, lag67 dds, spl33 error time step 17

22 Interpolation error (until t=15) N x = N v = 32, 16 MPI processes CFL-like condition (for x-interpolation steps) at: t rmp, lag77 rmp, spl33 dds, lag77 dds, lag67 dds, spl33 error wall time 17

23 18 Introduction Interpolation und Parallelization Numerical comparison of interpolators Code optimization Overlap of computation and communication

24 19 Single core performance Speedup of the total domain decomposition code (main loop) by at least a factor 2 obtained by: Avoid Fortran convenience idioms (:). Force inlining of interpolators into advector modules. Cache blocking: Memory access to 1d stripes with large stride in 6D array slow. Instead extract them in blocks along the first dimension to exploit hardware prefetching.

25 20 Effect of cache blocking Configuration: Lagrange 7,7 on cube of 32 6 points, 5 time steps. Hardware: Sandy bridge node (mick@mpcdf) with 16 cores. CPU time (in s) for advections direction blocking no blocking sum

26 21 Single node performance (on Sandy bridge) 10 3 OMP MPI linear wall clock time no. of processors

27 22 Single node scalability Configuration: Lagrange 7,7 on cube of 32 6 points, 5 time steps. Hardware: Sandy bridge node (mick@mpcdf) with 16 cores. Speed up compared to single CPU 1 MPI, 16 OMP 2MPI, 8 OMP 16 MPI, 1 OMP dds, cache dds dd, cache dd rmp, cache [6.2] rmp [5.4] Note: Remap send-receive-buffer copying not OMP-parallelized.

28 23 Single node performance Configuration: Lagrange 7,7 on cube of Hardware: Haswell node MPI OMP time dd slim time dd [s] [s]

29 24 Multi node performance Configuration: Lagrange 7,7 on cube of Hardware: 32 Haswell nodes MPI/OMP time dd slim time dd time rmp 64/ s 103 s [336 s] 1024/ s 117 s 177s

30 25 Memory consumption of parallelization algorithms Parameters: N x = 16, N v = 64, 20 time steps Configuration: 8 MPI processes, 1 OMP on MAIK node of RZG (Sandy-Bridge), Intel 15. Interpolator Algorithm main memory [GB] Lagrange 7,7 Remap 6.3 Lagrange 7,7 DD 8.0 Lagrange 7,7 DD slim0 1.6 Lagrange 6,7 DD slim 1.4 Splines Remap 6.3

31 26 Strong scaling Configuration: N = 64 6, 50 time steps, 7-point Lagrange, 4 MPI with 5OMP-threads per node. Hardware: Ivy Bridge (hydra@mpcdf) (64 GB per node, InfiniBand FDR14) wall clock time [s] remap domain decomposition, 64 bit halos domain decomposition, 32 bit halos cores

32 27 Is the code portable to Intel Xeon Phi KNL? KNL XEON (Draco) GHz GHz 16 GB HBM, 96 GB DRAM 128 GB DRAM

33 28 Results on KNL Configuration: Lagrange 7,7 on cube of Hardware: KNL Cache mode, Intel 17. MPI OMP time dd slim time dd [s] [s] speed up

34 29 Results on KNL Configuration: 7-point Lagrange, OMP only, no hyperthreading. grid KNL HBM KNL DDR2 Haswell 24 dd dd slim

35 30 Introduction Interpolation und Parallelization Numerical comparison of interpolators Code optimization Overlap of computation and communication

36 31 Overlap of communication and computation Algorithm: Advection with fixed-interval Lagrange interpolation in domain decomposition.

37 Copy data to send buffer; MPI communication of halos; for i6 do for i5 do for i4 do for i2 do for i1 do Copy 1d stripe over i3 into scratch buffer; Interpolation along x3; Copy 1d stripe back to 6d array; end end end end end Algorithm 1: Advection along x 3. 32

38 for block do Copy data to send buffer for block; MPI communication of halos for block; for i6 in block do for i5 do for i4 do for i2 do for i1 do Copy 1d stripe over i3 into scratch buffer; Interpolation along x3; Copy 1d stripe back to 6d array; end end end end end end Algorithm 2: Advection along x 3. 33

39 34 Overlap of communication and computation Algorithm: Advection with fixed-interval Lagrange interpolation in domain decomposition. Idea: Split advection into blocks with separate MPI communication Computation from previous block can be overlapped with computations for next block. Blocking in x 6 = v 3 for x-advections and x 3 for v-advections. Implemented with nested OMP parallelism and OMP locks for domain decomposition slim. First result on 32 HASWELL nodes (draco@mpcdf, 64 MPI processes, 8 OMP threads each), 64 6, 5 time steps, Lagrange 7,7, 4 blocks per advection: dd slim overlap : s dd slim plain : s

40 35 Overlap: Preliminary results adv prep exch thread id time [s] Note: Advection (adv) and halo preparation (prep) use nested OpenMP threads to utilize all available CPU cores

41 36 Overlap: Zoom into first advection block Note: Advection (adv) and halo preparation (prep) use nested OpenMP threads to utilize all available CPU cores

42 37 Conclusions Summary Interpolation: Lagrange better than splines for good resolution, splines for low resolution. Lagrange better suited for distributed domains. Memory-slim implementation of domain decomposition enables solution of large-scale problems. Domain decomposition scales better than remap on thousands of processors. Remap algorithm gives more flexibility on time step size and for global interpolants. Outlook Include magnetic field and multidimensional interpolation. Further exploit the potential of overlapping communication and computation.

A particle-in-cell method with adaptive phase-space remapping for kinetic plasmas

A particle-in-cell method with adaptive phase-space remapping for kinetic plasmas Bei Wang 1 Greg Miller 2 Phil Colella 3 1 Princeton Institute of Computational Science and Engineering Princeton University