S Subdivide, Preprocess and Conquer: Micromagnetism FEM/BEM-Simulations on Single-Node/Multi-GPU Systems

S4283 - Subdivide, : Micromagnetism FEM/BEM-Simulations on Single-Node/Multi-GPU Systems Elmar Westphal - Forschungszentrum Jülich GmbH 1

Contents Micromagnetism TetraMag, a FEM/BEM Micromagnetism Simulator Porting TetraMag to CUDA Porting TetraMag to multi-gpu CUDA Benchmarks 2

Micromagnetism In Micromagnetism, we investigate The structure of ferromagnetic domains. The structure, dynamics and motion of domain walls. Structure and dynamics of magnetic vortices. Spin waves, etc As a mesoscopic theory, it can provide a link between simulation and experiment 3

Magnetism on Different Length Scales magnetic nanostructures Macroscopic models hysteresis models, response to external parameters Domain theory subdivision into domains, details of the magnetic structure are neglected Micromagnetism continuum theory, domain walls, vortices Quantum theory electronic structure 4 Heisenberg model atomistic effects, spin chains

Magnetism on Different Time Scales 5

Most Recent Achievement Discovery of the Spin Cherenkov Effect [1] (magnetism equivalent to the sonic boom) Geometry: - 2 µm x 1 µm x 1 µm Permalloy prism - 5nm resolution (100 million tetrahedrons, 16 million discretisation nodes) [1] M. Yan, A. Kákay, C. Andreas, & R. Hertel. Spin Cherenkov effect and magnonic Mach cones. Physical Review B, Rapid Communications, 88, 220412(R) (2013) 6

About TetraMag Code started by Riccardo Hertel [1], extended and ported by Attila Kakay, Elmar Westphal [2][3] Upcoming: details about Calculation steps Matrix properties Older CUDA versions New challenges [1] Hertel, R. (2001). Micromagnetic simulations of magnetostatically coupled Nickel nanowires. Journal of Applied Physics, 90(11), 5752-5758. [2] Kakay, A., Westphal, E., & Hertel, R. (2010). Speedup of FEM micromagnetic simulations with graphical processing units. Magnetics, IEEE Transactions on, 46(6), 2303-2306. [3] http://www.fz-juelich.de/pgi/pgi-6/de/forschung/ MagnetizationDynamics/Simulations/_node.html 7

Calculation Steps Calculate magnetostatic fields with scalar potential Split into two parts U = U1 + U2 U1 is the solution of the inhomogenous Neumann problem U2 is to satisfy Laplace s equation with Dirichlet boundary conditions Solve/Integrate the Landau-Lifshitz-Gilbert equation of motion 8

Magnetostatic Field Calculated in 3 steps: Iterative solution of a sparse linear system for U1 Dense/hierarchical matrix-vector multiplication to obtain U2 in the boundary regions (not covered in this talk) Iterative solution of a sparse linear system for U2 within the magnetic region Sparse systems are solved using multi-gpu linbcg and bicgstab solvers 9

Time Integration Includes field calculations and vector operations Performed using CVODE from the Sundials package CVODE s NVector -structure and its operations have been ported for CUDA on single-host-multi-gpu systems Memory consuming (~1KB/node, limiting factor for GPU usage) Field calculations use a sparse matrix and several field vectors CVODE internally needs many helper vectors 10

Properties of TetraMag s Sparse Matrices Contain ~15 non-zero elements per row/column Distribution of elements depends on the underlying mesh (regular to seemingly random) Symmetric (for magnetostatic field calculation) OR Asymmetric (for exchange field calculation) Static over the whole program run 11

TetraMagCUDA Around since ~2009 (see GTC 2010 poster[1]) CUDA parts were piggy-backed onto CPU-routines Tries to copy as many sparse matrices to device memory as possible, the remainder stays on the CPU GPU-only execution limited to problem sizes of ~1M nodes Sufficient for most use cases at the time 12 [1] http://www.gputechconf.com/content/gtc/posters/ 2010/Q02-Massively-Parallel-Micromagnetic-FEM- Calculations-with-Graphical-Processing-Units.pdf

New Challenges Simulations at experimental scale (µm) require larger scale simulations (10M+ nodes) Single matrix + solver/integrator-vectors often exceed device memory Copying matrices sequentially not sufficient, Possible solutions: Reduce memory footprint Distribute problem over multiple GPUs 13

Reduce Memory Footprint Use symmetry Effective, but not possible for all matrices Often needs atomic operations (slow) Reduce precision of off-diagonal elements Moderately effective (8+4 -> 4+4 bit/value) May lead to unacceptable loss of precision 14

Reduce Memory Footprint Extract diagonals Good for performance (coalesced access) Very good results for symmetric matrices based on regular meshes (2 x (8+4) -> 8 bit) Mixed results otherwise Combinations of some/all of the above 15

Preprocessing Approaches for Matrix Distribution Matrices can be preprocessed for GPU(s) in different ways Traditional single GPU calculation Naive multi-gpu distribution checkerboard-style distribution 16

Traditional MVM on a Single GPU X = X

Distribution over Multiple GPUs Naive approach: Divide matrix into N GPU sub-matrices with N row /N GPU rows Copy one sub-matrix to each GPU Copy vector to all GPUs Perform partial multiplications Copy partial results to all other GPUs Repeat (if needed) 17

Naive Distribution, Preparation and First Multiplication X = 18

Naive Distribution, Preparation and First Multiplication X = copy partial matrices to other GPUs 18

Naive Distribution, Preparation and First Multiplication X = copy partial matrices to other GPUs 18 copy vectors to other GPUs

Naive Distribution, Preparation and First Multiplication X = copy partial matrices to other GPUs 18 copy vectors to other GPUs calculate

Naive Distribution, Subsequent Multiplications X = 19

Naive Distribution, Subsequent Multiplications X = 19 copy vectors to other GPUs

Naive Distribution, Subsequent Multiplications X = 19 copy vectors to other GPUs calculate

Naive Distribution Approach Pro Easy to implement Con All sub-matrices need vector data from all GPUs at the beginning of calculation Data transfer overhead about as expensive as actual calculation -> performance often below single GPU solution 20

Checkerboard Approach Split each sub-matrix into N GPU sub-sub-matrices Each of these needs vector values from only one GPU Perform multiplication of first sub-sub-matrix with partial vector At the same time, copy vector part needed for next multiplication into a (double)-buffer in a different stream Repeat for other sub-sub matrices 21

Checkerboard -Style MVM 0 1 2 Partial matrix on GPU X = 3 0 1 2 3 needs vector data from 22 part. vectors double part. results buffer

Matrix Preprocessing Original Matrix in CSR-like format Blocks of Nwarpsize rows are transposed to enable coalesced memory access Distribution of data destroys uniformity of row lengths Zero padding may be necessary -> wasted memory Rows are sorted and re-indexed by number of nonzero elements -> minimal padding 23

Optimizing Data Transfers Depending on the problem, sub-sub-matrices can get very small and access only very few vector elements Multiplication time is short Transfer of potentially unneeded elements takes a much longer time Solution: transfer only those elements that are really needed 24

Creating Export Vectors Each GPU needs an individual set of vector elements from every other GPU: Approach one: create NGPU x (NGPU-1) export vectors consumes much memory Approach two: rewrite export vector NGPU x (NGPU-1) times May need to read/write all elements NGPU x (NGPU-1) times Approach three: do some (lots of) work in preprocessing 25

Building the Perfect(?) Set of Export Vector During preprocessing: Build a key value out of which other GPUs need a certain vector element Using one bit for each target GPU results in at most 2 N GPU-1 key values (usually NGPU<=8 in a node) Use values as sort key, re-index in sorted order Store block offsets for keys 26

Building the Perfect(?) Set of Export Vector (cont.) During multiplication: Build export vector according to the index generated in preprocessing During loop over all sub-sub-matrices: Loop over all export blocks Asynchronously copy blocks with elements needed from next GPU into buffer 27

Key values are calculated relative to originating GPU: GPU 0 1 2 3 28

Key values are calculated relative to originating GPU: GPU vector elements accessed in 0 1 2 3 28 cols 0-2 are originally on GPU 0, relative key binary values: 0 1(001) 2(010) 4(100)

Key values are calculated relative to originating GPU: GPU vector elements accessed in 0 1 2 3 28 cols 0-2 are originally on GPU 0, relative key binary values: 0 1(001) 2(010) 4(100) 0 0 0 4 =4

Key values are calculated relative to originating GPU: GPU vector elements accessed in 0 1 2 3 28 cols 0-2 are originally on GPU 0, relative key binary values: 0 1(001) 2(010) 4(100) 0 0 0 4 =4 0 1 2 0 =3 cols 6-8 are originally on GPU 2, relative key binary values: 2(010) 4(100) 0 1(001) 0 4 0 1 =5

build export buffer during preprocessing: keys from matrix for GPU 0 index 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 ref. key 4 2 6 3 0 4 0 6 7 6 1 2 5 4 4 0 3 1 2 3 29

build export buffer during preprocessing: keys from matrix for GPU 0 reorder to resulting export index index 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 ref. key 4 2 6 3 0 4 0 6 7 6 1 2 5 4 4 0 3 1 2 3 export 10 17 1 11 18 3 16 19 0 5 13 13 14 12 2 7 8 - - - ref. key 1 1 2 2 2 3 3 3 4 4 4 4 4 5 6 6 7 - - - binary 001 001 010 010 010 011 011 011 100 100 100 100 100 101 110 110 111 - - - blocks sent to export streams during multiplication loop: exported to GPU 1 (bit 001) index 10 17 1 11 18 3 16 19 0 5 13 13 14 12 2 7 8 - - - binary 001 001 010 010 010 011 011 011 100 100 100 100 100 101 110 110 111 - - - exported to GPU 2 (bit 010) index 10 17 1 11 18 3 16 19 0 5 13 13 14 12 2 7 8 - - - binary 001 001 010 010 010 011 011 011 100 100 100 100 100 101 110 110 111 - - - 29

build export buffer during preprocessing: keys from matrix for GPU 0 reorder to resulting export index index 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 ref. key 4 2 6 3 0 4 0 6 7 6 1 2 5 4 4 0 3 1 2 3 export 10 17 1 11 18 3 16 19 0 5 13 13 14 12 2 7 8 - - - ref. key 1 1 2 2 2 3 3 3 4 4 4 4 4 5 6 6 7 - - - binary 001 001 010 010 010 011 011 011 100 100 100 100 100 101 110 110 111 - - - blocks sent to export streams during multiplication loop: exported to GPU 1 (bit 001) index 10 17 1 11 18 3 16 19 0 5 13 13 14 12 2 7 8 - - - binary 001 001 010 010 010 011 011 011 100 100 100 100 100 101 110 110 111 - - - exported to GPU 2 (bit 010) exported to GPU 3 (bit 100) index 10 17 1 11 18 3 16 19 0 5 13 13 14 12 2 7 8 - - - binary 001 001 010 010 010 011 011 011 100 100 100 100 100 101 110 110 111 - - - index 10 17 1 11 18 3 16 19 0 5 13 13 14 12 2 7 8 - - - binary 001 001 010 010 010 011 011 011 100 100 100 100 100 101 110 110 111 - - - 29

Pros and Cons Pro: Export vector is only built once per multiplication No element needs to be stored more than once No element is transferred more often than necessary Con: Limited number of GPUs, because number of blocks grows exponentially Time-consuming pre-processing 30

Further Optimisations Sub-sub-matrix multiplications are ordered by size of matrix (number of non-zero elements) More likely to correlate larger transfers with longer calculations Export vectors are copied to host memory during initial calculations Allows parallel import on devices with only one copy engine 31

Preprocessing Time Preprocessing involves multiple expensive indexing and sorting steps May take up to 10-20 seconds per matrix (with n ~20M) Depends on the number of GPUs used Happens only once, because matrices are static Typical run includes millions of solver iterations/matrix multiplications 32

Benchmarks Solver iteration times for Cube with regular mesh, very diagonal-dominant matrices, little data transfer Two problems of similar size and nature: Round disk with irregular mesh structure, few extractable diagonals, lots of data transfer Round disk with partially regular mesh structure, some extractable diagonals, moderate data transfer 33

Time per solver integration for cube 1µm (8.1M nodes, regular mesh) 30 Comparison: 2 x E5-2680 (10 cores, 2.8 GHz): 134 ms 25 time per solver iteration [ms] 20 15 10 5 0 1 2 3 4 5 6 7 8 number of GPUs Titan GTX 690 34

Time per solver integration for disk (6.x M nodes, different mesh structures) 40 Comparison: 2 x E5-2680 (10 cores, 2.8 GHz): 100 ms reg. 118 ms irreg. time per solver iteration [ms] 30 20 10 0 1 2 3 4 5 6 7 8 number of GPUs Titan irreg.mesh GTX 690, irreg.mesh GTX 690, 1GPU/card, irreg.mesh 35 Titan reg.mesh GTX 690 reg.mesh

Conclusions Distribution of matrices and vectors over multiple GPUs allows us to simulate significantly larger samples Performance scaling depends largely on the amount of data exchanged between GPUs Optimising the mesh-structure is very important in multi-gpu setups 36

Questions? 37