S Subdivide, Preprocess and Conquer: Micromagnetism FEM/BEM-Simulations on Single-Node/Multi-GPU Systems

Similar documents
Explore Computational Power of GPU in Electromagnetics and Micromagnetics

A Hybrid Method for the Wave Equation. beilina

A CUDA Solver for Helmholtz Equation

Coupling atomistic and continuum modelling of magnetism

SOLUTION of linear systems of equations of the form:

A Quantum Chemistry Domain-Specific Language for Heterogeneous Clusters

Parallel Transposition of Sparse Data Structures

SPARSE SOLVERS POISSON EQUATION. Margreet Nool. November 9, 2015 FOR THE. CWI, Multiscale Dynamics

Hybrid CPU/GPU Acceleration of Detection of 2-SNP Epistatic Interactions in GWAS

Accelerating linear algebra computations with hybrid GPU-multicore systems.

Faster Kinetics: Accelerate Your Finite-Rate Combustion Simulation with GPUs

S XMP LIBRARY INTERNALS. Niall Emmart University of Massachusetts. Follow on to S6151 XMP: An NVIDIA CUDA Accelerated Big Integer Library

Using a CUDA-Accelerated PGAS Model on a GPU Cluster for Bioinformatics

Solving PDEs with CUDA Jonathan Cohen

H ψ = E ψ. Introduction to Exact Diagonalization. Andreas Läuchli, New states of quantum matter MPI für Physik komplexer Systeme - Dresden

Finite element micromagnetics

Numerical Analysis Fall. Gauss Elimination

TR A Comparison of the Performance of SaP::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems

Proposal for a standard micromagnetic problem: Spin wave dispersion in a magnonic waveguide

J.I. Aliaga 1 M. Bollhöfer 2 A.F. Martín 1 E.S. Quintana-Ortí 1. March, 2009

Sparse LU Factorization on GPUs for Accelerating SPICE Simulation

Research on GPU-accelerated algorithm in 3D finite difference neutron diffusion calculation method

Matrix Assembly in FEA

Improvements for Implicit Linear Equation Solvers

COMPARISON OF CPU AND GPU IMPLEMENTATIONS OF THE LATTICE BOLTZMANN METHOD

Quantum Lattice Models & Introduction to Exact Diagonalization

Parallel programming practices for the solution of Sparse Linear Systems (motivated by computational physics and graphics)

Parallel Asynchronous Hybrid Krylov Methods for Minimization of Energy Consumption. Langshi CHEN 1,2,3 Supervised by Serge PETITON 2

The GPU code FARGO3D: presentation and implementation strategies

Maximum-weighted matching strategies and the application to symmetric indefinite systems

Notes for CS542G (Iterative Solvers for Linear Systems)

An Implementation of the MRRR Algorithm on a Data-Parallel Coprocessor

A block cipher enciphers each block with the same key.

A robust multilevel approximate inverse preconditioner for symmetric positive definite matrices

(Refer Slide Time: 03: 09)

arxiv: v1 [cond-mat.str-el] 22 Jun 2007

Parallel programming using MPI. Analysis and optimization. Bhupender Thakur, Jim Lupo, Le Yan, Alex Pacheco

Dynamic Scheduling for Work Agglomeration on Heterogeneous Clusters

A Massively Parallel Eigenvalue Solver for Small Matrices on Multicore and Manycore Architectures

Micromagnetic simulation of dynamic and thermal effects

A Sparse QS-Decomposition for Large Sparse Linear System of Equations

Micromagnetic simulation of magnetization reversal in rotational magnetic fields

Accelerating Quantum Chromodynamics Calculations with GPUs

V High frequency magnetic measurements

arxiv: v1 [physics.comp-ph] 30 Oct 2017

Practical Combustion Kinetics with CUDA

Multilevel low-rank approximation preconditioners Yousef Saad Department of Computer Science and Engineering University of Minnesota

Matrix Computations: Direct Methods II. May 5, 2014 Lecture 11

An Implementation of the MRRR Algorithm on a Data-Parallel Coprocessor

Sparse Linear Systems. Iterative Methods for Sparse Linear Systems. Motivation for Studying Sparse Linear Systems. Partial Differential Equations

Notation. Bounds on Speedup. Parallel Processing. CS575 Parallel Processing

Parallelization of Molecular Dynamics (with focus on Gromacs) SeSE 2014 p.1/29

Julian Merten. GPU Computing and Alternative Architecture

Macroscopic properties II

Parallel VLSI CAD Algorithms. Lecture 1 Introduction Zhuo Feng

CRYPTOGRAPHIC COMPUTING

Electromagnetic Counterparts to Gravitational Wave Detections: Bridging the Gap between Theory and Observation

Massively scalable computing method to tackle large eigenvalue problems for nanoelectronics modeling

Performance and Energy Analysis of the Iterative Solution of Sparse Linear Systems on Multicore and Manycore Architectures

Massively parallel semi-lagrangian solution of the 6d Vlasov-Poisson problem

4th year Project demo presentation

Transposition Mechanism for Sparse Matrices on Vector Processors

RECENT ADVANCES IN SIMULATION OF MAGNETRONS AND CROSSED-FIELD AMPLIFIERS. Abstract

Recent advances in sparse linear solver technology for semiconductor device simulation matrices

Two case studies of Monte Carlo simulation on GPU

ACCELERATING SPARSE CHOLESKY FACTORIZATION ON THE GPU

Targeting Extreme Scale Computational Challenges with Heterogeneous Systems

AN INDEPENDENT LOOPS SEARCH ALGORITHM FOR SOLVING INDUCTIVE PEEC LARGE PROBLEMS

Current-Induced Domain-Wall Dynamics in Ferromagnetic Nanowires

MAA507, Power method, QR-method and sparse matrix representation.

Transition from single-domain to vortex state in soft magnetic cylindrical nanodots

Scientific Computing with Case Studies SIAM Press, Lecture Notes for Unit VII Sparse Matrix

arxiv: v1 [physics.comp-ph] 22 Nov 2012

Scalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver

Towards a highly-parallel PDE-Solver using Adaptive Sparse Grids on Compute Clusters

Performance Analysis of Lattice QCD Application with APGAS Programming Model

Introduction to Quantum Information Technologies. B.M. Terhal, JARA-IQI, RWTH Aachen University & Forschungszentrum Jülich

Micromagnetic simulations of magnetization reversal. in Co/Ni multilayers

Optimizing Time Integration of Chemical-Kinetic Networks for Speed and Accuracy

Massively scalable computing method to tackle large eigenvalue problems for nanoelectronics modeling

FINE-GRAINED PARALLEL INCOMPLETE LU FACTORIZATION

Parallelization of the QC-lib Quantum Computer Simulator Library

Finite Difference Methods (FDMs) 1

Calculating Frobenius Numbers with Boolean Toeplitz Matrix Multiplication

The Solution of a FEM Equation in Frequency Domain Using a Parallel Computing with CUBLAS

FINDING PARALLELISM IN GENERAL-PURPOSE LINEAR PROGRAMMING

Hydra. A. Augusto Alves Jr and M.D. Sokoloff. University of Cincinnati

A model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization)

Image Reconstruction And Poisson s equation

Introduction. HFSS 3D EM Analysis S-parameter. Q3D R/L/C/G Extraction Model. magnitude [db] Frequency [GHz] S11 S21 -30

Parallelization of the QC-lib Quantum Computer Simulator Library

Unidirectional spin-wave heat conveyer

Dense Arithmetic over Finite Fields with CUMODP

APPARC PaA3a Deliverable. ESPRIT BRA III Contract # Reordering of Sparse Matrices for Parallel Processing. Achim Basermannn.

Problem. Problem Given a dictionary and a word. Which page (if any) contains the given word? 3 / 26

FINE-GRAINED PARALLEL INCOMPLETE LU FACTORIZATION

Janus: FPGA Based System for Scientific Computing Filippo Mantovani

Overview: Synchronous Computations

GPU acceleration of Newton s method for large systems of polynomial equations in double double and quad double arithmetic

6. Iterative Methods for Linear Systems. The stepwise approach to the solution...

Transcription:

S4283 - Subdivide, : Micromagnetism FEM/BEM-Simulations on Single-Node/Multi-GPU Systems Elmar Westphal - Forschungszentrum Jülich GmbH 1

Contents Micromagnetism TetraMag, a FEM/BEM Micromagnetism Simulator Porting TetraMag to CUDA Porting TetraMag to multi-gpu CUDA Benchmarks 2

Micromagnetism In Micromagnetism, we investigate The structure of ferromagnetic domains. The structure, dynamics and motion of domain walls. Structure and dynamics of magnetic vortices. Spin waves, etc As a mesoscopic theory, it can provide a link between simulation and experiment 3

Magnetism on Different Length Scales magnetic nanostructures Macroscopic models hysteresis models, response to external parameters Domain theory subdivision into domains, details of the magnetic structure are neglected Micromagnetism continuum theory, domain walls, vortices Quantum theory electronic structure 4 Heisenberg model atomistic effects, spin chains

Magnetism on Different Time Scales 5

Most Recent Achievement Discovery of the Spin Cherenkov Effect [1] (magnetism equivalent to the sonic boom) Geometry: - 2 µm x 1 µm x 1 µm Permalloy prism - 5nm resolution (100 million tetrahedrons, 16 million discretisation nodes) [1] M. Yan, A. Kákay, C. Andreas, & R. Hertel. Spin Cherenkov effect and magnonic Mach cones. Physical Review B, Rapid Communications, 88, 220412(R) (2013) 6

About TetraMag Code started by Riccardo Hertel [1], extended and ported by Attila Kakay, Elmar Westphal [2][3] Upcoming: details about Calculation steps Matrix properties Older CUDA versions New challenges [1] Hertel, R. (2001). Micromagnetic simulations of magnetostatically coupled Nickel nanowires. Journal of Applied Physics, 90(11), 5752-5758. [2] Kakay, A., Westphal, E., & Hertel, R. (2010). Speedup of FEM micromagnetic simulations with graphical processing units. Magnetics, IEEE Transactions on, 46(6), 2303-2306. [3] http://www.fz-juelich.de/pgi/pgi-6/de/forschung/ MagnetizationDynamics/Simulations/_node.html 7

Calculation Steps Calculate magnetostatic fields with scalar potential Split into two parts U = U1 + U2 U1 is the solution of the inhomogenous Neumann problem U2 is to satisfy Laplace s equation with Dirichlet boundary conditions Solve/Integrate the Landau-Lifshitz-Gilbert equation of motion 8

Magnetostatic Field Calculated in 3 steps: Iterative solution of a sparse linear system for U1 Dense/hierarchical matrix-vector multiplication to obtain U2 in the boundary regions (not covered in this talk) Iterative solution of a sparse linear system for U2 within the magnetic region Sparse systems are solved using multi-gpu linbcg and bicgstab solvers 9

Time Integration Includes field calculations and vector operations Performed using CVODE from the Sundials package CVODE s NVector -structure and its operations have been ported for CUDA on single-host-multi-gpu systems Memory consuming (~1KB/node, limiting factor for GPU usage) Field calculations use a sparse matrix and several field vectors CVODE internally needs many helper vectors 10

Properties of TetraMag s Sparse Matrices Contain ~15 non-zero elements per row/column Distribution of elements depends on the underlying mesh (regular to seemingly random) Symmetric (for magnetostatic field calculation) OR Asymmetric (for exchange field calculation) Static over the whole program run 11

TetraMagCUDA Around since ~2009 (see GTC 2010 poster[1]) CUDA parts were piggy-backed onto CPU-routines Tries to copy as many sparse matrices to device memory as possible, the remainder stays on the CPU GPU-only execution limited to problem sizes of ~1M nodes Sufficient for most use cases at the time 12 [1] http://www.gputechconf.com/content/gtc/posters/ 2010/Q02-Massively-Parallel-Micromagnetic-FEM- Calculations-with-Graphical-Processing-Units.pdf

New Challenges Simulations at experimental scale (µm) require larger scale simulations (10M+ nodes) Single matrix + solver/integrator-vectors often exceed device memory Copying matrices sequentially not sufficient, Possible solutions: Reduce memory footprint Distribute problem over multiple GPUs 13

Reduce Memory Footprint Use symmetry Effective, but not possible for all matrices Often needs atomic operations (slow) Reduce precision of off-diagonal elements Moderately effective (8+4 -> 4+4 bit/value) May lead to unacceptable loss of precision 14

Reduce Memory Footprint Extract diagonals Good for performance (coalesced access) Very good results for symmetric matrices based on regular meshes (2 x (8+4) -> 8 bit) Mixed results otherwise Combinations of some/all of the above 15

Preprocessing Approaches for Matrix Distribution Matrices can be preprocessed for GPU(s) in different ways Traditional single GPU calculation Naive multi-gpu distribution checkerboard-style distribution 16

Traditional MVM on a Single GPU X = X

Traditional MVM on a Single GPU X = X

Distribution over Multiple GPUs Naive approach: Divide matrix into N GPU sub-matrices with N row /N GPU rows Copy one sub-matrix to each GPU Copy vector to all GPUs Perform partial multiplications Copy partial results to all other GPUs Repeat (if needed) 17

Naive Distribution, Preparation and First Multiplication X = 18

Naive Distribution, Preparation and First Multiplication X = copy partial matrices to other GPUs 18

Naive Distribution, Preparation and First Multiplication X = copy partial matrices to other GPUs 18 copy vectors to other GPUs

Naive Distribution, Preparation and First Multiplication X = copy partial matrices to other GPUs 18 copy vectors to other GPUs calculate

Naive Distribution, Subsequent Multiplications X = 19

Naive Distribution, Subsequent Multiplications X = 19 copy vectors to other GPUs

Naive Distribution, Subsequent Multiplications X = 19 copy vectors to other GPUs calculate

Naive Distribution Approach Pro Easy to implement Con All sub-matrices need vector data from all GPUs at the beginning of calculation Data transfer overhead about as expensive as actual calculation -> performance often below single GPU solution 20

Checkerboard Approach Split each sub-matrix into N GPU sub-sub-matrices Each of these needs vector values from only one GPU Perform multiplication of first sub-sub-matrix with partial vector At the same time, copy vector part needed for next multiplication into a (double)-buffer in a different stream Repeat for other sub-sub matrices 21

Checkerboard -Style MVM 0 1 2 Partial matrix on GPU X = 3 0 1 2 3 needs vector data from 22 part. vectors double part. results buffer

Checkerboard -Style MVM 0 1 2 Partial matrix on GPU X = 3 0 1 2 3 needs vector data from 22 part. vectors double part. results buffer

Checkerboard -Style MVM 0 1 2 Partial matrix on GPU X = 3 0 1 2 3 needs vector data from 22 part. vectors double part. results buffer

Checkerboard -Style MVM 0 1 2 Partial matrix on GPU X = 3 0 1 2 3 needs vector data from 22 part. vectors double part. results buffer

Checkerboard -Style MVM 0 1 2 Partial matrix on GPU X = 3 0 1 2 3 needs vector data from 22 part. vectors double part. results buffer

Matrix Preprocessing Original Matrix in CSR-like format Blocks of Nwarpsize rows are transposed to enable coalesced memory access Distribution of data destroys uniformity of row lengths Zero padding may be necessary -> wasted memory Rows are sorted and re-indexed by number of nonzero elements -> minimal padding 23

Optimizing Data Transfers Depending on the problem, sub-sub-matrices can get very small and access only very few vector elements Multiplication time is short Transfer of potentially unneeded elements takes a much longer time Solution: transfer only those elements that are really needed 24

Creating Export Vectors Each GPU needs an individual set of vector elements from every other GPU: Approach one: create NGPU x (NGPU-1) export vectors consumes much memory Approach two: rewrite export vector NGPU x (NGPU-1) times May need to read/write all elements NGPU x (NGPU-1) times Approach three: do some (lots of) work in preprocessing 25

Building the Perfect(?) Set of Export Vector During preprocessing: Build a key value out of which other GPUs need a certain vector element Using one bit for each target GPU results in at most 2 N GPU-1 key values (usually NGPU<=8 in a node) Use values as sort key, re-index in sorted order Store block offsets for keys 26

Building the Perfect(?) Set of Export Vector (cont.) During multiplication: Build export vector according to the index generated in preprocessing During loop over all sub-sub-matrices: Loop over all export blocks Asynchronously copy blocks with elements needed from next GPU into buffer 27

Key values are calculated relative to originating GPU: GPU 0 1 2 3 28

Key values are calculated relative to originating GPU: GPU vector elements accessed in 0 1 2 3 28 cols 0-2 are originally on GPU 0, relative key binary values: 0 1(001) 2(010) 4(100)

Key values are calculated relative to originating GPU: GPU vector elements accessed in 0 1 2 3 28 cols 0-2 are originally on GPU 0, relative key binary values: 0 1(001) 2(010) 4(100) 0 0 0 4 =4

Key values are calculated relative to originating GPU: GPU vector elements accessed in 0 1 2 3 28 cols 0-2 are originally on GPU 0, relative key binary values: 0 1(001) 2(010) 4(100) 0 0 0 4 =4 0 1 2 0 =3

Key values are calculated relative to originating GPU: GPU vector elements accessed in 0 1 2 3 28 cols 0-2 are originally on GPU 0, relative key binary values: 0 1(001) 2(010) 4(100) 0 0 0 4 =4 0 1 2 0 =3 cols 6-8 are originally on GPU 2, relative key binary values: 2(010) 4(100) 0 1(001) 0 4 0 1 =5

build export buffer during preprocessing: keys from matrix for GPU 0 index 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 ref. key 4 2 6 3 0 4 0 6 7 6 1 2 5 4 4 0 3 1 2 3 29

build export buffer during preprocessing: keys from matrix for GPU 0 reorder to resulting export index index 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 ref. key 4 2 6 3 0 4 0 6 7 6 1 2 5 4 4 0 3 1 2 3 export 10 17 1 11 18 3 16 19 0 5 13 13 14 12 2 7 8 - - - ref. key 1 1 2 2 2 3 3 3 4 4 4 4 4 5 6 6 7 - - - 29

build export buffer during preprocessing: keys from matrix for GPU 0 reorder to resulting export index index 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 ref. key 4 2 6 3 0 4 0 6 7 6 1 2 5 4 4 0 3 1 2 3 export 10 17 1 11 18 3 16 19 0 5 13 13 14 12 2 7 8 - - - ref. key 1 1 2 2 2 3 3 3 4 4 4 4 4 5 6 6 7 - - - binary 001 001 010 010 010 011 011 011 100 100 100 100 100 101 110 110 111 - - - 29

build export buffer during preprocessing: keys from matrix for GPU 0 reorder to resulting export index index 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 ref. key 4 2 6 3 0 4 0 6 7 6 1 2 5 4 4 0 3 1 2 3 export 10 17 1 11 18 3 16 19 0 5 13 13 14 12 2 7 8 - - - ref. key 1 1 2 2 2 3 3 3 4 4 4 4 4 5 6 6 7 - - - binary 001 001 010 010 010 011 011 011 100 100 100 100 100 101 110 110 111 - - - blocks sent to export streams during multiplication loop: 29

build export buffer during preprocessing: keys from matrix for GPU 0 reorder to resulting export index index 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 ref. key 4 2 6 3 0 4 0 6 7 6 1 2 5 4 4 0 3 1 2 3 export 10 17 1 11 18 3 16 19 0 5 13 13 14 12 2 7 8 - - - ref. key 1 1 2 2 2 3 3 3 4 4 4 4 4 5 6 6 7 - - - binary 001 001 010 010 010 011 011 011 100 100 100 100 100 101 110 110 111 - - - blocks sent to export streams during multiplication loop: exported to GPU 1 (bit 001) index 10 17 1 11 18 3 16 19 0 5 13 13 14 12 2 7 8 - - - binary 001 001 010 010 010 011 011 011 100 100 100 100 100 101 110 110 111 - - - 29

build export buffer during preprocessing: keys from matrix for GPU 0 reorder to resulting export index index 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 ref. key 4 2 6 3 0 4 0 6 7 6 1 2 5 4 4 0 3 1 2 3 export 10 17 1 11 18 3 16 19 0 5 13 13 14 12 2 7 8 - - - ref. key 1 1 2 2 2 3 3 3 4 4 4 4 4 5 6 6 7 - - - binary 001 001 010 010 010 011 011 011 100 100 100 100 100 101 110 110 111 - - - blocks sent to export streams during multiplication loop: exported to GPU 1 (bit 001) index 10 17 1 11 18 3 16 19 0 5 13 13 14 12 2 7 8 - - - binary 001 001 010 010 010 011 011 011 100 100 100 100 100 101 110 110 111 - - - exported to GPU 2 (bit 010) index 10 17 1 11 18 3 16 19 0 5 13 13 14 12 2 7 8 - - - binary 001 001 010 010 010 011 011 011 100 100 100 100 100 101 110 110 111 - - - 29

build export buffer during preprocessing: keys from matrix for GPU 0 reorder to resulting export index index 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 ref. key 4 2 6 3 0 4 0 6 7 6 1 2 5 4 4 0 3 1 2 3 export 10 17 1 11 18 3 16 19 0 5 13 13 14 12 2 7 8 - - - ref. key 1 1 2 2 2 3 3 3 4 4 4 4 4 5 6 6 7 - - - binary 001 001 010 010 010 011 011 011 100 100 100 100 100 101 110 110 111 - - - blocks sent to export streams during multiplication loop: exported to GPU 1 (bit 001) index 10 17 1 11 18 3 16 19 0 5 13 13 14 12 2 7 8 - - - binary 001 001 010 010 010 011 011 011 100 100 100 100 100 101 110 110 111 - - - exported to GPU 2 (bit 010) exported to GPU 3 (bit 100) index 10 17 1 11 18 3 16 19 0 5 13 13 14 12 2 7 8 - - - binary 001 001 010 010 010 011 011 011 100 100 100 100 100 101 110 110 111 - - - index 10 17 1 11 18 3 16 19 0 5 13 13 14 12 2 7 8 - - - binary 001 001 010 010 010 011 011 011 100 100 100 100 100 101 110 110 111 - - - 29

Pros and Cons Pro: Export vector is only built once per multiplication No element needs to be stored more than once No element is transferred more often than necessary Con: Limited number of GPUs, because number of blocks grows exponentially Time-consuming pre-processing 30

Further Optimisations Sub-sub-matrix multiplications are ordered by size of matrix (number of non-zero elements) More likely to correlate larger transfers with longer calculations Export vectors are copied to host memory during initial calculations Allows parallel import on devices with only one copy engine 31

Preprocessing Time Preprocessing involves multiple expensive indexing and sorting steps May take up to 10-20 seconds per matrix (with n ~20M) Depends on the number of GPUs used Happens only once, because matrices are static Typical run includes millions of solver iterations/matrix multiplications 32

Benchmarks Solver iteration times for Cube with regular mesh, very diagonal-dominant matrices, little data transfer Two problems of similar size and nature: Round disk with irregular mesh structure, few extractable diagonals, lots of data transfer Round disk with partially regular mesh structure, some extractable diagonals, moderate data transfer 33

Time per solver integration for cube 1µm (8.1M nodes, regular mesh) 30 Comparison: 2 x E5-2680 (10 cores, 2.8 GHz): 134 ms 25 time per solver iteration [ms] 20 15 10 5 0 1 2 3 4 5 6 7 8 number of GPUs Titan GTX 690 34

Time per solver integration for disk (6.x M nodes, different mesh structures) 40 Comparison: 2 x E5-2680 (10 cores, 2.8 GHz): 100 ms reg. 118 ms irreg. time per solver iteration [ms] 30 20 10 0 1 2 3 4 5 6 7 8 number of GPUs Titan irreg.mesh GTX 690, irreg.mesh GTX 690, 1GPU/card, irreg.mesh 35 Titan reg.mesh GTX 690 reg.mesh

Conclusions Distribution of matrices and vectors over multiple GPUs allows us to simulate significantly larger samples Performance scaling depends largely on the amount of data exchanged between GPUs Optimising the mesh-structure is very important in multi-gpu setups 36

Questions? 37