Explore Computational Power of GPU in Electromagnetics and Micromagnetics

Size: px

Start display at page:

Download "Explore Computational Power of GPU in Electromagnetics and Micromagnetics"

Teresa Holland
5 years ago
Views:

1 Explore Computational Power of GPU in Electromagnetics and Micromagnetics Presenter: Sidi Fu, PhD candidate, UC San Diego Advisor: Prof. Vitaliy Lomakin Center of Magnetic Recording Research, Department of Electrical and Computer Engineering, University of California, San Diego 4/20/2014 COMPUTATIONAL ELECTROMAGNETICS AND MICROMAGNETICS GROUP, ELECTRICAL AND COMPUTER ENGINEERING DEPARTMENT, UCSD 1

2 Outline Motivation Micromagnetics : FastMag solver Electromagnetics GPU Acceleration Projects Non-uniform Fast Fourier Transform Sparse Matrix Vector Multiplication Finite Difference Method solver OOMMF Simulation examples 4/20/2014 COMPUTATIONAL ELECTROMAGNETICS AND MICROMAGNETICS GROUP, ELECTRICAL AND COMPUTER ENGINEERING DEPARTMENT, UCSD 2

3 Outline Motivation Micromagnetics : FastMag solver Electromagnetics GPU Acceleration Projects Non-uniform Fast Fourier Transform Sparse Matrix Vector Multiplication Finite Difference Method solver OOMMF Simulation examples 4/20/2014 COMPUTATIONAL ELECTROMAGNETICS AND MICROMAGNETICS GROUP, ELECTRICAL AND COMPUTER ENGINEERING DEPARTMENT, UCSD 3

Motivation Typical applications of micromagnetic simulations 1

Memory Typical problem scale: 100K ~ 100M CPU? Too slow. MPI?

performance 4/20/2014 COMPUTATIONAL ELECTROMAGNETICS AND

4 Motivation Typical applications of micromagnetic simulations Hard Drive Magnetic Materials Magnetic Memory Typical problem scale: 100K ~ 100M CPU? Too slow. MPI? Possible but expensive GPU (Relatively) low cost, high performance 4/20/2014 COMPUTATIONAL ELECTROMAGNETICS AND MICROMAGNETICS GROUP, ELECTRICAL AND COMPUTER ENGINEERING DEPARTMENT, UCSD 4

5 Motivation Landau-Lifshitz-Gilbert equation for magnetization dynamics: mˆ t ˆ ˆ 2 ˆ eff 1 m H m m H eff Near field: differential operator Effective field: Solved this nonlinear differential equation by marching-on-in-time, e.g. Integral operator Long-range field: demagnetization field Dense matrix -> Bottleneck: O(N 2 ) Differential operator Local field: exchange field Sparse matrix -> Can become bottleneck mˆ ( t ) ˆ ( ) ˆ ( ) ˆ m t t m t m( t ) H ( t ) m 1 m m m eff m M s 2A mˆ d r r r ˆ 2 M m s Far field: integral operator 4/20/2014 COMPUTATIONAL ELECTROMAGNETICS AND MICROMAGNETICS GROUP, ELECTRICAL AND COMPUTER ENGINEERING DEPARTMENT, UCSD 5

6 Motivation FastMag: a versatile GPU micromagnetic simulator Framework: j i Input interface FastMag LLG simulators Temperature/optics Hybrid simulators Fast Demag: NUFFT Fast Exchange Fast SpMV Time integration Fast Jacobian Parallelization CPU-GPU hybrid Output interface 6 4/20/2014 COMPUTATIONAL ELECTROMAGNETICS AND MICROMAGNETICS GROUP, ELECTRICAL AND COMPUTER ENGINEERING DEPARTMENT, UCSD 6

Motivation Typical applications of electromagnetic simulations 30 20 Mie MOM RCS(dBSW) 10 0-10 -20-30 0 20 40 60 80 100 120 140 160 180 (degree) Biomedical EM Equations to solve EM wave scattering

7 Motivation Typical applications of electromagnetic simulations Mie MOM RCS(dBSW) (degree) Biomedical EM Equations to solve EM wave scattering from airplane Radar cross section 1 A t c t D 1 V t c t 2 B 2 E A H J V 2 2 J 0 0 4/20/2014 COMPUTATIONAL ELECTROMAGNETICS AND MICROMAGNETICS GROUP, ELECTRICAL AND COMPUTER ENGINEERING DEPARTMENT, UCSD 7

8 Motivation Electromagnetic problem example Example: field-based volume integral equation jk 0 r' r 0 ' D ' e ' 2 kede ( k ) ' ed dv k0 dv 4 4 r r ' r r r ' jk r r D i Goal: solve electric flux Step 1: Quadrature points represents integral Q PD N D n 1 D f () r n n Step 2: Quadrature source to potential = ZQ Step 3: Quadrature observer function to testing function P T D i Sparse Matrix: maps basis function to quadrature source points Dense Matrix: Summation of the products between source and Green s function Sparse Matrix: maps quadrature potential points to testing functions 4/20/2014 COMPUTATIONAL ELECTROMAGNETICS AND MICROMAGNETICS GROUP, ELECTRICAL AND COMPUTER ENGINEERING DEPARTMENT, UCSD 8

9 Outline Motivation Micromagnetics : FastMag solver Electromagnetics GPU Acceleration Projects Non-uniform Fast Fourier Transform Sparse Matrix Vector Multiplication Finite Difference Method solver OOMMF Simulation examples 4/20/2014 COMPUTATIONAL ELECTROMAGNETICS AND MICROMAGNETICS GROUP, ELECTRICAL AND COMPUTER ENGINEERING DEPARTMENT, UCSD 9

10 NUFFT Traditional Fast Fourier Transform Advantage Computational complexity: O(N 2 ) O(NlogN) Well-known libraries: e.g. FFTW, Intel MKL, Nvidia CUFFT Electromagnetic probs: u j N jk i j e r r i1 ri rj i j Green's Function q( r ) i Disadvantage Cannot solve non-uniform source distribution problems Non-periodic problems require zero padding NUFFT: Non-uniform Fast Fourier Transform (or Adaptive Integral Method) * Uniform sampling general structures Non-uniform problem Uniform problem Ref: Zhu, Zhenhai, Ben Song, and Jacob White. "pfft++ A general and extensible fast integral equation solver based on a pre-corrected FFT algorithm." Micromagnetic probs: M s mˆ d r r r 4/20/2014 COMPUTATIONAL ELECTROMAGNETICS AND MICROMAGNETICS GROUP, ELECTRICAL AND COMPUTER ENGINEERING DEPARTMENT, UCSD 10

11 NUFFT Algorithm building blocks: 1. Projection 2. Fast Fourier Transform 3. Back-projection 4. Near-field correction CUDA Implementation: Coalesced memory access Shared memory Thread independency Workload balancing Heavy floating ops 4/20/2014 COMPUTATIONAL ELECTROMAGNETICS AND MICROMAGNETICS GROUP, ELECTRICAL AND COMPUTER ENGINEERING DEPARTMENT, UCSD 11

12 NUFFT Algorithm building blocks: 1. Projection 2. Fast Fourier Transform 3. Back-projection 4. Near-field correction CUDA Implementation: Coalesced memory access Shared memory Thread independency Workload balancing Heavy floating ops 4/20/2014 COMPUTATIONAL ELECTROMAGNETICS AND MICROMAGNETICS GROUP, ELECTRICAL AND COMPUTER ENGINEERING DEPARTMENT, UCSD 12

13 NUFFT Algorithm building blocks: 1. Projection 2. Fast Fourier Transform 3. Back-projection 4. Near-field correction CUDA Implementation: Coalesced memory access Shared memory Thread independency Workload balancing Heavy floating ops 4/20/2014 COMPUTATIONAL ELECTROMAGNETICS AND MICROMAGNETICS GROUP, ELECTRICAL AND COMPUTER ENGINEERING DEPARTMENT, UCSD 13

Near-field correction CUDA Implementation: Coalesced memory access Shared memory Thread

14 NUFFT Algorithm building blocks: 1. Projection 2. Fast Fourier Transform 3. Back-projection 4. Near-field correction CUDA Implementation: Coalesced memory access Shared memory Thread independency Workload balancing Heavy floating ops 4/20/2014 COMPUTATIONAL ELECTROMAGNETICS AND MICROMAGNETICS GROUP, ELECTRICAL AND COMPUTER ENGINEERING DEPARTMENT, UCSD 14

NUFFT NUFFT on SINGLE GPU NUFFT on MULTIPLE GPUs CPU GPU CPU GPU Source coords Domain

coordinates Domain structure Get Src Amp Src Amp Src Amp Proje

Wait TensorMul K-space multiplication Mul.

15 NUFFT NUFFT on SINGLE GPU NUFFT on MULTIPLE GPUs CPU GPU CPU GPU Source coords Domain structure Source coords Domain structure Source coords Domain structure Source coordinates Domain structure Get Src Amp Src Amp Src Amp Projection Proj. Proj. Proj. Proj. Projection Src Amp on grids FFT Src Amp on grids in k- space Parallel FFT in 3D Wait TensorMul K-space multiplication Mul. Mul. Mul. Mul. Field in k-space ifft Near field correction Field on grids Near-field correction Parallel inverse FFT in 3D Corr. Corr. Corr. Corr. Observer field Observer field Observer field 4/20/2014 COMPUTATIONAL ELECTROMAGNETICS AND MICROMAGNETICS GROUP, ELECTRICAL AND COMPUTER ENGINEERING DEPARTMENT, UCSD 15

NUFFT Simul. Time/ms Single GPU results INTEL Xeon @ 3.2GHz vs. NVIDIA Geforce GTX 690 (1 card) 100x~300x CPU-GPU speed up!

16 NUFFT Simul. Time/ms Single GPU results INTEL 3.2GHz vs. NVIDIA Geforce GTX 690 (1 card) 100x~300x CPU-GPU speed up! Problem Size Direct CPU/s Direct GPU/s E S P T PT p p 1 P NUFFT CPU(cubic)/s NUFFT GPU(cubic)/s NUFFT GPU(linear)/s 16K 7.02E E E E E-3 64K 4.47E1 7.98E E0 1.23E E-3 256K 7.17E2 1.14E0 8.87E0 3.33E E-2 1M N/A 1.79E1 3.99E1 1.26E E-2 4M N/A N/A N/A 4.76E E-1 Multiple GPU results Multiple GPUs: 2 x NVIDIA Geforce GTX 690 (4GPUs) Problem size = 4M Parallel efficiency Ep = 77% across 4 GPUs 4/20/2014 COMPUTATIONAL ELECTROMAGNETICS AND MICROMAGNETICS GROUP, ELECTRICAL AND COMPUTER ENGINEERING DEPARTMENT, UCSD x 1.8x 2.6x 3.1x GPUs

17 SPMV Sparse Matrix-Vector Multiplication (SpMV) Application: differential operators, projections or interpolations Feature: #non-zero elements << #zero elements GPU Memory: only non-zero elements are kept in memory Computational Complexity: only non-zero elements are computed Example: compressed sparse row format (CSR) 2A ˆ 2 M m s A RowOffset = Ptr = Data = /20/2014 COMPUTATIONAL ELECTROMAGNETICS AND MICROMAGNETICS GROUP, ELECTRICAL AND COMPUTER ENGINEERING DEPARTMENT, UCSD 17

100% Ref: 1. CUDA_C_Best_Practice, Nvidia; 2. Optimizing Parallel Reduction in CUDA, M. Harris; 3.

18 SPMV Implementation: single GPU Bind input vector to texture memory Parallel Reduction w/ shuffle operations Input vector Maximize the CPU-GPU memory transfer throughput: Important for CPU-GPU mixture solvers Pinned host memory -> increase memory transfer throughput by 100% Ref: 1. CUDA_C_Best_Practice, Nvidia; 2. Optimizing Parallel Reduction in CUDA, M. Harris; 3. How to Optimize Data Transfers in CUDA C/C++, M. Harris 4/20/2014 COMPUTATIONAL ELECTROMAGNETICS AND MICROMAGNETICS GROUP, ELECTRICAL AND COMPUTER ENGINEERING DEPARTMENT, UCSD 18

19 SPMV Implementation: Sorting Sparse Matrix Input Vector Output Vector Sparse Matrix Input Vector Output Vector X = X = Sorting Sorting Vs. Box-sorting RCM 4/20/2014 COMPUTATIONAL ELECTROMAGNETICS AND MICROMAGNETICS GROUP, ELECTRICAL AND COMPUTER ENGINEERING DEPARTMENT, UCSD 19

20 SPMV Implementation: multiple GPUs Only part of the matrix and input vector is assigned to each GPU Workload balance: leveraging the number of non-zero elements among GPUs Problem: memory scalability across GPUs GPU0 V 1 GPU1 V 2 V 2 V 3 V 3 GPU2 V 4 V 4 V 4 GPU3 V 5 V 5 V 5 V 6 V 6 V 7 V 7 V 8 Before sorting 4/20/2014 COMPUTATIONAL ELECTROMAGNETICS AND MICROMAGNETICS GROUP, ELECTRICAL AND COMPUTER ENGINEERING DEPARTMENT, UCSD 20

scalability of multi-gpu implementation GPU0 V 1 GPU1 V 2 V 2 V 3 V 4 GPU2 GPU3 V 5 V 6 V 7 V 8 After

21 SPMV Implementation: multiple GPUs Only part of the matrix and input vector is assigned to each GPU Workload balance: leveraging the number of non-zero elements among GPUs Sorting helps to keep the scalability of multi-gpu implementation GPU0 V 1 GPU1 V 2 V 2 V 3 V 4 GPU2 GPU3 V 5 V 6 V 7 V 8 After sorting 4/20/2014 COMPUTATIONAL ELECTROMAGNETICS AND MICROMAGNETICS GROUP, ELECTRICAL AND COMPUTER ENGINEERING DEPARTMENT, UCSD 21

22 SPMV Speed results Two matrices generated from FEM mesh of a cube and a sphere, respectively. Three matrices chosen from Florida sparse matrix collection INTEL 3.2GHz w/ 1core running vs. 2 x NVIDIA Geforce GTX 690 CPU-GPU Memory transfer time is included nnz/ (nnz/row) Computational Time (ms) SPMV 1 GPU 2 GPUs 3 GPUs 4 GPUs Serial CPU MKL CPU Cusparse GPU FEM Cube 17.5M/ FEM Sphere 31.8M/ dielfilterv3real 89.3M/ gsm_ M/ Cube_Coup_dt6 124M/ /20/2014 COMPUTATIONAL ELECTROMAGNETICS AND MICROMAGNETICS GROUP, ELECTRICAL AND COMPUTER ENGINEERING DEPARTMENT, UCSD 22

23 Simul. Time/ms Simul. Time/ms Simul. Time/ms SPMV Speed results Multiple GPUs FEM Sphere Cube_Coup_dt6 DielFilterV3Real x 1.6x 2.2x GPUs memcpy kernel 3.1x x 1.9x 2.6x memcpy 10 kernel x GPUs GPUs memcpy kernel 1.0x 1.8x 2.7x 2.8x 4/20/2014 COMPUTATIONAL ELECTROMAGNETICS AND MICROMAGNETICS GROUP, ELECTRICAL AND COMPUTER ENGINEERING DEPARTMENT, UCSD 23

12-point exchange field Uni-axial and cubic anisotropy field Flexibility in changing material properties Problem: CPU speed is too slow for large

24 OOMMF GPU OOMMF (Object-oriented Micromagnetic Framework) by NIST Open-source, thousands users worldwide Micromagnetic simulator Landau-Lifshitz-Gilbert equation Finite Difference method Object-oriented coding framework Periodic and non-periodic boundary condition 6-point and 12-point exchange field Uni-axial and cubic anisotropy field Flexibility in changing material properties Problem: CPU speed is too slow for large problems Solution: GPU parallel computation 4/20/2014 COMPUTATIONAL ELECTROMAGNETICS AND MICROMAGNETICS GROUP, ELECTRICAL AND COMPUTER ENGINEERING DEPARTMENT, UCSD 24

25 OOMMF GPU GPU Parallelism m initiation H applied Hanisotropy k H k k m Hexchange M l 2 2 s ex m m m H m m H t 1 eff ( 2 eff ) Heff Happlied Hexchange Hanisotropy Hstray m t 1 m H 2 eff m ( m Heff ) 4/20/2014 COMPUTATIONAL ELECTROMAGNETICS AND MICROMAGNETICS GROUP, ELECTRICAL AND COMPUTER ENGINEERING DEPARTMENT, UCSD 25

26 OOMMF GPU Speed Results Test case: cubic geometry with various problem size Hardware CPU: Xeon w/ 1 core running GPU: Nvidia GTX 690@915MHz w/ 1536 cores running Computational Time Speed-up: Problem Size CPU/ms GPU/ms Speed-up 16 3 = 4K x = 32K x = 256K x = 2M x24.5 4/20/2014 COMPUTATIONAL ELECTROMAGNETICS AND MICROMAGNETICS GROUP, ELECTRICAL AND COMPUTER ENGINEERING DEPARTMENT, UCSD 26

27 Outline Motivatioin Micromagnetics : FastMag solver Electromagnetics GPU Acceleration Projects Non-uniform Fast Fourier Transform Sparse Matrix Vector Multiplication Finite Difference Method solver OOMMF Simulation examples 4/20/2014 COMPUTATIONAL ELECTROMAGNETICS AND MICROMAGNETICS GROUP, ELECTRICAL AND COMPUTER ENGINEERING DEPARTMENT, UCSD 27

Micromagnetic Simulation: Magnetic head Challenge and features Complex geometry: 5-10 micron size, ~1000 aspect ratio, complex shapes and coupled parts Hundreds of millions of elements may be needed

28 Micromagnetic Simulation: Magnetic head Challenge and features Complex geometry: 5-10 micron size, ~1000 aspect ratio, complex shapes and coupled parts Hundreds of millions of elements may be needed Parameters: M = emu cc,α = 0.2 s 5 3 5micron size, 50 80nm tip Adams and BDF time stepping Hardware: Tesla S2070 GPU, i7 CPU *Coils are surrounding the head Tip resolution Largest element # of tetrah. elements Time per 1 ns 10 nm 130 nm 130K 1.75 min 10 nm 57 nm 1.2M 17 min 10 nm 33 nm 4.8M 107 min 10 nm 10 nm 126M ~3 days 4/20/2014 COMPUTATIONAL ELECTROMAGNETICS AND MICROMAGNETICS GROUP, ELECTRICAL AND COMPUTER ENGINEERING DEPARTMENT, UCSD 28

29 Micromagnetic Simulation: Granular media Features General Voronoi tessellation Distributions of particle size, shape, separation, material parameters, etc. Single and multiple layers with option for sub-layer discretization Surface and bulk exchange 4/20/2014 COMPUTATIONAL ELECTROMAGNETICS AND MICROMAGNETICS GROUP, ELECTRICAL AND COMPUTER ENGINEERING DEPARTMENT, UCSD 29

Micromagnetic Simulation: Magnetic memories

structure V electron flow In-plane MRAM Free Layer

30 4/20/2014 COMPUTATIONAL ELECTROMAGNETICS AND

30 Micromagnetic Simulation: Magnetic memories Spin-transfer-torque based Magnetic RAM spin valve structure V electron flow In-plane MRAM Free Layer Perpendicular MRAM Free Layer Fixed Layer Fixed Layer 30 4/20/2014 COMPUTATIONAL ELECTROMAGNETICS AND MICROMAGNETICS GROUP, ELECTRICAL AND COMPUTER ENGINEERING DEPARTMENT, UCSD 30

31 Electromagnetic Simulation: Human body scattering Human body simulation Method: Potential Integral Equation Key algorithm: Non-uniform Fast Fourier Transform Mesh: 8.4 million tetrahedrons, 2mm resolution Total number of iterations: 109 Simulation time: 48mins Current distribution along x Incident wave x polarization z y x λ = 1.25m, ε r = 41.4 j18 4/20/2014 COMPUTATIONAL ELECTROMAGNETICS AND MICROMAGNETICS GROUP, ELECTRICAL AND COMPUTER ENGINEERING DEPARTMENT, UCSD 31

Summary Have done A Finite Element Method based micromagnetic solver - FastMag Two GPU algorithms: Non-uniform Fast Fourier Transform and Sparse Matrix Vector multiplication with 20x ~ 300x GPU-CPU

32 Summary Have done A Finite Element Method based micromagnetic solver - FastMag Two GPU algorithms: Non-uniform Fast Fourier Transform and Sparse Matrix Vector multiplication with 20x ~ 300x GPU-CPU speed-up Multi-GPU implementation of two algorithms, gaining 65% - 85% parallel efficiency Electromagnetic and micromagnetic simulation examples Future work The entire solver of FastMag is going to be implemented on GPU With the release of CUDA 6.0, implementation with multiple GPUs will be more efficient More information? Please find it out at our group s website: Acknowledgement Shaojing Li Ruinan Chang Marko Lubarda Marco Escobar Majd Kuteifan Marco Menarini Simon Couture Javier Espigares 4/20/2014 COMPUTATIONAL ELECTROMAGNETICS AND MICROMAGNETICS GROUP, ELECTRICAL AND COMPUTER ENGINEERING DEPARTMENT, UCSD 32

TR A Comparison of the Performance of SaP::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems

TR A Comparison of the Performance of SaP::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems TR-0-07 A Comparison of the Performance of ::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems Ang Li, Omkar Deshmukh, Radu Serban, Dan Negrut May, 0 Abstract ::GPU is a