上海超级计算中心 Shanghai Supercomputer Center. Lei Xu Shanghai Supercomputer Center San Jose

Size: px

Start display at page:

Download "上海超级计算中心 Shanghai Supercomputer Center. Lei Xu Shanghai Supercomputer Center San Jose"

Brittney Hutchinson
6 years ago
Views:

1 上海超级计算中心 Shanghai Supercomputer Center Lei Xu Shanghai Supercomputer Center San Jose

2 Overview Introduction Fundamentals of the FDTD method Implementation of 3D UPML-FDTD algorithm on GPU clusters Optimiation Techniques Results Conclusions & Future Works Slide 2 03/26/2014@GTC Shanghai Supercomputer Center lu@ssc.net.cn

3 Introduction Motivations Pre-developing the massive parallel electromagnetic(em) simulation application for the future eascale high productivity HPC system(supported by NSFC, 863 program) Fast turnaround for EM simulation Slide 3 03/26/2014@GTC Shanghai Supercomputer Center lu@ssc.net.cn

4 Relative works Introduction FDTD(Finite-difference time-domain) method has been used to accelerate EM simulation for decades on CPU clusters. Most of researches, carried out in recent years, focus on the implementation and optimiation on less than 10 GPU cards. Highlight of my work The performance results of optimied FDTD code show strong scaling from 12 to 80 Tesla K20m GPU cards Slide 4 03/26/2014@GTC Shanghai Supercomputer Center lu@ssc.net.cn

Fundamentals of the FDTD Algorithm Propagation of EM waves is governed by the Mawell s equations H 1 Ey E Mawell s curl equations B E M t D H J t Si coupled scalar difference equations ( Msource ) mh

5 Fundamentals of the FDTD Algorithm Propagation of EM waves is governed by the Mawell s equations H 1 Ey E Mawell s curl equations B E M t D H J t Si coupled scalar difference equations ( Msource ) mh t y H y 1 E E ( Msource ) y mhy t H 1 E E y ( Msource ) mh t y E 1 H H y ( Jsource E ) t y Ey 1 H H ( Jsource E ) y y t E 1 H y H ( Jsource E ) t y Slide 5 03/26/2014@GTC Shanghai Supercomputer Center lu@ssc.net.cn

6 Fundamentals of the FDTD Algorithm Yee cell E y Spatial: centered finite-difference E H E Temporal: the second-order accuracy Every E/H component surrounded by 4 H y E E y H H y E H/E components Leapfrog algorithm: E and H E E H E E y E y components in time at intervals of 1/2Δt E y H Slide 6 03/26/2014@GTC Shanghai Supercomputer Center lu@ssc.net.cn

7 Fundamentals of the FDTD Algorithm The Finite-Difference Epression for Mawell s Equations in Three Dimensions H 1 Ey E t y * ( Msource H) source source J M * 0, 0 n1/ 2 n1/ 2 H CP( m) H i, j 1/ 2, k 1/ 2 i, j 1/ 2, k 1/ 2 n n n n Ey Ey E E i, j 1/ 2, k 1 i, j 1/ 2, k i, j 1, k 1/ 2 i, j, k 1/ 2 CQ( m) (1) y ( m) ( ) ( ) m m m m t t ( ) 1 ( ) (, 1/ 2, 1/ 2), ( ) t m m m i j k CP m, CQ( m) ( m) m( m) m( m) t ( m) m( m) m( m) t 1 1 t 2 2 ( m) t 2 2 ( m) Slide 7 03/26/2014@GTC Shanghai Supercomputer Center lu@ssc.net.cn

8 Fundamentals of the FDTD Algorithm The Difference Epressions under the UPML(uniaial perfectly matched layer) Absorbing Boundary Conditions B H n3 / 2 2 y yt n1/ 2 2t B i, j 1/ 2, k 1/ 2 2 y yt i, j 1/ 2, k 1/ 2 2 y yt n 1 n 1 n 1 n 1 E E Ey Ey i, j 1, k 1/ 2 i, j, k 1/ 2 i, j 1/ 2, k 1 i, j 1/ 2, k y n3 / 2 2 1/ 2 t n 1 H i, j 1/ 2, k 1/ 2 2 t i, j 1/ 2, k 1/ 2 (2 t) n 3 / 2 n1/ 2 2 t B 2 t B (3) i, j 1/ 2, k 1/ 2 i, j 1/ 2, k 1/ 2 1/ 1/ d 1/ ln[ R(0)]ln( g) Geometric Grading ( ) g,0 ( ) ( ma ) g,0 d / 2 (g 1) (2) Slide 8 03/26/2014@GTC Shanghai Supercomputer Center lu@ssc.net.cn

9 Implementation of 3D UPML-FDTD algorithm on GPU clusters FDTD Communication Pattern send( H, H ) recv( E, E ) send( E, E ) y recv( H, H ) send( H, H ) y recv( E, E ) recv( H, H ) y send( E, E ) y y y send( E, E ) y recv( E, E ) y recv( H, H ) send( H, H ) Each subdomain needs to communicate E/H values at boundary with front, behind, left, right, up and below subdomains Slide 9 03/26/2014@GTC Shanghai Supercomputer Center lu@ssc.net.cn y y

10 Implementation of 3D UPML-FDTD algorithm on GPU clusters Analysis of the Domain Decomposition of the UPML- FDTD Algorithm The communication overhead depends on the surface area of these subdomains The surface area of a subdomain: n n n n ny n y y, y, n ny y n n ny n 2 (4) y the values of virtual topology in three directions y No. of Processors n ny n problem sie Slide 10 03/26/2014@GTC Shanghai Supercomputer Center lu@ssc.net.cn

11 Implementation of 3D UPML-FDTD algorithm on GPU clusters Analysis of UPML-FDTD Algorithm Domain Decomposition n ny yn n ny n 2 ( y ) n ( n ny n) (5) General advices for domain decomposition The optimum virtual topology should be created in three dimensions The topology should be created along the directions with more Yee cells Slide 11 03/26/2014@GTC Shanghai Supercomputer Center lu@ssc.net.cn

12 Implementation of 3D UPML-FDTD algorithm on GPU clusters FDTD Data Partitioning & Workflow Start Domain decomposition Init parameters Update H H y E H Communicate H Update E Communicate E E H E y Source ecitation E( i, j, k), Ey ( i, j, k), E ( i, j, k ) H ( i, j, k ), H y ( i, j, k ), H ( i, j, k) belong to Cell(i, j,k) Yes n t = n t + 1 n t < N t No Gather result end FDTD Workflow Slide 12 03/26/2014@GTC Shanghai Supercomputer Center lu@ssc.net.cn

13 Optimiation techniques Using Non-block Communication CPU GPU CPU GPU init init iteration iteration GPU update H in MPI Sendrecv(buf H) GPU update H in memcpy_htod(buf E) GPU update H e memcpy_dtoh(buf H) GPU update E in MPI_Waitall(comm E) MPI Isend(buf H) MPI Irecv(buf H) memcpy_htod(buf E) GPU update H e memcpy_dtoh(buf H) GPU update E in MPI communications are overlapped with the update-kernels updating the main grid in GPU memcpy_htod(buf H) MPI_Waitall(comm H) GPU update E e memcpy_htod(buf H) memcpy_dtoh(buf E) GPU update E e memcpy_dtoh(buf E) MPI Sendrecv(buf E) MPI Isend(buf E) MPI Irecv(buf E) Block Communication Non-block Communication Slide 13 03/26/2014@GTC Shanghai Supercomputer Center lu@ssc.net.cn

14 Optimiation techniques Using the Read-Only Cache Using ldg() function to read data from global memory Be cached in readonly data cache Slide 14 Shanghai Supercomputer Center

15 Optimiation techniques Concurrent Kernel Eecution Data packing on si sides for communication are independent. Concurrent eecution of si kernel functions to pack data Minimie logic branches and maimie the utiliation of GPU It utilied GPU resources more efficiently by running kernels concurrently Slide 15 Shanghai Supercomputer Center

Optimiation techniques Asynchronous data transfer from device to host Creating multiple CUDA streams to asynchronous pack and transfer data to host top H sync mode async mode right H cudastream 1 GPU

16 Optimiation techniques Asynchronous data transfer from device to host Creating multiple CUDA streams to asynchronous pack and transfer data to host top H sync mode async mode right H cudastream 1 GPU pack top H GPU pack right H cudamemcpy dtoh MPI_Send GPU pack top H cudamemcpyasync dtoh MPI_Isend cudastream 2 GPU pack right H cudamemcpyasync dtoh MPI_Isend Slide 16 03/26/2014@GTC Shanghai Supercomputer Center lu@ssc.net.cn

17 Results The test environments: Shanghai, China 50 GPU nodes, each node with: 2 NVIDIA Tesla Kepler K20m (4.7 GB with ECC on) 2 Intel Xeon E CPU, 2.6GH, 20 MB cache, 64GB DDR3 memory. Mellano Infiniband FDR switch. CUDA 5.5 Slide 17 03/26/2014@GTC Shanghai Supercomputer Center lu@ssc.net.cn

18 Results Validating numerical results with the analytical solution for the electric dipole ecitation source Using Gauss pulse: 2 10 t3t p( t) 10 ep T 2ns T The relative error is less than 2% Slide 18 03/26/2014@GTC Shanghai Supercomputer Center lu@ssc.net.cn

19 Results The Scalability of the UPML-FDTD The parallel efficiency on K20m is 82.5%(32 cards) and 67.5%(64 cards) Slide 19 Shanghai Supercomputer Center

20 Results The Performance with/without Optimiation The optimiation techniques promote the speedup from 22.8X to 27.4X on 64 K20m cards Slide 20 Shanghai Supercomputer Center

21 Results Testing the UPML-FDTD on 80 K20m cards The parallel efficiency is up to 91% on 80 K20m cards Slide 21 Shanghai Supercomputer Center

22 Conclusion & Future Works The three-dimensional UPML-FDTD algorithm achieves high numerical accuracy. A set of optimiation techniques improve the performance using non-block communication using read-only data cache concurrent kernel eecution transferring data in asynchronous mode The scalability of this algorithm is shown for up to 80 Tesla K20m GPUs. The proposed solution brings more fleibility, allowing the use of the CPU and GPU simultaneously in the FDTD numerical simulation, combining the CPU and GPU processing power with the CPU memory capacity. Slide 22 03/26/2014@GTC Shanghai Supercomputer Center lu@ssc.net.cn

23 Acknowledgments Slide 23 Shanghai Supercomputer Center

24 Thanks for your attention! Slide 24 Shanghai Supercomputer Center

Lattice Boltzmann simulations on heterogeneous CPU-GPU clusters

Lattice Boltzmann simulations on heterogeneous CPU-GPU clusters H. Köstler 2nd International Symposium Computer Simulations on GPU Freudenstadt, 29.05.2013 1 Contents Motivation walberla software concepts