Perm State University Research-Education Center Parallel and Distributed Computing

Perm State University Research-Education Center Parallel and Distributed Computing A 25-minute Talk (S4493) at the GPU Technology Conference (GTC) 2014 MARCH 24-27, 2014 SAN JOSE, CA GPU-accelerated modeling of coherent processes in magnetic nano-structures Aleksey G. Demenev, Tatyana S. Belozerova, Petr V. Kharebov, Aleksandr V. Polyakov, Viktor K. Henner, Evgeniy K. Khenner Perm State University, Russia 1

Outline Introduction Physical model Method of numerical modeling Analyses of a potential parallelization of the initial codes Creation of the Magnetodynamics-F program Description of OpenMP-version and OpenACC-version Examples of application the program Case study 1 and 2 Experimental estimation (CPU+OpenMP vs GPU+OpenACC) Acceleration of the parallel algorithm A priori estimation A posteriori estimation Conclusions Acknowledgments 2

Introduction. Problems Problem is the creation of high-performance and reliable software for computer simulation of spin dynamics of magnetic nanostructures. The elements of that systems can be nano-molecules, nano-clusters, molecular crystals, etc. Spin is magnetic moment (in the physics of magnetic phenomena); - analogue of the classical angular momentum of the particle (in quantum mechanics).. 3

Introduction. The spins of nanomolecules 4

Introduction. Coherent effects Coherent effects are effective spin-spin interactions do not decrease with distance, the time scale for relaxation processes is inversely proportional to the number of spins.. Superradiance is coherent effects phenomenon when the radiated power is proportional to squar the number of spins. Conditions of observation is low temperature sample in a passive resonator. Future prospect is the possible use of high-velocity coherent processes in nanostructures in different kinds of sensors and switches, especially in nanodevices. Application domain is development of technologies for producing nano detectors of weak radiation and rapid creation of compact magnetic recording systems. 5

Introduction. Problems Description of the collective dynamic behavior of many-spin systems are usually by some time correlation functions Require to develop effective and reliable method methods for computing functions such systems far from equilibrium spins with long-range inter-particles interactions. Mathematical difficulty is presence of a broad continuous spectrum of characteristic times of processes determining the multi-scale dynamics of the system. Computational complexity increases nonlinearly with the number of structural elements and time observations of the system with realistic models. Technological barrier is unacceptably long time of sequential calculations. 6

Introduction. Approaches Approach for barier overcome is parallelization algorithms to significantly increase the number of structural elements and the time evolution of the studied range of systems available for study. Additional difficulties: classical theory of convergence does not apply to parallel numerical methods; parallel algorithms are specific mistakes that are not characteristic for successive; overhead of parallel computations can reverse the benefits of parallelization. Additional tasks: research on the subject to ensure the correctness of the results, analysis and evaluation of the effectiveness of computational algorithms mapping on modern parallel computer architectures. Perspective architectures are hybrid of multi-core CPUs with many-core accelerators. 7

Physical model 8

Physical model 9

Physical model. The system of equations 10

Physical model. The system of equations 11

Method of numerical modeling 12

Method of numerical modeling 13

Analyses of a potential parallelization of the initial codes The initial software was created for only sequential algorithms under MS Windows: program "Spins" in the environment MS Visual Studio C ++ ; program "MagnetoDynamics" in the environment Borland Delphi. These restrictions prevented the effective use of high performance computing in research. The most of the supercomputer uses the operating system Linux. Transfer of Spins to a cross-platform environment is difficult to implement due to the fact that used library Microsoft NET 4.0 is not cross-platform. Transferring of MagnetoDynamics to cross-platform development environment is difficult due to the fact that the language Borland Delphi has no international standard. Therefore, the program MagnetoDynamics-F in Fortran was created a new as a HPC code. 14

Analyses of a potential parallelization of the initial codes Methods analysis of the information structure of algorithms; asymptotic analysis of the algorithms complexity. Computational complexity Cost T (1) of the algorithms grows : asymptotically quadratic with increasing number of simulated nanoparticles at constant integration step ; directly proportional to the number of integration steps the automatic choice of the step. Mem(1) of the Magnetodynamics algorithms grows asymptotically quadratic with increasing number of simulated nanoparticles that better than Mem(1) of the Spins algorithm. The typical problems are considered. Asymptotic estimates of speedup and efficiency of multi-threaded parallelization algorithms implemented in the codes are performed a priory : by theory in accordance with Amdahl; by semiempirical formulae taking into account the overhead of multi-threading support for multi-core processors and many-core accelerators. 15

Creation of the Magnetodynamics-F program The parallel Fortran- code Magnetodynamics-F created using application programming interfaces OpenMP and OpenACC. The first (sequential) part includes input parameters, creating output files, and modeling of the spin system with a given polarization. The second (to be parallelized) part includes the integration of equations of motion and the calculation of the intensity of the magnetic dipole radiation. For OpenMP-version it was multithreaded and automatically vectorized by compiler : loops on the calculation of the right-hand sides of the equations of motion; loop on the calculation of the intensities of magnetic dipole radiation. For OpenACC-version it was multithreaded and automatically vectorized by compiler loops on the calculation of the right-hand sides of the equations of motion only. Program compiled by: Intel Fortran Compiler 2011; PGI Accelerator Server 13.1. 16

Creation of the Magnetodynamics-F program 17

Case 1. CPU+OpenMP vs GPU+OpenACC Case 1 is the computation with at about 1000 particles. PGI Accelerator PGI-13.1 compiler used to supports both standards: OpenACC and OpenMP. OpenMP-version experimental speedup is about number of CPU-cores dual Intel Xeon 5670 system. OpenACC-version + NVIDIA Tesla 2050 (448 CUDA-cores) acceleration are about 2x better than acceleration OpenMP-version + Intel Xeon 5670 (6 cores); equal to dual Intel Xeon 5670 system (12 cores). It is appropriate to use computers with GPU-accelerators for study of magnetic dynamics systems with N more than 1000. 18

Case 2. CPU+OpenMP vs GPU+OpenACC Case 2 is the computation with at about 5000 particle. PGI Accelerator PGI-13.1 compiler used to supports both standards: OpenACC and OpenMP. OpenMP-version experimental speedup is about number of CPU-cores dual Intel Xeon Xeon E5-2680 system (16 cores). OpenACC-version + NVIDIA Tesla K20 (2496 CUDA-cores) acceleration are about in nearly 5x better than acceleration OpenMP-version + Intel Xeon E5-2680 (8 cores); over 2x better than OpenMP-version + dual Intel Xeon E5-2680 system (16 cores). It is shown that the use of NVIDIA Tesla accelerates simulation for study of magnetic dynamics systems which include thousands of magnetic nanoparticles. 19

Case 2. CPU+OpenMP vs GPU+OpenACC 20

Acceleration of the parallel algorithm 21

Acceleration of the parallel algorithm 22

Conclusions Multi-scale molecular dynamics of the systems of nanomagnets is investigated by numerical simulation using parallel algorithms. Fortran- code Magnetodynamics-F provides some types of research: study of the possibility of regulation time of switching of the magnetic moment of the nanostructure; estimation of the role of nanocrystal geometry on super-radiation of 1-, 2- and 3-dimensional objects; study of magnetodynamics of a nanodots inductively coupled with the passive resonator; depending on the solution from initial orientation of the magnetic moment in order to find the configurations for which the super-radiance and radiative damping are maximal. The parallel programs created using application programming interfaces OpenMP and OpenACC. The estimates of speedup and efficiency of implemented algorithms in comparison with sequential algorithms have been obtained. It is shown that the use of NVIDIA Tesla accelerates simulation for study of magnetic dynamics systems which include thousands of magnetic nanoparticles. 23

Acknowledgments Work are based on the Research-Education Center Parallel and Distributed Computing of Perm State University, Russia. We used supercomputers: "PSU-Tesla" (T-Platforms, December 2010); "PSU-Kepler"(IBM + TC "Garmoniya", December 2012). Used unique equipment purchased under Perm State University Development Programme as national research university. The work was supported by the Russian Foundation for Basic Research and Perm Krai Government (projects 11-07-96007 and 13-02-96018). 24

Contacts Aleksey Demenev, PhD, Assoc.Prof.; Director of the Research-Education Center Parallel and Distributed Computing of Perm State University Phone. +7(342)2396409 fax +7(342)2396584 E-mail: A-demenev@psu.ru http://demenev.livejournal.com