Perm State University Research-Education Center Parallel and Distributed Computing

Similar documents
A Quantum Chemistry Domain-Specific Language for Heterogeneous Clusters

Introduction to numerical computations on the GPU

Introduction to Benchmark Test for Multi-scale Computational Materials Software

Calculation of ground states of few-body nuclei using NVIDIA CUDA technology

A Massively Parallel Eigenvalue Solver for Small Matrices on Multicore and Manycore Architectures

Scalable and Power-Efficient Data Mining Kernels

Algorithm for Sparse Approximate Inverse Preconditioners in the Conjugate Gradient Method

Julian Merten. GPU Computing and Alternative Architecture

TR A Comparison of the Performance of SaP::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems

GPU-based computation of the Monte Carlo simulation of classical spin systems

APPLICATION OF CUDA TECHNOLOGY FOR CALCULATION OF GROUND STATES OF FEW-BODY NUCLEI BY FEYNMAN'S CONTINUAL INTEGRALS METHOD

Dense Arithmetic over Finite Fields with CUMODP

11 Parallel programming models

Welcome to MCS 572. content and organization expectations of the course. definition and classification

Performance Analysis of Lattice QCD Application with APGAS Programming Model

Accelerating Model Reduction of Large Linear Systems with Graphics Processors

Multiphase Flow Simulations in Inclined Tubes with Lattice Boltzmann Method on GPU

arxiv: v1 [physics.data-an] 19 Feb 2017

Stochastic Modelling of Electron Transport on different HPC architectures

Advancing Weather Prediction at NOAA. 18 November 2015 Tom Henderson NOAA / ESRL / GSD

Antti-Pekka Hynninen, 5/10/2017, GTC2017, San Jose CA

Accelerating linear algebra computations with hybrid GPU-multicore systems.

Piz Daint & Piz Kesch : from general purpose supercomputing to an appliance for weather forecasting. Thomas C. Schulthess

上海超级计算中心 Shanghai Supercomputer Center. Lei Xu Shanghai Supercomputer Center San Jose

SPARSE SOLVERS POISSON EQUATION. Margreet Nool. November 9, 2015 FOR THE. CWI, Multiscale Dynamics

Panorama des modèles et outils de programmation parallèle

MPI at MPI. Jens Saak. Max Planck Institute for Dynamics of Complex Technical Systems Computational Methods in Systems and Control Theory

The Panel: What does the future look like for NPW application development? 17 th ECMWF Workshop on High Performance Computing in Meteorology

Parallel Performance Theory - 1

Population Estimation: Using High-Performance Computing in Statistical Research. Craig Finch Zia Rehman

Optimization Techniques for Parallel Code 1. Parallel programming models

arxiv: v1 [hep-lat] 7 Oct 2010

Parallelization of Molecular Dynamics (with focus on Gromacs) SeSE 2014 p.1/29

Parallel Multivariate SpatioTemporal Clustering of. Large Ecological Datasets on Hybrid Supercomputers

VERSION 4.0. Nanostructure semiconductor quantum simulation software for scientists and engineers.

Petascale Quantum Simulations of Nano Systems and Biomolecules

Performance Evaluation of Scientific Applications on POWER8

Vector Lane Threading


Research of the new Intel Xeon Phi architecture for solving a wide range of scientific problems at JINR

Jim Held, Ph.D., Intel Fellow & Director Emerging Technology Research, Intel Labs. HPC User Forum April 18, 2018

HYCOM and Navy ESPC Future High Performance Computing Needs. Alan J. Wallcraft. COAPS Short Seminar November 6, 2017

Hydra. A library for data analysis in massively parallel platforms. A. Augusto Alves Jr and Michael D. Sokoloff

Dynamic Scheduling for Work Agglomeration on Heterogeneous Clusters

Accelerating incompressible fluid flow simulations on hybrid CPU/GPU systems

Design of nanomechanical sensors based on carbon nanoribbons and nanotubes in a distributed computing system

WRF performance tuning for the Intel Woodcrest Processor

GPU acceleration of Newton s method for large systems of polynomial equations in double double and quad double arithmetic

Improvement of MPAS on the Integration Speed and the Accuracy

S8241 VERSIONING GPU- ACCLERATED WRF TO Jeff Adie, 26 March, 2018 (Presented by Stan Posey, NVIDIA)

High-Performance Computing and Groundbreaking Applications

From Piz Daint to Piz Kesch : the making of a GPU-based weather forecasting system. Oliver Fuhrer and Thomas C. Schulthess

A CUDA Solver for Helmholtz Equation

Tips Geared Towards R. Adam J. Suarez. Arpil 10, 2015

The Augmented Spherical Wave Method

Level-3 BLAS on a GPU

Paralleliza(on and Performance of the NIM Weather Model on CPU, GPU and MIC Architectures

Performance of the fusion code GYRO on three four generations of Crays. Mark Fahey University of Tennessee, Knoxville

Everyday Multithreading

GPU Computing Activities in KISTI

Structure and algorithms of motion control system's software of the small spacecraft

Targeting Extreme Scale Computational Challenges with Heterogeneous Systems

From parallel to distributed computing as application to simulate magnetic properties-structure relationship for new nanomagnetic materials

Direct Self-Consistent Field Computations on GPU Clusters

Parallel Performance Theory

6. Arthur Adamson Postdoctoral Recognition Award, University of Southern California

Computational Numerical Integration for Spherical Quadratures. Verified by the Boltzmann Equation

Intel s approach to Quantum Computing

Nanoscale optical circuits: controlling light using localized surface plasmon resonances

GPU Acceleration of Cutoff Pair Potentials for Molecular Modeling Applications

Numerical Simulation of Light Propagation Through Composite and Anisotropic Media Using Supercomputers

Sparse LU Factorization on GPUs for Accelerating SPICE Simulation

Using a CUDA-Accelerated PGAS Model on a GPU Cluster for Bioinformatics

Multi-GPU Simulations of the Infinite Universe

Claude Tadonki. MINES ParisTech PSL Research University Centre de Recherche Informatique

S0214 : GPU Based Stacking Sequence Generation For Composite Skins Using GA

Accelerating Protein Coordinate Conversion using GPUs

Recent advances in the GFDL Flexible Modeling System

Scaling the Software and Advancing the Science of Global Modeling and Assimilation Systems at NASA. Bill Putman

Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers

First, a look at using OpenACC on WRF subroutine advance_w dynamics routine

On the Computational Complexity of the Discrete Pascal Transform

Large-scale Electronic Structure Simulations with MVAPICH2 on Intel Knights Landing Manycore Processors

Simulation of Lid-driven Cavity Flow by Parallel Implementation of Lattice Boltzmann Method on GPUs

Lab 11: Rotational Dynamics

The Fast Multipole Method in molecular dynamics

INITIAL INTEGRATION AND EVALUATION

SPECTRAL CLUSTERING OF LARGE NETWORKS

ACCELERATING WEATHER PREDICTION WITH NVIDIA GPUS

Weather Research and Forecasting (WRF) Performance Benchmark and Profiling. July 2012

Two case studies of Monte Carlo simulation on GPU

Performance and Energy Analysis of the Iterative Solution of Sparse Linear Systems on Multicore and Manycore Architectures

Presentation Outline

Spins Dynamics in Nanomagnets. Andrew D. Kent

Heterogeneous programming for hybrid CPU-GPU systems: Lessons learned from computational chemistry

Tight-Focusing of Short Intense Laser Pulses in Particle-in-Cell Simulations of Laser-Plasma Interaction

Information Sciences Institute 22 June 2012 Bob Lucas, Gene Wagenbreth, Dan Davis, Roger Grimes and

S XMP LIBRARY INTERNALS. Niall Emmart University of Massachusetts. Follow on to S6151 XMP: An NVIDIA CUDA Accelerated Big Integer Library

Model Order Reduction via Matlab Parallel Computing Toolbox. Istanbul Technical University

CHAPTER 2 MAGNETISM. 2.1 Magnetic materials

Transcription:

Perm State University Research-Education Center Parallel and Distributed Computing A 25-minute Talk (S4493) at the GPU Technology Conference (GTC) 2014 MARCH 24-27, 2014 SAN JOSE, CA GPU-accelerated modeling of coherent processes in magnetic nano-structures Aleksey G. Demenev, Tatyana S. Belozerova, Petr V. Kharebov, Aleksandr V. Polyakov, Viktor K. Henner, Evgeniy K. Khenner Perm State University, Russia 1

Outline Introduction Physical model Method of numerical modeling Analyses of a potential parallelization of the initial codes Creation of the Magnetodynamics-F program Description of OpenMP-version and OpenACC-version Examples of application the program Case study 1 and 2 Experimental estimation (CPU+OpenMP vs GPU+OpenACC) Acceleration of the parallel algorithm A priori estimation A posteriori estimation Conclusions Acknowledgments 2

Introduction. Problems Problem is the creation of high-performance and reliable software for computer simulation of spin dynamics of magnetic nanostructures. The elements of that systems can be nano-molecules, nano-clusters, molecular crystals, etc. Spin is magnetic moment (in the physics of magnetic phenomena); - analogue of the classical angular momentum of the particle (in quantum mechanics).. 3

Introduction. The spins of nanomolecules 4

Introduction. Coherent effects Coherent effects are effective spin-spin interactions do not decrease with distance, the time scale for relaxation processes is inversely proportional to the number of spins.. Superradiance is coherent effects phenomenon when the radiated power is proportional to squar the number of spins. Conditions of observation is low temperature sample in a passive resonator. Future prospect is the possible use of high-velocity coherent processes in nanostructures in different kinds of sensors and switches, especially in nanodevices. Application domain is development of technologies for producing nano detectors of weak radiation and rapid creation of compact magnetic recording systems. 5

Introduction. Problems Description of the collective dynamic behavior of many-spin systems are usually by some time correlation functions Require to develop effective and reliable method methods for computing functions such systems far from equilibrium spins with long-range inter-particles interactions. Mathematical difficulty is presence of a broad continuous spectrum of characteristic times of processes determining the multi-scale dynamics of the system. Computational complexity increases nonlinearly with the number of structural elements and time observations of the system with realistic models. Technological barrier is unacceptably long time of sequential calculations. 6

Introduction. Approaches Approach for barier overcome is parallelization algorithms to significantly increase the number of structural elements and the time evolution of the studied range of systems available for study. Additional difficulties: classical theory of convergence does not apply to parallel numerical methods; parallel algorithms are specific mistakes that are not characteristic for successive; overhead of parallel computations can reverse the benefits of parallelization. Additional tasks: research on the subject to ensure the correctness of the results, analysis and evaluation of the effectiveness of computational algorithms mapping on modern parallel computer architectures. Perspective architectures are hybrid of multi-core CPUs with many-core accelerators. 7

Physical model 8

Physical model 9

Physical model. The system of equations 10

Physical model. The system of equations 11

Method of numerical modeling 12

Method of numerical modeling 13

Analyses of a potential parallelization of the initial codes The initial software was created for only sequential algorithms under MS Windows: program "Spins" in the environment MS Visual Studio C ++ ; program "MagnetoDynamics" in the environment Borland Delphi. These restrictions prevented the effective use of high performance computing in research. The most of the supercomputer uses the operating system Linux. Transfer of Spins to a cross-platform environment is difficult to implement due to the fact that used library Microsoft NET 4.0 is not cross-platform. Transferring of MagnetoDynamics to cross-platform development environment is difficult due to the fact that the language Borland Delphi has no international standard. Therefore, the program MagnetoDynamics-F in Fortran was created a new as a HPC code. 14

Analyses of a potential parallelization of the initial codes Methods analysis of the information structure of algorithms; asymptotic analysis of the algorithms complexity. Computational complexity Cost T (1) of the algorithms grows : asymptotically quadratic with increasing number of simulated nanoparticles at constant integration step ; directly proportional to the number of integration steps the automatic choice of the step. Mem(1) of the Magnetodynamics algorithms grows asymptotically quadratic with increasing number of simulated nanoparticles that better than Mem(1) of the Spins algorithm. The typical problems are considered. Asymptotic estimates of speedup and efficiency of multi-threaded parallelization algorithms implemented in the codes are performed a priory : by theory in accordance with Amdahl; by semiempirical formulae taking into account the overhead of multi-threading support for multi-core processors and many-core accelerators. 15

Creation of the Magnetodynamics-F program The parallel Fortran- code Magnetodynamics-F created using application programming interfaces OpenMP and OpenACC. The first (sequential) part includes input parameters, creating output files, and modeling of the spin system with a given polarization. The second (to be parallelized) part includes the integration of equations of motion and the calculation of the intensity of the magnetic dipole radiation. For OpenMP-version it was multithreaded and automatically vectorized by compiler : loops on the calculation of the right-hand sides of the equations of motion; loop on the calculation of the intensities of magnetic dipole radiation. For OpenACC-version it was multithreaded and automatically vectorized by compiler loops on the calculation of the right-hand sides of the equations of motion only. Program compiled by: Intel Fortran Compiler 2011; PGI Accelerator Server 13.1. 16

Creation of the Magnetodynamics-F program 17

Case 1. CPU+OpenMP vs GPU+OpenACC Case 1 is the computation with at about 1000 particles. PGI Accelerator PGI-13.1 compiler used to supports both standards: OpenACC and OpenMP. OpenMP-version experimental speedup is about number of CPU-cores dual Intel Xeon 5670 system. OpenACC-version + NVIDIA Tesla 2050 (448 CUDA-cores) acceleration are about 2x better than acceleration OpenMP-version + Intel Xeon 5670 (6 cores); equal to dual Intel Xeon 5670 system (12 cores). It is appropriate to use computers with GPU-accelerators for study of magnetic dynamics systems with N more than 1000. 18

Case 2. CPU+OpenMP vs GPU+OpenACC Case 2 is the computation with at about 5000 particle. PGI Accelerator PGI-13.1 compiler used to supports both standards: OpenACC and OpenMP. OpenMP-version experimental speedup is about number of CPU-cores dual Intel Xeon Xeon E5-2680 system (16 cores). OpenACC-version + NVIDIA Tesla K20 (2496 CUDA-cores) acceleration are about in nearly 5x better than acceleration OpenMP-version + Intel Xeon E5-2680 (8 cores); over 2x better than OpenMP-version + dual Intel Xeon E5-2680 system (16 cores). It is shown that the use of NVIDIA Tesla accelerates simulation for study of magnetic dynamics systems which include thousands of magnetic nanoparticles. 19

Case 2. CPU+OpenMP vs GPU+OpenACC 20

Acceleration of the parallel algorithm 21

Acceleration of the parallel algorithm 22

Conclusions Multi-scale molecular dynamics of the systems of nanomagnets is investigated by numerical simulation using parallel algorithms. Fortran- code Magnetodynamics-F provides some types of research: study of the possibility of regulation time of switching of the magnetic moment of the nanostructure; estimation of the role of nanocrystal geometry on super-radiation of 1-, 2- and 3-dimensional objects; study of magnetodynamics of a nanodots inductively coupled with the passive resonator; depending on the solution from initial orientation of the magnetic moment in order to find the configurations for which the super-radiance and radiative damping are maximal. The parallel programs created using application programming interfaces OpenMP and OpenACC. The estimates of speedup and efficiency of implemented algorithms in comparison with sequential algorithms have been obtained. It is shown that the use of NVIDIA Tesla accelerates simulation for study of magnetic dynamics systems which include thousands of magnetic nanoparticles. 23

Acknowledgments Work are based on the Research-Education Center Parallel and Distributed Computing of Perm State University, Russia. We used supercomputers: "PSU-Tesla" (T-Platforms, December 2010); "PSU-Kepler"(IBM + TC "Garmoniya", December 2012). Used unique equipment purchased under Perm State University Development Programme as national research university. The work was supported by the Russian Foundation for Basic Research and Perm Krai Government (projects 11-07-96007 and 13-02-96018). 24

Contacts Aleksey Demenev, PhD, Assoc.Prof.; Director of the Research-Education Center Parallel and Distributed Computing of Perm State University Phone. +7(342)2396409 fax +7(342)2396584 E-mail: A-demenev@psu.ru http://demenev.livejournal.com