Parallelization of the Dirac operator. Pushan Majumdar. Indian Association for the Cultivation of Sciences, Jadavpur, Kolkata

Similar documents
arxiv: v1 [hep-lat] 10 Jul 2012

Polynomial Filtered Hybrid Monte Carlo

Performance Analysis of Lattice QCD Application with APGAS Programming Model

Claude Tadonki. MINES ParisTech PSL Research University Centre de Recherche Informatique

Numerical Time Integration in Hybrid Monte Carlo Schemes for Lattice Quantum Chromo Dynamics

Lecture 24 - MPI ECE 459: Programming for Performance

Model Order Reduction via Matlab Parallel Computing Toolbox. Istanbul Technical University

Domain Wall Fermion Simulations with the Exact One-Flavor Algorithm

Introduction to lattice QCD simulations

Case Study: Quantum Chromodynamics

QCD on the lattice - an introduction

Introductory MPI June 2008

CEE 618 Scientific Parallel Computing (Lecture 7): OpenMP (con td) and Matrix Multiplication

Tuning And Understanding MILC Performance In Cray XK6 GPU Clusters. Mike Showerman, Guochun Shi Steven Gottlieb

Renormalization Group: non perturbative aspects and applications in statistical and solid state physics.

11 Parallel programming models

The Message Passing Interface (MPI) TMA4280 Introduction to Supercomputing

Lattice Quantum Chromodynamics on the MIC architectures

Lecture 4: Linear Algebra 1

Parallel Transposition of Sparse Data Structures

Excited states of the QCD flux tube

arxiv: v1 [hep-lat] 7 Oct 2010

Finite Temperature Field Theory

Trivially parallel computing

High-performance processing and development with Madagascar. July 24, 2010 Madagascar development team

CEE 618 Scientific Parallel Computing (Lecture 4): Message-Passing Interface (MPI)

Testing a Fourier Accelerated Hybrid Monte Carlo Algorithm

Finite Chemical Potential in N t = 6 QCD

lattice QCD and the hadron spectrum Jozef Dudek ODU/JLab

PETSc for Python. Lisandro Dalcin

A Numerical QCD Hello World

Monte Carlo. Lecture 15 4/9/18. Harvard SEAS AP 275 Atomistic Modeling of Materials Boris Kozinsky

Inverse problems. High-order optimization and parallel computing. Lecture 7

What is Classical Molecular Dynamics?

Large-scale Electronic Structure Simulations with MVAPICH2 on Intel Knights Landing Manycore Processors

Treecodes for Cosmology Thomas Quinn University of Washington N-Body Shop

MCRG Flow for the Nonlinear Sigma Model

Principles of Equilibrium Statistical Mechanics

arxiv:hep-lat/ v1 6 Oct 2000

Extending Parallel Scalability of LAMMPS and Multiscale Reactive Molecular Simulations

Static-scheduling and hybrid-programming in SuperLU DIST on multicore cluster systems

A Robustness Optimization of SRAM Dynamic Stability by Sensitivity-based Reachability Analysis

Parallelization of Multilevel Preconditioners Constructed from Inverse-Based ILUs on Shared-Memory Multiprocessors

NATIONAL UNIVERSITY OF SINGAPORE PC5215 NUMERICAL RECIPES WITH APPLICATIONS. (Semester I: AY ) Time Allowed: 2 Hours

lattice QCD and the hadron spectrum Jozef Dudek ODU/JLab

Panorama des modèles et outils de programmation parallèle

Zacros. Software Package Development: Pushing the Frontiers of Kinetic Monte Carlo Simulation in Catalysis

Effective Field Theory

ERLANGEN REGIONAL COMPUTING CENTER

6. Auxiliary field continuous time quantum Monte Carlo

The Fermion Bag Approach

CRYSTAL in parallel: replicated and distributed (MPP) data. Why parallel?

and B. Taglienti (b) (a): Dipartimento di Fisica and Infn, Universita di Cagliari (c): Dipartimento di Fisica and Infn, Universita di Roma La Sapienza

An Efficient FETI Implementation on Distributed Shared Memory Machines with Independent Numbers of Subdomains and Processors

A Parallel Implementation of the. Yuan-Jye Jason Wu y. September 2, Abstract. The GTH algorithm is a very accurate direct method for nding

Restricted Spin Set Lattice Models A route to Topological Order

NATIONAL UNIVERSITY OF SINGAPORE PC5215 NUMERICAL RECIPES WITH APPLICATIONS. (Semester I: AY ) Time Allowed: 2 Hours

Finite-temperature Field Theory


Parallel programming using MPI. Analysis and optimization. Bhupender Thakur, Jim Lupo, Le Yan, Alex Pacheco

Mixed action simulations: approaching physical quark masses

Multi-Length Scale Matrix Computations and Applications in Quantum Mechanical Simulations

Symmetric Pivoting in ScaLAPACK Craig Lucas University of Manchester Cray User Group 8 May 2006, Lugano

Maxwell s equations. based on S-54. electric field charge density. current density

Monte Carlo studies of the spontaneous rotational symmetry breaking in dimensionally reduced super Yang-Mills models

Lattice Gauge Theory: A Non-Perturbative Approach to QCD

Maxwell s equations. electric field charge density. current density

Hamiltonian approach to Yang- Mills Theories in 2+1 Dimensions: Glueball and Meson Mass Spectra

Massively parallel semi-lagrangian solution of the 6d Vlasov-Poisson problem

Parallelization Strategies for Density Matrix Renormalization Group algorithms on Shared-Memory Systems

arxiv: v1 [hep-lat] 28 Aug 2014 Karthee Sivalingam

Random Walks A&T and F&S 3.1.2

COMBINED EXPLICIT-IMPLICIT TAYLOR SERIES METHODS

The Blue Gene/P at Jülich Case Study & Optimization. W.Frings, Forschungszentrum Jülich,

Advanced sampling. fluids of strongly orientation-dependent interactions (e.g., dipoles, hydrogen bonds)

A Lattice Study of the Glueball Spectrum

Review of lattice EFT methods and connections to lattice QCD

Parallel Eigensolver Performance on High Performance Computers

4th year Project demo presentation

Understanding Molecular Simulation 2009 Monte Carlo and Molecular Dynamics in different ensembles. Srikanth Sastry

Hamiltonian approach to Yang- Mills Theories in 2+1 Dimensions: Glueball and Meson Mass Spectra

Solving the Inverse Toeplitz Eigenproblem Using ScaLAPACK and MPI *

Contents. Preface... xi. Introduction...

Factoring large integers using parallel Quadratic Sieve

Path Integrals. Andreas Wipf Theoretisch-Physikalisches-Institut Friedrich-Schiller-Universität, Max Wien Platz Jena

Cactus Tools for Petascale Computing

PRECONDITIONING IN THE PARALLEL BLOCK-JACOBI SVD ALGORITHM

Sparse BLAS-3 Reduction

Algorithm for Sparse Approximate Inverse Preconditioners in the Conjugate Gradient Method

openq*d simulation code for QCD+QED

arxiv: v1 [hep-lat] 2 May 2012

Accelerating Quantum Chromodynamics Calculations with GPUs

The symmetries of QCD (and consequences)

Efficient implementation of the overlap operator on multi-gpus

Perm State University Research-Education Center Parallel and Distributed Computing

PhD in Theoretical Particle Physics Academic Year 2017/2018

Chemical composition of the decaying glasma

Page rank computation HPC course project a.y

Molecular Dynamics. What to choose in an integrator The Verlet algorithm Boundary Conditions in Space and time Reading Assignment: F&S Chapter 4

Massively scalable computing method to tackle large eigenvalue problems for nanoelectronics modeling

Transcription:

Parallelization of the Dirac operator Pushan Majumdar Indian Association for the Cultivation of Sciences, Jadavpur, Kolkata

Outline Introduction Algorithms Parallelization Comparison of performances Conclusions

The use of QCD to describe the strong force was motivated by a whole series of experimental and theoretical discoveries made in the 1960 s and 70 s. successful classification of particles by the quark model multiplicities of certain high-energy strong interactions. QCD is a non-abelian gauge theory with gauge group SU(3). The matter fields describe quarks, which are spin 1/2 objects transforming under the fundamental representation of SU(3). The QCD lagrangian is : S = ψd/ (A)ψ + F 2 (A). Till date QCD is our best candidate theory for describing strong interactions.

Observables in non-abelian gauge theories where S = ψd/ (A)ψ + F 2 (A). O = Z 1 DAD ψdψ O e S (1) S is quartic in A. Functional integral cannot be done analytically. Two options : 1) Expand e S as a series in g. 2) Estimate the integral numerically. (1) works for small values of g and leads to perturbative QCD. (2) is fully non-perturbative, but requires a discretization for implementation. This leads to lattice gauge theory. The value of g depends upon the energy with which you probe the system

Wilson s LGT : Discretization maintaining gauge invariance. finite lattice spacing : UV cut off finite lattice size : IR cut off Periodic boundary conditions to avoid end effects T 4 topology. Finally one must extrapolate to zero lattice spacing.

dimension of the integral : 4 N 4 domain of integration : manifold of the group SU(3) Not possible to do exactly use sampling techniques

Molecular Dynamics Euclidean path integral for quantum theory Partition function for classical statistical mechanical system in 4 space dimensions. Dynamics in simulation time. O can (β) thermodynamic limit [ O micro ] E=Ē(β) ergodic lim T 1 T T 0 dτ O(φ i(τ)) Canonical ensemble average microcanonical ensemble average at E = Ē(β) (equilibrium temperature) Time average over classical trajectories. d 2 φ i dτ 2 = S[φ] φ i (configuration φ has energy Ē).

Complexity of motion in higher dimension Quantum fluctuations in original theory O = 1 Z Dφ O[φ]e S[β,φ] (Original theory) O = 1 Z DφDπ O[φ]e 1 2 i π2 i S[β,φ] (with auxiliary field π). Equations of motion: H 4 dim = 1 2 πi 2 i + S[β, φ] φ i = H[φ, π] H[φ, π] ; π i = π i φ i Problems: Not ergodic. Not possible to take the thermodynamic limit.

Hybrid Algorithms Langevin dynamics + Molecular Dynamics Uses MD to move fast through configuration space, but choice of momenta from a random distribution (Langevin step) makes it ergodic. Implementation: (i) Choose {φ i } in some arbitrary way. (ii) Choose {π i } from some Gaussian (random) ensemble. P {π i } = ( i 1 2π ) e 1 2 i π2 i. (iii) Evaluate π i (1) = π i (1 + ɛ/2) (iv) Iterate the MD equations for a few steps and store φ i (n). (To be used as φ i (1) for the next MD update.) (v) Go back to step (ii) and repeat. Systematic error introduced by finite time step remains.

Hybrid Monte Carlo Hybrid algorithm + Metropolis accept/reject step. Implementation: (0) Start with a configuration {φ 1 i, π1 i }. (i)-(iv) Do the first 4 steps as in the Hybrid algorithm to get to {φ N i, πn i } after N MD steps. (v) Accept the configuration {φ N i, πn i } with probability p acc = min { 1, exp( H[φN i, πn i ] exp( H[φ 1 i, π1 i ] (vi) If {φ N i, πn i } is not accepted keep {φ1 i, π1 i } and choose new momenta. Otherwise make {φ N i, πn i } the new {φ1 i, π1 i } and choose new momenta. This algorithm satisfies detailed balance (necessary for equilibrium distribution) and eliminates error due to finite time step. }

Equilibration β = 5.2, a m = 0.03 from top to bottom: mean plaquette ; # of CG iterations for accept/reject step ; min(re(λ)) λ: eigenvalue of Dirac operator., topological charge ν The lower two rows values have been determined only for every 5th configuration.

Hybrid Monte Carlo - Parallelization Generate the Gaussian noise (random momenta) Cheap. Integrate the Hamiltonian equations of motion. Integration done using the leap-frog integration scheme. Improved integration schemes viz. Sexton-Weingarten, Omelyan etc. exist. Expensive. Target for parallelization. Accept/Reject step : Corrects for numerical errors introduced during integration of the equations of motion.

Example : Lattice gauge theory Degrees of freedom : SU(3) matrices & complex vector fields. For two flavours of mass degenerate quarks : H = 1 2 trp2 + Gauge Action(U) + φ (M M) 1 (U)φ where M is a matrix representation of the Dirac operator and U s are SU(3) matrices. Hamiltonian equations of motion update p s and U s. φ is held constant during the molecular dynamics. Equations of motion are : U = i p U and H/ t = 0

The conjugate gradient algorithm Most expensive part of calculation is evaluation of t φ (M M) 1 φ which requires the evaluation of (M M) 1 φ at every molecular dynamics step. For a hermitian matrix A, the inversion is best done by conjugate gradient. If we want to evaluate x = A 1 b, we can as well solve the linear system Ax = b or minimize the quadratic form Ax b.

Let {p n } be a sequence of conjugate vectors. ( p i, Ap j = 0) Then {p n } forms a basis in R n. Let the system Ax = b be solved by x = α i p i. Then α i = p i, b p i, Ap i Requires repeated application of A on p i and then dot products between various vectors. Objective : Try to parallelize the matrix-vector multiplication A p Two options : OpenMP & MPI

OpenMP concentrates on parallelizing loops. Compiler needs to be told which loops can be safely parallelized. Evaluation of b = Da!$OMP PARALLEL DO do isb = 1,nsite do idb = 1,4 do icb = 1,3 b(icb,idb,isb) = diag*a(icb,idb,isb) end do end do end do!$omp END PARALLEL DO

!$OMP PARALLEL DO PRIVATE(isb,ida,ica,idb,icb,idir,isap,isam,uu, & & uudag) SHARED(a,b) do isb = 1,nsite do idir=1,4 isap=neib(idir,isb) isam=neib(-idir,isb) call uextract(uu,idir,isb) call udagextract(uudag,idir,isam) internal loop over colour & dirac indices (ida,ica,idb,icb) b(icb,idb,isb) = b(icb,idb,isb) & & -gp(idb,ida,idir)*uu(icb,ica)*a(ica,ida,isap) & & -gm(idb,ida,idir)*uudag(icb,ica)*a(ica,ida,isam) enddo enddo!$omp END PARALLEL DO

Pros: Simple Data layout and decomposition is handled automatically by directives. Unified code for both serial and parallel applications Original (serial) code statements need not, in general, be modified when parallelized with OpenMP.

Cons: Currently only runs efficiently in shared-memory multiprocessor platforms Requires a compiler that supports OpenMP. Scalability is limited by memory architecture. Reliable error handling is missing.

What is MPI? A standard protocol for passing messages between parallel processors. Small: Require only six library functions to write any parallel code. Large: There are more than 200 functions in MPI-2. It uses communicator a handle to identify a set of processes and topology for different pattern of communication. A communicator is a collection of processes that can send messages to each other. Can be called by Fortran, C or C++ codes. Small MPI: MPI_Init MPI_Comm_size MPI_Comm_rank MPI_Send MPI_Recv MPI_Finalize

program hmc_main use paramod use params implicit none integer i, ierr, numprocs include "mpif.h" call MPI_INIT( ierr ) call MPI_COMM_RANK( MPI_COMM_WORLD, rank, ierr ) call MPI_COMM_SIZE( MPI_COMM_WORLD, numprocs, ierr ) print *, "Process ", rank, " of ", numprocs, " is alive" call init_once call do_hmc_wilson call MPI_FINALIZE(ierr) end program hmc_main mpirun -n 8./executable < input > out &

subroutine ddagonvec(b,a)! Evaluates b = D a. base=rank*nsub do isb = 1,nsub loop over colour and dirac indices (icb,idb) b(icb,idb,isb) = diag*a(icb,idb,isb+base) end do do isb = base+1,base+nsub do idir=1,4 isap=neib(idir,isb) isam=neib(-idir,isb) call uextract(uu,idir,isb) call udagextract(uudag,idir,isam) loop over internal colour & dirac indices (ida,ica,idb,icb) b(icb,idb,isb-base) = b(icb,idb,isb-base) & & -gp(idb,ida,idir)*uu(icb,ica)*a(ica,ida,isap) & & -gm(idb,ida,idir)*uudag(icb,ica)*a(ica,ida,isam) enddo enddo return end subroutine ddagonvec

subroutine matvec_wilson_ddagd(vec,res) use lattsize use paramod include "mpif.h" implicit none complex(8), dimension(blocksize*nsite), intent(in) :: vec complex(8), dimension(blocksize*nsite) :: vec1 complex(8), dimension(blocksize*nsite), intent(out) :: res complex(8), dimension(blocksize*nsub) :: res_loc1,res_loc2 integer :: ierr call donvec(res_loc1,vec) call MPI_ALLGATHER(res_loc1,blocksize*nsub,MPI_DOUBLE_COMPLEX, & vec1,blocksize*nsub,mpi_double_complex,mpi_comm_world,ierr) call ddagonvec(res_loc2,vec1) call MPI_ALLGATHER(res_loc2,blocksize*nsub,MPI_DOUBLE_COMPLEX, & res,blocksize*nsub,mpi_double_complex,mpi_comm_world,ierr) end subroutine matvec_wilson_ddagd

Performance comparisons Amdahl s Law: S = 1 f + (1 f)/p S: Speed up factor over serial code f: Fraction of code that is serial P: Number of precesses If f = 0.1, then even if P, S 10. Similarly if f = 0.2 then S 5.

Tests were performed on a IBM p-590 which has 32 cores and 64 GB memory. Each core can handle 2 threads giving 64 logical CPU s. Compiler used was threadsafe IBM fortran compiler xlf90. Lattice OpenMP MPI no. of threads no. of processes 4 4 2 4 8 2 4 8 8 4 2 4 8 2 4 8 12 4 2 4 8 16 2 4 8 16 16 4 2 4 8 16 2 4 8 16 All timings are for 5 full MD trajectories + Accept/Reject step. Conjugate Gradient tolerance was kept at 10 10. Each MD trajectory involved 50 integration steps with a time step of 0.02.

Typical inversion time OMP Lattice Serial 2 threads 4 threads 8 threads 16 threads 12 4 10.67 11.86/2 14.32/4 11.66/8 13.64/16 16 4 46.57 43.25/2 41.68/4 62.75/8 59.45/16 MPI Lattice Serial 2 procs 4 procs 8 procs 16 procs 12 4 10.67 6.52 4.09 2.74 2.26 16 4 46.57 31.13 15.35 12.11 10.52 Typical force calculation time OMP Lattice Serial 2 threads 4 threads 8 threads 16 threads 12 4 2.99 2.97/2 4.28/4 2.95/8 3.23/16 16 4 9.52 9.44/2 9.43/4 13.21/8 12.73/16 MPI Lattice Serial 2 procs 4 procs 8 procs 16 procs 12 4 2.99 1.56 0.82 0.43 0.25 16 4 9.52 6.36 2.57 1.38 0.84 Reference: (16/12) 5 = 4.213 (16/12) 4 = 3.16

Conclusions Both MPI and OpenMP parallelizations show a speedup by about a factor four using 16 processes. Applying Amdahl s law we see that our parallelization is about 85% The overhead for MPI is higher than OpenMP. OpenMP can scale upto 64 or at most 128 processors. But this is extremely costly. A 32 core IBM p590 (2.2GHz PPC) costs > Rs. 2 crores.

Memory is an issue for these kinds of machines. Again the limit is either 64 GB or at most 128 GB. Wilson Dirac operator with Wilson gauge action on a 16 4 lattice takes up close to 300 MB of memory. A 64 4 lattice will require 256 0.3=76.8 GB of memory. With OpenMP one has to be careful about synchronization. A variable may suffer unwanted updates by threads. Use PRIVATE & SHARED declarations to avoid such problems. With OpenMP coding is easier. Unlike MPI where parallelization has to be done globally, here parallelization can be done incrementally loop by loop.

Main advantage of MPI is that it can scale to thousands of processors. So it does not suffer from the Memory constraints. If memory is not a problem and you have enough independent runs to do which uses up all your independent nodes, then use OpenMP. If not you will have to use MPI. Thank you