Efficient implementation of the overlap operator on multi-gpus

Similar documents
Tuning And Understanding MILC Performance In Cray XK6 GPU Clusters. Mike Showerman, Guochun Shi Steven Gottlieb

Case Study: Quantum Chromodynamics

Accelerating Quantum Chromodynamics Calculations with GPUs

Measuring freeze-out parameters on the Bielefeld GPU cluster

Deflation for inversion with multiple right-hand sides in QCD

arxiv: v1 [hep-lat] 10 Jul 2012

Last Time. Social Network Graphs Betweenness. Graph Laplacian. Girvan-Newman Algorithm. Spectral Bisection

Catalytic effects of monopole in QCD

Lattice Boltzmann simulations on heterogeneous CPU-GPU clusters

arxiv: v1 [hep-lat] 8 Nov 2014

Block Iterative Eigensolvers for Sequences of Dense Correlated Eigenvalue Problems

Claude Tadonki. MINES ParisTech PSL Research University Centre de Recherche Informatique

Unraveling the mysteries of quarks with hundreds of GPUs. Ron Babich NVIDIA

DELFT UNIVERSITY OF TECHNOLOGY

Lattice Quantum Chromodynamics on the MIC architectures

Performance Analysis of Lattice QCD Application with APGAS Programming Model

Accelerating linear algebra computations with hybrid GPU-multicore systems.

A Jacobi Davidson Method with a Multigrid Solver for the Hermitian Wilson-Dirac Operator

Lattice QCD at non-zero temperature and density

ab initio Electronic Structure Calculations

The symmetries of QCD (and consequences)

CME342 Parallel Methods in Numerical Analysis. Matrix Computation: Iterative Methods II. Sparse Matrix-vector Multiplication.

QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment

arxiv: v1 [hep-lat] 19 Jul 2009

Introduction to numerical computations on the GPU

The Blue Gene/P at Jülich Case Study & Optimization. W.Frings, Forschungszentrum Jülich,

Efficient algorithms for symmetric tensor contractions

Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers

Parallelization of the Molecular Orbital Program MOS-F

Lattice QCD with Domain Decomposition on Intel R Xeon Phi TM

Finite-choice algorithm optimization in Conjugate Gradients

Domain Wall Fermion Simulations with the Exact One-Flavor Algorithm

SP-CNN: A Scalable and Programmable CNN-based Accelerator. Dilan Manatunga Dr. Hyesoon Kim Dr. Saibal Mukhopadhyay

Parallel Asynchronous Hybrid Krylov Methods for Minimization of Energy Consumption. Langshi CHEN 1,2,3 Supervised by Serge PETITON 2

arxiv: v1 [hep-lat] 28 Aug 2014 Karthee Sivalingam

Parallel Eigensolver Performance on High Performance Computers 1

Parallel Eigensolver Performance on High Performance Computers

Matrix Algorithms. Volume II: Eigensystems. G. W. Stewart H1HJ1L. University of Maryland College Park, Maryland

Comparing iterative methods to compute the overlap Dirac operator at nonzero chemical potential

Direct Self-Consistent Field Computations on GPU Clusters

Heterogeneous programming for hybrid CPU-GPU systems: Lessons learned from computational chemistry

Arnoldi Methods in SLEPc

Multigrid Methods for Linear Systems with Stochastic Entries Arising in Lattice QCD. Andreas Frommer

GPU accelerated Arnoldi solver for small batched matrix

CRYSTAL in parallel: replicated and distributed (MPP) data. Why parallel?

Solving Quadratic Equations with XL on Parallel Architectures

arxiv: v1 [hep-lat] 2 May 2012

Static-scheduling and hybrid-programming in SuperLU DIST on multicore cluster systems

ELSI: A Unified Software Interface for Kohn-Sham Electronic Structure Solvers

Preconditioned Parallel Block Jacobi SVD Algorithm

Appendix A Notational Conventions

Numerical Methods I Eigenvalue Problems

GPU-accelerated Computing at Scale. Dirk Pleiter I GTC Europe 10 October 2018

Review of lattice EFT methods and connections to lattice QCD

hypre MG for LQFT Chris Schroeder LLNL - Physics Division

Computation of eigenvalues and singular values Recall that your solutions to these questions will not be collected or evaluated.

ACCELERATING SPARSE CHOLESKY FACTORIZATION ON THE GPU

Jacobi-Based Eigenvalue Solver on GPU. Lung-Sheng Chien, NVIDIA

Welcome to MCS 572. content and organization expectations of the course. definition and classification

A knowledge-based approach to high-performance computing in ab initio simulations.

Antti-Pekka Hynninen, 5/10/2017, GTC2017, San Jose CA

Accelerating computation of eigenvectors in the nonsymmetric eigenvalue problem

MAA507, Power method, QR-method and sparse matrix representation.

Computational Methods. Eigenvalues and Singular Values

Accelerating and Scaling Lanczos Diagonalization with GPGPU

Accelerating computation of eigenvectors in the dense nonsymmetric eigenvalue problem

arxiv: v1 [cs.dc] 4 Sep 2014

A Numerical QCD Hello World

ILAS 2017 July 25, 2017

A Quantum Chemistry Domain-Specific Language for Heterogeneous Clusters

Computing least squares condition numbers on hybrid multicore/gpu systems

The Removal of Critical Slowing Down. Lattice College of William and Mary

Lecture 11: CMSC 878R/AMSC698R. Iterative Methods An introduction. Outline. Inverse, LU decomposition, Cholesky, SVD, etc.

lattice QCD and the hadron spectrum Jozef Dudek ODU/JLab

The Gauge Principle Contents Quantum Electrodynamics SU(N) Gauge Theory Global Gauge Transformations Local Gauge Transformations Dynamics of Field Ten

Opportunities for ELPA to Accelerate the Solution of the Bethe-Salpeter Eigenvalue Problem

AMS526: Numerical Analysis I (Numerical Linear Algebra) Lecture 23: GMRES and Other Krylov Subspace Methods; Preconditioning

A refined Lanczos method for computing eigenvalues and eigenvectors of unsymmetric matrices

Solution of eigenvalue problems. Subspace iteration, The symmetric Lanczos algorithm. Harmonic Ritz values, Jacobi-Davidson s method

REVIEW. Quantum electrodynamics (QED) Quantum electrodynamics is a theory of photons interacting with the electrons and positrons of a Dirac field:

Low fermionic eigenmode dominance in QCD on the lattice

Parallel Longest Common Subsequence using Graphics Hardware

H ψ = E ψ. Introduction to Exact Diagonalization. Andreas Läuchli, New states of quantum matter MPI für Physik komplexer Systeme - Dresden

Parallel programming practices for the solution of Sparse Linear Systems (motivated by computational physics and graphics)

A Parallel Algorithm for Computing the Extremal Eigenvalues of Very Large Sparse Matrices*

Cold and dense QCD matter

AMS526: Numerical Analysis I (Numerical Linear Algebra for Computational and Data Sciences)

QCDOC A Specialized Computer for Particle Physics

Cyclops Tensor Framework

Algebraic Multi-Grid solver for lattice QCD on Exascale hardware: Intel Xeon Phi

Derivation of Electro Weak Unification and Final Form of Standard Model with QCD and Gluons 1 W W W 3

Large-N Quantum Field Theories and Nonlinear Random Processes

Eigenvalue Problems. Eigenvalue problems occur in many areas of science and engineering, such as structural analysis

GPU Acceleration of Cutoff Pair Potentials for Molecular Modeling Applications

Index. for generalized eigenvalue problem, butterfly form, 211

Exploiting Low-Rank Structure in Computing Matrix Powers with Applications to Preconditioning

College of William & Mary Department of Computer Science

PRECONDITIONING IN THE PARALLEL BLOCK-JACOBI SVD ALGORITHM

The Fermion Bag Approach

arxiv: v1 [hep-lat] 4 Nov 2014

Transcription:

Efficient implementation of the overlap operator on multi-gpus Andrei Alexandru Mike Lujan, Craig Pelissier, Ben Gamari, Frank Lee SAAHPC 2011 - University of Tennessee

Outline Motivation Overlap operator Multi-gpu Wilson-Dirac kernel Eigensolver and inverter Conclusions 2

Building blocks of matter Quarks are the constituents of matter Quarks interact strongly by exchanging gluons Peculiar properties: Confinement Asymptotic freedom (Nobel Prize 2004) Theory of strong interactions -- Quantum Chromodynamics (QCD)

Lattice QCD Replace space-time with a fourdimensional lattice Differential operators are replaced with finite difference operators Typical lattice size 20-40 per dimension and 1.5-3 times longer in the 4th dimension For example 24 3 x48=663552 sites Typical project size ~ 1 Petaflop

Why overlap fermions on multi-gpus? Study QCD dynamics in the chiral regime Overlap fermions preserve chiral symmetry at finite lattice spacing Overlap fermions are computationally demanding Use GPUs since they have good memory bandwidth Memory requirements of overlap force us to use multiple GPUs 5

Lattice QCD QCD is a field theory U110 Ψ Α a 10 U111 Ψ Α a 11 U112 Ψ Α a 12 U113 Ψ Α a 13 U114 Ψ Α a 14 Lattice QCD defined on a 4D grid U15 U 0 10 U16 U 0 11 U17 U 0 12 U18 U 0 13 U19 U 0 14 quarks -- sites Ψ a Α 5 U 0 5 Ψ a Α 6 U 0 6 Ψ a Α 7 U 0 7 Ψ a Α 8 U 0 8 Ψ a Α 9 U 0 9 gluons -- links U10 Ψ a Α 0 U 0 0 U11 Ψ a Α 1 U 0 1 U12 Ψ a Α 2 U 0 2 U13 Ψ a Α 3 U 0 3 U14 Ψ a Α 4 U 0 4 Links are randomly generated according to dynamics 6

Wilson-Dirac operator Wilson fermions is one of the simplest discretizations It is numerically fast, very sparse It breaks chiral symmetry m + D/ D w =(ma + 4) 1 2 Tµ µ>0: (T µ ψ) n = U µ (n)ψ n+ˆµ (1 γ µ ), µ<0: (T µ ψ) n = U µ (n ˆµ) ψ n ˆµ (1 + γ µ ). Serves as a kernel for the overlap operator 0 ψ(x) ψ(y) 0 =(D 1 w ) x,y 7

Wilson-Dirac operator Y (n) =(MX)(n) =X(n) κ µ Vµ (n)x(n +ˆµ)+V (n ˆµ)X(n ˆµ) Wilson operator is used to multiply Wilson fields, 4x3 matrices living at every site on the lattice The value of Y at a site depends on the value of X at the same site and the neighboring sites (8) Each of the fields at the neighboring sites needs to be transported to the final site -- this involves a multiplication with a color matrix (3x3) and a spinor matrix (4x4) The color matrices differ from link to link whereas the spinor matrices depend only on direction The matrices and the vectors are all complex T Μ Ψ U 1,1 U 1,2 U 1,3 U 2,1 U 2,2 U 2,3 U 3,1 U 3,2 U 3,3 n,μ Ψ 1,1 Ψ 1,2 Ψ 1,3 Ψ 1,4 Ψ 2,1 Ψ 2,2 Ψ 2,3 Ψ 2,4 Ψ 3,1 Ψ 3,2 Ψ 3,3 Ψ 3,4 nμ 1,1 1,2 1,3 1,4 2,1 2,2 2,3 2,4 3,1 3,2 3,3 3,4 4,1 4,2 4,3 4,4 Μ

Overlap operator 1.0 0.5 Overlap operator is dense 0.0 0.5 1.0 1.0 0.5 0.0 0.5 1.0 Ε About 100 times more expensive than Wilson kernel The cost is proportional to conditioning number of (H w ) 2 and log δ D =1+γ 5 sign(h w ) H w = γ 5 D w sign(h w ) QP (Q 2 ) with Q = H w H w δ = max x [,1] 1 xp (x) 9

Requirements Overlap operator Wilson kernel + vector routines Hwilson eigensolver Propagator calculation Overlap inverter Overlap eigensolver 10

System architecture GPU Memory 140 GB/s 1-6GB CPU ~5GB/s Infiniband ~2 x 2.5GB/s 10-20 GB/s 12-48GB Memory 11

Computational strategy We use one process per GPU and MPI for communication All data resides in GPU memory Lattice sites are split evenly between the nodes All data belonging to a particular site resides on the node that owns the sites The communication is mainly implemented via shifts and is overlapped if possible with computation 12

Vector routines Expression Templates + THRUST library = auto-generate optimized kernels φ αψ 1 + βψ 2 + γψ 3 +... Non-reduction kernels scale perfectly Max bandwidth on M2070 with ECC on is about 85GB/s Reduction kernels have poor scaling -- small computational fraction Most poor scaling is due to poor single-node kernel performance on small vectors Bandwidth per GPU GBs 100 80 60 40 20 0 vector addition scalar product 0 5 10 15 20 25 30 GPU Count 13

Wilson-Dirac kernel 14

Wilson-Dirac kernel The cost of the Wilson-Dirac operator is 1368 flops/site: 600 multiplications (44%) and 768 additions (56%) -- balanced load The amount of data for one site computation is in: 8 spinors + 8 links (neighbors) + 1 spinor out: 1 spinor In double precision 3072 bytes/site and computational density is 1368flops/3072bytes = 0.45 flop/byte -- double that in single precision For 85GB/s max bandwidth we have max kernel performance: 38.25 GFlops (double) and 76.5 GFlops (single) It has a fair amount of parallelism the 8 transports can be implemented in parallel each of these transports can be split in 2 parallel tasks 15

Calculation steps The communication time is overlapped with the computation time to hide latency 1. Gather: compute compressed fields and fill in communication buffers 2. Comm: initiate non-blocking communication 3. Bulk: compute interior points dslash 4. Scatter: finish communication and add results 16

Minimal surface Cut the lattice in hypercubes with the same dimensions The longest dimension is always cut first and an already cut dimension is preferred As the lattice is cut the boundary to interior ratio increases GPUs N int N boun dims 1 6.6 x 10 5 0 24, 24, 24, 48 2 3.3 x 10 5 2.8 x 10 4 24, 24, 24, 24 4 1.7 x 10 5 2.8 x 10 4 24, 24, 24, 12 8 8.3 x 10 4 2.8 x 10 4 12, 24, 24, 12 16 4.1 x 10 4 2.1 x 10 4 12, 12, 24, 12 32 2.1 x 10 4 1.4 x 10 4 12, 12, 12, 12 17

Dslash anatomy 2 :Dslash bulk 1:Gather 2: gpu > cpu 3: cpu > cpu 4: cpu > gpu 5:Scatter PCI Infiniband PCI stream 1 stream 2 18

Dslash timing GPUs gather scatter gpu>cpu cpu>cpu cpu>gpu comm dslash bulk 1 0.0 0.0 0.0 0.0 0.0 0.0 28.9 2 0.2 0.3 0.6 1.3 0.6 2.6 14.5 4 0.2 0.3 0.6 1.6 0.6 2.9 7.3 8 0.2 0.3 0.6 1.5 0.6 2.7 3.6 16 0.2 0.2 0.7 1.6 0.6 2.8 1.8 19

Strong scaling for 24 3 x64 Performance per GPU GFLOPS 80 60 40 20 0 0 5 10 15 20 25 30 GPU Count double precision performance model single precision 20

Comparison with other codes Our code QUDA 32 3 x 256 24 3 x 128 16 GPUs 32 GPUs 16 GPUs 32 GPUs double single single 521 1005 1327 928 1825 2247 487 935 971 503 913 1007 21

Overlap operator 22

Sign approximation Polynomial approximation P (Q 2 )ψ = P (Q 2 )ψ = n i=1 n i=1 c i T i (Q 2 )ψ Rational approximation b i Q 2 + c i ψ Time s 8 6 4 2 0 5 10 15 20 25 30 GPU Count double pass polynomial 23

Wilson-Dirac kernel performance comparison Performance GFLOPS 500 400 300 200 100 0 0 10 20 30 40 50 GPU Equivalent Count CPU GPU 24

Performance comparison The GPU cluster uses 1GPU per node and QDR Infiniband interconnects The CPU machine is a Cray XT5 with dual hex-core AMD processors We compare the performance of 32GPUs (the target cluster dimension) vs 256 CPU-cores (optimal performance for CPU) 25

Overlap performance For 24 3 x64 lattice overlap operator matrixvector multiplication takes 1.1 s on 32GPUs On 256 cores Cray XT-5 it takes 3.3 s This translate into a ratio of 1GPU = 24CPUcores 26

Hwilson eigensolver 27

Small eigenspace dimension 0.12 0.10 1200 1000 Λ 0.08 0.06 0.04 0.02 0.00 0 50 100 150 200 250 300 Number of Eigenvectors Polynomial Order 800 600 400 200 0 50 100 150 200 250 300 Number of Eigenvectors δ = Ae bn = λ /λ max 28

Eigensolvers We use implicitly restarted Arnoldi factorization The method requires storage for temporary vectors. For optimal convergence we need 2.5 times more vectors than required: k=2.5 l For efficiency we need to also code a matrix-matrix multiplication routine AV k = V k H k + f k e k with (e k ) n = δ k,n H k µ = QR V k QV k H k RQ + µ For each iteration we need k matrix-vector multiplications and k 2 vector orthogonalizations We use locking of the converged eigenvectors to accelerate convergence 29

Hwilson eigensolver We use Chebyshev acceleration of order 100; Arnoldi eigensolver converges in one iteration Compute 200 eigenvectors: storage required for 500 vectors = 85GB Total time: 0.27 hours on the GPU cluster vs 0.60 hours on the Cray XT-5 This corresponds to 1GPU=18 CPU-cores In situations with reduced GPU memory, use mixed mode: the eigensystem is store in CPU memory. This is feasible due to Chebyshev acceleration. In this mode the GPU code takes 0.43 hours 30

Overlap eigensolver 31

Overlap eigensystem Deflation speeds up inversions considerably One propagator = 12 inversions m π = 200 MeV: 12 x 2,000 = 24,000 deflation: 6,600 + 12 x 200 = 9,000 one propagator per config: 2.5 times speed-up Compute hermitian overlap eigenvectors and then rebuild overlap eigenvectors 32

Overlap eigensystem Compute 100 eigenvector pairs to 10-10 precision On GPU cluster this takes 2.7 hours On the Cray machine this takes 10.6 hours This translates in 1GPU = 26 CPU-cores In the situation where memory is limited, we can use mixed mode: store the overlap Krylov space in CPU memory. The code takes 4 hours to converge in this case. 33

Overlap inverter 34

Overlap inverter We use m π = 200 MeV and a precision of 10-8 We use adaptive CG method which is 60% faster than regular CG We use a multi-shifted inverter with shifts We store the overlap eigensystem in CPU memory, Hwilson eigensystem in GPU memory and solutions in GPU memory The GPU cluster takes 0.52 hours vs 2.3 hours for the Cray machine The performance translates to 1GPU = 35 CPU-cores 35

Summary CPU GPU Hwilson eigensolver Overlap eigensolver expensive orthogonalization Chebyshev acceleration 2.5 x 200 vectors 200 Hwilson vectors 2.5 x 100 eigenpairs Overlap inverter 100 Overlap eigenpairs 200 Hwilson vectors + solutions (100) Pure GPU(32): 3.5 hours Cray XT5(256): 13.5 hours 36

Conclusions We showed how to efficiently implement overlap operator on GPUs For efficiency we need to store the data in GPU memory, which forces us to use GPUs in parallel For 24 3 x64 lattices of interest Wilson kernel scaling efficiency is 50% on 32 GPUs Scaling efficiency is better that CPU codes of equivalent performance The sign function needed for the overlap operator polynomial approximation is better both in terms of memory use and performance Most of the time is spent in eigensolvers. We use implicitly restarted Arnoldi eigensolvers. On systems with reduced memory a mixed strategy can be use with only 50-60% performance penalty. Overall, the GPU/CPU performance ratio for our codes is compatible with the ratio measured for the dslash routine. This result is not surprising since the most time consuming part of these codes is the dslash routine, but it takes careful planning to work around all possible bottlenecks. 37

Outlook Most of the time is spent in overlap eigensolver Chebyshev acceleration: preliminary 20-30% boost Mixed precision -- different eigensolver method Use different inversion/deflation strategy 38