Efficient implementation of the overlap operator on multi-gpus

Size: px

Start display at page:

Download "Efficient implementation of the overlap operator on multi-gpus"

Lynette Price
5 years ago
Views:

1 Efficient implementation of the overlap operator on multi-gpus Andrei Alexandru Mike Lujan, Craig Pelissier, Ben Gamari, Frank Lee SAAHPC University of Tennessee

2 Outline Motivation Overlap operator Multi-gpu Wilson-Dirac kernel Eigensolver and inverter Conclusions 2

Building blocks of matter Quarks are the

3 Building blocks of matter Quarks are the constituents of matter Quarks interact strongly by exchanging gluons Peculiar properties: Confinement Asymptotic freedom (Nobel Prize 2004) Theory of strong interactions -- Quantum Chromodynamics (QCD)

Typical lattice size 20-40 per dimension and 1.

4 Lattice QCD Replace space-time with a fourdimensional lattice Differential operators are replaced with finite difference operators Typical lattice size per dimension and times longer in the 4th dimension For example 24 3 x48= sites Typical project size ~ 1 Petaflop

5 Why overlap fermions on multi-gpus? Study QCD dynamics in the chiral regime Overlap fermions preserve chiral symmetry at finite lattice spacing Overlap fermions are computationally demanding Use GPUs since they have good memory bandwidth Memory requirements of overlap force us to use multiple GPUs 5

Lattice QCD QCD is a field theory U110 Ψ Α a 10 U111 Ψ Α a 11 U112 Ψ Α a 12 U113 Ψ Α a 13 U114 Ψ Α a 14 Lattice QCD defined on a 4D grid U15 U 0 10 U16 U 0 11 U17 U 0 12 U18 U 0 13 U19 U 0 14 quarks

6 Lattice QCD QCD is a field theory U110 Ψ Α a 10 U111 Ψ Α a 11 U112 Ψ Α a 12 U113 Ψ Α a 13 U114 Ψ Α a 14 Lattice QCD defined on a 4D grid U15 U 0 10 U16 U 0 11 U17 U 0 12 U18 U 0 13 U19 U 0 14 quarks -- sites Ψ a Α 5 U 0 5 Ψ a Α 6 U 0 6 Ψ a Α 7 U 0 7 Ψ a Α 8 U 0 8 Ψ a Α 9 U 0 9 gluons -- links U10 Ψ a Α 0 U 0 0 U11 Ψ a Α 1 U 0 1 U12 Ψ a Α 2 U 0 2 U13 Ψ a Α 3 U 0 3 U14 Ψ a Α 4 U 0 4 Links are randomly generated according to dynamics 6

Wilson-Dirac operator Wilson fermions is one of the simplest discretizations It is numerically fast, very sparse It breaks chiral symmetry m + D/ D w =(ma + 4) 1 2

7 Wilson-Dirac operator Wilson fermions is one of the simplest discretizations It is numerically fast, very sparse It breaks chiral symmetry m + D/ D w =(ma + 4) 1 2 Tµ µ>0: (T µ ψ) n = U µ (n)ψ n+ˆµ (1 γ µ ), µ<0: (T µ ψ) n = U µ (n ˆµ) ψ n ˆµ (1 + γ µ ). Serves as a kernel for the overlap operator 0 ψ(x) ψ(y) 0 =(D 1 w ) x,y 7

Wilson-Dirac operator Y (n) =(MX)(n) =X(n) κ µ Vµ (n)x(n +ˆµ)+V (n ˆµ)X(n ˆµ) Wilson operator is used to multiply Wilson fields, 4x3 matrices living at every site on the lattice The value of Y at a

8 Wilson-Dirac operator Y (n) =(MX)(n) =X(n) κ µ Vµ (n)x(n +ˆµ)+V (n ˆµ)X(n ˆµ) Wilson operator is used to multiply Wilson fields, 4x3 matrices living at every site on the lattice The value of Y at a site depends on the value of X at the same site and the neighboring sites (8) Each of the fields at the neighboring sites needs to be transported to the final site -- this involves a multiplication with a color matrix (3x3) and a spinor matrix (4x4) The color matrices differ from link to link whereas the spinor matrices depend only on direction The matrices and the vectors are all complex T Μ Ψ U 1,1 U 1,2 U 1,3 U 2,1 U 2,2 U 2,3 U 3,1 U 3,2 U 3,3 n,μ Ψ 1,1 Ψ 1,2 Ψ 1,3 Ψ 1,4 Ψ 2,1 Ψ 2,2 Ψ 2,3 Ψ 2,4 Ψ 3,1 Ψ 3,2 Ψ 3,3 Ψ 3,4 nμ 1,1 1,2 1,3 1,4 2,1 2,2 2,3 2,4 3,1 3,2 3,3 3,4 4,1 4,2 4,3 4,4 Μ

Overlap operator 1.0 0.5 Overlap operator is dense 0.0 0.5 1.

9 Overlap operator Overlap operator is dense Ε About 100 times more expensive than Wilson kernel The cost is proportional to conditioning number of (H w ) 2 and log δ D =1+γ 5 sign(h w ) H w = γ 5 D w sign(h w ) QP (Q 2 ) with Q = H w H w δ = max x [,1] 1 xp (x) 9

10 Requirements Overlap operator Wilson kernel + vector routines Hwilson eigensolver Propagator calculation Overlap inverter Overlap eigensolver 10

11 System architecture GPU Memory 140 GB/s 1-6GB CPU ~5GB/s Infiniband ~2 x 2.5GB/s GB/s 12-48GB Memory 11

Computational strategy We use one process per GPU and MPI for communication All data resides in GPU memory Lattice sites are split evenly between the nodes All data

12 Computational strategy We use one process per GPU and MPI for communication All data resides in GPU memory Lattice sites are split evenly between the nodes All data belonging to a particular site resides on the node that owns the sites The communication is mainly implemented via shifts and is overlapped if possible with computation 12

$.. Non-reduction kernels scale perfectly Max bandwidth on M2070 with ECC on is about 85GB/s Reduction kernels have poor scaling -- small computational fraction Most poor scaling is due to poor$

13 Vector routines Expression Templates + THRUST library = auto-generate optimized kernels φ αψ 1 + βψ 2 + γψ Non-reduction kernels scale perfectly Max bandwidth on M2070 with ECC on is about 85GB/s Reduction kernels have poor scaling -- small computational fraction Most poor scaling is due to poor single-node kernel performance on small vectors Bandwidth per GPU GBs vector addition scalar product GPU Count 13

14 Wilson-Dirac kernel 14

Wilson-Dirac kernel The cost of the Wilson-Dirac operator is 1368 flops/site: 600 multiplications (44%) and 768 additions (56%) -- balanced load The amount of data for one site computation is in: 8

15 Wilson-Dirac kernel The cost of the Wilson-Dirac operator is 1368 flops/site: 600 multiplications (44%) and 768 additions (56%) -- balanced load The amount of data for one site computation is in: 8 spinors + 8 links (neighbors) + 1 spinor out: 1 spinor In double precision 3072 bytes/site and computational density is 1368flops/3072bytes = 0.45 flop/byte -- double that in single precision For 85GB/s max bandwidth we have max kernel performance: GFlops (double) and 76.5 GFlops (single) It has a fair amount of parallelism the 8 transports can be implemented in parallel each of these transports can be split in 2 parallel tasks 15

16 Calculation steps The communication time is overlapped with the computation time to hide latency 1. Gather: compute compressed fields and fill in communication buffers 2. Comm: initiate non-blocking communication 3. Bulk: compute interior points dslash 4. Scatter: finish communication and add results 16

Minimal surface Cut the lattice in hypercubes with the same dimensions The longest dimension is always cut first and an already cut dimension is preferred As the lattice is cut the boundary to

17 Minimal surface Cut the lattice in hypercubes with the same dimensions The longest dimension is always cut first and an already cut dimension is preferred As the lattice is cut the boundary to interior ratio increases GPUs N int N boun dims x , 24, 24, x x , 24, 24, x x , 24, 24, x x , 24, 24, x x , 12, 24, x x , 12, 12, 12 17

18 Dslash anatomy 2 :Dslash bulk 1:Gather 2: gpu > cpu 3: cpu > cpu 4: cpu > gpu 5:Scatter PCI Infiniband PCI stream 1 stream 2 18

19 Dslash timing GPUs gather scatter gpu>cpu cpu>cpu cpu>gpu comm dslash bulk

20 Strong scaling for 24 3 x64 Performance per GPU GFLOPS GPU Count double precision performance model single precision 20

21 Comparison with other codes Our code QUDA 32 3 x x GPUs 32 GPUs 16 GPUs 32 GPUs double single single

22 Overlap operator 22

23 Sign approximation Polynomial approximation P (Q 2 )ψ = P (Q 2 )ψ = n i=1 n i=1 c i T i (Q 2 )ψ Rational approximation b i Q 2 + c i ψ Time s GPU Count double pass polynomial 23

Wilson-Dirac kernel performance comparison Performance GFLOPS 500

24 Wilson-Dirac kernel performance comparison Performance GFLOPS GPU Equivalent Count CPU GPU 24

25 Performance comparison The GPU cluster uses 1GPU per node and QDR Infiniband interconnects The CPU machine is a Cray XT5 with dual hex-core AMD processors We compare the performance of 32GPUs (the target cluster dimension) vs 256 CPU-cores (optimal performance for CPU) 25

26 Overlap performance For 24 3 x64 lattice overlap operator matrixvector multiplication takes 1.1 s on 32GPUs On 256 cores Cray XT-5 it takes 3.3 s This translate into a ratio of 1GPU = 24CPUcores 26

27 Hwilson eigensolver 27

Small eigenspace dimension 0.12 0.10 1200 1000 Λ 0.08 0.06 0.04 0.02 0.

28 Small eigenspace dimension Λ Number of Eigenvectors Polynomial Order Number of Eigenvectors δ = Ae bn = λ /λ max 28

5 l For efficiency we need to also code a matrix-matrix multiplication routine AV k = V k H k + f k e k with (e k ) n = δ k,n H

29 Eigensolvers We use implicitly restarted Arnoldi factorization The method requires storage for temporary vectors. For optimal convergence we need 2.5 times more vectors than required: k=2.5 l For efficiency we need to also code a matrix-matrix multiplication routine AV k = V k H k + f k e k with (e k ) n = δ k,n H k µ = QR V k QV k H k RQ + µ For each iteration we need k matrix-vector multiplications and k 2 vector orthogonalizations We use locking of the converged eigenvectors to accelerate convergence 29

Hwilson eigensolver We use Chebyshev acceleration of order 100; Arnoldi eigensolver converges in one iteration Compute 200 eigenvectors: storage required for 500 vectors = 85GB Total time: 0.

30 Hwilson eigensolver We use Chebyshev acceleration of order 100; Arnoldi eigensolver converges in one iteration Compute 200 eigenvectors: storage required for 500 vectors = 85GB Total time: 0.27 hours on the GPU cluster vs 0.60 hours on the Cray XT-5 This corresponds to 1GPU=18 CPU-cores In situations with reduced GPU memory, use mixed mode: the eigensystem is store in CPU memory. This is feasible due to Chebyshev acceleration. In this mode the GPU code takes 0.43 hours 30

31 Overlap eigensolver 31

32 Overlap eigensystem Deflation speeds up inversions considerably One propagator = 12 inversions m π = 200 MeV: 12 x 2,000 = 24,000 deflation: 6, x 200 = 9,000 one propagator per config: 2.5 times speed-up Compute hermitian overlap eigenvectors and then rebuild overlap eigenvectors 32

Overlap eigensystem Compute 100 eigenvector pairs to 10-10 precision On GPU cluster this takes 2.7 hours On the Cray machine this takes 10.

33 Overlap eigensystem Compute 100 eigenvector pairs to precision On GPU cluster this takes 2.7 hours On the Cray machine this takes 10.6 hours This translates in 1GPU = 26 CPU-cores In the situation where memory is limited, we can use mixed mode: store the overlap Krylov space in CPU memory. The code takes 4 hours to converge in this case. 33

34 Overlap inverter 34

Overlap inverter We use m π = 200 MeV and a precision of 10-8 We use adaptive CG method which is 60% faster than regular CG We use a multi-shifted inverter with shifts We store the overlap

35 Overlap inverter We use m π = 200 MeV and a precision of 10-8 We use adaptive CG method which is 60% faster than regular CG We use a multi-shifted inverter with shifts We store the overlap eigensystem in CPU memory, Hwilson eigensystem in GPU memory and solutions in GPU memory The GPU cluster takes 0.52 hours vs 2.3 hours for the Cray machine The performance translates to 1GPU = 35 CPU-cores 35

36 Summary CPU GPU Hwilson eigensolver Overlap eigensolver expensive orthogonalization Chebyshev acceleration 2.5 x 200 vectors 200 Hwilson vectors 2.5 x 100 eigenpairs Overlap inverter 100 Overlap eigenpairs 200 Hwilson vectors + solutions (100) Pure GPU(32): 3.5 hours Cray XT5(256): 13.5 hours 36

Conclusions We showed how to efficiently implement overlap operator on GPUs For efficiency we need to store the data in GPU memory, which forces us to use GPUs in parallel For 24 3 x64 lattices of

37 Conclusions We showed how to efficiently implement overlap operator on GPUs For efficiency we need to store the data in GPU memory, which forces us to use GPUs in parallel For 24 3 x64 lattices of interest Wilson kernel scaling efficiency is 50% on 32 GPUs Scaling efficiency is better that CPU codes of equivalent performance The sign function needed for the overlap operator polynomial approximation is better both in terms of memory use and performance Most of the time is spent in eigensolvers. We use implicitly restarted Arnoldi eigensolvers. On systems with reduced memory a mixed strategy can be use with only 50-60% performance penalty. Overall, the GPU/CPU performance ratio for our codes is compatible with the ratio measured for the dslash routine. This result is not surprising since the most time consuming part of these codes is the dslash routine, but it takes careful planning to work around all possible bottlenecks. 37

38 Outlook Most of the time is spent in overlap eigensolver Chebyshev acceleration: preliminary 20-30% boost Mixed precision -- different eigensolver method Use different inversion/deflation strategy 38

Tuning And Understanding MILC Performance In Cray XK6 GPU Clusters. Mike Showerman, Guochun Shi Steven Gottlieb

Tuning And Understanding MILC Performance In Cray XK6 GPU Clusters Mike Showerman, Guochun Shi Steven Gottlieb Outline Background Lattice QCD and MILC GPU and Cray XK6 node architecture Implementation