Efficient implementation of the overlap operator on multi-gpus

Efficient implementation of the overlap operator on multi-gpus Andrei Alexandru Mike Lujan, Craig Pelissier, Ben Gamari, Frank Lee SAAHPC 2011 - University of Tennessee

Outline Motivation Overlap operator Multi-gpu Wilson-Dirac kernel Eigensolver and inverter Conclusions 2

Building blocks of matter Quarks are the constituents of matter Quarks interact strongly by exchanging gluons Peculiar properties: Confinement Asymptotic freedom (Nobel Prize 2004) Theory of strong interactions -- Quantum Chromodynamics (QCD)

Lattice QCD Replace space-time with a fourdimensional lattice Differential operators are replaced with finite difference operators Typical lattice size 20-40 per dimension and 1.5-3 times longer in the 4th dimension For example 24 3 x48=663552 sites Typical project size ~ 1 Petaflop

Why overlap fermions on multi-gpus? Study QCD dynamics in the chiral regime Overlap fermions preserve chiral symmetry at finite lattice spacing Overlap fermions are computationally demanding Use GPUs since they have good memory bandwidth Memory requirements of overlap force us to use multiple GPUs 5

Lattice QCD QCD is a field theory U110 Ψ Α a 10 U111 Ψ Α a 11 U112 Ψ Α a 12 U113 Ψ Α a 13 U114 Ψ Α a 14 Lattice QCD defined on a 4D grid U15 U 0 10 U16 U 0 11 U17 U 0 12 U18 U 0 13 U19 U 0 14 quarks -- sites Ψ a Α 5 U 0 5 Ψ a Α 6 U 0 6 Ψ a Α 7 U 0 7 Ψ a Α 8 U 0 8 Ψ a Α 9 U 0 9 gluons -- links U10 Ψ a Α 0 U 0 0 U11 Ψ a Α 1 U 0 1 U12 Ψ a Α 2 U 0 2 U13 Ψ a Α 3 U 0 3 U14 Ψ a Α 4 U 0 4 Links are randomly generated according to dynamics 6

Wilson-Dirac operator Wilson fermions is one of the simplest discretizations It is numerically fast, very sparse It breaks chiral symmetry m + D/ D w =(ma + 4) 1 2 Tµ µ>0: (T µ ψ) n = U µ (n)ψ n+ˆµ (1 γ µ ), µ<0: (T µ ψ) n = U µ (n ˆµ) ψ n ˆµ (1 + γ µ ). Serves as a kernel for the overlap operator 0 ψ(x) ψ(y) 0 =(D 1 w ) x,y 7

Wilson-Dirac operator Y (n) =(MX)(n) =X(n) κ µ Vµ (n)x(n +ˆµ)+V (n ˆµ)X(n ˆµ) Wilson operator is used to multiply Wilson fields, 4x3 matrices living at every site on the lattice The value of Y at a site depends on the value of X at the same site and the neighboring sites (8) Each of the fields at the neighboring sites needs to be transported to the final site -- this involves a multiplication with a color matrix (3x3) and a spinor matrix (4x4) The color matrices differ from link to link whereas the spinor matrices depend only on direction The matrices and the vectors are all complex T Μ Ψ U 1,1 U 1,2 U 1,3 U 2,1 U 2,2 U 2,3 U 3,1 U 3,2 U 3,3 n,μ Ψ 1,1 Ψ 1,2 Ψ 1,3 Ψ 1,4 Ψ 2,1 Ψ 2,2 Ψ 2,3 Ψ 2,4 Ψ 3,1 Ψ 3,2 Ψ 3,3 Ψ 3,4 nμ 1,1 1,2 1,3 1,4 2,1 2,2 2,3 2,4 3,1 3,2 3,3 3,4 4,1 4,2 4,3 4,4 Μ

Overlap operator 1.0 0.5 Overlap operator is dense 0.0 0.5 1.0 1.0 0.5 0.0 0.5 1.0 Ε About 100 times more expensive than Wilson kernel The cost is proportional to conditioning number of (H w ) 2 and log δ D =1+γ 5 sign(h w ) H w = γ 5 D w sign(h w ) QP (Q 2 ) with Q = H w H w δ = max x [,1] 1 xp (x) 9

Requirements Overlap operator Wilson kernel + vector routines Hwilson eigensolver Propagator calculation Overlap inverter Overlap eigensolver 10

System architecture GPU Memory 140 GB/s 1-6GB CPU ~5GB/s Infiniband ~2 x 2.5GB/s 10-20 GB/s 12-48GB Memory 11

Computational strategy We use one process per GPU and MPI for communication All data resides in GPU memory Lattice sites are split evenly between the nodes All data belonging to a particular site resides on the node that owns the sites The communication is mainly implemented via shifts and is overlapped if possible with computation 12

Vector routines Expression Templates + THRUST library = auto-generate optimized kernels φ αψ 1 + βψ 2 + γψ 3 +... Non-reduction kernels scale perfectly Max bandwidth on M2070 with ECC on is about 85GB/s Reduction kernels have poor scaling -- small computational fraction Most poor scaling is due to poor single-node kernel performance on small vectors Bandwidth per GPU GBs 100 80 60 40 20 0 vector addition scalar product 0 5 10 15 20 25 30 GPU Count 13

Wilson-Dirac kernel 14

Wilson-Dirac kernel The cost of the Wilson-Dirac operator is 1368 flops/site: 600 multiplications (44%) and 768 additions (56%) -- balanced load The amount of data for one site computation is in: 8 spinors + 8 links (neighbors) + 1 spinor out: 1 spinor In double precision 3072 bytes/site and computational density is 1368flops/3072bytes = 0.45 flop/byte -- double that in single precision For 85GB/s max bandwidth we have max kernel performance: 38.25 GFlops (double) and 76.5 GFlops (single) It has a fair amount of parallelism the 8 transports can be implemented in parallel each of these transports can be split in 2 parallel tasks 15

Calculation steps The communication time is overlapped with the computation time to hide latency 1. Gather: compute compressed fields and fill in communication buffers 2. Comm: initiate non-blocking communication 3. Bulk: compute interior points dslash 4. Scatter: finish communication and add results 16

Minimal surface Cut the lattice in hypercubes with the same dimensions The longest dimension is always cut first and an already cut dimension is preferred As the lattice is cut the boundary to interior ratio increases GPUs N int N boun dims 1 6.6 x 10 5 0 24, 24, 24, 48 2 3.3 x 10 5 2.8 x 10 4 24, 24, 24, 24 4 1.7 x 10 5 2.8 x 10 4 24, 24, 24, 12 8 8.3 x 10 4 2.8 x 10 4 12, 24, 24, 12 16 4.1 x 10 4 2.1 x 10 4 12, 12, 24, 12 32 2.1 x 10 4 1.4 x 10 4 12, 12, 12, 12 17

Dslash anatomy 2 :Dslash bulk 1:Gather 2: gpu > cpu 3: cpu > cpu 4: cpu > gpu 5:Scatter PCI Infiniband PCI stream 1 stream 2 18

Dslash timing GPUs gather scatter gpu>cpu cpu>cpu cpu>gpu comm dslash bulk 1 0.0 0.0 0.0 0.0 0.0 0.0 28.9 2 0.2 0.3 0.6 1.3 0.6 2.6 14.5 4 0.2 0.3 0.6 1.6 0.6 2.9 7.3 8 0.2 0.3 0.6 1.5 0.6 2.7 3.6 16 0.2 0.2 0.7 1.6 0.6 2.8 1.8 19

Strong scaling for 24 3 x64 Performance per GPU GFLOPS 80 60 40 20 0 0 5 10 15 20 25 30 GPU Count double precision performance model single precision 20

Comparison with other codes Our code QUDA 32 3 x 256 24 3 x 128 16 GPUs 32 GPUs 16 GPUs 32 GPUs double single single 521 1005 1327 928 1825 2247 487 935 971 503 913 1007 21

Overlap operator 22

Sign approximation Polynomial approximation P (Q 2 )ψ = P (Q 2 )ψ = n i=1 n i=1 c i T i (Q 2 )ψ Rational approximation b i Q 2 + c i ψ Time s 8 6 4 2 0 5 10 15 20 25 30 GPU Count double pass polynomial 23

Wilson-Dirac kernel performance comparison Performance GFLOPS 500 400 300 200 100 0 0 10 20 30 40 50 GPU Equivalent Count CPU GPU 24

Performance comparison The GPU cluster uses 1GPU per node and QDR Infiniband interconnects The CPU machine is a Cray XT5 with dual hex-core AMD processors We compare the performance of 32GPUs (the target cluster dimension) vs 256 CPU-cores (optimal performance for CPU) 25

Overlap performance For 24 3 x64 lattice overlap operator matrixvector multiplication takes 1.1 s on 32GPUs On 256 cores Cray XT-5 it takes 3.3 s This translate into a ratio of 1GPU = 24CPUcores 26

Hwilson eigensolver 27

Small eigenspace dimension 0.12 0.10 1200 1000 Λ 0.08 0.06 0.04 0.02 0.00 0 50 100 150 200 250 300 Number of Eigenvectors Polynomial Order 800 600 400 200 0 50 100 150 200 250 300 Number of Eigenvectors δ = Ae bn = λ /λ max 28

Eigensolvers We use implicitly restarted Arnoldi factorization The method requires storage for temporary vectors. For optimal convergence we need 2.5 times more vectors than required: k=2.5 l For efficiency we need to also code a matrix-matrix multiplication routine AV k = V k H k + f k e k with (e k ) n = δ k,n H k µ = QR V k QV k H k RQ + µ For each iteration we need k matrix-vector multiplications and k 2 vector orthogonalizations We use locking of the converged eigenvectors to accelerate convergence 29

Hwilson eigensolver We use Chebyshev acceleration of order 100; Arnoldi eigensolver converges in one iteration Compute 200 eigenvectors: storage required for 500 vectors = 85GB Total time: 0.27 hours on the GPU cluster vs 0.60 hours on the Cray XT-5 This corresponds to 1GPU=18 CPU-cores In situations with reduced GPU memory, use mixed mode: the eigensystem is store in CPU memory. This is feasible due to Chebyshev acceleration. In this mode the GPU code takes 0.43 hours 30

Overlap eigensolver 31

Overlap eigensystem Deflation speeds up inversions considerably One propagator = 12 inversions m π = 200 MeV: 12 x 2,000 = 24,000 deflation: 6,600 + 12 x 200 = 9,000 one propagator per config: 2.5 times speed-up Compute hermitian overlap eigenvectors and then rebuild overlap eigenvectors 32

Overlap eigensystem Compute 100 eigenvector pairs to 10-10 precision On GPU cluster this takes 2.7 hours On the Cray machine this takes 10.6 hours This translates in 1GPU = 26 CPU-cores In the situation where memory is limited, we can use mixed mode: store the overlap Krylov space in CPU memory. The code takes 4 hours to converge in this case. 33

Overlap inverter 34

Overlap inverter We use m π = 200 MeV and a precision of 10-8 We use adaptive CG method which is 60% faster than regular CG We use a multi-shifted inverter with shifts We store the overlap eigensystem in CPU memory, Hwilson eigensystem in GPU memory and solutions in GPU memory The GPU cluster takes 0.52 hours vs 2.3 hours for the Cray machine The performance translates to 1GPU = 35 CPU-cores 35

Summary CPU GPU Hwilson eigensolver Overlap eigensolver expensive orthogonalization Chebyshev acceleration 2.5 x 200 vectors 200 Hwilson vectors 2.5 x 100 eigenpairs Overlap inverter 100 Overlap eigenpairs 200 Hwilson vectors + solutions (100) Pure GPU(32): 3.5 hours Cray XT5(256): 13.5 hours 36

Conclusions We showed how to efficiently implement overlap operator on GPUs For efficiency we need to store the data in GPU memory, which forces us to use GPUs in parallel For 24 3 x64 lattices of interest Wilson kernel scaling efficiency is 50% on 32 GPUs Scaling efficiency is better that CPU codes of equivalent performance The sign function needed for the overlap operator polynomial approximation is better both in terms of memory use and performance Most of the time is spent in eigensolvers. We use implicitly restarted Arnoldi eigensolvers. On systems with reduced memory a mixed strategy can be use with only 50-60% performance penalty. Overall, the GPU/CPU performance ratio for our codes is compatible with the ratio measured for the dslash routine. This result is not surprising since the most time consuming part of these codes is the dslash routine, but it takes careful planning to work around all possible bottlenecks. 37

Outlook Most of the time is spent in overlap eigensolver Chebyshev acceleration: preliminary 20-30% boost Mixed precision -- different eigensolver method Use different inversion/deflation strategy 38