Communication Avoiding Strategies for the Numerical Kernels in Coupled Physics Simulations

Size: px

Start display at page:

Download "Communication Avoiding Strategies for the Numerical Kernels in Coupled Physics Simulations"

Nickolas Byrd
5 years ago
Views:

1 ExaScience Lab Intel Labs Europe EXASCALE COMPUTING Communication Avoiding Strategies for the Numerical Kernels in Coupled Physics Simulations SIAM Conference on Parallel Processing for Scientific Computing Savannah, Georgia, Feb 15-17, 2012 P. Ghysels 1,2, T. Ashby 2,3, K. Meerbergen 4, W. Vanroose 1 1 University of Antwerp - Dept Math & CS, Belgium (pieter.ghysels@ua.ac.be) 2 Intel ExaScience Lab, Leuven, Belgium, 3 Imec vzw, Leuven, Belgium, 4 KU Leuven, - Dept CS, Leuven, Belgium

2 Outline 1 Introduction 2 GMRES Overview 3 Avoiding the reduction for normalization Shifted matrix-vector product for stability 4 Pipelined GMRES p 1 -GMRES Deeper pipelining 5 Numerical results Changing the Krylov basis 6 Performance model Predicted scaling 7 Discussion and Outlook 2/35

3 Introduction Increasing gap between computation and communication costs Floating point performance steadily increases Network latencies only go down marginally Memory latencies decline slowly Avoid communication by trading communication for computation Hide latency of communications Generalized Minimal Residual (GMRES) method Dot-products are latency dominated Dot-products can not be overlapped by other work in standard GMRES All other operations (stencil/axpy) scale well 3/35

4 Global reductions and variability Load imbalance, OS jitter, hardware variability... Global communication expensive global synchronization Influence of OS noise on (left) CHiC cluster and (right) Jaguar XT4 Latency [microseconds] with noise noiseless Latency [microseconds] with noise noiseless k 4k 16k 128k 1M # Processes k 4k 16k 128k 1M # Processes Thanks to T. Hoefler, see T. Hoefler, T. Schneider and A. Lumsdaine (2010) 4/35

5 GMRES, a short overview Problem: find x from Ax = b Initial guess x (0) and initial residual r (0) = Ax (0) b Find an approximate solution x k x 0 K k (r (0), A), with K k (r (0), A) = span{r (0), Ar (0), A 2 r (0),..., A k 1 r (0) } Orthonormal Krylov basis V k = [v 0, v 1,..., v k 1 ], V T k V k = I k Create new base vector z k = Av k 1 using matrix-vector product Make z k orthogonal to all previous v 0,..., v k 1 classical/modified Gram-Schmidt or Householder orthogonalization Satisfy the Arnoldi recurrence relation AV i = V i1 H i1,i Find x k = x 0 V k y k with y k K k that gives minimal residual Transform H from upper Hessenberg to upper triangular form to solve least squares problem 5/35

6 GMRES, classical Gram-Schmidt 1: r 0 b Ax 0 2: v 0 r 0/ r 0 2 3: for i = 0,..., m 1 do 4: w Av i 5: for j = 0,..., i do 6: h j,i w, v j 7: end for 8: ṽ i1 w i j=1 h j,iv j 9: h i1,i ṽ i1 2 10: v i1 ṽ i1 /h i1,i 11: {apply Givens rotations to h :,i } 12: end for 13: y m argmin H m1,my m r 0 2e 1 2 Sparse Matrix-Vector product Only communication with neighbors Good scaling Dot-product Global communication Scales as log(p) Scalar vector multiplication, vector-vector addition No communication 14: x x 0 V my m 6/35

7 GMRES iteration on 4 nodes p4 p3 p2 p1 SpMV local dot reduction b-cast Gram-Schmidt axpy norm reduction b-castscale H update Normalization Communication for SpMV can be overlapped with stencil calculations Typically a binomial tree reduction with height log 2 (P) Processor idling during all reduce latency (not to scale) For modified Gram-Schmidt, (i1) global reductions 7/35

8 Avoiding reduction for normalization Can the extra reduction for the normalization be avoided? p4 p3 p2 p1 SpMV local dot reduction b-cast Gram-Schmidt axpy norm reduction b-castscale H update Normalization 8/35

9 Avoiding reduction for normalization Given an orthonormal basis for the Krylov space K i1 (A, v 0) V i1 := [v 0, v 1,..., v i 1, v i ], which satisfies the Arnoldi relation AV i = V i1 H i1,i 9/35

10 Avoiding reduction for normalization Given an orthonormal basis for the Krylov space K i1 (A, v 0) V i1 := [v 0, v 1,..., v i 1, v i ], which satisfies the Arnoldi relation AV i = V i1 H i1,i, then the steps 1: z i1 = Av i 2: Compute z i1, v j, for j = 0,..., i and z i1 3: h j,i = z i1, v j, for j = 0,..., i 4: h i1,i = 5: v i1 = z i1 2 i j=0 z i1, v j 2 ( z i1 i j=0 z i1, v j v j ) /h i1,i expand the basis to V i2 and the Hessenberg matrix to H i2,i1. 9/35

11 Avoiding reduction for normalization Given an orthonormal basis for the Krylov space K i1 (A, v 0) V i1 := [v 0, v 1,..., v i 1, v i ], which satisfies the Arnoldi relation AV i = V i1 H i1,i, then the steps 1: z i1 = Av i 2: Compute z i1, v j, for j = 0,..., i and z i1 3: h j,i = z i1, v j, for j = 0,..., i 4: h i1,i = 5: v i1 = z i1 2 i j=0 z i1, v j 2 ( z i1 i j=0 z i1, v j v j ) /h i1,i expand the basis to V i2 and the Hessenberg matrix to H i2,i1. Dot-products in line 2 can be combined in a single reduction Line 4 can lead to numerical instability Line 4 can lead to -breakdown 9/35

12 Avoiding reduction for normalization, Shifted matrix-vector product to improve stability Given an orthonormal basis for the Krylov space K i1 (A, v 0) V i1 := [v 0, v 1,..., v i 1, v i ], with AV i = V i1 H i1,i and the shifts σ i C, then the steps 1: z i1 = (A σ i I ) v i 2: Compute z i1, v j, for j = 0,..., i and z i1 3: h j,i = z i1, v j, for j = 0,..., i 1 4: h i,i = z i1, v i σ i 5: h i1,i = 6: v i1 = z i1 2 i j=0 z i1, v j 2 ( z i1 i j=0 z i1, v j v j ) /h i1,i expand the basis to V i2 and the Hessenberg matrix to H i2,i1. 10/35

13 Avoiding reduction for normalization, Shifted matrix-vector product to improve stability Given an orthonormal basis for the Krylov space K i1 (A, v 0) V i1 := [v 0, v 1,..., v i 1, v i ], with AV i = V i1 H i1,i and the shifts σ i C, then the steps 1: z i1 = (A σ i I ) v i 2: Compute z i1, v j, for j = 0,..., i and z i1 3: h j,i = z i1, v j, for j = 0,..., i 1 4: h i,i = z i1, v i σ i 5: h i1,i = 6: v i1 = z i1 2 i j=0 z i1, v j 2 ( z i1 i j=0 z i1, v j v j ) /h i1,i expand the basis to V i2 and the Hessenberg matrix to H i2,i1. If still -breakdown occurs: re-orthogonalize or restart 10/35

14 Avoiding reduction for normalization Only a single reduction per iteration Both reductions combined into 1 Now reduces i 1 floating point numbers i.s.o i p4 p3 p2 p1 SpMV local dot reduction b-cast axpy Gram-Schmidt normalization H update 11/35

15 Overlapping dot-products and SpMV Given the bases related by z 0 = v 0 and V i := [v 0, v 1,..., v i 1 ] Z i1 := [z 0, z 1,..., z i 1, z i ], z j = (A σi ) v j 1, (σ C) then the following steps 1: Compute z i, v j, for j = 0,..., i 1 and z i 2: w = Az i 3: h j,i 1 = z i, v j, for j = 0,..., i 2 4: h i 1,i 1 = z i, v i 1 σ 5: h i,i 1 = z i 2 i 1 j=0 z i, v j 2 ( 6: v i = z i ) i 1 j=0 z i, v j v j /h i,i 1 ( 7: z i1 = w ) i 1 j=0 h j,i 1z j1 /h i,i 1 expand the bases to V i1 and Z i1 and H i,i 1 to H i1,i. 12/35

16 Overlapping dot-products and SpMV Dot-products are computed in line 1, but first used in line 3, after the SpMV in line 2 Sliding window Add new vector z i1 to Z i1 at one end, using SpMV Add orthogonal vector v i to V i at other end, using dot-products Proof Starting from the Arnoldi relation Multiplying with (A σi ) with Az i already calculated in line 2 v i = (Av i 1 i 1 j=0 h j,i 1v j )/h i,i 1 z i1 = (Az i i 1 j=0 h j,i 1z j1 )/h i,i 1 13/35

17 p(1)-gmres p4 p3 p2 p1 reduction bcastscale axpy H update correction local dot Overlap dot-product global communication latency with SpMV Some additional flops compared to standard GMRES correction on the figure Still waiting if reduction takes longer than SpMV 14/35

18 Deeper pipelining What if a global all-reduce takes longer than a matrix-vector product? V i l1 = [v 0, v 1,..., v i l ] Z i1 = [z 0, z 1,..., z i l, z i l1,..., z i ] }{{} l v 0, j = 0, i z j = P j (A)v 0, 0 < j < l, with P i (t) = (t σ j ) P l (A)v j l, j l j=0 Sliding window now l iterations long All shifts zero, z j = A l v j l, called the monomial basis like power method, convergence to dominant eigenvector 15/35

19 The change of basis matrix The Z i basis satisfies a similar Arnoldi relation AZ i = Z i1 B i1,i with the upper Hessenberg change of basis matrix matrix B i1,i See also σ 0 1 B i1,i = M. Hoemmen (2010) σl 1 1 h 0,0... h 0,i l h 1, h i1 l,i l E. Carson, N. Knight and J. Demmel (2011) 16/35

20 The Z k basis can be made orthonormal by a QR factorization Z k = V k G k The last column of G k, i.e. g j,k 1 with j = 0,..., k 1 can be calculated using only the dot-products { z k 1, v j, j k l 1 and the elements of G k 1. z k 1, z j, k l 1 < j k 1 V k l = [ {}}{ v 0, v 1,..., v k l 1 ] Z k = [z 0, z 1,..., z k l 1, z k l,..., z k 2 }{{}, z k 1] ( g j,k 1 = z k 1, v j = z k 1, z j j 1 m=0 g m,jg m,i )/g j,j, g j,j = z j, z j j 1 m=0 g m,j 2 17/35

21 Extending the Arnoldi recurrence Using the Arnoldi recurrence relation and the QR factorization AV k = V k1 H k1,k Z k = V k G k do an incremental update of Arnoldi s Hessenberg matrix H k1,k = V T k1av k = V T k1az k G 1 k = V T k1z k1 B k1,k G 1 k = G k1 B k1,k G 1 k =... [ Hk,k 1 (G k b :,k g :,k1 b k1,k H k,k 1 g :,k )g 1 ] k,k = 0 g k1,k1 b k1,k g 1 k,k 18/35

22 1: r 0 b Ax 0 ; v 0 r 0/ r 0 ; z 0 v 0 2: for i =..., m l do 0, { 3: w (A σ ii )z i, i < l Az i, i l 4: a i l 5: if a 0 then 6: g j,a1 (g j,a1 j 1 k=0 g k,jg k,a1 )/g j,j, j = a l 2,..., a 7: g a1,a1 g a1,a1 a k=0 g k,a1 2 8: # Check for breakdown and restart or re-orthogonalize if necessary 9: if a < l then 10: h j,a (g j,a1 σ ag j,a a 1 k=0 h j,kg k,a )/g a,a, j = 0,..., a 11: h a1,a g a1,a1/g a,a 12: else 13: h j,a ( a1 l k=0 g j,kl h k,a l a 1 k=j 1 h j,kg k,a )/g a,a, j = 0,..., a 14: h a1,a g a1,a1h a1 l,a l /g a,a 15: end if 16: v a (z a a 1 j=0 gj,avj)/ga,a 17: z i1 (w a 1 j=0 hj,a 1z jl)/h a,a 1 18: end if { 19: g j,i1 z i1, v j, j = 0,..., a z i1, z j, j = a 1,..., i 1 20: end for 21: y m argmin H m1,my m r 0 e : x x 0 V my m

23 Random lower bidiagonal matrix A R with A i,i = 1 p i and A i1,i = q i p i and q i uniformly distributed on [0, 1] κ 2(A) 5 unsymmetric, nonnegative, not normal, positive definite and diagonally dominant r k e-06 1e-08 1e-10 1e-12 1e-14 Monomial basis, σ i = 0 1e iterations gmres c-gs gmres m-gs l 1 -gmres p 1 -gmres p(1)-gmres p(2)-gmres p(3)-gmres p(4)-gmres 20/35

24 Changing the basis for stability Newton basis Take l Ritz values as shifts for p(l)-gmres Chebyshev basis Chebyshev polynomial, minimal over an ellipse surrounding the spectrum Three term recurrence z i1 = σ [ i 2 σ i1 c (d A)z i σ ] i 1 z i 1 σ i When all eigenvalues are real, zeros of Chebyshev polynomial are ( ) ( ) σ i = λ minλ max λmax λmin (2i1)π cos, for i = 0,..., l l Complex shifts can be avoided for real matrices z j1 = (A R(λ j )I ) z j z j2 = (A R(λ j )I ) z j1 I(λ j ) 2 z j ( (A λj I )(A λ j I )z j ) Put shifts in (modified) Leja ordering to maximize their internal distances M. Hoemmen, PhD thesis (2010) 21/35

25 r k e-06 1e-08 1e-10 1e-12 1e-14 Newton basis 1e iterations gmres c-gs gmres m-gs l 1 -gmres p 1 -gmres p(1)-gmres p(2)-gmres p(3)-gmres p(4)-gmres r k e-06 1e-08 1e-10 1e-12 1e-14 Random bidiagonal matrix Chebyshev basis 1e iterations gmres c-gs gmres m-gs l 1 -gmres p 1 -gmres p(1)-gmres p(2)-gmres p(3)-gmres p(4)-gmres Eigenvalues are randomly distributed between 1 and 2 Compute l Ritz values (in [1, 2]) Compute zeros of the degree l Chebyshev polynomial minimal over [1, 2] Residual for p(l)-gmres is delayed l steps Restart after 30 iterations, re-fill pipeline, again l steps delay 22/35

26 r k 2 Finite difference stencil, PDE900 real, unsymmetric, matrix with κ(a) point stencil for linear elliptic equation with Dirichlet boundary conditions on a regular grid e-06 1e-08 1e-10 1e-12 Monomial basis gmres c-gs gmres m-gs l 1 -gmres p 1 -gmres p(1)-gmres p(2)-gmres p(3)-gmres p(4)-gmres r k e-06 1e-08 1e-10 1e-12 Newton basis gmres c-gs gmres m-gs l 1 -gmres p 1 -gmres p(1)-gmres p(2)-gmres p(3)-gmres p(4)-gmres 1e-14 1e iterations iterations Longer pipelining depth more frequent restarts 23/35

27 Oil reservoir simulation (ORSIRR1) real, unsymmetric, matrix, 6858 non-zeros and κ(a) 100 Matrix Market, generated from oil reservoir simulation r k 2 1e Monomial basis iterations gmres c-gs gmres m-gs l 1 -gmres p 1 -gmres p(1)-gmres p(2)-gmres p(3)-gmres p(4)-gmres r k 2 1e Newton basis iterations gmres c-gs gmres m-gs l 1 -gmres p 1 -gmres p(1)-gmres p(2)-gmres p(3)-gmres p(4)-gmres 24/35

28 Performance model In total N unknowns, P c cores, P n nodes A single message of size m takes T (m) = t s mt w latency t s and inverse bandwidth t w SpMV communication cost: send 6 faces of 3D cubic grid T spmv = 6(t s (N/P n) 2/3 t w ) All-reduce uses binomial tree (also for b-cast) T red = 2 log 2 (P n) (t s mt w ) Flops for dot-product 2N/P c log 2 (P n) Matrix has n z non-zeros 2n z flops per SpMV Single flop takes t c take half of peak flop rate 25/35

29 Counting flops and messages p4 p3 p2 p1 T SpMV 2iN/P c t c 2(i1)N/P c t c N/P c t c 2n z N/P c t c log 2 (P n ) (t s it w it c ) log 2 (P n ) (t s it w it c ) 26/35

30 GMRES classic Gram-Schmidt T calc = t c((2n z 4i 3)N/P c (i 1) log 2 (P n) ) T red = 2 log 2 (P n) (t s it w ) 2 log }{{} 2 (P n) (t s t w ) }{{} orthogonalization Normalization Costfunctions for iteration i 27/35

31 GMRES classic Gram-Schmidt T calc = t c((2n z 4i 3)N/P c (i 1) log 2 (P n) ) T red = 2 log 2 (P n) (t s it w ) 2 log 2 (P n) (t s t w ) GMRES modified Gram-Schmidt T calc = t c((2n z 4i 3)N/P c (i 1) log 2 (P n) ) T red = 2(i 1) log 2 (P n) (t s t w ) Costfunctions for iteration i 27/35

32 GMRES classic Gram-Schmidt T calc = t c((2n z 4i 3)N/P c (i 1) log 2 (P n) ) T red = 2 log 2 (P n) (t s it w ) 2 log 2 (P n) (t s t w ) GMRES modified Gram-Schmidt T calc = t c((2n z 4i 3)N/P c (i 1) log 2 (P n) ) T red = 2(i 1) log 2 (P n) (t s t w ) l 1 -GMRES T calc = t c((2n z 4i 4)N/P c (i 1) log 2 (P n) ) T red = 2 log 2 (P n) (t s (i 1)t w ) Costfunctions for iteration i 27/35

33 GMRES classic Gram-Schmidt T calc = t c((2n z 4i 3)N/P c (i 1) log 2 (P n) ) T red = 2 log 2 (P n) (t s it w ) 2 log 2 (P n) (t s t w ) GMRES modified Gram-Schmidt T calc = t c((2n z 4i 3)N/P c (i 1) log 2 (P n) ) T red = 2(i 1) log 2 (P n) (t s t w ) l 1 -GMRES T calc = t c((2n z 4i 4)N/P c (i 1) log 2 (P n) ) T red = 2 log 2 (P n) (t s (i 1)t w ) p(l)-gmres Costfunctions for iteration i T calc = t c((2n z 6i 4l 4)N/P c (i 1) log 2 (P n) ) Reduction can be overlapped with other calculations and communications total time reduction {}}{ T red = 2 log 2 (P n) (t s (i 1)t w ) total time reduction {}}{ min(l (t c(2n z 4(i l))n/p c T spmv), 2 log }{{} 2 (P n) (t s (i 1)t w )) local calc and comm for l its 27/35

34 Counting flops and messages p4 p3 p2 p1 T red For the reduction, take only the part that is not overlapped by other work 28/35

35 Machine parameters Machine parameters t s, t w and t c for 5 large scale machines Machine (peak) t s (µs) t w (µs) Procs (Pc/Pn) t c (s) CHiC Cluster (11.2 TFlop/s) Opteron 2218 (2x2) SGI Altix 4700 (13.1 TFlop/s) Itanium II (2x2) Jugene, BG/P, CNK (825.5 TFlop/s) PowerPC 450 (4) Intrepid, BG/P, ZeptoOS (458.6 TFlop/s) PowerPC 450 (4) Jaguar, Cray XT-4 (1.38 PFlop/s) Opteron 1354 (4) Lynx, Intel Exa Leuven (4.3 TFlop/s) Xeon X5660 (2x6) Does not consider network topology In BlueGene/P there is a reduction network T. Hoefler, T. Schneider and A. Lumsdaine (2010) T. Hoefler, T. Mehlan, A. Lumsdaine and W. Rehm (2007) Parameters from LogGOPS model are measured using the tool netgauge 29/35

36 Predicted scaling up to 200, 000 nodes Prediction of a strong scaling experiment on XT4 part of Cray Jaguar N = = unknowns x1000 iterations/s GMRES cgs GMRES mgs l 1 -GMRES p 1 -GMRES p(1)-gmres p(2)-gmres p(3)-gmres x1000 nodes 30/35

37 Predicted timings on 200, 000 nodes global all-reduce matrix-vector communication calculations time per iteration (sec) p(3) p(2) p(1) l p1 1 cgs CHiC p(3) p(2) p(1) l p1 1 cgs Altix p(3) p(2) p(1) l p1 1 cgs CNK p(3) p(2) p(1) l p1 1 cgs ZeptoOS p(3) p(2) p(1) l p1 1 cgs XT4 p(3) p(2) p(1) l p1 1 cgs Lynx Calculation part grows slightly, due to redundant computations Global reductions can be overlapped Larger benefits for long latencies and fast processors 31/35

38 Discussion Non-blocking collective operations Proposed for MPI-3 standard MPICH2 from svn already has support MPI Iallreduce uses MPI Isend/recv, call MPI Test or MPI Wait libnbc, see T. Hoefler and A. Lumsdaine (2006) Call blocking reduce from dedicated thread Some network cards support reductions in hardware (f.i. BlueGene/P) 32/35

39 Discussion Non-blocking collective operations Proposed for MPI-3 standard MPICH2 from svn already has support MPI Iallreduce uses MPI Isend/recv, call MPI Test or MPI Wait libnbc, see T. Hoefler and A. Lumsdaine (2006) Call blocking reduce from dedicated thread Some network cards support reductions in hardware (f.i. BlueGene/P) Preconditioning Pipelining can be combined with preconditioning Preconditioner makes SpMV more costly shorter pipelines 32/35

40 Discussion What if classical Gram-Schmidt is not good enough? Check for orthogonality of the basis and reorthogonalize if necessary Reorthogonalization can also be pipelined Increases pipeline depth? 33/35

41 Discussion What if classical Gram-Schmidt is not good enough? Check for orthogonality of the basis and reorthogonalize if necessary Reorthogonalization can also be pipelined Increases pipeline depth? Comparison with s-step GMRES s-step GMRES improves arithmetic intensity via dense QR (TSQR) and matrix-powers kernel s-step GMRES hard to combine with preconditioning p(l)-gmres easy to combine with preconditioning For p(l)-gmres, small l will be enough, good for stability In s-step GMRES, s should be as large as possible pipelining avoids bursts of communication M. Hoemmen, PhD thesis (2010) 33/35

42 Conclusions & Outlook Conclusions Presented GMRES variations where dot-product latency can be overlapped... at the expense of some redundant flops Studied performance/scaling with a simple analytical model Pipelining can already help to hide expensive global synchronization costs 34/35

43 Conclusions & Outlook Conclusions Presented GMRES variations where dot-product latency can be overlapped... at the expense of some redundant flops Studied performance/scaling with a simple analytical model Pipelining can already help to hide expensive global synchronization costs Outlook Efficient implementations, benchmarking Pipelining for short term recurrence methods CG, BiCGStab,... Less redundant work Fault tolerance? 34/35

44 References E. Carson, N. Knight and J. Demmel (2011) T. Hoefler, T. Schneider and A. Lumsdaine (2010) T. Hoefler, T. Mehlan, A. Lumsdaine and W. Rehm (2007) M. Hoemmen, PhD thesis (2010) 35/35

B. Reps 1, P. Ghysels 2, O. Schenk 4, K. Meerbergen 3,5, W. Vanroose 1,3. IHP 2015, Paris FRx

B. Reps 1, P. Ghysels 2, O. Schenk 4, K. Meerbergen 3,5, W. Vanroose 1,3. IHP 2015, Paris FRx Communication Avoiding and Hiding in preconditioned Krylov solvers 1 Applied Mathematics, University of Antwerp, Belgium 2 Future Technology Group, Berkeley Lab, USA 3 Intel ExaScience Lab, Belgium 4 USI,