B. Reps 1, P. Ghysels 2, O. Schenk 4, K. Meerbergen 3,5, W. Vanroose 1,3. IHP 2015, Paris FRx

Size: px

Start display at page:

Download "B. Reps 1, P. Ghysels 2, O. Schenk 4, K. Meerbergen 3,5, W. Vanroose 1,3. IHP 2015, Paris FRx"

Debra Wilkinson
6 years ago
Views:

1 Communication Avoiding and Hiding in preconditioned Krylov solvers 1 Applied Mathematics, University of Antwerp, Belgium 2 Future Technology Group, Berkeley Lab, USA 3 Intel ExaScience Lab, Belgium 4 USI, Lugano, CH 5 KU Leuven, Belgium B. Reps 1, P. Ghysels 2, O. Schenk 4, K. Meerbergen 3,5, W. Vanroose 1,3 IHP 2015, Paris FRx

2 Overview Arithmetic Intensity in Krylov Arithmetic Intensity in Multigrid Pipelining communication and computation in Krylov

3 GMRES, classical Gram-Schmidt 1: r 0 b Ax 0 2: v 0 r 0 / r 0 2 3: for i = 0,..., m 1 do 4: w Av i 5: for j = 0,..., i do 6: h j,i w, v j 7: end for 8: ṽ i+1 w i j=1 h j,iv j 9: h i+1,i ṽ i : v i+1 ṽ i+1 /h i+1,i 11: {apply Givens rotations to h :,i } 12: end for 13: y m argmin H m+1,m y m r 0 2 e : x x 0 + V m y m Sparse Matrix-Vector product Only communication with neighbors Good scaling Dot-product Global communication Scales as log(p) Scalar vector multiplication, vector-vector addition No communication

4 Part I Arithmetic Intensity and Sparse Matrix vector product

5 Performance of Kernel of dependent SpMV SIMD: SSE AVX mm512 on Xeon Phi. Similar with NEON on ARM. Bandwidth is the bottleneck and will remain even with 3D stacked memory. attainable GFLOP/sec peak memory BW with vectorization peak floating-point 1 1/8 1/4 1/ Arithmetic Intensity FLOP/Byte Arithmetic intensity (Flop/byte) S. Williams, A. Waterman and D. Patterson (2008)

6 Arithmetic intensity of k dependent SpMVs 1 SpMV k SpMVs k SpMVs in place flops 2n nz 2k n nz 2k n nz words moved n nz + 2n kn nz + 2kn n nz + 2n q 2 2 2k J. Demmel, CS on Communication-Avoiding algorithms M. Hoemmen, Communication-avoiding Krylov subspace methods. (2010).

7 Increasing the arithmetic intensity Divide the domain in tiles which fit in the cache Ground surface is loaded in cache and reused k times Redundant work at the tile edges s+1 u s+1 i,j 3 s u s i 1,j u s i,j u s i,j+1 u s i+1,j s 2 1 u s i,j B 0 B P. Ghysels, P. Klosiewicz, and W. Vanroose. Improving the arithmetic intensity of multigrid with the help of polynomial smoothers. Num. Lin. Alg. Appl. 19 (2012):

8 Stencil Compilers. (polytope model) Pluto Automatic loop parallelization, Locality optimizations based on the polyhedral model, add OpenMP pragma s around the outer loop, ivdep and vector always. compilers (guided) auto-vectorization. U. Bondhugula, M. Baskaran, S. Krishnamoorthy, J. Ramanujam, A. Rountev, and P. Sadayappan, Automatic Transformations for Communication-Minimized Parallelization and Locality Optimization in the Polyhedral Model, Int. Conf. Compiler Construction (ETAPS CC), Apr 2008, Budapest, Hungary.

9 Increasing the arithmetic intensity with a stencil compiler (Pluto) naive naive, no-vec pluto pluto, no-vec peak roofline fitted roofline naive naive, no-vec pluto pluto, no-vec peak roofline fitted roofline GFlop/s 64 GFlop/s arithmetic intensity (flop/byte) Dual socket Intel Sandy Bridge, 16 2 threads arithmetic intensity (flop/byte) Intel Xeon Phi, 61 4 threads

10 Krylov methods Classical Krylov K k (A, v) = span{v, Av, A 2 v,..., A k 1 v} the residual and error can then be written as r (k) = P k (A)r (0) (1)

11 Krylov methods Classical Krylov K k (A, v) = span{v, Av, A 2 v,..., A k 1 v} the residual and error can then be written as r (k) = P k (A)r (0) (1) Polynomial Preconditioning P m (A)x = q(a)ax = q(a)b K k (P m (A), v) = span{v, P m (A)v, P m (A) 2 v,..., P m (A) k 1 v} where P m (A) = (A σ 1 )(A σ 2 )... (A σ m ) is a fixed low-order polynomial r (k) = Q k (P m (A))r (0) (2)

12 Incomplete list of the literature on polynomial preconditioning Y. Saad, Iterative methods for sparse linear systems Chapter 12, SIAM (2003). D. O Leary, Yet another polynomials preconditioner for the Conjugate Gradient Algorithm, Lin. Alg. Appl. (1991) p377 A. van der Ploeg, Preconditioning for sparse matrices with applications, (1994) M. Van Gijzen, A polynomial preconditioner for the GMRES algorithm J. Comp. Appl. Math 59 (1995): A. Basermann, B. Reichel, C. Schelthoff, Preconditioned CG methods for sparse matrices on massively parallel machines, 23 (1997),

13 Convergence of CG with different short Polynomials

14 Time to solution is reduced

15 Recursive calculation of w k+1 = p k+1 (A)v 1: σ 1 = θ/δ 2: ρ 0 = 1/σ 1 3: w 0 = 0, w 1 = 1 θ Av 4: w 1 = w 1 w 0 5: for k=1,... do 6: ρ k = 1/(2σ 1 ρ k 1 ) 7: w k+1 = ρ k [ 2 δ A(v w k) + ρ k 1 w k ] 8: w k+1 = w k + w k+1 9: end for Mangled with Pluto to raise arithmetic intensity.

16 Average cost for each matvec averaged time for each matvec total matvec dot-product axpy order of polynomial 2D finite difference poisson problem on i7-2860qm

17 attainable GFLOP/sec peak memory BW with vectorization peak floating-point 1 1/8 1/4 1/ Arithmetic Intensity FLOP/Byte Arithmetic intensity (Flop/byte) S. Williams, A. Waterman and D. Patterson (2008)

18 Relying on compiler autovectorization averaged time for each matvec total matvec dot-product axpy total SIMD matvec SIMD order of polynomial Intel i7-2860qm

19 Part II Arithmetic Intensity in Multigrid

20 Arithmetic Intensity Multigrid where MG = (I I2h h 2h (...)Ih A)S ν 1, (3) S is the smoother, for example ω-jacobi where S = I ωd 1 A, The interpolation I2h h and restriction I 2h h. These are all low arithmetic intensity.

21 Multigrid Cost Model Increasing the number of smoothing steps per level 1 60 convergence rate # V-cycles smoothing steps smoothing steps MG iterations = log(10 8 )/ log(ρ(v-cycle)) where ρ(v-cycle) is the spectral radius of the V-cycle, can be rounded to higher integer

22 Work Unit Cost model Work Unit cost model ignores memory bandwidth 1 WU = smoother cost = O(n) (9ν + 19)( ) log(tol) log(ρ) WU (9ν + 19) 4 log(tol) 3 log(ρ) WU # V-cycles work units smoothing steps smoothing steps

23 avg perf (Gflop/s) peak BW, peak Flops no SIMD stream BW # smoothing steps Attainable GFlop/sec avg cost (s/gflop) peak BW, peak Flops no SIMD stream BW # smoothing steps Average cost in sec/gflop ν 1 ω-jac = flops (ν 1 ω-jac) roof (q(ν 1 ω-jac)) ν 1flops (ω-jac) roof (ν 1 q 1 (ω-jac)) (4)

24 Roofline-based vs naive cost model roofline model naive model work units # smoothing steps

25 Timings for 2D (Sandy Bridge) time (s) naive pluto naive model roofline model smoothing steps P. Ghysels and W. Vanroose, Modelling the performance of geometric multigrid on many-core computer architectures, Exascience.com techreport Sept, 2013,

26 3d results (Sandy Bridge)

27 Current Krylov Solvers Krylov Solver User Matvec User provides a matvec routine and selects Krylov method at command line. No opportunities to increase arithmetic intensity.

28 Current Krylov Solvers Krylov Solver Future Krylov Solvers User provides a stencil routine. User Matvec Stencil compiler increases arithmetic intensity. Krylov Solver User provides a matvec routine and selects Krylov method at command line. No opportunities to increase arithmetic intensity. Stencil Compiler (Pochoir, Pluto,...) W.Vanroose, P. Ghysels, D. Roose and K. Meerbergen, Position paper at DOE Exascale Mathematics workshop User Stencil

29 Part III Pipelining communication and computations

30 GMRES, classical Gram-Schmidt 1: r 0 := b Ax 0 2: v 0 := r 0 / r 0 2 3: for i = 0,..., m 1 do 4: w := Av i 5: for j = 0,..., i do 6: h j,i := w, v j 7: end for 8: ṽ i+1 := w i j=1 h j,iv j 9: h i+1,i := ṽ i : v i+1 := ṽ i+1 /h i+1,i 11: {apply Givens rotations to h :,i } 12: end for 13: y m := argmin H m+1,m y m r 0 2 e : x := x 0 + V m y m Sparse Matrix-Vector product Only communication with neighbors Good scaling but BW limited. Dot-product Global communication Scales as log(p) Scalar vector multiplication, vector-vector addition No communication

31 GMRES vs Pipelined GMRES iteration on 4 nodes Classical GMRES p4 p3 + + p2 p SpMV local dot reduction b-cast Gram-Schmidt axpy norm reduction b-castscale H update Normalization Pipelined GMRES p4 p3 + p2 p1 + + reduction bcastscale axpy H update correction local dot P. Ghysels, T. Ashby, K. Meerbergen and W. Vanroose Hiding global communication latency in the GMRES algorithm on massively parallel machines. SIAM J. Scientific Computing, 35(1):C48C71, (2013).

32 Better Scaling x1000 iterations/s GMRES cgs GMRES mgs l 1 -GMRES p 1 -GMRES p(1)-gmres p(2)-gmres p(3)-gmres x1000 nodes Prediction of a strong scaling experiment for GMRES on XT4 part of Cray Jaguar.

33 Are there similar opportunities in preconditioned Conjugate Gradients? Preconditioned Conjugate Gradient 1: r 0 := b Ax 0 ; u 0 := M 1 r 0 ; p 0 := u 0 2: for i = 0,..., m 1 do 3: s := Ap i 4: α := r i, u i / s, p i 5: x i+1 := x i + αp i 6: r i+1 := r i αs 7: u i+1 := M 1 r i+1 8: β := r i+1, u i+1 / r i, u i 9: p i+1 := u i+1 + βp i 10: end for

34 Chronopoulos/Gear CG Only one global reduction each iteration. 1: r 0 := b Ax 0 ; u 0 := M 1 r 0 ; w 0 := Au 0 2: α 0 := r 0, u 0 / w 0, u 0 ; β := 0; γ 0 := r 0, u 0 3: for i = 0,..., m 1 do 4: p i := u i + β i p i 1 5: s i := w i + β i s i 1 6: x i+1 := x i + αp i 7: r i+1 := r i αs i 8: u i+1 := M 1 r i+1 9: w i+1 := Au i+1 10: γ i+1 := r i+1, u i+1 11: δ := w i+1, u i+1 12: β i+1 := γ i+1 /γ i 13: α i+1 := γ i+1 /(δ β i+1 γ i+1 /α i ) 14: end for

35 pipelined Chronopoulos/Gear CG Global reduction overlaps with matrix vector product. 1: r 0 := b Ax 0 ; w 0 := Au 0 2: for i = 0,..., m 1 do 3: γ i := r i, r i 4: δ := w i, r i 5: q i := Aw i 6: if i > 0 then 7: β i := γ i /γ i 1 ; α i := γ i /(δ β i γ i /α i 1 ) 8: else 9: β i := 0; α i := γ i /δ 10: end if 11: z i := q i + β i z i 1 12: s i := w i + β i s i 1 13: p i := r i + β i p i 1 14: x i+1 := x i + α i p i 15: r i+1 := r i α i s i 16: w i+1 := w i α i z i 17: end for

36 Preconditioned pipelined CG 1: r 0 := b Ax 0 ; u 0 := M 1 r 0 ; w 0 := Au 0 2: for i = 0,... do 3: γ i := r i, u i 4: δ := w i, u i 5: m i := M 1 w i 6: n i := Am i 7: if i > 0 then 8: β i := γ i /γ i 1 ; α i := γ i / (δ β i γ i /α i 1 ) 9: else 10: β i := 0; α i := γ i /δ 11: end if 12: z i := n i + β i z i 1 13: q i := m i + β i q i 1 14: s i := w i + β i s i 1 15: p i := u i + β i p i 1 16: x i+1 := x i + α i p i 17: r i+1 := r i α i s i 18: u i+1 := u i α i q i 19: w i+1 := w i α i z i 20: end for

37 Preconditioned pipelined CR 1: r 0 := b Ax 0 ; u 0 := M 1 r 0 ; w 0 := Au 0 2: for i = 0,... do 3: m i := M 1 w i 4: γ i := w i, u i 5: δ := m i, w i 6: n i := Am i 7: if i > 0 then 8: β i := γ i /γ i 1 ; α i := γ i / (δ β i γ i /α i 1 ) 9: else 10: β i := 0; α i := γ i /δ 11: end if 12: z i := n i + β i z i 1 13: q i := m i + β i q i 1 14: p i := u i + β i p i 1 15: x i+1 := x i + α i p i 16: u i+1 := u i α i q i 17: w i+1 := w i α i z i 18: end for

38 Cost Model G := time for a global reduction SpMV := time for a sparse-matrix vector product PC := time for preconditioner application local work such as AXPY is neglected flops time (excl, axpys, dots) #glob syncs memory CG 10 2G + SpMV + PC 2 4 Chro/Gea 12 G + SpMV + PC 1 5 CR 12 2G + SpMV + PC 2 5 pipe-cg 20 max(g, SpMV + PC) 1 9 pipe-cr 16 max(g, SpMV) + PC 1 7 Gropp-CG 14 max(g, SpMV) + max(g, PC) 2 6 P. Ghysels and W. Vanroose, Hiding global synchronization latency in the preconditioned Conjugate Gradient algorithm, Parallel Computing, 40, (2014), Pages

39 Better Scaling Hydrostatic ice sheet flow, Q1 finite elements line search Newton method (rtol = 10 8, atol = ) CG with block Jacobi ICC(0) precond (rtol = 10 5, atol = ) Measured speedup over standard CG for different variations of pipelined CG.

40 MPI trace

41 Pipelined methods in PETSc (from 3.4.2) Krylov methods: KSPPIPECR, KSPPIPECG, KSPGROPPCG, KSPPGMRES Uses MPI-3 non-blocking collectives export MPICH ASYNC PROGRES=1 1:... 2: KSP PCApply(ksp,W,M); /* m Bw */ 3: if (i > 0 && ksp normtype == KSP NORM PRECONDITIONED) 4: VecNormBegin(U,NORM 2,&dp); 5: VecDotBegin(W,U,&gamma); 6: VecDotBegin(M,W,&delta); 7: PetscCommSplitReductionBegin(PetscObjectComm((PetscObject)U)); 8: KSP MatMult(ksp,Amat,M,N); /* n Am */ 9: if (i > 0 && ksp normtype == KSP NORM PRECONDITIONED) 10: VecNormEnd(U,NORM 2,&dp); 11: VecDotEnd(W,U,&gamma); 12: VecDotEnd(M,W,&delta); 13:...

42 Krylov subspace methods with additional global reductions Deflation: Remove a few known annoying eigenvectors. Helmholtz. FETI methods.... Augmenting: adds a subspace to the Krylov subspace, e.g. recycling. Newton-Krylov methods. Numerical Continuation. Coarse Solver in multigrid.... Sn := Kn(A, v) + U (5) where U forms a basis for U. x n = x 0 + V n y n + Uu n

43 Deflation with eigenvectors with smallest eigenvalues Smallest eigenvalues and vectors: A [w 1, w 2,..., w m ] = [λ 1 w 1, λ 2 w 2,..., λ m w m ] + ɛ[θ] where λ 1 < λ 2 < λ 3 <... < λ m <... and Θ 1. W := [w 1, w 2,..., w m ] Correction step e = W (W T AW ) 1 W T r

44 Deflated CG 1: r 1 := b Ax 1 2: x 0 := x 1 + W (W T AW ) 1 W T r 1 3: r 0 := b Ax 0 4: p 0 := r 0 W (W T AW ) 1 W T Ar 0 5: for i = 0,... do 6: s := Ap i 7: α i := r i, r i / s, p i 8: x i+1 := x i + α i p i 9: r i+1 := r i α i s 10: β i := r i+1, r i+1 / r i, r i 11: w := Ar i+1 12: σ := W, w 13: p i+1 := r i+1 + β i p i W (W T AW ) 1 σ 14: end for x 0 such that W T r 0 = 0 with r 0 = b Ax 0 (cfr init-cg) σ = W, w = W, Ar i+1 = AW, r i+1 : store AW? CG on H T AH x = H T b with H = I W (W T AW ) 1 (AW ) T

45 Pipelined Deflation CG

46 Selective Deflation

50 Conclusions Two communication bottlenecks: limited BW in Av. No benefit from SIMD. Latencies in (u, v). Poor strong scaling.

51 Conclusions Two communication bottlenecks: limited BW in Av. No benefit from SIMD. Latencies in (u, v). Poor strong scaling. Replace Av with a P m (A)v with a fixed coefficients. Execute with a stencil compiler. Combine with Multigrid. Many open question for unstructured matrices.

52 Conclusions Two communication bottlenecks: limited BW in Av. No benefit from SIMD. Latencies in (u, v). Poor strong scaling. Replace Av with a P m (A)v with a fixed coefficients. Execute with a stencil compiler. Combine with Multigrid. Many open question for unstructured matrices. Pipelining of Krylov algorithms reduce the number of global reductions and hide their latencies. They can scale as the Sparse Matrix-Vector product. The pipelined GMRES and CG code is in PETSc-current. Extension to Deflated and Augmented Krylov methods. Algorithms with additional global reductions.

53 Conclusions Two communication bottlenecks: limited BW in Av. No benefit from SIMD. Latencies in (u, v). Poor strong scaling. Replace Av with a P m (A)v with a fixed coefficients. Execute with a stencil compiler. Combine with Multigrid. Many open question for unstructured matrices. Pipelining of Krylov algorithms reduce the number of global reductions and hide their latencies. They can scale as the Sparse Matrix-Vector product. The pipelined GMRES and CG code is in PETSc-current. Extension to Deflated and Augmented Krylov methods. Algorithms with additional global reductions. Acknowledgements: EU s Seventh Framework Programme (FP7/ ) under grant agreement n Intel and the Institute for the Promotion of Innovation through Science and Technology in Flanders (IWT).

EXASCALE COMPUTING. Hiding global synchronization latency in the preconditioned Conjugate Gradient algorithm. P. Ghysels W.

EXASCALE COMPUTING. Hiding global synchronization latency in the preconditioned Conjugate Gradient algorithm. P. Ghysels W. ExaScience Lab Intel Labs Europe EXASCALE COMPUTING Hiding global synchronization latency in the preconditioned Conjugate Gradient algorithm P. Ghysels W. Vanroose Report 12.2012.1, December 2012 This