MS 28: Scalable Communication-Avoiding and -Hiding Krylov Subspace Methods I

Size: px

Start display at page:

Download "MS 28: Scalable Communication-Avoiding and -Hiding Krylov Subspace Methods I"

Derek Lawson
5 years ago
Views:

1 MS 28: Scalable Communication-Avoiding and -Hiding Krylov Subspace Methods I Organizers: Siegfried Cools, University of Antwerp, Belgium Erin C. Carson, New York University, USA 10:50-11:10 High Performance Variants of Krylov Subspace Methods Erin C. Carson, New York University, USA 11:15-11:35 About Parallel Variants of GMRES Algorithm Jocelyne Erhel, Inria-Rennes, France 11:40-12:00 Enlarged GMRES for Reducing Communication OlivierTissot, Inria, France

2 MS 40: Scalable Communication-Avoiding and -Hiding Krylov Subspace Methods II Organizers: Siegfried Cools, University of Antwerp, Belgium Erin C. Carson, New York University, USA 2:40-3:00 Impact of Noise Models on Pipelined Krylov Methods Hannah Morgan, University of Chicago, USA 3:05-3:25 Scalable Krylov Methods for Spectral Graph Partitioning Pieter Ghysels, Lawrence Berkeley National Laboratory, USA 3:30-3:50 Using Non-Blocking Communication to Achieve Scalability for Preconditioned Conjugate Gradient Methods William D. Gropp, University of Illinois at Urbana-Champaign, USA 3:55-4:15 Performance of S-Step and Pipelined Krylov Methods Piotr Luszczek, University of Tennessee, Knoxville, USA

3 High Performance Variants of Krylov Subspace Methods Erin Carson New York University SIAM PP18, Tokyo, Japan March 8, 2018

4 Collaborators Emmanuel Agullo, Inria, France Siegfried Cools, University of Antwerp, Belgium James Demmel, University of California, Berkeley, USA Pieter Ghysels, Lawrence Berkeley National Laboratory, USA Luc Giraud, Inria, France Miro Rozložník, Czech Academy of Sciences, Czech Republic Zdeněk Strakoš, Charles University, Czech Republic Petr Tichý, Czech Academy of Sciences, Czech Republic Miroslav Tůma, Czech Academy of Sciences, Czech Republic Wim Vanroose, Antwerp University, Belgium Emrullah Fatih Yetkin, Inria, France

5 Exascale System Projections Today's Systems Predicted Exascale Systems* Factor Improvement System Peak flops/s flops/s 100 Node Memory Bandwidth Interconnect Bandwidth 10 2 GB/s 10 3 GB/s GB/s 10 2 GB/s 10 Memory Latency 10 7 s s 2 Interconnect Latency 10 6 s s 2 *Sources: from P. Beckman (ANL), J. Shalf (LBL), and D. Unat (LBL) 1

6 Exascale System Projections Today's Systems Predicted Exascale Systems* Factor Improvement System Peak flops/s flops/s 100 Node Memory Bandwidth Interconnect Bandwidth 10 2 GB/s 10 3 GB/s GB/s 10 2 GB/s 10 Memory Latency 10 7 s s 2 Interconnect Latency 10 6 s s 2 *Sources: from P. Beckman (ANL), J. Shalf (LBL), and D. Unat (LBL) Movement of data (communication) is much more expensive than floating point operations (computation), in terms of both time and energy Reducing time spent moving data/waiting for data will be essential for applications at exascale! 1

7 Exascale System Projections Today's Systems Predicted Exascale Systems* Factor Improvement System Peak flops/s flops/s 100 Node Memory Bandwidth Interconnect Bandwidth 10 2 GB/s 10 3 GB/s GB/s 10 2 GB/s 10 Memory Latency 10 7 s s 2 Interconnect Latency 10 6 s s 2 *Sources: from P. Beckman (ANL), J. Shalf (LBL), and D. Unat (LBL) Movement of data (communication) is much more expensive than floating point operations (computation), in terms of both time and energy Reducing time spent moving data/waiting for data will be essential for applications at exascale! communication avoiding & communication hiding 1

8 Krylov Subspace Methods Krylov Subspace Method: projection process onto the Krylov subspace K i A, r 0 = span r 0, Ar 0, A 2 r 0,, A i 1 r 0 where A is an N N matrix and r 0 is a length-n vector 2

9 Krylov Subspace Methods Krylov Subspace Method: projection process onto the Krylov subspace K i A, r 0 = span r 0, Ar 0, A 2 r 0,, A i 1 r 0 where A is an N N matrix and r 0 is a length-n vector In each iteration: Add a dimension to the Krylov subspace Forms nested sequence of Krylov subspaces K 1 A, r 0 K 2 A, r 0 K i (A, r 0 ) rnew Aδ r 0 Orthogonalize (with respect to some C i ) Linear systems: Select approximate solution x i x 0 + K i (A, r 0 ) using r i = b Ax i C i 0 C 2

10 Krylov Subspace Methods Krylov Subspace Method: projection process onto the Krylov subspace K i A, r 0 = span r 0, Ar 0, A 2 r 0,, A i 1 r 0 where A is an N N matrix and r 0 is a length-n vector In each iteration: Add a dimension to the Krylov subspace Forms nested sequence of Krylov subspaces K 1 A, r 0 K 2 A, r 0 K i (A, r 0 ) rnew Aδ r 0 Orthogonalize (with respect to some C i ) Linear systems: Select approximate solution x i x 0 + K i (A, r 0 ) using r i = b Ax i C i 0 C Conjugate gradient method: A is symmetric positive definite, C i = K i (A, r 0 ) r i K i A, r 0 x x i A = min z x 0 +K i (A,r 0 ) x z A r N = 0 2

11 Conjugate Gradient on the World's Fastest Computer 3

12 Conjugate Gradient on the World's Fastest Computer current #1 on top500 3

13 Conjugate Gradient on the World's Fastest Computer current #1 on top500 Linpack benchmark (dense Ax = b, direct) 74% efficiency 3

14 Conjugate Gradient on the World's Fastest Computer current #1 on top500 Linpack benchmark (dense Ax = b, direct) 74% efficiency HPCG benchmark (sparse Ax = b, iterative) 0.4% efficiency 3

15 The Conjugate Gradient (CG) Method r 0 = b Ax 0, p 0 = r 0 for i = 1:nmax α i 1 = r T i 1 ri 1 p T i 1 Ap i 1 x i = x i 1 + α i 1 p i 1 r i = r i 1 α i 1 Ap i 1 β i = r i T r i r T i 1 r i 1 end p i = r i + β i p i 1 4

16 The Conjugate Gradient (CG) Method r 0 = b Ax 0, p 0 = r 0 for i = 1:nmax α i 1 = r T i 1 ri 1 p T i 1 Ap i 1 x i = x i 1 + α i 1 p i 1 r i = r i 1 α i 1 Ap i 1 Iteration Loop β i = r i T r i r T i 1 r i 1 end p i = r i + β i p i 1 4

17 The Conjugate Gradient (CG) Method r 0 = b Ax 0, p 0 = r 0 for i = 1:nmax α i 1 = r T i 1 ri 1 p T i 1 Ap i 1 x i = x i 1 + α i 1 p i 1 r i = r i 1 α i 1 Ap i 1 Iteration Loop Sparse Matrix Vector β i = r i T r i r T i 1 r i 1 end p i = r i + β i p i 1 4

18 The Conjugate Gradient (CG) Method r 0 = b Ax 0, p 0 = r 0 for i = 1:nmax α i 1 = r T i 1 ri 1 p T i 1 Ap i 1 x i = x i 1 + α i 1 p i 1 r i = r i 1 α i 1 Ap i 1 Iteration Loop Sparse Matrix Vector Inner Products β i = r i T r i r T i 1 r i 1 end p i = r i + β i p i 1 4

19 The Conjugate Gradient (CG) Method r 0 = b Ax 0, p 0 = r 0 for i = 1:nmax α i 1 = r T i 1 ri 1 p T i 1 Ap i 1 x i = x i 1 + α i 1 p i 1 Iteration Loop Sparse Matrix Vector Inner Products r i = r i 1 α i 1 Ap i 1 Vector Updates β i = r i T r i r T i 1 r i 1 end p i = r i + β i p i 1 4

20 The Conjugate Gradient (CG) Method r 0 = b Ax 0, p 0 = r 0 for i = 1:nmax α i 1 = r T i 1 ri 1 p T i 1 Ap i 1 x i = x i 1 + α i 1 p i 1 Iteration Loop Sparse Matrix Vector Inner Products r i = r i 1 α i 1 Ap i 1 Vector Updates β i = r i T r i r T i 1 r i 1 Inner Products end p i = r i + β i p i 1 4

21 The Conjugate Gradient (CG) Method r 0 = b Ax 0, p 0 = r 0 for i = 1:nmax α i 1 = r T i 1 ri 1 p T i 1 Ap i 1 x i = x i 1 + α i 1 p i 1 Iteration Loop Sparse Matrix Vector Inner Products r i = r i 1 α i 1 Ap i 1 Vector Updates β i = r i T r i r T i 1 r i 1 Inner Products end p i = r i + β i p i 1 Vector Updates 4

22 The Conjugate Gradient (CG) Method r 0 = b Ax 0, p 0 = r 0 for i = 1:nmax α i 1 = r T i 1 ri 1 p T i 1 Ap i 1 x i = x i 1 + α i 1 p i 1 Iteration Loop Sparse Matrix Vector Inner Products r i = r i 1 α i 1 Ap i 1 Vector Updates β i = r i T r i r T i 1 r i 1 Inner Products end p i = r i + β i p i 1 Vector Updates End Loop 4

23 Cost Per Iteration Sparse matrix-vector multiplication (SpMV) O(nnz) flops Must communicate vector entries w/neighboring processors (nearest neighbor MPI collective) 5

24 Cost Per Iteration Sparse matrix-vector multiplication (SpMV) O(nnz) flops Must communicate vector entries w/neighboring processors (nearest neighbor MPI collective) Inner products O(N) flops global synchronization (MPI_Allreduce) all processors must exchange data and wait for all communication to finish before proceeding 5

synchronization (MPI_Allreduce) all processors must exchange data and wait for all communication to

25 Cost Per Iteration Sparse matrix-vector multiplication (SpMV) O(nnz) flops Must communicate vector entries w/neighboring processors (nearest neighbor MPI collective) Inner products O(N) flops global synchronization (MPI_Allreduce) all processors must exchange data and wait for all communication to finish before proceeding SpMV orthogonalize Low computation/communication ratio Performance is communication-bound 5

26 Reducing Synchronization Cost Communication cost has motivated many approaches to reducing synchronization cost in Krylov subspace methods: Hiding communication: Pipelined Krylov subspace methods Introduce auxiliary vectors to decouple SpMV and inner products Enables overlapping of communication and computation Avoiding communication: s-step Krylov subspace methods Compute iterations in blocks of s (using a different Krylov subspace basis) Reduces number of synchronizations per iteration by a factor of O(s) * Both equivalent to classical CG in exact arithmetic 6

27 Pipelined CG (Ghysels and Vanroose 2013) r 0 = b Ax 0, p 0 = r 0 s 0 = Ap 0, w 0 = Ar 0, z 0 = Aw 0, α 0 = r 0 T r 0 /p 0 T s 0 for i = 1:nmax x i = x i 1 + α i 1 p i 1 r i = r i 1 α i 1 s i 1 w i = w i 1 α i 1 z i 1 q i = Aw i β i = r i T r i r T i 1 r i 1 α i = w i T r i r i T ri β i α i 1 r i T r i end p i = r i + β i p i 1 s i = w i + β i s i 1 z i = q i + β i z i 1 7

28 Pipelined CG (Ghysels and Vanroose 2013) r 0 = b Ax 0, p 0 = r 0 s 0 = Ap 0, w 0 = Ar 0, z 0 = Aw 0, α 0 = r 0 T r 0 /p 0 T s 0 for i = 1:nmax Uses auxiliary vectors s i Ap i, w i Ar i, z i A 2 r i x i = x i 1 + α i 1 p i 1 r i = r i 1 α i 1 s i 1 w i = w i 1 α i 1 z i 1 q i = Aw i β i = r i T r i r T i 1 r i 1 α i = w i T r i r i T ri β i α i 1 r i T r i end p i = r i + β i p i 1 s i = w i + β i s i 1 z i = q i + β i z i 1 7

29 Pipelined CG (Ghysels and Vanroose 2013) r 0 = b Ax 0, p 0 = r 0 s 0 = Ap 0, w 0 = Ar 0, z 0 = Aw 0, α 0 = r 0 T r 0 /p 0 T s 0 for i = 1:nmax x i = x i 1 + α i 1 p i 1 r i = r i 1 α i 1 s i 1 w i = w i 1 α i 1 z i 1 q i = Aw i β i = α i = r i T r i r T i 1 r i 1 w i T r i r i T ri β i α i 1 r i T r i Uses auxiliary vectors s i Ap i, w i Ar i, z i A 2 r i Removes sequential dependency between SpMV and inner products Allows the use of nonblocking (asynchronous) MPI communication to overlap SpMV and inner products See talk by W. Gropp in Part II: MS40 end p i = r i + β i p i 1 s i = w i + β i s i 1 z i = q i + β i z i 1 Hides the latency of global communications 7

30 Pipelined CG (Ghysels and Vanroose 2013) r 0 = b Ax 0, p 0 = r 0 s 0 = Ap 0, w 0 = Ar 0, z 0 = Aw 0, α 0 = r 0 T r 0 /p 0 T s 0 for i = 1:nmax Iteration Loop x i = x i 1 + α i 1 p i 1 r i = r i 1 α i 1 s i 1 w i = w i 1 α i 1 z i 1 q i = Aw i β i = r i T r i r T i 1 r i 1 α i = w i T r i r i T ri β i α i 1 r i T r i end p i = r i + β i p i 1 s i = w i + β i s i 1 z i = q i + β i z i 1 7

31 Pipelined CG (Ghysels and Vanroose 2013) r 0 = b Ax 0, p 0 = r 0 s 0 = Ap 0, w 0 = Ar 0, z 0 = Aw 0, α 0 = r 0 T r 0 /p 0 T s 0 for i = 1:nmax x i = x i 1 + α i 1 p i 1 r i = r i 1 α i 1 s i 1 Iteration Loop Vector Updates w i = w i 1 α i 1 z i 1 q i = Aw i β i = r i T r i r T i 1 r i 1 α i = w i T r i r i T ri β i α i 1 r i T r i end p i = r i + β i p i 1 s i = w i + β i s i 1 z i = q i + β i z i 1 7

32 Pipelined CG (Ghysels and Vanroose 2013) Overlap r 0 = b Ax 0, p 0 = r 0 s 0 = Ap 0, w 0 = Ar 0, z 0 = Aw 0, α 0 = r 0 T r 0 /p 0 T s 0 for i = 1:nmax x i = x i 1 + α i 1 p i 1 r i = r i 1 α i 1 s i 1 Iteration Loop Vector Updates w i = w i 1 α i 1 z i 1 q i = Aw i β i = r i T r i r T i 1 r i 1 Inner Products SpMV α i = w i T r i r i T ri β i α i 1 r i T r i end p i = r i + β i p i 1 s i = w i + β i s i 1 z i = q i + β i z i 1 7

33 Pipelined CG (Ghysels and Vanroose 2013) Overlap r 0 = b Ax 0, p 0 = r 0 s 0 = Ap 0, w 0 = Ar 0, z 0 = Aw 0, α 0 = r 0 T r 0 /p 0 T s 0 for i = 1:nmax x i = x i 1 + α i 1 p i 1 r i = r i 1 α i 1 s i 1 Iteration Loop Vector Updates w i = w i 1 α i 1 z i 1 q i = Aw i β i = r i T r i r T i 1 r i 1 Inner Products SpMV α i = w i T r i r i T ri β i α i 1 r i T r i Vector Updates end p i = r i + β i p i 1 s i = w i + β i s i 1 z i = q i + β i z i 1 7

34 Pipelined CG (Ghysels and Vanroose 2013) Overlap r 0 = b Ax 0, p 0 = r 0 s 0 = Ap 0, w 0 = Ar 0, z 0 = Aw 0, α 0 = r 0 T r 0 /p 0 T s 0 for i = 1:nmax x i = x i 1 + α i 1 p i 1 r i = r i 1 α i 1 s i 1 Iteration Loop Vector Updates w i = w i 1 α i 1 z i 1 q i = Aw i β i = r i T r i r T i 1 r i 1 Inner Products SpMV α i = w i T r i r i T ri β i α i 1 r i T r i Vector Updates p i = r i + β i p i 1 s i = w i + β i s i 1 End Loop end z i = q i + β i z i 1 7

35 Pipelined CG (Ghysels and Vanroose 2013) Overlap r 0 = b Ax 0, p 0 = r 0 s 0 = Ap 0, w 0 = Ar 0, z 0 = Aw 0, α 0 = r 0 T r 0 /p 0 T s 0 for i = 1:nmax Iteration Loop x i = x i 1 + α i 1 p i 1 r i = r i 1 α i 1 s i 1 Vector Updates w i = w i 1 α i 1 z i 1 q i = Aw i β i = r i T r i r T i 1 r i 1 Inner Products Precond SpMV α i = w i T r i r i T ri β i α i 1 r i T r i Vector Updates p i = r i + β i p i 1 s i = w i + β i s i 1 End Loop end z i = q i + β i z i 1 7

36 Overview of Pipelined KSMs Pipelined GMRES (Ghysels et al. 2013) Deep pipelines - compute l new Krylov basis vectors during global communication, orthogonalize after l iterations Talk by W. Vanroose, IP7 Sat March 10 Pipelined CG (Ghysels et al. 2013) With deep pipelines (Cornelis et al. 2018) Pipelined BiCGSTAB (Cools et al. 2017) Probabilistic performance modeling of pipelined KSMs Talk by H. Morgan, Part II: MS40 8

37 s-step CG r 0 = b Ax 0, p 0 = r 0 for k = 0:nmax/s Compute Y k and B k such that AY k = Y k B k and span(y k ) = K s+1 A, p sk G k = Y k T Y k x 0 = 0, r 0 = e s+2, p 0 = e 1 for j = 1: s end α sk+j 1 = x j = x j 1 r j = r j 1 β sk+j = r j T G k r j r T j 1 Gk r j 1 p T j 1 G k B k p j 1 + α sk+j 1 p j 1 α sk+j 1 B k p j 1 T G k r j 1 r j 1 p j = r j + β sk+j p j 1 + K s A, r sk Block iterations into groups of s Construct basis matrix Y k to expand Krylov subspace s dimensions at once Same latency cost as 1 SpMV (under assumptions on sparsity) 1 global synchronization to compute inner products between basis vectors Update coordinates of iteration vectors in the constructed basis requires no communication [x s k+1 x sk, r s k+1, p s k+1 ] = Y k [x s, r s, p s ] end 9

38 s-step CG r 0 = b Ax 0, p 0 = r 0 for k = 0:nmax/s Compute Y k and B k such that AY k = Y k B k and Outer Loop span(y k ) = K s+1 A, p sk + K s A, r sk G k = Y k T Y k x 0 = 0, r 0 = e s+2, p 0 = e 1 for j = 1: s end α sk+j 1 = x j = x j 1 r j = r j 1 β sk+j = r j T G k r j r T j 1 Gk r j 1 p T j 1 G k B k p j 1 + α sk+j 1 p j 1 α sk+j 1 B k p j 1 T G k r j 1 r j 1 p j = r j + β sk+j p j 1 [x s k+1 x sk, r s k+1, p s k+1 ] = Y k [x s, r s, p s ] end 9

39 s-step CG r 0 = b Ax 0, p 0 = r 0 for k = 0:nmax/s Compute Y k and B k such that AY k = Y k B k and span(y k ) = K s+1 A, p sk + K s A, r sk Outer Loop Compute basis O(s) SPMVs G k = Y k T Y k x 0 = 0, r 0 = e s+2, p 0 = e 1 for j = 1: s end α sk+j 1 = x j = x j 1 r j = r j 1 β sk+j = r j T G k r j r T j 1 Gk r j 1 p T j 1 G k B k p j 1 + α sk+j 1 p j 1 α sk+j 1 B k p j 1 T G k r j 1 r j 1 p j = r j + β sk+j p j 1 [x s k+1 x sk, r s k+1, p s k+1 ] = Y k [x s, r s, p s ] end 9

40 s-step CG r 0 = b Ax 0, p 0 = r 0 for k = 0:nmax/s Compute Y k and B k such that AY k = Y k B k and span(y k ) = K s+1 A, p sk G k = Y k T Y k x 0 = 0, r 0 = e s+2, p 0 = e 1 for j = 1: s end α sk+j 1 = x j = x j 1 r j = r j 1 β sk+j = r j T G k r j r T j 1 Gk r j 1 p T j 1 G k B k p j 1 + α sk+j 1 p j 1 α sk+j 1 B k p j 1 T G k r j 1 r j 1 p j = r j + β sk+j p j 1 + K s A, r sk Outer Loop Compute basis O(s) SPMVs O(s 2 ) Inner Products (one synchronization) [x s k+1 x sk, r s k+1, p s k+1 ] = Y k [x s, r s, p s ] end 9

41 s-step CG r 0 = b Ax 0, p 0 = r 0 for k = 0:nmax/s Compute Y k and B k such that AY k = Y k B k and span(y k ) = K s+1 A, p sk G k = Y k T Y k x 0 = 0, r 0 = e s+2, p 0 = e 1 for j = 1: s end α sk+j 1 = x j = x j 1 r j = r j 1 β sk+j = r j T G k r j r T j 1 Gk r j 1 p T j 1 G k B k p j 1 + α sk+j 1 p j 1 α sk+j 1 B k p j 1 T G k r j 1 r j 1 p j = r j + β sk+j p j 1 + K s A, r sk Outer Loop Compute basis O(s) SPMVs O(s 2 ) Inner Products (one synchronization) Inner Loop [x s k+1 x sk, r s k+1, p s k+1 ] = Y k [x s, r s, p s ] end 9

42 s-step CG r 0 = b Ax 0, p 0 = r 0 for k = 0:nmax/s Compute Y k and B k such that AY k = Y k B k and span(y k ) = K s+1 A, p sk G k = Y k T Y k x 0 = 0, r 0 = e s+2, p 0 = e 1 for j = 1: s end α sk+j 1 = x j = x j 1 r j = r j 1 β sk+j = r j T G k r j r T j 1 Gk r j 1 p T j 1 G k B k p j 1 + α sk+j 1 p j 1 α sk+j 1 B k p j 1 T G k r j 1 r j 1 p j = r j + β sk+j p j 1 + K s A, r sk Outer Loop Compute basis O(s) SPMVs O(s 2 ) Inner Products (one synchronization) Inner Loop Local Vector Updates (no comm.) [x s k+1 x sk, r s k+1, p s k+1 ] = Y k [x s, r s, p s ] end 9

43 s-step CG r 0 = b Ax 0, p 0 = r 0 for k = 0:nmax/s Compute Y k and B k such that AY k = Y k B k and span(y k ) = K s+1 A, p sk G k = Y k T Y k x 0 = 0, r 0 = e s+2, p 0 = e 1 for j = 1: s end α sk+j 1 = x j = x j 1 r j = r j 1 β sk+j = r j T G k r j r T j 1 Gk r j 1 p T j 1 G k B k p j 1 + α sk+j 1 p j 1 α sk+j 1 B k p j 1 T G k r j 1 r j 1 p j = r j + β sk+j p j 1 + K s A, r sk [x s k+1 x sk, r s k+1, p s k+1 ] = Y k [x s, r s, p s ] end Outer Loop Compute basis O(s) SPMVs O(s 2 ) Inner Products (one synchronization) Inner Loop Local Vector Updates (no comm.) End Inner Loop s times 9

44 s-step CG r 0 = b Ax 0, p 0 = r 0 for k = 0:nmax/s Compute Y k and B k such that AY k = Y k B k and span(y k ) = K s+1 A, p sk G k = Y k T Y k x 0 = 0, r 0 = e s+2, p 0 = e 1 for j = 1: s end α sk+j 1 = x j = x j 1 r j = r j 1 β sk+j = r j T G k r j r T j 1 Gk r j 1 p T j 1 G k B k p j 1 + α sk+j 1 p j 1 α sk+j 1 B k p j 1 T G k r j 1 r j 1 p j = r j + β sk+j p j 1 + K s A, r sk [x s k+1 x sk, r s k+1, p s k+1 ] = Y k [x s, r s, p s ] end Outer Loop Compute basis O(s) SPMVs O(s 2 ) Inner Products (one synchronization) Inner Loop Local Vector Updates (no comm.) End Inner Loop End Outer Loop s times 9

45 Overview of s-step KSMs s-step CG/Lanczos: (Van Rosendale, 1983), (Chronopoulos and Gear, 1989), (Leland, 1989), (Toledo, 1995), (Hoemmen et al., 2010) s-step GMRES/Arnoldi: (Walker, 1988), (Chronopoulous and Kim, 1990), (Bai, Hu, Reichel, 1991), (de Sturler, 1991), (Joubert, Carey, 1992), (Erhel, 1995), (Hoemmen et al., 2010) s-step BICGSTAB (C. et al., 2012) s-step QMR (Feuerriegel, Bücker, 2013) s-step LSQR (C., 2015) Many others... Recent work: Hybrid pipelined s-step methods (Yamazaki et al., 2017) Talk by P. Luszczek in Part II, MS40 Improving convergence rate and scalability in preconditioned s-step GMRES methods Talk by J. Erhel in MS28 (this session) 10

46 The effect of finite precision Well-known that roundoff error has two effects: 1. Delay of convergence No longer have exact Krylov subspace Can lose numerical rank deficiency Residuals no longer orthogonal - Minimization no longer exact! 2. Loss of attainable accuracy Rounding errors cause true residual b Ax i and updated residual r i deviate! 11

47 The effect of finite precision Well-known that roundoff error has two effects: 1. Delay of convergence No longer have exact Krylov subspace Can lose numerical rank deficiency Residuals no longer orthogonal - Minimization no longer exact! 2. Loss of attainable accuracy Rounding errors cause true residual b Ax i and updated residual r i deviate! A: bcsstk03 from UFSMC, b: equal components in the eigenbasis of A and b = 1 N = 112, κ A 7e6 11

48 The effect of finite precision Well-known that roundoff error has two effects: 1. Delay of convergence No longer have exact Krylov subspace Can lose numerical rank deficiency Residuals no longer orthogonal - Minimization no longer exact! 2. Loss of attainable accuracy Rounding errors cause true residual b Ax i and updated residual r i deviate! A: bcsstk03 from UFSMC, b: equal components in the eigenbasis of A and b = 1 N = 112, κ A 7e6 11

49 The effect of finite precision Well-known that roundoff error has two effects: 1. Delay of convergence No longer have exact Krylov subspace Can lose numerical rank deficiency Residuals no longer orthogonal - Minimization no longer exact! 2. Loss of attainable accuracy Rounding errors cause true residual b Ax i and updated residual r i deviate! A: bcsstk03 from UFSMC, b: equal components in the eigenbasis of A and b = 1 N = 112, κ A 7e6 Much work on these results for CG; See Meurant and Strakoš (2006) for a thorough summary of early developments in finite precision analysis of Lanczos and CG 11

50 Optimizing high performance iterative solvers Synchronization-reducing variants are designed to reduce the time/iteration 12

51 Optimizing high performance iterative solvers Synchronization-reducing variants are designed to reduce the time/iteration But this is not the whole story! 12

52 Optimizing high performance iterative solvers Synchronization-reducing variants are designed to reduce the time/iteration But this is not the whole story! What we really want to minimize is the runtime, subject to some constraint on accuracy 12

53 Optimizing high performance iterative solvers Synchronization-reducing variants are designed to reduce the time/iteration But this is not the whole story! What we really want to minimize is the runtime, subject to some constraint on accuracy Changes to how the recurrences are computed can exacerbate finite precision effects of convergence delay and loss of accuracy A: bcsstk03 from UFSMC, b: equal components in the eigenbasis of A and b = 1 N = 112, κ A 7e6 12

54 Optimizing high performance iterative solvers Synchronization-reducing variants are designed to reduce the time/iteration But this is not the whole story! What we really want to minimize is the runtime, subject to some constraint on accuracy Changes to how the recurrences are computed can exacerbate finite precision effects of convergence delay and loss of accuracy Crucial that we understand and take into account how algorithm modifications will affect the convergence rate and attainable accuracy! A: bcsstk03 from UFSMC, b: equal components in the eigenbasis of A and b = 1 N = 112, κ A 7e6 12

55 Maximum attainable accuracy Accuracy depends on the size of the true residual: b A x i 13

56 Maximum attainable accuracy Accuracy depends on the size of the true residual: b A x i Rounding errors cause the true residual, b A x i, and the updated residual, r i, to deviate 13

57 Maximum attainable accuracy Accuracy depends on the size of the true residual: b A x i Rounding errors cause the true residual, b A x i, and the updated residual, r i, to deviate Writing b A x i = r i + b A x i r i, b A x i r i + b A x i r i As r i 0, b A x i depends on b A x i r i 13

58 Maximum attainable accuracy Accuracy depends on the size of the true residual: b A x i Rounding errors cause the true residual, b A x i, and the updated residual, r i, to deviate Writing b A x i = r i + b A x i r i, b A x i r i + b A x i r i As r i 0, b A x i depends on b A x i r i Many results on bounding attainable accuracy, e.g.: Greenbaum (1989, 1994, 1997), Sleijpen, van der Vorst and Fokkema (1994), Sleijpen, van der Vorst and Modersitzki (2001), Björck, Elfving and Strakoš (1998) and Gutknecht and Strakoš (2000). 13

59 Maximum attainable accuracy of HSCG In finite precision HSCG, iterates are updated by x i = x i 1 + α i 1 p i 1 δx i and r i = r i 1 α i 1 A p i 1 δr i 14

60 Maximum attainable accuracy of HSCG In finite precision HSCG, iterates are updated by x i = x i 1 + α i 1 p i 1 δx i and r i = r i 1 α i 1 A p i 1 δr i Let f i b A x i r i 14

61 Maximum attainable accuracy of HSCG In finite precision HSCG, iterates are updated by x i = x i 1 + α i 1 p i 1 δx i and r i = r i 1 α i 1 A p i 1 δr i Let f i b A x i r i f i = b A x i 1 + α i 1 p i 1 δx i r i 1 α i 1 A p i 1 δr i 14

62 Maximum attainable accuracy of HSCG In finite precision HSCG, iterates are updated by x i = x i 1 + α i 1 p i 1 δx i and r i = r i 1 α i 1 A p i 1 δr i Let f i b A x i r i f i = b A x i 1 + α i 1 p i 1 δx i = f i 1 + Aδx i + δr i r i 1 α i 1 A p i 1 δr i 14

63 Maximum attainable accuracy of HSCG In finite precision HSCG, iterates are updated by x i = x i 1 + α i 1 p i 1 δx i and r i = r i 1 α i 1 A p i 1 δr i Let f i b A x i r i f i = b A x i 1 + α i 1 p i 1 δx i = f i 1 + Aδx i + δr i r i 1 α i 1 A p i 1 δr i i = f 0 + m=1 Aδx m + δr m 14

64 Maximum attainable accuracy of HSCG In finite precision HSCG, iterates are updated by x i = x i 1 + α i 1 p i 1 δx i and r i = r i 1 α i 1 A p i 1 δr i Let f i b A x i r i f i = b A x i 1 + α i 1 p i 1 δx i = f i 1 + Aδx i + δr i r i 1 α i 1 A p i 1 δr i i = f 0 + m=1 Aδx m + δr m accumulation of local rounding errors 14

65 Maximum attainable accuracy of HSCG In finite precision HSCG, iterates are updated by x i = x i 1 + α i 1 p i 1 δx i and r i = r i 1 α i 1 A p i 1 δr i Let f i b A x i r i f i = b A x i 1 + α i 1 p i 1 δx i = f i 1 + Aδx i + δr i r i 1 α i 1 A p i 1 δr i i = f 0 + m=1 Aδx m + δr m accumulation of local rounding errors f i i O ε m=0 N A A x m + r m van der Vorst and Ye, 2000 f i O(ε) A x + max m=0,,i x m Greenbaum, 1997 f i O ε N A A A 1 i m=0 r m Sleijpen and van der Vorst,

66 Attainable accuracy of pipelined CG Pipelined CG updates x i and r i via: x i = x i 1 + α i 1 p i 1, r i = r i 1 α i 1 s i 1 15

67 Attainable accuracy of pipelined CG Pipelined CG updates x i and r i via: x i = x i 1 + α i 1 p i 1, r i = r i 1 α i 1 s i 1 In finite precision: x i = x i 1 + α i 1 p i 1 + δx i r i = r i 1 α i 1 s i 1 + δr i 15

68 Attainable accuracy of pipelined CG Pipelined CG updates x i and r i via: x i = x i 1 + α i 1 p i 1, r i = r i 1 α i 1 s i 1 In finite precision: x i = x i 1 + α i 1 p i 1 + δx i r i = r i 1 α i 1 s i 1 + δr i f i = r i (b A x i ) = f i 1 α i 1 s i 1 A p i 1 + δr i + Aδx i where i = f 0 + m=1 (δr m + Aδx m ) G i d i G i = S i A P i, d i = α 0,, α i 1 T 15

69 Attainable accuracy of pipelined CG Pipelined CG updates x i and r i via: x i = x i 1 + α i 1 p i 1, r i = r i 1 α i 1 s i 1 In finite precision: x i = x i 1 + α i 1 p i 1 + δx i r i = r i 1 α i 1 s i 1 + δr i f i = r i (b A x i ) = f i 1 α i 1 s i 1 A p i 1 + δr i + Aδx i where i = f 0 + m=1 (δr m + Aδx m ) G i d i G i = S i A P i, d i = α 0,, α i 1 T 15

70 Attainable accuracy of pipelined CG Pipelined CG updates x i and r i via: x i = x i 1 + α i 1 p i 1, r i = r i 1 α i 1 s i 1 In finite precision: x i = x i 1 + α i 1 p i 1 + δx i r i = r i 1 α i 1 s i 1 + δr i f i = r i (b A x i ) = f i 1 α i 1 s i 1 A p i 1 + δr i + Aδx i where i = f 0 + m=1 (δr m + Aδx m ) G i d i G i = S i A P i, d i = α 0,, α i 1 T 15

71 Attainable accuracy of pipelined CG Pipelined CG updates x i and r i via: x i = x i 1 + α i 1 p i 1, r i = r i 1 α i 1 s i 1 In finite precision: x i = x i 1 + α i 1 p i 1 + δx i r i = r i 1 α i 1 s i 1 + δr i f i = r i (b A x i ) = f i 1 α i 1 s i 1 A p i 1 + δr i + Aδx i where i = f 0 + m=1 (δr m + Aδx m ) G i d i G i = S i A P i, d i = α 0,, α i 1 T Amplification of local rounding errors possible depending on α i s and β i s See recent work: (Cools et al., 2017), (Carson et al., 2017) 15

72 Numerical Example A: bcsstk03 from UFSMC, b: equal components in the eigenbasis of A and b = 1 N = 112, κ A 7e6 16

73 Numerical Example A: bcsstk03 from UFSMC, b: equal components in the eigenbasis of A and b = 1 N = 112, κ A 7e6 A: nos4 from UFSMC, b: equal components in the eigenbasis of A and b = 1 N = 100, κ A 2e3 16

74 Attainable accuracy of s-step CG f i b A x i r i For CG: f i f 0 + ε i m=1 1 + N A x m + r m 17

75 Attainable accuracy of s-step CG f i b A x i r i For CG: f i f 0 + ε i m=1 1 + N A x m + r m For s-step CG: i sk + j (see C., 2015) f sk+j f 0 + εcγ sk+j m=1 1 + N A x m + r m where c is a low-degree polynomial in s, and Γ = max l k Y l + Y l 17

76 Attainable accuracy of s-step CG f i b A x i r i For CG: f i f 0 + ε i m=1 1 + N A x m + r m For s-step CG: i sk + j (see C., 2015) f sk+j f 0 + εcγ sk+j m=1 1 + N A x m + r m where c is a low-degree polynomial in s, and Γ = max l k Y l + Y l Amplification of local rounding errors possible depending on conditioning of basis 17

77 Numerical example s-step CG with monomial basis (Y = [p i, Ap i,, A s p i, r i, Ar i, A s 1 r i ]) A: bcsstk03 from UFSMC, b: equal components in the eigenbasis of A and b = 1 N = 112, κ A 7e6 18

78 Numerical example s-step CG with monomial basis (Y = [p i, Ap i,, A s p i, r i, Ar i, A s 1 r i ]) A: bcsstk03 from UFSMC, b: equal components in the eigenbasis of A and b = 1 N = 112, κ A 7e6 18

79 Numerical example s-step CG with monomial basis (Y = [p i, Ap i,, A s p i, r i, Ar i, A s 1 r i ]) A: bcsstk03 from UFSMC, b: equal components in the eigenbasis of A and b = 1 N = 112, κ A 7e6 18

80 Numerical example s-step CG with monomial basis (Y = [p i, Ap i,, A s p i, r i, Ar i, A s 1 r i ]) * Can also use other, more well-conditioned bases to improve convergence rate and accuracy (see, e.g. Philippe and Reichel, 2012). 18

81 Numerical example s-step CG with monomial basis (Y = [p i, Ap i,, A s p i, r i, Ar i, A s 1 r i ]) A: nos4 from UFSMC, b: equal components in the eigenbasis of A and b = 1 N = 100, κ A 2e3 19

82 Residual Replacement Idea: improve accuracy by replacing r i with fl(b A x i ) in certain iterations (Van der Vorst and Ye, 2000) 20

83 Residual Replacement Idea: improve accuracy by replacing r i with fl(b A x i ) in certain iterations (Van der Vorst and Ye, 2000) Choose when to replace based on estimate of f i b A x i r i Replace often enough such that f i remains small But don't replace when error in computing fl(b A x i ) would perturb recurrence and cause convergence delay See (Strakoš and Tichý, 2002) 20

84 Residual Replacement Idea: improve accuracy by replacing r i with fl(b A x i ) in certain iterations (Van der Vorst and Ye, 2000) Choose when to replace based on estimate of f i b A x i r i Replace often enough such that f i remains small But don't replace when error in computing fl(b A x i ) would perturb recurrence and cause convergence delay See (Strakoš and Tichý, 2002) This strategy can be adapted for both pipelined KSMs (Cools and Vanroose, 2017) and s-step KSMs (Carson and Demmel, 2014) 20

85 Residual Replacement Idea: improve accuracy by replacing r i with fl(b A x i ) in certain iterations (Van der Vorst and Ye, 2000) Choose when to replace based on estimate of f i b A x i r i Replace often enough such that f i remains small But don't replace when error in computing fl(b A x i ) would perturb recurrence and cause convergence delay See (Strakoš and Tichý, 2002) This strategy can be adapted for both pipelined KSMs (Cools and Vanroose, 2017) and s-step KSMs (Carson and Demmel, 2014) In both cases, estimate of f i can be computed inexpensively Improves accuracy to comparable level as classical method in many cases 20

86 Scalability of pipelined CG with RR 21

87 Conclusions and takeaways Need new approaches to reducing synchronization cost in Krylov subspace methods for large-scale problems Pipelined methods, s-step methods, hybrid approaches But must also consider finite precision behavior! 22

88 Conclusions and takeaways Need new approaches to reducing synchronization cost in Krylov subspace methods for large-scale problems Pipelined methods, s-step methods, hybrid approaches But must also consider finite precision behavior! Other communication-avoiding and communication-hiding approaches possible e.g., Enlarged Krylov subspace methods - Talk by O. Tissot, MS28 (this session) 22

89 Conclusions and takeaways Need new approaches to reducing synchronization cost in Krylov subspace methods for large-scale problems Pipelined methods, s-step methods, hybrid approaches But must also consider finite precision behavior! Other communication-avoiding and communication-hiding approaches possible e.g., Enlarged Krylov subspace methods - Talk by O. Tissot, MS28 (this session) Best approach is highly application-dependent Application of pipelined KSMs to graph partitioning - Talk by P. Ghysels in Part II: MS40 22

90 Conclusions and takeaways Need new approaches to reducing synchronization cost in Krylov subspace methods for large-scale problems Pipelined methods, s-step methods, hybrid approaches But must also consider finite precision behavior! Other communication-avoiding and communication-hiding approaches possible e.g., Enlarged Krylov subspace methods - Talk by O. Tissot, MS28 (this session) Best approach is highly application-dependent Application of pipelined KSMs to graph partitioning - Talk by P. Ghysels in Part II: MS40 Many interesting open problems and challenges as we push toward exascalelevel computing! 22

91 Thank You! math.nyu.edu/~erinc

High performance Krylov subspace method variants

High performance Krylov subspace method variants and their behavior in finite precision Erin Carson New York University HPCSE17, May 24, 2017 Collaborators Miroslav Rozložník Institute of Computer Science,