MS 28: Scalable Communication-Avoiding and -Hiding Krylov Subspace Methods I
|
|
- Derek Lawson
- 5 years ago
- Views:
Transcription
1 MS 28: Scalable Communication-Avoiding and -Hiding Krylov Subspace Methods I Organizers: Siegfried Cools, University of Antwerp, Belgium Erin C. Carson, New York University, USA 10:50-11:10 High Performance Variants of Krylov Subspace Methods Erin C. Carson, New York University, USA 11:15-11:35 About Parallel Variants of GMRES Algorithm Jocelyne Erhel, Inria-Rennes, France 11:40-12:00 Enlarged GMRES for Reducing Communication OlivierTissot, Inria, France
2 MS 40: Scalable Communication-Avoiding and -Hiding Krylov Subspace Methods II Organizers: Siegfried Cools, University of Antwerp, Belgium Erin C. Carson, New York University, USA 2:40-3:00 Impact of Noise Models on Pipelined Krylov Methods Hannah Morgan, University of Chicago, USA 3:05-3:25 Scalable Krylov Methods for Spectral Graph Partitioning Pieter Ghysels, Lawrence Berkeley National Laboratory, USA 3:30-3:50 Using Non-Blocking Communication to Achieve Scalability for Preconditioned Conjugate Gradient Methods William D. Gropp, University of Illinois at Urbana-Champaign, USA 3:55-4:15 Performance of S-Step and Pipelined Krylov Methods Piotr Luszczek, University of Tennessee, Knoxville, USA
3 High Performance Variants of Krylov Subspace Methods Erin Carson New York University SIAM PP18, Tokyo, Japan March 8, 2018
4 Collaborators Emmanuel Agullo, Inria, France Siegfried Cools, University of Antwerp, Belgium James Demmel, University of California, Berkeley, USA Pieter Ghysels, Lawrence Berkeley National Laboratory, USA Luc Giraud, Inria, France Miro Rozložník, Czech Academy of Sciences, Czech Republic Zdeněk Strakoš, Charles University, Czech Republic Petr Tichý, Czech Academy of Sciences, Czech Republic Miroslav Tůma, Czech Academy of Sciences, Czech Republic Wim Vanroose, Antwerp University, Belgium Emrullah Fatih Yetkin, Inria, France
5 Exascale System Projections Today's Systems Predicted Exascale Systems* Factor Improvement System Peak flops/s flops/s 100 Node Memory Bandwidth Interconnect Bandwidth 10 2 GB/s 10 3 GB/s GB/s 10 2 GB/s 10 Memory Latency 10 7 s s 2 Interconnect Latency 10 6 s s 2 *Sources: from P. Beckman (ANL), J. Shalf (LBL), and D. Unat (LBL) 1
6 Exascale System Projections Today's Systems Predicted Exascale Systems* Factor Improvement System Peak flops/s flops/s 100 Node Memory Bandwidth Interconnect Bandwidth 10 2 GB/s 10 3 GB/s GB/s 10 2 GB/s 10 Memory Latency 10 7 s s 2 Interconnect Latency 10 6 s s 2 *Sources: from P. Beckman (ANL), J. Shalf (LBL), and D. Unat (LBL) Movement of data (communication) is much more expensive than floating point operations (computation), in terms of both time and energy Reducing time spent moving data/waiting for data will be essential for applications at exascale! 1
7 Exascale System Projections Today's Systems Predicted Exascale Systems* Factor Improvement System Peak flops/s flops/s 100 Node Memory Bandwidth Interconnect Bandwidth 10 2 GB/s 10 3 GB/s GB/s 10 2 GB/s 10 Memory Latency 10 7 s s 2 Interconnect Latency 10 6 s s 2 *Sources: from P. Beckman (ANL), J. Shalf (LBL), and D. Unat (LBL) Movement of data (communication) is much more expensive than floating point operations (computation), in terms of both time and energy Reducing time spent moving data/waiting for data will be essential for applications at exascale! communication avoiding & communication hiding 1
8 Krylov Subspace Methods Krylov Subspace Method: projection process onto the Krylov subspace K i A, r 0 = span r 0, Ar 0, A 2 r 0,, A i 1 r 0 where A is an N N matrix and r 0 is a length-n vector 2
9 Krylov Subspace Methods Krylov Subspace Method: projection process onto the Krylov subspace K i A, r 0 = span r 0, Ar 0, A 2 r 0,, A i 1 r 0 where A is an N N matrix and r 0 is a length-n vector In each iteration: Add a dimension to the Krylov subspace Forms nested sequence of Krylov subspaces K 1 A, r 0 K 2 A, r 0 K i (A, r 0 ) rnew Aδ r 0 Orthogonalize (with respect to some C i ) Linear systems: Select approximate solution x i x 0 + K i (A, r 0 ) using r i = b Ax i C i 0 C 2
10 Krylov Subspace Methods Krylov Subspace Method: projection process onto the Krylov subspace K i A, r 0 = span r 0, Ar 0, A 2 r 0,, A i 1 r 0 where A is an N N matrix and r 0 is a length-n vector In each iteration: Add a dimension to the Krylov subspace Forms nested sequence of Krylov subspaces K 1 A, r 0 K 2 A, r 0 K i (A, r 0 ) rnew Aδ r 0 Orthogonalize (with respect to some C i ) Linear systems: Select approximate solution x i x 0 + K i (A, r 0 ) using r i = b Ax i C i 0 C Conjugate gradient method: A is symmetric positive definite, C i = K i (A, r 0 ) r i K i A, r 0 x x i A = min z x 0 +K i (A,r 0 ) x z A r N = 0 2
11 Conjugate Gradient on the World's Fastest Computer 3
12 Conjugate Gradient on the World's Fastest Computer current #1 on top500 3
13 Conjugate Gradient on the World's Fastest Computer current #1 on top500 Linpack benchmark (dense Ax = b, direct) 74% efficiency 3
14 Conjugate Gradient on the World's Fastest Computer current #1 on top500 Linpack benchmark (dense Ax = b, direct) 74% efficiency HPCG benchmark (sparse Ax = b, iterative) 0.4% efficiency 3
15 The Conjugate Gradient (CG) Method r 0 = b Ax 0, p 0 = r 0 for i = 1:nmax α i 1 = r T i 1 ri 1 p T i 1 Ap i 1 x i = x i 1 + α i 1 p i 1 r i = r i 1 α i 1 Ap i 1 β i = r i T r i r T i 1 r i 1 end p i = r i + β i p i 1 4
16 The Conjugate Gradient (CG) Method r 0 = b Ax 0, p 0 = r 0 for i = 1:nmax α i 1 = r T i 1 ri 1 p T i 1 Ap i 1 x i = x i 1 + α i 1 p i 1 r i = r i 1 α i 1 Ap i 1 Iteration Loop β i = r i T r i r T i 1 r i 1 end p i = r i + β i p i 1 4
17 The Conjugate Gradient (CG) Method r 0 = b Ax 0, p 0 = r 0 for i = 1:nmax α i 1 = r T i 1 ri 1 p T i 1 Ap i 1 x i = x i 1 + α i 1 p i 1 r i = r i 1 α i 1 Ap i 1 Iteration Loop Sparse Matrix Vector β i = r i T r i r T i 1 r i 1 end p i = r i + β i p i 1 4
18 The Conjugate Gradient (CG) Method r 0 = b Ax 0, p 0 = r 0 for i = 1:nmax α i 1 = r T i 1 ri 1 p T i 1 Ap i 1 x i = x i 1 + α i 1 p i 1 r i = r i 1 α i 1 Ap i 1 Iteration Loop Sparse Matrix Vector Inner Products β i = r i T r i r T i 1 r i 1 end p i = r i + β i p i 1 4
19 The Conjugate Gradient (CG) Method r 0 = b Ax 0, p 0 = r 0 for i = 1:nmax α i 1 = r T i 1 ri 1 p T i 1 Ap i 1 x i = x i 1 + α i 1 p i 1 Iteration Loop Sparse Matrix Vector Inner Products r i = r i 1 α i 1 Ap i 1 Vector Updates β i = r i T r i r T i 1 r i 1 end p i = r i + β i p i 1 4
20 The Conjugate Gradient (CG) Method r 0 = b Ax 0, p 0 = r 0 for i = 1:nmax α i 1 = r T i 1 ri 1 p T i 1 Ap i 1 x i = x i 1 + α i 1 p i 1 Iteration Loop Sparse Matrix Vector Inner Products r i = r i 1 α i 1 Ap i 1 Vector Updates β i = r i T r i r T i 1 r i 1 Inner Products end p i = r i + β i p i 1 4
21 The Conjugate Gradient (CG) Method r 0 = b Ax 0, p 0 = r 0 for i = 1:nmax α i 1 = r T i 1 ri 1 p T i 1 Ap i 1 x i = x i 1 + α i 1 p i 1 Iteration Loop Sparse Matrix Vector Inner Products r i = r i 1 α i 1 Ap i 1 Vector Updates β i = r i T r i r T i 1 r i 1 Inner Products end p i = r i + β i p i 1 Vector Updates 4
22 The Conjugate Gradient (CG) Method r 0 = b Ax 0, p 0 = r 0 for i = 1:nmax α i 1 = r T i 1 ri 1 p T i 1 Ap i 1 x i = x i 1 + α i 1 p i 1 Iteration Loop Sparse Matrix Vector Inner Products r i = r i 1 α i 1 Ap i 1 Vector Updates β i = r i T r i r T i 1 r i 1 Inner Products end p i = r i + β i p i 1 Vector Updates End Loop 4
23 Cost Per Iteration Sparse matrix-vector multiplication (SpMV) O(nnz) flops Must communicate vector entries w/neighboring processors (nearest neighbor MPI collective) 5
24 Cost Per Iteration Sparse matrix-vector multiplication (SpMV) O(nnz) flops Must communicate vector entries w/neighboring processors (nearest neighbor MPI collective) Inner products O(N) flops global synchronization (MPI_Allreduce) all processors must exchange data and wait for all communication to finish before proceeding 5
25 Cost Per Iteration Sparse matrix-vector multiplication (SpMV) O(nnz) flops Must communicate vector entries w/neighboring processors (nearest neighbor MPI collective) Inner products O(N) flops global synchronization (MPI_Allreduce) all processors must exchange data and wait for all communication to finish before proceeding SpMV orthogonalize Low computation/communication ratio Performance is communication-bound 5
26 Reducing Synchronization Cost Communication cost has motivated many approaches to reducing synchronization cost in Krylov subspace methods: Hiding communication: Pipelined Krylov subspace methods Introduce auxiliary vectors to decouple SpMV and inner products Enables overlapping of communication and computation Avoiding communication: s-step Krylov subspace methods Compute iterations in blocks of s (using a different Krylov subspace basis) Reduces number of synchronizations per iteration by a factor of O(s) * Both equivalent to classical CG in exact arithmetic 6
27 Pipelined CG (Ghysels and Vanroose 2013) r 0 = b Ax 0, p 0 = r 0 s 0 = Ap 0, w 0 = Ar 0, z 0 = Aw 0, α 0 = r 0 T r 0 /p 0 T s 0 for i = 1:nmax x i = x i 1 + α i 1 p i 1 r i = r i 1 α i 1 s i 1 w i = w i 1 α i 1 z i 1 q i = Aw i β i = r i T r i r T i 1 r i 1 α i = w i T r i r i T ri β i α i 1 r i T r i end p i = r i + β i p i 1 s i = w i + β i s i 1 z i = q i + β i z i 1 7
28 Pipelined CG (Ghysels and Vanroose 2013) r 0 = b Ax 0, p 0 = r 0 s 0 = Ap 0, w 0 = Ar 0, z 0 = Aw 0, α 0 = r 0 T r 0 /p 0 T s 0 for i = 1:nmax Uses auxiliary vectors s i Ap i, w i Ar i, z i A 2 r i x i = x i 1 + α i 1 p i 1 r i = r i 1 α i 1 s i 1 w i = w i 1 α i 1 z i 1 q i = Aw i β i = r i T r i r T i 1 r i 1 α i = w i T r i r i T ri β i α i 1 r i T r i end p i = r i + β i p i 1 s i = w i + β i s i 1 z i = q i + β i z i 1 7
29 Pipelined CG (Ghysels and Vanroose 2013) r 0 = b Ax 0, p 0 = r 0 s 0 = Ap 0, w 0 = Ar 0, z 0 = Aw 0, α 0 = r 0 T r 0 /p 0 T s 0 for i = 1:nmax x i = x i 1 + α i 1 p i 1 r i = r i 1 α i 1 s i 1 w i = w i 1 α i 1 z i 1 q i = Aw i β i = α i = r i T r i r T i 1 r i 1 w i T r i r i T ri β i α i 1 r i T r i Uses auxiliary vectors s i Ap i, w i Ar i, z i A 2 r i Removes sequential dependency between SpMV and inner products Allows the use of nonblocking (asynchronous) MPI communication to overlap SpMV and inner products See talk by W. Gropp in Part II: MS40 end p i = r i + β i p i 1 s i = w i + β i s i 1 z i = q i + β i z i 1 Hides the latency of global communications 7
30 Pipelined CG (Ghysels and Vanroose 2013) r 0 = b Ax 0, p 0 = r 0 s 0 = Ap 0, w 0 = Ar 0, z 0 = Aw 0, α 0 = r 0 T r 0 /p 0 T s 0 for i = 1:nmax Iteration Loop x i = x i 1 + α i 1 p i 1 r i = r i 1 α i 1 s i 1 w i = w i 1 α i 1 z i 1 q i = Aw i β i = r i T r i r T i 1 r i 1 α i = w i T r i r i T ri β i α i 1 r i T r i end p i = r i + β i p i 1 s i = w i + β i s i 1 z i = q i + β i z i 1 7
31 Pipelined CG (Ghysels and Vanroose 2013) r 0 = b Ax 0, p 0 = r 0 s 0 = Ap 0, w 0 = Ar 0, z 0 = Aw 0, α 0 = r 0 T r 0 /p 0 T s 0 for i = 1:nmax x i = x i 1 + α i 1 p i 1 r i = r i 1 α i 1 s i 1 Iteration Loop Vector Updates w i = w i 1 α i 1 z i 1 q i = Aw i β i = r i T r i r T i 1 r i 1 α i = w i T r i r i T ri β i α i 1 r i T r i end p i = r i + β i p i 1 s i = w i + β i s i 1 z i = q i + β i z i 1 7
32 Pipelined CG (Ghysels and Vanroose 2013) Overlap r 0 = b Ax 0, p 0 = r 0 s 0 = Ap 0, w 0 = Ar 0, z 0 = Aw 0, α 0 = r 0 T r 0 /p 0 T s 0 for i = 1:nmax x i = x i 1 + α i 1 p i 1 r i = r i 1 α i 1 s i 1 Iteration Loop Vector Updates w i = w i 1 α i 1 z i 1 q i = Aw i β i = r i T r i r T i 1 r i 1 Inner Products SpMV α i = w i T r i r i T ri β i α i 1 r i T r i end p i = r i + β i p i 1 s i = w i + β i s i 1 z i = q i + β i z i 1 7
33 Pipelined CG (Ghysels and Vanroose 2013) Overlap r 0 = b Ax 0, p 0 = r 0 s 0 = Ap 0, w 0 = Ar 0, z 0 = Aw 0, α 0 = r 0 T r 0 /p 0 T s 0 for i = 1:nmax x i = x i 1 + α i 1 p i 1 r i = r i 1 α i 1 s i 1 Iteration Loop Vector Updates w i = w i 1 α i 1 z i 1 q i = Aw i β i = r i T r i r T i 1 r i 1 Inner Products SpMV α i = w i T r i r i T ri β i α i 1 r i T r i Vector Updates end p i = r i + β i p i 1 s i = w i + β i s i 1 z i = q i + β i z i 1 7
34 Pipelined CG (Ghysels and Vanroose 2013) Overlap r 0 = b Ax 0, p 0 = r 0 s 0 = Ap 0, w 0 = Ar 0, z 0 = Aw 0, α 0 = r 0 T r 0 /p 0 T s 0 for i = 1:nmax x i = x i 1 + α i 1 p i 1 r i = r i 1 α i 1 s i 1 Iteration Loop Vector Updates w i = w i 1 α i 1 z i 1 q i = Aw i β i = r i T r i r T i 1 r i 1 Inner Products SpMV α i = w i T r i r i T ri β i α i 1 r i T r i Vector Updates p i = r i + β i p i 1 s i = w i + β i s i 1 End Loop end z i = q i + β i z i 1 7
35 Pipelined CG (Ghysels and Vanroose 2013) Overlap r 0 = b Ax 0, p 0 = r 0 s 0 = Ap 0, w 0 = Ar 0, z 0 = Aw 0, α 0 = r 0 T r 0 /p 0 T s 0 for i = 1:nmax Iteration Loop x i = x i 1 + α i 1 p i 1 r i = r i 1 α i 1 s i 1 Vector Updates w i = w i 1 α i 1 z i 1 q i = Aw i β i = r i T r i r T i 1 r i 1 Inner Products Precond SpMV α i = w i T r i r i T ri β i α i 1 r i T r i Vector Updates p i = r i + β i p i 1 s i = w i + β i s i 1 End Loop end z i = q i + β i z i 1 7
36 Overview of Pipelined KSMs Pipelined GMRES (Ghysels et al. 2013) Deep pipelines - compute l new Krylov basis vectors during global communication, orthogonalize after l iterations Talk by W. Vanroose, IP7 Sat March 10 Pipelined CG (Ghysels et al. 2013) With deep pipelines (Cornelis et al. 2018) Pipelined BiCGSTAB (Cools et al. 2017) Probabilistic performance modeling of pipelined KSMs Talk by H. Morgan, Part II: MS40 8
37 s-step CG r 0 = b Ax 0, p 0 = r 0 for k = 0:nmax/s Compute Y k and B k such that AY k = Y k B k and span(y k ) = K s+1 A, p sk G k = Y k T Y k x 0 = 0, r 0 = e s+2, p 0 = e 1 for j = 1: s end α sk+j 1 = x j = x j 1 r j = r j 1 β sk+j = r j T G k r j r T j 1 Gk r j 1 p T j 1 G k B k p j 1 + α sk+j 1 p j 1 α sk+j 1 B k p j 1 T G k r j 1 r j 1 p j = r j + β sk+j p j 1 + K s A, r sk Block iterations into groups of s Construct basis matrix Y k to expand Krylov subspace s dimensions at once Same latency cost as 1 SpMV (under assumptions on sparsity) 1 global synchronization to compute inner products between basis vectors Update coordinates of iteration vectors in the constructed basis requires no communication [x s k+1 x sk, r s k+1, p s k+1 ] = Y k [x s, r s, p s ] end 9
38 s-step CG r 0 = b Ax 0, p 0 = r 0 for k = 0:nmax/s Compute Y k and B k such that AY k = Y k B k and Outer Loop span(y k ) = K s+1 A, p sk + K s A, r sk G k = Y k T Y k x 0 = 0, r 0 = e s+2, p 0 = e 1 for j = 1: s end α sk+j 1 = x j = x j 1 r j = r j 1 β sk+j = r j T G k r j r T j 1 Gk r j 1 p T j 1 G k B k p j 1 + α sk+j 1 p j 1 α sk+j 1 B k p j 1 T G k r j 1 r j 1 p j = r j + β sk+j p j 1 [x s k+1 x sk, r s k+1, p s k+1 ] = Y k [x s, r s, p s ] end 9
39 s-step CG r 0 = b Ax 0, p 0 = r 0 for k = 0:nmax/s Compute Y k and B k such that AY k = Y k B k and span(y k ) = K s+1 A, p sk + K s A, r sk Outer Loop Compute basis O(s) SPMVs G k = Y k T Y k x 0 = 0, r 0 = e s+2, p 0 = e 1 for j = 1: s end α sk+j 1 = x j = x j 1 r j = r j 1 β sk+j = r j T G k r j r T j 1 Gk r j 1 p T j 1 G k B k p j 1 + α sk+j 1 p j 1 α sk+j 1 B k p j 1 T G k r j 1 r j 1 p j = r j + β sk+j p j 1 [x s k+1 x sk, r s k+1, p s k+1 ] = Y k [x s, r s, p s ] end 9
40 s-step CG r 0 = b Ax 0, p 0 = r 0 for k = 0:nmax/s Compute Y k and B k such that AY k = Y k B k and span(y k ) = K s+1 A, p sk G k = Y k T Y k x 0 = 0, r 0 = e s+2, p 0 = e 1 for j = 1: s end α sk+j 1 = x j = x j 1 r j = r j 1 β sk+j = r j T G k r j r T j 1 Gk r j 1 p T j 1 G k B k p j 1 + α sk+j 1 p j 1 α sk+j 1 B k p j 1 T G k r j 1 r j 1 p j = r j + β sk+j p j 1 + K s A, r sk Outer Loop Compute basis O(s) SPMVs O(s 2 ) Inner Products (one synchronization) [x s k+1 x sk, r s k+1, p s k+1 ] = Y k [x s, r s, p s ] end 9
41 s-step CG r 0 = b Ax 0, p 0 = r 0 for k = 0:nmax/s Compute Y k and B k such that AY k = Y k B k and span(y k ) = K s+1 A, p sk G k = Y k T Y k x 0 = 0, r 0 = e s+2, p 0 = e 1 for j = 1: s end α sk+j 1 = x j = x j 1 r j = r j 1 β sk+j = r j T G k r j r T j 1 Gk r j 1 p T j 1 G k B k p j 1 + α sk+j 1 p j 1 α sk+j 1 B k p j 1 T G k r j 1 r j 1 p j = r j + β sk+j p j 1 + K s A, r sk Outer Loop Compute basis O(s) SPMVs O(s 2 ) Inner Products (one synchronization) Inner Loop [x s k+1 x sk, r s k+1, p s k+1 ] = Y k [x s, r s, p s ] end 9
42 s-step CG r 0 = b Ax 0, p 0 = r 0 for k = 0:nmax/s Compute Y k and B k such that AY k = Y k B k and span(y k ) = K s+1 A, p sk G k = Y k T Y k x 0 = 0, r 0 = e s+2, p 0 = e 1 for j = 1: s end α sk+j 1 = x j = x j 1 r j = r j 1 β sk+j = r j T G k r j r T j 1 Gk r j 1 p T j 1 G k B k p j 1 + α sk+j 1 p j 1 α sk+j 1 B k p j 1 T G k r j 1 r j 1 p j = r j + β sk+j p j 1 + K s A, r sk Outer Loop Compute basis O(s) SPMVs O(s 2 ) Inner Products (one synchronization) Inner Loop Local Vector Updates (no comm.) [x s k+1 x sk, r s k+1, p s k+1 ] = Y k [x s, r s, p s ] end 9
43 s-step CG r 0 = b Ax 0, p 0 = r 0 for k = 0:nmax/s Compute Y k and B k such that AY k = Y k B k and span(y k ) = K s+1 A, p sk G k = Y k T Y k x 0 = 0, r 0 = e s+2, p 0 = e 1 for j = 1: s end α sk+j 1 = x j = x j 1 r j = r j 1 β sk+j = r j T G k r j r T j 1 Gk r j 1 p T j 1 G k B k p j 1 + α sk+j 1 p j 1 α sk+j 1 B k p j 1 T G k r j 1 r j 1 p j = r j + β sk+j p j 1 + K s A, r sk [x s k+1 x sk, r s k+1, p s k+1 ] = Y k [x s, r s, p s ] end Outer Loop Compute basis O(s) SPMVs O(s 2 ) Inner Products (one synchronization) Inner Loop Local Vector Updates (no comm.) End Inner Loop s times 9
44 s-step CG r 0 = b Ax 0, p 0 = r 0 for k = 0:nmax/s Compute Y k and B k such that AY k = Y k B k and span(y k ) = K s+1 A, p sk G k = Y k T Y k x 0 = 0, r 0 = e s+2, p 0 = e 1 for j = 1: s end α sk+j 1 = x j = x j 1 r j = r j 1 β sk+j = r j T G k r j r T j 1 Gk r j 1 p T j 1 G k B k p j 1 + α sk+j 1 p j 1 α sk+j 1 B k p j 1 T G k r j 1 r j 1 p j = r j + β sk+j p j 1 + K s A, r sk [x s k+1 x sk, r s k+1, p s k+1 ] = Y k [x s, r s, p s ] end Outer Loop Compute basis O(s) SPMVs O(s 2 ) Inner Products (one synchronization) Inner Loop Local Vector Updates (no comm.) End Inner Loop End Outer Loop s times 9
45 Overview of s-step KSMs s-step CG/Lanczos: (Van Rosendale, 1983), (Chronopoulos and Gear, 1989), (Leland, 1989), (Toledo, 1995), (Hoemmen et al., 2010) s-step GMRES/Arnoldi: (Walker, 1988), (Chronopoulous and Kim, 1990), (Bai, Hu, Reichel, 1991), (de Sturler, 1991), (Joubert, Carey, 1992), (Erhel, 1995), (Hoemmen et al., 2010) s-step BICGSTAB (C. et al., 2012) s-step QMR (Feuerriegel, Bücker, 2013) s-step LSQR (C., 2015) Many others... Recent work: Hybrid pipelined s-step methods (Yamazaki et al., 2017) Talk by P. Luszczek in Part II, MS40 Improving convergence rate and scalability in preconditioned s-step GMRES methods Talk by J. Erhel in MS28 (this session) 10
46 The effect of finite precision Well-known that roundoff error has two effects: 1. Delay of convergence No longer have exact Krylov subspace Can lose numerical rank deficiency Residuals no longer orthogonal - Minimization no longer exact! 2. Loss of attainable accuracy Rounding errors cause true residual b Ax i and updated residual r i deviate! 11
47 The effect of finite precision Well-known that roundoff error has two effects: 1. Delay of convergence No longer have exact Krylov subspace Can lose numerical rank deficiency Residuals no longer orthogonal - Minimization no longer exact! 2. Loss of attainable accuracy Rounding errors cause true residual b Ax i and updated residual r i deviate! A: bcsstk03 from UFSMC, b: equal components in the eigenbasis of A and b = 1 N = 112, κ A 7e6 11
48 The effect of finite precision Well-known that roundoff error has two effects: 1. Delay of convergence No longer have exact Krylov subspace Can lose numerical rank deficiency Residuals no longer orthogonal - Minimization no longer exact! 2. Loss of attainable accuracy Rounding errors cause true residual b Ax i and updated residual r i deviate! A: bcsstk03 from UFSMC, b: equal components in the eigenbasis of A and b = 1 N = 112, κ A 7e6 11
49 The effect of finite precision Well-known that roundoff error has two effects: 1. Delay of convergence No longer have exact Krylov subspace Can lose numerical rank deficiency Residuals no longer orthogonal - Minimization no longer exact! 2. Loss of attainable accuracy Rounding errors cause true residual b Ax i and updated residual r i deviate! A: bcsstk03 from UFSMC, b: equal components in the eigenbasis of A and b = 1 N = 112, κ A 7e6 Much work on these results for CG; See Meurant and Strakoš (2006) for a thorough summary of early developments in finite precision analysis of Lanczos and CG 11
50 Optimizing high performance iterative solvers Synchronization-reducing variants are designed to reduce the time/iteration 12
51 Optimizing high performance iterative solvers Synchronization-reducing variants are designed to reduce the time/iteration But this is not the whole story! 12
52 Optimizing high performance iterative solvers Synchronization-reducing variants are designed to reduce the time/iteration But this is not the whole story! What we really want to minimize is the runtime, subject to some constraint on accuracy 12
53 Optimizing high performance iterative solvers Synchronization-reducing variants are designed to reduce the time/iteration But this is not the whole story! What we really want to minimize is the runtime, subject to some constraint on accuracy Changes to how the recurrences are computed can exacerbate finite precision effects of convergence delay and loss of accuracy A: bcsstk03 from UFSMC, b: equal components in the eigenbasis of A and b = 1 N = 112, κ A 7e6 12
54 Optimizing high performance iterative solvers Synchronization-reducing variants are designed to reduce the time/iteration But this is not the whole story! What we really want to minimize is the runtime, subject to some constraint on accuracy Changes to how the recurrences are computed can exacerbate finite precision effects of convergence delay and loss of accuracy Crucial that we understand and take into account how algorithm modifications will affect the convergence rate and attainable accuracy! A: bcsstk03 from UFSMC, b: equal components in the eigenbasis of A and b = 1 N = 112, κ A 7e6 12
55 Maximum attainable accuracy Accuracy depends on the size of the true residual: b A x i 13
56 Maximum attainable accuracy Accuracy depends on the size of the true residual: b A x i Rounding errors cause the true residual, b A x i, and the updated residual, r i, to deviate 13
57 Maximum attainable accuracy Accuracy depends on the size of the true residual: b A x i Rounding errors cause the true residual, b A x i, and the updated residual, r i, to deviate Writing b A x i = r i + b A x i r i, b A x i r i + b A x i r i As r i 0, b A x i depends on b A x i r i 13
58 Maximum attainable accuracy Accuracy depends on the size of the true residual: b A x i Rounding errors cause the true residual, b A x i, and the updated residual, r i, to deviate Writing b A x i = r i + b A x i r i, b A x i r i + b A x i r i As r i 0, b A x i depends on b A x i r i Many results on bounding attainable accuracy, e.g.: Greenbaum (1989, 1994, 1997), Sleijpen, van der Vorst and Fokkema (1994), Sleijpen, van der Vorst and Modersitzki (2001), Björck, Elfving and Strakoš (1998) and Gutknecht and Strakoš (2000). 13
59 Maximum attainable accuracy of HSCG In finite precision HSCG, iterates are updated by x i = x i 1 + α i 1 p i 1 δx i and r i = r i 1 α i 1 A p i 1 δr i 14
60 Maximum attainable accuracy of HSCG In finite precision HSCG, iterates are updated by x i = x i 1 + α i 1 p i 1 δx i and r i = r i 1 α i 1 A p i 1 δr i Let f i b A x i r i 14
61 Maximum attainable accuracy of HSCG In finite precision HSCG, iterates are updated by x i = x i 1 + α i 1 p i 1 δx i and r i = r i 1 α i 1 A p i 1 δr i Let f i b A x i r i f i = b A x i 1 + α i 1 p i 1 δx i r i 1 α i 1 A p i 1 δr i 14
62 Maximum attainable accuracy of HSCG In finite precision HSCG, iterates are updated by x i = x i 1 + α i 1 p i 1 δx i and r i = r i 1 α i 1 A p i 1 δr i Let f i b A x i r i f i = b A x i 1 + α i 1 p i 1 δx i = f i 1 + Aδx i + δr i r i 1 α i 1 A p i 1 δr i 14
63 Maximum attainable accuracy of HSCG In finite precision HSCG, iterates are updated by x i = x i 1 + α i 1 p i 1 δx i and r i = r i 1 α i 1 A p i 1 δr i Let f i b A x i r i f i = b A x i 1 + α i 1 p i 1 δx i = f i 1 + Aδx i + δr i r i 1 α i 1 A p i 1 δr i i = f 0 + m=1 Aδx m + δr m 14
64 Maximum attainable accuracy of HSCG In finite precision HSCG, iterates are updated by x i = x i 1 + α i 1 p i 1 δx i and r i = r i 1 α i 1 A p i 1 δr i Let f i b A x i r i f i = b A x i 1 + α i 1 p i 1 δx i = f i 1 + Aδx i + δr i r i 1 α i 1 A p i 1 δr i i = f 0 + m=1 Aδx m + δr m accumulation of local rounding errors 14
65 Maximum attainable accuracy of HSCG In finite precision HSCG, iterates are updated by x i = x i 1 + α i 1 p i 1 δx i and r i = r i 1 α i 1 A p i 1 δr i Let f i b A x i r i f i = b A x i 1 + α i 1 p i 1 δx i = f i 1 + Aδx i + δr i r i 1 α i 1 A p i 1 δr i i = f 0 + m=1 Aδx m + δr m accumulation of local rounding errors f i i O ε m=0 N A A x m + r m van der Vorst and Ye, 2000 f i O(ε) A x + max m=0,,i x m Greenbaum, 1997 f i O ε N A A A 1 i m=0 r m Sleijpen and van der Vorst,
66 Attainable accuracy of pipelined CG Pipelined CG updates x i and r i via: x i = x i 1 + α i 1 p i 1, r i = r i 1 α i 1 s i 1 15
67 Attainable accuracy of pipelined CG Pipelined CG updates x i and r i via: x i = x i 1 + α i 1 p i 1, r i = r i 1 α i 1 s i 1 In finite precision: x i = x i 1 + α i 1 p i 1 + δx i r i = r i 1 α i 1 s i 1 + δr i 15
68 Attainable accuracy of pipelined CG Pipelined CG updates x i and r i via: x i = x i 1 + α i 1 p i 1, r i = r i 1 α i 1 s i 1 In finite precision: x i = x i 1 + α i 1 p i 1 + δx i r i = r i 1 α i 1 s i 1 + δr i f i = r i (b A x i ) = f i 1 α i 1 s i 1 A p i 1 + δr i + Aδx i where i = f 0 + m=1 (δr m + Aδx m ) G i d i G i = S i A P i, d i = α 0,, α i 1 T 15
69 Attainable accuracy of pipelined CG Pipelined CG updates x i and r i via: x i = x i 1 + α i 1 p i 1, r i = r i 1 α i 1 s i 1 In finite precision: x i = x i 1 + α i 1 p i 1 + δx i r i = r i 1 α i 1 s i 1 + δr i f i = r i (b A x i ) = f i 1 α i 1 s i 1 A p i 1 + δr i + Aδx i where i = f 0 + m=1 (δr m + Aδx m ) G i d i G i = S i A P i, d i = α 0,, α i 1 T 15
70 Attainable accuracy of pipelined CG Pipelined CG updates x i and r i via: x i = x i 1 + α i 1 p i 1, r i = r i 1 α i 1 s i 1 In finite precision: x i = x i 1 + α i 1 p i 1 + δx i r i = r i 1 α i 1 s i 1 + δr i f i = r i (b A x i ) = f i 1 α i 1 s i 1 A p i 1 + δr i + Aδx i where i = f 0 + m=1 (δr m + Aδx m ) G i d i G i = S i A P i, d i = α 0,, α i 1 T 15
71 Attainable accuracy of pipelined CG Pipelined CG updates x i and r i via: x i = x i 1 + α i 1 p i 1, r i = r i 1 α i 1 s i 1 In finite precision: x i = x i 1 + α i 1 p i 1 + δx i r i = r i 1 α i 1 s i 1 + δr i f i = r i (b A x i ) = f i 1 α i 1 s i 1 A p i 1 + δr i + Aδx i where i = f 0 + m=1 (δr m + Aδx m ) G i d i G i = S i A P i, d i = α 0,, α i 1 T Amplification of local rounding errors possible depending on α i s and β i s See recent work: (Cools et al., 2017), (Carson et al., 2017) 15
72 Numerical Example A: bcsstk03 from UFSMC, b: equal components in the eigenbasis of A and b = 1 N = 112, κ A 7e6 16
73 Numerical Example A: bcsstk03 from UFSMC, b: equal components in the eigenbasis of A and b = 1 N = 112, κ A 7e6 A: nos4 from UFSMC, b: equal components in the eigenbasis of A and b = 1 N = 100, κ A 2e3 16
74 Attainable accuracy of s-step CG f i b A x i r i For CG: f i f 0 + ε i m=1 1 + N A x m + r m 17
75 Attainable accuracy of s-step CG f i b A x i r i For CG: f i f 0 + ε i m=1 1 + N A x m + r m For s-step CG: i sk + j (see C., 2015) f sk+j f 0 + εcγ sk+j m=1 1 + N A x m + r m where c is a low-degree polynomial in s, and Γ = max l k Y l + Y l 17
76 Attainable accuracy of s-step CG f i b A x i r i For CG: f i f 0 + ε i m=1 1 + N A x m + r m For s-step CG: i sk + j (see C., 2015) f sk+j f 0 + εcγ sk+j m=1 1 + N A x m + r m where c is a low-degree polynomial in s, and Γ = max l k Y l + Y l Amplification of local rounding errors possible depending on conditioning of basis 17
77 Numerical example s-step CG with monomial basis (Y = [p i, Ap i,, A s p i, r i, Ar i, A s 1 r i ]) A: bcsstk03 from UFSMC, b: equal components in the eigenbasis of A and b = 1 N = 112, κ A 7e6 18
78 Numerical example s-step CG with monomial basis (Y = [p i, Ap i,, A s p i, r i, Ar i, A s 1 r i ]) A: bcsstk03 from UFSMC, b: equal components in the eigenbasis of A and b = 1 N = 112, κ A 7e6 18
79 Numerical example s-step CG with monomial basis (Y = [p i, Ap i,, A s p i, r i, Ar i, A s 1 r i ]) A: bcsstk03 from UFSMC, b: equal components in the eigenbasis of A and b = 1 N = 112, κ A 7e6 18
80 Numerical example s-step CG with monomial basis (Y = [p i, Ap i,, A s p i, r i, Ar i, A s 1 r i ]) * Can also use other, more well-conditioned bases to improve convergence rate and accuracy (see, e.g. Philippe and Reichel, 2012). 18
81 Numerical example s-step CG with monomial basis (Y = [p i, Ap i,, A s p i, r i, Ar i, A s 1 r i ]) A: nos4 from UFSMC, b: equal components in the eigenbasis of A and b = 1 N = 100, κ A 2e3 19
82 Residual Replacement Idea: improve accuracy by replacing r i with fl(b A x i ) in certain iterations (Van der Vorst and Ye, 2000) 20
83 Residual Replacement Idea: improve accuracy by replacing r i with fl(b A x i ) in certain iterations (Van der Vorst and Ye, 2000) Choose when to replace based on estimate of f i b A x i r i Replace often enough such that f i remains small But don't replace when error in computing fl(b A x i ) would perturb recurrence and cause convergence delay See (Strakoš and Tichý, 2002) 20
84 Residual Replacement Idea: improve accuracy by replacing r i with fl(b A x i ) in certain iterations (Van der Vorst and Ye, 2000) Choose when to replace based on estimate of f i b A x i r i Replace often enough such that f i remains small But don't replace when error in computing fl(b A x i ) would perturb recurrence and cause convergence delay See (Strakoš and Tichý, 2002) This strategy can be adapted for both pipelined KSMs (Cools and Vanroose, 2017) and s-step KSMs (Carson and Demmel, 2014) 20
85 Residual Replacement Idea: improve accuracy by replacing r i with fl(b A x i ) in certain iterations (Van der Vorst and Ye, 2000) Choose when to replace based on estimate of f i b A x i r i Replace often enough such that f i remains small But don't replace when error in computing fl(b A x i ) would perturb recurrence and cause convergence delay See (Strakoš and Tichý, 2002) This strategy can be adapted for both pipelined KSMs (Cools and Vanroose, 2017) and s-step KSMs (Carson and Demmel, 2014) In both cases, estimate of f i can be computed inexpensively Improves accuracy to comparable level as classical method in many cases 20
86 Scalability of pipelined CG with RR 21
87 Conclusions and takeaways Need new approaches to reducing synchronization cost in Krylov subspace methods for large-scale problems Pipelined methods, s-step methods, hybrid approaches But must also consider finite precision behavior! 22
88 Conclusions and takeaways Need new approaches to reducing synchronization cost in Krylov subspace methods for large-scale problems Pipelined methods, s-step methods, hybrid approaches But must also consider finite precision behavior! Other communication-avoiding and communication-hiding approaches possible e.g., Enlarged Krylov subspace methods - Talk by O. Tissot, MS28 (this session) 22
89 Conclusions and takeaways Need new approaches to reducing synchronization cost in Krylov subspace methods for large-scale problems Pipelined methods, s-step methods, hybrid approaches But must also consider finite precision behavior! Other communication-avoiding and communication-hiding approaches possible e.g., Enlarged Krylov subspace methods - Talk by O. Tissot, MS28 (this session) Best approach is highly application-dependent Application of pipelined KSMs to graph partitioning - Talk by P. Ghysels in Part II: MS40 22
90 Conclusions and takeaways Need new approaches to reducing synchronization cost in Krylov subspace methods for large-scale problems Pipelined methods, s-step methods, hybrid approaches But must also consider finite precision behavior! Other communication-avoiding and communication-hiding approaches possible e.g., Enlarged Krylov subspace methods - Talk by O. Tissot, MS28 (this session) Best approach is highly application-dependent Application of pipelined KSMs to graph partitioning - Talk by P. Ghysels in Part II: MS40 Many interesting open problems and challenges as we push toward exascalelevel computing! 22
91 Thank You! math.nyu.edu/~erinc
High performance Krylov subspace method variants
High performance Krylov subspace method variants and their behavior in finite precision Erin Carson New York University HPCSE17, May 24, 2017 Collaborators Miroslav Rozložník Institute of Computer Science,
More informationEfficient Deflation for Communication-Avoiding Krylov Subspace Methods
Efficient Deflation for Communication-Avoiding Krylov Subspace Methods Erin Carson Nicholas Knight, James Demmel Univ. of California, Berkeley Monday, June 24, NASCA 2013, Calais, France Overview We derive
More informationMS4: Minimizing Communication in Numerical Algorithms Part I of II
MS4: Minimizing Communication in Numerical Algorithms Part I of II Organizers: Oded Schwartz (Hebrew University of Jerusalem) and Erin Carson (New York University) Talks: 1. Communication-Avoiding Krylov
More informationA Residual Replacement Strategy for Improving the Maximum Attainable Accuracy of s-step Krylov Subspace Methods
A Residual Replacement Strategy for Improving the Maximum Attainable Accuracy of s-step Krylov Subspace Methods Erin Carson James Demmel Electrical Engineering and Computer Sciences University of California
More informationEXASCALE COMPUTING. Hiding global synchronization latency in the preconditioned Conjugate Gradient algorithm. P. Ghysels W.
ExaScience Lab Intel Labs Europe EXASCALE COMPUTING Hiding global synchronization latency in the preconditioned Conjugate Gradient algorithm P. Ghysels W. Vanroose Report 12.2012.1, December 2012 This
More informationScalable Non-blocking Preconditioned Conjugate Gradient Methods
Scalable Non-blocking Preconditioned Conjugate Gradient Methods Paul Eller and William Gropp University of Illinois at Urbana-Champaign Department of Computer Science Supercomputing 16 Paul Eller and William
More informationA Residual Replacement Strategy for Improving the Maximum Attainable Accuracy of Communication- Avoiding Krylov Subspace Methods
A Residual Replacement Strategy for Improving the Maximum Attainable Accuracy of Communication- Avoiding Krylov Subspace Methods Erin Carson James Demmel Electrical Engineering and Computer Sciences University
More informationLanczos tridigonalization and Golub - Kahan bidiagonalization: Ideas, connections and impact
Lanczos tridigonalization and Golub - Kahan bidiagonalization: Ideas, connections and impact Zdeněk Strakoš Academy of Sciences and Charles University, Prague http://www.cs.cas.cz/ strakos Hong Kong, February
More informationFinite-choice algorithm optimization in Conjugate Gradients
Finite-choice algorithm optimization in Conjugate Gradients Jack Dongarra and Victor Eijkhout January 2003 Abstract We present computational aspects of mathematically equivalent implementations of the
More informationCommunication-avoiding Krylov subspace methods
Motivation Communication-avoiding Krylov subspace methods Mark mhoemmen@cs.berkeley.edu University of California Berkeley EECS MS Numerical Libraries Group visit: 28 April 2008 Overview Motivation Current
More informationNumerical behavior of inexact linear solvers
Numerical behavior of inexact linear solvers Miro Rozložník joint results with Zhong-zhi Bai and Pavel Jiránek Institute of Computer Science, Czech Academy of Sciences, Prague, Czech Republic The fourth
More informationCommunication Avoiding Strategies for the Numerical Kernels in Coupled Physics Simulations
ExaScience Lab Intel Labs Europe EXASCALE COMPUTING Communication Avoiding Strategies for the Numerical Kernels in Coupled Physics Simulations SIAM Conference on Parallel Processing for Scientific Computing
More informationExploiting Low-Rank Structure in Computing Matrix Powers with Applications to Preconditioning
Exploiting Low-Rank Structure in Computing Matrix Powers with Applications to Preconditioning Erin C. Carson, Nicholas Knight, James Demmel, Ming Gu U.C. Berkeley SIAM PP 12, Savannah, Georgia, USA, February
More informationCME342 Parallel Methods in Numerical Analysis. Matrix Computation: Iterative Methods II. Sparse Matrix-vector Multiplication.
CME342 Parallel Methods in Numerical Analysis Matrix Computation: Iterative Methods II Outline: CG & its parallelization. Sparse Matrix-vector Multiplication. 1 Basic iterative methods: Ax = b r = b Ax
More informationCommunication-Avoiding Krylov Subspace Methods in Theory and Practice. Erin Claire Carson. A dissertation submitted in partial satisfaction of the
Communication-Avoiding Krylov Subspace Methods in Theory and Practice by Erin Claire Carson A dissertation submitted in partial satisfaction of the requirements for the degree of Doctor of Philosophy in
More informationError analysis of the s-step Lanczos method in finite precision
Error analysis of the s-step Lanczos method in finite precision Erin Carson James Demmel Electrical Engineering and Computer Sciences University of California at Berkeley Technical Report No. UCB/EECS-2014-55
More informationS-Step and Communication-Avoiding Iterative Methods
S-Step and Communication-Avoiding Iterative Methods Maxim Naumov NVIDIA, 270 San Tomas Expressway, Santa Clara, CA 95050 Abstract In this paper we make an overview of s-step Conjugate Gradient and develop
More informationKey words. conjugate gradients, normwise backward error, incremental norm estimation.
Proceedings of ALGORITMY 2016 pp. 323 332 ON ERROR ESTIMATION IN THE CONJUGATE GRADIENT METHOD: NORMWISE BACKWARD ERROR PETR TICHÝ Abstract. Using an idea of Duff and Vömel [BIT, 42 (2002), pp. 300 322
More informationIterative Methods for Linear Systems of Equations
Iterative Methods for Linear Systems of Equations Projection methods (3) ITMAN PhD-course DTU 20-10-08 till 24-10-08 Martin van Gijzen 1 Delft University of Technology Overview day 4 Bi-Lanczos method
More informationSummary of Iterative Methods for Non-symmetric Linear Equations That Are Related to the Conjugate Gradient (CG) Method
Summary of Iterative Methods for Non-symmetric Linear Equations That Are Related to the Conjugate Gradient (CG) Method Leslie Foster 11-5-2012 We will discuss the FOM (full orthogonalization method), CG,
More informationContents. Preface... xi. Introduction...
Contents Preface... xi Introduction... xv Chapter 1. Computer Architectures... 1 1.1. Different types of parallelism... 1 1.1.1. Overlap, concurrency and parallelism... 1 1.1.2. Temporal and spatial parallelism
More informationSparse linear solvers: iterative methods and preconditioning
Sparse linear solvers: iterative methods and preconditioning L Grigori ALPINES INRIA and LJLL, UPMC March 2017 Plan Sparse linear solvers Sparse matrices and graphs Classes of linear solvers Krylov subspace
More informationOn the accuracy of saddle point solvers
On the accuracy of saddle point solvers Miro Rozložník joint results with Valeria Simoncini and Pavel Jiránek Institute of Computer Science, Czech Academy of Sciences, Prague, Czech Republic Seminar at
More informationMARCH 24-27, 2014 SAN JOSE, CA
MARCH 24-27, 2014 SAN JOSE, CA Sparse HPC on modern architectures Important scientific applications rely on sparse linear algebra HPCG a new benchmark proposal to complement Top500 (HPL) To solve A x =
More informationSensitivity of Gauss-Christoffel quadrature and sensitivity of Jacobi matrices to small changes of spectral data
Sensitivity of Gauss-Christoffel quadrature and sensitivity of Jacobi matrices to small changes of spectral data Zdeněk Strakoš Academy of Sciences and Charles University, Prague http://www.cs.cas.cz/
More informationLinear Solvers. Andrew Hazel
Linear Solvers Andrew Hazel Introduction Thus far we have talked about the formulation and discretisation of physical problems...... and stopped when we got to a discrete linear system of equations. Introduction
More informationIterative methods, preconditioning, and their application to CMB data analysis. Laura Grigori INRIA Saclay
Iterative methods, preconditioning, and their application to CMB data analysis Laura Grigori INRIA Saclay Plan Motivation Communication avoiding for numerical linear algebra Novel algorithms that minimize
More informationB. Reps 1, P. Ghysels 2, O. Schenk 4, K. Meerbergen 3,5, W. Vanroose 1,3. IHP 2015, Paris FRx
Communication Avoiding and Hiding in preconditioned Krylov solvers 1 Applied Mathematics, University of Antwerp, Belgium 2 Future Technology Group, Berkeley Lab, USA 3 Intel ExaScience Lab, Belgium 4 USI,
More informationParallel Asynchronous Hybrid Krylov Methods for Minimization of Energy Consumption. Langshi CHEN 1,2,3 Supervised by Serge PETITON 2
1 / 23 Parallel Asynchronous Hybrid Krylov Methods for Minimization of Energy Consumption Langshi CHEN 1,2,3 Supervised by Serge PETITON 2 Maison de la Simulation Lille 1 University CNRS March 18, 2013
More informationMoments, Model Reduction and Nonlinearity in Solving Linear Algebraic Problems
Moments, Model Reduction and Nonlinearity in Solving Linear Algebraic Problems Zdeněk Strakoš Charles University, Prague http://www.karlin.mff.cuni.cz/ strakos 16th ILAS Meeting, Pisa, June 2010. Thanks
More informationPrinciples and Analysis of Krylov Subspace Methods
Principles and Analysis of Krylov Subspace Methods Zdeněk Strakoš Institute of Computer Science, Academy of Sciences, Prague www.cs.cas.cz/~strakos Ostrava, February 2005 1 With special thanks to C.C.
More informationAPPLIED NUMERICAL LINEAR ALGEBRA
APPLIED NUMERICAL LINEAR ALGEBRA James W. Demmel University of California Berkeley, California Society for Industrial and Applied Mathematics Philadelphia Contents Preface 1 Introduction 1 1.1 Basic Notation
More informationAMS526: Numerical Analysis I (Numerical Linear Algebra) Lecture 23: GMRES and Other Krylov Subspace Methods; Preconditioning
AMS526: Numerical Analysis I (Numerical Linear Algebra) Lecture 23: GMRES and Other Krylov Subspace Methods; Preconditioning Xiangmin Jiao SUNY Stony Brook Xiangmin Jiao Numerical Analysis I 1 / 18 Outline
More informationA stable variant of Simpler GMRES and GCR
A stable variant of Simpler GMRES and GCR Miroslav Rozložník joint work with Pavel Jiránek and Martin H. Gutknecht Institute of Computer Science, Czech Academy of Sciences, Prague, Czech Republic miro@cs.cas.cz,
More informationOn the loss of orthogonality in the Gram-Schmidt orthogonalization process
CERFACS Technical Report No. TR/PA/03/25 Luc Giraud Julien Langou Miroslav Rozložník On the loss of orthogonality in the Gram-Schmidt orthogonalization process Abstract. In this paper we study numerical
More information1 Conjugate gradients
Notes for 2016-11-18 1 Conjugate gradients We now turn to the method of conjugate gradients (CG), perhaps the best known of the Krylov subspace solvers. The CG iteration can be characterized as the iteration
More informationSOLVING SPARSE LINEAR SYSTEMS OF EQUATIONS. Chao Yang Computational Research Division Lawrence Berkeley National Laboratory Berkeley, CA, USA
1 SOLVING SPARSE LINEAR SYSTEMS OF EQUATIONS Chao Yang Computational Research Division Lawrence Berkeley National Laboratory Berkeley, CA, USA 2 OUTLINE Sparse matrix storage format Basic factorization
More informationETNA Kent State University
Electronic Transactions on Numerical Analysis. Volume 43, pp. 125-141, 2014. Copyright 2014,. ISSN 1068-9613. ETNA AN EFFICIENT DEFLATION TECHNIQUE FOR THE COMMUNICATION- AVOIDING CONJUGATE GRADIENT METHOD
More informationM.A. Botchev. September 5, 2014
Rome-Moscow school of Matrix Methods and Applied Linear Algebra 2014 A short introduction to Krylov subspaces for linear systems, matrix functions and inexact Newton methods. Plan and exercises. M.A. Botchev
More informationBlock Bidiagonal Decomposition and Least Squares Problems
Block Bidiagonal Decomposition and Least Squares Problems Åke Björck Department of Mathematics Linköping University Perspectives in Numerical Analysis, Helsinki, May 27 29, 2008 Outline Bidiagonal Decomposition
More informationEfficient Estimation of the A-norm of the Error in the Preconditioned Conjugate Gradient Method
Efficient Estimation of the A-norm of the Error in the Preconditioned Conjugate Gradient Method Zdeněk Strakoš and Petr Tichý Institute of Computer Science AS CR, Technical University of Berlin. Emory
More informationReduced Synchronization Overhead on. December 3, Abstract. The standard formulation of the conjugate gradient algorithm involves
Lapack Working Note 56 Conjugate Gradient Algorithms with Reduced Synchronization Overhead on Distributed Memory Multiprocessors E. F. D'Azevedo y, V.L. Eijkhout z, C. H. Romine y December 3, 1999 Abstract
More informationParallel sparse linear solvers and applications in CFD
Parallel sparse linear solvers and applications in CFD Jocelyne Erhel Joint work with Désiré Nuentsa Wakam () and Baptiste Poirriez () SAGE team, Inria Rennes, France journée Calcul Intensif Distribué
More informationParallel Numerical Algorithms
Parallel Numerical Algorithms Chapter 6 Matrix Models Section 6.2 Low Rank Approximation Edgar Solomonik Department of Computer Science University of Illinois at Urbana-Champaign CS 554 / CSE 512 Edgar
More informationPETROV-GALERKIN METHODS
Chapter 7 PETROV-GALERKIN METHODS 7.1 Energy Norm Minimization 7.2 Residual Norm Minimization 7.3 General Projection Methods 7.1 Energy Norm Minimization Saad, Sections 5.3.1, 5.2.1a. 7.1.1 Methods based
More informationMinisymposia 9 and 34: Avoiding Communication in Linear Algebra. Jim Demmel UC Berkeley bebop.cs.berkeley.edu
Minisymposia 9 and 34: Avoiding Communication in Linear Algebra Jim Demmel UC Berkeley bebop.cs.berkeley.edu Motivation (1) Increasing parallelism to exploit From Top500 to multicores in your laptop Exponentially
More informationON ORTHOGONAL REDUCTION TO HESSENBERG FORM WITH SMALL BANDWIDTH
ON ORTHOGONAL REDUCTION TO HESSENBERG FORM WITH SMALL BANDWIDTH V. FABER, J. LIESEN, AND P. TICHÝ Abstract. Numerous algorithms in numerical linear algebra are based on the reduction of a given matrix
More informationEfficient Estimation of the A-norm of the Error in the Preconditioned Conjugate Gradient Method
Efficient Estimation of the A-norm of the Error in the Preconditioned Conjugate Gradient Method Zdeněk Strakoš and Petr Tichý Institute of Computer Science AS CR, Technical University of Berlin. International
More informationVariants of BiCGSafe method using shadow three-term recurrence
Variants of BiCGSafe method using shadow three-term recurrence Seiji Fujino, Takashi Sekimoto Research Institute for Information Technology, Kyushu University Ryoyu Systems Co. Ltd. (Kyushu University)
More informationContribution of Wo¹niakowski, Strako²,... The conjugate gradient method in nite precision computa
Contribution of Wo¹niakowski, Strako²,... The conjugate gradient method in nite precision computations ªaw University of Technology Institute of Mathematics and Computer Science Warsaw, October 7, 2006
More informationOn the influence of eigenvalues on Bi-CG residual norms
On the influence of eigenvalues on Bi-CG residual norms Jurjen Duintjer Tebbens Institute of Computer Science Academy of Sciences of the Czech Republic duintjertebbens@cs.cas.cz Gérard Meurant 30, rue
More informationA communication-avoiding thick-restart Lanczos method on a distributed-memory system
A communication-avoiding thick-restart Lanczos method on a distributed-memory system Ichitaro Yamazaki and Kesheng Wu Lawrence Berkeley National Laboratory, Berkeley, CA, USA Abstract. The Thick-Restart
More informationSome minimization problems
Week 13: Wednesday, Nov 14 Some minimization problems Last time, we sketched the following two-step strategy for approximating the solution to linear systems via Krylov subspaces: 1. Build a sequence of
More informationAMS526: Numerical Analysis I (Numerical Linear Algebra)
AMS526: Numerical Analysis I (Numerical Linear Algebra) Lecture 23: GMRES and Other Krylov Subspace Methods Xiangmin Jiao SUNY Stony Brook Xiangmin Jiao Numerical Analysis I 1 / 9 Minimizing Residual CG
More informationApplication of a Preconditioned Chebyshev Basis Communication-Avoiding Conjugate Gradient Method to a Multiphase Thermal-Hydraulic CFD Code
Application of a Preconditioned Chebyshev Basis Communication-Avoiding Conjugate Gradient Method to a Multiphase Thermal-Hydraulic CFD Code Yasuhiro Idomura 1(B), Takuya Ina 1, Akie Mayumi 1, Susumu Yamada
More informationDELFT UNIVERSITY OF TECHNOLOGY
DELFT UNIVERSITY OF TECHNOLOGY REPORT 16-02 The Induced Dimension Reduction method applied to convection-diffusion-reaction problems R. Astudillo and M. B. van Gijzen ISSN 1389-6520 Reports of the Delft
More informationPerformance Evaluation of GPBiCGSafe Method without Reverse-Ordered Recurrence for Realistic Problems
Performance Evaluation of GPBiCGSafe Method without Reverse-Ordered Recurrence for Realistic Problems Seiji Fujino, Takashi Sekimoto Abstract GPBiCG method is an attractive iterative method for the solution
More informationA Domain Decomposition Based Jacobi-Davidson Algorithm for Quantum Dot Simulation
A Domain Decomposition Based Jacobi-Davidson Algorithm for Quantum Dot Simulation Tao Zhao 1, Feng-Nan Hwang 2 and Xiao-Chuan Cai 3 Abstract In this paper, we develop an overlapping domain decomposition
More informationA High-Performance Parallel Hybrid Method for Large Sparse Linear Systems
Outline A High-Performance Parallel Hybrid Method for Large Sparse Linear Systems Azzam Haidar CERFACS, Toulouse joint work with Luc Giraud (N7-IRIT, France) and Layne Watson (Virginia Polytechnic Institute,
More informationOn the properties of Krylov subspaces in finite precision CG computations
On the properties of Krylov subspaces in finite precision CG computations Tomáš Gergelits, Zdeněk Strakoš Department of Numerical Mathematics, Faculty of Mathematics and Physics, Charles University in
More information4.8 Arnoldi Iteration, Krylov Subspaces and GMRES
48 Arnoldi Iteration, Krylov Subspaces and GMRES We start with the problem of using a similarity transformation to convert an n n matrix A to upper Hessenberg form H, ie, A = QHQ, (30) with an appropriate
More informationInexactness and flexibility in linear Krylov solvers
Inexactness and flexibility in linear Krylov solvers Luc Giraud ENSEEIHT (N7) - IRIT, Toulouse Matrix Analysis and Applications CIRM Luminy - October 15-19, 2007 in honor of Gérard Meurant for his 60 th
More informationHOW TO MAKE SIMPLER GMRES AND GCR MORE STABLE
HOW TO MAKE SIMPLER GMRES AND GCR MORE STABLE PAVEL JIRÁNEK, MIROSLAV ROZLOŽNÍK, AND MARTIN H. GUTKNECHT Abstract. In this paper we analyze the numerical behavior of several minimum residual methods, which
More informationAN EFFICIENT DEFLATION TECHNIQUE FOR THE COMMUNICATION- AVOIDING CONJUGATE GRADIENT METHOD
AN EFFICIENT DEFLATION TECHNIQUE FOR THE COMMUNICATION- AVOIDING CONJUGATE GRADIENT METHOD ERIN CARSON, NICHOLAS KNIGHT, AND JAMES DEMMEL Abstract. Communication-avoiding Krylov subspace methods (CA-KSMs)
More informationEECS 275 Matrix Computation
EECS 275 Matrix Computation Ming-Hsuan Yang Electrical Engineering and Computer Science University of California at Merced Merced, CA 95344 http://faculty.ucmerced.edu/mhyang Lecture 20 1 / 20 Overview
More informationIncomplete Cholesky preconditioners that exploit the low-rank property
anapov@ulb.ac.be ; http://homepages.ulb.ac.be/ anapov/ 1 / 35 Incomplete Cholesky preconditioners that exploit the low-rank property (theory and practice) Artem Napov Service de Métrologie Nucléaire, Université
More informationKrylov Space Solvers
Seminar for Applied Mathematics ETH Zurich International Symposium on Frontiers of Computational Science Nagoya, 12/13 Dec. 2005 Sparse Matrices Large sparse linear systems of equations or large sparse
More informationA Parallel Implementation of the Davidson Method for Generalized Eigenproblems
A Parallel Implementation of the Davidson Method for Generalized Eigenproblems Eloy ROMERO 1 and Jose E. ROMAN Instituto ITACA, Universidad Politécnica de Valencia, Spain Abstract. We present a parallel
More informationParallel Numerics, WT 2016/ Iterative Methods for Sparse Linear Systems of Equations. page 1 of 1
Parallel Numerics, WT 2016/2017 5 Iterative Methods for Sparse Linear Systems of Equations page 1 of 1 Contents 1 Introduction 1.1 Computer Science Aspects 1.2 Numerical Problems 1.3 Graphs 1.4 Loop Manipulations
More informationOn the Vorobyev method of moments
On the Vorobyev method of moments Zdeněk Strakoš Charles University in Prague and Czech Academy of Sciences http://www.karlin.mff.cuni.cz/ strakos Conference in honor of Volker Mehrmann Berlin, May 2015
More informationNumerical Methods in Matrix Computations
Ake Bjorck Numerical Methods in Matrix Computations Springer Contents 1 Direct Methods for Linear Systems 1 1.1 Elements of Matrix Theory 1 1.1.1 Matrix Algebra 2 1.1.2 Vector Spaces 6 1.1.3 Submatrices
More informationAccelerating linear algebra computations with hybrid GPU-multicore systems.
Accelerating linear algebra computations with hybrid GPU-multicore systems. Marc Baboulin INRIA/Université Paris-Sud joint work with Jack Dongarra (University of Tennessee and Oak Ridge National Laboratory)
More informationMatching moments and matrix computations
Matching moments and matrix computations Jörg Liesen Technical University of Berlin and Petr Tichý Czech Academy of Sciences and Zdeněk Strakoš Charles University in Prague and Czech Academy of Sciences
More informationConjugate Gradient Method
Conjugate Gradient Method direct and indirect methods positive definite linear systems Krylov sequence spectral analysis of Krylov sequence preconditioning Prof. S. Boyd, EE364b, Stanford University Three
More informationFrom here to there. peta exa-scale transient simulation. Lawrence Mitchell 1, 27th March
From here to there challenges for peta exa-scale transient simulation Lawrence Mitchell 1, 27th March 2017 1 Departments of Computing and Mathematics, Imperial College London lawrence.mitchell@imperial.ac.uk
More informationThe Lanczos and conjugate gradient algorithms
The Lanczos and conjugate gradient algorithms Gérard MEURANT October, 2008 1 The Lanczos algorithm 2 The Lanczos algorithm in finite precision 3 The nonsymmetric Lanczos algorithm 4 The Golub Kahan bidiagonalization
More informationComputing least squares condition numbers on hybrid multicore/gpu systems
Computing least squares condition numbers on hybrid multicore/gpu systems M. Baboulin and J. Dongarra and R. Lacroix Abstract This paper presents an efficient computation for least squares conditioning
More informationChapter 7 Iterative Techniques in Matrix Algebra
Chapter 7 Iterative Techniques in Matrix Algebra Per-Olof Persson persson@berkeley.edu Department of Mathematics University of California, Berkeley Math 128B Numerical Analysis Vector Norms Definition
More informationGolub-Kahan iterative bidiagonalization and determining the noise level in the data
Golub-Kahan iterative bidiagonalization and determining the noise level in the data Iveta Hnětynková,, Martin Plešinger,, Zdeněk Strakoš, * Charles University, Prague ** Academy of Sciences of the Czech
More informationThe Performance Evolution of the Parallel Ocean Program on the Cray X1
The Performance Evolution of the Parallel Ocean Program on the Cray X1 Patrick H. Worley Oak Ridge National Laboratory John Levesque Cray Inc. 46th Cray User Group Conference May 18, 2003 Knoxville Marriott
More informationApplied Mathematics 205. Unit V: Eigenvalue Problems. Lecturer: Dr. David Knezevic
Applied Mathematics 205 Unit V: Eigenvalue Problems Lecturer: Dr. David Knezevic Unit V: Eigenvalue Problems Chapter V.4: Krylov Subspace Methods 2 / 51 Krylov Subspace Methods In this chapter we give
More informationImplementation of a preconditioned eigensolver using Hypre
Implementation of a preconditioned eigensolver using Hypre Andrew V. Knyazev 1, and Merico E. Argentati 1 1 Department of Mathematics, University of Colorado at Denver, USA SUMMARY This paper describes
More informationParallel Iterative Methods for Sparse Linear Systems. H. Martin Bücker Lehrstuhl für Hochleistungsrechnen
Parallel Iterative Methods for Sparse Linear Systems Lehrstuhl für Hochleistungsrechnen www.sc.rwth-aachen.de RWTH Aachen Large and Sparse Small and Dense Outline Problem with Direct Methods Iterative
More informationarxiv: v1 [hep-lat] 19 Jul 2009
arxiv:0907.3261v1 [hep-lat] 19 Jul 2009 Application of preconditioned block BiCGGR to the Wilson-Dirac equation with multiple right-hand sides in lattice QCD Abstract H. Tadano a,b, Y. Kuramashi c,b, T.
More informationIDR(s) A family of simple and fast algorithms for solving large nonsymmetric systems of linear equations
IDR(s) A family of simple and fast algorithms for solving large nonsymmetric systems of linear equations Harrachov 2007 Peter Sonneveld en Martin van Gijzen August 24, 2007 1 Delft University of Technology
More informationDeflation Strategies to Improve the Convergence of Communication-Avoiding GMRES
Deflation Strategies to Improve the Convergence of Communication-Avoiding GMRES Ichitaro Yamazaki, Stanimire Tomov, and Jack Dongarra Department of Electrical Engineering and Computer Science University
More informationComputational Linear Algebra
Computational Linear Algebra PD Dr. rer. nat. habil. Ralf Peter Mundani Computation in Engineering / BGU Scientific Computing in Computer Science / INF Winter Term 2017/18 Part 3: Iterative Methods PD
More informationComparison of Fixed Point Methods and Krylov Subspace Methods Solving Convection-Diffusion Equations
American Journal of Computational Mathematics, 5, 5, 3-6 Published Online June 5 in SciRes. http://www.scirp.org/journal/ajcm http://dx.doi.org/.436/ajcm.5.5 Comparison of Fixed Point Methods and Krylov
More informationarxiv: v1 [hep-lat] 2 May 2012
A CG Method for Multiple Right Hand Sides and Multiple Shifts in Lattice QCD Calculations arxiv:1205.0359v1 [hep-lat] 2 May 2012 Fachbereich C, Mathematik und Naturwissenschaften, Bergische Universität
More informationKey words. linear equations, eigenvalues, polynomial preconditioning, GMRES, GMRES-DR, deflation, QCD
1 POLYNOMIAL PRECONDITIONED GMRES AND GMRES-DR QUAN LIU, RONALD B. MORGAN, AND WALTER WILCOX Abstract. We look at solving large nonsymmetric systems of linear equations using polynomial preconditioned
More informationCourse Notes: Week 1
Course Notes: Week 1 Math 270C: Applied Numerical Linear Algebra 1 Lecture 1: Introduction (3/28/11) We will focus on iterative methods for solving linear systems of equations (and some discussion of eigenvalues
More informationC. Vuik 1 R. Nabben 2 J.M. Tang 1
Deflation acceleration of block ILU preconditioned Krylov methods C. Vuik 1 R. Nabben 2 J.M. Tang 1 1 Delft University of Technology J.M. Burgerscentrum 2 Technical University Berlin Institut für Mathematik
More information1. Introduction. Krylov subspace methods are used extensively for the iterative solution of linear systems of equations Ax = b.
SIAM J. SCI. COMPUT. Vol. 31, No. 2, pp. 1035 1062 c 2008 Society for Industrial and Applied Mathematics IDR(s): A FAMILY OF SIMPLE AND FAST ALGORITHMS FOR SOLVING LARGE NONSYMMETRIC SYSTEMS OF LINEAR
More informationIterative Methods for Sparse Linear Systems
Iterative Methods for Sparse Linear Systems Luca Bergamaschi e-mail: berga@dmsa.unipd.it - http://www.dmsa.unipd.it/ berga Department of Mathematical Methods and Models for Scientific Applications University
More informationEIGIFP: A MATLAB Program for Solving Large Symmetric Generalized Eigenvalue Problems
EIGIFP: A MATLAB Program for Solving Large Symmetric Generalized Eigenvalue Problems JAMES H. MONEY and QIANG YE UNIVERSITY OF KENTUCKY eigifp is a MATLAB program for computing a few extreme eigenvalues
More informationSCALABLE HYBRID SPARSE LINEAR SOLVERS
The Pennsylvania State University The Graduate School Department of Computer Science and Engineering SCALABLE HYBRID SPARSE LINEAR SOLVERS A Thesis in Computer Science and Engineering by Keita Teranishi
More informationc 2008 Society for Industrial and Applied Mathematics
SIAM J. MATRIX ANAL. APPL. Vol. 30, No. 4, pp. 1483 1499 c 2008 Society for Industrial and Applied Mathematics HOW TO MAKE SIMPLER GMRES AND GCR MORE STABLE PAVEL JIRÁNEK, MIROSLAV ROZLOŽNÍK, AND MARTIN
More informationError Bounds for Iterative Refinement in Three Precisions
Error Bounds for Iterative Refinement in Three Precisions Erin C. Carson, New York University Nicholas J. Higham, University of Manchester SIAM Annual Meeting Portland, Oregon July 13, 018 Hardware Support
More informationConjugate Gradient algorithm. Storage: fixed, independent of number of steps.
Conjugate Gradient algorithm Need: A symmetric positive definite; Cost: 1 matrix-vector product per step; Storage: fixed, independent of number of steps. The CG method minimizes the A norm of the error,
More informationTopics. The CG Algorithm Algorithmic Options CG s Two Main Convergence Theorems
Topics The CG Algorithm Algorithmic Options CG s Two Main Convergence Theorems What about non-spd systems? Methods requiring small history Methods requiring large history Summary of solvers 1 / 52 Conjugate
More information