High performance Krylov subspace method variants and their behavior in finite precision Erin Carson New York University HPCSE17, May 24, 2017
Collaborators Miroslav Rozložník Institute of Computer Science, Czech Academy of Sciences Zdeněk Strakoš Department of Numerical Mathematics, Faculty of Mathematics and Physics, Charles University Petr Tichý Department of Numerical Mathematics, Faculty of Mathematics and Physics, Charles University Miroslav Tůma Department of Numerical Mathematics, Faculty of Mathematics and Physics, Charles University Preprint NCMM/2016/08: http://www.karlin.mff.cuni.cz/~strakos/download/2016_carrozstrtictum_16.pdf 2
Conjugate Gradient method for solving Ax = b double precision (ε = 2 53 ) x i x A = x i x T A(x i x) x i = x i 1 + α i p i r i = r i 1 α i Ap i p i = r i + β i p i 3
Conjugate Gradient method for solving Ax = b double precision (ε = 2 53 ) x i x A = x i x T A(x i x) x i = x i 1 + α i p i r i = r i 1 α i Ap i p i = r i + β i p i 3
Krylov subspace methods Linear systems Ax = b, eigenvalue problems, singular value problems, least squares, etc. Best for: A large & very sparse, stored implicitly, or only approximation needed Krylov Subspace Method is a projection process onto the Krylov subspace K i A, r 0 = span r 0, Ar 0, A 2 r 0,, A i 1 r 0 where A is an N N matrix and r 0 = b Ax 0 is a length-n vector In each iteration, Add a dimension to the Krylov subspace Forms nested sequence of Krylov subspaces K 1 A, r 0 K 2 A, r 0 K i (A, r 0 ) Orthogonalize (with respect to some C i ) Select approximate solution x i x 0 + K i (A, r 0 ) using r i = b Ax i C i rnew 0 C Aδ r 0 Ex: Lanczos/Conjugate Gradient (CG), Arnoldi/Generalized Minimum Residual (GMRES), Biconjugate Gradient (BICG), BICGSTAB, GKL, LSQR, etc. 4
The conjugate gradient method A is symmetric positive definite, C i = K i (A, r 0 ) 5
The conjugate gradient method A is symmetric positive definite, C i = K i (A, r 0 ) r i K i A, r 0 x x i A = min z x 0 +K i (A,r 0 ) x z A 5
The conjugate gradient method A is symmetric positive definite, C i = K i (A, r 0 ) r i K i A, r 0 x x i A = min z x 0 +K i (A,r 0 ) x z A r N+1 = 0 5
The conjugate gradient method A is symmetric positive definite, C i = K i (A, r 0 ) r i K i A, r 0 x x i A = min z x 0 +K i (A,r 0 ) x z A Connection with Lanczos r N+1 = 0 With v 1 = r 0 / r 0, i iterations of Lanczos produces N i matrix V i = v 1,, v i, and i i tridiagonal matrix T i such that AV i = V i T i + δ i+1 v i+1 e i T, T i = V i AV i CG approximation x i is obtained by solving the reduced model T i y i = r 0 e 1, x i = x 0 + V i y i 5
The conjugate gradient method A is symmetric positive definite, C i = K i (A, r 0 ) r i K i A, r 0 x x i A = min z x 0 +K i (A,r 0 ) x z A Connection with Lanczos r N+1 = 0 With v 1 = r 0 / r 0, i iterations of Lanczos produces N i matrix V i = v 1,, v i, and i i tridiagonal matrix T i such that AV i = V i T i + δ i+1 v i+1 e i T, T i = V i AV i CG approximation x i is obtained by solving the reduced model T i y i = r 0 e 1, x i = x 0 + V i y i Connections with orthogonal polynomials, Stieltjes problem of moments, Gauss- Cristoffel quadrature, others (see 2013 book of Liesen and Strakoš) 5
The conjugate gradient method A is symmetric positive definite, C i = K i (A, r 0 ) r i K i A, r 0 x x i A = min z x 0 +K i (A,r 0 ) x z A Connection with Lanczos r N+1 = 0 With v 1 = r 0 / r 0, i iterations of Lanczos produces N i matrix V i = v 1,, v i, and i i tridiagonal matrix T i such that AV i = V i T i + δ i+1 v i+1 e i T, T i = V i AV i CG approximation x i is obtained by solving the reduced model T i y i = r 0 e 1, x i = x 0 + V i y i Connections with orthogonal polynomials, Stieltjes problem of moments, Gauss- Cristoffel quadrature, others (see 2013 book of Liesen and Strakoš) CG (and other Krylov subspace methods) are highly nonlinear Good for convergence, bad for ease of finite precision analysis 5
Implementation of CG Standard implementation due to Hestenes and Stiefel (1952) (HSCG) Uses three 2-term recurrences for updating x i, r i, p i r 0 = b Ax 0, p 0 = r 0 for i = 1:nmax α i 1 = r T i 1 ri 1 p T i 1 Ap i 1 x i = x i 1 + α i 1 p i 1 r i = r i 1 α i 1 Ap i 1 β i = r i T r i r T i 1 r i 1 end p i = r i + β i p i 1 6
Implementation of CG Standard implementation due to Hestenes and Stiefel (1952) (HSCG) Uses three 2-term recurrences for updating x i, r i, p i r 0 = b Ax 0, p 0 = r 0 for i = 1:nmax α i 1 = r T i 1 ri 1 p T i 1 Ap i 1 minimizes x x i A along line z α = x i 1 + αp i 1 x i = x i 1 + α i 1 p i 1 r i = r i 1 α i 1 Ap i 1 β i = r i T r i r T i 1 r i 1 end p i = r i + β i p i 1 6
Implementation of CG Standard implementation due to Hestenes and Stiefel (1952) (HSCG) Uses three 2-term recurrences for updating x i, r i, p i r 0 = b Ax 0, p 0 = r 0 for i = 1:nmax end α i 1 = r T i 1 ri 1 p T i 1 Ap i 1 x i = x i 1 + α i 1 p i 1 r i = r i 1 α i 1 Ap i 1 β i = r i T r i r T i 1 r i 1 p i = r i + β i p i 1 If minimizes x x i A along line z α = x i 1 + αp i 1 p i A p j for i j, 1-dimensional minimizations in each iteration give i-dimensional minimization over the whole subspace x 0 + K i A, r 0 = x 0 + span{p 0, p i 1 } 6
Communication in CG Projection process in terms of communication: 7
Communication in CG Projection process in terms of communication: Add a dimension to K i Sparse matrix-vector multiplication (SpMV) Must communicate vector entries w/ neighboring processors (P2P communication) SpMV 7
Communication in CG Projection process in terms of communication: Add a dimension to K i Sparse matrix-vector multiplication (SpMV) Must communicate vector entries w/ neighboring processors (P2P communication) Orthogonalize with respect to C i Inner products global synchronization (MPI_Allreduce) all processors must exchange data and wait for all communication to finish before proceeding SpMV orthogonalize 7
Communication in CG Projection process in terms of communication: Add a dimension to K i Sparse matrix-vector multiplication (SpMV) Must communicate vector entries w/ neighboring processors (P2P communication) Orthogonalize with respect to C i Inner products global synchronization (MPI_Allreduce) all processors must exchange data and wait for all communication to finish before proceeding SpMV orthogonalize Dependencies between communication-bound kernels in each iteration limit performance! 7
Communication in HSCG r 0 = b Ax 0, p 0 = r 0 for i = 1:nmax α i 1 = r T i 1 ri 1 p T i 1 Ap i 1 x i = x i 1 + α i 1 p i 1 r i = r i 1 α i 1 Ap i 1 Iteration Loop SpMV Inner Products Vector Updates β i = r i T r i r T i 1 r i 1 Inner Products end p i = r i + β i p i 1 Vector Updates End Loop 8
Communication in HSCG r 0 = b Ax 0, p 0 = r 0 for i = 1:nmax α i 1 = r T i 1 ri 1 p T i 1 Ap i 1 x i = x i 1 + α i 1 p i 1 r i = r i 1 α i 1 Ap i 1 Iteration Loop SpMV Inner Products Vector Updates β i = r i T r i r T i 1 r i 1 Inner Products end p i = r i + β i p i 1 Vector Updates End Loop 8
Communication in HSCG r 0 = b Ax 0, p 0 = r 0 for i = 1:nmax α i 1 = r T i 1 ri 1 p T i 1 Ap i 1 x i = x i 1 + α i 1 p i 1 r i = r i 1 α i 1 Ap i 1 Iteration Loop SpMV Inner Products Vector Updates β i = r i T r i r T i 1 r i 1 Inner Products end p i = r i + β i p i 1 Vector Updates End Loop 8
Communication in HSCG r 0 = b Ax 0, p 0 = r 0 for i = 1:nmax α i 1 = r T i 1 ri 1 p T i 1 Ap i 1 x i = x i 1 + α i 1 p i 1 r i = r i 1 α i 1 Ap i 1 Iteration Loop SpMV Inner Products Vector Updates β i = r i T r i r T i 1 r i 1 Inner Products end p i = r i + β i p i 1 Vector Updates End Loop 8
Communication in HSCG r 0 = b Ax 0, p 0 = r 0 for i = 1:nmax α i 1 = r T i 1 ri 1 p T i 1 Ap i 1 x i = x i 1 + α i 1 p i 1 r i = r i 1 α i 1 Ap i 1 Iteration Loop SpMV Inner Products Vector Updates β i = r i T r i r T i 1 r i 1 Inner Products end p i = r i + β i p i 1 Vector Updates End Loop 8
Communication in HSCG r 0 = b Ax 0, p 0 = r 0 for i = 1:nmax α i 1 = r T i 1 ri 1 p T i 1 Ap i 1 x i = x i 1 + α i 1 p i 1 r i = r i 1 α i 1 Ap i 1 Iteration Loop SpMV Inner Products Vector Updates β i = r i T r i r T i 1 r i 1 Inner Products end p i = r i + β i p i 1 Vector Updates End Loop 8
Future exascale systems Petascale Systems (2009) Predicted Exascale Systems Factor Improvement System Peak ~1000 Node Memory Bandwidth Total Node Interconnect Bandwidth 25 GB/s 0.4-4 TB/s ~10-100 3.5 GB/s 100-400 GB/s ~100 Memory Latency 100 ns 50 ns ~1 Interconnect Latency ~1 * Sources: from P. Beckman (ANL), J. Shalf (LBL), and D. Unat (LBL) 9
Future exascale systems Petascale Systems (2009) Predicted Exascale Systems Factor Improvement System Peak ~1000 Node Memory Bandwidth Total Node Interconnect Bandwidth 25 GB/s 0.4-4 TB/s ~10-100 3.5 GB/s 100-400 GB/s ~100 Memory Latency 100 ns 50 ns ~1 Interconnect Latency ~1 * Sources: from P. Beckman (ANL), J. Shalf (LBL), and D. Unat (LBL) 9
Future exascale systems Petascale Systems (2009) Predicted Exascale Systems Factor Improvement System Peak ~1000 Node Memory Bandwidth Total Node Interconnect Bandwidth 25 GB/s 0.4-4 TB/s ~10-100 3.5 GB/s 100-400 GB/s ~100 Memory Latency 100 ns 50 ns ~1 Interconnect Latency ~1 * Sources: from P. Beckman (ANL), J. Shalf (LBL), and D. Unat (LBL) 9
Future exascale systems Petascale Systems (2009) Predicted Exascale Systems Factor Improvement System Peak ~1000 Node Memory Bandwidth Total Node Interconnect Bandwidth 25 GB/s 0.4-4 TB/s ~10-100 3.5 GB/s 100-400 GB/s ~100 Memory Latency 100 ns 50 ns ~1 Interconnect Latency ~1 * Sources: from P. Beckman (ANL), J. Shalf (LBL), and D. Unat (LBL) Gaps between communication/computation cost only growing larger in future systems Reducing time spent moving data/waiting for data will be essential for applications at exascale! 9
Synchronization-reducing variants Communication cost has motivated many approaches to reducing synchronization in CG: 10
Synchronization-reducing variants Communication cost has motivated many approaches to reducing synchronization in CG: Early work: CG with a single synchronization point per iteration 3-term recurrence CG Using modified computation of recurrence coefficients Using auxiliary vectors 10
Synchronization-reducing variants Communication cost has motivated many approaches to reducing synchronization in CG: Early work: CG with a single synchronization point per iteration 3-term recurrence CG Using modified computation of recurrence coefficients Using auxiliary vectors Pipelined Krylov subspace methods Uses modified coefficients and auxiliary vectors to reduce synchronization points to 1 per iteration Modifications also allow decoupling of SpMV and inner products - enables overlapping 10
Synchronization-reducing variants Communication cost has motivated many approaches to reducing synchronization in CG: Early work: CG with a single synchronization point per iteration 3-term recurrence CG Using modified computation of recurrence coefficients Using auxiliary vectors Pipelined Krylov subspace methods Uses modified coefficients and auxiliary vectors to reduce synchronization points to 1 per iteration Modifications also allow decoupling of SpMV and inner products - enables overlapping s-step Krylov subspace methods Compute iterations in blocks of s using a different Krylov subspace basis Enables one synchronization per s iterations 10
The effects of finite precision Well-known that roundoff error has two effects: 1. Delay of convergence No longer have exact Krylov subspace Can lose numerical rank deficiency Residuals no longer orthogonal - Minimization no longer exact! 2. Loss of attainable accuracy Rounding errors cause true residual b Ax i and updated residual r i deviate! 11
The effects of finite precision Well-known that roundoff error has two effects: 1. Delay of convergence No longer have exact Krylov subspace Can lose numerical rank deficiency Residuals no longer orthogonal - Minimization no longer exact! 2. Loss of attainable accuracy Rounding errors cause true residual b Ax i and updated residual r i deviate! A: bcsstk03 from UFSMC, b: equal components in the eigenbasis of A and b = 1 N = 112, κ A 7e6 11
The effects of finite precision Well-known that roundoff error has two effects: 1. Delay of convergence No longer have exact Krylov subspace Can lose numerical rank deficiency Residuals no longer orthogonal - Minimization no longer exact! 2. Loss of attainable accuracy Rounding errors cause true residual b Ax i and updated residual r i deviate! A: bcsstk03 from UFSMC, b: equal components in the eigenbasis of A and b = 1 N = 112, κ A 7e6 11
The effects of finite precision Well-known that roundoff error has two effects: 1. Delay of convergence No longer have exact Krylov subspace Can lose numerical rank deficiency Residuals no longer orthogonal - Minimization no longer exact! 2. Loss of attainable accuracy Rounding errors cause true residual b Ax i and updated residual r i deviate! A: bcsstk03 from UFSMC, b: equal components in the eigenbasis of A and b = 1 N = 112, κ A 7e6 Much work on these results for CG; See Meurant and Strakoš (2006) for a thorough summary of early developments in finite precision analysis of Lanczos and CG 11
Optimizing high performance iterative solvers Synchronization-reducing variants are designed to reduce the time/iteration 12
Optimizing high performance iterative solvers Synchronization-reducing variants are designed to reduce the time/iteration But this is not the whole story! 12
Optimizing high performance iterative solvers Synchronization-reducing variants are designed to reduce the time/iteration But this is not the whole story! What we really want to minimize is the runtime, subject to some constraint on accuracy, runtime = (time/iteration) x (# iterations) 12
Optimizing high performance iterative solvers Synchronization-reducing variants are designed to reduce the time/iteration But this is not the whole story! What we really want to minimize is the runtime, subject to some constraint on accuracy, runtime = (time/iteration) x (# iterations) Changes to how the recurrences are computed can exacerbate finite precision effects of convergence delay and loss of accuracy 12
Optimizing high performance iterative solvers Synchronization-reducing variants are designed to reduce the time/iteration But this is not the whole story! What we really want to minimize is the runtime, subject to some constraint on accuracy, runtime = (time/iteration) x (# iterations) Changes to how the recurrences are computed can exacerbate finite precision effects of convergence delay and loss of accuracy Crucial that we understand and take into account how algorithm modifications will affect the convergence rate and attainable accuracy! 12
Maximum attainable accuracy Accuracy depends on the size of the true residual: b A x i 13
Maximum attainable accuracy Accuracy depends on the size of the true residual: b A x i Rounding errors cause the true residual, b A x i, and the updated residual, r i, to deviate 13
Maximum attainable accuracy Accuracy depends on the size of the true residual: b A x i Rounding errors cause the true residual, b A x i, and the updated residual, r i, to deviate Writing b A x i = r i + b A x i r i, b A x i r i + b A x i r i 13
Maximum attainable accuracy Accuracy depends on the size of the true residual: b A x i Rounding errors cause the true residual, b A x i, and the updated residual, r i, to deviate Writing b A x i = r i + b A x i r i, b A x i r i + b A x i r i As r i 0, b A x i depends on b A x i r i 13
Maximum attainable accuracy Accuracy depends on the size of the true residual: b A x i Rounding errors cause the true residual, b A x i, and the updated residual, r i, to deviate Writing b A x i = r i + b A x i r i, b A x i r i + b A x i r i As r i 0, b A x i depends on b A x i r i Many results on bounding attainable accuracy, e.g.: Greenbaum (1989, 1994, 1997), Sleijpen, van der Vorst and Fokkema (1994), Sleijpen, van der Vorst and Modersitzki (2001), Björck, Elfving and Strakoš (1998) and Gutknecht and Strakoš (2000). 13
Maximum attainable accuracy of HSCG In finite precision HSCG, iterates are updated by x i = x i 1 + α i 1 p i 1 δx i and r i = r i 1 α i 1 A p i 1 δr i 14
Maximum attainable accuracy of HSCG In finite precision HSCG, iterates are updated by x i = x i 1 + α i 1 p i 1 δx i and r i = r i 1 α i 1 A p i 1 δr i Let f i b A x i r i 14
Maximum attainable accuracy of HSCG In finite precision HSCG, iterates are updated by x i = x i 1 + α i 1 p i 1 δx i and r i = r i 1 α i 1 A p i 1 δr i Let f i b A x i r i f i = b A x i 1 + α i 1 p i 1 δx i r i 1 α i 1 A p i 1 δr i 14
Maximum attainable accuracy of HSCG In finite precision HSCG, iterates are updated by x i = x i 1 + α i 1 p i 1 δx i and r i = r i 1 α i 1 A p i 1 δr i Let f i b A x i r i f i = b A x i 1 + α i 1 p i 1 δx i = f i 1 + Aδx i + δr i r i 1 α i 1 A p i 1 δr i 14
Maximum attainable accuracy of HSCG In finite precision HSCG, iterates are updated by x i = x i 1 + α i 1 p i 1 δx i and r i = r i 1 α i 1 A p i 1 δr i Let f i b A x i r i f i = b A x i 1 + α i 1 p i 1 δx i = f i 1 + Aδx i + δr i r i 1 α i 1 A p i 1 δr i i = f 0 + m=1 Aδx m + δr m 14
Maximum attainable accuracy of HSCG In finite precision HSCG, iterates are updated by x i = x i 1 + α i 1 p i 1 δx i and r i = r i 1 α i 1 A p i 1 δr i Let f i b A x i r i f i = b A x i 1 + α i 1 p i 1 δx i = f i 1 + Aδx i + δr i r i 1 α i 1 A p i 1 δr i i = f 0 + m=1 Aδx m + δr m f i i O ε m=0 N A A x m + r m van der Vorst and Ye, 2000 f i O(ε) A x + max m=0,,i x m Greenbaum, 1997 f i O ε N A A A 1 i m=0 r m Sleijpen and van der Vorst, 1995 14
Early approaches to reducing synchronization Goal: Reduce the 2 synchronization points per iteration in HSCG to 1 synchronization point per iteration 15
Early approaches to reducing synchronization Goal: Reduce the 2 synchronization points per iteration in HSCG to 1 synchronization point per iteration Compute β i from α i 1 and Ap i 1 using relation r i 2 2 = α i 1 Ap i 1 2 r i 1 2 Can then also merge the updates of x i, r i, and p i Developed independently by Johnson (1983, 1984), van Rosendale (1983, 1984), Saad (1985) Many other similar approaches 15
Early approaches to reducing synchronization Goal: Reduce the 2 synchronization points per iteration in HSCG to 1 synchronization point per iteration Compute β i from α i 1 and Ap i 1 using relation r i 2 2 = α i 1 Ap i 1 2 r i 1 2 Can then also merge the updates of x i, r i, and p i Developed independently by Johnson (1983, 1984), van Rosendale (1983, 1984), Saad (1985) Many other similar approaches Could also compute α i 1 from β i 1 : α i 1 = r i 1 T Ar i 1 r T β i 1 i 1 r i 1 α i 2 1 15
Modified recurrence coefficient computation What is the effect of changing the way the recurrence coefficients (α and β) are computed in HSCG? 16
Modified recurrence coefficient computation What is the effect of changing the way the recurrence coefficients (α and β) are computed in HSCG? Notice that neither α nor β appear in the bounds on f i f i = b A x i r i = b A x i 1 + α i 1 p i 1 δx i r i 1 α i 1 A p i 1 δr i 16
Modified recurrence coefficient computation What is the effect of changing the way the recurrence coefficients (α and β) are computed in HSCG? Notice that neither α nor β appear in the bounds on f i f i = b A x i r i = b A x i 1 + α i 1 p i 1 δx i r i 1 α i 1 A p i 1 δr i As long as the same α i 1 is used in updating x i and r i, still holds f i = f i 1 + Aδx i + δr i 16
Modified recurrence coefficient computation What is the effect of changing the way the recurrence coefficients (α and β) are computed in HSCG? Notice that neither α nor β appear in the bounds on f i f i = b A x i r i = b A x i 1 + α i 1 p i 1 δx i r i 1 α i 1 A p i 1 δr i As long as the same α i 1 is used in updating x i and r i, still holds f i = f i 1 + Aδx i + δr i Rounding errors made in computing α i 1 do not contribute to the residual gap 16
Modified recurrence coefficient computation What is the effect of changing the way the recurrence coefficients (α and β) are computed in HSCG? Notice that neither α nor β appear in the bounds on f i f i = b A x i r i = b A x i 1 + α i 1 p i 1 δx i r i 1 α i 1 A p i 1 δr i As long as the same α i 1 is used in updating x i and r i, still holds f i = f i 1 + Aδx i + δr i Rounding errors made in computing α i 1 do not contribute to the residual gap But may change computed x i, r i, which can affect convergence rate... 16
Modified recurrence coefficient computation Example: HSCG with modified formula for α i 1 α i 1 = r i 1 T Ar i 1 r T β i 1 i 1 r i 1 α i 2 1 17
CG with two three-term recurrences (STCG) HSCG recurrences can be written as AP i = R i+1 L i, R i = P i U i we can combine these to obtain a 3-term recurrence for the residuals (STCG): AR i = R i+1 T i, T i = L i U i 18
CG with two three-term recurrences (STCG) HSCG recurrences can be written as AP i = R i+1 L i, R i = P i U i we can combine these to obtain a 3-term recurrence for the residuals (STCG): AR i = R i+1 T i, T i = L i U i First developed by Stiefel (1952/53), also Rutishauser (1959) and Hageman and Young (1981) Motivated by relation to three-term recurrences for orthogonal polynomials r 0 = b Ax 0, p 0 = r 0, x 1 = x 0, r 1 = r 0, e 1 = 0 for i = 1:nmax q i 1 = (r i 1,Ar i 1 ) (r i 1,r i 1 ) e i 2 x i = x i 1 + 1 q i 1 r i 1 + e i 2 (x i 1 x i 2 ) r i = r i 1 + 1 q i 1 Ar i 1 + e i 2 (r i 1 r i 2 ) end e i 1 = q i 1 (r i,r i ) (r i 1,r i 1 ) 18
CG with two three-term recurrences (STCG) HSCG recurrences can be written as AP i = R i+1 L i, R i = P i U i we can combine these to obtain a 3-term recurrence for the residuals (STCG): AR i = R i+1 T i, T i = L i U i First developed by Stiefel (1952/53), also Rutishauser (1959) and Hageman and Young (1981) Motivated by relation to three-term recurrences for orthogonal polynomials r 0 = b Ax 0, p 0 = r 0, x 1 = x 0, r 1 = r 0, e 1 = 0 for i = 1:nmax q i 1 = (r i 1,Ar i 1 ) (r i 1,r i 1 ) e i 2 x i = x i 1 + 1 q i 1 r i 1 + e i 2 (x i 1 x i 2 ) r i = r i 1 + 1 q i 1 Ar i 1 + e i 2 (r i 1 r i 2 ) end e i 1 = q i 1 (r i,r i ) (r i 1,r i 1 ) Can be accomplished with a single synchronization point on parallel computers (Strakoš 1985, 1987) 18
CG with two three-term recurrences (STCG) HSCG recurrences can be written as AP i = R i+1 L i, R i = P i U i we can combine these to obtain a 3-term recurrence for the residuals (STCG): AR i = R i+1 T i, T i = L i U i First developed by Stiefel (1952/53), also Rutishauser (1959) and Hageman and Young (1981) Motivated by relation to three-term recurrences for orthogonal polynomials r 0 = b Ax 0, p 0 = r 0, x 1 = x 0, r 1 = r 0, e 1 = 0 for i = 1:nmax q i 1 = (r i 1,Ar i 1 ) (r i 1,r i 1 ) e i 2 x i = x i 1 + 1 q i 1 r i 1 + e i 2 (x i 1 x i 2 ) r i = r i 1 + 1 q i 1 Ar i 1 + e i 2 (r i 1 r i 2 ) end e i 1 = q i 1 (r i,r i ) (r i 1,r i 1 ) Can be accomplished with a single synchronization point on parallel computers (Strakoš 1985, 1987) Similar approach (computing α i using β i 1 ) used by D'Azevedo, Eijkhout, Romaine (1992, 1993) 18
Attainable accuracy of STCG Analyzed by Gutknecht and Strakoš (2000) Attainable accuracy for STCG can be much worse than for HSCG 19
Attainable accuracy of STCG Analyzed by Gutknecht and Strakoš (2000) Attainable accuracy for STCG can be much worse than for HSCG Residual gap bounded by sum of local errors PLUS local errors multiplied by factors which depend on max 0 l<j i r j 2 r l 2 19
Attainable accuracy of STCG Analyzed by Gutknecht and Strakoš (2000) Attainable accuracy for STCG can be much worse than for HSCG Residual gap bounded by sum of local errors PLUS local errors multiplied by factors which depend on max 0 l<j i r j 2 r l 2 Large residual oscillations can cause these factors to be large! Local errors can be amplified! 19
STCG 20
STCG 20
Chronopoulos and Gear's CG (ChG CG) Chronopoulos and Gear (1989) Looks like HSCG, but very similar to 3-term recurrence CG (STCG) Reduces synchronizations/iteration to 1 by changing computation of α i and using an auxiliary recurrence for Ap i r 0 = b Ax 0, p 0 = r 0, s 0 = Ap 0, α 0 = (r 0, r 0 )/(p 0, s 0 ) for i = 1:nmax x i = x i 1 + α i 1 p i 1 r i = r i 1 α i 1 s i 1 w i = Ar i β i = (r i,r i ) α i = (r i 1,r i 1 ) (r i,r i ) w i,r i ( β i α i 1 )(r i,r i ) p i = r i + β i p i 1 s i = w i + β i s i 1 end 21
Chronopoulos and Gear's CG (ChG CG) Chronopoulos and Gear (1989) Looks like HSCG, but very similar to 3-term recurrence CG (STCG) Reduces synchronizations/iteration to 1 by changing computation of α i and using an auxiliary recurrence for Ap i r 0 = b Ax 0, p 0 = r 0, s 0 = Ap 0, α 0 = (r 0, r 0 )/(p 0, s 0 ) for i = 1:nmax x i = x i 1 + α i 1 p i 1 r i = r i 1 α i 1 s i 1 w i = Ar i β i = (r i,r i ) α i = (r i 1,r i 1 ) (r i,r i ) w i,r i ( β i α i 1 )(r i,r i ) p i = r i + β i p i 1 s i = w i + β i s i 1 end 21
Chronopoulos and Gear's CG (ChG CG) Chronopoulos and Gear (1989) Looks like HSCG, but very similar to 3-term recurrence CG (STCG) Reduces synchronizations/iteration to 1 by changing computation of α i and using an auxiliary recurrence for Ap i r 0 = b Ax 0, p 0 = r 0, s 0 = Ap 0, α 0 = (r 0, r 0 )/(p 0, s 0 ) for i = 1:nmax x i = x i 1 + α i 1 p i 1 r i = r i 1 α i 1 s i 1 w i = Ar i β i = (r i,r i ) α i = (r i 1,r i 1 ) (r i,r i ) w i,r i ( β i α i 1 )(r i,r i ) p i = r i + β i p i 1 s i = w i + β i s i 1 end 21
Chronopoulos and Gear's CG (ChG CG) Chronopoulos and Gear (1989) Looks like HSCG, but very similar to 3-term recurrence CG (STCG) Reduces synchronizations/iteration to 1 by changing computation of α i and using an auxiliary recurrence for Ap i r 0 = b Ax 0, p 0 = r 0, s 0 = Ap 0, α 0 = (r 0, r 0 )/(p 0, s 0 ) for i = 1:nmax x i = x i 1 + α i 1 p i 1 r i = r i 1 α i 1 s i 1 w i = Ar i β i = (r i,r i ) α i = (r i 1,r i 1 ) (r i,r i ) w i,r i ( β i α i 1 )(r i,r i ) Iteration Loop Vector Updates SpMV Inner Products Vector Updates p i = r i + β i p i 1 s i = w i + β i s i 1 End Loop 21 end
Chronopoulos and Gear's CG (ChG CG) Chronopoulos and Gear (1989) Looks like HSCG, but very similar to 3-term recurrence CG (STCG) Reduces synchronizations/iteration to 1 by changing computation of α i and using an auxiliary recurrence for Ap i r 0 = b Ax 0, p 0 = r 0, s 0 = Ap 0, α 0 = (r 0, r 0 )/(p 0, s 0 ) for i = 1:nmax x i = x i 1 + α i 1 p i 1 r i = r i 1 α i 1 s i 1 w i = Ar i β i = (r i,r i ) α i = (r i 1,r i 1 ) (r i,r i ) w i,r i ( β i α i 1 )(r i,r i ) Iteration Loop Vector Updates SpMV Inner Products Vector Updates p i = r i + β i p i 1 s i = w i + β i s i 1 End Loop 21 end
Chronopoulos and Gear's CG (ChG CG) Chronopoulos and Gear (1989) Looks like HSCG, but very similar to 3-term recurrence CG (STCG) Reduces synchronizations/iteration to 1 by changing computation of α i and using an auxiliary recurrence for Ap i r 0 = b Ax 0, p 0 = r 0, s 0 = Ap 0, α 0 = (r 0, r 0 )/(p 0, s 0 ) for i = 1:nmax x i = x i 1 + α i 1 p i 1 r i = r i 1 α i 1 s i 1 w i = Ar i β i = (r i,r i ) α i = (r i 1,r i 1 ) (r i,r i ) w i,r i ( β i α i 1 )(r i,r i ) Iteration Loop Vector Updates SpMV Inner Products Vector Updates p i = r i + β i p i 1 s i = w i + β i s i 1 End Loop 21 end
Chronopoulos and Gear's CG (ChG CG) Chronopoulos and Gear (1989) Looks like HSCG, but very similar to 3-term recurrence CG (STCG) Reduces synchronizations/iteration to 1 by changing computation of α i and using an auxiliary recurrence for Ap i r 0 = b Ax 0, p 0 = r 0, s 0 = Ap 0, α 0 = (r 0, r 0 )/(p 0, s 0 ) for i = 1:nmax x i = x i 1 + α i 1 p i 1 r i = r i 1 α i 1 s i 1 w i = Ar i β i = (r i,r i ) α i = (r i 1,r i 1 ) (r i,r i ) w i,r i ( β i α i 1 )(r i,r i ) Iteration Loop Vector Updates SpMV Inner Products Vector Updates p i = r i + β i p i 1 s i = w i + β i s i 1 End Loop 21 end
Chronopoulos and Gear's CG (ChG CG) Chronopoulos and Gear (1989) Looks like HSCG, but very similar to 3-term recurrence CG (STCG) Reduces synchronizations/iteration to 1 by changing computation of α i and using an auxiliary recurrence for Ap i r 0 = b Ax 0, p 0 = r 0, s 0 = Ap 0, α 0 = (r 0, r 0 )/(p 0, s 0 ) for i = 1:nmax x i = x i 1 + α i 1 p i 1 r i = r i 1 α i 1 s i 1 w i = Ar i β i = (r i,r i ) α i = (r i 1,r i 1 ) (r i,r i ) w i,r i ( β i α i 1 )(r i,r i ) Iteration Loop Vector Updates SpMV Inner Products Vector Updates p i = r i + β i p i 1 s i = w i + β i s i 1 End Loop 21 end
Pipelined CG (GVCG) Pipelined CG of Ghysels and Vanroose (2014) Similar to Chronopoulos and Gear approach Uses auxiliary vector s i Ap i and same formula for α i 22
Pipelined CG (GVCG) Pipelined CG of Ghysels and Vanroose (2014) Similar to Chronopoulos and Gear approach Uses auxiliary vector s i Ap i and same formula for α i Also uses auxiliary vectors for Ar i and A 2 r i to remove sequential dependency between SpMV and inner products 22
Pipelined CG (GVCG) Pipelined CG of Ghysels and Vanroose (2014) Similar to Chronopoulos and Gear approach Uses auxiliary vector s i Ap i and same formula for α i Also uses auxiliary vectors for Ar i and A 2 r i to remove sequential dependency between SpMV and inner products Allows the use of nonblocking (asynchronous) MPI communication to overlap SpMV and inner products Hides the latency of global communications 22
GVCG (Ghysels and Vanroose 2014) r 0 = b Ax 0, p 0 = r 0 s 0 = Ap 0, w 0 = Ar 0, z 0 = Aw 0, α 0 = r 0 T r 0 /p 0 T s 0 for i = 1:nmax x i = x i 1 + α i 1 p i 1 r i = r i 1 α i 1 s i 1 w i = w i 1 α i 1 z i 1 q i = Aw i β i = r i T r i r T i 1 r i 1 α i = w i T r i r i T ri β i α i 1 r i T r i end p i = r i + β i p i 1 s i = w i + β i s i 1 z i = q i + β i z i 1 23
GVCG (Ghysels and Vanroose 2014) r 0 = b Ax 0, p 0 = r 0 s 0 = Ap 0, w 0 = Ar 0, z 0 = Aw 0, α 0 = r 0 T r 0 /p 0 T s 0 for i = 1:nmax x i = x i 1 + α i 1 p i 1 r i = r i 1 α i 1 s i 1 w i = w i 1 α i 1 z i 1 q i = Aw i β i = r i T r i r T i 1 r i 1 α i = w i T r i r i T ri β i α i 1 r i T r i end p i = r i + β i p i 1 s i = w i + β i s i 1 z i = q i + β i z i 1 23
Overlap GVCG (Ghysels and Vanroose 2014) r 0 = b Ax 0, p 0 = r 0 s 0 = Ap 0, w 0 = Ar 0, z 0 = Aw 0, α 0 = r 0 T r 0 /p 0 T s 0 for i = 1:nmax x i = x i 1 + α i 1 p i 1 r i = r i 1 α i 1 s i 1 Iteration Loop Vector Updates w i = w i 1 α i 1 z i 1 q i = Aw i β i = r i T r i r T i 1 r i 1 Inner Products SpMV α i = w i T r i r i T ri β i α i 1 r i T r i Vector Updates p i = r i + β i p i 1 s i = w i + β i s i 1 End Loop end z i = q i + β i z i 1 23
Overlap GVCG (Ghysels and Vanroose 2014) r 0 = b Ax 0, p 0 = r 0 s 0 = Ap 0, w 0 = Ar 0, z 0 = Aw 0, α 0 = r 0 T r 0 /p 0 T s 0 for i = 1:nmax x i = x i 1 + α i 1 p i 1 r i = r i 1 α i 1 s i 1 Iteration Loop Vector Updates w i = w i 1 α i 1 z i 1 q i = Aw i β i = r i T r i r T i 1 r i 1 Inner Products SpMV α i = w i T r i r i T ri β i α i 1 r i T r i Vector Updates p i = r i + β i p i 1 s i = w i + β i s i 1 End Loop end z i = q i + β i z i 1 23
Overlap GVCG (Ghysels and Vanroose 2014) r 0 = b Ax 0, p 0 = r 0 s 0 = Ap 0, w 0 = Ar 0, z 0 = Aw 0, α 0 = r 0 T r 0 /p 0 T s 0 for i = 1:nmax x i = x i 1 + α i 1 p i 1 r i = r i 1 α i 1 s i 1 Iteration Loop Vector Updates w i = w i 1 α i 1 z i 1 q i = Aw i β i = r i T r i r T i 1 r i 1 Inner Products SpMV α i = w i T r i r i T ri β i α i 1 r i T r i Vector Updates p i = r i + β i p i 1 s i = w i + β i s i 1 End Loop end z i = q i + β i z i 1 23
Overlap GVCG (Ghysels and Vanroose 2014) r 0 = b Ax 0, p 0 = r 0 s 0 = Ap 0, w 0 = Ar 0, z 0 = Aw 0, α 0 = r 0 T r 0 /p 0 T s 0 for i = 1:nmax x i = x i 1 + α i 1 p i 1 r i = r i 1 α i 1 s i 1 Iteration Loop Vector Updates w i = w i 1 α i 1 z i 1 q i = Aw i β i = r i T r i r T i 1 r i 1 Inner Products SpMV α i = w i T r i r i T ri β i α i 1 r i T r i Vector Updates p i = r i + β i p i 1 s i = w i + β i s i 1 End Loop end z i = q i + β i z i 1 23
Overlap GVCG (Ghysels and Vanroose 2014) r 0 = b Ax 0, p 0 = r 0 s 0 = Ap 0, w 0 = Ar 0, z 0 = Aw 0, α 0 = r 0 T r 0 /p 0 T s 0 for i = 1:nmax x i = x i 1 + α i 1 p i 1 r i = r i 1 α i 1 s i 1 Iteration Loop Vector Updates w i = w i 1 α i 1 z i 1 q i = Aw i β i = r i T r i r T i 1 r i 1 Inner Products Precond SpMV α i = w i T r i r i T ri β i α i 1 r i T r i Vector Updates p i = r i + β i p i 1 s i = w i + β i s i 1 End Loop end z i = q i + β i z i 1 23
Attainable accuracy of pipelined CG What is the effect of adding auxiliary recurrences to the CG method? 24
Attainable accuracy of pipelined CG What is the effect of adding auxiliary recurrences to the CG method? To isolate the effects, we consider a simplified version of a pipelined method r 0 = b Ax 0, p 0 = r 0, s 0 = Ap 0 for i = 1:nmax end α i 1 = (r i 1,r i 1 ) (p i 1,s i 1 ) x i = x i 1 + α i 1 p i 1 r i = r i 1 α i 1 s i 1 β i = (r i,r i ) (r i 1,r i 1 ) p i = r i + β i p i 1 s i = Ar i + β i s i 1 24
Attainable accuracy of pipelined CG What is the effect of adding auxiliary recurrences to the CG method? To isolate the effects, we consider a simplified version of a pipelined method Uses same update formulas for α and β as HSCG, but uses additional recurrence for Ap i r 0 = b Ax 0, p 0 = r 0, s 0 = Ap 0 for i = 1:nmax end α i 1 = (r i 1,r i 1 ) (p i 1,s i 1 ) x i = x i 1 + α i 1 p i 1 r i = r i 1 α i 1 s i 1 β i = (r i,r i ) (r i 1,r i 1 ) p i = r i + β i p i 1 s i = Ar i + β i s i 1 24
Attainable accuracy of simple pipelined CG x i = x i 1 + α i 1 p i 1 + δx i r i = r i 1 α i 1 s i 1 + δr i 25
Attainable accuracy of simple pipelined CG x i = x i 1 + α i 1 p i 1 + δx i r i = r i 1 α i 1 s i 1 + δr i f i = r i (b A x i ) 25
Attainable accuracy of simple pipelined CG x i = x i 1 + α i 1 p i 1 + δx i r i = r i 1 α i 1 s i 1 + δr i f i = r i (b A x i ) = f i 1 α i 1 s i 1 A p i 1 + δr i + Aδx i 25
Attainable accuracy of simple pipelined CG x i = x i 1 + α i 1 p i 1 + δx i r i = r i 1 α i 1 s i 1 + δr i f i = r i (b A x i ) = f i 1 α i 1 s i 1 A p i 1 + δr i + Aδx i i = f 0 + m=1 (δr m + Aδx m ) G i d i where G i = S i A P i, d i = α 0,, α i 1 T 25
Attainable accuracy of simple pipelined CG x i = x i 1 + α i 1 p i 1 + δx i r i = r i 1 α i 1 s i 1 + δr i f i = r i (b A x i ) = f i 1 α i 1 s i 1 A p i 1 + δr i + Aδx i i = f 0 + m=1 (δr m + Aδx m ) G i d i where G i = S i A P i, d i = α 0,, α i 1 T 25
Attainable accuracy of simple pipelined CG x i = x i 1 + α i 1 p i 1 + δx i r i = r i 1 α i 1 s i 1 + δr i f i = r i (b A x i ) = f i 1 α i 1 s i 1 A p i 1 + δr i + Aδx i i = f 0 + m=1 (δr m + Aδx m ) G i d i where G i = S i A P i, d i = α 0,, α i 1 T 25
Attainable accuracy of simple pipelined CG O ε 1 G i κ( U 1 O ε i ) A P i + A R i U i U i = 1 β 1 0 0 0 1 0 1 β i 1 0 0 1 U i 1 = 1 β 1 β 1 β 2 β i 1 0 1 β 2 β 2 β i 1 1 β i 1 0 0 1 26
Attainable accuracy of simple pipelined CG O ε 1 G i κ( U 1 O ε i ) A P i + A R i U i U i = 1 β 1 0 0 0 1 0 1 β i 1 0 0 1 U i 1 = 1 β 1 β 1 β 2 β i 1 0 1 β 2 β 2 β i 1 1 β i 1 0 0 1 β l β l+1 β j = r j 2 r l 1 2, l < j 26
Attainable accuracy of simple pipelined CG O ε 1 G i κ( U 1 O ε i ) A P i + A R i U i U i = 1 β 1 0 0 0 1 0 1 β i 1 0 0 1 U i 1 = 1 β 1 β 1 β 2 β i 1 0 1 β 2 β 2 β i 1 1 β i 1 0 0 1 β l β l+1 β j = r j 2 r l 1 2, l < j Residual oscillations can cause these factors to be large! Errors in computed recurrence coefficients can be amplified! 26
Attainable accuracy of simple pipelined CG O ε 1 G i κ( U 1 O ε i ) A P i + A R i U i U i = 1 β 1 0 0 0 1 0 1 β i 1 0 0 1 U i 1 = 1 β 1 β 1 β 2 β i 1 0 1 β 2 β 2 β i 1 1 β i 1 0 0 1 β l β l+1 β j = r j 2 r l 1 2, l < j Residual oscillations can cause these factors to be large! Errors in computed recurrence coefficients can be amplified! Very similar to the results for attainable accuracy in the 3-term STCG Seemingly innocuous change can cause drastic loss of accuracy 26
Simple pipelined CG 27
Simple pipelined CG effect of using auxiliary vector s i Ap i 27
Simple pipelined CG effect of changing formula for recurrence coefficient α and using auxiliary vector s i Ap i 27
Simple pipelined CG effect of changing formula for recurrence coefficient α and using auxiliary vectors s i Ap i, w i Ar i, z i A 2 r i 27
s-step CG Idea: Compute blocks of s iterations at once Compute updates in a different basis Communicate every s iterations instead of every iteration Reduces number of synchronizations per iteration by a factor of s 28
s-step CG Idea: Compute blocks of s iterations at once Compute updates in a different basis Communicate every s iterations instead of every iteration Reduces number of synchronizations per iteration by a factor of s An idea rediscovered many times 28
s-step CG Idea: Compute blocks of s iterations at once Compute updates in a different basis Communicate every s iterations instead of every iteration Reduces number of synchronizations per iteration by a factor of s An idea rediscovered many times First related work: s-dimensional steepest descent, least squares Khabaza ( 63), Forsythe ( 68), Marchuk and Kuznecov ( 68) 28
s-step CG Idea: Compute blocks of s iterations at once Compute updates in a different basis Communicate every s iterations instead of every iteration Reduces number of synchronizations per iteration by a factor of s An idea rediscovered many times First related work: s-dimensional steepest descent, least squares Khabaza ( 63), Forsythe ( 68), Marchuk and Kuznecov ( 68) Flurry of work on s-step Krylov methods in 80s/early 90s: see, e.g., Van Rosendale (1983); Chronopoulos and Gear (1989) 28
s-step CG Idea: Compute blocks of s iterations at once Compute updates in a different basis Communicate every s iterations instead of every iteration Reduces number of synchronizations per iteration by a factor of s An idea rediscovered many times First related work: s-dimensional steepest descent, least squares Khabaza ( 63), Forsythe ( 68), Marchuk and Kuznecov ( 68) Flurry of work on s-step Krylov methods in 80s/early 90s: see, e.g., Van Rosendale (1983); Chronopoulos and Gear (1989) Resurgence of interest in recent years due to growing problem sizes; growing relative cost of communication 28
s-step CG Key observation: After iteration i, for j {0,.., s}, x i+j x i, r i+j, p i+j K s+1 A, p i + K s A, r i 29
s-step CG Key observation: After iteration i, for j {0,.., s}, x i+j x i, r i+j, p i+j K s+1 A, p i + K s A, r i s steps of s-step CG: 29
s-step CG Key observation: After iteration i, for j {0,.., s}, x i+j x i, r i+j, p i+j K s+1 A, p i + K s A, r i s steps of s-step CG: Expand solution space s dimensions at once Compute basis matrix Y such that span Y = K s+1 A, p i + K s A, r i according to the recurrence AY = Y B 29
s-step CG Key observation: After iteration i, for j {0,.., s}, x i+j x i, r i+j, p i+j K s+1 A, p i + K s A, r i s steps of s-step CG: Expand solution space s dimensions at once Compute basis matrix Y such that span Y = K s+1 A, p i + K s A, r i according to the recurrence AY = Y B Compute inner products between basis vectors in one synchronization G = Y T Y 29
s-step CG Key observation: After iteration i, for j {0,.., s}, x i+j x i, r i+j, p i+j K s+1 A, p i + K s A, r i s steps of s-step CG: Expand solution space s dimensions at once Compute basis matrix Y such that span Y = K s+1 A, p i + K s A, r i according to the recurrence AY = Y B Compute inner products between basis vectors in one synchronization G = Y T Y Compute s iterations of vector updates Perform s iterations of vector updates by updating coordinates in basis Y: x i+j x i = Yx j, r i+j = Yr j, p i+j = Yp j 29
s-step CG For s iterations of updates, inner products and SpMVs (in basis Y) can be computed by independently by each processor without communication: 30
s-step CG For s iterations of updates, inner products and SpMVs (in basis Y) can be computed by independently by each processor without communication: Ap i+j n n 30
s-step CG For s iterations of updates, inner products and SpMVs (in basis Y) can be computed by independently by each processor without communication: Ap i+j n = AYp j n 30
s-step CG For s iterations of updates, inner products and SpMVs (in basis Y) can be computed by independently by each processor without communication: Ap i+j = AYp j = Y(Bp j ) n n O(s) O(s) 30
s-step CG For s iterations of updates, inner products and SpMVs (in basis Y) can be computed by independently by each processor without communication: Ap i+j = AYp j = Y(Bp j ) n n O(s) O(s) (r i+j, r i+j ) 30
s-step CG For s iterations of updates, inner products and SpMVs (in basis Y) can be computed by independently by each processor without communication: Ap i+j = AYp j = Y(Bp j ) n n O(s) O(s) (r i+j, r i+j ) = r j T Y T Yr j 30
s-step CG For s iterations of updates, inner products and SpMVs (in basis Y) can be computed by independently by each processor without communication: Ap i+j = AYp j = Y(Bp j ) n n O(s) O(s) (r i+j, r i+j ) = r j T Y T Yr j = r j T Gr j 30
s-step CG r 0 = b Ax 0, p 0 = r 0 for k = 0:nmax/s Compute Y k and B k such that AY k = Y k B k and span(y k ) = K s+1 A, p sk + K s A, r sk G k = Y k T Y k x 0 = 0, r 0 = e s+2, p 0 = e 1 for j = 1: s end α sk+j 1 = x j = x j 1 r j = r j 1 β sk+j = r j T G k r j r T j 1 Gk r j 1 p T j 1 G k B k p j 1 + α sk+j 1 p j 1 α sk+j 1 B k p j 1 T G k r j 1 r j 1 p j = r j + β sk+j p j 1 [x s k+1 x sk, r s k+1, p s k+1 ] = Y k [x s, r s, p s ] end 31
s-step CG r 0 = b Ax 0, p 0 = r 0 for k = 0:nmax/s Compute Y k and B k such that AY k = Y k B k and span(y k ) = K s+1 A, p sk G k = Y k T Y k x 0 = 0, r 0 = e s+2, p 0 = e 1 for j = 1: s end α sk+j 1 = x j = x j 1 r j = r j 1 β sk+j = r j T G k r j r T j 1 Gk r j 1 p T j 1 G k B k p j 1 + α sk+j 1 p j 1 α sk+j 1 B k p j 1 T G k r j 1 r j 1 p j = r j + β sk+j p j 1 + K s A, r sk [x s k+1 x sk, r s k+1, p s k+1 ] = Y k [x s, r s, p s ] end Outer Loop Compute basis O(s) SPMVs O(s 2 ) Inner Products (one synchronization) Inner Loop Local Vector Updates (no comm.) End Inner Loop Inner Outer Loop s times 31
s-step CG r 0 = b Ax 0, p 0 = r 0 for k = 0:nmax/s Compute Y k and B k such that AY k = Y k B k and span(y k ) = K s+1 A, p sk G k = Y k T Y k x 0 = 0, r 0 = e s+2, p 0 = e 1 for j = 1: s end α sk+j 1 = x j = x j 1 r j = r j 1 β sk+j = r j T G k r j r T j 1 Gk r j 1 p T j 1 G k B k p j 1 + α sk+j 1 p j 1 α sk+j 1 B k p j 1 T G k r j 1 r j 1 p j = r j + β sk+j p j 1 + K s A, r sk [x s k+1 x sk, r s k+1, p s k+1 ] = Y k [x s, r s, p s ] end Outer Loop Compute basis O(s) SPMVs O(s 2 ) Inner Products (one synchronization) Inner Loop Local Vector Updates (no comm.) End Inner Loop Inner Outer Loop s times 31
s-step CG r 0 = b Ax 0, p 0 = r 0 for k = 0:nmax/s Compute Y k and B k such that AY k = Y k B k and span(y k ) = K s+1 A, p sk G k = Y k T Y k x 0 = 0, r 0 = e s+2, p 0 = e 1 for j = 1: s end α sk+j 1 = x j = x j 1 r j = r j 1 β sk+j = r j T G k r j r T j 1 Gk r j 1 p T j 1 G k B k p j 1 + α sk+j 1 p j 1 α sk+j 1 B k p j 1 T G k r j 1 r j 1 p j = r j + β sk+j p j 1 + K s A, r sk [x s k+1 x sk, r s k+1, p s k+1 ] = Y k [x s, r s, p s ] end Outer Loop Compute basis O(s) SPMVs O(s 2 ) Inner Products (one synchronization) Inner Loop Local Vector Updates (no comm.) End Inner Loop Inner Outer Loop s times 31
s-step CG r 0 = b Ax 0, p 0 = r 0 for k = 0:nmax/s Compute Y k and B k such that AY k = Y k B k and span(y k ) = K s+1 A, p sk G k = Y k T Y k x 0 = 0, r 0 = e s+2, p 0 = e 1 for j = 1: s end α sk+j 1 = x j = x j 1 r j = r j 1 β sk+j = r j T G k r j r T j 1 Gk r j 1 p T j 1 G k B k p j 1 + α sk+j 1 p j 1 α sk+j 1 B k p j 1 T G k r j 1 r j 1 p j = r j + β sk+j p j 1 + K s A, r sk [x s k+1 x sk, r s k+1, p s k+1 ] = Y k [x s, r s, p s ] end Outer Loop Compute basis O(s) SPMVs O(s 2 ) Inner Products (one synchronization) Inner Loop Local Vector Updates (no comm.) End Inner Loop Inner Outer Loop s times 31
Sources of local roundoff error in s-step CG Computing the s-step Krylov subspace basis: A Y k = Y k B k + ΔY k Updating coordinate vectors in the inner loop: x k,j r k,j = x k,j 1 = r k,j 1 + q k,j 1 B k q k,j 1 with q k,j 1 + ξ k,j + η k,j = fl( α sk+j 1 p k,j 1 ) Recovering CG vectors for use in next outer loop: x sk+j = Y k x k,j r sk+j = Y k r k,j + x sk + φ sk+j + ψ sk+j 32
Sources of local roundoff error in s-step CG Computing the s-step Krylov subspace basis: A Y k = Y k B k + ΔY k Error in computing s-step basis Updating coordinate vectors in the inner loop: x k,j r k,j = x k,j 1 = r k,j 1 + q k,j 1 B k q k,j 1 with q k,j 1 + ξ k,j + η k,j = fl( α sk+j 1 p k,j 1 ) Recovering CG vectors for use in next outer loop: x sk+j = Y k x k,j r sk+j = Y k r k,j + x sk + φ sk+j + ψ sk+j 32
Sources of local roundoff error in s-step CG Computing the s-step Krylov subspace basis: A Y k = Y k B k + ΔY k Error in computing s-step basis Updating coordinate vectors in the inner loop: x k,j r k,j = x k,j 1 = r k,j 1 + q k,j 1 B k q k,j 1 with q k,j 1 + ξ k,j + η k,j = fl( α sk+j 1 p k,j 1 ) Error in updating coefficient vectors Recovering CG vectors for use in next outer loop: x sk+j = Y k x k,j r sk+j = Y k r k,j + x sk + φ sk+j + ψ sk+j 32
Sources of local roundoff error in s-step CG Computing the s-step Krylov subspace basis: A Y k = Y k B k + ΔY k Error in computing s-step basis Updating coordinate vectors in the inner loop: x k,j r k,j = x k,j 1 = r k,j 1 + q k,j 1 B k q k,j 1 with q k,j 1 + ξ k,j + η k,j = fl( α sk+j 1 p k,j 1 ) Error in updating coefficient vectors Recovering CG vectors for use in next outer loop: x sk+j = Y k x k,j r sk+j = Y k r k,j + x sk + φ sk+j + ψ sk+j Error in basis change 32
Attainable accuracy of s-step CG We can write the gap between the true and updated residuals f in terms of these errors: f sk+j = f 0 k 1 Aφ sl+s + ψ sl+s + s A Y l ξ l,i + Y l η l,i ΔY l q l,i 1 l=0 i=1 Aφ sk+j ψ sk+j j i=1 A Y k ξ k,i + Y k η k,i ΔY l q k,i 1 Using standard rounding error results, this allows us to obtain an upper bound on f sk+j. 33
Attainable accuracy of s-step CG We can write the gap between the true and updated residuals f in terms of these errors: f sk+j = f 0 k 1 Aφ sl+s + ψ sl+s + s A Y l ξ l,i + Y l η l,i ΔY l q l,i 1 l=0 i=1 Aφ sk+j ψ sk+j j i=1 A Y k ξ k,i + Y k η k,i ΔY l q k,i 1 Using standard rounding error results, this allows us to obtain an upper bound on f sk+j. 33
Attainable accuracy of s-step CG We can write the gap between the true and updated residuals f in terms of these errors: f sk+j = f 0 k 1 Aφ sl+s + ψ sl+s + s A Y l ξ l,i + Y l η l,i ΔY l q l,i 1 l=0 i=1 Aφ sk+j ψ sk+j j i=1 A Y k ξ k,i + Y k η k,i ΔY l q k,i 1 Using standard rounding error results, this allows us to obtain an upper bound on f sk+j. 33
Attainable accuracy of s-step CG We can write the gap between the true and updated residuals f in terms of these errors: f sk+j = f 0 k 1 Aφ sl+s + ψ sl+s + s A Y l ξ l,i + Y l η l,i ΔY l q l,i 1 l=0 i=1 Aφ sk+j ψ sk+j j i=1 A Y k ξ k,i + Y k η k,i ΔY l q k,i 1 Using standard rounding error results, this allows us to obtain an upper bound on f sk+j. 33
Attainable accuracy of s-step CG f i b A x i r i For CG: f i f 0 + ε i m=1 1 + N A x m + r m 34
Attainable accuracy of s-step CG f i b A x i r i For CG: f i f 0 + ε i m=1 1 + N A x m + r m For s-step CG: i sk + j f sk+j f 0 + εc Γ k sk+j m=1 1 + N A x m + r m where c is a low-degree polynomial in s, and Γ k = max l k Γ l, where Γ l = Y l + Y l (see C., 2015) 34
s-step CG 35
s-step CG s-step CG with monomial basis (Y = [p i, Ap i,, A s p i, r i, Ar i, A s 1 r i ]) 35
s-step CG s-step CG with monomial basis (Y = [p i, Ap i,, A s p i, r i, Ar i, A s 1 r i ]) 35
s-step CG s-step CG with monomial basis (Y = [p i, Ap i,, A s p i, r i, Ar i, A s 1 r i ]) Can also use other, more well-conditioned bases to improve convergence rate and accuracy (see, e.g. Philippe and Reichel, 2012). 35
s-step CG Even assuming perfect parallel scalability with s (which is usually not the case due to extra SpMVs and inner products), already at s = 4 we are worse than HSCG in terms of number of synchronizations! 36
A different problem... A: nos4 from UFSMC, b: equal components in the eigenbasis of A and b = 1 N = 100, κ A 2e3 37
A different problem... A: nos4 from UFSMC, b: equal components in the eigenbasis of A and b = 1 N = 100, κ A 2e3 37
A different problem... A: nos4 from UFSMC, b: equal components in the eigenbasis of A and b = 1 N = 100, κ A 2e3 37
A different problem... A: nos4 from UFSMC, b: equal components in the eigenbasis of A and b = 1 N = 100, κ A 2e3
A different problem... A: nos4 from UFSMC, b: equal components in the eigenbasis of A and b = 1 N = 100, κ A 2e3 If application only requires x x i A 10 10, any of these methods will work!
Performance Benefit from s-step BICGSTAB Speedups for real applications s-step BICGSTAB bottom-solver implemented in BoxLib (AMR framework from LBL) Low Mach Number Combustion Code (LMC): gas-phase combustion simulation Compared GMG with BICGSTAB vs. GMG with s-step BICGSTAB (s=4) on a Cray XE6 for two different applications Up to 2.5x speedup in bottom solve; up to 1.5x in overall MG solve 3.0x 2.5x 2.0x 1.5x 1.0x 0.5x 0.0x LMC - 3D mac_project Solve Bottom Solver MG Solver (overall) 64 512 4096 32768 48 384 3072 24576 Flat MPI MPI+OpenMP (see Williams et al., IPDPS 2014) 38
Conclusions and takeaways Think of the bigger picture Much focus on modifying methods to speed up iterations But the speed of an iteration only part of the runtime: runtime = (time/iteration) x (#iterations) 39
Conclusions and takeaways Think of the bigger picture Much focus on modifying methods to speed up iterations But the speed of an iteration only part of the runtime: runtime = (time/iteration) x (#iterations) And a solver that can't solve to required accuracy is useless! 39
Conclusions and takeaways Think of the bigger picture Much focus on modifying methods to speed up iterations But the speed of an iteration only part of the runtime: runtime = (time/iteration) x (#iterations) And a solver that can't solve to required accuracy is useless! Solver runtime is only part of a larger scientific code 39
Conclusions and takeaways Think of the bigger picture Much focus on modifying methods to speed up iterations But the speed of an iteration only part of the runtime: runtime = (time/iteration) x (#iterations) And a solver that can't solve to required accuracy is useless! Solver runtime is only part of a larger scientific code Design and implementation of iterative solvers requires a holistic approach Selecting the right method, parameters, stopping criteria Selecting the right preconditioner (closely linked with the discretization! see Málek and Strakoš, 2015) 39