High performance Krylov subspace method variants
|
|
- Sherman Garrison
- 5 years ago
- Views:
Transcription
1 High performance Krylov subspace method variants and their behavior in finite precision Erin Carson New York University HPCSE17, May 24, 2017
2 Collaborators Miroslav Rozložník Institute of Computer Science, Czech Academy of Sciences Zdeněk Strakoš Department of Numerical Mathematics, Faculty of Mathematics and Physics, Charles University Petr Tichý Department of Numerical Mathematics, Faculty of Mathematics and Physics, Charles University Miroslav Tůma Department of Numerical Mathematics, Faculty of Mathematics and Physics, Charles University Preprint NCMM/2016/08: 2
3 Conjugate Gradient method for solving Ax = b double precision (ε = 2 53 ) x i x A = x i x T A(x i x) x i = x i 1 + α i p i r i = r i 1 α i Ap i p i = r i + β i p i 3
4 Conjugate Gradient method for solving Ax = b double precision (ε = 2 53 ) x i x A = x i x T A(x i x) x i = x i 1 + α i p i r i = r i 1 α i Ap i p i = r i + β i p i 3
5 Krylov subspace methods Linear systems Ax = b, eigenvalue problems, singular value problems, least squares, etc. Best for: A large & very sparse, stored implicitly, or only approximation needed Krylov Subspace Method is a projection process onto the Krylov subspace K i A, r 0 = span r 0, Ar 0, A 2 r 0,, A i 1 r 0 where A is an N N matrix and r 0 = b Ax 0 is a length-n vector In each iteration, Add a dimension to the Krylov subspace Forms nested sequence of Krylov subspaces K 1 A, r 0 K 2 A, r 0 K i (A, r 0 ) Orthogonalize (with respect to some C i ) Select approximate solution x i x 0 + K i (A, r 0 ) using r i = b Ax i C i rnew 0 C Aδ r 0 Ex: Lanczos/Conjugate Gradient (CG), Arnoldi/Generalized Minimum Residual (GMRES), Biconjugate Gradient (BICG), BICGSTAB, GKL, LSQR, etc. 4
6 The conjugate gradient method A is symmetric positive definite, C i = K i (A, r 0 ) 5
7 The conjugate gradient method A is symmetric positive definite, C i = K i (A, r 0 ) r i K i A, r 0 x x i A = min z x 0 +K i (A,r 0 ) x z A 5
8 The conjugate gradient method A is symmetric positive definite, C i = K i (A, r 0 ) r i K i A, r 0 x x i A = min z x 0 +K i (A,r 0 ) x z A r N+1 = 0 5
9 The conjugate gradient method A is symmetric positive definite, C i = K i (A, r 0 ) r i K i A, r 0 x x i A = min z x 0 +K i (A,r 0 ) x z A Connection with Lanczos r N+1 = 0 With v 1 = r 0 / r 0, i iterations of Lanczos produces N i matrix V i = v 1,, v i, and i i tridiagonal matrix T i such that AV i = V i T i + δ i+1 v i+1 e i T, T i = V i AV i CG approximation x i is obtained by solving the reduced model T i y i = r 0 e 1, x i = x 0 + V i y i 5
10 The conjugate gradient method A is symmetric positive definite, C i = K i (A, r 0 ) r i K i A, r 0 x x i A = min z x 0 +K i (A,r 0 ) x z A Connection with Lanczos r N+1 = 0 With v 1 = r 0 / r 0, i iterations of Lanczos produces N i matrix V i = v 1,, v i, and i i tridiagonal matrix T i such that AV i = V i T i + δ i+1 v i+1 e i T, T i = V i AV i CG approximation x i is obtained by solving the reduced model T i y i = r 0 e 1, x i = x 0 + V i y i Connections with orthogonal polynomials, Stieltjes problem of moments, Gauss- Cristoffel quadrature, others (see 2013 book of Liesen and Strakoš) 5
11 The conjugate gradient method A is symmetric positive definite, C i = K i (A, r 0 ) r i K i A, r 0 x x i A = min z x 0 +K i (A,r 0 ) x z A Connection with Lanczos r N+1 = 0 With v 1 = r 0 / r 0, i iterations of Lanczos produces N i matrix V i = v 1,, v i, and i i tridiagonal matrix T i such that AV i = V i T i + δ i+1 v i+1 e i T, T i = V i AV i CG approximation x i is obtained by solving the reduced model T i y i = r 0 e 1, x i = x 0 + V i y i Connections with orthogonal polynomials, Stieltjes problem of moments, Gauss- Cristoffel quadrature, others (see 2013 book of Liesen and Strakoš) CG (and other Krylov subspace methods) are highly nonlinear Good for convergence, bad for ease of finite precision analysis 5
12 Implementation of CG Standard implementation due to Hestenes and Stiefel (1952) (HSCG) Uses three 2-term recurrences for updating x i, r i, p i r 0 = b Ax 0, p 0 = r 0 for i = 1:nmax α i 1 = r T i 1 ri 1 p T i 1 Ap i 1 x i = x i 1 + α i 1 p i 1 r i = r i 1 α i 1 Ap i 1 β i = r i T r i r T i 1 r i 1 end p i = r i + β i p i 1 6
13 Implementation of CG Standard implementation due to Hestenes and Stiefel (1952) (HSCG) Uses three 2-term recurrences for updating x i, r i, p i r 0 = b Ax 0, p 0 = r 0 for i = 1:nmax α i 1 = r T i 1 ri 1 p T i 1 Ap i 1 minimizes x x i A along line z α = x i 1 + αp i 1 x i = x i 1 + α i 1 p i 1 r i = r i 1 α i 1 Ap i 1 β i = r i T r i r T i 1 r i 1 end p i = r i + β i p i 1 6
14 Implementation of CG Standard implementation due to Hestenes and Stiefel (1952) (HSCG) Uses three 2-term recurrences for updating x i, r i, p i r 0 = b Ax 0, p 0 = r 0 for i = 1:nmax end α i 1 = r T i 1 ri 1 p T i 1 Ap i 1 x i = x i 1 + α i 1 p i 1 r i = r i 1 α i 1 Ap i 1 β i = r i T r i r T i 1 r i 1 p i = r i + β i p i 1 If minimizes x x i A along line z α = x i 1 + αp i 1 p i A p j for i j, 1-dimensional minimizations in each iteration give i-dimensional minimization over the whole subspace x 0 + K i A, r 0 = x 0 + span{p 0, p i 1 } 6
15 Communication in CG Projection process in terms of communication: 7
16 Communication in CG Projection process in terms of communication: Add a dimension to K i Sparse matrix-vector multiplication (SpMV) Must communicate vector entries w/ neighboring processors (P2P communication) SpMV 7
17 Communication in CG Projection process in terms of communication: Add a dimension to K i Sparse matrix-vector multiplication (SpMV) Must communicate vector entries w/ neighboring processors (P2P communication) Orthogonalize with respect to C i Inner products global synchronization (MPI_Allreduce) all processors must exchange data and wait for all communication to finish before proceeding SpMV orthogonalize 7
18 Communication in CG Projection process in terms of communication: Add a dimension to K i Sparse matrix-vector multiplication (SpMV) Must communicate vector entries w/ neighboring processors (P2P communication) Orthogonalize with respect to C i Inner products global synchronization (MPI_Allreduce) all processors must exchange data and wait for all communication to finish before proceeding SpMV orthogonalize Dependencies between communication-bound kernels in each iteration limit performance! 7
19 Communication in HSCG r 0 = b Ax 0, p 0 = r 0 for i = 1:nmax α i 1 = r T i 1 ri 1 p T i 1 Ap i 1 x i = x i 1 + α i 1 p i 1 r i = r i 1 α i 1 Ap i 1 Iteration Loop SpMV Inner Products Vector Updates β i = r i T r i r T i 1 r i 1 Inner Products end p i = r i + β i p i 1 Vector Updates End Loop 8
20 Communication in HSCG r 0 = b Ax 0, p 0 = r 0 for i = 1:nmax α i 1 = r T i 1 ri 1 p T i 1 Ap i 1 x i = x i 1 + α i 1 p i 1 r i = r i 1 α i 1 Ap i 1 Iteration Loop SpMV Inner Products Vector Updates β i = r i T r i r T i 1 r i 1 Inner Products end p i = r i + β i p i 1 Vector Updates End Loop 8
21 Communication in HSCG r 0 = b Ax 0, p 0 = r 0 for i = 1:nmax α i 1 = r T i 1 ri 1 p T i 1 Ap i 1 x i = x i 1 + α i 1 p i 1 r i = r i 1 α i 1 Ap i 1 Iteration Loop SpMV Inner Products Vector Updates β i = r i T r i r T i 1 r i 1 Inner Products end p i = r i + β i p i 1 Vector Updates End Loop 8
22 Communication in HSCG r 0 = b Ax 0, p 0 = r 0 for i = 1:nmax α i 1 = r T i 1 ri 1 p T i 1 Ap i 1 x i = x i 1 + α i 1 p i 1 r i = r i 1 α i 1 Ap i 1 Iteration Loop SpMV Inner Products Vector Updates β i = r i T r i r T i 1 r i 1 Inner Products end p i = r i + β i p i 1 Vector Updates End Loop 8
23 Communication in HSCG r 0 = b Ax 0, p 0 = r 0 for i = 1:nmax α i 1 = r T i 1 ri 1 p T i 1 Ap i 1 x i = x i 1 + α i 1 p i 1 r i = r i 1 α i 1 Ap i 1 Iteration Loop SpMV Inner Products Vector Updates β i = r i T r i r T i 1 r i 1 Inner Products end p i = r i + β i p i 1 Vector Updates End Loop 8
24 Communication in HSCG r 0 = b Ax 0, p 0 = r 0 for i = 1:nmax α i 1 = r T i 1 ri 1 p T i 1 Ap i 1 x i = x i 1 + α i 1 p i 1 r i = r i 1 α i 1 Ap i 1 Iteration Loop SpMV Inner Products Vector Updates β i = r i T r i r T i 1 r i 1 Inner Products end p i = r i + β i p i 1 Vector Updates End Loop 8
25 Future exascale systems Petascale Systems (2009) Predicted Exascale Systems Factor Improvement System Peak ~1000 Node Memory Bandwidth Total Node Interconnect Bandwidth 25 GB/s TB/s ~ GB/s GB/s ~100 Memory Latency 100 ns 50 ns ~1 Interconnect Latency ~1 * Sources: from P. Beckman (ANL), J. Shalf (LBL), and D. Unat (LBL) 9
26 Future exascale systems Petascale Systems (2009) Predicted Exascale Systems Factor Improvement System Peak ~1000 Node Memory Bandwidth Total Node Interconnect Bandwidth 25 GB/s TB/s ~ GB/s GB/s ~100 Memory Latency 100 ns 50 ns ~1 Interconnect Latency ~1 * Sources: from P. Beckman (ANL), J. Shalf (LBL), and D. Unat (LBL) 9
27 Future exascale systems Petascale Systems (2009) Predicted Exascale Systems Factor Improvement System Peak ~1000 Node Memory Bandwidth Total Node Interconnect Bandwidth 25 GB/s TB/s ~ GB/s GB/s ~100 Memory Latency 100 ns 50 ns ~1 Interconnect Latency ~1 * Sources: from P. Beckman (ANL), J. Shalf (LBL), and D. Unat (LBL) 9
28 Future exascale systems Petascale Systems (2009) Predicted Exascale Systems Factor Improvement System Peak ~1000 Node Memory Bandwidth Total Node Interconnect Bandwidth 25 GB/s TB/s ~ GB/s GB/s ~100 Memory Latency 100 ns 50 ns ~1 Interconnect Latency ~1 * Sources: from P. Beckman (ANL), J. Shalf (LBL), and D. Unat (LBL) Gaps between communication/computation cost only growing larger in future systems Reducing time spent moving data/waiting for data will be essential for applications at exascale! 9
29 Synchronization-reducing variants Communication cost has motivated many approaches to reducing synchronization in CG: 10
30 Synchronization-reducing variants Communication cost has motivated many approaches to reducing synchronization in CG: Early work: CG with a single synchronization point per iteration 3-term recurrence CG Using modified computation of recurrence coefficients Using auxiliary vectors 10
31 Synchronization-reducing variants Communication cost has motivated many approaches to reducing synchronization in CG: Early work: CG with a single synchronization point per iteration 3-term recurrence CG Using modified computation of recurrence coefficients Using auxiliary vectors Pipelined Krylov subspace methods Uses modified coefficients and auxiliary vectors to reduce synchronization points to 1 per iteration Modifications also allow decoupling of SpMV and inner products - enables overlapping 10
32 Synchronization-reducing variants Communication cost has motivated many approaches to reducing synchronization in CG: Early work: CG with a single synchronization point per iteration 3-term recurrence CG Using modified computation of recurrence coefficients Using auxiliary vectors Pipelined Krylov subspace methods Uses modified coefficients and auxiliary vectors to reduce synchronization points to 1 per iteration Modifications also allow decoupling of SpMV and inner products - enables overlapping s-step Krylov subspace methods Compute iterations in blocks of s using a different Krylov subspace basis Enables one synchronization per s iterations 10
33 The effects of finite precision Well-known that roundoff error has two effects: 1. Delay of convergence No longer have exact Krylov subspace Can lose numerical rank deficiency Residuals no longer orthogonal - Minimization no longer exact! 2. Loss of attainable accuracy Rounding errors cause true residual b Ax i and updated residual r i deviate! 11
34 The effects of finite precision Well-known that roundoff error has two effects: 1. Delay of convergence No longer have exact Krylov subspace Can lose numerical rank deficiency Residuals no longer orthogonal - Minimization no longer exact! 2. Loss of attainable accuracy Rounding errors cause true residual b Ax i and updated residual r i deviate! A: bcsstk03 from UFSMC, b: equal components in the eigenbasis of A and b = 1 N = 112, κ A 7e6 11
35 The effects of finite precision Well-known that roundoff error has two effects: 1. Delay of convergence No longer have exact Krylov subspace Can lose numerical rank deficiency Residuals no longer orthogonal - Minimization no longer exact! 2. Loss of attainable accuracy Rounding errors cause true residual b Ax i and updated residual r i deviate! A: bcsstk03 from UFSMC, b: equal components in the eigenbasis of A and b = 1 N = 112, κ A 7e6 11
36 The effects of finite precision Well-known that roundoff error has two effects: 1. Delay of convergence No longer have exact Krylov subspace Can lose numerical rank deficiency Residuals no longer orthogonal - Minimization no longer exact! 2. Loss of attainable accuracy Rounding errors cause true residual b Ax i and updated residual r i deviate! A: bcsstk03 from UFSMC, b: equal components in the eigenbasis of A and b = 1 N = 112, κ A 7e6 Much work on these results for CG; See Meurant and Strakoš (2006) for a thorough summary of early developments in finite precision analysis of Lanczos and CG 11
37 Optimizing high performance iterative solvers Synchronization-reducing variants are designed to reduce the time/iteration 12
38 Optimizing high performance iterative solvers Synchronization-reducing variants are designed to reduce the time/iteration But this is not the whole story! 12
39 Optimizing high performance iterative solvers Synchronization-reducing variants are designed to reduce the time/iteration But this is not the whole story! What we really want to minimize is the runtime, subject to some constraint on accuracy, runtime = (time/iteration) x (# iterations) 12
40 Optimizing high performance iterative solvers Synchronization-reducing variants are designed to reduce the time/iteration But this is not the whole story! What we really want to minimize is the runtime, subject to some constraint on accuracy, runtime = (time/iteration) x (# iterations) Changes to how the recurrences are computed can exacerbate finite precision effects of convergence delay and loss of accuracy 12
41 Optimizing high performance iterative solvers Synchronization-reducing variants are designed to reduce the time/iteration But this is not the whole story! What we really want to minimize is the runtime, subject to some constraint on accuracy, runtime = (time/iteration) x (# iterations) Changes to how the recurrences are computed can exacerbate finite precision effects of convergence delay and loss of accuracy Crucial that we understand and take into account how algorithm modifications will affect the convergence rate and attainable accuracy! 12
42 Maximum attainable accuracy Accuracy depends on the size of the true residual: b A x i 13
43 Maximum attainable accuracy Accuracy depends on the size of the true residual: b A x i Rounding errors cause the true residual, b A x i, and the updated residual, r i, to deviate 13
44 Maximum attainable accuracy Accuracy depends on the size of the true residual: b A x i Rounding errors cause the true residual, b A x i, and the updated residual, r i, to deviate Writing b A x i = r i + b A x i r i, b A x i r i + b A x i r i 13
45 Maximum attainable accuracy Accuracy depends on the size of the true residual: b A x i Rounding errors cause the true residual, b A x i, and the updated residual, r i, to deviate Writing b A x i = r i + b A x i r i, b A x i r i + b A x i r i As r i 0, b A x i depends on b A x i r i 13
46 Maximum attainable accuracy Accuracy depends on the size of the true residual: b A x i Rounding errors cause the true residual, b A x i, and the updated residual, r i, to deviate Writing b A x i = r i + b A x i r i, b A x i r i + b A x i r i As r i 0, b A x i depends on b A x i r i Many results on bounding attainable accuracy, e.g.: Greenbaum (1989, 1994, 1997), Sleijpen, van der Vorst and Fokkema (1994), Sleijpen, van der Vorst and Modersitzki (2001), Björck, Elfving and Strakoš (1998) and Gutknecht and Strakoš (2000). 13
47 Maximum attainable accuracy of HSCG In finite precision HSCG, iterates are updated by x i = x i 1 + α i 1 p i 1 δx i and r i = r i 1 α i 1 A p i 1 δr i 14
48 Maximum attainable accuracy of HSCG In finite precision HSCG, iterates are updated by x i = x i 1 + α i 1 p i 1 δx i and r i = r i 1 α i 1 A p i 1 δr i Let f i b A x i r i 14
49 Maximum attainable accuracy of HSCG In finite precision HSCG, iterates are updated by x i = x i 1 + α i 1 p i 1 δx i and r i = r i 1 α i 1 A p i 1 δr i Let f i b A x i r i f i = b A x i 1 + α i 1 p i 1 δx i r i 1 α i 1 A p i 1 δr i 14
50 Maximum attainable accuracy of HSCG In finite precision HSCG, iterates are updated by x i = x i 1 + α i 1 p i 1 δx i and r i = r i 1 α i 1 A p i 1 δr i Let f i b A x i r i f i = b A x i 1 + α i 1 p i 1 δx i = f i 1 + Aδx i + δr i r i 1 α i 1 A p i 1 δr i 14
51 Maximum attainable accuracy of HSCG In finite precision HSCG, iterates are updated by x i = x i 1 + α i 1 p i 1 δx i and r i = r i 1 α i 1 A p i 1 δr i Let f i b A x i r i f i = b A x i 1 + α i 1 p i 1 δx i = f i 1 + Aδx i + δr i r i 1 α i 1 A p i 1 δr i i = f 0 + m=1 Aδx m + δr m 14
52 Maximum attainable accuracy of HSCG In finite precision HSCG, iterates are updated by x i = x i 1 + α i 1 p i 1 δx i and r i = r i 1 α i 1 A p i 1 δr i Let f i b A x i r i f i = b A x i 1 + α i 1 p i 1 δx i = f i 1 + Aδx i + δr i r i 1 α i 1 A p i 1 δr i i = f 0 + m=1 Aδx m + δr m f i i O ε m=0 N A A x m + r m van der Vorst and Ye, 2000 f i O(ε) A x + max m=0,,i x m Greenbaum, 1997 f i O ε N A A A 1 i m=0 r m Sleijpen and van der Vorst,
53 Early approaches to reducing synchronization Goal: Reduce the 2 synchronization points per iteration in HSCG to 1 synchronization point per iteration 15
54 Early approaches to reducing synchronization Goal: Reduce the 2 synchronization points per iteration in HSCG to 1 synchronization point per iteration Compute β i from α i 1 and Ap i 1 using relation r i 2 2 = α i 1 Ap i 1 2 r i 1 2 Can then also merge the updates of x i, r i, and p i Developed independently by Johnson (1983, 1984), van Rosendale (1983, 1984), Saad (1985) Many other similar approaches 15
55 Early approaches to reducing synchronization Goal: Reduce the 2 synchronization points per iteration in HSCG to 1 synchronization point per iteration Compute β i from α i 1 and Ap i 1 using relation r i 2 2 = α i 1 Ap i 1 2 r i 1 2 Can then also merge the updates of x i, r i, and p i Developed independently by Johnson (1983, 1984), van Rosendale (1983, 1984), Saad (1985) Many other similar approaches Could also compute α i 1 from β i 1 : α i 1 = r i 1 T Ar i 1 r T β i 1 i 1 r i 1 α i
56 Modified recurrence coefficient computation What is the effect of changing the way the recurrence coefficients (α and β) are computed in HSCG? 16
57 Modified recurrence coefficient computation What is the effect of changing the way the recurrence coefficients (α and β) are computed in HSCG? Notice that neither α nor β appear in the bounds on f i f i = b A x i r i = b A x i 1 + α i 1 p i 1 δx i r i 1 α i 1 A p i 1 δr i 16
58 Modified recurrence coefficient computation What is the effect of changing the way the recurrence coefficients (α and β) are computed in HSCG? Notice that neither α nor β appear in the bounds on f i f i = b A x i r i = b A x i 1 + α i 1 p i 1 δx i r i 1 α i 1 A p i 1 δr i As long as the same α i 1 is used in updating x i and r i, still holds f i = f i 1 + Aδx i + δr i 16
59 Modified recurrence coefficient computation What is the effect of changing the way the recurrence coefficients (α and β) are computed in HSCG? Notice that neither α nor β appear in the bounds on f i f i = b A x i r i = b A x i 1 + α i 1 p i 1 δx i r i 1 α i 1 A p i 1 δr i As long as the same α i 1 is used in updating x i and r i, still holds f i = f i 1 + Aδx i + δr i Rounding errors made in computing α i 1 do not contribute to the residual gap 16
60 Modified recurrence coefficient computation What is the effect of changing the way the recurrence coefficients (α and β) are computed in HSCG? Notice that neither α nor β appear in the bounds on f i f i = b A x i r i = b A x i 1 + α i 1 p i 1 δx i r i 1 α i 1 A p i 1 δr i As long as the same α i 1 is used in updating x i and r i, still holds f i = f i 1 + Aδx i + δr i Rounding errors made in computing α i 1 do not contribute to the residual gap But may change computed x i, r i, which can affect convergence rate... 16
61 Modified recurrence coefficient computation Example: HSCG with modified formula for α i 1 α i 1 = r i 1 T Ar i 1 r T β i 1 i 1 r i 1 α i
62 CG with two three-term recurrences (STCG) HSCG recurrences can be written as AP i = R i+1 L i, R i = P i U i we can combine these to obtain a 3-term recurrence for the residuals (STCG): AR i = R i+1 T i, T i = L i U i 18
63 CG with two three-term recurrences (STCG) HSCG recurrences can be written as AP i = R i+1 L i, R i = P i U i we can combine these to obtain a 3-term recurrence for the residuals (STCG): AR i = R i+1 T i, T i = L i U i First developed by Stiefel (1952/53), also Rutishauser (1959) and Hageman and Young (1981) Motivated by relation to three-term recurrences for orthogonal polynomials r 0 = b Ax 0, p 0 = r 0, x 1 = x 0, r 1 = r 0, e 1 = 0 for i = 1:nmax q i 1 = (r i 1,Ar i 1 ) (r i 1,r i 1 ) e i 2 x i = x i q i 1 r i 1 + e i 2 (x i 1 x i 2 ) r i = r i q i 1 Ar i 1 + e i 2 (r i 1 r i 2 ) end e i 1 = q i 1 (r i,r i ) (r i 1,r i 1 ) 18
64 CG with two three-term recurrences (STCG) HSCG recurrences can be written as AP i = R i+1 L i, R i = P i U i we can combine these to obtain a 3-term recurrence for the residuals (STCG): AR i = R i+1 T i, T i = L i U i First developed by Stiefel (1952/53), also Rutishauser (1959) and Hageman and Young (1981) Motivated by relation to three-term recurrences for orthogonal polynomials r 0 = b Ax 0, p 0 = r 0, x 1 = x 0, r 1 = r 0, e 1 = 0 for i = 1:nmax q i 1 = (r i 1,Ar i 1 ) (r i 1,r i 1 ) e i 2 x i = x i q i 1 r i 1 + e i 2 (x i 1 x i 2 ) r i = r i q i 1 Ar i 1 + e i 2 (r i 1 r i 2 ) end e i 1 = q i 1 (r i,r i ) (r i 1,r i 1 ) Can be accomplished with a single synchronization point on parallel computers (Strakoš 1985, 1987) 18
65 CG with two three-term recurrences (STCG) HSCG recurrences can be written as AP i = R i+1 L i, R i = P i U i we can combine these to obtain a 3-term recurrence for the residuals (STCG): AR i = R i+1 T i, T i = L i U i First developed by Stiefel (1952/53), also Rutishauser (1959) and Hageman and Young (1981) Motivated by relation to three-term recurrences for orthogonal polynomials r 0 = b Ax 0, p 0 = r 0, x 1 = x 0, r 1 = r 0, e 1 = 0 for i = 1:nmax q i 1 = (r i 1,Ar i 1 ) (r i 1,r i 1 ) e i 2 x i = x i q i 1 r i 1 + e i 2 (x i 1 x i 2 ) r i = r i q i 1 Ar i 1 + e i 2 (r i 1 r i 2 ) end e i 1 = q i 1 (r i,r i ) (r i 1,r i 1 ) Can be accomplished with a single synchronization point on parallel computers (Strakoš 1985, 1987) Similar approach (computing α i using β i 1 ) used by D'Azevedo, Eijkhout, Romaine (1992, 1993) 18
66 Attainable accuracy of STCG Analyzed by Gutknecht and Strakoš (2000) Attainable accuracy for STCG can be much worse than for HSCG 19
67 Attainable accuracy of STCG Analyzed by Gutknecht and Strakoš (2000) Attainable accuracy for STCG can be much worse than for HSCG Residual gap bounded by sum of local errors PLUS local errors multiplied by factors which depend on max 0 l<j i r j 2 r l 2 19
68 Attainable accuracy of STCG Analyzed by Gutknecht and Strakoš (2000) Attainable accuracy for STCG can be much worse than for HSCG Residual gap bounded by sum of local errors PLUS local errors multiplied by factors which depend on max 0 l<j i r j 2 r l 2 Large residual oscillations can cause these factors to be large! Local errors can be amplified! 19
69 STCG 20
70 STCG 20
71 Chronopoulos and Gear's CG (ChG CG) Chronopoulos and Gear (1989) Looks like HSCG, but very similar to 3-term recurrence CG (STCG) Reduces synchronizations/iteration to 1 by changing computation of α i and using an auxiliary recurrence for Ap i r 0 = b Ax 0, p 0 = r 0, s 0 = Ap 0, α 0 = (r 0, r 0 )/(p 0, s 0 ) for i = 1:nmax x i = x i 1 + α i 1 p i 1 r i = r i 1 α i 1 s i 1 w i = Ar i β i = (r i,r i ) α i = (r i 1,r i 1 ) (r i,r i ) w i,r i ( β i α i 1 )(r i,r i ) p i = r i + β i p i 1 s i = w i + β i s i 1 end 21
72 Chronopoulos and Gear's CG (ChG CG) Chronopoulos and Gear (1989) Looks like HSCG, but very similar to 3-term recurrence CG (STCG) Reduces synchronizations/iteration to 1 by changing computation of α i and using an auxiliary recurrence for Ap i r 0 = b Ax 0, p 0 = r 0, s 0 = Ap 0, α 0 = (r 0, r 0 )/(p 0, s 0 ) for i = 1:nmax x i = x i 1 + α i 1 p i 1 r i = r i 1 α i 1 s i 1 w i = Ar i β i = (r i,r i ) α i = (r i 1,r i 1 ) (r i,r i ) w i,r i ( β i α i 1 )(r i,r i ) p i = r i + β i p i 1 s i = w i + β i s i 1 end 21
73 Chronopoulos and Gear's CG (ChG CG) Chronopoulos and Gear (1989) Looks like HSCG, but very similar to 3-term recurrence CG (STCG) Reduces synchronizations/iteration to 1 by changing computation of α i and using an auxiliary recurrence for Ap i r 0 = b Ax 0, p 0 = r 0, s 0 = Ap 0, α 0 = (r 0, r 0 )/(p 0, s 0 ) for i = 1:nmax x i = x i 1 + α i 1 p i 1 r i = r i 1 α i 1 s i 1 w i = Ar i β i = (r i,r i ) α i = (r i 1,r i 1 ) (r i,r i ) w i,r i ( β i α i 1 )(r i,r i ) p i = r i + β i p i 1 s i = w i + β i s i 1 end 21
74 Chronopoulos and Gear's CG (ChG CG) Chronopoulos and Gear (1989) Looks like HSCG, but very similar to 3-term recurrence CG (STCG) Reduces synchronizations/iteration to 1 by changing computation of α i and using an auxiliary recurrence for Ap i r 0 = b Ax 0, p 0 = r 0, s 0 = Ap 0, α 0 = (r 0, r 0 )/(p 0, s 0 ) for i = 1:nmax x i = x i 1 + α i 1 p i 1 r i = r i 1 α i 1 s i 1 w i = Ar i β i = (r i,r i ) α i = (r i 1,r i 1 ) (r i,r i ) w i,r i ( β i α i 1 )(r i,r i ) Iteration Loop Vector Updates SpMV Inner Products Vector Updates p i = r i + β i p i 1 s i = w i + β i s i 1 End Loop 21 end
75 Chronopoulos and Gear's CG (ChG CG) Chronopoulos and Gear (1989) Looks like HSCG, but very similar to 3-term recurrence CG (STCG) Reduces synchronizations/iteration to 1 by changing computation of α i and using an auxiliary recurrence for Ap i r 0 = b Ax 0, p 0 = r 0, s 0 = Ap 0, α 0 = (r 0, r 0 )/(p 0, s 0 ) for i = 1:nmax x i = x i 1 + α i 1 p i 1 r i = r i 1 α i 1 s i 1 w i = Ar i β i = (r i,r i ) α i = (r i 1,r i 1 ) (r i,r i ) w i,r i ( β i α i 1 )(r i,r i ) Iteration Loop Vector Updates SpMV Inner Products Vector Updates p i = r i + β i p i 1 s i = w i + β i s i 1 End Loop 21 end
76 Chronopoulos and Gear's CG (ChG CG) Chronopoulos and Gear (1989) Looks like HSCG, but very similar to 3-term recurrence CG (STCG) Reduces synchronizations/iteration to 1 by changing computation of α i and using an auxiliary recurrence for Ap i r 0 = b Ax 0, p 0 = r 0, s 0 = Ap 0, α 0 = (r 0, r 0 )/(p 0, s 0 ) for i = 1:nmax x i = x i 1 + α i 1 p i 1 r i = r i 1 α i 1 s i 1 w i = Ar i β i = (r i,r i ) α i = (r i 1,r i 1 ) (r i,r i ) w i,r i ( β i α i 1 )(r i,r i ) Iteration Loop Vector Updates SpMV Inner Products Vector Updates p i = r i + β i p i 1 s i = w i + β i s i 1 End Loop 21 end
77 Chronopoulos and Gear's CG (ChG CG) Chronopoulos and Gear (1989) Looks like HSCG, but very similar to 3-term recurrence CG (STCG) Reduces synchronizations/iteration to 1 by changing computation of α i and using an auxiliary recurrence for Ap i r 0 = b Ax 0, p 0 = r 0, s 0 = Ap 0, α 0 = (r 0, r 0 )/(p 0, s 0 ) for i = 1:nmax x i = x i 1 + α i 1 p i 1 r i = r i 1 α i 1 s i 1 w i = Ar i β i = (r i,r i ) α i = (r i 1,r i 1 ) (r i,r i ) w i,r i ( β i α i 1 )(r i,r i ) Iteration Loop Vector Updates SpMV Inner Products Vector Updates p i = r i + β i p i 1 s i = w i + β i s i 1 End Loop 21 end
78 Chronopoulos and Gear's CG (ChG CG) Chronopoulos and Gear (1989) Looks like HSCG, but very similar to 3-term recurrence CG (STCG) Reduces synchronizations/iteration to 1 by changing computation of α i and using an auxiliary recurrence for Ap i r 0 = b Ax 0, p 0 = r 0, s 0 = Ap 0, α 0 = (r 0, r 0 )/(p 0, s 0 ) for i = 1:nmax x i = x i 1 + α i 1 p i 1 r i = r i 1 α i 1 s i 1 w i = Ar i β i = (r i,r i ) α i = (r i 1,r i 1 ) (r i,r i ) w i,r i ( β i α i 1 )(r i,r i ) Iteration Loop Vector Updates SpMV Inner Products Vector Updates p i = r i + β i p i 1 s i = w i + β i s i 1 End Loop 21 end
79 Pipelined CG (GVCG) Pipelined CG of Ghysels and Vanroose (2014) Similar to Chronopoulos and Gear approach Uses auxiliary vector s i Ap i and same formula for α i 22
80 Pipelined CG (GVCG) Pipelined CG of Ghysels and Vanroose (2014) Similar to Chronopoulos and Gear approach Uses auxiliary vector s i Ap i and same formula for α i Also uses auxiliary vectors for Ar i and A 2 r i to remove sequential dependency between SpMV and inner products 22
81 Pipelined CG (GVCG) Pipelined CG of Ghysels and Vanroose (2014) Similar to Chronopoulos and Gear approach Uses auxiliary vector s i Ap i and same formula for α i Also uses auxiliary vectors for Ar i and A 2 r i to remove sequential dependency between SpMV and inner products Allows the use of nonblocking (asynchronous) MPI communication to overlap SpMV and inner products Hides the latency of global communications 22
82 GVCG (Ghysels and Vanroose 2014) r 0 = b Ax 0, p 0 = r 0 s 0 = Ap 0, w 0 = Ar 0, z 0 = Aw 0, α 0 = r 0 T r 0 /p 0 T s 0 for i = 1:nmax x i = x i 1 + α i 1 p i 1 r i = r i 1 α i 1 s i 1 w i = w i 1 α i 1 z i 1 q i = Aw i β i = r i T r i r T i 1 r i 1 α i = w i T r i r i T ri β i α i 1 r i T r i end p i = r i + β i p i 1 s i = w i + β i s i 1 z i = q i + β i z i 1 23
83 GVCG (Ghysels and Vanroose 2014) r 0 = b Ax 0, p 0 = r 0 s 0 = Ap 0, w 0 = Ar 0, z 0 = Aw 0, α 0 = r 0 T r 0 /p 0 T s 0 for i = 1:nmax x i = x i 1 + α i 1 p i 1 r i = r i 1 α i 1 s i 1 w i = w i 1 α i 1 z i 1 q i = Aw i β i = r i T r i r T i 1 r i 1 α i = w i T r i r i T ri β i α i 1 r i T r i end p i = r i + β i p i 1 s i = w i + β i s i 1 z i = q i + β i z i 1 23
84 Overlap GVCG (Ghysels and Vanroose 2014) r 0 = b Ax 0, p 0 = r 0 s 0 = Ap 0, w 0 = Ar 0, z 0 = Aw 0, α 0 = r 0 T r 0 /p 0 T s 0 for i = 1:nmax x i = x i 1 + α i 1 p i 1 r i = r i 1 α i 1 s i 1 Iteration Loop Vector Updates w i = w i 1 α i 1 z i 1 q i = Aw i β i = r i T r i r T i 1 r i 1 Inner Products SpMV α i = w i T r i r i T ri β i α i 1 r i T r i Vector Updates p i = r i + β i p i 1 s i = w i + β i s i 1 End Loop end z i = q i + β i z i 1 23
85 Overlap GVCG (Ghysels and Vanroose 2014) r 0 = b Ax 0, p 0 = r 0 s 0 = Ap 0, w 0 = Ar 0, z 0 = Aw 0, α 0 = r 0 T r 0 /p 0 T s 0 for i = 1:nmax x i = x i 1 + α i 1 p i 1 r i = r i 1 α i 1 s i 1 Iteration Loop Vector Updates w i = w i 1 α i 1 z i 1 q i = Aw i β i = r i T r i r T i 1 r i 1 Inner Products SpMV α i = w i T r i r i T ri β i α i 1 r i T r i Vector Updates p i = r i + β i p i 1 s i = w i + β i s i 1 End Loop end z i = q i + β i z i 1 23
86 Overlap GVCG (Ghysels and Vanroose 2014) r 0 = b Ax 0, p 0 = r 0 s 0 = Ap 0, w 0 = Ar 0, z 0 = Aw 0, α 0 = r 0 T r 0 /p 0 T s 0 for i = 1:nmax x i = x i 1 + α i 1 p i 1 r i = r i 1 α i 1 s i 1 Iteration Loop Vector Updates w i = w i 1 α i 1 z i 1 q i = Aw i β i = r i T r i r T i 1 r i 1 Inner Products SpMV α i = w i T r i r i T ri β i α i 1 r i T r i Vector Updates p i = r i + β i p i 1 s i = w i + β i s i 1 End Loop end z i = q i + β i z i 1 23
87 Overlap GVCG (Ghysels and Vanroose 2014) r 0 = b Ax 0, p 0 = r 0 s 0 = Ap 0, w 0 = Ar 0, z 0 = Aw 0, α 0 = r 0 T r 0 /p 0 T s 0 for i = 1:nmax x i = x i 1 + α i 1 p i 1 r i = r i 1 α i 1 s i 1 Iteration Loop Vector Updates w i = w i 1 α i 1 z i 1 q i = Aw i β i = r i T r i r T i 1 r i 1 Inner Products SpMV α i = w i T r i r i T ri β i α i 1 r i T r i Vector Updates p i = r i + β i p i 1 s i = w i + β i s i 1 End Loop end z i = q i + β i z i 1 23
88 Overlap GVCG (Ghysels and Vanroose 2014) r 0 = b Ax 0, p 0 = r 0 s 0 = Ap 0, w 0 = Ar 0, z 0 = Aw 0, α 0 = r 0 T r 0 /p 0 T s 0 for i = 1:nmax x i = x i 1 + α i 1 p i 1 r i = r i 1 α i 1 s i 1 Iteration Loop Vector Updates w i = w i 1 α i 1 z i 1 q i = Aw i β i = r i T r i r T i 1 r i 1 Inner Products Precond SpMV α i = w i T r i r i T ri β i α i 1 r i T r i Vector Updates p i = r i + β i p i 1 s i = w i + β i s i 1 End Loop end z i = q i + β i z i 1 23
89 Attainable accuracy of pipelined CG What is the effect of adding auxiliary recurrences to the CG method? 24
90 Attainable accuracy of pipelined CG What is the effect of adding auxiliary recurrences to the CG method? To isolate the effects, we consider a simplified version of a pipelined method r 0 = b Ax 0, p 0 = r 0, s 0 = Ap 0 for i = 1:nmax end α i 1 = (r i 1,r i 1 ) (p i 1,s i 1 ) x i = x i 1 + α i 1 p i 1 r i = r i 1 α i 1 s i 1 β i = (r i,r i ) (r i 1,r i 1 ) p i = r i + β i p i 1 s i = Ar i + β i s i 1 24
91 Attainable accuracy of pipelined CG What is the effect of adding auxiliary recurrences to the CG method? To isolate the effects, we consider a simplified version of a pipelined method Uses same update formulas for α and β as HSCG, but uses additional recurrence for Ap i r 0 = b Ax 0, p 0 = r 0, s 0 = Ap 0 for i = 1:nmax end α i 1 = (r i 1,r i 1 ) (p i 1,s i 1 ) x i = x i 1 + α i 1 p i 1 r i = r i 1 α i 1 s i 1 β i = (r i,r i ) (r i 1,r i 1 ) p i = r i + β i p i 1 s i = Ar i + β i s i 1 24
92 Attainable accuracy of simple pipelined CG x i = x i 1 + α i 1 p i 1 + δx i r i = r i 1 α i 1 s i 1 + δr i 25
93 Attainable accuracy of simple pipelined CG x i = x i 1 + α i 1 p i 1 + δx i r i = r i 1 α i 1 s i 1 + δr i f i = r i (b A x i ) 25
94 Attainable accuracy of simple pipelined CG x i = x i 1 + α i 1 p i 1 + δx i r i = r i 1 α i 1 s i 1 + δr i f i = r i (b A x i ) = f i 1 α i 1 s i 1 A p i 1 + δr i + Aδx i 25
95 Attainable accuracy of simple pipelined CG x i = x i 1 + α i 1 p i 1 + δx i r i = r i 1 α i 1 s i 1 + δr i f i = r i (b A x i ) = f i 1 α i 1 s i 1 A p i 1 + δr i + Aδx i i = f 0 + m=1 (δr m + Aδx m ) G i d i where G i = S i A P i, d i = α 0,, α i 1 T 25
96 Attainable accuracy of simple pipelined CG x i = x i 1 + α i 1 p i 1 + δx i r i = r i 1 α i 1 s i 1 + δr i f i = r i (b A x i ) = f i 1 α i 1 s i 1 A p i 1 + δr i + Aδx i i = f 0 + m=1 (δr m + Aδx m ) G i d i where G i = S i A P i, d i = α 0,, α i 1 T 25
97 Attainable accuracy of simple pipelined CG x i = x i 1 + α i 1 p i 1 + δx i r i = r i 1 α i 1 s i 1 + δr i f i = r i (b A x i ) = f i 1 α i 1 s i 1 A p i 1 + δr i + Aδx i i = f 0 + m=1 (δr m + Aδx m ) G i d i where G i = S i A P i, d i = α 0,, α i 1 T 25
98 Attainable accuracy of simple pipelined CG O ε 1 G i κ( U 1 O ε i ) A P i + A R i U i U i = 1 β β i U i 1 = 1 β 1 β 1 β 2 β i β 2 β 2 β i 1 1 β i
99 Attainable accuracy of simple pipelined CG O ε 1 G i κ( U 1 O ε i ) A P i + A R i U i U i = 1 β β i U i 1 = 1 β 1 β 1 β 2 β i β 2 β 2 β i 1 1 β i β l β l+1 β j = r j 2 r l 1 2, l < j 26
100 Attainable accuracy of simple pipelined CG O ε 1 G i κ( U 1 O ε i ) A P i + A R i U i U i = 1 β β i U i 1 = 1 β 1 β 1 β 2 β i β 2 β 2 β i 1 1 β i β l β l+1 β j = r j 2 r l 1 2, l < j Residual oscillations can cause these factors to be large! Errors in computed recurrence coefficients can be amplified! 26
101 Attainable accuracy of simple pipelined CG O ε 1 G i κ( U 1 O ε i ) A P i + A R i U i U i = 1 β β i U i 1 = 1 β 1 β 1 β 2 β i β 2 β 2 β i 1 1 β i β l β l+1 β j = r j 2 r l 1 2, l < j Residual oscillations can cause these factors to be large! Errors in computed recurrence coefficients can be amplified! Very similar to the results for attainable accuracy in the 3-term STCG Seemingly innocuous change can cause drastic loss of accuracy 26
102 Simple pipelined CG 27
103 Simple pipelined CG effect of using auxiliary vector s i Ap i 27
104 Simple pipelined CG effect of changing formula for recurrence coefficient α and using auxiliary vector s i Ap i 27
105 Simple pipelined CG effect of changing formula for recurrence coefficient α and using auxiliary vectors s i Ap i, w i Ar i, z i A 2 r i 27
106 s-step CG Idea: Compute blocks of s iterations at once Compute updates in a different basis Communicate every s iterations instead of every iteration Reduces number of synchronizations per iteration by a factor of s 28
107 s-step CG Idea: Compute blocks of s iterations at once Compute updates in a different basis Communicate every s iterations instead of every iteration Reduces number of synchronizations per iteration by a factor of s An idea rediscovered many times 28
108 s-step CG Idea: Compute blocks of s iterations at once Compute updates in a different basis Communicate every s iterations instead of every iteration Reduces number of synchronizations per iteration by a factor of s An idea rediscovered many times First related work: s-dimensional steepest descent, least squares Khabaza ( 63), Forsythe ( 68), Marchuk and Kuznecov ( 68) 28
109 s-step CG Idea: Compute blocks of s iterations at once Compute updates in a different basis Communicate every s iterations instead of every iteration Reduces number of synchronizations per iteration by a factor of s An idea rediscovered many times First related work: s-dimensional steepest descent, least squares Khabaza ( 63), Forsythe ( 68), Marchuk and Kuznecov ( 68) Flurry of work on s-step Krylov methods in 80s/early 90s: see, e.g., Van Rosendale (1983); Chronopoulos and Gear (1989) 28
110 s-step CG Idea: Compute blocks of s iterations at once Compute updates in a different basis Communicate every s iterations instead of every iteration Reduces number of synchronizations per iteration by a factor of s An idea rediscovered many times First related work: s-dimensional steepest descent, least squares Khabaza ( 63), Forsythe ( 68), Marchuk and Kuznecov ( 68) Flurry of work on s-step Krylov methods in 80s/early 90s: see, e.g., Van Rosendale (1983); Chronopoulos and Gear (1989) Resurgence of interest in recent years due to growing problem sizes; growing relative cost of communication 28
111 s-step CG Key observation: After iteration i, for j {0,.., s}, x i+j x i, r i+j, p i+j K s+1 A, p i + K s A, r i 29
112 s-step CG Key observation: After iteration i, for j {0,.., s}, x i+j x i, r i+j, p i+j K s+1 A, p i + K s A, r i s steps of s-step CG: 29
113 s-step CG Key observation: After iteration i, for j {0,.., s}, x i+j x i, r i+j, p i+j K s+1 A, p i + K s A, r i s steps of s-step CG: Expand solution space s dimensions at once Compute basis matrix Y such that span Y = K s+1 A, p i + K s A, r i according to the recurrence AY = Y B 29
114 s-step CG Key observation: After iteration i, for j {0,.., s}, x i+j x i, r i+j, p i+j K s+1 A, p i + K s A, r i s steps of s-step CG: Expand solution space s dimensions at once Compute basis matrix Y such that span Y = K s+1 A, p i + K s A, r i according to the recurrence AY = Y B Compute inner products between basis vectors in one synchronization G = Y T Y 29
115 s-step CG Key observation: After iteration i, for j {0,.., s}, x i+j x i, r i+j, p i+j K s+1 A, p i + K s A, r i s steps of s-step CG: Expand solution space s dimensions at once Compute basis matrix Y such that span Y = K s+1 A, p i + K s A, r i according to the recurrence AY = Y B Compute inner products between basis vectors in one synchronization G = Y T Y Compute s iterations of vector updates Perform s iterations of vector updates by updating coordinates in basis Y: x i+j x i = Yx j, r i+j = Yr j, p i+j = Yp j 29
116 s-step CG For s iterations of updates, inner products and SpMVs (in basis Y) can be computed by independently by each processor without communication: 30
117 s-step CG For s iterations of updates, inner products and SpMVs (in basis Y) can be computed by independently by each processor without communication: Ap i+j n n 30
118 s-step CG For s iterations of updates, inner products and SpMVs (in basis Y) can be computed by independently by each processor without communication: Ap i+j n = AYp j n 30
119 s-step CG For s iterations of updates, inner products and SpMVs (in basis Y) can be computed by independently by each processor without communication: Ap i+j = AYp j = Y(Bp j ) n n O(s) O(s) 30
120 s-step CG For s iterations of updates, inner products and SpMVs (in basis Y) can be computed by independently by each processor without communication: Ap i+j = AYp j = Y(Bp j ) n n O(s) O(s) (r i+j, r i+j ) 30
121 s-step CG For s iterations of updates, inner products and SpMVs (in basis Y) can be computed by independently by each processor without communication: Ap i+j = AYp j = Y(Bp j ) n n O(s) O(s) (r i+j, r i+j ) = r j T Y T Yr j 30
122 s-step CG For s iterations of updates, inner products and SpMVs (in basis Y) can be computed by independently by each processor without communication: Ap i+j = AYp j = Y(Bp j ) n n O(s) O(s) (r i+j, r i+j ) = r j T Y T Yr j = r j T Gr j 30
123 s-step CG r 0 = b Ax 0, p 0 = r 0 for k = 0:nmax/s Compute Y k and B k such that AY k = Y k B k and span(y k ) = K s+1 A, p sk + K s A, r sk G k = Y k T Y k x 0 = 0, r 0 = e s+2, p 0 = e 1 for j = 1: s end α sk+j 1 = x j = x j 1 r j = r j 1 β sk+j = r j T G k r j r T j 1 Gk r j 1 p T j 1 G k B k p j 1 + α sk+j 1 p j 1 α sk+j 1 B k p j 1 T G k r j 1 r j 1 p j = r j + β sk+j p j 1 [x s k+1 x sk, r s k+1, p s k+1 ] = Y k [x s, r s, p s ] end 31
124 s-step CG r 0 = b Ax 0, p 0 = r 0 for k = 0:nmax/s Compute Y k and B k such that AY k = Y k B k and span(y k ) = K s+1 A, p sk G k = Y k T Y k x 0 = 0, r 0 = e s+2, p 0 = e 1 for j = 1: s end α sk+j 1 = x j = x j 1 r j = r j 1 β sk+j = r j T G k r j r T j 1 Gk r j 1 p T j 1 G k B k p j 1 + α sk+j 1 p j 1 α sk+j 1 B k p j 1 T G k r j 1 r j 1 p j = r j + β sk+j p j 1 + K s A, r sk [x s k+1 x sk, r s k+1, p s k+1 ] = Y k [x s, r s, p s ] end Outer Loop Compute basis O(s) SPMVs O(s 2 ) Inner Products (one synchronization) Inner Loop Local Vector Updates (no comm.) End Inner Loop Inner Outer Loop s times 31
125 s-step CG r 0 = b Ax 0, p 0 = r 0 for k = 0:nmax/s Compute Y k and B k such that AY k = Y k B k and span(y k ) = K s+1 A, p sk G k = Y k T Y k x 0 = 0, r 0 = e s+2, p 0 = e 1 for j = 1: s end α sk+j 1 = x j = x j 1 r j = r j 1 β sk+j = r j T G k r j r T j 1 Gk r j 1 p T j 1 G k B k p j 1 + α sk+j 1 p j 1 α sk+j 1 B k p j 1 T G k r j 1 r j 1 p j = r j + β sk+j p j 1 + K s A, r sk [x s k+1 x sk, r s k+1, p s k+1 ] = Y k [x s, r s, p s ] end Outer Loop Compute basis O(s) SPMVs O(s 2 ) Inner Products (one synchronization) Inner Loop Local Vector Updates (no comm.) End Inner Loop Inner Outer Loop s times 31
126 s-step CG r 0 = b Ax 0, p 0 = r 0 for k = 0:nmax/s Compute Y k and B k such that AY k = Y k B k and span(y k ) = K s+1 A, p sk G k = Y k T Y k x 0 = 0, r 0 = e s+2, p 0 = e 1 for j = 1: s end α sk+j 1 = x j = x j 1 r j = r j 1 β sk+j = r j T G k r j r T j 1 Gk r j 1 p T j 1 G k B k p j 1 + α sk+j 1 p j 1 α sk+j 1 B k p j 1 T G k r j 1 r j 1 p j = r j + β sk+j p j 1 + K s A, r sk [x s k+1 x sk, r s k+1, p s k+1 ] = Y k [x s, r s, p s ] end Outer Loop Compute basis O(s) SPMVs O(s 2 ) Inner Products (one synchronization) Inner Loop Local Vector Updates (no comm.) End Inner Loop Inner Outer Loop s times 31
127 s-step CG r 0 = b Ax 0, p 0 = r 0 for k = 0:nmax/s Compute Y k and B k such that AY k = Y k B k and span(y k ) = K s+1 A, p sk G k = Y k T Y k x 0 = 0, r 0 = e s+2, p 0 = e 1 for j = 1: s end α sk+j 1 = x j = x j 1 r j = r j 1 β sk+j = r j T G k r j r T j 1 Gk r j 1 p T j 1 G k B k p j 1 + α sk+j 1 p j 1 α sk+j 1 B k p j 1 T G k r j 1 r j 1 p j = r j + β sk+j p j 1 + K s A, r sk [x s k+1 x sk, r s k+1, p s k+1 ] = Y k [x s, r s, p s ] end Outer Loop Compute basis O(s) SPMVs O(s 2 ) Inner Products (one synchronization) Inner Loop Local Vector Updates (no comm.) End Inner Loop Inner Outer Loop s times 31
128 Sources of local roundoff error in s-step CG Computing the s-step Krylov subspace basis: A Y k = Y k B k + ΔY k Updating coordinate vectors in the inner loop: x k,j r k,j = x k,j 1 = r k,j 1 + q k,j 1 B k q k,j 1 with q k,j 1 + ξ k,j + η k,j = fl( α sk+j 1 p k,j 1 ) Recovering CG vectors for use in next outer loop: x sk+j = Y k x k,j r sk+j = Y k r k,j + x sk + φ sk+j + ψ sk+j 32
129 Sources of local roundoff error in s-step CG Computing the s-step Krylov subspace basis: A Y k = Y k B k + ΔY k Error in computing s-step basis Updating coordinate vectors in the inner loop: x k,j r k,j = x k,j 1 = r k,j 1 + q k,j 1 B k q k,j 1 with q k,j 1 + ξ k,j + η k,j = fl( α sk+j 1 p k,j 1 ) Recovering CG vectors for use in next outer loop: x sk+j = Y k x k,j r sk+j = Y k r k,j + x sk + φ sk+j + ψ sk+j 32
130 Sources of local roundoff error in s-step CG Computing the s-step Krylov subspace basis: A Y k = Y k B k + ΔY k Error in computing s-step basis Updating coordinate vectors in the inner loop: x k,j r k,j = x k,j 1 = r k,j 1 + q k,j 1 B k q k,j 1 with q k,j 1 + ξ k,j + η k,j = fl( α sk+j 1 p k,j 1 ) Error in updating coefficient vectors Recovering CG vectors for use in next outer loop: x sk+j = Y k x k,j r sk+j = Y k r k,j + x sk + φ sk+j + ψ sk+j 32
131 Sources of local roundoff error in s-step CG Computing the s-step Krylov subspace basis: A Y k = Y k B k + ΔY k Error in computing s-step basis Updating coordinate vectors in the inner loop: x k,j r k,j = x k,j 1 = r k,j 1 + q k,j 1 B k q k,j 1 with q k,j 1 + ξ k,j + η k,j = fl( α sk+j 1 p k,j 1 ) Error in updating coefficient vectors Recovering CG vectors for use in next outer loop: x sk+j = Y k x k,j r sk+j = Y k r k,j + x sk + φ sk+j + ψ sk+j Error in basis change 32
132 Attainable accuracy of s-step CG We can write the gap between the true and updated residuals f in terms of these errors: f sk+j = f 0 k 1 Aφ sl+s + ψ sl+s + s A Y l ξ l,i + Y l η l,i ΔY l q l,i 1 l=0 i=1 Aφ sk+j ψ sk+j j i=1 A Y k ξ k,i + Y k η k,i ΔY l q k,i 1 Using standard rounding error results, this allows us to obtain an upper bound on f sk+j. 33
133 Attainable accuracy of s-step CG We can write the gap between the true and updated residuals f in terms of these errors: f sk+j = f 0 k 1 Aφ sl+s + ψ sl+s + s A Y l ξ l,i + Y l η l,i ΔY l q l,i 1 l=0 i=1 Aφ sk+j ψ sk+j j i=1 A Y k ξ k,i + Y k η k,i ΔY l q k,i 1 Using standard rounding error results, this allows us to obtain an upper bound on f sk+j. 33
134 Attainable accuracy of s-step CG We can write the gap between the true and updated residuals f in terms of these errors: f sk+j = f 0 k 1 Aφ sl+s + ψ sl+s + s A Y l ξ l,i + Y l η l,i ΔY l q l,i 1 l=0 i=1 Aφ sk+j ψ sk+j j i=1 A Y k ξ k,i + Y k η k,i ΔY l q k,i 1 Using standard rounding error results, this allows us to obtain an upper bound on f sk+j. 33
135 Attainable accuracy of s-step CG We can write the gap between the true and updated residuals f in terms of these errors: f sk+j = f 0 k 1 Aφ sl+s + ψ sl+s + s A Y l ξ l,i + Y l η l,i ΔY l q l,i 1 l=0 i=1 Aφ sk+j ψ sk+j j i=1 A Y k ξ k,i + Y k η k,i ΔY l q k,i 1 Using standard rounding error results, this allows us to obtain an upper bound on f sk+j. 33
136 Attainable accuracy of s-step CG f i b A x i r i For CG: f i f 0 + ε i m=1 1 + N A x m + r m 34
137 Attainable accuracy of s-step CG f i b A x i r i For CG: f i f 0 + ε i m=1 1 + N A x m + r m For s-step CG: i sk + j f sk+j f 0 + εc Γ k sk+j m=1 1 + N A x m + r m where c is a low-degree polynomial in s, and Γ k = max l k Γ l, where Γ l = Y l + Y l (see C., 2015) 34
138 s-step CG 35
139 s-step CG s-step CG with monomial basis (Y = [p i, Ap i,, A s p i, r i, Ar i, A s 1 r i ]) 35
140 s-step CG s-step CG with monomial basis (Y = [p i, Ap i,, A s p i, r i, Ar i, A s 1 r i ]) 35
141 s-step CG s-step CG with monomial basis (Y = [p i, Ap i,, A s p i, r i, Ar i, A s 1 r i ]) Can also use other, more well-conditioned bases to improve convergence rate and accuracy (see, e.g. Philippe and Reichel, 2012). 35
142 s-step CG Even assuming perfect parallel scalability with s (which is usually not the case due to extra SpMVs and inner products), already at s = 4 we are worse than HSCG in terms of number of synchronizations! 36
143 A different problem... A: nos4 from UFSMC, b: equal components in the eigenbasis of A and b = 1 N = 100, κ A 2e3 37
144 A different problem... A: nos4 from UFSMC, b: equal components in the eigenbasis of A and b = 1 N = 100, κ A 2e3 37
145 A different problem... A: nos4 from UFSMC, b: equal components in the eigenbasis of A and b = 1 N = 100, κ A 2e3 37
146 A different problem... A: nos4 from UFSMC, b: equal components in the eigenbasis of A and b = 1 N = 100, κ A 2e3
147 A different problem... A: nos4 from UFSMC, b: equal components in the eigenbasis of A and b = 1 N = 100, κ A 2e3 If application only requires x x i A 10 10, any of these methods will work!
148 Performance Benefit from s-step BICGSTAB Speedups for real applications s-step BICGSTAB bottom-solver implemented in BoxLib (AMR framework from LBL) Low Mach Number Combustion Code (LMC): gas-phase combustion simulation Compared GMG with BICGSTAB vs. GMG with s-step BICGSTAB (s=4) on a Cray XE6 for two different applications Up to 2.5x speedup in bottom solve; up to 1.5x in overall MG solve 3.0x 2.5x 2.0x 1.5x 1.0x 0.5x 0.0x LMC - 3D mac_project Solve Bottom Solver MG Solver (overall) Flat MPI MPI+OpenMP (see Williams et al., IPDPS 2014) 38
149 Conclusions and takeaways Think of the bigger picture Much focus on modifying methods to speed up iterations But the speed of an iteration only part of the runtime: runtime = (time/iteration) x (#iterations) 39
150 Conclusions and takeaways Think of the bigger picture Much focus on modifying methods to speed up iterations But the speed of an iteration only part of the runtime: runtime = (time/iteration) x (#iterations) And a solver that can't solve to required accuracy is useless! 39
151 Conclusions and takeaways Think of the bigger picture Much focus on modifying methods to speed up iterations But the speed of an iteration only part of the runtime: runtime = (time/iteration) x (#iterations) And a solver that can't solve to required accuracy is useless! Solver runtime is only part of a larger scientific code 39
152 Conclusions and takeaways Think of the bigger picture Much focus on modifying methods to speed up iterations But the speed of an iteration only part of the runtime: runtime = (time/iteration) x (#iterations) And a solver that can't solve to required accuracy is useless! Solver runtime is only part of a larger scientific code Design and implementation of iterative solvers requires a holistic approach Selecting the right method, parameters, stopping criteria Selecting the right preconditioner (closely linked with the discretization! see Málek and Strakoš, 2015) 39
MS 28: Scalable Communication-Avoiding and -Hiding Krylov Subspace Methods I
MS 28: Scalable Communication-Avoiding and -Hiding Krylov Subspace Methods I Organizers: Siegfried Cools, University of Antwerp, Belgium Erin C. Carson, New York University, USA 10:50-11:10 High Performance
More informationEfficient Deflation for Communication-Avoiding Krylov Subspace Methods
Efficient Deflation for Communication-Avoiding Krylov Subspace Methods Erin Carson Nicholas Knight, James Demmel Univ. of California, Berkeley Monday, June 24, NASCA 2013, Calais, France Overview We derive
More informationMS4: Minimizing Communication in Numerical Algorithms Part I of II
MS4: Minimizing Communication in Numerical Algorithms Part I of II Organizers: Oded Schwartz (Hebrew University of Jerusalem) and Erin Carson (New York University) Talks: 1. Communication-Avoiding Krylov
More informationFinite-choice algorithm optimization in Conjugate Gradients
Finite-choice algorithm optimization in Conjugate Gradients Jack Dongarra and Victor Eijkhout January 2003 Abstract We present computational aspects of mathematically equivalent implementations of the
More informationKey words. conjugate gradients, normwise backward error, incremental norm estimation.
Proceedings of ALGORITMY 2016 pp. 323 332 ON ERROR ESTIMATION IN THE CONJUGATE GRADIENT METHOD: NORMWISE BACKWARD ERROR PETR TICHÝ Abstract. Using an idea of Duff and Vömel [BIT, 42 (2002), pp. 300 322
More informationEXASCALE COMPUTING. Hiding global synchronization latency in the preconditioned Conjugate Gradient algorithm. P. Ghysels W.
ExaScience Lab Intel Labs Europe EXASCALE COMPUTING Hiding global synchronization latency in the preconditioned Conjugate Gradient algorithm P. Ghysels W. Vanroose Report 12.2012.1, December 2012 This
More informationMoments, Model Reduction and Nonlinearity in Solving Linear Algebraic Problems
Moments, Model Reduction and Nonlinearity in Solving Linear Algebraic Problems Zdeněk Strakoš Charles University, Prague http://www.karlin.mff.cuni.cz/ strakos 16th ILAS Meeting, Pisa, June 2010. Thanks
More informationA Residual Replacement Strategy for Improving the Maximum Attainable Accuracy of s-step Krylov Subspace Methods
A Residual Replacement Strategy for Improving the Maximum Attainable Accuracy of s-step Krylov Subspace Methods Erin Carson James Demmel Electrical Engineering and Computer Sciences University of California
More informationCME342 Parallel Methods in Numerical Analysis. Matrix Computation: Iterative Methods II. Sparse Matrix-vector Multiplication.
CME342 Parallel Methods in Numerical Analysis Matrix Computation: Iterative Methods II Outline: CG & its parallelization. Sparse Matrix-vector Multiplication. 1 Basic iterative methods: Ax = b r = b Ax
More informationSensitivity of Gauss-Christoffel quadrature and sensitivity of Jacobi matrices to small changes of spectral data
Sensitivity of Gauss-Christoffel quadrature and sensitivity of Jacobi matrices to small changes of spectral data Zdeněk Strakoš Academy of Sciences and Charles University, Prague http://www.cs.cas.cz/
More informationError analysis of the s-step Lanczos method in finite precision
Error analysis of the s-step Lanczos method in finite precision Erin Carson James Demmel Electrical Engineering and Computer Sciences University of California at Berkeley Technical Report No. UCB/EECS-2014-55
More informationMatching moments and matrix computations
Matching moments and matrix computations Jörg Liesen Technical University of Berlin and Petr Tichý Czech Academy of Sciences and Zdeněk Strakoš Charles University in Prague and Czech Academy of Sciences
More informationEfficient Estimation of the A-norm of the Error in the Preconditioned Conjugate Gradient Method
Efficient Estimation of the A-norm of the Error in the Preconditioned Conjugate Gradient Method Zdeněk Strakoš and Petr Tichý Institute of Computer Science AS CR, Technical University of Berlin. Emory
More informationScalable Non-blocking Preconditioned Conjugate Gradient Methods
Scalable Non-blocking Preconditioned Conjugate Gradient Methods Paul Eller and William Gropp University of Illinois at Urbana-Champaign Department of Computer Science Supercomputing 16 Paul Eller and William
More informationLanczos tridigonalization and Golub - Kahan bidiagonalization: Ideas, connections and impact
Lanczos tridigonalization and Golub - Kahan bidiagonalization: Ideas, connections and impact Zdeněk Strakoš Academy of Sciences and Charles University, Prague http://www.cs.cas.cz/ strakos Hong Kong, February
More informationEfficient Estimation of the A-norm of the Error in the Preconditioned Conjugate Gradient Method
Efficient Estimation of the A-norm of the Error in the Preconditioned Conjugate Gradient Method Zdeněk Strakoš and Petr Tichý Institute of Computer Science AS CR, Technical University of Berlin. International
More informationIterative Methods for Linear Systems of Equations
Iterative Methods for Linear Systems of Equations Projection methods (3) ITMAN PhD-course DTU 20-10-08 till 24-10-08 Martin van Gijzen 1 Delft University of Technology Overview day 4 Bi-Lanczos method
More information6.4 Krylov Subspaces and Conjugate Gradients
6.4 Krylov Subspaces and Conjugate Gradients Our original equation is Ax = b. The preconditioned equation is P Ax = P b. When we write P, we never intend that an inverse will be explicitly computed. P
More informationReduced Synchronization Overhead on. December 3, Abstract. The standard formulation of the conjugate gradient algorithm involves
Lapack Working Note 56 Conjugate Gradient Algorithms with Reduced Synchronization Overhead on Distributed Memory Multiprocessors E. F. D'Azevedo y, V.L. Eijkhout z, C. H. Romine y December 3, 1999 Abstract
More informationSummary of Iterative Methods for Non-symmetric Linear Equations That Are Related to the Conjugate Gradient (CG) Method
Summary of Iterative Methods for Non-symmetric Linear Equations That Are Related to the Conjugate Gradient (CG) Method Leslie Foster 11-5-2012 We will discuss the FOM (full orthogonalization method), CG,
More informationPrinciples and Analysis of Krylov Subspace Methods
Principles and Analysis of Krylov Subspace Methods Zdeněk Strakoš Institute of Computer Science, Academy of Sciences, Prague www.cs.cas.cz/~strakos Ostrava, February 2005 1 With special thanks to C.C.
More informationA Residual Replacement Strategy for Improving the Maximum Attainable Accuracy of Communication- Avoiding Krylov Subspace Methods
A Residual Replacement Strategy for Improving the Maximum Attainable Accuracy of Communication- Avoiding Krylov Subspace Methods Erin Carson James Demmel Electrical Engineering and Computer Sciences University
More informationM.A. Botchev. September 5, 2014
Rome-Moscow school of Matrix Methods and Applied Linear Algebra 2014 A short introduction to Krylov subspaces for linear systems, matrix functions and inexact Newton methods. Plan and exercises. M.A. Botchev
More informationPETROV-GALERKIN METHODS
Chapter 7 PETROV-GALERKIN METHODS 7.1 Energy Norm Minimization 7.2 Residual Norm Minimization 7.3 General Projection Methods 7.1 Energy Norm Minimization Saad, Sections 5.3.1, 5.2.1a. 7.1.1 Methods based
More informationIntroduction. Chapter One
Chapter One Introduction The aim of this book is to describe and explain the beautiful mathematical relationships between matrices, moments, orthogonal polynomials, quadrature rules and the Lanczos and
More informationComputational Linear Algebra
Computational Linear Algebra PD Dr. rer. nat. habil. Ralf-Peter Mundani Computation in Engineering / BGU Scientific Computing in Computer Science / INF Winter Term 2018/19 Part 4: Iterative Methods PD
More informationKrylov Space Solvers
Seminar for Applied Mathematics ETH Zurich International Symposium on Frontiers of Computational Science Nagoya, 12/13 Dec. 2005 Sparse Matrices Large sparse linear systems of equations or large sparse
More informationS-Step and Communication-Avoiding Iterative Methods
S-Step and Communication-Avoiding Iterative Methods Maxim Naumov NVIDIA, 270 San Tomas Expressway, Santa Clara, CA 95050 Abstract In this paper we make an overview of s-step Conjugate Gradient and develop
More informationLinear Solvers. Andrew Hazel
Linear Solvers Andrew Hazel Introduction Thus far we have talked about the formulation and discretisation of physical problems...... and stopped when we got to a discrete linear system of equations. Introduction
More informationTopics. The CG Algorithm Algorithmic Options CG s Two Main Convergence Theorems
Topics The CG Algorithm Algorithmic Options CG s Two Main Convergence Theorems What about non-spd systems? Methods requiring small history Methods requiring large history Summary of solvers 1 / 52 Conjugate
More informationContents. Preface... xi. Introduction...
Contents Preface... xi Introduction... xv Chapter 1. Computer Architectures... 1 1.1. Different types of parallelism... 1 1.1.1. Overlap, concurrency and parallelism... 1 1.1.2. Temporal and spatial parallelism
More informationConjugate gradient method. Descent method. Conjugate search direction. Conjugate Gradient Algorithm (294)
Conjugate gradient method Descent method Hestenes, Stiefel 1952 For A N N SPD In exact arithmetic, solves in N steps In real arithmetic No guaranteed stopping Often converges in many fewer than N steps
More informationOn the Vorobyev method of moments
On the Vorobyev method of moments Zdeněk Strakoš Charles University in Prague and Czech Academy of Sciences http://www.karlin.mff.cuni.cz/ strakos Conference in honor of Volker Mehrmann Berlin, May 2015
More information1 Conjugate gradients
Notes for 2016-11-18 1 Conjugate gradients We now turn to the method of conjugate gradients (CG), perhaps the best known of the Krylov subspace solvers. The CG iteration can be characterized as the iteration
More informationNumerical behavior of inexact linear solvers
Numerical behavior of inexact linear solvers Miro Rozložník joint results with Zhong-zhi Bai and Pavel Jiránek Institute of Computer Science, Czech Academy of Sciences, Prague, Czech Republic The fourth
More informationComputational Linear Algebra
Computational Linear Algebra PD Dr. rer. nat. habil. Ralf Peter Mundani Computation in Engineering / BGU Scientific Computing in Computer Science / INF Winter Term 2017/18 Part 3: Iterative Methods PD
More informationAMS526: Numerical Analysis I (Numerical Linear Algebra) Lecture 23: GMRES and Other Krylov Subspace Methods; Preconditioning
AMS526: Numerical Analysis I (Numerical Linear Algebra) Lecture 23: GMRES and Other Krylov Subspace Methods; Preconditioning Xiangmin Jiao SUNY Stony Brook Xiangmin Jiao Numerical Analysis I 1 / 18 Outline
More informationDELFT UNIVERSITY OF TECHNOLOGY
DELFT UNIVERSITY OF TECHNOLOGY REPORT 16-02 The Induced Dimension Reduction method applied to convection-diffusion-reaction problems R. Astudillo and M. B. van Gijzen ISSN 1389-6520 Reports of the Delft
More informationConjugate Gradient algorithm. Storage: fixed, independent of number of steps.
Conjugate Gradient algorithm Need: A symmetric positive definite; Cost: 1 matrix-vector product per step; Storage: fixed, independent of number of steps. The CG method minimizes the A norm of the error,
More informationCommunication-Avoiding Krylov Subspace Methods in Theory and Practice. Erin Claire Carson. A dissertation submitted in partial satisfaction of the
Communication-Avoiding Krylov Subspace Methods in Theory and Practice by Erin Claire Carson A dissertation submitted in partial satisfaction of the requirements for the degree of Doctor of Philosophy in
More informationEECS 275 Matrix Computation
EECS 275 Matrix Computation Ming-Hsuan Yang Electrical Engineering and Computer Science University of California at Merced Merced, CA 95344 http://faculty.ucmerced.edu/mhyang Lecture 20 1 / 20 Overview
More informationThe Lanczos and conjugate gradient algorithms
The Lanczos and conjugate gradient algorithms Gérard MEURANT October, 2008 1 The Lanczos algorithm 2 The Lanczos algorithm in finite precision 3 The nonsymmetric Lanczos algorithm 4 The Golub Kahan bidiagonalization
More informationIterative methods for Linear System
Iterative methods for Linear System JASS 2009 Student: Rishi Patil Advisor: Prof. Thomas Huckle Outline Basics: Matrices and their properties Eigenvalues, Condition Number Iterative Methods Direct and
More informationThe amount of work to construct each new guess from the previous one should be a small multiple of the number of nonzeros in A.
AMSC/CMSC 661 Scientific Computing II Spring 2005 Solution of Sparse Linear Systems Part 2: Iterative methods Dianne P. O Leary c 2005 Solving Sparse Linear Systems: Iterative methods The plan: Iterative
More informationON ORTHOGONAL REDUCTION TO HESSENBERG FORM WITH SMALL BANDWIDTH
ON ORTHOGONAL REDUCTION TO HESSENBERG FORM WITH SMALL BANDWIDTH V. FABER, J. LIESEN, AND P. TICHÝ Abstract. Numerous algorithms in numerical linear algebra are based on the reduction of a given matrix
More informationContribution of Wo¹niakowski, Strako²,... The conjugate gradient method in nite precision computa
Contribution of Wo¹niakowski, Strako²,... The conjugate gradient method in nite precision computations ªaw University of Technology Institute of Mathematics and Computer Science Warsaw, October 7, 2006
More information4.8 Arnoldi Iteration, Krylov Subspaces and GMRES
48 Arnoldi Iteration, Krylov Subspaces and GMRES We start with the problem of using a similarity transformation to convert an n n matrix A to upper Hessenberg form H, ie, A = QHQ, (30) with an appropriate
More informationVariants of BiCGSafe method using shadow three-term recurrence
Variants of BiCGSafe method using shadow three-term recurrence Seiji Fujino, Takashi Sekimoto Research Institute for Information Technology, Kyushu University Ryoyu Systems Co. Ltd. (Kyushu University)
More informationChapter 7 Iterative Techniques in Matrix Algebra
Chapter 7 Iterative Techniques in Matrix Algebra Per-Olof Persson persson@berkeley.edu Department of Mathematics University of California, Berkeley Math 128B Numerical Analysis Vector Norms Definition
More informationConjugate Gradient Method
Conjugate Gradient Method direct and indirect methods positive definite linear systems Krylov sequence spectral analysis of Krylov sequence preconditioning Prof. S. Boyd, EE364b, Stanford University Three
More informationGolub-Kahan iterative bidiagonalization and determining the noise level in the data
Golub-Kahan iterative bidiagonalization and determining the noise level in the data Iveta Hnětynková,, Martin Plešinger,, Zdeněk Strakoš, * Charles University, Prague ** Academy of Sciences of the Czech
More informationCommunication-avoiding Krylov subspace methods
Motivation Communication-avoiding Krylov subspace methods Mark mhoemmen@cs.berkeley.edu University of California Berkeley EECS MS Numerical Libraries Group visit: 28 April 2008 Overview Motivation Current
More informationIterative methods for Linear System of Equations. Joint Advanced Student School (JASS-2009)
Iterative methods for Linear System of Equations Joint Advanced Student School (JASS-2009) Course #2: Numerical Simulation - from Models to Software Introduction In numerical simulation, Partial Differential
More informationOn the accuracy of saddle point solvers
On the accuracy of saddle point solvers Miro Rozložník joint results with Valeria Simoncini and Pavel Jiránek Institute of Computer Science, Czech Academy of Sciences, Prague, Czech Republic Seminar at
More informationIterative Methods for Sparse Linear Systems
Iterative Methods for Sparse Linear Systems Luca Bergamaschi e-mail: berga@dmsa.unipd.it - http://www.dmsa.unipd.it/ berga Department of Mathematical Methods and Models for Scientific Applications University
More informationApplied Mathematics 205. Unit V: Eigenvalue Problems. Lecturer: Dr. David Knezevic
Applied Mathematics 205 Unit V: Eigenvalue Problems Lecturer: Dr. David Knezevic Unit V: Eigenvalue Problems Chapter V.4: Krylov Subspace Methods 2 / 51 Krylov Subspace Methods In this chapter we give
More informationOn the influence of eigenvalues on Bi-CG residual norms
On the influence of eigenvalues on Bi-CG residual norms Jurjen Duintjer Tebbens Institute of Computer Science Academy of Sciences of the Czech Republic duintjertebbens@cs.cas.cz Gérard Meurant 30, rue
More informationNotes on Some Methods for Solving Linear Systems
Notes on Some Methods for Solving Linear Systems Dianne P. O Leary, 1983 and 1999 and 2007 September 25, 2007 When the matrix A is symmetric and positive definite, we have a whole new class of algorithms
More informationON EFFICIENT NUMERICAL APPROXIMATION OF THE BILINEAR FORM c A 1 b
ON EFFICIENT NUMERICAL APPROXIMATION OF THE ILINEAR FORM c A 1 b ZDENĚK STRAKOŠ AND PETR TICHÝ Abstract. Let A C N N be a nonsingular complex matrix and b and c complex vectors of length N. The goal of
More informationConjugate Gradient (CG) Method
Conjugate Gradient (CG) Method by K. Ozawa 1 Introduction In the series of this lecture, I will introduce the conjugate gradient method, which solves efficiently large scale sparse linear simultaneous
More informationIDR(s) Master s thesis Goushani Kisoensingh. Supervisor: Gerard L.G. Sleijpen Department of Mathematics Universiteit Utrecht
IDR(s) Master s thesis Goushani Kisoensingh Supervisor: Gerard L.G. Sleijpen Department of Mathematics Universiteit Utrecht Contents 1 Introduction 2 2 The background of Bi-CGSTAB 3 3 IDR(s) 4 3.1 IDR.............................................
More informationRecent computational developments in Krylov Subspace Methods for linear systems. Valeria Simoncini and Daniel B. Szyld
Recent computational developments in Krylov Subspace Methods for linear systems Valeria Simoncini and Daniel B. Szyld A later version appeared in : Numerical Linear Algebra w/appl., 2007, v. 14(1), pp.1-59.
More information1. Introduction. Krylov subspace methods are used extensively for the iterative solution of linear systems of equations Ax = b.
SIAM J. SCI. COMPUT. Vol. 31, No. 2, pp. 1035 1062 c 2008 Society for Industrial and Applied Mathematics IDR(s): A FAMILY OF SIMPLE AND FAST ALGORITHMS FOR SOLVING LARGE NONSYMMETRIC SYSTEMS OF LINEAR
More informationA stable variant of Simpler GMRES and GCR
A stable variant of Simpler GMRES and GCR Miroslav Rozložník joint work with Pavel Jiránek and Martin H. Gutknecht Institute of Computer Science, Czech Academy of Sciences, Prague, Czech Republic miro@cs.cas.cz,
More informationIterative Methods for Solving A x = b
Iterative Methods for Solving A x = b A good (free) online source for iterative methods for solving A x = b is given in the description of a set of iterative solvers called templates found at netlib: http
More informationCommunication Avoiding Strategies for the Numerical Kernels in Coupled Physics Simulations
ExaScience Lab Intel Labs Europe EXASCALE COMPUTING Communication Avoiding Strategies for the Numerical Kernels in Coupled Physics Simulations SIAM Conference on Parallel Processing for Scientific Computing
More informationSolving large sparse Ax = b.
Bob-05 p.1/69 Solving large sparse Ax = b. Stopping criteria, & backward stability of MGS-GMRES. Chris Paige (McGill University); Miroslav Rozložník & Zdeněk Strakoš (Academy of Sciences of the Czech Republic)..pdf
More informationA Domain Decomposition Based Jacobi-Davidson Algorithm for Quantum Dot Simulation
A Domain Decomposition Based Jacobi-Davidson Algorithm for Quantum Dot Simulation Tao Zhao 1, Feng-Nan Hwang 2 and Xiao-Chuan Cai 3 Abstract In this paper, we develop an overlapping domain decomposition
More informationComposite convergence bounds based on Chebyshev polynomials and finite precision conjugate gradient computations
Numerical Algorithms manuscript No. (will be inserted by the editor) Composite convergence bounds based on Chebyshev polynomials and finite precision conjugate gradient computations Tomáš Gergelits Zdeněk
More informationExploiting Low-Rank Structure in Computing Matrix Powers with Applications to Preconditioning
Exploiting Low-Rank Structure in Computing Matrix Powers with Applications to Preconditioning Erin C. Carson, Nicholas Knight, James Demmel, Ming Gu U.C. Berkeley SIAM PP 12, Savannah, Georgia, USA, February
More informationThe Conjugate Gradient Method
The Conjugate Gradient Method Jason E. Hicken Aerospace Design Lab Department of Aeronautics & Astronautics Stanford University 14 July 2011 Lecture Objectives describe when CG can be used to solve Ax
More informationAMS526: Numerical Analysis I (Numerical Linear Algebra)
AMS526: Numerical Analysis I (Numerical Linear Algebra) Lecture 21: Sensitivity of Eigenvalues and Eigenvectors; Conjugate Gradient Method Xiangmin Jiao Stony Brook University Xiangmin Jiao Numerical Analysis
More informationIntroduction to Iterative Solvers of Linear Systems
Introduction to Iterative Solvers of Linear Systems SFB Training Event January 2012 Prof. Dr. Andreas Frommer Typeset by Lukas Krämer, Simon-Wolfgang Mages and Rudolf Rödl 1 Classes of Matrices and their
More informationB. Reps 1, P. Ghysels 2, O. Schenk 4, K. Meerbergen 3,5, W. Vanroose 1,3. IHP 2015, Paris FRx
Communication Avoiding and Hiding in preconditioned Krylov solvers 1 Applied Mathematics, University of Antwerp, Belgium 2 Future Technology Group, Berkeley Lab, USA 3 Intel ExaScience Lab, Belgium 4 USI,
More informationMinisymposia 9 and 34: Avoiding Communication in Linear Algebra. Jim Demmel UC Berkeley bebop.cs.berkeley.edu
Minisymposia 9 and 34: Avoiding Communication in Linear Algebra Jim Demmel UC Berkeley bebop.cs.berkeley.edu Motivation (1) Increasing parallelism to exploit From Top500 to multicores in your laptop Exponentially
More informationPerformance Evaluation of GPBiCGSafe Method without Reverse-Ordered Recurrence for Realistic Problems
Performance Evaluation of GPBiCGSafe Method without Reverse-Ordered Recurrence for Realistic Problems Seiji Fujino, Takashi Sekimoto Abstract GPBiCG method is an attractive iterative method for the solution
More informationOn prescribing Ritz values and GMRES residual norms generated by Arnoldi processes
On prescribing Ritz values and GMRES residual norms generated by Arnoldi processes Jurjen Duintjer Tebbens Institute of Computer Science Academy of Sciences of the Czech Republic joint work with Gérard
More informationIDR(s) A family of simple and fast algorithms for solving large nonsymmetric systems of linear equations
IDR(s) A family of simple and fast algorithms for solving large nonsymmetric systems of linear equations Harrachov 2007 Peter Sonneveld en Martin van Gijzen August 24, 2007 1 Delft University of Technology
More information1 Extrapolation: A Hint of Things to Come
Notes for 2017-03-24 1 Extrapolation: A Hint of Things to Come Stationary iterations are simple. Methods like Jacobi or Gauss-Seidel are easy to program, and it s (relatively) easy to analyze their convergence.
More informationITERATIVE METHODS BASED ON KRYLOV SUBSPACES
ITERATIVE METHODS BASED ON KRYLOV SUBSPACES LONG CHEN We shall present iterative methods for solving linear algebraic equation Au = b based on Krylov subspaces We derive conjugate gradient (CG) method
More informationKRYLOV SUBSPACE ITERATION
KRYLOV SUBSPACE ITERATION Presented by: Nab Raj Roshyara Master and Ph.D. Student Supervisors: Prof. Dr. Peter Benner, Department of Mathematics, TU Chemnitz and Dipl.-Geophys. Thomas Günther 1. Februar
More informationLecture 9: Krylov Subspace Methods. 2 Derivation of the Conjugate Gradient Algorithm
CS 622 Data-Sparse Matrix Computations September 19, 217 Lecture 9: Krylov Subspace Methods Lecturer: Anil Damle Scribes: David Eriksson, Marc Aurele Gilles, Ariah Klages-Mundt, Sophia Novitzky 1 Introduction
More informationA DISSERTATION. Extensions of the Conjugate Residual Method. by Tomohiro Sogabe. Presented to
A DISSERTATION Extensions of the Conjugate Residual Method ( ) by Tomohiro Sogabe Presented to Department of Applied Physics, The University of Tokyo Contents 1 Introduction 1 2 Krylov subspace methods
More informationIterative Methods and Multigrid
Iterative Methods and Multigrid Part 3: Preconditioning 2 Eric de Sturler Preconditioning The general idea behind preconditioning is that convergence of some method for the linear system Ax = b can be
More informationSolutions and Notes to Selected Problems In: Numerical Optimzation by Jorge Nocedal and Stephen J. Wright.
Solutions and Notes to Selected Problems In: Numerical Optimzation by Jorge Nocedal and Stephen J. Wright. John L. Weatherwax July 7, 2010 wax@alum.mit.edu 1 Chapter 5 (Conjugate Gradient Methods) Notes
More informationA short course on: Preconditioned Krylov subspace methods. Yousef Saad University of Minnesota Dept. of Computer Science and Engineering
A short course on: Preconditioned Krylov subspace methods Yousef Saad University of Minnesota Dept. of Computer Science and Engineering Universite du Littoral, Jan 19-3, 25 Outline Part 1 Introd., discretization
More informationBlock Bidiagonal Decomposition and Least Squares Problems
Block Bidiagonal Decomposition and Least Squares Problems Åke Björck Department of Mathematics Linköping University Perspectives in Numerical Analysis, Helsinki, May 27 29, 2008 Outline Bidiagonal Decomposition
More informationParallel Numerics, WT 2016/ Iterative Methods for Sparse Linear Systems of Equations. page 1 of 1
Parallel Numerics, WT 2016/2017 5 Iterative Methods for Sparse Linear Systems of Equations page 1 of 1 Contents 1 Introduction 1.1 Computer Science Aspects 1.2 Numerical Problems 1.3 Graphs 1.4 Loop Manipulations
More informationSOLVING SPARSE LINEAR SYSTEMS OF EQUATIONS. Chao Yang Computational Research Division Lawrence Berkeley National Laboratory Berkeley, CA, USA
1 SOLVING SPARSE LINEAR SYSTEMS OF EQUATIONS Chao Yang Computational Research Division Lawrence Berkeley National Laboratory Berkeley, CA, USA 2 OUTLINE Sparse matrix storage format Basic factorization
More informationEIGIFP: A MATLAB Program for Solving Large Symmetric Generalized Eigenvalue Problems
EIGIFP: A MATLAB Program for Solving Large Symmetric Generalized Eigenvalue Problems JAMES H. MONEY and QIANG YE UNIVERSITY OF KENTUCKY eigifp is a MATLAB program for computing a few extreme eigenvalues
More informationThe Conjugate Gradient Method for Solving Linear Systems of Equations
The Conjugate Gradient Method for Solving Linear Systems of Equations Mike Rambo Mentor: Hans de Moor May 2016 Department of Mathematics, Saint Mary s College of California Contents 1 Introduction 2 2
More informationComparison of Fixed Point Methods and Krylov Subspace Methods Solving Convection-Diffusion Equations
American Journal of Computational Mathematics, 5, 5, 3-6 Published Online June 5 in SciRes. http://www.scirp.org/journal/ajcm http://dx.doi.org/.436/ajcm.5.5 Comparison of Fixed Point Methods and Krylov
More informationBLOCK KRYLOV SPACE METHODS FOR LINEAR SYSTEMS WITH MULTIPLE RIGHT-HAND SIDES: AN INTRODUCTION
BLOCK KRYLOV SPACE METHODS FOR LINEAR SYSTEMS WITH MULTIPLE RIGHT-HAND SIDES: AN INTRODUCTION MARTIN H. GUTKNECHT / VERSION OF February 28, 2006 Abstract. In a number of applications in scientific computing
More informationCourse Notes: Week 4
Course Notes: Week 4 Math 270C: Applied Numerical Linear Algebra 1 Lecture 9: Steepest Descent (4/18/11) The connection with Lanczos iteration and the CG was not originally known. CG was originally derived
More informationPreconditioning Techniques Analysis for CG Method
Preconditioning Techniques Analysis for CG Method Huaguang Song Department of Computer Science University of California, Davis hso@ucdavis.edu Abstract Matrix computation issue for solve linear system
More informationFEM and sparse linear system solving
FEM & sparse linear system solving, Lecture 9, Nov 19, 2017 1/36 Lecture 9, Nov 17, 2017: Krylov space methods http://people.inf.ethz.ch/arbenz/fem17 Peter Arbenz Computer Science Department, ETH Zürich
More informationOn the interplay between discretization and preconditioning of Krylov subspace methods
On the interplay between discretization and preconditioning of Krylov subspace methods Josef Málek and Zdeněk Strakoš Nečas Center for Mathematical Modeling Charles University in Prague and Czech Academy
More informationThe solution of the discretized incompressible Navier-Stokes equations with iterative methods
The solution of the discretized incompressible Navier-Stokes equations with iterative methods Report 93-54 C. Vuik Technische Universiteit Delft Delft University of Technology Faculteit der Technische
More informationHOW TO MAKE SIMPLER GMRES AND GCR MORE STABLE
HOW TO MAKE SIMPLER GMRES AND GCR MORE STABLE PAVEL JIRÁNEK, MIROSLAV ROZLOŽNÍK, AND MARTIN H. GUTKNECHT Abstract. In this paper we analyze the numerical behavior of several minimum residual methods, which
More informationSome minimization problems
Week 13: Wednesday, Nov 14 Some minimization problems Last time, we sketched the following two-step strategy for approximating the solution to linear systems via Krylov subspaces: 1. Build a sequence of
More information