High performance Krylov subspace method variants

Size: px
Start display at page:

Download "High performance Krylov subspace method variants"

Transcription

1 High performance Krylov subspace method variants and their behavior in finite precision Erin Carson New York University HPCSE17, May 24, 2017

2 Collaborators Miroslav Rozložník Institute of Computer Science, Czech Academy of Sciences Zdeněk Strakoš Department of Numerical Mathematics, Faculty of Mathematics and Physics, Charles University Petr Tichý Department of Numerical Mathematics, Faculty of Mathematics and Physics, Charles University Miroslav Tůma Department of Numerical Mathematics, Faculty of Mathematics and Physics, Charles University Preprint NCMM/2016/08: 2

3 Conjugate Gradient method for solving Ax = b double precision (ε = 2 53 ) x i x A = x i x T A(x i x) x i = x i 1 + α i p i r i = r i 1 α i Ap i p i = r i + β i p i 3

4 Conjugate Gradient method for solving Ax = b double precision (ε = 2 53 ) x i x A = x i x T A(x i x) x i = x i 1 + α i p i r i = r i 1 α i Ap i p i = r i + β i p i 3

5 Krylov subspace methods Linear systems Ax = b, eigenvalue problems, singular value problems, least squares, etc. Best for: A large & very sparse, stored implicitly, or only approximation needed Krylov Subspace Method is a projection process onto the Krylov subspace K i A, r 0 = span r 0, Ar 0, A 2 r 0,, A i 1 r 0 where A is an N N matrix and r 0 = b Ax 0 is a length-n vector In each iteration, Add a dimension to the Krylov subspace Forms nested sequence of Krylov subspaces K 1 A, r 0 K 2 A, r 0 K i (A, r 0 ) Orthogonalize (with respect to some C i ) Select approximate solution x i x 0 + K i (A, r 0 ) using r i = b Ax i C i rnew 0 C Aδ r 0 Ex: Lanczos/Conjugate Gradient (CG), Arnoldi/Generalized Minimum Residual (GMRES), Biconjugate Gradient (BICG), BICGSTAB, GKL, LSQR, etc. 4

6 The conjugate gradient method A is symmetric positive definite, C i = K i (A, r 0 ) 5

7 The conjugate gradient method A is symmetric positive definite, C i = K i (A, r 0 ) r i K i A, r 0 x x i A = min z x 0 +K i (A,r 0 ) x z A 5

8 The conjugate gradient method A is symmetric positive definite, C i = K i (A, r 0 ) r i K i A, r 0 x x i A = min z x 0 +K i (A,r 0 ) x z A r N+1 = 0 5

9 The conjugate gradient method A is symmetric positive definite, C i = K i (A, r 0 ) r i K i A, r 0 x x i A = min z x 0 +K i (A,r 0 ) x z A Connection with Lanczos r N+1 = 0 With v 1 = r 0 / r 0, i iterations of Lanczos produces N i matrix V i = v 1,, v i, and i i tridiagonal matrix T i such that AV i = V i T i + δ i+1 v i+1 e i T, T i = V i AV i CG approximation x i is obtained by solving the reduced model T i y i = r 0 e 1, x i = x 0 + V i y i 5

10 The conjugate gradient method A is symmetric positive definite, C i = K i (A, r 0 ) r i K i A, r 0 x x i A = min z x 0 +K i (A,r 0 ) x z A Connection with Lanczos r N+1 = 0 With v 1 = r 0 / r 0, i iterations of Lanczos produces N i matrix V i = v 1,, v i, and i i tridiagonal matrix T i such that AV i = V i T i + δ i+1 v i+1 e i T, T i = V i AV i CG approximation x i is obtained by solving the reduced model T i y i = r 0 e 1, x i = x 0 + V i y i Connections with orthogonal polynomials, Stieltjes problem of moments, Gauss- Cristoffel quadrature, others (see 2013 book of Liesen and Strakoš) 5

11 The conjugate gradient method A is symmetric positive definite, C i = K i (A, r 0 ) r i K i A, r 0 x x i A = min z x 0 +K i (A,r 0 ) x z A Connection with Lanczos r N+1 = 0 With v 1 = r 0 / r 0, i iterations of Lanczos produces N i matrix V i = v 1,, v i, and i i tridiagonal matrix T i such that AV i = V i T i + δ i+1 v i+1 e i T, T i = V i AV i CG approximation x i is obtained by solving the reduced model T i y i = r 0 e 1, x i = x 0 + V i y i Connections with orthogonal polynomials, Stieltjes problem of moments, Gauss- Cristoffel quadrature, others (see 2013 book of Liesen and Strakoš) CG (and other Krylov subspace methods) are highly nonlinear Good for convergence, bad for ease of finite precision analysis 5

12 Implementation of CG Standard implementation due to Hestenes and Stiefel (1952) (HSCG) Uses three 2-term recurrences for updating x i, r i, p i r 0 = b Ax 0, p 0 = r 0 for i = 1:nmax α i 1 = r T i 1 ri 1 p T i 1 Ap i 1 x i = x i 1 + α i 1 p i 1 r i = r i 1 α i 1 Ap i 1 β i = r i T r i r T i 1 r i 1 end p i = r i + β i p i 1 6

13 Implementation of CG Standard implementation due to Hestenes and Stiefel (1952) (HSCG) Uses three 2-term recurrences for updating x i, r i, p i r 0 = b Ax 0, p 0 = r 0 for i = 1:nmax α i 1 = r T i 1 ri 1 p T i 1 Ap i 1 minimizes x x i A along line z α = x i 1 + αp i 1 x i = x i 1 + α i 1 p i 1 r i = r i 1 α i 1 Ap i 1 β i = r i T r i r T i 1 r i 1 end p i = r i + β i p i 1 6

14 Implementation of CG Standard implementation due to Hestenes and Stiefel (1952) (HSCG) Uses three 2-term recurrences for updating x i, r i, p i r 0 = b Ax 0, p 0 = r 0 for i = 1:nmax end α i 1 = r T i 1 ri 1 p T i 1 Ap i 1 x i = x i 1 + α i 1 p i 1 r i = r i 1 α i 1 Ap i 1 β i = r i T r i r T i 1 r i 1 p i = r i + β i p i 1 If minimizes x x i A along line z α = x i 1 + αp i 1 p i A p j for i j, 1-dimensional minimizations in each iteration give i-dimensional minimization over the whole subspace x 0 + K i A, r 0 = x 0 + span{p 0, p i 1 } 6

15 Communication in CG Projection process in terms of communication: 7

16 Communication in CG Projection process in terms of communication: Add a dimension to K i Sparse matrix-vector multiplication (SpMV) Must communicate vector entries w/ neighboring processors (P2P communication) SpMV 7

17 Communication in CG Projection process in terms of communication: Add a dimension to K i Sparse matrix-vector multiplication (SpMV) Must communicate vector entries w/ neighboring processors (P2P communication) Orthogonalize with respect to C i Inner products global synchronization (MPI_Allreduce) all processors must exchange data and wait for all communication to finish before proceeding SpMV orthogonalize 7

18 Communication in CG Projection process in terms of communication: Add a dimension to K i Sparse matrix-vector multiplication (SpMV) Must communicate vector entries w/ neighboring processors (P2P communication) Orthogonalize with respect to C i Inner products global synchronization (MPI_Allreduce) all processors must exchange data and wait for all communication to finish before proceeding SpMV orthogonalize Dependencies between communication-bound kernels in each iteration limit performance! 7

19 Communication in HSCG r 0 = b Ax 0, p 0 = r 0 for i = 1:nmax α i 1 = r T i 1 ri 1 p T i 1 Ap i 1 x i = x i 1 + α i 1 p i 1 r i = r i 1 α i 1 Ap i 1 Iteration Loop SpMV Inner Products Vector Updates β i = r i T r i r T i 1 r i 1 Inner Products end p i = r i + β i p i 1 Vector Updates End Loop 8

20 Communication in HSCG r 0 = b Ax 0, p 0 = r 0 for i = 1:nmax α i 1 = r T i 1 ri 1 p T i 1 Ap i 1 x i = x i 1 + α i 1 p i 1 r i = r i 1 α i 1 Ap i 1 Iteration Loop SpMV Inner Products Vector Updates β i = r i T r i r T i 1 r i 1 Inner Products end p i = r i + β i p i 1 Vector Updates End Loop 8

21 Communication in HSCG r 0 = b Ax 0, p 0 = r 0 for i = 1:nmax α i 1 = r T i 1 ri 1 p T i 1 Ap i 1 x i = x i 1 + α i 1 p i 1 r i = r i 1 α i 1 Ap i 1 Iteration Loop SpMV Inner Products Vector Updates β i = r i T r i r T i 1 r i 1 Inner Products end p i = r i + β i p i 1 Vector Updates End Loop 8

22 Communication in HSCG r 0 = b Ax 0, p 0 = r 0 for i = 1:nmax α i 1 = r T i 1 ri 1 p T i 1 Ap i 1 x i = x i 1 + α i 1 p i 1 r i = r i 1 α i 1 Ap i 1 Iteration Loop SpMV Inner Products Vector Updates β i = r i T r i r T i 1 r i 1 Inner Products end p i = r i + β i p i 1 Vector Updates End Loop 8

23 Communication in HSCG r 0 = b Ax 0, p 0 = r 0 for i = 1:nmax α i 1 = r T i 1 ri 1 p T i 1 Ap i 1 x i = x i 1 + α i 1 p i 1 r i = r i 1 α i 1 Ap i 1 Iteration Loop SpMV Inner Products Vector Updates β i = r i T r i r T i 1 r i 1 Inner Products end p i = r i + β i p i 1 Vector Updates End Loop 8

24 Communication in HSCG r 0 = b Ax 0, p 0 = r 0 for i = 1:nmax α i 1 = r T i 1 ri 1 p T i 1 Ap i 1 x i = x i 1 + α i 1 p i 1 r i = r i 1 α i 1 Ap i 1 Iteration Loop SpMV Inner Products Vector Updates β i = r i T r i r T i 1 r i 1 Inner Products end p i = r i + β i p i 1 Vector Updates End Loop 8

25 Future exascale systems Petascale Systems (2009) Predicted Exascale Systems Factor Improvement System Peak ~1000 Node Memory Bandwidth Total Node Interconnect Bandwidth 25 GB/s TB/s ~ GB/s GB/s ~100 Memory Latency 100 ns 50 ns ~1 Interconnect Latency ~1 * Sources: from P. Beckman (ANL), J. Shalf (LBL), and D. Unat (LBL) 9

26 Future exascale systems Petascale Systems (2009) Predicted Exascale Systems Factor Improvement System Peak ~1000 Node Memory Bandwidth Total Node Interconnect Bandwidth 25 GB/s TB/s ~ GB/s GB/s ~100 Memory Latency 100 ns 50 ns ~1 Interconnect Latency ~1 * Sources: from P. Beckman (ANL), J. Shalf (LBL), and D. Unat (LBL) 9

27 Future exascale systems Petascale Systems (2009) Predicted Exascale Systems Factor Improvement System Peak ~1000 Node Memory Bandwidth Total Node Interconnect Bandwidth 25 GB/s TB/s ~ GB/s GB/s ~100 Memory Latency 100 ns 50 ns ~1 Interconnect Latency ~1 * Sources: from P. Beckman (ANL), J. Shalf (LBL), and D. Unat (LBL) 9

28 Future exascale systems Petascale Systems (2009) Predicted Exascale Systems Factor Improvement System Peak ~1000 Node Memory Bandwidth Total Node Interconnect Bandwidth 25 GB/s TB/s ~ GB/s GB/s ~100 Memory Latency 100 ns 50 ns ~1 Interconnect Latency ~1 * Sources: from P. Beckman (ANL), J. Shalf (LBL), and D. Unat (LBL) Gaps between communication/computation cost only growing larger in future systems Reducing time spent moving data/waiting for data will be essential for applications at exascale! 9

29 Synchronization-reducing variants Communication cost has motivated many approaches to reducing synchronization in CG: 10

30 Synchronization-reducing variants Communication cost has motivated many approaches to reducing synchronization in CG: Early work: CG with a single synchronization point per iteration 3-term recurrence CG Using modified computation of recurrence coefficients Using auxiliary vectors 10

31 Synchronization-reducing variants Communication cost has motivated many approaches to reducing synchronization in CG: Early work: CG with a single synchronization point per iteration 3-term recurrence CG Using modified computation of recurrence coefficients Using auxiliary vectors Pipelined Krylov subspace methods Uses modified coefficients and auxiliary vectors to reduce synchronization points to 1 per iteration Modifications also allow decoupling of SpMV and inner products - enables overlapping 10

32 Synchronization-reducing variants Communication cost has motivated many approaches to reducing synchronization in CG: Early work: CG with a single synchronization point per iteration 3-term recurrence CG Using modified computation of recurrence coefficients Using auxiliary vectors Pipelined Krylov subspace methods Uses modified coefficients and auxiliary vectors to reduce synchronization points to 1 per iteration Modifications also allow decoupling of SpMV and inner products - enables overlapping s-step Krylov subspace methods Compute iterations in blocks of s using a different Krylov subspace basis Enables one synchronization per s iterations 10

33 The effects of finite precision Well-known that roundoff error has two effects: 1. Delay of convergence No longer have exact Krylov subspace Can lose numerical rank deficiency Residuals no longer orthogonal - Minimization no longer exact! 2. Loss of attainable accuracy Rounding errors cause true residual b Ax i and updated residual r i deviate! 11

34 The effects of finite precision Well-known that roundoff error has two effects: 1. Delay of convergence No longer have exact Krylov subspace Can lose numerical rank deficiency Residuals no longer orthogonal - Minimization no longer exact! 2. Loss of attainable accuracy Rounding errors cause true residual b Ax i and updated residual r i deviate! A: bcsstk03 from UFSMC, b: equal components in the eigenbasis of A and b = 1 N = 112, κ A 7e6 11

35 The effects of finite precision Well-known that roundoff error has two effects: 1. Delay of convergence No longer have exact Krylov subspace Can lose numerical rank deficiency Residuals no longer orthogonal - Minimization no longer exact! 2. Loss of attainable accuracy Rounding errors cause true residual b Ax i and updated residual r i deviate! A: bcsstk03 from UFSMC, b: equal components in the eigenbasis of A and b = 1 N = 112, κ A 7e6 11

36 The effects of finite precision Well-known that roundoff error has two effects: 1. Delay of convergence No longer have exact Krylov subspace Can lose numerical rank deficiency Residuals no longer orthogonal - Minimization no longer exact! 2. Loss of attainable accuracy Rounding errors cause true residual b Ax i and updated residual r i deviate! A: bcsstk03 from UFSMC, b: equal components in the eigenbasis of A and b = 1 N = 112, κ A 7e6 Much work on these results for CG; See Meurant and Strakoš (2006) for a thorough summary of early developments in finite precision analysis of Lanczos and CG 11

37 Optimizing high performance iterative solvers Synchronization-reducing variants are designed to reduce the time/iteration 12

38 Optimizing high performance iterative solvers Synchronization-reducing variants are designed to reduce the time/iteration But this is not the whole story! 12

39 Optimizing high performance iterative solvers Synchronization-reducing variants are designed to reduce the time/iteration But this is not the whole story! What we really want to minimize is the runtime, subject to some constraint on accuracy, runtime = (time/iteration) x (# iterations) 12

40 Optimizing high performance iterative solvers Synchronization-reducing variants are designed to reduce the time/iteration But this is not the whole story! What we really want to minimize is the runtime, subject to some constraint on accuracy, runtime = (time/iteration) x (# iterations) Changes to how the recurrences are computed can exacerbate finite precision effects of convergence delay and loss of accuracy 12

41 Optimizing high performance iterative solvers Synchronization-reducing variants are designed to reduce the time/iteration But this is not the whole story! What we really want to minimize is the runtime, subject to some constraint on accuracy, runtime = (time/iteration) x (# iterations) Changes to how the recurrences are computed can exacerbate finite precision effects of convergence delay and loss of accuracy Crucial that we understand and take into account how algorithm modifications will affect the convergence rate and attainable accuracy! 12

42 Maximum attainable accuracy Accuracy depends on the size of the true residual: b A x i 13

43 Maximum attainable accuracy Accuracy depends on the size of the true residual: b A x i Rounding errors cause the true residual, b A x i, and the updated residual, r i, to deviate 13

44 Maximum attainable accuracy Accuracy depends on the size of the true residual: b A x i Rounding errors cause the true residual, b A x i, and the updated residual, r i, to deviate Writing b A x i = r i + b A x i r i, b A x i r i + b A x i r i 13

45 Maximum attainable accuracy Accuracy depends on the size of the true residual: b A x i Rounding errors cause the true residual, b A x i, and the updated residual, r i, to deviate Writing b A x i = r i + b A x i r i, b A x i r i + b A x i r i As r i 0, b A x i depends on b A x i r i 13

46 Maximum attainable accuracy Accuracy depends on the size of the true residual: b A x i Rounding errors cause the true residual, b A x i, and the updated residual, r i, to deviate Writing b A x i = r i + b A x i r i, b A x i r i + b A x i r i As r i 0, b A x i depends on b A x i r i Many results on bounding attainable accuracy, e.g.: Greenbaum (1989, 1994, 1997), Sleijpen, van der Vorst and Fokkema (1994), Sleijpen, van der Vorst and Modersitzki (2001), Björck, Elfving and Strakoš (1998) and Gutknecht and Strakoš (2000). 13

47 Maximum attainable accuracy of HSCG In finite precision HSCG, iterates are updated by x i = x i 1 + α i 1 p i 1 δx i and r i = r i 1 α i 1 A p i 1 δr i 14

48 Maximum attainable accuracy of HSCG In finite precision HSCG, iterates are updated by x i = x i 1 + α i 1 p i 1 δx i and r i = r i 1 α i 1 A p i 1 δr i Let f i b A x i r i 14

49 Maximum attainable accuracy of HSCG In finite precision HSCG, iterates are updated by x i = x i 1 + α i 1 p i 1 δx i and r i = r i 1 α i 1 A p i 1 δr i Let f i b A x i r i f i = b A x i 1 + α i 1 p i 1 δx i r i 1 α i 1 A p i 1 δr i 14

50 Maximum attainable accuracy of HSCG In finite precision HSCG, iterates are updated by x i = x i 1 + α i 1 p i 1 δx i and r i = r i 1 α i 1 A p i 1 δr i Let f i b A x i r i f i = b A x i 1 + α i 1 p i 1 δx i = f i 1 + Aδx i + δr i r i 1 α i 1 A p i 1 δr i 14

51 Maximum attainable accuracy of HSCG In finite precision HSCG, iterates are updated by x i = x i 1 + α i 1 p i 1 δx i and r i = r i 1 α i 1 A p i 1 δr i Let f i b A x i r i f i = b A x i 1 + α i 1 p i 1 δx i = f i 1 + Aδx i + δr i r i 1 α i 1 A p i 1 δr i i = f 0 + m=1 Aδx m + δr m 14

52 Maximum attainable accuracy of HSCG In finite precision HSCG, iterates are updated by x i = x i 1 + α i 1 p i 1 δx i and r i = r i 1 α i 1 A p i 1 δr i Let f i b A x i r i f i = b A x i 1 + α i 1 p i 1 δx i = f i 1 + Aδx i + δr i r i 1 α i 1 A p i 1 δr i i = f 0 + m=1 Aδx m + δr m f i i O ε m=0 N A A x m + r m van der Vorst and Ye, 2000 f i O(ε) A x + max m=0,,i x m Greenbaum, 1997 f i O ε N A A A 1 i m=0 r m Sleijpen and van der Vorst,

53 Early approaches to reducing synchronization Goal: Reduce the 2 synchronization points per iteration in HSCG to 1 synchronization point per iteration 15

54 Early approaches to reducing synchronization Goal: Reduce the 2 synchronization points per iteration in HSCG to 1 synchronization point per iteration Compute β i from α i 1 and Ap i 1 using relation r i 2 2 = α i 1 Ap i 1 2 r i 1 2 Can then also merge the updates of x i, r i, and p i Developed independently by Johnson (1983, 1984), van Rosendale (1983, 1984), Saad (1985) Many other similar approaches 15

55 Early approaches to reducing synchronization Goal: Reduce the 2 synchronization points per iteration in HSCG to 1 synchronization point per iteration Compute β i from α i 1 and Ap i 1 using relation r i 2 2 = α i 1 Ap i 1 2 r i 1 2 Can then also merge the updates of x i, r i, and p i Developed independently by Johnson (1983, 1984), van Rosendale (1983, 1984), Saad (1985) Many other similar approaches Could also compute α i 1 from β i 1 : α i 1 = r i 1 T Ar i 1 r T β i 1 i 1 r i 1 α i

56 Modified recurrence coefficient computation What is the effect of changing the way the recurrence coefficients (α and β) are computed in HSCG? 16

57 Modified recurrence coefficient computation What is the effect of changing the way the recurrence coefficients (α and β) are computed in HSCG? Notice that neither α nor β appear in the bounds on f i f i = b A x i r i = b A x i 1 + α i 1 p i 1 δx i r i 1 α i 1 A p i 1 δr i 16

58 Modified recurrence coefficient computation What is the effect of changing the way the recurrence coefficients (α and β) are computed in HSCG? Notice that neither α nor β appear in the bounds on f i f i = b A x i r i = b A x i 1 + α i 1 p i 1 δx i r i 1 α i 1 A p i 1 δr i As long as the same α i 1 is used in updating x i and r i, still holds f i = f i 1 + Aδx i + δr i 16

59 Modified recurrence coefficient computation What is the effect of changing the way the recurrence coefficients (α and β) are computed in HSCG? Notice that neither α nor β appear in the bounds on f i f i = b A x i r i = b A x i 1 + α i 1 p i 1 δx i r i 1 α i 1 A p i 1 δr i As long as the same α i 1 is used in updating x i and r i, still holds f i = f i 1 + Aδx i + δr i Rounding errors made in computing α i 1 do not contribute to the residual gap 16

60 Modified recurrence coefficient computation What is the effect of changing the way the recurrence coefficients (α and β) are computed in HSCG? Notice that neither α nor β appear in the bounds on f i f i = b A x i r i = b A x i 1 + α i 1 p i 1 δx i r i 1 α i 1 A p i 1 δr i As long as the same α i 1 is used in updating x i and r i, still holds f i = f i 1 + Aδx i + δr i Rounding errors made in computing α i 1 do not contribute to the residual gap But may change computed x i, r i, which can affect convergence rate... 16

61 Modified recurrence coefficient computation Example: HSCG with modified formula for α i 1 α i 1 = r i 1 T Ar i 1 r T β i 1 i 1 r i 1 α i

62 CG with two three-term recurrences (STCG) HSCG recurrences can be written as AP i = R i+1 L i, R i = P i U i we can combine these to obtain a 3-term recurrence for the residuals (STCG): AR i = R i+1 T i, T i = L i U i 18

63 CG with two three-term recurrences (STCG) HSCG recurrences can be written as AP i = R i+1 L i, R i = P i U i we can combine these to obtain a 3-term recurrence for the residuals (STCG): AR i = R i+1 T i, T i = L i U i First developed by Stiefel (1952/53), also Rutishauser (1959) and Hageman and Young (1981) Motivated by relation to three-term recurrences for orthogonal polynomials r 0 = b Ax 0, p 0 = r 0, x 1 = x 0, r 1 = r 0, e 1 = 0 for i = 1:nmax q i 1 = (r i 1,Ar i 1 ) (r i 1,r i 1 ) e i 2 x i = x i q i 1 r i 1 + e i 2 (x i 1 x i 2 ) r i = r i q i 1 Ar i 1 + e i 2 (r i 1 r i 2 ) end e i 1 = q i 1 (r i,r i ) (r i 1,r i 1 ) 18

64 CG with two three-term recurrences (STCG) HSCG recurrences can be written as AP i = R i+1 L i, R i = P i U i we can combine these to obtain a 3-term recurrence for the residuals (STCG): AR i = R i+1 T i, T i = L i U i First developed by Stiefel (1952/53), also Rutishauser (1959) and Hageman and Young (1981) Motivated by relation to three-term recurrences for orthogonal polynomials r 0 = b Ax 0, p 0 = r 0, x 1 = x 0, r 1 = r 0, e 1 = 0 for i = 1:nmax q i 1 = (r i 1,Ar i 1 ) (r i 1,r i 1 ) e i 2 x i = x i q i 1 r i 1 + e i 2 (x i 1 x i 2 ) r i = r i q i 1 Ar i 1 + e i 2 (r i 1 r i 2 ) end e i 1 = q i 1 (r i,r i ) (r i 1,r i 1 ) Can be accomplished with a single synchronization point on parallel computers (Strakoš 1985, 1987) 18

65 CG with two three-term recurrences (STCG) HSCG recurrences can be written as AP i = R i+1 L i, R i = P i U i we can combine these to obtain a 3-term recurrence for the residuals (STCG): AR i = R i+1 T i, T i = L i U i First developed by Stiefel (1952/53), also Rutishauser (1959) and Hageman and Young (1981) Motivated by relation to three-term recurrences for orthogonal polynomials r 0 = b Ax 0, p 0 = r 0, x 1 = x 0, r 1 = r 0, e 1 = 0 for i = 1:nmax q i 1 = (r i 1,Ar i 1 ) (r i 1,r i 1 ) e i 2 x i = x i q i 1 r i 1 + e i 2 (x i 1 x i 2 ) r i = r i q i 1 Ar i 1 + e i 2 (r i 1 r i 2 ) end e i 1 = q i 1 (r i,r i ) (r i 1,r i 1 ) Can be accomplished with a single synchronization point on parallel computers (Strakoš 1985, 1987) Similar approach (computing α i using β i 1 ) used by D'Azevedo, Eijkhout, Romaine (1992, 1993) 18

66 Attainable accuracy of STCG Analyzed by Gutknecht and Strakoš (2000) Attainable accuracy for STCG can be much worse than for HSCG 19

67 Attainable accuracy of STCG Analyzed by Gutknecht and Strakoš (2000) Attainable accuracy for STCG can be much worse than for HSCG Residual gap bounded by sum of local errors PLUS local errors multiplied by factors which depend on max 0 l<j i r j 2 r l 2 19

68 Attainable accuracy of STCG Analyzed by Gutknecht and Strakoš (2000) Attainable accuracy for STCG can be much worse than for HSCG Residual gap bounded by sum of local errors PLUS local errors multiplied by factors which depend on max 0 l<j i r j 2 r l 2 Large residual oscillations can cause these factors to be large! Local errors can be amplified! 19

69 STCG 20

70 STCG 20

71 Chronopoulos and Gear's CG (ChG CG) Chronopoulos and Gear (1989) Looks like HSCG, but very similar to 3-term recurrence CG (STCG) Reduces synchronizations/iteration to 1 by changing computation of α i and using an auxiliary recurrence for Ap i r 0 = b Ax 0, p 0 = r 0, s 0 = Ap 0, α 0 = (r 0, r 0 )/(p 0, s 0 ) for i = 1:nmax x i = x i 1 + α i 1 p i 1 r i = r i 1 α i 1 s i 1 w i = Ar i β i = (r i,r i ) α i = (r i 1,r i 1 ) (r i,r i ) w i,r i ( β i α i 1 )(r i,r i ) p i = r i + β i p i 1 s i = w i + β i s i 1 end 21

72 Chronopoulos and Gear's CG (ChG CG) Chronopoulos and Gear (1989) Looks like HSCG, but very similar to 3-term recurrence CG (STCG) Reduces synchronizations/iteration to 1 by changing computation of α i and using an auxiliary recurrence for Ap i r 0 = b Ax 0, p 0 = r 0, s 0 = Ap 0, α 0 = (r 0, r 0 )/(p 0, s 0 ) for i = 1:nmax x i = x i 1 + α i 1 p i 1 r i = r i 1 α i 1 s i 1 w i = Ar i β i = (r i,r i ) α i = (r i 1,r i 1 ) (r i,r i ) w i,r i ( β i α i 1 )(r i,r i ) p i = r i + β i p i 1 s i = w i + β i s i 1 end 21

73 Chronopoulos and Gear's CG (ChG CG) Chronopoulos and Gear (1989) Looks like HSCG, but very similar to 3-term recurrence CG (STCG) Reduces synchronizations/iteration to 1 by changing computation of α i and using an auxiliary recurrence for Ap i r 0 = b Ax 0, p 0 = r 0, s 0 = Ap 0, α 0 = (r 0, r 0 )/(p 0, s 0 ) for i = 1:nmax x i = x i 1 + α i 1 p i 1 r i = r i 1 α i 1 s i 1 w i = Ar i β i = (r i,r i ) α i = (r i 1,r i 1 ) (r i,r i ) w i,r i ( β i α i 1 )(r i,r i ) p i = r i + β i p i 1 s i = w i + β i s i 1 end 21

74 Chronopoulos and Gear's CG (ChG CG) Chronopoulos and Gear (1989) Looks like HSCG, but very similar to 3-term recurrence CG (STCG) Reduces synchronizations/iteration to 1 by changing computation of α i and using an auxiliary recurrence for Ap i r 0 = b Ax 0, p 0 = r 0, s 0 = Ap 0, α 0 = (r 0, r 0 )/(p 0, s 0 ) for i = 1:nmax x i = x i 1 + α i 1 p i 1 r i = r i 1 α i 1 s i 1 w i = Ar i β i = (r i,r i ) α i = (r i 1,r i 1 ) (r i,r i ) w i,r i ( β i α i 1 )(r i,r i ) Iteration Loop Vector Updates SpMV Inner Products Vector Updates p i = r i + β i p i 1 s i = w i + β i s i 1 End Loop 21 end

75 Chronopoulos and Gear's CG (ChG CG) Chronopoulos and Gear (1989) Looks like HSCG, but very similar to 3-term recurrence CG (STCG) Reduces synchronizations/iteration to 1 by changing computation of α i and using an auxiliary recurrence for Ap i r 0 = b Ax 0, p 0 = r 0, s 0 = Ap 0, α 0 = (r 0, r 0 )/(p 0, s 0 ) for i = 1:nmax x i = x i 1 + α i 1 p i 1 r i = r i 1 α i 1 s i 1 w i = Ar i β i = (r i,r i ) α i = (r i 1,r i 1 ) (r i,r i ) w i,r i ( β i α i 1 )(r i,r i ) Iteration Loop Vector Updates SpMV Inner Products Vector Updates p i = r i + β i p i 1 s i = w i + β i s i 1 End Loop 21 end

76 Chronopoulos and Gear's CG (ChG CG) Chronopoulos and Gear (1989) Looks like HSCG, but very similar to 3-term recurrence CG (STCG) Reduces synchronizations/iteration to 1 by changing computation of α i and using an auxiliary recurrence for Ap i r 0 = b Ax 0, p 0 = r 0, s 0 = Ap 0, α 0 = (r 0, r 0 )/(p 0, s 0 ) for i = 1:nmax x i = x i 1 + α i 1 p i 1 r i = r i 1 α i 1 s i 1 w i = Ar i β i = (r i,r i ) α i = (r i 1,r i 1 ) (r i,r i ) w i,r i ( β i α i 1 )(r i,r i ) Iteration Loop Vector Updates SpMV Inner Products Vector Updates p i = r i + β i p i 1 s i = w i + β i s i 1 End Loop 21 end

77 Chronopoulos and Gear's CG (ChG CG) Chronopoulos and Gear (1989) Looks like HSCG, but very similar to 3-term recurrence CG (STCG) Reduces synchronizations/iteration to 1 by changing computation of α i and using an auxiliary recurrence for Ap i r 0 = b Ax 0, p 0 = r 0, s 0 = Ap 0, α 0 = (r 0, r 0 )/(p 0, s 0 ) for i = 1:nmax x i = x i 1 + α i 1 p i 1 r i = r i 1 α i 1 s i 1 w i = Ar i β i = (r i,r i ) α i = (r i 1,r i 1 ) (r i,r i ) w i,r i ( β i α i 1 )(r i,r i ) Iteration Loop Vector Updates SpMV Inner Products Vector Updates p i = r i + β i p i 1 s i = w i + β i s i 1 End Loop 21 end

78 Chronopoulos and Gear's CG (ChG CG) Chronopoulos and Gear (1989) Looks like HSCG, but very similar to 3-term recurrence CG (STCG) Reduces synchronizations/iteration to 1 by changing computation of α i and using an auxiliary recurrence for Ap i r 0 = b Ax 0, p 0 = r 0, s 0 = Ap 0, α 0 = (r 0, r 0 )/(p 0, s 0 ) for i = 1:nmax x i = x i 1 + α i 1 p i 1 r i = r i 1 α i 1 s i 1 w i = Ar i β i = (r i,r i ) α i = (r i 1,r i 1 ) (r i,r i ) w i,r i ( β i α i 1 )(r i,r i ) Iteration Loop Vector Updates SpMV Inner Products Vector Updates p i = r i + β i p i 1 s i = w i + β i s i 1 End Loop 21 end

79 Pipelined CG (GVCG) Pipelined CG of Ghysels and Vanroose (2014) Similar to Chronopoulos and Gear approach Uses auxiliary vector s i Ap i and same formula for α i 22

80 Pipelined CG (GVCG) Pipelined CG of Ghysels and Vanroose (2014) Similar to Chronopoulos and Gear approach Uses auxiliary vector s i Ap i and same formula for α i Also uses auxiliary vectors for Ar i and A 2 r i to remove sequential dependency between SpMV and inner products 22

81 Pipelined CG (GVCG) Pipelined CG of Ghysels and Vanroose (2014) Similar to Chronopoulos and Gear approach Uses auxiliary vector s i Ap i and same formula for α i Also uses auxiliary vectors for Ar i and A 2 r i to remove sequential dependency between SpMV and inner products Allows the use of nonblocking (asynchronous) MPI communication to overlap SpMV and inner products Hides the latency of global communications 22

82 GVCG (Ghysels and Vanroose 2014) r 0 = b Ax 0, p 0 = r 0 s 0 = Ap 0, w 0 = Ar 0, z 0 = Aw 0, α 0 = r 0 T r 0 /p 0 T s 0 for i = 1:nmax x i = x i 1 + α i 1 p i 1 r i = r i 1 α i 1 s i 1 w i = w i 1 α i 1 z i 1 q i = Aw i β i = r i T r i r T i 1 r i 1 α i = w i T r i r i T ri β i α i 1 r i T r i end p i = r i + β i p i 1 s i = w i + β i s i 1 z i = q i + β i z i 1 23

83 GVCG (Ghysels and Vanroose 2014) r 0 = b Ax 0, p 0 = r 0 s 0 = Ap 0, w 0 = Ar 0, z 0 = Aw 0, α 0 = r 0 T r 0 /p 0 T s 0 for i = 1:nmax x i = x i 1 + α i 1 p i 1 r i = r i 1 α i 1 s i 1 w i = w i 1 α i 1 z i 1 q i = Aw i β i = r i T r i r T i 1 r i 1 α i = w i T r i r i T ri β i α i 1 r i T r i end p i = r i + β i p i 1 s i = w i + β i s i 1 z i = q i + β i z i 1 23

84 Overlap GVCG (Ghysels and Vanroose 2014) r 0 = b Ax 0, p 0 = r 0 s 0 = Ap 0, w 0 = Ar 0, z 0 = Aw 0, α 0 = r 0 T r 0 /p 0 T s 0 for i = 1:nmax x i = x i 1 + α i 1 p i 1 r i = r i 1 α i 1 s i 1 Iteration Loop Vector Updates w i = w i 1 α i 1 z i 1 q i = Aw i β i = r i T r i r T i 1 r i 1 Inner Products SpMV α i = w i T r i r i T ri β i α i 1 r i T r i Vector Updates p i = r i + β i p i 1 s i = w i + β i s i 1 End Loop end z i = q i + β i z i 1 23

85 Overlap GVCG (Ghysels and Vanroose 2014) r 0 = b Ax 0, p 0 = r 0 s 0 = Ap 0, w 0 = Ar 0, z 0 = Aw 0, α 0 = r 0 T r 0 /p 0 T s 0 for i = 1:nmax x i = x i 1 + α i 1 p i 1 r i = r i 1 α i 1 s i 1 Iteration Loop Vector Updates w i = w i 1 α i 1 z i 1 q i = Aw i β i = r i T r i r T i 1 r i 1 Inner Products SpMV α i = w i T r i r i T ri β i α i 1 r i T r i Vector Updates p i = r i + β i p i 1 s i = w i + β i s i 1 End Loop end z i = q i + β i z i 1 23

86 Overlap GVCG (Ghysels and Vanroose 2014) r 0 = b Ax 0, p 0 = r 0 s 0 = Ap 0, w 0 = Ar 0, z 0 = Aw 0, α 0 = r 0 T r 0 /p 0 T s 0 for i = 1:nmax x i = x i 1 + α i 1 p i 1 r i = r i 1 α i 1 s i 1 Iteration Loop Vector Updates w i = w i 1 α i 1 z i 1 q i = Aw i β i = r i T r i r T i 1 r i 1 Inner Products SpMV α i = w i T r i r i T ri β i α i 1 r i T r i Vector Updates p i = r i + β i p i 1 s i = w i + β i s i 1 End Loop end z i = q i + β i z i 1 23

87 Overlap GVCG (Ghysels and Vanroose 2014) r 0 = b Ax 0, p 0 = r 0 s 0 = Ap 0, w 0 = Ar 0, z 0 = Aw 0, α 0 = r 0 T r 0 /p 0 T s 0 for i = 1:nmax x i = x i 1 + α i 1 p i 1 r i = r i 1 α i 1 s i 1 Iteration Loop Vector Updates w i = w i 1 α i 1 z i 1 q i = Aw i β i = r i T r i r T i 1 r i 1 Inner Products SpMV α i = w i T r i r i T ri β i α i 1 r i T r i Vector Updates p i = r i + β i p i 1 s i = w i + β i s i 1 End Loop end z i = q i + β i z i 1 23

88 Overlap GVCG (Ghysels and Vanroose 2014) r 0 = b Ax 0, p 0 = r 0 s 0 = Ap 0, w 0 = Ar 0, z 0 = Aw 0, α 0 = r 0 T r 0 /p 0 T s 0 for i = 1:nmax x i = x i 1 + α i 1 p i 1 r i = r i 1 α i 1 s i 1 Iteration Loop Vector Updates w i = w i 1 α i 1 z i 1 q i = Aw i β i = r i T r i r T i 1 r i 1 Inner Products Precond SpMV α i = w i T r i r i T ri β i α i 1 r i T r i Vector Updates p i = r i + β i p i 1 s i = w i + β i s i 1 End Loop end z i = q i + β i z i 1 23

89 Attainable accuracy of pipelined CG What is the effect of adding auxiliary recurrences to the CG method? 24

90 Attainable accuracy of pipelined CG What is the effect of adding auxiliary recurrences to the CG method? To isolate the effects, we consider a simplified version of a pipelined method r 0 = b Ax 0, p 0 = r 0, s 0 = Ap 0 for i = 1:nmax end α i 1 = (r i 1,r i 1 ) (p i 1,s i 1 ) x i = x i 1 + α i 1 p i 1 r i = r i 1 α i 1 s i 1 β i = (r i,r i ) (r i 1,r i 1 ) p i = r i + β i p i 1 s i = Ar i + β i s i 1 24

91 Attainable accuracy of pipelined CG What is the effect of adding auxiliary recurrences to the CG method? To isolate the effects, we consider a simplified version of a pipelined method Uses same update formulas for α and β as HSCG, but uses additional recurrence for Ap i r 0 = b Ax 0, p 0 = r 0, s 0 = Ap 0 for i = 1:nmax end α i 1 = (r i 1,r i 1 ) (p i 1,s i 1 ) x i = x i 1 + α i 1 p i 1 r i = r i 1 α i 1 s i 1 β i = (r i,r i ) (r i 1,r i 1 ) p i = r i + β i p i 1 s i = Ar i + β i s i 1 24

92 Attainable accuracy of simple pipelined CG x i = x i 1 + α i 1 p i 1 + δx i r i = r i 1 α i 1 s i 1 + δr i 25

93 Attainable accuracy of simple pipelined CG x i = x i 1 + α i 1 p i 1 + δx i r i = r i 1 α i 1 s i 1 + δr i f i = r i (b A x i ) 25

94 Attainable accuracy of simple pipelined CG x i = x i 1 + α i 1 p i 1 + δx i r i = r i 1 α i 1 s i 1 + δr i f i = r i (b A x i ) = f i 1 α i 1 s i 1 A p i 1 + δr i + Aδx i 25

95 Attainable accuracy of simple pipelined CG x i = x i 1 + α i 1 p i 1 + δx i r i = r i 1 α i 1 s i 1 + δr i f i = r i (b A x i ) = f i 1 α i 1 s i 1 A p i 1 + δr i + Aδx i i = f 0 + m=1 (δr m + Aδx m ) G i d i where G i = S i A P i, d i = α 0,, α i 1 T 25

96 Attainable accuracy of simple pipelined CG x i = x i 1 + α i 1 p i 1 + δx i r i = r i 1 α i 1 s i 1 + δr i f i = r i (b A x i ) = f i 1 α i 1 s i 1 A p i 1 + δr i + Aδx i i = f 0 + m=1 (δr m + Aδx m ) G i d i where G i = S i A P i, d i = α 0,, α i 1 T 25

97 Attainable accuracy of simple pipelined CG x i = x i 1 + α i 1 p i 1 + δx i r i = r i 1 α i 1 s i 1 + δr i f i = r i (b A x i ) = f i 1 α i 1 s i 1 A p i 1 + δr i + Aδx i i = f 0 + m=1 (δr m + Aδx m ) G i d i where G i = S i A P i, d i = α 0,, α i 1 T 25

98 Attainable accuracy of simple pipelined CG O ε 1 G i κ( U 1 O ε i ) A P i + A R i U i U i = 1 β β i U i 1 = 1 β 1 β 1 β 2 β i β 2 β 2 β i 1 1 β i

99 Attainable accuracy of simple pipelined CG O ε 1 G i κ( U 1 O ε i ) A P i + A R i U i U i = 1 β β i U i 1 = 1 β 1 β 1 β 2 β i β 2 β 2 β i 1 1 β i β l β l+1 β j = r j 2 r l 1 2, l < j 26

100 Attainable accuracy of simple pipelined CG O ε 1 G i κ( U 1 O ε i ) A P i + A R i U i U i = 1 β β i U i 1 = 1 β 1 β 1 β 2 β i β 2 β 2 β i 1 1 β i β l β l+1 β j = r j 2 r l 1 2, l < j Residual oscillations can cause these factors to be large! Errors in computed recurrence coefficients can be amplified! 26

101 Attainable accuracy of simple pipelined CG O ε 1 G i κ( U 1 O ε i ) A P i + A R i U i U i = 1 β β i U i 1 = 1 β 1 β 1 β 2 β i β 2 β 2 β i 1 1 β i β l β l+1 β j = r j 2 r l 1 2, l < j Residual oscillations can cause these factors to be large! Errors in computed recurrence coefficients can be amplified! Very similar to the results for attainable accuracy in the 3-term STCG Seemingly innocuous change can cause drastic loss of accuracy 26

102 Simple pipelined CG 27

103 Simple pipelined CG effect of using auxiliary vector s i Ap i 27

104 Simple pipelined CG effect of changing formula for recurrence coefficient α and using auxiliary vector s i Ap i 27

105 Simple pipelined CG effect of changing formula for recurrence coefficient α and using auxiliary vectors s i Ap i, w i Ar i, z i A 2 r i 27

106 s-step CG Idea: Compute blocks of s iterations at once Compute updates in a different basis Communicate every s iterations instead of every iteration Reduces number of synchronizations per iteration by a factor of s 28

107 s-step CG Idea: Compute blocks of s iterations at once Compute updates in a different basis Communicate every s iterations instead of every iteration Reduces number of synchronizations per iteration by a factor of s An idea rediscovered many times 28

108 s-step CG Idea: Compute blocks of s iterations at once Compute updates in a different basis Communicate every s iterations instead of every iteration Reduces number of synchronizations per iteration by a factor of s An idea rediscovered many times First related work: s-dimensional steepest descent, least squares Khabaza ( 63), Forsythe ( 68), Marchuk and Kuznecov ( 68) 28

109 s-step CG Idea: Compute blocks of s iterations at once Compute updates in a different basis Communicate every s iterations instead of every iteration Reduces number of synchronizations per iteration by a factor of s An idea rediscovered many times First related work: s-dimensional steepest descent, least squares Khabaza ( 63), Forsythe ( 68), Marchuk and Kuznecov ( 68) Flurry of work on s-step Krylov methods in 80s/early 90s: see, e.g., Van Rosendale (1983); Chronopoulos and Gear (1989) 28

110 s-step CG Idea: Compute blocks of s iterations at once Compute updates in a different basis Communicate every s iterations instead of every iteration Reduces number of synchronizations per iteration by a factor of s An idea rediscovered many times First related work: s-dimensional steepest descent, least squares Khabaza ( 63), Forsythe ( 68), Marchuk and Kuznecov ( 68) Flurry of work on s-step Krylov methods in 80s/early 90s: see, e.g., Van Rosendale (1983); Chronopoulos and Gear (1989) Resurgence of interest in recent years due to growing problem sizes; growing relative cost of communication 28

111 s-step CG Key observation: After iteration i, for j {0,.., s}, x i+j x i, r i+j, p i+j K s+1 A, p i + K s A, r i 29

112 s-step CG Key observation: After iteration i, for j {0,.., s}, x i+j x i, r i+j, p i+j K s+1 A, p i + K s A, r i s steps of s-step CG: 29

113 s-step CG Key observation: After iteration i, for j {0,.., s}, x i+j x i, r i+j, p i+j K s+1 A, p i + K s A, r i s steps of s-step CG: Expand solution space s dimensions at once Compute basis matrix Y such that span Y = K s+1 A, p i + K s A, r i according to the recurrence AY = Y B 29

114 s-step CG Key observation: After iteration i, for j {0,.., s}, x i+j x i, r i+j, p i+j K s+1 A, p i + K s A, r i s steps of s-step CG: Expand solution space s dimensions at once Compute basis matrix Y such that span Y = K s+1 A, p i + K s A, r i according to the recurrence AY = Y B Compute inner products between basis vectors in one synchronization G = Y T Y 29

115 s-step CG Key observation: After iteration i, for j {0,.., s}, x i+j x i, r i+j, p i+j K s+1 A, p i + K s A, r i s steps of s-step CG: Expand solution space s dimensions at once Compute basis matrix Y such that span Y = K s+1 A, p i + K s A, r i according to the recurrence AY = Y B Compute inner products between basis vectors in one synchronization G = Y T Y Compute s iterations of vector updates Perform s iterations of vector updates by updating coordinates in basis Y: x i+j x i = Yx j, r i+j = Yr j, p i+j = Yp j 29

116 s-step CG For s iterations of updates, inner products and SpMVs (in basis Y) can be computed by independently by each processor without communication: 30

117 s-step CG For s iterations of updates, inner products and SpMVs (in basis Y) can be computed by independently by each processor without communication: Ap i+j n n 30

118 s-step CG For s iterations of updates, inner products and SpMVs (in basis Y) can be computed by independently by each processor without communication: Ap i+j n = AYp j n 30

119 s-step CG For s iterations of updates, inner products and SpMVs (in basis Y) can be computed by independently by each processor without communication: Ap i+j = AYp j = Y(Bp j ) n n O(s) O(s) 30

120 s-step CG For s iterations of updates, inner products and SpMVs (in basis Y) can be computed by independently by each processor without communication: Ap i+j = AYp j = Y(Bp j ) n n O(s) O(s) (r i+j, r i+j ) 30

121 s-step CG For s iterations of updates, inner products and SpMVs (in basis Y) can be computed by independently by each processor without communication: Ap i+j = AYp j = Y(Bp j ) n n O(s) O(s) (r i+j, r i+j ) = r j T Y T Yr j 30

122 s-step CG For s iterations of updates, inner products and SpMVs (in basis Y) can be computed by independently by each processor without communication: Ap i+j = AYp j = Y(Bp j ) n n O(s) O(s) (r i+j, r i+j ) = r j T Y T Yr j = r j T Gr j 30

123 s-step CG r 0 = b Ax 0, p 0 = r 0 for k = 0:nmax/s Compute Y k and B k such that AY k = Y k B k and span(y k ) = K s+1 A, p sk + K s A, r sk G k = Y k T Y k x 0 = 0, r 0 = e s+2, p 0 = e 1 for j = 1: s end α sk+j 1 = x j = x j 1 r j = r j 1 β sk+j = r j T G k r j r T j 1 Gk r j 1 p T j 1 G k B k p j 1 + α sk+j 1 p j 1 α sk+j 1 B k p j 1 T G k r j 1 r j 1 p j = r j + β sk+j p j 1 [x s k+1 x sk, r s k+1, p s k+1 ] = Y k [x s, r s, p s ] end 31

124 s-step CG r 0 = b Ax 0, p 0 = r 0 for k = 0:nmax/s Compute Y k and B k such that AY k = Y k B k and span(y k ) = K s+1 A, p sk G k = Y k T Y k x 0 = 0, r 0 = e s+2, p 0 = e 1 for j = 1: s end α sk+j 1 = x j = x j 1 r j = r j 1 β sk+j = r j T G k r j r T j 1 Gk r j 1 p T j 1 G k B k p j 1 + α sk+j 1 p j 1 α sk+j 1 B k p j 1 T G k r j 1 r j 1 p j = r j + β sk+j p j 1 + K s A, r sk [x s k+1 x sk, r s k+1, p s k+1 ] = Y k [x s, r s, p s ] end Outer Loop Compute basis O(s) SPMVs O(s 2 ) Inner Products (one synchronization) Inner Loop Local Vector Updates (no comm.) End Inner Loop Inner Outer Loop s times 31

125 s-step CG r 0 = b Ax 0, p 0 = r 0 for k = 0:nmax/s Compute Y k and B k such that AY k = Y k B k and span(y k ) = K s+1 A, p sk G k = Y k T Y k x 0 = 0, r 0 = e s+2, p 0 = e 1 for j = 1: s end α sk+j 1 = x j = x j 1 r j = r j 1 β sk+j = r j T G k r j r T j 1 Gk r j 1 p T j 1 G k B k p j 1 + α sk+j 1 p j 1 α sk+j 1 B k p j 1 T G k r j 1 r j 1 p j = r j + β sk+j p j 1 + K s A, r sk [x s k+1 x sk, r s k+1, p s k+1 ] = Y k [x s, r s, p s ] end Outer Loop Compute basis O(s) SPMVs O(s 2 ) Inner Products (one synchronization) Inner Loop Local Vector Updates (no comm.) End Inner Loop Inner Outer Loop s times 31

126 s-step CG r 0 = b Ax 0, p 0 = r 0 for k = 0:nmax/s Compute Y k and B k such that AY k = Y k B k and span(y k ) = K s+1 A, p sk G k = Y k T Y k x 0 = 0, r 0 = e s+2, p 0 = e 1 for j = 1: s end α sk+j 1 = x j = x j 1 r j = r j 1 β sk+j = r j T G k r j r T j 1 Gk r j 1 p T j 1 G k B k p j 1 + α sk+j 1 p j 1 α sk+j 1 B k p j 1 T G k r j 1 r j 1 p j = r j + β sk+j p j 1 + K s A, r sk [x s k+1 x sk, r s k+1, p s k+1 ] = Y k [x s, r s, p s ] end Outer Loop Compute basis O(s) SPMVs O(s 2 ) Inner Products (one synchronization) Inner Loop Local Vector Updates (no comm.) End Inner Loop Inner Outer Loop s times 31

127 s-step CG r 0 = b Ax 0, p 0 = r 0 for k = 0:nmax/s Compute Y k and B k such that AY k = Y k B k and span(y k ) = K s+1 A, p sk G k = Y k T Y k x 0 = 0, r 0 = e s+2, p 0 = e 1 for j = 1: s end α sk+j 1 = x j = x j 1 r j = r j 1 β sk+j = r j T G k r j r T j 1 Gk r j 1 p T j 1 G k B k p j 1 + α sk+j 1 p j 1 α sk+j 1 B k p j 1 T G k r j 1 r j 1 p j = r j + β sk+j p j 1 + K s A, r sk [x s k+1 x sk, r s k+1, p s k+1 ] = Y k [x s, r s, p s ] end Outer Loop Compute basis O(s) SPMVs O(s 2 ) Inner Products (one synchronization) Inner Loop Local Vector Updates (no comm.) End Inner Loop Inner Outer Loop s times 31

128 Sources of local roundoff error in s-step CG Computing the s-step Krylov subspace basis: A Y k = Y k B k + ΔY k Updating coordinate vectors in the inner loop: x k,j r k,j = x k,j 1 = r k,j 1 + q k,j 1 B k q k,j 1 with q k,j 1 + ξ k,j + η k,j = fl( α sk+j 1 p k,j 1 ) Recovering CG vectors for use in next outer loop: x sk+j = Y k x k,j r sk+j = Y k r k,j + x sk + φ sk+j + ψ sk+j 32

129 Sources of local roundoff error in s-step CG Computing the s-step Krylov subspace basis: A Y k = Y k B k + ΔY k Error in computing s-step basis Updating coordinate vectors in the inner loop: x k,j r k,j = x k,j 1 = r k,j 1 + q k,j 1 B k q k,j 1 with q k,j 1 + ξ k,j + η k,j = fl( α sk+j 1 p k,j 1 ) Recovering CG vectors for use in next outer loop: x sk+j = Y k x k,j r sk+j = Y k r k,j + x sk + φ sk+j + ψ sk+j 32

130 Sources of local roundoff error in s-step CG Computing the s-step Krylov subspace basis: A Y k = Y k B k + ΔY k Error in computing s-step basis Updating coordinate vectors in the inner loop: x k,j r k,j = x k,j 1 = r k,j 1 + q k,j 1 B k q k,j 1 with q k,j 1 + ξ k,j + η k,j = fl( α sk+j 1 p k,j 1 ) Error in updating coefficient vectors Recovering CG vectors for use in next outer loop: x sk+j = Y k x k,j r sk+j = Y k r k,j + x sk + φ sk+j + ψ sk+j 32

131 Sources of local roundoff error in s-step CG Computing the s-step Krylov subspace basis: A Y k = Y k B k + ΔY k Error in computing s-step basis Updating coordinate vectors in the inner loop: x k,j r k,j = x k,j 1 = r k,j 1 + q k,j 1 B k q k,j 1 with q k,j 1 + ξ k,j + η k,j = fl( α sk+j 1 p k,j 1 ) Error in updating coefficient vectors Recovering CG vectors for use in next outer loop: x sk+j = Y k x k,j r sk+j = Y k r k,j + x sk + φ sk+j + ψ sk+j Error in basis change 32

132 Attainable accuracy of s-step CG We can write the gap between the true and updated residuals f in terms of these errors: f sk+j = f 0 k 1 Aφ sl+s + ψ sl+s + s A Y l ξ l,i + Y l η l,i ΔY l q l,i 1 l=0 i=1 Aφ sk+j ψ sk+j j i=1 A Y k ξ k,i + Y k η k,i ΔY l q k,i 1 Using standard rounding error results, this allows us to obtain an upper bound on f sk+j. 33

133 Attainable accuracy of s-step CG We can write the gap between the true and updated residuals f in terms of these errors: f sk+j = f 0 k 1 Aφ sl+s + ψ sl+s + s A Y l ξ l,i + Y l η l,i ΔY l q l,i 1 l=0 i=1 Aφ sk+j ψ sk+j j i=1 A Y k ξ k,i + Y k η k,i ΔY l q k,i 1 Using standard rounding error results, this allows us to obtain an upper bound on f sk+j. 33

134 Attainable accuracy of s-step CG We can write the gap between the true and updated residuals f in terms of these errors: f sk+j = f 0 k 1 Aφ sl+s + ψ sl+s + s A Y l ξ l,i + Y l η l,i ΔY l q l,i 1 l=0 i=1 Aφ sk+j ψ sk+j j i=1 A Y k ξ k,i + Y k η k,i ΔY l q k,i 1 Using standard rounding error results, this allows us to obtain an upper bound on f sk+j. 33

135 Attainable accuracy of s-step CG We can write the gap between the true and updated residuals f in terms of these errors: f sk+j = f 0 k 1 Aφ sl+s + ψ sl+s + s A Y l ξ l,i + Y l η l,i ΔY l q l,i 1 l=0 i=1 Aφ sk+j ψ sk+j j i=1 A Y k ξ k,i + Y k η k,i ΔY l q k,i 1 Using standard rounding error results, this allows us to obtain an upper bound on f sk+j. 33

136 Attainable accuracy of s-step CG f i b A x i r i For CG: f i f 0 + ε i m=1 1 + N A x m + r m 34

137 Attainable accuracy of s-step CG f i b A x i r i For CG: f i f 0 + ε i m=1 1 + N A x m + r m For s-step CG: i sk + j f sk+j f 0 + εc Γ k sk+j m=1 1 + N A x m + r m where c is a low-degree polynomial in s, and Γ k = max l k Γ l, where Γ l = Y l + Y l (see C., 2015) 34

138 s-step CG 35

139 s-step CG s-step CG with monomial basis (Y = [p i, Ap i,, A s p i, r i, Ar i, A s 1 r i ]) 35

140 s-step CG s-step CG with monomial basis (Y = [p i, Ap i,, A s p i, r i, Ar i, A s 1 r i ]) 35

141 s-step CG s-step CG with monomial basis (Y = [p i, Ap i,, A s p i, r i, Ar i, A s 1 r i ]) Can also use other, more well-conditioned bases to improve convergence rate and accuracy (see, e.g. Philippe and Reichel, 2012). 35

142 s-step CG Even assuming perfect parallel scalability with s (which is usually not the case due to extra SpMVs and inner products), already at s = 4 we are worse than HSCG in terms of number of synchronizations! 36

143 A different problem... A: nos4 from UFSMC, b: equal components in the eigenbasis of A and b = 1 N = 100, κ A 2e3 37

144 A different problem... A: nos4 from UFSMC, b: equal components in the eigenbasis of A and b = 1 N = 100, κ A 2e3 37

145 A different problem... A: nos4 from UFSMC, b: equal components in the eigenbasis of A and b = 1 N = 100, κ A 2e3 37

146 A different problem... A: nos4 from UFSMC, b: equal components in the eigenbasis of A and b = 1 N = 100, κ A 2e3

147 A different problem... A: nos4 from UFSMC, b: equal components in the eigenbasis of A and b = 1 N = 100, κ A 2e3 If application only requires x x i A 10 10, any of these methods will work!

148 Performance Benefit from s-step BICGSTAB Speedups for real applications s-step BICGSTAB bottom-solver implemented in BoxLib (AMR framework from LBL) Low Mach Number Combustion Code (LMC): gas-phase combustion simulation Compared GMG with BICGSTAB vs. GMG with s-step BICGSTAB (s=4) on a Cray XE6 for two different applications Up to 2.5x speedup in bottom solve; up to 1.5x in overall MG solve 3.0x 2.5x 2.0x 1.5x 1.0x 0.5x 0.0x LMC - 3D mac_project Solve Bottom Solver MG Solver (overall) Flat MPI MPI+OpenMP (see Williams et al., IPDPS 2014) 38

149 Conclusions and takeaways Think of the bigger picture Much focus on modifying methods to speed up iterations But the speed of an iteration only part of the runtime: runtime = (time/iteration) x (#iterations) 39

150 Conclusions and takeaways Think of the bigger picture Much focus on modifying methods to speed up iterations But the speed of an iteration only part of the runtime: runtime = (time/iteration) x (#iterations) And a solver that can't solve to required accuracy is useless! 39

151 Conclusions and takeaways Think of the bigger picture Much focus on modifying methods to speed up iterations But the speed of an iteration only part of the runtime: runtime = (time/iteration) x (#iterations) And a solver that can't solve to required accuracy is useless! Solver runtime is only part of a larger scientific code 39

152 Conclusions and takeaways Think of the bigger picture Much focus on modifying methods to speed up iterations But the speed of an iteration only part of the runtime: runtime = (time/iteration) x (#iterations) And a solver that can't solve to required accuracy is useless! Solver runtime is only part of a larger scientific code Design and implementation of iterative solvers requires a holistic approach Selecting the right method, parameters, stopping criteria Selecting the right preconditioner (closely linked with the discretization! see Málek and Strakoš, 2015) 39

MS 28: Scalable Communication-Avoiding and -Hiding Krylov Subspace Methods I

MS 28: Scalable Communication-Avoiding and -Hiding Krylov Subspace Methods I MS 28: Scalable Communication-Avoiding and -Hiding Krylov Subspace Methods I Organizers: Siegfried Cools, University of Antwerp, Belgium Erin C. Carson, New York University, USA 10:50-11:10 High Performance

More information

Efficient Deflation for Communication-Avoiding Krylov Subspace Methods

Efficient Deflation for Communication-Avoiding Krylov Subspace Methods Efficient Deflation for Communication-Avoiding Krylov Subspace Methods Erin Carson Nicholas Knight, James Demmel Univ. of California, Berkeley Monday, June 24, NASCA 2013, Calais, France Overview We derive

More information

MS4: Minimizing Communication in Numerical Algorithms Part I of II

MS4: Minimizing Communication in Numerical Algorithms Part I of II MS4: Minimizing Communication in Numerical Algorithms Part I of II Organizers: Oded Schwartz (Hebrew University of Jerusalem) and Erin Carson (New York University) Talks: 1. Communication-Avoiding Krylov

More information

Finite-choice algorithm optimization in Conjugate Gradients

Finite-choice algorithm optimization in Conjugate Gradients Finite-choice algorithm optimization in Conjugate Gradients Jack Dongarra and Victor Eijkhout January 2003 Abstract We present computational aspects of mathematically equivalent implementations of the

More information

Key words. conjugate gradients, normwise backward error, incremental norm estimation.

Key words. conjugate gradients, normwise backward error, incremental norm estimation. Proceedings of ALGORITMY 2016 pp. 323 332 ON ERROR ESTIMATION IN THE CONJUGATE GRADIENT METHOD: NORMWISE BACKWARD ERROR PETR TICHÝ Abstract. Using an idea of Duff and Vömel [BIT, 42 (2002), pp. 300 322

More information

EXASCALE COMPUTING. Hiding global synchronization latency in the preconditioned Conjugate Gradient algorithm. P. Ghysels W.

EXASCALE COMPUTING. Hiding global synchronization latency in the preconditioned Conjugate Gradient algorithm. P. Ghysels W. ExaScience Lab Intel Labs Europe EXASCALE COMPUTING Hiding global synchronization latency in the preconditioned Conjugate Gradient algorithm P. Ghysels W. Vanroose Report 12.2012.1, December 2012 This

More information

Moments, Model Reduction and Nonlinearity in Solving Linear Algebraic Problems

Moments, Model Reduction and Nonlinearity in Solving Linear Algebraic Problems Moments, Model Reduction and Nonlinearity in Solving Linear Algebraic Problems Zdeněk Strakoš Charles University, Prague http://www.karlin.mff.cuni.cz/ strakos 16th ILAS Meeting, Pisa, June 2010. Thanks

More information

A Residual Replacement Strategy for Improving the Maximum Attainable Accuracy of s-step Krylov Subspace Methods

A Residual Replacement Strategy for Improving the Maximum Attainable Accuracy of s-step Krylov Subspace Methods A Residual Replacement Strategy for Improving the Maximum Attainable Accuracy of s-step Krylov Subspace Methods Erin Carson James Demmel Electrical Engineering and Computer Sciences University of California

More information

CME342 Parallel Methods in Numerical Analysis. Matrix Computation: Iterative Methods II. Sparse Matrix-vector Multiplication.

CME342 Parallel Methods in Numerical Analysis. Matrix Computation: Iterative Methods II. Sparse Matrix-vector Multiplication. CME342 Parallel Methods in Numerical Analysis Matrix Computation: Iterative Methods II Outline: CG & its parallelization. Sparse Matrix-vector Multiplication. 1 Basic iterative methods: Ax = b r = b Ax

More information

Sensitivity of Gauss-Christoffel quadrature and sensitivity of Jacobi matrices to small changes of spectral data

Sensitivity of Gauss-Christoffel quadrature and sensitivity of Jacobi matrices to small changes of spectral data Sensitivity of Gauss-Christoffel quadrature and sensitivity of Jacobi matrices to small changes of spectral data Zdeněk Strakoš Academy of Sciences and Charles University, Prague http://www.cs.cas.cz/

More information

Error analysis of the s-step Lanczos method in finite precision

Error analysis of the s-step Lanczos method in finite precision Error analysis of the s-step Lanczos method in finite precision Erin Carson James Demmel Electrical Engineering and Computer Sciences University of California at Berkeley Technical Report No. UCB/EECS-2014-55

More information

Matching moments and matrix computations

Matching moments and matrix computations Matching moments and matrix computations Jörg Liesen Technical University of Berlin and Petr Tichý Czech Academy of Sciences and Zdeněk Strakoš Charles University in Prague and Czech Academy of Sciences

More information

Efficient Estimation of the A-norm of the Error in the Preconditioned Conjugate Gradient Method

Efficient Estimation of the A-norm of the Error in the Preconditioned Conjugate Gradient Method Efficient Estimation of the A-norm of the Error in the Preconditioned Conjugate Gradient Method Zdeněk Strakoš and Petr Tichý Institute of Computer Science AS CR, Technical University of Berlin. Emory

More information

Scalable Non-blocking Preconditioned Conjugate Gradient Methods

Scalable Non-blocking Preconditioned Conjugate Gradient Methods Scalable Non-blocking Preconditioned Conjugate Gradient Methods Paul Eller and William Gropp University of Illinois at Urbana-Champaign Department of Computer Science Supercomputing 16 Paul Eller and William

More information

Lanczos tridigonalization and Golub - Kahan bidiagonalization: Ideas, connections and impact

Lanczos tridigonalization and Golub - Kahan bidiagonalization: Ideas, connections and impact Lanczos tridigonalization and Golub - Kahan bidiagonalization: Ideas, connections and impact Zdeněk Strakoš Academy of Sciences and Charles University, Prague http://www.cs.cas.cz/ strakos Hong Kong, February

More information

Efficient Estimation of the A-norm of the Error in the Preconditioned Conjugate Gradient Method

Efficient Estimation of the A-norm of the Error in the Preconditioned Conjugate Gradient Method Efficient Estimation of the A-norm of the Error in the Preconditioned Conjugate Gradient Method Zdeněk Strakoš and Petr Tichý Institute of Computer Science AS CR, Technical University of Berlin. International

More information

Iterative Methods for Linear Systems of Equations

Iterative Methods for Linear Systems of Equations Iterative Methods for Linear Systems of Equations Projection methods (3) ITMAN PhD-course DTU 20-10-08 till 24-10-08 Martin van Gijzen 1 Delft University of Technology Overview day 4 Bi-Lanczos method

More information

6.4 Krylov Subspaces and Conjugate Gradients

6.4 Krylov Subspaces and Conjugate Gradients 6.4 Krylov Subspaces and Conjugate Gradients Our original equation is Ax = b. The preconditioned equation is P Ax = P b. When we write P, we never intend that an inverse will be explicitly computed. P

More information

Reduced Synchronization Overhead on. December 3, Abstract. The standard formulation of the conjugate gradient algorithm involves

Reduced Synchronization Overhead on. December 3, Abstract. The standard formulation of the conjugate gradient algorithm involves Lapack Working Note 56 Conjugate Gradient Algorithms with Reduced Synchronization Overhead on Distributed Memory Multiprocessors E. F. D'Azevedo y, V.L. Eijkhout z, C. H. Romine y December 3, 1999 Abstract

More information

Summary of Iterative Methods for Non-symmetric Linear Equations That Are Related to the Conjugate Gradient (CG) Method

Summary of Iterative Methods for Non-symmetric Linear Equations That Are Related to the Conjugate Gradient (CG) Method Summary of Iterative Methods for Non-symmetric Linear Equations That Are Related to the Conjugate Gradient (CG) Method Leslie Foster 11-5-2012 We will discuss the FOM (full orthogonalization method), CG,

More information

Principles and Analysis of Krylov Subspace Methods

Principles and Analysis of Krylov Subspace Methods Principles and Analysis of Krylov Subspace Methods Zdeněk Strakoš Institute of Computer Science, Academy of Sciences, Prague www.cs.cas.cz/~strakos Ostrava, February 2005 1 With special thanks to C.C.

More information

A Residual Replacement Strategy for Improving the Maximum Attainable Accuracy of Communication- Avoiding Krylov Subspace Methods

A Residual Replacement Strategy for Improving the Maximum Attainable Accuracy of Communication- Avoiding Krylov Subspace Methods A Residual Replacement Strategy for Improving the Maximum Attainable Accuracy of Communication- Avoiding Krylov Subspace Methods Erin Carson James Demmel Electrical Engineering and Computer Sciences University

More information

M.A. Botchev. September 5, 2014

M.A. Botchev. September 5, 2014 Rome-Moscow school of Matrix Methods and Applied Linear Algebra 2014 A short introduction to Krylov subspaces for linear systems, matrix functions and inexact Newton methods. Plan and exercises. M.A. Botchev

More information

PETROV-GALERKIN METHODS

PETROV-GALERKIN METHODS Chapter 7 PETROV-GALERKIN METHODS 7.1 Energy Norm Minimization 7.2 Residual Norm Minimization 7.3 General Projection Methods 7.1 Energy Norm Minimization Saad, Sections 5.3.1, 5.2.1a. 7.1.1 Methods based

More information

Introduction. Chapter One

Introduction. Chapter One Chapter One Introduction The aim of this book is to describe and explain the beautiful mathematical relationships between matrices, moments, orthogonal polynomials, quadrature rules and the Lanczos and

More information

Computational Linear Algebra

Computational Linear Algebra Computational Linear Algebra PD Dr. rer. nat. habil. Ralf-Peter Mundani Computation in Engineering / BGU Scientific Computing in Computer Science / INF Winter Term 2018/19 Part 4: Iterative Methods PD

More information

Krylov Space Solvers

Krylov Space Solvers Seminar for Applied Mathematics ETH Zurich International Symposium on Frontiers of Computational Science Nagoya, 12/13 Dec. 2005 Sparse Matrices Large sparse linear systems of equations or large sparse

More information

S-Step and Communication-Avoiding Iterative Methods

S-Step and Communication-Avoiding Iterative Methods S-Step and Communication-Avoiding Iterative Methods Maxim Naumov NVIDIA, 270 San Tomas Expressway, Santa Clara, CA 95050 Abstract In this paper we make an overview of s-step Conjugate Gradient and develop

More information

Linear Solvers. Andrew Hazel

Linear Solvers. Andrew Hazel Linear Solvers Andrew Hazel Introduction Thus far we have talked about the formulation and discretisation of physical problems...... and stopped when we got to a discrete linear system of equations. Introduction

More information

Topics. The CG Algorithm Algorithmic Options CG s Two Main Convergence Theorems

Topics. The CG Algorithm Algorithmic Options CG s Two Main Convergence Theorems Topics The CG Algorithm Algorithmic Options CG s Two Main Convergence Theorems What about non-spd systems? Methods requiring small history Methods requiring large history Summary of solvers 1 / 52 Conjugate

More information

Contents. Preface... xi. Introduction...

Contents. Preface... xi. Introduction... Contents Preface... xi Introduction... xv Chapter 1. Computer Architectures... 1 1.1. Different types of parallelism... 1 1.1.1. Overlap, concurrency and parallelism... 1 1.1.2. Temporal and spatial parallelism

More information

Conjugate gradient method. Descent method. Conjugate search direction. Conjugate Gradient Algorithm (294)

Conjugate gradient method. Descent method. Conjugate search direction. Conjugate Gradient Algorithm (294) Conjugate gradient method Descent method Hestenes, Stiefel 1952 For A N N SPD In exact arithmetic, solves in N steps In real arithmetic No guaranteed stopping Often converges in many fewer than N steps

More information

On the Vorobyev method of moments

On the Vorobyev method of moments On the Vorobyev method of moments Zdeněk Strakoš Charles University in Prague and Czech Academy of Sciences http://www.karlin.mff.cuni.cz/ strakos Conference in honor of Volker Mehrmann Berlin, May 2015

More information

1 Conjugate gradients

1 Conjugate gradients Notes for 2016-11-18 1 Conjugate gradients We now turn to the method of conjugate gradients (CG), perhaps the best known of the Krylov subspace solvers. The CG iteration can be characterized as the iteration

More information

Numerical behavior of inexact linear solvers

Numerical behavior of inexact linear solvers Numerical behavior of inexact linear solvers Miro Rozložník joint results with Zhong-zhi Bai and Pavel Jiránek Institute of Computer Science, Czech Academy of Sciences, Prague, Czech Republic The fourth

More information

Computational Linear Algebra

Computational Linear Algebra Computational Linear Algebra PD Dr. rer. nat. habil. Ralf Peter Mundani Computation in Engineering / BGU Scientific Computing in Computer Science / INF Winter Term 2017/18 Part 3: Iterative Methods PD

More information

AMS526: Numerical Analysis I (Numerical Linear Algebra) Lecture 23: GMRES and Other Krylov Subspace Methods; Preconditioning

AMS526: Numerical Analysis I (Numerical Linear Algebra) Lecture 23: GMRES and Other Krylov Subspace Methods; Preconditioning AMS526: Numerical Analysis I (Numerical Linear Algebra) Lecture 23: GMRES and Other Krylov Subspace Methods; Preconditioning Xiangmin Jiao SUNY Stony Brook Xiangmin Jiao Numerical Analysis I 1 / 18 Outline

More information

DELFT UNIVERSITY OF TECHNOLOGY

DELFT UNIVERSITY OF TECHNOLOGY DELFT UNIVERSITY OF TECHNOLOGY REPORT 16-02 The Induced Dimension Reduction method applied to convection-diffusion-reaction problems R. Astudillo and M. B. van Gijzen ISSN 1389-6520 Reports of the Delft

More information

Conjugate Gradient algorithm. Storage: fixed, independent of number of steps.

Conjugate Gradient algorithm. Storage: fixed, independent of number of steps. Conjugate Gradient algorithm Need: A symmetric positive definite; Cost: 1 matrix-vector product per step; Storage: fixed, independent of number of steps. The CG method minimizes the A norm of the error,

More information

Communication-Avoiding Krylov Subspace Methods in Theory and Practice. Erin Claire Carson. A dissertation submitted in partial satisfaction of the

Communication-Avoiding Krylov Subspace Methods in Theory and Practice. Erin Claire Carson. A dissertation submitted in partial satisfaction of the Communication-Avoiding Krylov Subspace Methods in Theory and Practice by Erin Claire Carson A dissertation submitted in partial satisfaction of the requirements for the degree of Doctor of Philosophy in

More information

EECS 275 Matrix Computation

EECS 275 Matrix Computation EECS 275 Matrix Computation Ming-Hsuan Yang Electrical Engineering and Computer Science University of California at Merced Merced, CA 95344 http://faculty.ucmerced.edu/mhyang Lecture 20 1 / 20 Overview

More information

The Lanczos and conjugate gradient algorithms

The Lanczos and conjugate gradient algorithms The Lanczos and conjugate gradient algorithms Gérard MEURANT October, 2008 1 The Lanczos algorithm 2 The Lanczos algorithm in finite precision 3 The nonsymmetric Lanczos algorithm 4 The Golub Kahan bidiagonalization

More information

Iterative methods for Linear System

Iterative methods for Linear System Iterative methods for Linear System JASS 2009 Student: Rishi Patil Advisor: Prof. Thomas Huckle Outline Basics: Matrices and their properties Eigenvalues, Condition Number Iterative Methods Direct and

More information

The amount of work to construct each new guess from the previous one should be a small multiple of the number of nonzeros in A.

The amount of work to construct each new guess from the previous one should be a small multiple of the number of nonzeros in A. AMSC/CMSC 661 Scientific Computing II Spring 2005 Solution of Sparse Linear Systems Part 2: Iterative methods Dianne P. O Leary c 2005 Solving Sparse Linear Systems: Iterative methods The plan: Iterative

More information

ON ORTHOGONAL REDUCTION TO HESSENBERG FORM WITH SMALL BANDWIDTH

ON ORTHOGONAL REDUCTION TO HESSENBERG FORM WITH SMALL BANDWIDTH ON ORTHOGONAL REDUCTION TO HESSENBERG FORM WITH SMALL BANDWIDTH V. FABER, J. LIESEN, AND P. TICHÝ Abstract. Numerous algorithms in numerical linear algebra are based on the reduction of a given matrix

More information

Contribution of Wo¹niakowski, Strako²,... The conjugate gradient method in nite precision computa

Contribution of Wo¹niakowski, Strako²,... The conjugate gradient method in nite precision computa Contribution of Wo¹niakowski, Strako²,... The conjugate gradient method in nite precision computations ªaw University of Technology Institute of Mathematics and Computer Science Warsaw, October 7, 2006

More information

4.8 Arnoldi Iteration, Krylov Subspaces and GMRES

4.8 Arnoldi Iteration, Krylov Subspaces and GMRES 48 Arnoldi Iteration, Krylov Subspaces and GMRES We start with the problem of using a similarity transformation to convert an n n matrix A to upper Hessenberg form H, ie, A = QHQ, (30) with an appropriate

More information

Variants of BiCGSafe method using shadow three-term recurrence

Variants of BiCGSafe method using shadow three-term recurrence Variants of BiCGSafe method using shadow three-term recurrence Seiji Fujino, Takashi Sekimoto Research Institute for Information Technology, Kyushu University Ryoyu Systems Co. Ltd. (Kyushu University)

More information

Chapter 7 Iterative Techniques in Matrix Algebra

Chapter 7 Iterative Techniques in Matrix Algebra Chapter 7 Iterative Techniques in Matrix Algebra Per-Olof Persson persson@berkeley.edu Department of Mathematics University of California, Berkeley Math 128B Numerical Analysis Vector Norms Definition

More information

Conjugate Gradient Method

Conjugate Gradient Method Conjugate Gradient Method direct and indirect methods positive definite linear systems Krylov sequence spectral analysis of Krylov sequence preconditioning Prof. S. Boyd, EE364b, Stanford University Three

More information

Golub-Kahan iterative bidiagonalization and determining the noise level in the data

Golub-Kahan iterative bidiagonalization and determining the noise level in the data Golub-Kahan iterative bidiagonalization and determining the noise level in the data Iveta Hnětynková,, Martin Plešinger,, Zdeněk Strakoš, * Charles University, Prague ** Academy of Sciences of the Czech

More information

Communication-avoiding Krylov subspace methods

Communication-avoiding Krylov subspace methods Motivation Communication-avoiding Krylov subspace methods Mark mhoemmen@cs.berkeley.edu University of California Berkeley EECS MS Numerical Libraries Group visit: 28 April 2008 Overview Motivation Current

More information

Iterative methods for Linear System of Equations. Joint Advanced Student School (JASS-2009)

Iterative methods for Linear System of Equations. Joint Advanced Student School (JASS-2009) Iterative methods for Linear System of Equations Joint Advanced Student School (JASS-2009) Course #2: Numerical Simulation - from Models to Software Introduction In numerical simulation, Partial Differential

More information

On the accuracy of saddle point solvers

On the accuracy of saddle point solvers On the accuracy of saddle point solvers Miro Rozložník joint results with Valeria Simoncini and Pavel Jiránek Institute of Computer Science, Czech Academy of Sciences, Prague, Czech Republic Seminar at

More information

Iterative Methods for Sparse Linear Systems

Iterative Methods for Sparse Linear Systems Iterative Methods for Sparse Linear Systems Luca Bergamaschi e-mail: berga@dmsa.unipd.it - http://www.dmsa.unipd.it/ berga Department of Mathematical Methods and Models for Scientific Applications University

More information

Applied Mathematics 205. Unit V: Eigenvalue Problems. Lecturer: Dr. David Knezevic

Applied Mathematics 205. Unit V: Eigenvalue Problems. Lecturer: Dr. David Knezevic Applied Mathematics 205 Unit V: Eigenvalue Problems Lecturer: Dr. David Knezevic Unit V: Eigenvalue Problems Chapter V.4: Krylov Subspace Methods 2 / 51 Krylov Subspace Methods In this chapter we give

More information

On the influence of eigenvalues on Bi-CG residual norms

On the influence of eigenvalues on Bi-CG residual norms On the influence of eigenvalues on Bi-CG residual norms Jurjen Duintjer Tebbens Institute of Computer Science Academy of Sciences of the Czech Republic duintjertebbens@cs.cas.cz Gérard Meurant 30, rue

More information

Notes on Some Methods for Solving Linear Systems

Notes on Some Methods for Solving Linear Systems Notes on Some Methods for Solving Linear Systems Dianne P. O Leary, 1983 and 1999 and 2007 September 25, 2007 When the matrix A is symmetric and positive definite, we have a whole new class of algorithms

More information

ON EFFICIENT NUMERICAL APPROXIMATION OF THE BILINEAR FORM c A 1 b

ON EFFICIENT NUMERICAL APPROXIMATION OF THE BILINEAR FORM c A 1 b ON EFFICIENT NUMERICAL APPROXIMATION OF THE ILINEAR FORM c A 1 b ZDENĚK STRAKOŠ AND PETR TICHÝ Abstract. Let A C N N be a nonsingular complex matrix and b and c complex vectors of length N. The goal of

More information

Conjugate Gradient (CG) Method

Conjugate Gradient (CG) Method Conjugate Gradient (CG) Method by K. Ozawa 1 Introduction In the series of this lecture, I will introduce the conjugate gradient method, which solves efficiently large scale sparse linear simultaneous

More information

IDR(s) Master s thesis Goushani Kisoensingh. Supervisor: Gerard L.G. Sleijpen Department of Mathematics Universiteit Utrecht

IDR(s) Master s thesis Goushani Kisoensingh. Supervisor: Gerard L.G. Sleijpen Department of Mathematics Universiteit Utrecht IDR(s) Master s thesis Goushani Kisoensingh Supervisor: Gerard L.G. Sleijpen Department of Mathematics Universiteit Utrecht Contents 1 Introduction 2 2 The background of Bi-CGSTAB 3 3 IDR(s) 4 3.1 IDR.............................................

More information

Recent computational developments in Krylov Subspace Methods for linear systems. Valeria Simoncini and Daniel B. Szyld

Recent computational developments in Krylov Subspace Methods for linear systems. Valeria Simoncini and Daniel B. Szyld Recent computational developments in Krylov Subspace Methods for linear systems Valeria Simoncini and Daniel B. Szyld A later version appeared in : Numerical Linear Algebra w/appl., 2007, v. 14(1), pp.1-59.

More information

1. Introduction. Krylov subspace methods are used extensively for the iterative solution of linear systems of equations Ax = b.

1. Introduction. Krylov subspace methods are used extensively for the iterative solution of linear systems of equations Ax = b. SIAM J. SCI. COMPUT. Vol. 31, No. 2, pp. 1035 1062 c 2008 Society for Industrial and Applied Mathematics IDR(s): A FAMILY OF SIMPLE AND FAST ALGORITHMS FOR SOLVING LARGE NONSYMMETRIC SYSTEMS OF LINEAR

More information

A stable variant of Simpler GMRES and GCR

A stable variant of Simpler GMRES and GCR A stable variant of Simpler GMRES and GCR Miroslav Rozložník joint work with Pavel Jiránek and Martin H. Gutknecht Institute of Computer Science, Czech Academy of Sciences, Prague, Czech Republic miro@cs.cas.cz,

More information

Iterative Methods for Solving A x = b

Iterative Methods for Solving A x = b Iterative Methods for Solving A x = b A good (free) online source for iterative methods for solving A x = b is given in the description of a set of iterative solvers called templates found at netlib: http

More information

Communication Avoiding Strategies for the Numerical Kernels in Coupled Physics Simulations

Communication Avoiding Strategies for the Numerical Kernels in Coupled Physics Simulations ExaScience Lab Intel Labs Europe EXASCALE COMPUTING Communication Avoiding Strategies for the Numerical Kernels in Coupled Physics Simulations SIAM Conference on Parallel Processing for Scientific Computing

More information

Solving large sparse Ax = b.

Solving large sparse Ax = b. Bob-05 p.1/69 Solving large sparse Ax = b. Stopping criteria, & backward stability of MGS-GMRES. Chris Paige (McGill University); Miroslav Rozložník & Zdeněk Strakoš (Academy of Sciences of the Czech Republic)..pdf

More information

A Domain Decomposition Based Jacobi-Davidson Algorithm for Quantum Dot Simulation

A Domain Decomposition Based Jacobi-Davidson Algorithm for Quantum Dot Simulation A Domain Decomposition Based Jacobi-Davidson Algorithm for Quantum Dot Simulation Tao Zhao 1, Feng-Nan Hwang 2 and Xiao-Chuan Cai 3 Abstract In this paper, we develop an overlapping domain decomposition

More information

Composite convergence bounds based on Chebyshev polynomials and finite precision conjugate gradient computations

Composite convergence bounds based on Chebyshev polynomials and finite precision conjugate gradient computations Numerical Algorithms manuscript No. (will be inserted by the editor) Composite convergence bounds based on Chebyshev polynomials and finite precision conjugate gradient computations Tomáš Gergelits Zdeněk

More information

Exploiting Low-Rank Structure in Computing Matrix Powers with Applications to Preconditioning

Exploiting Low-Rank Structure in Computing Matrix Powers with Applications to Preconditioning Exploiting Low-Rank Structure in Computing Matrix Powers with Applications to Preconditioning Erin C. Carson, Nicholas Knight, James Demmel, Ming Gu U.C. Berkeley SIAM PP 12, Savannah, Georgia, USA, February

More information

The Conjugate Gradient Method

The Conjugate Gradient Method The Conjugate Gradient Method Jason E. Hicken Aerospace Design Lab Department of Aeronautics & Astronautics Stanford University 14 July 2011 Lecture Objectives describe when CG can be used to solve Ax

More information

AMS526: Numerical Analysis I (Numerical Linear Algebra)

AMS526: Numerical Analysis I (Numerical Linear Algebra) AMS526: Numerical Analysis I (Numerical Linear Algebra) Lecture 21: Sensitivity of Eigenvalues and Eigenvectors; Conjugate Gradient Method Xiangmin Jiao Stony Brook University Xiangmin Jiao Numerical Analysis

More information

Introduction to Iterative Solvers of Linear Systems

Introduction to Iterative Solvers of Linear Systems Introduction to Iterative Solvers of Linear Systems SFB Training Event January 2012 Prof. Dr. Andreas Frommer Typeset by Lukas Krämer, Simon-Wolfgang Mages and Rudolf Rödl 1 Classes of Matrices and their

More information

B. Reps 1, P. Ghysels 2, O. Schenk 4, K. Meerbergen 3,5, W. Vanroose 1,3. IHP 2015, Paris FRx

B. Reps 1, P. Ghysels 2, O. Schenk 4, K. Meerbergen 3,5, W. Vanroose 1,3. IHP 2015, Paris FRx Communication Avoiding and Hiding in preconditioned Krylov solvers 1 Applied Mathematics, University of Antwerp, Belgium 2 Future Technology Group, Berkeley Lab, USA 3 Intel ExaScience Lab, Belgium 4 USI,

More information

Minisymposia 9 and 34: Avoiding Communication in Linear Algebra. Jim Demmel UC Berkeley bebop.cs.berkeley.edu

Minisymposia 9 and 34: Avoiding Communication in Linear Algebra. Jim Demmel UC Berkeley bebop.cs.berkeley.edu Minisymposia 9 and 34: Avoiding Communication in Linear Algebra Jim Demmel UC Berkeley bebop.cs.berkeley.edu Motivation (1) Increasing parallelism to exploit From Top500 to multicores in your laptop Exponentially

More information

Performance Evaluation of GPBiCGSafe Method without Reverse-Ordered Recurrence for Realistic Problems

Performance Evaluation of GPBiCGSafe Method without Reverse-Ordered Recurrence for Realistic Problems Performance Evaluation of GPBiCGSafe Method without Reverse-Ordered Recurrence for Realistic Problems Seiji Fujino, Takashi Sekimoto Abstract GPBiCG method is an attractive iterative method for the solution

More information

On prescribing Ritz values and GMRES residual norms generated by Arnoldi processes

On prescribing Ritz values and GMRES residual norms generated by Arnoldi processes On prescribing Ritz values and GMRES residual norms generated by Arnoldi processes Jurjen Duintjer Tebbens Institute of Computer Science Academy of Sciences of the Czech Republic joint work with Gérard

More information

IDR(s) A family of simple and fast algorithms for solving large nonsymmetric systems of linear equations

IDR(s) A family of simple and fast algorithms for solving large nonsymmetric systems of linear equations IDR(s) A family of simple and fast algorithms for solving large nonsymmetric systems of linear equations Harrachov 2007 Peter Sonneveld en Martin van Gijzen August 24, 2007 1 Delft University of Technology

More information

1 Extrapolation: A Hint of Things to Come

1 Extrapolation: A Hint of Things to Come Notes for 2017-03-24 1 Extrapolation: A Hint of Things to Come Stationary iterations are simple. Methods like Jacobi or Gauss-Seidel are easy to program, and it s (relatively) easy to analyze their convergence.

More information

ITERATIVE METHODS BASED ON KRYLOV SUBSPACES

ITERATIVE METHODS BASED ON KRYLOV SUBSPACES ITERATIVE METHODS BASED ON KRYLOV SUBSPACES LONG CHEN We shall present iterative methods for solving linear algebraic equation Au = b based on Krylov subspaces We derive conjugate gradient (CG) method

More information

KRYLOV SUBSPACE ITERATION

KRYLOV SUBSPACE ITERATION KRYLOV SUBSPACE ITERATION Presented by: Nab Raj Roshyara Master and Ph.D. Student Supervisors: Prof. Dr. Peter Benner, Department of Mathematics, TU Chemnitz and Dipl.-Geophys. Thomas Günther 1. Februar

More information

Lecture 9: Krylov Subspace Methods. 2 Derivation of the Conjugate Gradient Algorithm

Lecture 9: Krylov Subspace Methods. 2 Derivation of the Conjugate Gradient Algorithm CS 622 Data-Sparse Matrix Computations September 19, 217 Lecture 9: Krylov Subspace Methods Lecturer: Anil Damle Scribes: David Eriksson, Marc Aurele Gilles, Ariah Klages-Mundt, Sophia Novitzky 1 Introduction

More information

A DISSERTATION. Extensions of the Conjugate Residual Method. by Tomohiro Sogabe. Presented to

A DISSERTATION. Extensions of the Conjugate Residual Method. by Tomohiro Sogabe. Presented to A DISSERTATION Extensions of the Conjugate Residual Method ( ) by Tomohiro Sogabe Presented to Department of Applied Physics, The University of Tokyo Contents 1 Introduction 1 2 Krylov subspace methods

More information

Iterative Methods and Multigrid

Iterative Methods and Multigrid Iterative Methods and Multigrid Part 3: Preconditioning 2 Eric de Sturler Preconditioning The general idea behind preconditioning is that convergence of some method for the linear system Ax = b can be

More information

Solutions and Notes to Selected Problems In: Numerical Optimzation by Jorge Nocedal and Stephen J. Wright.

Solutions and Notes to Selected Problems In: Numerical Optimzation by Jorge Nocedal and Stephen J. Wright. Solutions and Notes to Selected Problems In: Numerical Optimzation by Jorge Nocedal and Stephen J. Wright. John L. Weatherwax July 7, 2010 wax@alum.mit.edu 1 Chapter 5 (Conjugate Gradient Methods) Notes

More information

A short course on: Preconditioned Krylov subspace methods. Yousef Saad University of Minnesota Dept. of Computer Science and Engineering

A short course on: Preconditioned Krylov subspace methods. Yousef Saad University of Minnesota Dept. of Computer Science and Engineering A short course on: Preconditioned Krylov subspace methods Yousef Saad University of Minnesota Dept. of Computer Science and Engineering Universite du Littoral, Jan 19-3, 25 Outline Part 1 Introd., discretization

More information

Block Bidiagonal Decomposition and Least Squares Problems

Block Bidiagonal Decomposition and Least Squares Problems Block Bidiagonal Decomposition and Least Squares Problems Åke Björck Department of Mathematics Linköping University Perspectives in Numerical Analysis, Helsinki, May 27 29, 2008 Outline Bidiagonal Decomposition

More information

Parallel Numerics, WT 2016/ Iterative Methods for Sparse Linear Systems of Equations. page 1 of 1

Parallel Numerics, WT 2016/ Iterative Methods for Sparse Linear Systems of Equations. page 1 of 1 Parallel Numerics, WT 2016/2017 5 Iterative Methods for Sparse Linear Systems of Equations page 1 of 1 Contents 1 Introduction 1.1 Computer Science Aspects 1.2 Numerical Problems 1.3 Graphs 1.4 Loop Manipulations

More information

SOLVING SPARSE LINEAR SYSTEMS OF EQUATIONS. Chao Yang Computational Research Division Lawrence Berkeley National Laboratory Berkeley, CA, USA

SOLVING SPARSE LINEAR SYSTEMS OF EQUATIONS. Chao Yang Computational Research Division Lawrence Berkeley National Laboratory Berkeley, CA, USA 1 SOLVING SPARSE LINEAR SYSTEMS OF EQUATIONS Chao Yang Computational Research Division Lawrence Berkeley National Laboratory Berkeley, CA, USA 2 OUTLINE Sparse matrix storage format Basic factorization

More information

EIGIFP: A MATLAB Program for Solving Large Symmetric Generalized Eigenvalue Problems

EIGIFP: A MATLAB Program for Solving Large Symmetric Generalized Eigenvalue Problems EIGIFP: A MATLAB Program for Solving Large Symmetric Generalized Eigenvalue Problems JAMES H. MONEY and QIANG YE UNIVERSITY OF KENTUCKY eigifp is a MATLAB program for computing a few extreme eigenvalues

More information

The Conjugate Gradient Method for Solving Linear Systems of Equations

The Conjugate Gradient Method for Solving Linear Systems of Equations The Conjugate Gradient Method for Solving Linear Systems of Equations Mike Rambo Mentor: Hans de Moor May 2016 Department of Mathematics, Saint Mary s College of California Contents 1 Introduction 2 2

More information

Comparison of Fixed Point Methods and Krylov Subspace Methods Solving Convection-Diffusion Equations

Comparison of Fixed Point Methods and Krylov Subspace Methods Solving Convection-Diffusion Equations American Journal of Computational Mathematics, 5, 5, 3-6 Published Online June 5 in SciRes. http://www.scirp.org/journal/ajcm http://dx.doi.org/.436/ajcm.5.5 Comparison of Fixed Point Methods and Krylov

More information

BLOCK KRYLOV SPACE METHODS FOR LINEAR SYSTEMS WITH MULTIPLE RIGHT-HAND SIDES: AN INTRODUCTION

BLOCK KRYLOV SPACE METHODS FOR LINEAR SYSTEMS WITH MULTIPLE RIGHT-HAND SIDES: AN INTRODUCTION BLOCK KRYLOV SPACE METHODS FOR LINEAR SYSTEMS WITH MULTIPLE RIGHT-HAND SIDES: AN INTRODUCTION MARTIN H. GUTKNECHT / VERSION OF February 28, 2006 Abstract. In a number of applications in scientific computing

More information

Course Notes: Week 4

Course Notes: Week 4 Course Notes: Week 4 Math 270C: Applied Numerical Linear Algebra 1 Lecture 9: Steepest Descent (4/18/11) The connection with Lanczos iteration and the CG was not originally known. CG was originally derived

More information

Preconditioning Techniques Analysis for CG Method

Preconditioning Techniques Analysis for CG Method Preconditioning Techniques Analysis for CG Method Huaguang Song Department of Computer Science University of California, Davis hso@ucdavis.edu Abstract Matrix computation issue for solve linear system

More information

FEM and sparse linear system solving

FEM and sparse linear system solving FEM & sparse linear system solving, Lecture 9, Nov 19, 2017 1/36 Lecture 9, Nov 17, 2017: Krylov space methods http://people.inf.ethz.ch/arbenz/fem17 Peter Arbenz Computer Science Department, ETH Zürich

More information

On the interplay between discretization and preconditioning of Krylov subspace methods

On the interplay between discretization and preconditioning of Krylov subspace methods On the interplay between discretization and preconditioning of Krylov subspace methods Josef Málek and Zdeněk Strakoš Nečas Center for Mathematical Modeling Charles University in Prague and Czech Academy

More information

The solution of the discretized incompressible Navier-Stokes equations with iterative methods

The solution of the discretized incompressible Navier-Stokes equations with iterative methods The solution of the discretized incompressible Navier-Stokes equations with iterative methods Report 93-54 C. Vuik Technische Universiteit Delft Delft University of Technology Faculteit der Technische

More information

HOW TO MAKE SIMPLER GMRES AND GCR MORE STABLE

HOW TO MAKE SIMPLER GMRES AND GCR MORE STABLE HOW TO MAKE SIMPLER GMRES AND GCR MORE STABLE PAVEL JIRÁNEK, MIROSLAV ROZLOŽNÍK, AND MARTIN H. GUTKNECHT Abstract. In this paper we analyze the numerical behavior of several minimum residual methods, which

More information

Some minimization problems

Some minimization problems Week 13: Wednesday, Nov 14 Some minimization problems Last time, we sketched the following two-step strategy for approximating the solution to linear systems via Krylov subspaces: 1. Build a sequence of

More information