High performance Krylov subspace method variants

Similar documents
MS 28: Scalable Communication-Avoiding and -Hiding Krylov Subspace Methods I

Efficient Deflation for Communication-Avoiding Krylov Subspace Methods

MS4: Minimizing Communication in Numerical Algorithms Part I of II

Finite-choice algorithm optimization in Conjugate Gradients

Key words. conjugate gradients, normwise backward error, incremental norm estimation.

EXASCALE COMPUTING. Hiding global synchronization latency in the preconditioned Conjugate Gradient algorithm. P. Ghysels W.

Moments, Model Reduction and Nonlinearity in Solving Linear Algebraic Problems

A Residual Replacement Strategy for Improving the Maximum Attainable Accuracy of s-step Krylov Subspace Methods

CME342 Parallel Methods in Numerical Analysis. Matrix Computation: Iterative Methods II. Sparse Matrix-vector Multiplication.

Sensitivity of Gauss-Christoffel quadrature and sensitivity of Jacobi matrices to small changes of spectral data

Error analysis of the s-step Lanczos method in finite precision

Matching moments and matrix computations

Efficient Estimation of the A-norm of the Error in the Preconditioned Conjugate Gradient Method

Scalable Non-blocking Preconditioned Conjugate Gradient Methods

Lanczos tridigonalization and Golub - Kahan bidiagonalization: Ideas, connections and impact

Efficient Estimation of the A-norm of the Error in the Preconditioned Conjugate Gradient Method

Iterative Methods for Linear Systems of Equations

6.4 Krylov Subspaces and Conjugate Gradients

Reduced Synchronization Overhead on. December 3, Abstract. The standard formulation of the conjugate gradient algorithm involves

Summary of Iterative Methods for Non-symmetric Linear Equations That Are Related to the Conjugate Gradient (CG) Method

Principles and Analysis of Krylov Subspace Methods

A Residual Replacement Strategy for Improving the Maximum Attainable Accuracy of Communication- Avoiding Krylov Subspace Methods

M.A. Botchev. September 5, 2014

PETROV-GALERKIN METHODS

Introduction. Chapter One

Computational Linear Algebra

Krylov Space Solvers

S-Step and Communication-Avoiding Iterative Methods

Linear Solvers. Andrew Hazel

Topics. The CG Algorithm Algorithmic Options CG s Two Main Convergence Theorems

Contents. Preface... xi. Introduction...

Conjugate gradient method. Descent method. Conjugate search direction. Conjugate Gradient Algorithm (294)

On the Vorobyev method of moments

1 Conjugate gradients

Numerical behavior of inexact linear solvers

Computational Linear Algebra

AMS526: Numerical Analysis I (Numerical Linear Algebra) Lecture 23: GMRES and Other Krylov Subspace Methods; Preconditioning

DELFT UNIVERSITY OF TECHNOLOGY

Conjugate Gradient algorithm. Storage: fixed, independent of number of steps.

Communication-Avoiding Krylov Subspace Methods in Theory and Practice. Erin Claire Carson. A dissertation submitted in partial satisfaction of the

EECS 275 Matrix Computation

The Lanczos and conjugate gradient algorithms

Iterative methods for Linear System

The amount of work to construct each new guess from the previous one should be a small multiple of the number of nonzeros in A.

ON ORTHOGONAL REDUCTION TO HESSENBERG FORM WITH SMALL BANDWIDTH

Contribution of Wo¹niakowski, Strako²,... The conjugate gradient method in nite precision computa

4.8 Arnoldi Iteration, Krylov Subspaces and GMRES

Variants of BiCGSafe method using shadow three-term recurrence

Chapter 7 Iterative Techniques in Matrix Algebra

Conjugate Gradient Method

Golub-Kahan iterative bidiagonalization and determining the noise level in the data

Communication-avoiding Krylov subspace methods

Iterative methods for Linear System of Equations. Joint Advanced Student School (JASS-2009)

On the accuracy of saddle point solvers

Iterative Methods for Sparse Linear Systems

Applied Mathematics 205. Unit V: Eigenvalue Problems. Lecturer: Dr. David Knezevic

On the influence of eigenvalues on Bi-CG residual norms

Notes on Some Methods for Solving Linear Systems

ON EFFICIENT NUMERICAL APPROXIMATION OF THE BILINEAR FORM c A 1 b

Conjugate Gradient (CG) Method

IDR(s) Master s thesis Goushani Kisoensingh. Supervisor: Gerard L.G. Sleijpen Department of Mathematics Universiteit Utrecht

Recent computational developments in Krylov Subspace Methods for linear systems. Valeria Simoncini and Daniel B. Szyld

1. Introduction. Krylov subspace methods are used extensively for the iterative solution of linear systems of equations Ax = b.

A stable variant of Simpler GMRES and GCR

Iterative Methods for Solving A x = b

Communication Avoiding Strategies for the Numerical Kernels in Coupled Physics Simulations

Solving large sparse Ax = b.

A Domain Decomposition Based Jacobi-Davidson Algorithm for Quantum Dot Simulation

Composite convergence bounds based on Chebyshev polynomials and finite precision conjugate gradient computations

Exploiting Low-Rank Structure in Computing Matrix Powers with Applications to Preconditioning

The Conjugate Gradient Method

AMS526: Numerical Analysis I (Numerical Linear Algebra)

Introduction to Iterative Solvers of Linear Systems

B. Reps 1, P. Ghysels 2, O. Schenk 4, K. Meerbergen 3,5, W. Vanroose 1,3. IHP 2015, Paris FRx

Minisymposia 9 and 34: Avoiding Communication in Linear Algebra. Jim Demmel UC Berkeley bebop.cs.berkeley.edu

Performance Evaluation of GPBiCGSafe Method without Reverse-Ordered Recurrence for Realistic Problems

On prescribing Ritz values and GMRES residual norms generated by Arnoldi processes

IDR(s) A family of simple and fast algorithms for solving large nonsymmetric systems of linear equations

1 Extrapolation: A Hint of Things to Come

ITERATIVE METHODS BASED ON KRYLOV SUBSPACES

KRYLOV SUBSPACE ITERATION

Lecture 9: Krylov Subspace Methods. 2 Derivation of the Conjugate Gradient Algorithm

A DISSERTATION. Extensions of the Conjugate Residual Method. by Tomohiro Sogabe. Presented to

Iterative Methods and Multigrid

Solutions and Notes to Selected Problems In: Numerical Optimzation by Jorge Nocedal and Stephen J. Wright.

A short course on: Preconditioned Krylov subspace methods. Yousef Saad University of Minnesota Dept. of Computer Science and Engineering

Block Bidiagonal Decomposition and Least Squares Problems

Parallel Numerics, WT 2016/ Iterative Methods for Sparse Linear Systems of Equations. page 1 of 1

SOLVING SPARSE LINEAR SYSTEMS OF EQUATIONS. Chao Yang Computational Research Division Lawrence Berkeley National Laboratory Berkeley, CA, USA

EIGIFP: A MATLAB Program for Solving Large Symmetric Generalized Eigenvalue Problems

The Conjugate Gradient Method for Solving Linear Systems of Equations

Comparison of Fixed Point Methods and Krylov Subspace Methods Solving Convection-Diffusion Equations

BLOCK KRYLOV SPACE METHODS FOR LINEAR SYSTEMS WITH MULTIPLE RIGHT-HAND SIDES: AN INTRODUCTION

Course Notes: Week 4

Preconditioning Techniques Analysis for CG Method

FEM and sparse linear system solving

On the interplay between discretization and preconditioning of Krylov subspace methods

The solution of the discretized incompressible Navier-Stokes equations with iterative methods

HOW TO MAKE SIMPLER GMRES AND GCR MORE STABLE

Some minimization problems

Transcription:

High performance Krylov subspace method variants and their behavior in finite precision Erin Carson New York University HPCSE17, May 24, 2017

Collaborators Miroslav Rozložník Institute of Computer Science, Czech Academy of Sciences Zdeněk Strakoš Department of Numerical Mathematics, Faculty of Mathematics and Physics, Charles University Petr Tichý Department of Numerical Mathematics, Faculty of Mathematics and Physics, Charles University Miroslav Tůma Department of Numerical Mathematics, Faculty of Mathematics and Physics, Charles University Preprint NCMM/2016/08: http://www.karlin.mff.cuni.cz/~strakos/download/2016_carrozstrtictum_16.pdf 2

Conjugate Gradient method for solving Ax = b double precision (ε = 2 53 ) x i x A = x i x T A(x i x) x i = x i 1 + α i p i r i = r i 1 α i Ap i p i = r i + β i p i 3

Conjugate Gradient method for solving Ax = b double precision (ε = 2 53 ) x i x A = x i x T A(x i x) x i = x i 1 + α i p i r i = r i 1 α i Ap i p i = r i + β i p i 3

Krylov subspace methods Linear systems Ax = b, eigenvalue problems, singular value problems, least squares, etc. Best for: A large & very sparse, stored implicitly, or only approximation needed Krylov Subspace Method is a projection process onto the Krylov subspace K i A, r 0 = span r 0, Ar 0, A 2 r 0,, A i 1 r 0 where A is an N N matrix and r 0 = b Ax 0 is a length-n vector In each iteration, Add a dimension to the Krylov subspace Forms nested sequence of Krylov subspaces K 1 A, r 0 K 2 A, r 0 K i (A, r 0 ) Orthogonalize (with respect to some C i ) Select approximate solution x i x 0 + K i (A, r 0 ) using r i = b Ax i C i rnew 0 C Aδ r 0 Ex: Lanczos/Conjugate Gradient (CG), Arnoldi/Generalized Minimum Residual (GMRES), Biconjugate Gradient (BICG), BICGSTAB, GKL, LSQR, etc. 4

The conjugate gradient method A is symmetric positive definite, C i = K i (A, r 0 ) 5

The conjugate gradient method A is symmetric positive definite, C i = K i (A, r 0 ) r i K i A, r 0 x x i A = min z x 0 +K i (A,r 0 ) x z A 5

The conjugate gradient method A is symmetric positive definite, C i = K i (A, r 0 ) r i K i A, r 0 x x i A = min z x 0 +K i (A,r 0 ) x z A r N+1 = 0 5

The conjugate gradient method A is symmetric positive definite, C i = K i (A, r 0 ) r i K i A, r 0 x x i A = min z x 0 +K i (A,r 0 ) x z A Connection with Lanczos r N+1 = 0 With v 1 = r 0 / r 0, i iterations of Lanczos produces N i matrix V i = v 1,, v i, and i i tridiagonal matrix T i such that AV i = V i T i + δ i+1 v i+1 e i T, T i = V i AV i CG approximation x i is obtained by solving the reduced model T i y i = r 0 e 1, x i = x 0 + V i y i 5

The conjugate gradient method A is symmetric positive definite, C i = K i (A, r 0 ) r i K i A, r 0 x x i A = min z x 0 +K i (A,r 0 ) x z A Connection with Lanczos r N+1 = 0 With v 1 = r 0 / r 0, i iterations of Lanczos produces N i matrix V i = v 1,, v i, and i i tridiagonal matrix T i such that AV i = V i T i + δ i+1 v i+1 e i T, T i = V i AV i CG approximation x i is obtained by solving the reduced model T i y i = r 0 e 1, x i = x 0 + V i y i Connections with orthogonal polynomials, Stieltjes problem of moments, Gauss- Cristoffel quadrature, others (see 2013 book of Liesen and Strakoš) 5

The conjugate gradient method A is symmetric positive definite, C i = K i (A, r 0 ) r i K i A, r 0 x x i A = min z x 0 +K i (A,r 0 ) x z A Connection with Lanczos r N+1 = 0 With v 1 = r 0 / r 0, i iterations of Lanczos produces N i matrix V i = v 1,, v i, and i i tridiagonal matrix T i such that AV i = V i T i + δ i+1 v i+1 e i T, T i = V i AV i CG approximation x i is obtained by solving the reduced model T i y i = r 0 e 1, x i = x 0 + V i y i Connections with orthogonal polynomials, Stieltjes problem of moments, Gauss- Cristoffel quadrature, others (see 2013 book of Liesen and Strakoš) CG (and other Krylov subspace methods) are highly nonlinear Good for convergence, bad for ease of finite precision analysis 5

Implementation of CG Standard implementation due to Hestenes and Stiefel (1952) (HSCG) Uses three 2-term recurrences for updating x i, r i, p i r 0 = b Ax 0, p 0 = r 0 for i = 1:nmax α i 1 = r T i 1 ri 1 p T i 1 Ap i 1 x i = x i 1 + α i 1 p i 1 r i = r i 1 α i 1 Ap i 1 β i = r i T r i r T i 1 r i 1 end p i = r i + β i p i 1 6

Implementation of CG Standard implementation due to Hestenes and Stiefel (1952) (HSCG) Uses three 2-term recurrences for updating x i, r i, p i r 0 = b Ax 0, p 0 = r 0 for i = 1:nmax α i 1 = r T i 1 ri 1 p T i 1 Ap i 1 minimizes x x i A along line z α = x i 1 + αp i 1 x i = x i 1 + α i 1 p i 1 r i = r i 1 α i 1 Ap i 1 β i = r i T r i r T i 1 r i 1 end p i = r i + β i p i 1 6

Implementation of CG Standard implementation due to Hestenes and Stiefel (1952) (HSCG) Uses three 2-term recurrences for updating x i, r i, p i r 0 = b Ax 0, p 0 = r 0 for i = 1:nmax end α i 1 = r T i 1 ri 1 p T i 1 Ap i 1 x i = x i 1 + α i 1 p i 1 r i = r i 1 α i 1 Ap i 1 β i = r i T r i r T i 1 r i 1 p i = r i + β i p i 1 If minimizes x x i A along line z α = x i 1 + αp i 1 p i A p j for i j, 1-dimensional minimizations in each iteration give i-dimensional minimization over the whole subspace x 0 + K i A, r 0 = x 0 + span{p 0, p i 1 } 6

Communication in CG Projection process in terms of communication: 7

Communication in CG Projection process in terms of communication: Add a dimension to K i Sparse matrix-vector multiplication (SpMV) Must communicate vector entries w/ neighboring processors (P2P communication) SpMV 7

Communication in CG Projection process in terms of communication: Add a dimension to K i Sparse matrix-vector multiplication (SpMV) Must communicate vector entries w/ neighboring processors (P2P communication) Orthogonalize with respect to C i Inner products global synchronization (MPI_Allreduce) all processors must exchange data and wait for all communication to finish before proceeding SpMV orthogonalize 7

Communication in CG Projection process in terms of communication: Add a dimension to K i Sparse matrix-vector multiplication (SpMV) Must communicate vector entries w/ neighboring processors (P2P communication) Orthogonalize with respect to C i Inner products global synchronization (MPI_Allreduce) all processors must exchange data and wait for all communication to finish before proceeding SpMV orthogonalize Dependencies between communication-bound kernels in each iteration limit performance! 7

Communication in HSCG r 0 = b Ax 0, p 0 = r 0 for i = 1:nmax α i 1 = r T i 1 ri 1 p T i 1 Ap i 1 x i = x i 1 + α i 1 p i 1 r i = r i 1 α i 1 Ap i 1 Iteration Loop SpMV Inner Products Vector Updates β i = r i T r i r T i 1 r i 1 Inner Products end p i = r i + β i p i 1 Vector Updates End Loop 8

Communication in HSCG r 0 = b Ax 0, p 0 = r 0 for i = 1:nmax α i 1 = r T i 1 ri 1 p T i 1 Ap i 1 x i = x i 1 + α i 1 p i 1 r i = r i 1 α i 1 Ap i 1 Iteration Loop SpMV Inner Products Vector Updates β i = r i T r i r T i 1 r i 1 Inner Products end p i = r i + β i p i 1 Vector Updates End Loop 8

Communication in HSCG r 0 = b Ax 0, p 0 = r 0 for i = 1:nmax α i 1 = r T i 1 ri 1 p T i 1 Ap i 1 x i = x i 1 + α i 1 p i 1 r i = r i 1 α i 1 Ap i 1 Iteration Loop SpMV Inner Products Vector Updates β i = r i T r i r T i 1 r i 1 Inner Products end p i = r i + β i p i 1 Vector Updates End Loop 8

Communication in HSCG r 0 = b Ax 0, p 0 = r 0 for i = 1:nmax α i 1 = r T i 1 ri 1 p T i 1 Ap i 1 x i = x i 1 + α i 1 p i 1 r i = r i 1 α i 1 Ap i 1 Iteration Loop SpMV Inner Products Vector Updates β i = r i T r i r T i 1 r i 1 Inner Products end p i = r i + β i p i 1 Vector Updates End Loop 8

Communication in HSCG r 0 = b Ax 0, p 0 = r 0 for i = 1:nmax α i 1 = r T i 1 ri 1 p T i 1 Ap i 1 x i = x i 1 + α i 1 p i 1 r i = r i 1 α i 1 Ap i 1 Iteration Loop SpMV Inner Products Vector Updates β i = r i T r i r T i 1 r i 1 Inner Products end p i = r i + β i p i 1 Vector Updates End Loop 8

Communication in HSCG r 0 = b Ax 0, p 0 = r 0 for i = 1:nmax α i 1 = r T i 1 ri 1 p T i 1 Ap i 1 x i = x i 1 + α i 1 p i 1 r i = r i 1 α i 1 Ap i 1 Iteration Loop SpMV Inner Products Vector Updates β i = r i T r i r T i 1 r i 1 Inner Products end p i = r i + β i p i 1 Vector Updates End Loop 8

Future exascale systems Petascale Systems (2009) Predicted Exascale Systems Factor Improvement System Peak ~1000 Node Memory Bandwidth Total Node Interconnect Bandwidth 25 GB/s 0.4-4 TB/s ~10-100 3.5 GB/s 100-400 GB/s ~100 Memory Latency 100 ns 50 ns ~1 Interconnect Latency ~1 * Sources: from P. Beckman (ANL), J. Shalf (LBL), and D. Unat (LBL) 9

Future exascale systems Petascale Systems (2009) Predicted Exascale Systems Factor Improvement System Peak ~1000 Node Memory Bandwidth Total Node Interconnect Bandwidth 25 GB/s 0.4-4 TB/s ~10-100 3.5 GB/s 100-400 GB/s ~100 Memory Latency 100 ns 50 ns ~1 Interconnect Latency ~1 * Sources: from P. Beckman (ANL), J. Shalf (LBL), and D. Unat (LBL) 9

Future exascale systems Petascale Systems (2009) Predicted Exascale Systems Factor Improvement System Peak ~1000 Node Memory Bandwidth Total Node Interconnect Bandwidth 25 GB/s 0.4-4 TB/s ~10-100 3.5 GB/s 100-400 GB/s ~100 Memory Latency 100 ns 50 ns ~1 Interconnect Latency ~1 * Sources: from P. Beckman (ANL), J. Shalf (LBL), and D. Unat (LBL) 9

Future exascale systems Petascale Systems (2009) Predicted Exascale Systems Factor Improvement System Peak ~1000 Node Memory Bandwidth Total Node Interconnect Bandwidth 25 GB/s 0.4-4 TB/s ~10-100 3.5 GB/s 100-400 GB/s ~100 Memory Latency 100 ns 50 ns ~1 Interconnect Latency ~1 * Sources: from P. Beckman (ANL), J. Shalf (LBL), and D. Unat (LBL) Gaps between communication/computation cost only growing larger in future systems Reducing time spent moving data/waiting for data will be essential for applications at exascale! 9

Synchronization-reducing variants Communication cost has motivated many approaches to reducing synchronization in CG: 10

Synchronization-reducing variants Communication cost has motivated many approaches to reducing synchronization in CG: Early work: CG with a single synchronization point per iteration 3-term recurrence CG Using modified computation of recurrence coefficients Using auxiliary vectors 10

Synchronization-reducing variants Communication cost has motivated many approaches to reducing synchronization in CG: Early work: CG with a single synchronization point per iteration 3-term recurrence CG Using modified computation of recurrence coefficients Using auxiliary vectors Pipelined Krylov subspace methods Uses modified coefficients and auxiliary vectors to reduce synchronization points to 1 per iteration Modifications also allow decoupling of SpMV and inner products - enables overlapping 10

Synchronization-reducing variants Communication cost has motivated many approaches to reducing synchronization in CG: Early work: CG with a single synchronization point per iteration 3-term recurrence CG Using modified computation of recurrence coefficients Using auxiliary vectors Pipelined Krylov subspace methods Uses modified coefficients and auxiliary vectors to reduce synchronization points to 1 per iteration Modifications also allow decoupling of SpMV and inner products - enables overlapping s-step Krylov subspace methods Compute iterations in blocks of s using a different Krylov subspace basis Enables one synchronization per s iterations 10

The effects of finite precision Well-known that roundoff error has two effects: 1. Delay of convergence No longer have exact Krylov subspace Can lose numerical rank deficiency Residuals no longer orthogonal - Minimization no longer exact! 2. Loss of attainable accuracy Rounding errors cause true residual b Ax i and updated residual r i deviate! 11

The effects of finite precision Well-known that roundoff error has two effects: 1. Delay of convergence No longer have exact Krylov subspace Can lose numerical rank deficiency Residuals no longer orthogonal - Minimization no longer exact! 2. Loss of attainable accuracy Rounding errors cause true residual b Ax i and updated residual r i deviate! A: bcsstk03 from UFSMC, b: equal components in the eigenbasis of A and b = 1 N = 112, κ A 7e6 11

The effects of finite precision Well-known that roundoff error has two effects: 1. Delay of convergence No longer have exact Krylov subspace Can lose numerical rank deficiency Residuals no longer orthogonal - Minimization no longer exact! 2. Loss of attainable accuracy Rounding errors cause true residual b Ax i and updated residual r i deviate! A: bcsstk03 from UFSMC, b: equal components in the eigenbasis of A and b = 1 N = 112, κ A 7e6 11

The effects of finite precision Well-known that roundoff error has two effects: 1. Delay of convergence No longer have exact Krylov subspace Can lose numerical rank deficiency Residuals no longer orthogonal - Minimization no longer exact! 2. Loss of attainable accuracy Rounding errors cause true residual b Ax i and updated residual r i deviate! A: bcsstk03 from UFSMC, b: equal components in the eigenbasis of A and b = 1 N = 112, κ A 7e6 Much work on these results for CG; See Meurant and Strakoš (2006) for a thorough summary of early developments in finite precision analysis of Lanczos and CG 11

Optimizing high performance iterative solvers Synchronization-reducing variants are designed to reduce the time/iteration 12

Optimizing high performance iterative solvers Synchronization-reducing variants are designed to reduce the time/iteration But this is not the whole story! 12

Optimizing high performance iterative solvers Synchronization-reducing variants are designed to reduce the time/iteration But this is not the whole story! What we really want to minimize is the runtime, subject to some constraint on accuracy, runtime = (time/iteration) x (# iterations) 12

Optimizing high performance iterative solvers Synchronization-reducing variants are designed to reduce the time/iteration But this is not the whole story! What we really want to minimize is the runtime, subject to some constraint on accuracy, runtime = (time/iteration) x (# iterations) Changes to how the recurrences are computed can exacerbate finite precision effects of convergence delay and loss of accuracy 12

Optimizing high performance iterative solvers Synchronization-reducing variants are designed to reduce the time/iteration But this is not the whole story! What we really want to minimize is the runtime, subject to some constraint on accuracy, runtime = (time/iteration) x (# iterations) Changes to how the recurrences are computed can exacerbate finite precision effects of convergence delay and loss of accuracy Crucial that we understand and take into account how algorithm modifications will affect the convergence rate and attainable accuracy! 12

Maximum attainable accuracy Accuracy depends on the size of the true residual: b A x i 13

Maximum attainable accuracy Accuracy depends on the size of the true residual: b A x i Rounding errors cause the true residual, b A x i, and the updated residual, r i, to deviate 13

Maximum attainable accuracy Accuracy depends on the size of the true residual: b A x i Rounding errors cause the true residual, b A x i, and the updated residual, r i, to deviate Writing b A x i = r i + b A x i r i, b A x i r i + b A x i r i 13

Maximum attainable accuracy Accuracy depends on the size of the true residual: b A x i Rounding errors cause the true residual, b A x i, and the updated residual, r i, to deviate Writing b A x i = r i + b A x i r i, b A x i r i + b A x i r i As r i 0, b A x i depends on b A x i r i 13

Maximum attainable accuracy Accuracy depends on the size of the true residual: b A x i Rounding errors cause the true residual, b A x i, and the updated residual, r i, to deviate Writing b A x i = r i + b A x i r i, b A x i r i + b A x i r i As r i 0, b A x i depends on b A x i r i Many results on bounding attainable accuracy, e.g.: Greenbaum (1989, 1994, 1997), Sleijpen, van der Vorst and Fokkema (1994), Sleijpen, van der Vorst and Modersitzki (2001), Björck, Elfving and Strakoš (1998) and Gutknecht and Strakoš (2000). 13

Maximum attainable accuracy of HSCG In finite precision HSCG, iterates are updated by x i = x i 1 + α i 1 p i 1 δx i and r i = r i 1 α i 1 A p i 1 δr i 14

Maximum attainable accuracy of HSCG In finite precision HSCG, iterates are updated by x i = x i 1 + α i 1 p i 1 δx i and r i = r i 1 α i 1 A p i 1 δr i Let f i b A x i r i 14

Maximum attainable accuracy of HSCG In finite precision HSCG, iterates are updated by x i = x i 1 + α i 1 p i 1 δx i and r i = r i 1 α i 1 A p i 1 δr i Let f i b A x i r i f i = b A x i 1 + α i 1 p i 1 δx i r i 1 α i 1 A p i 1 δr i 14

Maximum attainable accuracy of HSCG In finite precision HSCG, iterates are updated by x i = x i 1 + α i 1 p i 1 δx i and r i = r i 1 α i 1 A p i 1 δr i Let f i b A x i r i f i = b A x i 1 + α i 1 p i 1 δx i = f i 1 + Aδx i + δr i r i 1 α i 1 A p i 1 δr i 14

Maximum attainable accuracy of HSCG In finite precision HSCG, iterates are updated by x i = x i 1 + α i 1 p i 1 δx i and r i = r i 1 α i 1 A p i 1 δr i Let f i b A x i r i f i = b A x i 1 + α i 1 p i 1 δx i = f i 1 + Aδx i + δr i r i 1 α i 1 A p i 1 δr i i = f 0 + m=1 Aδx m + δr m 14

Maximum attainable accuracy of HSCG In finite precision HSCG, iterates are updated by x i = x i 1 + α i 1 p i 1 δx i and r i = r i 1 α i 1 A p i 1 δr i Let f i b A x i r i f i = b A x i 1 + α i 1 p i 1 δx i = f i 1 + Aδx i + δr i r i 1 α i 1 A p i 1 δr i i = f 0 + m=1 Aδx m + δr m f i i O ε m=0 N A A x m + r m van der Vorst and Ye, 2000 f i O(ε) A x + max m=0,,i x m Greenbaum, 1997 f i O ε N A A A 1 i m=0 r m Sleijpen and van der Vorst, 1995 14

Early approaches to reducing synchronization Goal: Reduce the 2 synchronization points per iteration in HSCG to 1 synchronization point per iteration 15

Early approaches to reducing synchronization Goal: Reduce the 2 synchronization points per iteration in HSCG to 1 synchronization point per iteration Compute β i from α i 1 and Ap i 1 using relation r i 2 2 = α i 1 Ap i 1 2 r i 1 2 Can then also merge the updates of x i, r i, and p i Developed independently by Johnson (1983, 1984), van Rosendale (1983, 1984), Saad (1985) Many other similar approaches 15

Early approaches to reducing synchronization Goal: Reduce the 2 synchronization points per iteration in HSCG to 1 synchronization point per iteration Compute β i from α i 1 and Ap i 1 using relation r i 2 2 = α i 1 Ap i 1 2 r i 1 2 Can then also merge the updates of x i, r i, and p i Developed independently by Johnson (1983, 1984), van Rosendale (1983, 1984), Saad (1985) Many other similar approaches Could also compute α i 1 from β i 1 : α i 1 = r i 1 T Ar i 1 r T β i 1 i 1 r i 1 α i 2 1 15

Modified recurrence coefficient computation What is the effect of changing the way the recurrence coefficients (α and β) are computed in HSCG? 16

Modified recurrence coefficient computation What is the effect of changing the way the recurrence coefficients (α and β) are computed in HSCG? Notice that neither α nor β appear in the bounds on f i f i = b A x i r i = b A x i 1 + α i 1 p i 1 δx i r i 1 α i 1 A p i 1 δr i 16

Modified recurrence coefficient computation What is the effect of changing the way the recurrence coefficients (α and β) are computed in HSCG? Notice that neither α nor β appear in the bounds on f i f i = b A x i r i = b A x i 1 + α i 1 p i 1 δx i r i 1 α i 1 A p i 1 δr i As long as the same α i 1 is used in updating x i and r i, still holds f i = f i 1 + Aδx i + δr i 16

Modified recurrence coefficient computation What is the effect of changing the way the recurrence coefficients (α and β) are computed in HSCG? Notice that neither α nor β appear in the bounds on f i f i = b A x i r i = b A x i 1 + α i 1 p i 1 δx i r i 1 α i 1 A p i 1 δr i As long as the same α i 1 is used in updating x i and r i, still holds f i = f i 1 + Aδx i + δr i Rounding errors made in computing α i 1 do not contribute to the residual gap 16

Modified recurrence coefficient computation What is the effect of changing the way the recurrence coefficients (α and β) are computed in HSCG? Notice that neither α nor β appear in the bounds on f i f i = b A x i r i = b A x i 1 + α i 1 p i 1 δx i r i 1 α i 1 A p i 1 δr i As long as the same α i 1 is used in updating x i and r i, still holds f i = f i 1 + Aδx i + δr i Rounding errors made in computing α i 1 do not contribute to the residual gap But may change computed x i, r i, which can affect convergence rate... 16

Modified recurrence coefficient computation Example: HSCG with modified formula for α i 1 α i 1 = r i 1 T Ar i 1 r T β i 1 i 1 r i 1 α i 2 1 17

CG with two three-term recurrences (STCG) HSCG recurrences can be written as AP i = R i+1 L i, R i = P i U i we can combine these to obtain a 3-term recurrence for the residuals (STCG): AR i = R i+1 T i, T i = L i U i 18

CG with two three-term recurrences (STCG) HSCG recurrences can be written as AP i = R i+1 L i, R i = P i U i we can combine these to obtain a 3-term recurrence for the residuals (STCG): AR i = R i+1 T i, T i = L i U i First developed by Stiefel (1952/53), also Rutishauser (1959) and Hageman and Young (1981) Motivated by relation to three-term recurrences for orthogonal polynomials r 0 = b Ax 0, p 0 = r 0, x 1 = x 0, r 1 = r 0, e 1 = 0 for i = 1:nmax q i 1 = (r i 1,Ar i 1 ) (r i 1,r i 1 ) e i 2 x i = x i 1 + 1 q i 1 r i 1 + e i 2 (x i 1 x i 2 ) r i = r i 1 + 1 q i 1 Ar i 1 + e i 2 (r i 1 r i 2 ) end e i 1 = q i 1 (r i,r i ) (r i 1,r i 1 ) 18

CG with two three-term recurrences (STCG) HSCG recurrences can be written as AP i = R i+1 L i, R i = P i U i we can combine these to obtain a 3-term recurrence for the residuals (STCG): AR i = R i+1 T i, T i = L i U i First developed by Stiefel (1952/53), also Rutishauser (1959) and Hageman and Young (1981) Motivated by relation to three-term recurrences for orthogonal polynomials r 0 = b Ax 0, p 0 = r 0, x 1 = x 0, r 1 = r 0, e 1 = 0 for i = 1:nmax q i 1 = (r i 1,Ar i 1 ) (r i 1,r i 1 ) e i 2 x i = x i 1 + 1 q i 1 r i 1 + e i 2 (x i 1 x i 2 ) r i = r i 1 + 1 q i 1 Ar i 1 + e i 2 (r i 1 r i 2 ) end e i 1 = q i 1 (r i,r i ) (r i 1,r i 1 ) Can be accomplished with a single synchronization point on parallel computers (Strakoš 1985, 1987) 18

CG with two three-term recurrences (STCG) HSCG recurrences can be written as AP i = R i+1 L i, R i = P i U i we can combine these to obtain a 3-term recurrence for the residuals (STCG): AR i = R i+1 T i, T i = L i U i First developed by Stiefel (1952/53), also Rutishauser (1959) and Hageman and Young (1981) Motivated by relation to three-term recurrences for orthogonal polynomials r 0 = b Ax 0, p 0 = r 0, x 1 = x 0, r 1 = r 0, e 1 = 0 for i = 1:nmax q i 1 = (r i 1,Ar i 1 ) (r i 1,r i 1 ) e i 2 x i = x i 1 + 1 q i 1 r i 1 + e i 2 (x i 1 x i 2 ) r i = r i 1 + 1 q i 1 Ar i 1 + e i 2 (r i 1 r i 2 ) end e i 1 = q i 1 (r i,r i ) (r i 1,r i 1 ) Can be accomplished with a single synchronization point on parallel computers (Strakoš 1985, 1987) Similar approach (computing α i using β i 1 ) used by D'Azevedo, Eijkhout, Romaine (1992, 1993) 18

Attainable accuracy of STCG Analyzed by Gutknecht and Strakoš (2000) Attainable accuracy for STCG can be much worse than for HSCG 19

Attainable accuracy of STCG Analyzed by Gutknecht and Strakoš (2000) Attainable accuracy for STCG can be much worse than for HSCG Residual gap bounded by sum of local errors PLUS local errors multiplied by factors which depend on max 0 l<j i r j 2 r l 2 19

Attainable accuracy of STCG Analyzed by Gutknecht and Strakoš (2000) Attainable accuracy for STCG can be much worse than for HSCG Residual gap bounded by sum of local errors PLUS local errors multiplied by factors which depend on max 0 l<j i r j 2 r l 2 Large residual oscillations can cause these factors to be large! Local errors can be amplified! 19

STCG 20

STCG 20

Chronopoulos and Gear's CG (ChG CG) Chronopoulos and Gear (1989) Looks like HSCG, but very similar to 3-term recurrence CG (STCG) Reduces synchronizations/iteration to 1 by changing computation of α i and using an auxiliary recurrence for Ap i r 0 = b Ax 0, p 0 = r 0, s 0 = Ap 0, α 0 = (r 0, r 0 )/(p 0, s 0 ) for i = 1:nmax x i = x i 1 + α i 1 p i 1 r i = r i 1 α i 1 s i 1 w i = Ar i β i = (r i,r i ) α i = (r i 1,r i 1 ) (r i,r i ) w i,r i ( β i α i 1 )(r i,r i ) p i = r i + β i p i 1 s i = w i + β i s i 1 end 21

Chronopoulos and Gear's CG (ChG CG) Chronopoulos and Gear (1989) Looks like HSCG, but very similar to 3-term recurrence CG (STCG) Reduces synchronizations/iteration to 1 by changing computation of α i and using an auxiliary recurrence for Ap i r 0 = b Ax 0, p 0 = r 0, s 0 = Ap 0, α 0 = (r 0, r 0 )/(p 0, s 0 ) for i = 1:nmax x i = x i 1 + α i 1 p i 1 r i = r i 1 α i 1 s i 1 w i = Ar i β i = (r i,r i ) α i = (r i 1,r i 1 ) (r i,r i ) w i,r i ( β i α i 1 )(r i,r i ) p i = r i + β i p i 1 s i = w i + β i s i 1 end 21

Chronopoulos and Gear's CG (ChG CG) Chronopoulos and Gear (1989) Looks like HSCG, but very similar to 3-term recurrence CG (STCG) Reduces synchronizations/iteration to 1 by changing computation of α i and using an auxiliary recurrence for Ap i r 0 = b Ax 0, p 0 = r 0, s 0 = Ap 0, α 0 = (r 0, r 0 )/(p 0, s 0 ) for i = 1:nmax x i = x i 1 + α i 1 p i 1 r i = r i 1 α i 1 s i 1 w i = Ar i β i = (r i,r i ) α i = (r i 1,r i 1 ) (r i,r i ) w i,r i ( β i α i 1 )(r i,r i ) p i = r i + β i p i 1 s i = w i + β i s i 1 end 21

Chronopoulos and Gear's CG (ChG CG) Chronopoulos and Gear (1989) Looks like HSCG, but very similar to 3-term recurrence CG (STCG) Reduces synchronizations/iteration to 1 by changing computation of α i and using an auxiliary recurrence for Ap i r 0 = b Ax 0, p 0 = r 0, s 0 = Ap 0, α 0 = (r 0, r 0 )/(p 0, s 0 ) for i = 1:nmax x i = x i 1 + α i 1 p i 1 r i = r i 1 α i 1 s i 1 w i = Ar i β i = (r i,r i ) α i = (r i 1,r i 1 ) (r i,r i ) w i,r i ( β i α i 1 )(r i,r i ) Iteration Loop Vector Updates SpMV Inner Products Vector Updates p i = r i + β i p i 1 s i = w i + β i s i 1 End Loop 21 end

Chronopoulos and Gear's CG (ChG CG) Chronopoulos and Gear (1989) Looks like HSCG, but very similar to 3-term recurrence CG (STCG) Reduces synchronizations/iteration to 1 by changing computation of α i and using an auxiliary recurrence for Ap i r 0 = b Ax 0, p 0 = r 0, s 0 = Ap 0, α 0 = (r 0, r 0 )/(p 0, s 0 ) for i = 1:nmax x i = x i 1 + α i 1 p i 1 r i = r i 1 α i 1 s i 1 w i = Ar i β i = (r i,r i ) α i = (r i 1,r i 1 ) (r i,r i ) w i,r i ( β i α i 1 )(r i,r i ) Iteration Loop Vector Updates SpMV Inner Products Vector Updates p i = r i + β i p i 1 s i = w i + β i s i 1 End Loop 21 end

Chronopoulos and Gear's CG (ChG CG) Chronopoulos and Gear (1989) Looks like HSCG, but very similar to 3-term recurrence CG (STCG) Reduces synchronizations/iteration to 1 by changing computation of α i and using an auxiliary recurrence for Ap i r 0 = b Ax 0, p 0 = r 0, s 0 = Ap 0, α 0 = (r 0, r 0 )/(p 0, s 0 ) for i = 1:nmax x i = x i 1 + α i 1 p i 1 r i = r i 1 α i 1 s i 1 w i = Ar i β i = (r i,r i ) α i = (r i 1,r i 1 ) (r i,r i ) w i,r i ( β i α i 1 )(r i,r i ) Iteration Loop Vector Updates SpMV Inner Products Vector Updates p i = r i + β i p i 1 s i = w i + β i s i 1 End Loop 21 end

Chronopoulos and Gear's CG (ChG CG) Chronopoulos and Gear (1989) Looks like HSCG, but very similar to 3-term recurrence CG (STCG) Reduces synchronizations/iteration to 1 by changing computation of α i and using an auxiliary recurrence for Ap i r 0 = b Ax 0, p 0 = r 0, s 0 = Ap 0, α 0 = (r 0, r 0 )/(p 0, s 0 ) for i = 1:nmax x i = x i 1 + α i 1 p i 1 r i = r i 1 α i 1 s i 1 w i = Ar i β i = (r i,r i ) α i = (r i 1,r i 1 ) (r i,r i ) w i,r i ( β i α i 1 )(r i,r i ) Iteration Loop Vector Updates SpMV Inner Products Vector Updates p i = r i + β i p i 1 s i = w i + β i s i 1 End Loop 21 end

Chronopoulos and Gear's CG (ChG CG) Chronopoulos and Gear (1989) Looks like HSCG, but very similar to 3-term recurrence CG (STCG) Reduces synchronizations/iteration to 1 by changing computation of α i and using an auxiliary recurrence for Ap i r 0 = b Ax 0, p 0 = r 0, s 0 = Ap 0, α 0 = (r 0, r 0 )/(p 0, s 0 ) for i = 1:nmax x i = x i 1 + α i 1 p i 1 r i = r i 1 α i 1 s i 1 w i = Ar i β i = (r i,r i ) α i = (r i 1,r i 1 ) (r i,r i ) w i,r i ( β i α i 1 )(r i,r i ) Iteration Loop Vector Updates SpMV Inner Products Vector Updates p i = r i + β i p i 1 s i = w i + β i s i 1 End Loop 21 end

Pipelined CG (GVCG) Pipelined CG of Ghysels and Vanroose (2014) Similar to Chronopoulos and Gear approach Uses auxiliary vector s i Ap i and same formula for α i 22

Pipelined CG (GVCG) Pipelined CG of Ghysels and Vanroose (2014) Similar to Chronopoulos and Gear approach Uses auxiliary vector s i Ap i and same formula for α i Also uses auxiliary vectors for Ar i and A 2 r i to remove sequential dependency between SpMV and inner products 22

Pipelined CG (GVCG) Pipelined CG of Ghysels and Vanroose (2014) Similar to Chronopoulos and Gear approach Uses auxiliary vector s i Ap i and same formula for α i Also uses auxiliary vectors for Ar i and A 2 r i to remove sequential dependency between SpMV and inner products Allows the use of nonblocking (asynchronous) MPI communication to overlap SpMV and inner products Hides the latency of global communications 22

GVCG (Ghysels and Vanroose 2014) r 0 = b Ax 0, p 0 = r 0 s 0 = Ap 0, w 0 = Ar 0, z 0 = Aw 0, α 0 = r 0 T r 0 /p 0 T s 0 for i = 1:nmax x i = x i 1 + α i 1 p i 1 r i = r i 1 α i 1 s i 1 w i = w i 1 α i 1 z i 1 q i = Aw i β i = r i T r i r T i 1 r i 1 α i = w i T r i r i T ri β i α i 1 r i T r i end p i = r i + β i p i 1 s i = w i + β i s i 1 z i = q i + β i z i 1 23

GVCG (Ghysels and Vanroose 2014) r 0 = b Ax 0, p 0 = r 0 s 0 = Ap 0, w 0 = Ar 0, z 0 = Aw 0, α 0 = r 0 T r 0 /p 0 T s 0 for i = 1:nmax x i = x i 1 + α i 1 p i 1 r i = r i 1 α i 1 s i 1 w i = w i 1 α i 1 z i 1 q i = Aw i β i = r i T r i r T i 1 r i 1 α i = w i T r i r i T ri β i α i 1 r i T r i end p i = r i + β i p i 1 s i = w i + β i s i 1 z i = q i + β i z i 1 23

Overlap GVCG (Ghysels and Vanroose 2014) r 0 = b Ax 0, p 0 = r 0 s 0 = Ap 0, w 0 = Ar 0, z 0 = Aw 0, α 0 = r 0 T r 0 /p 0 T s 0 for i = 1:nmax x i = x i 1 + α i 1 p i 1 r i = r i 1 α i 1 s i 1 Iteration Loop Vector Updates w i = w i 1 α i 1 z i 1 q i = Aw i β i = r i T r i r T i 1 r i 1 Inner Products SpMV α i = w i T r i r i T ri β i α i 1 r i T r i Vector Updates p i = r i + β i p i 1 s i = w i + β i s i 1 End Loop end z i = q i + β i z i 1 23

Overlap GVCG (Ghysels and Vanroose 2014) r 0 = b Ax 0, p 0 = r 0 s 0 = Ap 0, w 0 = Ar 0, z 0 = Aw 0, α 0 = r 0 T r 0 /p 0 T s 0 for i = 1:nmax x i = x i 1 + α i 1 p i 1 r i = r i 1 α i 1 s i 1 Iteration Loop Vector Updates w i = w i 1 α i 1 z i 1 q i = Aw i β i = r i T r i r T i 1 r i 1 Inner Products SpMV α i = w i T r i r i T ri β i α i 1 r i T r i Vector Updates p i = r i + β i p i 1 s i = w i + β i s i 1 End Loop end z i = q i + β i z i 1 23

Overlap GVCG (Ghysels and Vanroose 2014) r 0 = b Ax 0, p 0 = r 0 s 0 = Ap 0, w 0 = Ar 0, z 0 = Aw 0, α 0 = r 0 T r 0 /p 0 T s 0 for i = 1:nmax x i = x i 1 + α i 1 p i 1 r i = r i 1 α i 1 s i 1 Iteration Loop Vector Updates w i = w i 1 α i 1 z i 1 q i = Aw i β i = r i T r i r T i 1 r i 1 Inner Products SpMV α i = w i T r i r i T ri β i α i 1 r i T r i Vector Updates p i = r i + β i p i 1 s i = w i + β i s i 1 End Loop end z i = q i + β i z i 1 23

Overlap GVCG (Ghysels and Vanroose 2014) r 0 = b Ax 0, p 0 = r 0 s 0 = Ap 0, w 0 = Ar 0, z 0 = Aw 0, α 0 = r 0 T r 0 /p 0 T s 0 for i = 1:nmax x i = x i 1 + α i 1 p i 1 r i = r i 1 α i 1 s i 1 Iteration Loop Vector Updates w i = w i 1 α i 1 z i 1 q i = Aw i β i = r i T r i r T i 1 r i 1 Inner Products SpMV α i = w i T r i r i T ri β i α i 1 r i T r i Vector Updates p i = r i + β i p i 1 s i = w i + β i s i 1 End Loop end z i = q i + β i z i 1 23

Overlap GVCG (Ghysels and Vanroose 2014) r 0 = b Ax 0, p 0 = r 0 s 0 = Ap 0, w 0 = Ar 0, z 0 = Aw 0, α 0 = r 0 T r 0 /p 0 T s 0 for i = 1:nmax x i = x i 1 + α i 1 p i 1 r i = r i 1 α i 1 s i 1 Iteration Loop Vector Updates w i = w i 1 α i 1 z i 1 q i = Aw i β i = r i T r i r T i 1 r i 1 Inner Products Precond SpMV α i = w i T r i r i T ri β i α i 1 r i T r i Vector Updates p i = r i + β i p i 1 s i = w i + β i s i 1 End Loop end z i = q i + β i z i 1 23

Attainable accuracy of pipelined CG What is the effect of adding auxiliary recurrences to the CG method? 24

Attainable accuracy of pipelined CG What is the effect of adding auxiliary recurrences to the CG method? To isolate the effects, we consider a simplified version of a pipelined method r 0 = b Ax 0, p 0 = r 0, s 0 = Ap 0 for i = 1:nmax end α i 1 = (r i 1,r i 1 ) (p i 1,s i 1 ) x i = x i 1 + α i 1 p i 1 r i = r i 1 α i 1 s i 1 β i = (r i,r i ) (r i 1,r i 1 ) p i = r i + β i p i 1 s i = Ar i + β i s i 1 24

Attainable accuracy of pipelined CG What is the effect of adding auxiliary recurrences to the CG method? To isolate the effects, we consider a simplified version of a pipelined method Uses same update formulas for α and β as HSCG, but uses additional recurrence for Ap i r 0 = b Ax 0, p 0 = r 0, s 0 = Ap 0 for i = 1:nmax end α i 1 = (r i 1,r i 1 ) (p i 1,s i 1 ) x i = x i 1 + α i 1 p i 1 r i = r i 1 α i 1 s i 1 β i = (r i,r i ) (r i 1,r i 1 ) p i = r i + β i p i 1 s i = Ar i + β i s i 1 24

Attainable accuracy of simple pipelined CG x i = x i 1 + α i 1 p i 1 + δx i r i = r i 1 α i 1 s i 1 + δr i 25

Attainable accuracy of simple pipelined CG x i = x i 1 + α i 1 p i 1 + δx i r i = r i 1 α i 1 s i 1 + δr i f i = r i (b A x i ) 25

Attainable accuracy of simple pipelined CG x i = x i 1 + α i 1 p i 1 + δx i r i = r i 1 α i 1 s i 1 + δr i f i = r i (b A x i ) = f i 1 α i 1 s i 1 A p i 1 + δr i + Aδx i 25

Attainable accuracy of simple pipelined CG x i = x i 1 + α i 1 p i 1 + δx i r i = r i 1 α i 1 s i 1 + δr i f i = r i (b A x i ) = f i 1 α i 1 s i 1 A p i 1 + δr i + Aδx i i = f 0 + m=1 (δr m + Aδx m ) G i d i where G i = S i A P i, d i = α 0,, α i 1 T 25

Attainable accuracy of simple pipelined CG x i = x i 1 + α i 1 p i 1 + δx i r i = r i 1 α i 1 s i 1 + δr i f i = r i (b A x i ) = f i 1 α i 1 s i 1 A p i 1 + δr i + Aδx i i = f 0 + m=1 (δr m + Aδx m ) G i d i where G i = S i A P i, d i = α 0,, α i 1 T 25

Attainable accuracy of simple pipelined CG x i = x i 1 + α i 1 p i 1 + δx i r i = r i 1 α i 1 s i 1 + δr i f i = r i (b A x i ) = f i 1 α i 1 s i 1 A p i 1 + δr i + Aδx i i = f 0 + m=1 (δr m + Aδx m ) G i d i where G i = S i A P i, d i = α 0,, α i 1 T 25

Attainable accuracy of simple pipelined CG O ε 1 G i κ( U 1 O ε i ) A P i + A R i U i U i = 1 β 1 0 0 0 1 0 1 β i 1 0 0 1 U i 1 = 1 β 1 β 1 β 2 β i 1 0 1 β 2 β 2 β i 1 1 β i 1 0 0 1 26

Attainable accuracy of simple pipelined CG O ε 1 G i κ( U 1 O ε i ) A P i + A R i U i U i = 1 β 1 0 0 0 1 0 1 β i 1 0 0 1 U i 1 = 1 β 1 β 1 β 2 β i 1 0 1 β 2 β 2 β i 1 1 β i 1 0 0 1 β l β l+1 β j = r j 2 r l 1 2, l < j 26

Attainable accuracy of simple pipelined CG O ε 1 G i κ( U 1 O ε i ) A P i + A R i U i U i = 1 β 1 0 0 0 1 0 1 β i 1 0 0 1 U i 1 = 1 β 1 β 1 β 2 β i 1 0 1 β 2 β 2 β i 1 1 β i 1 0 0 1 β l β l+1 β j = r j 2 r l 1 2, l < j Residual oscillations can cause these factors to be large! Errors in computed recurrence coefficients can be amplified! 26

Attainable accuracy of simple pipelined CG O ε 1 G i κ( U 1 O ε i ) A P i + A R i U i U i = 1 β 1 0 0 0 1 0 1 β i 1 0 0 1 U i 1 = 1 β 1 β 1 β 2 β i 1 0 1 β 2 β 2 β i 1 1 β i 1 0 0 1 β l β l+1 β j = r j 2 r l 1 2, l < j Residual oscillations can cause these factors to be large! Errors in computed recurrence coefficients can be amplified! Very similar to the results for attainable accuracy in the 3-term STCG Seemingly innocuous change can cause drastic loss of accuracy 26

Simple pipelined CG 27

Simple pipelined CG effect of using auxiliary vector s i Ap i 27

Simple pipelined CG effect of changing formula for recurrence coefficient α and using auxiliary vector s i Ap i 27

Simple pipelined CG effect of changing formula for recurrence coefficient α and using auxiliary vectors s i Ap i, w i Ar i, z i A 2 r i 27

s-step CG Idea: Compute blocks of s iterations at once Compute updates in a different basis Communicate every s iterations instead of every iteration Reduces number of synchronizations per iteration by a factor of s 28

s-step CG Idea: Compute blocks of s iterations at once Compute updates in a different basis Communicate every s iterations instead of every iteration Reduces number of synchronizations per iteration by a factor of s An idea rediscovered many times 28

s-step CG Idea: Compute blocks of s iterations at once Compute updates in a different basis Communicate every s iterations instead of every iteration Reduces number of synchronizations per iteration by a factor of s An idea rediscovered many times First related work: s-dimensional steepest descent, least squares Khabaza ( 63), Forsythe ( 68), Marchuk and Kuznecov ( 68) 28

s-step CG Idea: Compute blocks of s iterations at once Compute updates in a different basis Communicate every s iterations instead of every iteration Reduces number of synchronizations per iteration by a factor of s An idea rediscovered many times First related work: s-dimensional steepest descent, least squares Khabaza ( 63), Forsythe ( 68), Marchuk and Kuznecov ( 68) Flurry of work on s-step Krylov methods in 80s/early 90s: see, e.g., Van Rosendale (1983); Chronopoulos and Gear (1989) 28

s-step CG Idea: Compute blocks of s iterations at once Compute updates in a different basis Communicate every s iterations instead of every iteration Reduces number of synchronizations per iteration by a factor of s An idea rediscovered many times First related work: s-dimensional steepest descent, least squares Khabaza ( 63), Forsythe ( 68), Marchuk and Kuznecov ( 68) Flurry of work on s-step Krylov methods in 80s/early 90s: see, e.g., Van Rosendale (1983); Chronopoulos and Gear (1989) Resurgence of interest in recent years due to growing problem sizes; growing relative cost of communication 28

s-step CG Key observation: After iteration i, for j {0,.., s}, x i+j x i, r i+j, p i+j K s+1 A, p i + K s A, r i 29

s-step CG Key observation: After iteration i, for j {0,.., s}, x i+j x i, r i+j, p i+j K s+1 A, p i + K s A, r i s steps of s-step CG: 29

s-step CG Key observation: After iteration i, for j {0,.., s}, x i+j x i, r i+j, p i+j K s+1 A, p i + K s A, r i s steps of s-step CG: Expand solution space s dimensions at once Compute basis matrix Y such that span Y = K s+1 A, p i + K s A, r i according to the recurrence AY = Y B 29

s-step CG Key observation: After iteration i, for j {0,.., s}, x i+j x i, r i+j, p i+j K s+1 A, p i + K s A, r i s steps of s-step CG: Expand solution space s dimensions at once Compute basis matrix Y such that span Y = K s+1 A, p i + K s A, r i according to the recurrence AY = Y B Compute inner products between basis vectors in one synchronization G = Y T Y 29

s-step CG Key observation: After iteration i, for j {0,.., s}, x i+j x i, r i+j, p i+j K s+1 A, p i + K s A, r i s steps of s-step CG: Expand solution space s dimensions at once Compute basis matrix Y such that span Y = K s+1 A, p i + K s A, r i according to the recurrence AY = Y B Compute inner products between basis vectors in one synchronization G = Y T Y Compute s iterations of vector updates Perform s iterations of vector updates by updating coordinates in basis Y: x i+j x i = Yx j, r i+j = Yr j, p i+j = Yp j 29

s-step CG For s iterations of updates, inner products and SpMVs (in basis Y) can be computed by independently by each processor without communication: 30

s-step CG For s iterations of updates, inner products and SpMVs (in basis Y) can be computed by independently by each processor without communication: Ap i+j n n 30

s-step CG For s iterations of updates, inner products and SpMVs (in basis Y) can be computed by independently by each processor without communication: Ap i+j n = AYp j n 30

s-step CG For s iterations of updates, inner products and SpMVs (in basis Y) can be computed by independently by each processor without communication: Ap i+j = AYp j = Y(Bp j ) n n O(s) O(s) 30

s-step CG For s iterations of updates, inner products and SpMVs (in basis Y) can be computed by independently by each processor without communication: Ap i+j = AYp j = Y(Bp j ) n n O(s) O(s) (r i+j, r i+j ) 30

s-step CG For s iterations of updates, inner products and SpMVs (in basis Y) can be computed by independently by each processor without communication: Ap i+j = AYp j = Y(Bp j ) n n O(s) O(s) (r i+j, r i+j ) = r j T Y T Yr j 30

s-step CG For s iterations of updates, inner products and SpMVs (in basis Y) can be computed by independently by each processor without communication: Ap i+j = AYp j = Y(Bp j ) n n O(s) O(s) (r i+j, r i+j ) = r j T Y T Yr j = r j T Gr j 30

s-step CG r 0 = b Ax 0, p 0 = r 0 for k = 0:nmax/s Compute Y k and B k such that AY k = Y k B k and span(y k ) = K s+1 A, p sk + K s A, r sk G k = Y k T Y k x 0 = 0, r 0 = e s+2, p 0 = e 1 for j = 1: s end α sk+j 1 = x j = x j 1 r j = r j 1 β sk+j = r j T G k r j r T j 1 Gk r j 1 p T j 1 G k B k p j 1 + α sk+j 1 p j 1 α sk+j 1 B k p j 1 T G k r j 1 r j 1 p j = r j + β sk+j p j 1 [x s k+1 x sk, r s k+1, p s k+1 ] = Y k [x s, r s, p s ] end 31

s-step CG r 0 = b Ax 0, p 0 = r 0 for k = 0:nmax/s Compute Y k and B k such that AY k = Y k B k and span(y k ) = K s+1 A, p sk G k = Y k T Y k x 0 = 0, r 0 = e s+2, p 0 = e 1 for j = 1: s end α sk+j 1 = x j = x j 1 r j = r j 1 β sk+j = r j T G k r j r T j 1 Gk r j 1 p T j 1 G k B k p j 1 + α sk+j 1 p j 1 α sk+j 1 B k p j 1 T G k r j 1 r j 1 p j = r j + β sk+j p j 1 + K s A, r sk [x s k+1 x sk, r s k+1, p s k+1 ] = Y k [x s, r s, p s ] end Outer Loop Compute basis O(s) SPMVs O(s 2 ) Inner Products (one synchronization) Inner Loop Local Vector Updates (no comm.) End Inner Loop Inner Outer Loop s times 31

s-step CG r 0 = b Ax 0, p 0 = r 0 for k = 0:nmax/s Compute Y k and B k such that AY k = Y k B k and span(y k ) = K s+1 A, p sk G k = Y k T Y k x 0 = 0, r 0 = e s+2, p 0 = e 1 for j = 1: s end α sk+j 1 = x j = x j 1 r j = r j 1 β sk+j = r j T G k r j r T j 1 Gk r j 1 p T j 1 G k B k p j 1 + α sk+j 1 p j 1 α sk+j 1 B k p j 1 T G k r j 1 r j 1 p j = r j + β sk+j p j 1 + K s A, r sk [x s k+1 x sk, r s k+1, p s k+1 ] = Y k [x s, r s, p s ] end Outer Loop Compute basis O(s) SPMVs O(s 2 ) Inner Products (one synchronization) Inner Loop Local Vector Updates (no comm.) End Inner Loop Inner Outer Loop s times 31

s-step CG r 0 = b Ax 0, p 0 = r 0 for k = 0:nmax/s Compute Y k and B k such that AY k = Y k B k and span(y k ) = K s+1 A, p sk G k = Y k T Y k x 0 = 0, r 0 = e s+2, p 0 = e 1 for j = 1: s end α sk+j 1 = x j = x j 1 r j = r j 1 β sk+j = r j T G k r j r T j 1 Gk r j 1 p T j 1 G k B k p j 1 + α sk+j 1 p j 1 α sk+j 1 B k p j 1 T G k r j 1 r j 1 p j = r j + β sk+j p j 1 + K s A, r sk [x s k+1 x sk, r s k+1, p s k+1 ] = Y k [x s, r s, p s ] end Outer Loop Compute basis O(s) SPMVs O(s 2 ) Inner Products (one synchronization) Inner Loop Local Vector Updates (no comm.) End Inner Loop Inner Outer Loop s times 31

s-step CG r 0 = b Ax 0, p 0 = r 0 for k = 0:nmax/s Compute Y k and B k such that AY k = Y k B k and span(y k ) = K s+1 A, p sk G k = Y k T Y k x 0 = 0, r 0 = e s+2, p 0 = e 1 for j = 1: s end α sk+j 1 = x j = x j 1 r j = r j 1 β sk+j = r j T G k r j r T j 1 Gk r j 1 p T j 1 G k B k p j 1 + α sk+j 1 p j 1 α sk+j 1 B k p j 1 T G k r j 1 r j 1 p j = r j + β sk+j p j 1 + K s A, r sk [x s k+1 x sk, r s k+1, p s k+1 ] = Y k [x s, r s, p s ] end Outer Loop Compute basis O(s) SPMVs O(s 2 ) Inner Products (one synchronization) Inner Loop Local Vector Updates (no comm.) End Inner Loop Inner Outer Loop s times 31

Sources of local roundoff error in s-step CG Computing the s-step Krylov subspace basis: A Y k = Y k B k + ΔY k Updating coordinate vectors in the inner loop: x k,j r k,j = x k,j 1 = r k,j 1 + q k,j 1 B k q k,j 1 with q k,j 1 + ξ k,j + η k,j = fl( α sk+j 1 p k,j 1 ) Recovering CG vectors for use in next outer loop: x sk+j = Y k x k,j r sk+j = Y k r k,j + x sk + φ sk+j + ψ sk+j 32

Sources of local roundoff error in s-step CG Computing the s-step Krylov subspace basis: A Y k = Y k B k + ΔY k Error in computing s-step basis Updating coordinate vectors in the inner loop: x k,j r k,j = x k,j 1 = r k,j 1 + q k,j 1 B k q k,j 1 with q k,j 1 + ξ k,j + η k,j = fl( α sk+j 1 p k,j 1 ) Recovering CG vectors for use in next outer loop: x sk+j = Y k x k,j r sk+j = Y k r k,j + x sk + φ sk+j + ψ sk+j 32

Sources of local roundoff error in s-step CG Computing the s-step Krylov subspace basis: A Y k = Y k B k + ΔY k Error in computing s-step basis Updating coordinate vectors in the inner loop: x k,j r k,j = x k,j 1 = r k,j 1 + q k,j 1 B k q k,j 1 with q k,j 1 + ξ k,j + η k,j = fl( α sk+j 1 p k,j 1 ) Error in updating coefficient vectors Recovering CG vectors for use in next outer loop: x sk+j = Y k x k,j r sk+j = Y k r k,j + x sk + φ sk+j + ψ sk+j 32

Sources of local roundoff error in s-step CG Computing the s-step Krylov subspace basis: A Y k = Y k B k + ΔY k Error in computing s-step basis Updating coordinate vectors in the inner loop: x k,j r k,j = x k,j 1 = r k,j 1 + q k,j 1 B k q k,j 1 with q k,j 1 + ξ k,j + η k,j = fl( α sk+j 1 p k,j 1 ) Error in updating coefficient vectors Recovering CG vectors for use in next outer loop: x sk+j = Y k x k,j r sk+j = Y k r k,j + x sk + φ sk+j + ψ sk+j Error in basis change 32

Attainable accuracy of s-step CG We can write the gap between the true and updated residuals f in terms of these errors: f sk+j = f 0 k 1 Aφ sl+s + ψ sl+s + s A Y l ξ l,i + Y l η l,i ΔY l q l,i 1 l=0 i=1 Aφ sk+j ψ sk+j j i=1 A Y k ξ k,i + Y k η k,i ΔY l q k,i 1 Using standard rounding error results, this allows us to obtain an upper bound on f sk+j. 33

Attainable accuracy of s-step CG We can write the gap between the true and updated residuals f in terms of these errors: f sk+j = f 0 k 1 Aφ sl+s + ψ sl+s + s A Y l ξ l,i + Y l η l,i ΔY l q l,i 1 l=0 i=1 Aφ sk+j ψ sk+j j i=1 A Y k ξ k,i + Y k η k,i ΔY l q k,i 1 Using standard rounding error results, this allows us to obtain an upper bound on f sk+j. 33

Attainable accuracy of s-step CG We can write the gap between the true and updated residuals f in terms of these errors: f sk+j = f 0 k 1 Aφ sl+s + ψ sl+s + s A Y l ξ l,i + Y l η l,i ΔY l q l,i 1 l=0 i=1 Aφ sk+j ψ sk+j j i=1 A Y k ξ k,i + Y k η k,i ΔY l q k,i 1 Using standard rounding error results, this allows us to obtain an upper bound on f sk+j. 33

Attainable accuracy of s-step CG We can write the gap between the true and updated residuals f in terms of these errors: f sk+j = f 0 k 1 Aφ sl+s + ψ sl+s + s A Y l ξ l,i + Y l η l,i ΔY l q l,i 1 l=0 i=1 Aφ sk+j ψ sk+j j i=1 A Y k ξ k,i + Y k η k,i ΔY l q k,i 1 Using standard rounding error results, this allows us to obtain an upper bound on f sk+j. 33

Attainable accuracy of s-step CG f i b A x i r i For CG: f i f 0 + ε i m=1 1 + N A x m + r m 34

Attainable accuracy of s-step CG f i b A x i r i For CG: f i f 0 + ε i m=1 1 + N A x m + r m For s-step CG: i sk + j f sk+j f 0 + εc Γ k sk+j m=1 1 + N A x m + r m where c is a low-degree polynomial in s, and Γ k = max l k Γ l, where Γ l = Y l + Y l (see C., 2015) 34

s-step CG 35

s-step CG s-step CG with monomial basis (Y = [p i, Ap i,, A s p i, r i, Ar i, A s 1 r i ]) 35

s-step CG s-step CG with monomial basis (Y = [p i, Ap i,, A s p i, r i, Ar i, A s 1 r i ]) 35

s-step CG s-step CG with monomial basis (Y = [p i, Ap i,, A s p i, r i, Ar i, A s 1 r i ]) Can also use other, more well-conditioned bases to improve convergence rate and accuracy (see, e.g. Philippe and Reichel, 2012). 35

s-step CG Even assuming perfect parallel scalability with s (which is usually not the case due to extra SpMVs and inner products), already at s = 4 we are worse than HSCG in terms of number of synchronizations! 36

A different problem... A: nos4 from UFSMC, b: equal components in the eigenbasis of A and b = 1 N = 100, κ A 2e3 37

A different problem... A: nos4 from UFSMC, b: equal components in the eigenbasis of A and b = 1 N = 100, κ A 2e3 37

A different problem... A: nos4 from UFSMC, b: equal components in the eigenbasis of A and b = 1 N = 100, κ A 2e3 37

A different problem... A: nos4 from UFSMC, b: equal components in the eigenbasis of A and b = 1 N = 100, κ A 2e3

A different problem... A: nos4 from UFSMC, b: equal components in the eigenbasis of A and b = 1 N = 100, κ A 2e3 If application only requires x x i A 10 10, any of these methods will work!

Performance Benefit from s-step BICGSTAB Speedups for real applications s-step BICGSTAB bottom-solver implemented in BoxLib (AMR framework from LBL) Low Mach Number Combustion Code (LMC): gas-phase combustion simulation Compared GMG with BICGSTAB vs. GMG with s-step BICGSTAB (s=4) on a Cray XE6 for two different applications Up to 2.5x speedup in bottom solve; up to 1.5x in overall MG solve 3.0x 2.5x 2.0x 1.5x 1.0x 0.5x 0.0x LMC - 3D mac_project Solve Bottom Solver MG Solver (overall) 64 512 4096 32768 48 384 3072 24576 Flat MPI MPI+OpenMP (see Williams et al., IPDPS 2014) 38

Conclusions and takeaways Think of the bigger picture Much focus on modifying methods to speed up iterations But the speed of an iteration only part of the runtime: runtime = (time/iteration) x (#iterations) 39

Conclusions and takeaways Think of the bigger picture Much focus on modifying methods to speed up iterations But the speed of an iteration only part of the runtime: runtime = (time/iteration) x (#iterations) And a solver that can't solve to required accuracy is useless! 39

Conclusions and takeaways Think of the bigger picture Much focus on modifying methods to speed up iterations But the speed of an iteration only part of the runtime: runtime = (time/iteration) x (#iterations) And a solver that can't solve to required accuracy is useless! Solver runtime is only part of a larger scientific code 39

Conclusions and takeaways Think of the bigger picture Much focus on modifying methods to speed up iterations But the speed of an iteration only part of the runtime: runtime = (time/iteration) x (#iterations) And a solver that can't solve to required accuracy is useless! Solver runtime is only part of a larger scientific code Design and implementation of iterative solvers requires a holistic approach Selecting the right method, parameters, stopping criteria Selecting the right preconditioner (closely linked with the discretization! see Málek and Strakoš, 2015) 39