Finite-choice algorithm optimization in Conjugate Gradients

Similar documents
Reduced Synchronization Overhead on. December 3, Abstract. The standard formulation of the conjugate gradient algorithm involves

The Lanczos and conjugate gradient algorithms

The amount of work to construct each new guess from the previous one should be a small multiple of the number of nonzeros in A.

EXASCALE COMPUTING. Hiding global synchronization latency in the preconditioned Conjugate Gradient algorithm. P. Ghysels W.

6.4 Krylov Subspaces and Conjugate Gradients

AMS526: Numerical Analysis I (Numerical Linear Algebra) Lecture 23: GMRES and Other Krylov Subspace Methods; Preconditioning

ITERATIVE METHODS BASED ON KRYLOV SUBSPACES

High performance Krylov subspace method variants

Applied Mathematics 205. Unit V: Eigenvalue Problems. Lecturer: Dr. David Knezevic

S-Step and Communication-Avoiding Iterative Methods

Contents. Preface... xi. Introduction...

Efficient Deflation for Communication-Avoiding Krylov Subspace Methods

M.A. Botchev. September 5, 2014

Key words. linear equations, polynomial preconditioning, nonsymmetric Lanczos, BiCGStab, IDR

Conjugate gradient method. Descent method. Conjugate search direction. Conjugate Gradient Algorithm (294)

Solving Sparse Linear Systems: Iterative methods

Solving Sparse Linear Systems: Iterative methods

Summary of Iterative Methods for Non-symmetric Linear Equations That Are Related to the Conjugate Gradient (CG) Method

Iterative methods for Linear System of Equations. Joint Advanced Student School (JASS-2009)

SOLVING SPARSE LINEAR SYSTEMS OF EQUATIONS. Chao Yang Computational Research Division Lawrence Berkeley National Laboratory Berkeley, CA, USA

Topics. The CG Algorithm Algorithmic Options CG s Two Main Convergence Theorems

Performance Evaluation of GPBiCGSafe Method without Reverse-Ordered Recurrence for Realistic Problems

DELFT UNIVERSITY OF TECHNOLOGY

AMS526: Numerical Analysis I (Numerical Linear Algebra)

Numerical Methods in Matrix Computations

Scalable Non-blocking Preconditioned Conjugate Gradient Methods

Iterative methods for Linear System

4.8 Arnoldi Iteration, Krylov Subspaces and GMRES

MS 28: Scalable Communication-Avoiding and -Hiding Krylov Subspace Methods I

CME342 Parallel Methods in Numerical Analysis. Matrix Computation: Iterative Methods II. Sparse Matrix-vector Multiplication.

Formulation of a Preconditioned Algorithm for the Conjugate Gradient Squared Method in Accordance with Its Logical Structure

Communication-avoiding Krylov subspace methods

ITERATIVE METHODS FOR SPARSE LINEAR SYSTEMS

Some minimization problems

Algorithm for Sparse Approximate Inverse Preconditioners in the Conjugate Gradient Method

The Conjugate Gradient Method

Preface to the Second Edition. Preface to the First Edition

EIGIFP: A MATLAB Program for Solving Large Symmetric Generalized Eigenvalue Problems

Arnoldi Methods in SLEPc

FEM and sparse linear system solving

1 Conjugate gradients

Efficient Estimation of the A-norm of the Error in the Preconditioned Conjugate Gradient Method

Computational Linear Algebra

Jae Heon Yun and Yu Du Han

Computational Linear Algebra

The solution of the discretized incompressible Navier-Stokes equations with iterative methods

Variants of BiCGSafe method using shadow three-term recurrence

Iterative Methods for Linear Systems of Equations

A Residual Replacement Strategy for Improving the Maximum Attainable Accuracy of s-step Krylov Subspace Methods

Key words. conjugate gradients, normwise backward error, incremental norm estimation.

Lab 1: Iterative Methods for Solving Linear Systems

Towards Mechanical Derivation of Krylov Solver Libraries

Iterative Methods for Sparse Linear Systems

Preconditioned Parallel Block Jacobi SVD Algorithm

Linear Solvers. Andrew Hazel

Chapter 7 Iterative Techniques in Matrix Algebra

arxiv: v1 [hep-lat] 2 May 2012

Krylov Space Solvers

An Asynchronous Algorithm on NetSolve Global Computing System

A Bluffers Guide to Iterative Methods for Systems of Linear Equations

Parallel Numerics, WT 2016/ Iterative Methods for Sparse Linear Systems of Equations. page 1 of 1

A short course on: Preconditioned Krylov subspace methods. Yousef Saad University of Minnesota Dept. of Computer Science and Engineering

A DISSERTATION. Extensions of the Conjugate Residual Method. by Tomohiro Sogabe. Presented to

A communication-avoiding thick-restart Lanczos method on a distributed-memory system

Evaluation of Iterative-Solutions used for Three-Dimensional Volume Integral Equation

Contribution of Wo¹niakowski, Strako²,... The conjugate gradient method in nite precision computa

Iterative Methods for Solving A x = b

The Conjugate Gradient Method

LINEAR ALGEBRA: NUMERICAL METHODS. Version: August 12,

RESIDUAL SMOOTHING AND PEAK/PLATEAU BEHAVIOR IN KRYLOV SUBSPACE METHODS

A Residual Replacement Strategy for Improving the Maximum Attainable Accuracy of Communication- Avoiding Krylov Subspace Methods

Krylov Subspace Methods that Are Based on the Minimization of the Residual

Lecture 17: Iterative Methods and Sparse Linear Algebra

Sparse matrix methods in quantum chemistry Post-doctorale cursus quantumchemie Han-sur-Lesse, Belgium

EECS 275 Matrix Computation

Comparison of Fixed Point Methods and Krylov Subspace Methods Solving Convection-Diffusion Equations

Iterative Solvers in the Finite Element Solution of Transient Heat Conduction

On the influence of eigenvalues on Bi-CG residual norms

IDR(s) Master s thesis Goushani Kisoensingh. Supervisor: Gerard L.G. Sleijpen Department of Mathematics Universiteit Utrecht

A Novel Approach for Solving the Power Flow Equations

Block Lanczos Tridiagonalization of Complex Symmetric Matrices

Solving Ax = b, an overview. Program

ADDITIVE SCHWARZ FOR SCHUR COMPLEMENT 305 the parallel implementation of both preconditioners on distributed memory platforms, and compare their perfo

Chapter 12 Block LU Factorization

CME 302: NUMERICAL LINEAR ALGEBRA FALL 2005/06 LECTURE 0

AMSC 600 /CMSC 760 Advanced Linear Numerical Analysis Fall 2007 Krylov Minimization and Projection (KMP) Dianne P. O Leary c 2006, 2007.

Lecture 8 Fast Iterative Methods for linear systems of equations

Gradient Method Based on Roots of A

Iterative Methods and Multigrid

PRECONDITIONING IN THE PARALLEL BLOCK-JACOBI SVD ALGORITHM

Lecture 11: CMSC 878R/AMSC698R. Iterative Methods An introduction. Outline. Inverse, LU decomposition, Cholesky, SVD, etc.

OUTLINE 1. Introduction 1.1 Notation 1.2 Special matrices 2. Gaussian Elimination 2.1 Vector and matrix norms 2.2 Finite precision arithmetic 2.3 Fact

Incomplete LU Preconditioning and Error Compensation Strategies for Sparse Matrices

A HARMONIC RESTARTED ARNOLDI ALGORITHM FOR CALCULATING EIGENVALUES AND DETERMINING MULTIPLICITY

Moments, Model Reduction and Nonlinearity in Solving Linear Algebraic Problems

Course Notes: Week 1

1 Extrapolation: A Hint of Things to Come

A Domain Decomposition Based Jacobi-Davidson Algorithm for Quantum Dot Simulation

Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods 1

WHEN studying distributed simulations of power systems,

Transcription:

Finite-choice algorithm optimization in Conjugate Gradients Jack Dongarra and Victor Eijkhout January 2003 Abstract We present computational aspects of mathematically equivalent implementations of the Conjugate Gradient method. Since the implementations have different performance characteristics, but compute the same quantities, an adaptive implementation can at run time pick the optimal choice. 1 Introduction Recent times have seen the emergence of software that dynamically adapts itself to its computational environment. One example is ATLAS [9], where upon installation a generator program produces and times many equivalent implementations of Blas level 3 routines, keeping the most efficient one. This process consumes time up to several hours but only once per installation; linking to the generated library can give several factors improvement in the performance of a code. Here we investigate the issue of optimizing Conjugate Gradient methods. The choice between different but mathematically equivalent implementations can be made at run-time rather than at installation-time as in the the ATLAS case. Another important difference between our case and ATLAS is that the search space in Atlas is multi-dimensional, and with such a degree of freedom along each dimension, that we can essentially consider it infinite. The choices among different implementations of CG, by contrast, are very finite. They can be realized by simple switches in one overarching implementation. What is more, since all implementations compute the same quantities, switching between one implementation and another can be done during a run, without restarting. The variants of CG we consider are all geared towards optimization in the case of parallel execution. Several authors [1, 2, 4, 6] have proposed numerically equivalent implementations of the Conjugate Gradient method that rearrange the inner product Lapack Working Note 159, University of Tennessee Computer Science report ut-cs-03-502 Innovative Computing Lab, Department of Computer Science, University of Tennessee, Knoxville TN 37996 1

computations, usually at the expense of some extra computation. The new arrangements of inner products are such that all inner products in a single iteration can combine their communication stages. On parallel architectures with a large latency this can give appreciably savings in overall execution time. We also consider an implementation of IC-preconditioned CG by van der Vorst [3] which overlaps one inner product with half of the preconditioner application, effectively hiding its communication time. Since all these methods compute the same scalar and vector quantities in each iteration, a dynamic implementation can evaluate performance characteristics of the architecture during the first few iterations, and finish the process with what is perceived to be the optimal method for the given input parameters. Our choice of CG is no intrinsic limitation of the techniques described; the inner product elimination strategies of [1, 2, 4, 6] can be applied immediately to BiCG (and hence to QMR) and with slightly more effort to such methods as BiCGstab. 2 Variants of Conjugate Gradients In this section we consider several rearrangement strategies for the Conjugate Gradient method. The Conjugate Gradient method in its original form [5] has two inter-dependent inner products: each inner product is used to compute a vector, which in turn is needed for the other inner product. In a parallel context, each inner product can potentially take a time disproportionate to its scalar operation count, and the thought of rearranging the method to combine the inner products is then a naturaly one. The Saad/Meurant, Chronopoulos/Gear, and Eijkhout variants of CG below proceed by combining inner products; they involve more scalar operations than the original method. Since all methods have the same convergence speed in terms of numbers of iterations, on a single processor these variants clearly offer no advantage over the classical formulation. On a large number of processors and a high-latency network they will probably offer an advantage, offsetting the extra operations with decreased communication time. 2.1 The Saad/Meurant method A first proposed variant by Saad [7] uses r t i+1m 1 r i+1 + r t im 1 r i = α 2 i (Ap i ) t M 1 (Ap i ), (1) so the term r t i+1 M 1 r i+1 can be computed recursively without the need for an inner product, and before the vector r i+1 is actually constructed. Instead of computing r t i M 1 r i, we compute (Ap i ) t M 1 (Ap i ) at the same time we compute p t i Ap i. The vector M 1 r i is now computed by an extra recurrence M 1 r i+1 = M 1 r i α i M 1 Ap i. However, this method illustrates the potential dangers of rearranging operations: the variant methods are not guaranteed to be as stable as the original. Indeed, this method is unstable in many cases. 2

The first variant we actually consider here for implementation is a modification by Meurant [6] of Saad s method, and this has in practice been shown to be as stable as the original method. Meurant proposed to use equation (1) only as a predictor and used in β i ; after r i+1 and M 1 r i+1 have been constructed, the value of r t i+1 M 1 r i+1 is computed as an explicit inner product to be used in α i+1 combined with the other two inner products. We give the structure of the algorithm in figure 1. From p i form Ap i and M 1 Ap i. Now compute simultaneously the inner products p t iap i, r t im 1 r i, and (Ap i ) t M 1 (Ap i ); with r t i M 1 r i compute α i = p t i Ap i/r t i M 1 r i, and the value of the inner product r t i+1 M 1 r i+1 = r i M 1 r i +α 2 i (Ap i) t M 1 (Ap i ). With this, also compute β i = r i+1 M 1 r i+1 /r i M 1 r i. Save the vector M 1 Ap i. Update r i+1 = r i α i Ap i and M 1 r i+1 = M 1 r i α i M 1 Ap i. Update p i+1 = M 1 r i+1 β i p i. Figure 1: Meurant s modification of Saad s CG method. 2.2 The Chronopoulos and Gear method The second variant we consider was published by Chronopoulos and Gear [1] and eliminates the other, p t Ap, inner product. This method was later independently rediscovered by D Azevedo, Eijkhout, and Romine[2], who proved that it is as stable as the original CG method. The basic relation for this variant, p t iap i = r t im t AM 1 r i β 2 i p t i 1Ap i 1, (2) replaces the p t Ap inner product by r t M t AM 1 r, which can be computed combined with the r t M 1 r inner product. Also, we now construct Ap i recursively from Ap i+1 = AM 1 r i+1 + i Ap k u ki+1 rather than explicitly by a matrix-vector product. We give the structure of the algorithm in figure 2. 2.3 The Eijkhout variant Eijkhout has proposed [2, 4] a method that is very similar to the Chronopoulos and Gear one. It is also based on recursive calculation of p t Ap from an actually constructed inner k=1 3

From r i and M 1 r i compute simultaneously r t im 1 r i and (M 1 r i ) t AM 1 r i ; with this, compute β i = r i+1 M 1 r i+1 /r i M 1 r i. Recursively compute p t i Ap i from r t i M t AM 1 r i and p t i 1 Ap i 1 from p t iap i = r t im t AM 1 r i β 2 i p t i 1Ap i 1, With p t i Ap i compute α i = p t i Ap i/r t i M 1 r i, and use this to update x i and r i to x i+1 and r i+1. Apply the preconditioner to form M 1 r i+1, then the matrix to form AM 1 r i+1. Now construct Ap i+1 recursively by Ap i+1 = AM 1 r i+1 + Ap i β i Figure 2: Chronopoulos and Gear s method product (M 1 r) t A(M 1 r). Equation (2) is then replaced by p t iap i = r t im t AM 1 r i + β i (M 1 r i 1 ) t (Ap i 1 ), (3) This requires the recursive construction of the Ap i vectors from the actual matrixvector product AM 1 r i. Additionally, there is now a third inner product (M 1 r i ) t (Ap i ) in each iteration. This inner product can be combined with the already existing ones, so there is only an increase in the scalar work per iteration, not in the communication behaviour. There is no stability analysis for this method, but in tests it has appeared as stable or better as the Chronopoulos/Gear variant. 2.4 Van der Vorst s method Finally, for the special case of preconditioning by a Block Jacobi method with an Incomplete Cholesky solve of the blocks we consider a method by van der Vorst [3] which overlaps one inner product computation with the second half of the Cholesky solve. The crucial observation here is that, in the presence of a Cholesky preconditioner M = LL t, the inner product r t M 1 r can be computed as (L 1 r) t (L 1 r), that is, before the preconditioned residual M 1 r has been formed completely. The method then is construct s = L 1 r calculate z = L t s, and simultaneously 4

z t r = s t s. This method is by heuristic reasoning exactly as stable as the original method; in fact, since it uses symmetry to a larger extent than the original method it may in fact be more stable. While the other methods presented here can be generalized beyond CG, for instance to the BiCG method, this method essentially relies on the symmetry of the preconditioner and on Arnoldi orthogonalization, as opposed to Lanczos orthogonalization in the BiCG method. Thus, no generalization to other methods than CG offers itself. However, by combining van der Vorst s overlapping trick with the Saad/Meurant method, we find a variant of CG where all inner product communications can be hidden. 2.5 Norm calculation For all methods except van der Vorst s and Meurant s note that this include the original formulation communication latency can be eliminated in the norm calculation that is performed for the convergence test. These CG methods feature the following lines: calculate r ; and possibly terminate the iteration calculate z = M 1 r calculate z t r Rearranging this as calculate z = M 1 r calculate combined r and z t r; and possibly terminate the iteration combines the norm and inner product calculations, but in the final iteration one superfluous inner product is computed. This variant may be an improvement if the savings in the inner product computation outweighs the loss because of the extra inner product. This will often be the case, but not when there is no savings in the inner products, because the communications cost are negligible, or the preconditioner is relatively expensive and the number of iterations low. Of course, if the stopping test is on z t r there is no separate norm calculation; with a test on z the norm calculation can be combined with the inner product. 2.6 Block orthogonalization of residuals The orthogonal residuals of the CG method are linear combinations of a Krylov sequence based on the first residual. Thus, communication latency could be lessened by computing explicitly a number s of Krylov vectors, and orthogonalizing them en 5

bloc [1, 8]. Its clearest disadvantage is instability that grows with the number of Krylov vectors computed. Computational benefits are a mixed story: If all inner products are computed simultaneously, in essence using a classic Gram-Schmidt method for orthogonalization, the influence of latency is greatly diminished. The QR factorization used for the Gram-Schmidt orthogonalization is a Blas Level 3 kernel, hence more efficient in execution than the inner products in the original method. On the other hand, while in CG a residual is only orthogonalized against the previous two, in this method it has to be orthogonalized against all s 1 other vectors. Hence, the scalar computation cost is higher than of the original method: O(s 2 ) inner products as opposed to O(s). We expect that this method may be advantageous on machines that have a large difference between Blas levels 1 and 3 performance. References [1] CHRONOPOULOS, A., AND GEAR, C. s-step iterative methods for symmetric linear systems. Journal of Computational and Applied Mathematics 25 (1989), 153 168. [2] D AZEVEDO, E., EIJKHOUT, V., AND ROMINE, C. Lapack working note 56: Reducing communication costs in the conjugate gradient algorithm on distributed memory multiprocessor. Tech. Rep. CS-93-185, Computer Science Department, University of Tennessee, Knoxville, 1993. [3] DEMMEL, J., HEATH, M., AND VAN DER VORST, H. Parallel numerical linear algebra. In Acta Numerica 1993. Cambridge University Press, Cambridge, 1993. [4] EIJKHOUT, V. Lapack working note 51: Qualitative properties of the conjugate gradient and lanczos methods in a matrix framework. Tech. Rep. CS 92-170, Computer Science Department, University of Tennessee, 1992. [5] HESTENES, M., AND STIEFEL, E. Methods of conjugate gradients for solving linear systems. Nat. Bur. Stand. J. Res. 49 (1952), 409 436. [6] MEURANT, G. Multitasking the conjugate gradient method on the CRAY X- MP/48. Parallel Computing 5 (1987), 267 280. [7] SAAD, Y. Practical use of polynomial preconditionings for the conjugate gradient method. SIAM J. Sci. Stat. Comput. 6 (1985), 865 881. [8] VAN ROSENDALE, J. Minimizing inner product data dependencies in conjugate gradient iteration. Tech. Rep. 172178, ICASE, NASA Langley Research Center, Hampton, Virginia, 1983. 6

Classical Saad/Meurant Chronopoulos/Gear Eijkhout Norm calculation: error = r t r Preconditioner application: z M 1 r z z αq z M 1 r id Matrix-vector product: az A z id Inner products 1: ρ z t r ρ predict ρ true + α 2 µ error = r t r ρ z t r ζ z t az error = r t r ρ z t r ζ z t az ɛ (M 1 r) t (Ap) β ρ/ρ old β = ρ predict /ρ old β ρ/ρ old id Search direction update: p z + βp id id id Matrix-vector product: ap A p id ap az + βap id Preconditioner application: q M 1 ap Inner products 2: π p t ap π p t ap µ ap t q error = r t r ρ true = z t r π ζ β 2 π π ζ + βɛ α = ρ/π... ρ true... α = ρ/π Residual update: r r αap id id id 3 separate inner products 4 combined 3 combined 4 combined 1 extra vector update id id Figure 3: Structure of the inner loop of the CG variants 7

[9] WHALEY, R. C., PETITET, A., AND DONGARRA, J. J. Automated empirical optimization of software and the ATLAS project. To appear in Parallel Computing (2001). Also available as University of Tennessee LAPACK Working Note #147, UT-CS-00-448, 2000 (www.netlib.org/lapack/lawns/lawn147.ps). 8