SOR as a Preconditioner. A Dissertation. Presented to. University of Virginia. In Partial Fulllment. of the Requirements for the Degree

SOR as a Preconditioner A Dissertation Presented to The Faculty of the School of Engineering and Applied Science University of Virginia In Partial Fulllment of the Reuirements for the Degree Doctor of Philosophy (Computer Science) by Michael A. DeLong May 1997

Abstract Preconditioning is usually necessary for CG-type iterative algorithms for the solution of large sparse nonsymmetric linear systems. However, many good preconditioners have only marginal intrinsic parallelism { ILU and SSOR in the natural ordering are essentially seuential algorithms. Reordering can yield very good parallelism, but can cause (severe) degradation in the rate of convergence. We consider multi-step SOR in the red-black ordering as a preconditioner. It can have very good parallel properties, but it does not cause degradation in the overall rate of convergence, as does ILU or SSOR. We present one-processor results using multi-step SOR to precondition GMRES which substantiate this claim. There has been some confusion in the literature as to whether this approach should be eective. We deal with these objections, and discuss the relationship between the spectral properties of the preconditioned system and the convergence of GMRES. We also present results from the Intel Paragon and the IBM SP2 showing that as a preconditioner, multistep red-black SOR can give good scaled speedup.

Contents 1 Introduction 1 1.1 The Conjugate Gradient (CG) Method.......................... 1 1.2 Nonsymmetric Problems.................................. 5 1.3 Summary of our results.................................. 6 1.4 Organization of this thesis................................. 7 2 Linear Algebra, Test Problems, and Methods 8 2.1 Basic linear algebra..................................... 8 2.1.1 Inner products................................... 9 2.1.2 Orthogonality and conjugacy........................... 9 2.1.3 Norms........................................ 9 2.1.4 Bases........................................ 11 2.1.5 Special matrices.................................. 11 2.1.6 The Jordan canonical form............................ 11 2.2 Three test problems.................................... 12 2.2.1 Problem 1...................................... 12 2.2.2 Problem 2...................................... 14 2.2.3 Problem 3...................................... 14 2.3 The red-black ordering................................... 14 2.4 Two important properties................................. 15 2.4.1 Consistently ordered matrices........................... 15 2.4.2 Property A..................................... 16 2.5 The GMRES algorithm.................................. 16 2.5.1 Arnoldi's method.................................. 17 2.5.2 The GMRES algorithm.............................. 18 2.5.3 Restarting GMRES................................ 20 2.5.4 Breakdown and stagnation............................ 21 2.6 Preconditioning GMRES.................................. 22 i

CONTENTS ii 2.6.1 Left preconditioning................................ 22 2.6.2 Right preconditioning............................... 23 2.7 The SOR iteration..................................... 24 2.7.1 The Gauss-Seidel and SOR iterations...................... 24 2.7.2 SOR in the red-black ordering.......................... 25 2.7.3 Optimal values for!................................ 26 2.8 SOR as a preconditioner.................................. 27 2.9 The Bi-CGSTAB algorithm................................ 29 2.9.1 Background..................................... 29 2.9.2 The Bi-Conjugate Gradient (BCG) algorithm.................. 30 2.9.3 The Conjugate Gradients Suared (CGS) algorithm.............. 31 2.9.4 The Bi-CGSTAB algorithm............................ 33 2.9.5 Preconditioning the CGS and Bi-CGSTAB algorithms............. 34 2.10 Summary.......................................... 35 3 Serial Results 36 3.1 Implementation....................................... 36 3.1.1 The compressed sparse row (CSR) format.................... 37 3.1.2 Reordering..................................... 38 3.2 Gauss-Seidel vs. ILU(0) as a preconditioner....................... 38 3.3 The eect of!....................................... 39 3.4 The eect of....................................... 41 3.5 Results for Problems 2 and 3............................... 42 3.6 Experiments using Bi-CGSTAB.............................. 44 3.7 Summary and conclusions................................. 46 4 Theory 47 4.1 Convergence of CG..................................... 47 4.2 Convergence of GMRES.................................. 48 4.2.1 Residual norm bounds............................... 48 4.2.2 \Superlinear" convergence............................. 50 4.3 Form of the preconditioned matrix ^A........................... 50 4.4 Convergence of GMRES for ^A............................... 51 4.4.1 The Gauss-Seidel case:! = 1........................... 52 4.4.2 ( ^A) lying in a circle:!! opt.......................... 53 4.4.3 The \suboptimal"! case: 1 <! <! opt..................... 56 4.5 Tighter bounds....................................... 59 4.6 Summary.......................................... 60

CONTENTS iii 5 Parallel Implementation Issues 62 5.1 Hardware.......................................... 62 5.1.1 Intel Paragon.................................... 62 5.1.2 IBM SP2...................................... 63 5.2 Fortran 77.......................................... 64 5.3 Message Passing Interface (MPI)............................. 66 5.4 Minimizing paging on the Paragon............................ 67 5.5 Data structures....................................... 68 5.6 Parallel data distribution................................. 72 5.7 Speedup and scaled speedup................................ 73 5.7.1 Speedup....................................... 73 5.7.2 Scaled Speedup................................... 73 5.8 Summary.......................................... 74 6 Parallel Results 75 6.1 Problem Sizes........................................ 75 6.2 Choosing GMRES parameters............................... 76 6.3 Estimated and actual work................................ 77 6.4 Experiments on the Intel Paragon............................ 79 6.5 Experiments on the IBM SP2............................... 86 6.6 Summary.......................................... 91 7 Summary 92 7.1 Summary and conclusions................................. 92 7.2 Open problems and future directions........................... 92

Chapter 1 Introduction In this dissertation we address the problem of nding parallel preconditioners for Conjugate Gradient type methods for solving large, sparse, nonsymmetric linear systems. Our goal is to show that the SOR iteration provides an eective preconditioner for an important class of problems, and is highly parallel when used in the red-black ordering. In this chapter we briey overview the problem of using iterative methods to solve symmetric systems of linear euations, and use it as an introduction to the more dicult problem of solving sparse nonsymmetric systems of linear euations. We then outline the rest of the dissertation. Given an n n system of linear euations Ax = b (1) where A is large and sparse, such as arises from the nite dierence or nite element discretization of elliptic partial dierential euations, direct methods such as Gaussian Elimination have storage and computational reuirements that make them prohibitively expensive when (1) arises from the discretization of three-dimensional partial dierential euations. One alternative is to use iterative methods, which generally have more modest storage reuirements and may also reuire less time to produce a solution of comparable accuracy, depending on the method and the problem. They usually also have better parallel properties. One such iterative method is the Conjugate Gradient (CG) method. 1.1 The Conjugate Gradient (CG) Method If A is real and symmetric (A T = A) and positive denite (x T Ax > 0; for any nonzero vector x), then (1) can be solved using the Conjugate Gradient (CG) method [HS52], shown in Figure 1. In this gure, and henceforth, we use Greek letters for scalars (e.g. and ), upper-case letters for 1

CHAPTER 1. INTRODUCTION 2 1. Choose x 0. Set p 0 = r 0 = b Ax 0. 2. For k = 1; 2; : : : until convergence do: (a) k = r T k r k =p T k Ap k (b) x k+1 = x k k p k (c) r k+1 = r k + k Ap k (d) If r T r k+1 k+1 < then stop. (e) k = r T k+1 r k+1 =r T k r k (f) p k+1 = r k+1 + k p k Figure 1: The Conjugate Gradient algorithm matrices (A), and bold lower-case letters for vectors, including the iterates (x k ), residuals (r k ), and direction vectors (p k ). It is well-known (see e.g. [GO93]) that the CG algorithm has three desirable properties: It is guaranteed to converge in at most n iterations. If A has m n distinct eigenvalues, it is guaranteed to converge in at most m iterations. If x is the exact solution to (1) and x k is the approximation to x generated by the CG method at iteration k, then the error x x k is minimized in some norm at each iteration k. The iterates are related by short recurrences. Rutishauser (see e.g. [Ort88], Appendix 3) has shown that they are related by a three-term recurrence. The nite termination property is only of theoretical interest, since for large n convergence is generally needed in far fewer than n iterations. As a result, it is usually desirable to accelerate the rate of convergence of the CG algorithm by preconditioning, which transforms the original system to a new system: Ax = b ) SAS T (S T x) = Sb (2) having the same solution. The term preconditioning is generally credited to Turing [Tur48], and while the idea of preconditioning the CG algorithm was rst presented by Hestenes [Hes56], the term and the idea were not brought together until the papers by Evans [Eva73] and Axelsson [Axe74]. It is not necessary to form S or SAS T explicitly; instead, the preconditioning is usually incorporated into the CG algorithm. This is shown in Figure 2, where M = (S T S) 1. Since the CG algorithm reuires that the matrix A be symmetric and positive-denite, the preconditioned matrix ^A = SAS T must also be symmetric and positive-denite. Thus we have three (not entirely distinct) criteria for a good preconditioner for the CG method: 1. The CG algorithm must converge faster for the preconditioned problem (2) than for the original system. Otherwise there is no gain in performance.

CHAPTER 1. INTRODUCTION 3 1. Choose x 0 and compute r 0 = b Ax 0. Solve Mz 0 = r 0 : Set p 0 = z 0. 2. For k = 1; 2; : : : until convergence do: (a) k = r T k p k =p T k Ap k (b) x k+1 = x k k p k (c) r k+1 = r k + k Ap k (d) Solve Mz k+1 = r k+1 (e) k = z T k+1 r k+1 =z T k r k (f) p k+1 = z k+1 + k p k Figure 2: The Preconditioned Conjugate Gradient Algorithm 2. The auxiliary system Mz k+1 = r k+1 must be easy to solve. This is primarily a performance consideration { if the solution of the auxiliary system is expensive, then the preconditioner must be of very high uality to oset the expense. 3. The preconditioned matrix ^A must be symmetric positive denite. If ^A is not symmetric positive denite the CG algorithm may produce nonpositive inner products and break down. One popular preconditioning method is the Incomplete Cholesky (IC) factorization [Mv77] (Figure 3). This is an incomplete factorization based on the Cholesky factorization for symmetric l 11 = p a 11 For i = 2 to n For j = 1 to i 1 If a ij = 0 then l ij = 0 else l ij = l ii = r a ii P j 1 a ij l k=1 ikl jk =l jj P i 1 k=1 l2 ik Figure 3: Incomplete Cholesky Factorization positive-denite matrices, a variant of Gaussian Elimination. The IC factorization computes M = LL T ; where A = M R; (3) and R 6= 0 is referred to as the remainder matrix. By discarding ll-in elements as the factorization progresses, this factorization also retains the sparsity of the original matrix. The form given in Figure 3 is the no-ll form, but allowing limited ll-in makes R approach 0 in some sense, and improves the uality of the preconditioner.

CHAPTER 1. INTRODUCTION 4 Another approach is to use several steps of another iterative method as a subsidiary iteration. This basic idea is implicit in the paper by Concus, Golub, and O'Leary [CGO76], where they consider CG and another iterative method, Symmetric Successive Over-Relaxation (SSOR), working together as a \generalized CG method." They cite three previous papers ([Axe74], [YHS75], [Ehr75]), all of which consider CG as an accelerator for some other iterative method, in particular, the SSOR method. Their formulation is euivalent to preconditioning CG with a single SSOR step. More generally, subsidiary iteration preconditioning eectively forms z = (I + H 2 + + H m 1 )P 1 r; (4) where H is the iteration matrix of the subsidiary iteration, so M 1 = (I + H 2 + + H m 1 )P 1. Dubois, Greenbaum, and Rodrigue [DGR79] used a Jacobi iteration as a preconditioner. Johnson, Micchelli, and Paul [JMP83] use least-suares and min-max polynomials in conjunction with the Jacobi iteration, and give a framework for more general subsidiary iteration preconditioning with their discussion of inner and outer iterations. Adams [Ada83], [Ada85] used m-step SSOR preconditioning and also discussed the more general case of a polynomial SSOR preconditioner. Adams [Ada83] briey discusses preconditioning CG with SOR, and gives an example showing that the SOR iteration is not a valid preconditioner for CG: the loss of symmetry of the preconditioned system leads to a loss of convergence. One problem that arises when using IC or SSOR on a parallel computer is that they reuire the solution of large sparse triangular systems of euations, making them dicult to carry out in parallel. It is possible to extract some parallelism in the triangular solve using wavefront orderings ([AG88], [Gre86], [Sal90]), and this approach is somewhat eective for vector computers. But Stotland [Sto93] has shown that the wavefront algorithm computation is completely swamped by communication on distributed-memory multicomputers such as the Intel ipsc/860. A dierent approach is to reorder the euations. An important classical reordering for systems arising from partial dierential euations is the red-black ordering [You71]. It decouples the underlying euations into two groups, and eectively replaces the triangular solve with two half-sized matrix-vector multiplies. However, it has been shown ([AG88], [DM89], [DFT90], [HO90], [Ort91]) for both IC and SSOR preconditionings that the red-black ordering may cause severe degradation in the rate of convergence of the Preconditioned Conjugate Gradient algorithm as compared with the row-wise natural ordering. Poole and Ortega [PO87] gave results showing this to be the case for orderings with more than two colors as well. There are other parallel preconditioners for the CG method that are not IC, SSOR, nor any of the polynomial preconditioners mentioned earlier. A survey of preconditioners for the CG method may be found in [BBC + 94] and [Ort88].

CHAPTER 1. INTRODUCTION 5 1.2 Nonsymmetric Problems If A is nonsymmetric (A T 6= A) the CG algorithm is not valid: its minimization property is lost, and its nite termination property may be lost as well (see e.g. [Ada83], p. 127). Instead, several methods have been developed that are CG-like in that They are optimal in some sense, or Their iterates or residuals are related by short recurrences. We will discuss several of these methods in Chapter 2. The crucial properties of preconditioners for nonsymmetric CG-type methods are not as wellunderstood as they are for the CG method: there is no need to maintain symmetry; furthermore, eigenvalue information alone for the preconditioned system may not be sucient (see e.g. [GS94a], [GPS96]). In addition, characteristics of a problem that may be good for one nonsymmetric CG-type method may be bad for another [NRT92]. It is dicult to know a priori what will make a good preconditioner for using a given solver to solve a particular problem or class of problems. As a result, nding the \best" preconditioner-solver combination for a given problem is often a process of trial and error [BBC + 94]. One popular preconditoner for nonsymmetric systems uses Incomplete LU (ILU) factorization [Mv77]. Like IC, it is a variant of Gaussian Elimination, or LU factorization, in which A is factored as A = LU + R (5) where U is upper triangular, L is lower triangular, and R 6= 0 is the residual matrix. The preconditioning matrix M is just M = LU: (6) The preconditioned residual z = M 1 r is then obtained by solving the triangular systems Ly = r and U z = y. There are many ways to do ILU preconditioning, distinguished by the characteristics of the residual matrix R. No-ll, or ILU(0), preconditioning (Figure 4) retains the sparsity pattern of A. As is the case with IC preconditioning, it is possible to improve the uality of the preconditioner by adding some kind of diagonal compensation (the so-called modied factorizations) [Gus78], or by allowing ll elements, according to their level, prole, or magnitude. A discussion of ll techniues can be found in [Saa95], Chapter 10. A more concise discussion can be found in the tech report [CSW96]. It is also possible to use wavefront and multicolor orderings to parallelize ILU preconditioning, and, like IC, multicolor orderings can lead to a severe degradation in the overall rate of convergence of the iterative method. We will give a few experimental results to this eect in Chapter 3.

CHAPTER 1. INTRODUCTION 6 For k = 1 to n For i = k + 1 to n If (a ki = 0) then l ik = 0 else l ik = a ik =a kk For j = k + 1 to n If (a ij 6= 0) then a ij = a ij l ik a kj Figure 4: No-ll ILU Factorization Since it is not necessary to maintain the symmetry of the coecient matrix, it is possible to consider the SOR iteration itself as a preconditioner for nonsymmetric systems, even though it is not useful with CG. However, in [BBC + 94], the authors claim that for nonsymmetric problems \[SOR and Gauss-Seidel] are never used as preconditioners, for a rather technical reason." They explain that preconditioning with SOR maps the eigenvalues of the preconditioned matrix onto a circle in the complex plane, and that polynomial acceleration of SOR yields no improvement over simple SOR with optimal!, where! is the SOR relaxation parameter. Both of these claims are in fact true, but as we will show they do not speak to the uestion of whether SOR can be used eectively as a preconditioner. Another negative aspect of SOR as a preconditioner appears in a paper by Shadid and Tuminaro [ST94], in which they compare a variety of parallel preconditioners. One of these is one-step Gauss- Seidel, which does not perform well as compared to their other precondtioners. However, as we will show, it is critical to use multiple Gauss-Seidel steps to obtain useful preconditioning. In fact, Saad [Saa94] briey considers multiple Gauss-Seidel steps, but does not use a over-relaxation factor. 1.3 Summary of our results In this dissertation we will show that For an important class of problems, SOR is an eective preconditioner, despite the statement to the contrary in [BBC + 94]. Unlike SSOR, SOR does not suer serious degradation in the rate of convergence in the redblack ordering. It is critical to use multiple steps of Gauss-Seidel. This is why the use of a single Gauss-Seidel step in [ST94] was ineective.

CHAPTER 1. INTRODUCTION 7 At least a factor of two improvement over Gauss-Seidel may be achieved by using SOR with a nominal value for the relaxation parameter!. We will also give results showing the eect of the variation of! on the rate of convergence. SOR as a preconditioner can be highly parallel in the red-black ordering. In fact, for our implementation on the Paragon, our data partitioning and distribution yields superlinear scaled speedup. This suggests that multi-step red-black SOR may be an eective preconditioner for nonsymmetric CG-type iterative methods. 1.4 Organization of this thesis In Chapter 2 we introduce three partial dierential euations that lead to nonsymmetric test matrices, review some basic linear algebra, and discuss GMRES, a CG-type method for nonsymmetric problems that has a minimization property, and the BCG family of methods, which have short recurrences. We also review the SOR method, and show how SOR may be used to precondition GMRES and the BCG methods. In Chapter 3 we show that the matrices that arise when the partial dierential euations are discretized using nite dierences can be put into the red-black ordering, and show that for these problems, in the red-black ordering, the SOR iteration iteration is highly parallel. We also give experimental results on a serial machine showing the degradation in the rate of convergence of ILU(0)-preconditioned GMRES, and we show that for the same problem SOR-preconditioned GM- RES does not suer degradation. We then compare SOR-GMRES to stand-alone SOR and to unpreconditioned GMRES, and conclude that SOR can be used eectively as a preconditioner for GMRES. In Chapter 4 we give some mathematical justication for these experimental results by showing the eect of the preconditioner on the spectrum of the preconditioned matrix and discuss the rate of convergence in that context. In Chapter 5 we discuss some of the details of our parallel implementation, including the data partitioning and distribution, and in Chapter 6 we give parallel experimental results from the Intel Paragon, some of which show superlinear scaled speedup. We then explain the reasons for this, and show that the data distribution we have chosen may be the preferred way to proceed on this machine. We also give experimental results from the IBM SP2. In Chapter 7 we discuss some conclusions, future directions, and open problems.

Chapter 2 Linear Algebra, Test Problems, and Methods The Generalized Minimal Residual (GMRES) [SS86] method is an iterative method that can be used to solve nonsymmetric systems of linear euations. As is the case for the Conjugate Gradient (CG) method, the convergence of GMRES can be accelerated through the use of an eective preconditioner. Another iterative method, SOR, can be used as a preconditioner, and under some conditions it can have good parallel properties. We begin this chapter by reviewing some basic linear algebra. We then introduce three test problems, and show that in the discretization scheme we have chosen, the matrices that arise are consistently ordered with Property A, two properties that are crucial to our discussion of SOR. We then describe the GMRES method and the SOR iteration, and show how SOR can be used to precondition GMRES. We will use the resulting method, SOR(k)-GMRES(m), applied to these three test problems, for experiments in later chapters. We also discuss the Bi-Conjugate Gradient (BCG) family of methods, including the Stabilized Bi-Conjugate Gradient (Bi-CGSTAB) method, another nonsymmetric iterative method we will use in the next chapter. 2.1 Basic linear algebra In this section we review some basic linear algebra that we will use in the remainder of this dissertation. As in Chapter 1 we use the following conventions: Greek letters for scalars (e.g. and ), upper-case letters for matrices (A and M), and bold lower-case letters for vectors (x and y). We also use these denitions: x is a column vector and x T is a row vector. For both matrices and vectors T denotes the transpose. 8

CHAPTER 2. LINEAR ALGEBRA, TEST PROBLEMS, AND METHODS 9 A = [a 1 ; a 2 ; : : :; a n ] is a matrix with columns a 1 ; a 2 ; : : :; a n. A T is the transpose of A, and will have rows a T 1 ; a T 2 ; : : :; a T n. The elements a ii ; i = 1; 2; : : :n, are the main diagonal. The diagonals above it are referred to as the as the rst superdiagonal, the second superdiagonal, and so on. Analgously the diagonals below the main diagonal are referred to as the subdiagonals. 2.1.1 Inner products Let C n be the space of complex vectors of length n. Then an inner product is a complex-valued function of two vector variables, denoted ( ; ), that satises 1. (x; x) 0 for all x 2 C n, and (x; x) = 0 only if x = 0. 2. (x; y) = (y; x) for all real x; y 2 C n. (x; y) = (y; x) if any entry of x or y is complex. 3. (x; y) = (x; y) for all x; y 2 C n and complex scalars. 4. (x + y; z) = (x; z) + (y; z), for all x; y; z 2 C n. For the space R n of real vectors, the usual Euclidean inner product is just the dot product (x; y) = x T y = nx i=1 x i y i : (7) If A is a symmetric positive-denite matrix the product (x; y) A x T Ay also denes an inner product, as it satises the denition above. It is referred to as the A inner product. 2.1.2 Orthogonality and conjugacy If (x; y) = 0 for some inner product, the vectors x and y are said to be orthogonal or conjugate with respect to that inner product. A seuence of vectors x i ; i = 1; 2; : : :k is said to be orthogonal if (x i ; x j ) = 0 for i 6= j, and orthonormal if they are orthogonal and (x i ; x i ) = 1 for i = 1; 2; : : :k. If a seuence of vectors is orthogonal with respect to the A inner product, they are said to be A- conjugate. Lastly, if there are two seuences of vectors x i ; i = 1; 2; : : :k and y j ; j = 1; 2; : : :k such that each of the x i is orthogonal to all of the y j and vice versa, then the two seuences of vectors are said to be biconjugate. It can be shown (see e.g. [GO93], Chapter 9) that the residuals r i generated by the CG method are orthogonal, while the direction vectors p i are A-conjugate. 2.1.3 Norms A norm is a real-valued function over R n denoted by k k, having these three properties:

CHAPTER 2. LINEAR ALGEBRA, TEST PROBLEMS, AND METHODS 10 1. kxk 0, and kxk = 0 only if x = 0. 2. kxk = jjkxk for all x 2 R n and complex scalars. 3. kx + yk kxk + kyk for all x; y 2 R n. For any inner product there is a related norm: kxk = p (x; x): (8) This leads directly to the Euclidean norm and the A-norm, where A is symmetric and positivedenite: kxk 2 = vu u t X n x 2 i i=1! ; kxk A = p x T Ax: (9) There are many other norms we will not mention here. The only other norm we will consider is the innity norm: which is also called the max norm. Any vector norm gives rise to a matrix norm by the denition For the innity norm this leads to kxk 1 = max 1in jx ij; (10) kaxk kak = max x6=0 kxk = max kaxk: (11) kxk=1 kak 1 = max nx 1in j=1 If i ; 1 i n are the eigenvalues of A, then the spectral radius of A is ja ij j: (12) (A) = max 1in j ij (13) It can be shown (see e.g. [Ort90]) that the Euclidean norm of A is kak 2 = (A T A); (14) and if A is symmetric this reduces to kak 2 = (A). For a matrix A and some norm, A has a condition number relative to that norm: (A) = kak ka 1 k (15) where A 1 is the inverse of A. In a sense, a condition number measures the diculty of solving a linear system with coecient matrix A: the identity matrix has a condition number of (I) = 1, and matrices that have condition numbers for a given norm that are close to one are close to the identity matrix I in that norm. In fact, we will show in a later chapter that condition numbers enter explicitly into the estimates of the rate of convergence for both CG and GMRES.

CHAPTER 2. LINEAR ALGEBRA, TEST PROBLEMS, AND METHODS 11 2.1.4 Bases A vector v is said to be a linear combination of some set of p n vectors fx 1 ; x 2 ; : : : ; x p g if there are scalars 1 ; 2 ; : : :; p such that v = px i=1 i x i : (16) The vectors x i ; i = 1; : : :; p are linearly independent if no x i can be written as a linear combination of the other x j ; j 6= i. The set of all linear combinations of the x i, denoted spanfx 1 ; x 2 ; : : : ; x p g, forms a subspace of R n of dimension p. The x i span the subspace, and are called its basis. 2.1.5 Special matrices There are several special types of matrices that will be of interest in our discussion. A triangular matrix has non-zero entries on its main diagonal and either above it or below it, but not both. A matrix which is triangular except for non-zero entries on the rst diagonal above or below the main diagonal is a Hessenberg matrix. A real matrix Q such that Q T Q = I, where I is the identity matrix, is an orthogonal matrix. Its columns are orthonormal. A special kind of orthogonal matrix having entries that are either ones or zeros eectively reorders the euations represented by the matrix, and for this reason it is called a permutation matrix. 2.1.6 The Jordan canonical form The eigenvalues of A are known collectively as the spectrum of A, and are denoted: (A) = f 1 ; 2 ; : : :; n g : (17) Associated with each distinct eigenvalue i there is a vector p i such that Ap i = i p i : (18) The p i are called the eigenvectors of A. Evey matrix A can be written in its Jordan Canonical Form: A = P JP 1 (19) where P is a nonsingular matrix and J is diagonal or nearly so. If A has n linearly independent eigenvectors, then P = [p 1 ; p 2 ; : : : ; p n ] is the matrix of eigenvectors of A and J is a diagonal matrix having j ii = i. Symmetric matrices are guaranteed to have a full set of linearly independent eigenvectors (see e.g. [Ort87]) but nonsymmetric matrices are not. If A has fewer than n linearly independent eigenvectors, then it has some number of generalized eigenvectors that do not satisfy (18). In this case J is not diagonal: in addition to having j ii = i will also have some number of

CHAPTER 2. LINEAR ALGEBRA, TEST PROBLEMS, AND METHODS 12 ones on the rst superdiagonal. Associated with the Jordan canonical form is the Jordan condition number, which is the the condition number of P, measured in the Euclidean norm: j(a) = kp k 2 kp 1 k 2 : (20) When A is symmetric and positive denite, j(a) = 1, but when A is nonsymmetric its eigenvectors can fail to be orthogonal, so that j(a) > 1. In fact, j(a) can be arbitrarily large. 2.2 Three test problems Our three test problems are all partial dierential euations that, when discretized, yield large, sparse nonsymmetric systems of linear euations. 2.2.1 Problem 1 Our rst problem is a two-dimensional convection-diusion euation: u + u x + u y = f; (21) where u is the Laplacian operator applied to u (u = u xx + u yy ), and the rst-derivative terms u x and u y are referred to as the convection terms. For simplicity we consider only the unit suare [0; 1] [0; 1] as the domain, and we use uniform grid spacing h = 1=(N + 1), where N is the number of gridpoints in a given direction, in both the x and y directions. Then the grid points are and the grid points interior to the domain are (x i ; y j ) = (ih; jh); i; j = 0; 1; : : :N; N + 1; (22) (x i ; y j ) = (ih; jh); i; j = 1; : : :N: (23) We use Dirichlet conditions (u = 0) on the boundary. To discretize we use standard second-order nite dierences to approximate u xx and u yy at every interior node (x i ; y j ): u xx (x i ; y j ) u yy (x i ; y j ) so that the Laplacian u at (x i ; y j ) is : = h 1 2 (u(x i+1 ; y j ) 2u(x i ; y j ) + u(x i 1 ; y j )); : = h 1 2 (u(x i ; y j+1 ) 2u(x i ; y j ) + u(x i ; y j 1 )); u(x i ; y j ) : = 1 h 2 (4u(x i; y j ) u(x i+1 ; y j ) u(x i 1 ; y j ) u(x i ; y j+1 ) u(x i ; y j 1 )): (25) We use centered dierences to approximate the rst-derivative terms u x and u y at (x i ; y j ): u x (x i ; y j ) u y (x i ; y j ) : = 1 : = 1 2h (u(x i+1; y j ) u(x i 1 ; y j )); 2h (u(x i; y j+1 ) u(x i ; y j 1 )): (24) (26)

CHAPTER 2. LINEAR ALGEBRA, TEST PROBLEMS, AND METHODS 13 We substitute (25) and (26) into (21) and collect like terms. Now we order the gridpoints in the row-wise \natural" ordering: u 1 ; : : : ; u N 2 = u(x 1 ; y 1 ); : : :; u(x N ; y 1 ); u(x 1 ; y 2 ); : : :; u(x N ; y 2 );. : : :. u(x 1 ; y N ); : : :; u(x N ; y N ): (27) Figure 5(a) shows the interior grid points, labeled in the natural ordering, for N = 5. As there are 21 22 23 24 25 16 17 18 19 20 11 12 13 14 15 6 7 8 9 10 1 2 3 4 5 (a) Grid d e n w d e n w d e n w d e n w d n s d e n s w d e n s w d e n s w d e n s w d n s d e n s w d e n s w d e n s w d e n s w d n s d e n s w d e n s w d e n s w d e n s w d n s d e s w d e s w d e s w d e s w d (b) Assembled matrix Figure 5: A 5 5 grid on the unit suare in the natural ordering, and the related matrix. n = N 2 interior gridpoints, (21) is approximated by an n n linear system in which a typical euation is of the form Au = b (28) d i u i + e i u i+1 + w i u i 1 + n i u i+n + s i u i N = h 2 f i ; (29) where d i = 4; e i = 1 + h 2 ; w h i = 1 2 ; n i = 1 + h 2 ; s h i = 1 2 : (30) The assembled matrix A has a structure analogous to the one shown in Figure 5(b) for the case N = 5. This matrix A is structurally symmetric, but it is not numerically symmetric unless = = 0. For testing purposes we choose b for this problem so that the exact solution to (28) is known (u is a vector of ones).

CHAPTER 2. LINEAR ALGEBRA, TEST PROBLEMS, AND METHODS 14 2.2.2 Problem 2 Our next problem is taken from [van81], and was also used in [ST94]: u x u xx + (1 + y 2 )(u y u yy ) = f (31) over the unit suare with Dirichlet boundary conditions and right-hand side chosen such that the exact solution is u(x; y) = e x+y + x 2 (1 x) 2 ln(1 + y 2 ) (32) It is discretized in the same way as (21) and its typical discrete euation has the form (29) with d i = 4 + 2y 2 ; e i = 1 + 1 2 h; w i = 1 2.2.3 Problem 3 Like Problem 2, our third problem is also used in [ST94]: 1 2 h; n i = (1 + y 2 )e i ; s i = (1 + y 2 )w i (33) u + (x 2 u x + y 2 u y ) u = f (34) over the unit suare with Dirichlet boundary conditions. We use the same discretization scheme as in Problems 1 and 2, so that d i = 4 h 2 ; e i = 1 + 1 2 x2 h; w i = 1 1 2 x2 h; n i = 1 + 1 2 y2 h; s i = 1 1 2 y2 h; (35) and again we choose the right-hand side so the exact solution to the discrete problem is known (u is a vector of ones). 2.3 The red-black ordering Because we have chosen the ve-point discretization scheme (25) and Dirichlet boundary conditions, all three problems yield matrices with the structure shown in Figure 5(b) when the matrix is assembled using the natural ordering. If instead we label the gridpoints in the red-black ordering [You71], shown in Figure 6(a) for a 5 5 grid, and assemble the matrix by ordering rst all the red points, then all the black points, the resulting matrix has the structure shown in Figure 6(b). Matrices in the natural and red-black orderings are related by A nat = P A rb P T (36) where P is a permutation matrix. The matrix in the red-black ordering, A rb (Figure 6(b)), has a block structure: " # D R E A rb = (37) F D B where D R and D B are diagonal matrices. We will exploit this special structure in a later section in our discussion of preconditioning with SOR.

CHAPTER 2. LINEAR ALGEBRA, TEST PROBLEMS, AND METHODS 15 r11 b11 r12 b12 r13 b 8 r 9 b 9 r10 b10 r 6 b 6 r 7 b 7 r 8 b 3 r 4 b 4 r 5 b 5 r 1 b 1 r 2 b 2 r 3 (a) grid d e n d w e n d w n d s w e n d s w e n d s e n d s w e n d s w n d s w e n d s w e n d s e d s w e d s w w e n d w e n d s e n d s w e n d s w n d s w e n d s w e n d s e n d s w e n d s w n d s w e d s w e d (b) matrix Figure 6: 5 5 grid on the unit suare in the red-black ordering, and the related matrix. 2.4 Two important properties The matrices arising from all three of the test problems discussed in this chapter have two properties which will be very important when we discuss SOR, namely that they are consistently ordered and have what Young calls Property A [You71]. In this section we give the denitions of these two properties and show that the matrices arising from Probems 1, 2, and 3 are consistently ordered with Property A. 2.4.1 Consistently ordered matrices Young [You71] gives the following denitions: Given a matrix A = (a i;j ), the integers i and j are associated with respect to A if a i;j 6= 0 or a j;i 6= 0. The matrix A of order N is consistently ordered if for some t there exist subsets S 1 ; S 2 ; : : :; S t of W = f1; 2; : : :; Ng such that S t k=1 S k = W and such that i and j are associated, then j 2 S k+1 if j > i and j 2 S k 1 if j < i, where i 2 S k. If the grid of interest is m m, then the ve-point nite-dierence stencil connects any gridpoint i to the gridpoints i + 1; i 1; i + m; i m. Thus, if W = f1; : : :; m 2 g and the subsets S i of W are S 1 = f1g; S 2 = f2; m + 1g; S 3 = f3; m + 2; 2m + 1g; : : :; S 2m 2 = fm 2 m; m 2 1g; S 2m 1 = fm 2 g; (38) corresponding to diagonal lines of gridpoints, then the assembled matrix A is consistently ordered. For a 55 suare grid, the subsets are shown Figure 7, where S 1 = f1g, S 2 = f2; 6g, S 3 = f3; 7; 11g,

CHAPTER 2. LINEAR ALGEBRA, TEST PROBLEMS, AND METHODS 16 through S 8 = f20; 24g and S 9 = f25g. For example, gridpoint 9 will be associated with gridpoints 21 22 23 24 25 16 17 18 19 20 11 12 13 14 15 6 7 8 9 10 1 2 3 4 5 Figure 7: Diagonal subsets for a 5 5 grid 4, 8, 10, and 14 by the nite-dierence approximations (25) and (26). It is in S 5, while gridpoints 4 and 8 are in S 4, and gridpoints 10 and 14 are in S 6. Following this example the matrices arising from Problems 1-3 are consistently ordered. 2.4.2 Property A Ortega [Ort90] gives the following denition for Property A: The matrix A is 2-cyclic or has Property A if there is a permutation matrix P such that " # P AP T D 1 C 1 = (39) C 2 D 2 where D 1 and D 2 are diagonal. From our discussion of the block structure of A rb (see (37)), it is clear that the matrices arising from Problems 1-3 have Property A. 2.5 The GMRES algorithm Recall from Chapter 1 that the CG algorithm has three desirable properties: In exact arithmetic, it converges in at most n iterations if A is n n. It minimizes the error x x k in some norm at every iteration. The iterates are related by short recurrences.

CHAPTER 2. LINEAR ALGEBRA, TEST PROBLEMS, AND METHODS 17 In this section we develop the GMRES algorithm [SS86] from Arnoldi's method, and show that it is also guaranteed to converge in at most n iterations, and that it minimizes the residual r k = b Ax k in some sense at every iteration. In a later section we will discuss the BCG family of methods, which if they avoid breakdown are guaranteed to converge in no more than n iterations, and produce iterates that are related by short recurrences. 2.5.1 Arnoldi's method In exact arithmetic, Arnoldi's method generates a basis for the Krylov subspace K m = K m (A; v 1 ) = span v 1 ; Av 1 ; A 2 v 1 ; : : :; A m 1 v 1 (40) where m n. It can be implemented using a Gram-Schmidt process (Figure 8). The v j vectors it 1. Choose an initial vector v 1 with kv 1 k 2 = 1 2. For j = 1; 2; : : :; m (a) h i;j = (Av j ; v i ); i = 1; 2; : : :; j: (b) ^v j+1 = Av j P j i=1 h i;jv i ; (c) h j+1;j = k^v j+1 k 2 (d) v j+1 = (1=h j+1;j )^v j+1 produces are orthonormal: Figure 8: Arnoldi's method using Gram-Schmidt orthogonalization and they span K m. By construction h j+1;j v j+1 = Av j jx i=1 v T i v j = ( 1 if i = j 0 otherwise h i;j v i ; or Av j = j+1 X If we dene the matrix V m = [v 1 ; : : :; v m ], then this becomes i=1 (41) h i;j v i ; j = 1; 2; : : :; m: (42) AV m = V m H m + ^v m+1 e T m ; (43) where e T m = (0; 0; : : :; 0; 1) and H m = (h i;j ). H m is upper Hessenberg (upper triangular except for the rst subdiagonal). The last term, ^v m+1 e T m, is a vector that is zero except for the last element, which is just the last element of the \unorthogonalized" Arnoldi vector, ^v m+1 = h m+1;m v m+1, the other elements being forced to be zero by the orthogonality of the v i. We rewrite (43) as " # H AV m = V m+1hm ; where Hm m = : (44) 0 0 0 k^v m+1 k 2

CHAPTER 2. LINEAR ALGEBRA, TEST PROBLEMS, AND METHODS 18 That is, Hm is an (m + 1) m matrix consisting of H m augmented by a row that is zero except for the (m + 1; m) entry. If we multiply (44) by Vm T and make use of the orthogonality of the v j, we obtain Vm T AV m = H m : (45) In practice the vectors generated using the Gram-Schmidt process can lose orthogonality: because of rounding error and cancellation there may be vectors v i and v j that are far from orthogonal. This can be mitigated by implementing Arnoldi's method using a modied Gram-Schmidt process [Ste73] (Figure 9) which rearranges the computations in steps 2(a) and 2(b) of Figure 8. The two methods 1. Choose an initial vector v 1 with kv 1 k 2 = 1 2. For j = 1; 2; : : :; m (a) ^v j+1 = Av j (b) For i = 1; 2; : : :j h i;j = (^v j+1 ; v i ) ^v j+1 = ^v j+1 h i;j v i (c) h j+1;j = k^v j+1 k 2 (d) v j+1 = (1=h j+1;j )^v j+1 Figure 9: Arnoldi's method using modied Gram-Schmidt orthogonalization are mathematically euivalent. However, the parallel properties of the modied Gram-Schmidt process are less desirable: since each of the inner products (^v j+1 ; v i ) in Figure 9 reuires all the processors to synchronize, the j inner products reuire j synchronizations, while the unmodied Gram-Schmidt method allows its j inner products to be done all at once, eectively saving j 1 synchronizations. 2.5.2 The GMRES algorithm If x 0 is an initial guess for the solution to Ax = b, the initial residual is r 0 = b Ax 0 : (46) The GMRES algorithm (Figure 10) uses Arnoldi's method to generate a basis for the Krylov space K m (A; r 0 ). It then forms the iterate x m = x 0 + mx i=1 y i v i (47) as a linear combination of the Arnoldi vectors, where the coecients y i are chosen so the iterate x m minimizes the l 2 norm of the residual (kr m k 2 ) over the space K m (A; r 0 ). Thus, the GMRES algorithm has two desirable properties:

CHAPTER 2. LINEAR ALGEBRA, TEST PROBLEMS, AND METHODS 19 In exact arithmetic, it is guaranteed to converge in at most n iterations. The Euclidean norms of the residuals are non-increasing: kr k+1 k 2 kr k k 2 ; for k 0. 1. Choose x 0. Compute r 0 = b Ax 0, = kr 0 k 2, v 1 = (1=)r 0. 2. For j = 1; 2; : : :; m; : : : until converged do: (a) ^v j+1 = Av j : (b) h i;j = (^v j+1 ; v i ); i = 1; 2; : : :; j: (c) ^v j+1 = ^v j+1 P j i=1 h i;jv i ; (d) h j+1;j = k^v j+1 k 2 (e) Check the convergence criterion: if kr k k 2 < then GMRES has converged. (f) v j+1 = (1=h j+1;j )^v j+1. 3. Form the approximate solution: x m = x 0 + V m y m, where y m minimizes ke 1 Hm yk 2 ; y 2 R m : Figure 10: The GMRES algorithm We now show how to nd the minimizer y m = (y 1 ; y 2 ; : : :; y m ) T and how to monitor convergence. Our presentation follows that given in [SS86]. A more detailed discussion can be found in [Saa95]. Let = kr 0 k 2 and let v 1 = (1=)r 0 be the starting vector for Arnoldi's method. If V m = [v 1 ; : : :; v m ], then (47) can be written x m = x 0 + V m y, where y is an m-vector. Dene J(y) kb Axk 2 = kb A(x 0 + V m y)k 2 : (48) We want to choose y so that V m y minimizes the residual norm J(y) over K m (A; r 0 ). Note that b A(x 0 + V m y) = (b Ax 0 ) AV m y (49) = r 0 AV m y; (50) and recall that r 0 = v 1. Since V m+1 e 1 = v 1, from the relationship (44) r 0 AV m y = v 1 V m+1hm y (51) = V m+1 (e 1 Hm y): (52) Because all the v i are orthonormal, V T V m+1 m+1 = I and kv m+1 k 2 = kk 2 for any vector. Thus (48) through (52) give J(y) = kb Axk 2 = kv m+1 (e 1 Hm y)k 2 = ke 1 Hm yk 2 : (53) Therefore, the minimizer y m will minimize the l 2 norm of e 1 Hm y. This can be found using QR reduction: Q mhm = R m ; (54)

CHAPTER 2. LINEAR ALGEBRA, TEST PROBLEMS, AND METHODS 20 factoring the Hessenberg matrix Hm into Q T m, an (m + 1) (m + 1) orthogonal matrix, and R m, an (m + 1) m upper-triangular matrix whose last row is zero. This decomposition yields J(y) = ke 1 Hm yk 2 = kq m e 1 Hm y k 2 = kg m R m yk 2 : (55) The last line can be rewritten J(y) = " # " g g m+1 R 0 #" ym 0 # 2 (56) where R is an upper triangular mm matrix, and g m+1 is the (m+1)st entry of g m. The minimizer y m is therefore the solution of the nonsingular m m triangular system Ry m = g: (57) Substituting back into (56) gives J(y m ) = jg m+1 j: (58) We only need the minimizer y m every m iterations, but we need the residual norm kr k k 2 at every iteration for the convergence test. By factoring the Hessenberg matrix H k at every iteration we can form g k = Q k (e 1 ) inexpensively, and from the above construction we have kr k k 2 = jg k+1 j; (59) the absolute value of the last entry of g k. 2.5.3 Restarting GMRES As GMRES progresses, both its memory and computational reuirements increase. At iteration j it needs to store the new vector ^v j+1 and it performs j inner products and j \axpys" (vector operations that add a vector to a scalar times a vector: z = x + y) to orthogonalize ^v j+1 against all previous v i. If ^v j+1 is an n-vector it is easy to see that GMRES has memory reuirements that grow like jn and computational reuirements that grow like j 2 n. A modication of the algorithm that is less expensive per iteration restarts every m steps, where m is small relative to n. After step m the restarted algorithm, denoted GMRES(m), updates the iterate x m and the residual r m, discards the vectors v 1 ; : : :; v m, and restarts with x 0 = x m and r 0 = r m. Since GMRES does not compute the residual r k at each step it will need to be computed at restart. This can be done in either of two ways: Compute r m = b Ax m, or Use (52): r m = V m+1 (e 1 Hm y m ).

CHAPTER 2. LINEAR ALGEBRA, TEST PROBLEMS, AND METHODS 21 These two expressions are mathematically euivalent, but if m is small the second expression can be substantially cheaper, and for this reason it is the form we use for our experiments. However, as m grows large the second expression can give an r m that is less accurate, due to a loss of orthogonality of the columns of V m, and the rst expression can be more desirable. GMRES(m) is not guaranteed to converge in n steps, but will still converge if m is chosen large enough. Saad and Schultz [SS86] give conditions under which GMRES(m) will converge. Joubert [Jou94] has introduced a strategy for choosing the restart freuency adaptively. At each iteration the method produces two estimates of the residual norm for the next iteration: one with restarting, one without, and chooses to restart if restarting is expected to be more ecient. The eectiveness of this strategy is determined by the accuracy of the residual norm estimates, and by the correctness of the underlying assumption that a single-step prediction can nd the strategy that is most ecient overall. We do not use Joubert's strategy. 2.5.4 Breakdown and stagnation Many iterative methods for solving nonsymmetric linear systems can break down, and fail to produce a new iterate. Brown [Bro91] contrasts the breakdown behavior of GMRES with that of an earlier method developed by Saad [Saa81], the Full Orthogonalization Method (FOM). FOM uses Arnoldi's method to generate a basis for the Krylov subspace K m (A; r 0 ), and constructs the iterate x m = x 0 + V m y, but instead of minimizing the l 2 norm of the residual it solves H m y = e 1 (60) where H m is the m m Hessenberg matrix in (44). In exact arithmetic, FOM can break down in one of two ways: h k+1;k = 0. Then AV k = V k H k, the vectors v 1 ; : : :; v k span K m (A; r 0 ), so the current iterate x k solves Ax = b exactly. This is called lucky breakdown [SS86] or happy breakdown [Bro91]. It rarely occurs. H k is singular. When H k is singular FOM cannot generate the new iterate x m. In the rst case both FOM and GMRES generate the exact solution and stop. Brown [Bro91] shows that when FOM generates a singular H k, GMRES will under the same circumstances generate the same iterate again: x k = x k 1 ; but does not break down. This behavior can persist for several iterations, and it is called stagnation. In exact arithmetic GMRES can recover from stagnation and produce a seuence of iterates that converge to the solution. However, when GMRES is restarted this recovery can be lost { if GMRES(m) stagnates for a full restart cycle, then convergence is lost. Saad and Schultz [SS86] give the example " # " # " # 0 1 1 0 A = ; b = ; x 0 = ; (61) 1 0 1 0

CHAPTER 2. LINEAR ALGEBRA, TEST PROBLEMS, AND METHODS 22 where full GMRES converges in two steps, but the restarted method GMRES(1) never does. In machine arithmetic it is possible for GMRES to produce a seuence of nearly singular H k. Because of rounding error this can cause fatal stagnation: GMRES can produce a seuence of iterates that change very little, and fail to converge to the solution A 1 b. 2.6 Preconditioning GMRES Since n, the order of A, is usually very large, the nite convergence property of CG or GMRES is of only theoretical interest. In practice, convergence is determined by some convergence criterion such as the norm of the residual being suitably small (see Figure 10). In many cases, the rate of convergence can be increased by preconditioning. For CG, this means transforming A so as to maintain symmetry and positive-deniteness: ^A = SAS T : (62) Here S is a nonsingular matrix, chosen so that ^A has fewer distinct eigenvalues than A, or has a smaller condition number. For a nonsymmetric matrix, (62) is replaced by ^A = S 1 AS 2 ; (63) where S 1 and S 2 are nonsingular. If S 1 = I; the identity matrix, this is called right preconditioning, and if S 2 = I, it is called left preconditioning. We will consider preconditioning on only one side, so that the preconditioning matrix is either M = S 1 1 or M = S 1 2. The preconditioned system is not formed explicitly, but rather a solution of an auxiliary system involving the matrix M is incorporated into each iteration. In principle, we could choose M = A, which would give M 1 A = I, and GMRES would converge in a single step. However this is not practical, since applying the preconditioner would be as dicult as solving the original problem Ax = b. Instead we think of M as approximating A in some sense. For CG, M must be chosen so that ^A is symmetric positive denite, but for a nonsymmetric matrix the only constraint is that M be nonsingular, and the auxiliary system be \easy" to solve. 2.6.1 Left preconditioning GMRES is preconditioned on the left by solving Mz = y with z = ^v j+1 and y = Av j, so that step 2(a) in Figure 10 becomes Solve M ^v j+1 = Av j : It also reuires that the initial residual r 0 be preconditioned. The resulting algorithm is given in Figure 11. The modied residuals are ^r j = M 1 b M 1 Ax j = M 1 (b Ax j ) = M 1 r j ; (64)