S-Step and Communication-Avoiding Iterative Methods

Size: px

Start display at page:

Download "S-Step and Communication-Avoiding Iterative Methods"

Andrew Osborne
5 years ago
Views:

1 S-Step and Communication-Avoiding Iterative Methods Maxim Naumov NVIDIA, 270 San Tomas Expressway, Santa Clara, CA Abstract In this paper we make an overview of s-step Conjugate Gradient and develop a novel formulation for s-step BiConjugate Gradient Stabilized iterative method. Also, we show how to add preconditioning to both of these s-step schemes. We explain their relationship to the, block and communication-avoiding counterparts. Finally, we explore their advantages, such as the availability of matrix-power kernel A k x and use of block-dot products B = X T Y that group individual dot products together, as well as their drawbacks, such as the extra computational overhead and numerical stability related to the use of monomial basis for Krylov subspace K s = {r, Ar,..., A s r}. We note that the mathematical techniques used in this paper can be applied to other methods in sparse linear algebra and related fields, such as optimization. Introduction In this paper we are concerned with investigating iterative methods for the solution of linear system Ax = f () where nonsingular coefficient matrix A R n n, right-hand-side (RHS) f and solution x R n. A comprehensive overview of iterative methods can be found in [, 7]. In this brief study we will focus on two popular algorithms: (i) Conjugate Gradient (CG) and (ii) BiConjugate Gradient Stabilized (BiCGStab) Krylov subspace iterative methods. These algorithms are often used to solve linear systems with symmetric positive definite (SPD) and nonsymmetric coefficient matrices, respectively. NVIDIA Technical Report NVR , April 206. c 206 NVIDIA Corporation. All rights reserved. This research was developed, in part, with funding from the Defense Advanced Research Projects Agency (DARPA). The views, opinions, and findings contained in this article are those of the author and should not be interpreted as representing the official views or policies of the Department of Defense or the U.S. Government.

2 The CG and BiCGStab methods have already been generalized to their block counterparts in [, 5]. The block methods work with s vectors simultaneously and were originally designed to solve linear systems with multiple RHS, such as AX = F (2) where multiple RHS and solutions are column vectors in tall matrices F and X R n s, respectively. However, they can be adapted to solve systems with a single RHS, by for example randomly selecting the remaining s RHS [4]. In the latter case, the block methods often perform less iterations than their counterparts, but they do not compute exactly the same solution in i iterations as the methods in i s iterations. The s-step methods occupy a middle ground between their and block counterparts. They are usually applied to linear systems with a single RHS, they work with s vectors simultaneously, but they are also designed to produce identical solution in i iterations as the methods in i s iterations, when the computation is performed in exact arithmetic. In fact, the s-step methods can be viewed as an unrolling of s iterations and a clever re-grouping of recurrence terms in them. The s-step methods were first investigated by J. Van Rosendale, A. T. Chronopoulos and C. W. Gear in [4, 6, 6], where the authors proposed s-step CG algorithm for SPD linear systems. Subsequent work by A. T. Chronopoulos and C. D. Swanson in [5, 7] lead to the development of s-step methods, such as GMRES and Orthomin, but not BiCGStab, for nonsymmetric linear systems. These methods were further improved and generalized to the solution of eigenvalue problems using Lanczos and Arnoldi iterations in [8, 9]. There are several advantages for using s-step methods. The first advantage is that the matrix-vector multiplications used to build the Krylov subspace K s = {r, Ar,..., A s r} can be performed together, one immediately after another. This allows us to develop a more efficient matrix-vector multiplication, using a pipeline where the results from current multiplication are immediately forwarded as inputs to the next. The second advantage is that the dot products spread across multiple iterations are bundled together. This allows us to minimize fan-in communication that is associated with dot-products and often limits scalability of parallel platforms. There are also some disadvantages for using s-step methods. We perform more work per iteration of an s-step method than per s iterations of a method. The extra work is often the result of one extra matrix-vector multiplication required to recompute the residual. Also, notice that in practice the vectors computed by matrix-power kernel A k x for k = 0,..., s quickly become linearly dependent. Therefore, s-step methods can suffer from issues related to numerical stability resulting from the use of a monomial basis for the Krylov subspace corresponding to s internal iterations. In this paper we first make an overview of how to use properties of the directions and residual vectors in CG to build the s-step CG iterative method. Then, we find the properties of the directions and residual vectors in BiCGStab. Finally, we use this information in conjuction with recurrence relationships for these vectors to develop a novel s-step BiCGStab iterative method. We also show how to add preconditioning to both of these s-step schemes. 2

3 Finally, we point out that s-step iterative methods are closely related to communicationavoiding algorithms proposed in [2, 3, 0, 2]. In fact, the main differences between these methods lie in (i) a different basis used by the communication-avoiding algorithms to address numerical stability, and (ii) an efficient (communication-avoiding) implementation of matrix-power kernel A k x as well as distributed dense linear algebra operations, such as dense QR-factorization used in GMRES [8]. Let us now show how to derive the well known s-step CG and novel s-step BiCGStab iterative methods as well as how to add preconditioning to both algorithms. 2 S-Step Conjugate Gradient (CG) The CG method is shown in Alg.. Algorithm Standard CG : Let A R n n be a SPD matrix and x 0 be an initial guess. 2: Compute r 0 = f Ax 0 3: for i = 0,,...until convergence do 4: if i == 0 then Find directions p 5: Set p i = r i 6: else 7: Compute β i = rt i r i r T i r Dot-product i 8: Compute p i = r i + β i p i 9: end if 0: Compute q i = Ap i Matrix-vector multiply : Compute α i = rt i r i r T i q Dot-product i 2: Compute x i+ = x i + α i p i 3: Compute r i+ = r i α i q i 4: if r i+ / tol then stop end Check Convergence 5: end for Let us now prove that the directions p i and residuals r i computed by CG satisfy a few properties. These properties will subsequently be used to setup the s-step CG method. Lemma. The residuals r i are orthogonal and directions p i are A-orthogonal Proof. See Proposition 6.3 in [7]. r T i r j = 0 and p T i Ap j = 0 for i j (3) Theorem. The current residual r i and all previous direction p j are orthogonal r T i p j = 0 for i > j (4) 3

4 Proof. Using induction. Notice that for the base case i = we have r T p 0 = r T r 0 = 0 (5) For the induction step, let us assume r T i p j = 0. Then r T i p j = r T i (r j + β j p j ) = β j r T i p j = 0 (6) Let us now derive an alternative expression for scalars α i and β i. Corollary. The scalars α i and β i satisfy Proof. Using Lemma and Theorem we have and (p T i Ap i )α i = p T i r i (7) (p T i Ap i )β i = p T i Ar i+ (8) r T i r i+ = r T i (r i α i Ap i ) = (p i β i p i ) T r i α i (p i β i p i ) T Ap i = p T i r i α i p T i Ap i = 0 (9) p T i Ap i+ = p T i A(r i+ + β i Ap i ) = p T i Ar i+ + β i p T i Ap i = 0 (0) Let us now assume that we have s directions p i available at once, then we can write x i+s = x i + α i p i α i+s p i+s = x i + P i a i () where P i = [p i,..., p i+s ] R n s is a tall matrix and a i = [α i,..., α i+s ] T R s is a small vector. Consequently, residual can be expressed as r i+s = f Ax i+s = r i AP i a i (2) Also, let us assume that we have the monomial basis for Krylov subspace for s iterations R i = [r i, Ar i,..., A s r i ] (3) Then, following the CG algorithm, it would be natural to express the next directions as a combination of previous directions and the basis vectors P i+ = R i+ + P i B i (4) where B i = [β i,j ] R s s is a small matrix. At this point we have almost all the building blocks of s-step CG method, but we are still missing a way to find scalars α and β stored in the vector a i and matrix B i. These coefficients can be found using the following Corollary. 4

5 Corollary 2. The vector a T i = [α i,..., α i+s ] and matrix B i = [β i,j ] satisfy (Pi T AP i )a i = Pi T r i (5) (Pi T AP i )B i = Pi T AR i+ (6) Proof. Enforcing orthogonality of residual and search directions in Theorem we have P T i r i+s = P T i (r i AP i a i ) = P T i r i (P T i AP i )a i = 0 (7) Also, enforcing A-orthogonality between search directions in Lemma we obtain P T i AP i+ = P T i A(R i+ + P i B i ) = P T i AR i+ + (P T i AP i )B i = 0 (8) Notice the similarity between Corollary and 2 for and s-step CG, respectively. Finally, putting all the equations () - (6) together we obtain the pseudo-code for s-step CG iterative method, shown in Alg. 2. Algorithm 2 S-Step CG : Let A R n n be a SPD matrix and x 0 be an initial guess. 2: Compute r 0 = f Ax 0 and let index k = is (initially k = 0) 3: for i = 0,,...until convergence do 4: Compute T = [r k, Ar k,..., A s r k ] Matrix-power kernel 5: Let R i = [r k, Ar k,..., A s r k ] Extract R = T (:, : s ) from T 6: Let Q i = [Ar k, A 2 r k,..., A s r k ] = AR i Extract Q = T (:, 2 : s) from T 7: if i == 0 then Find directions p 8: Set P i = R i 9: else 0: Compute C i = Q T i P i Block dot-products : Solve W i B i = C i Find scalars β 2: Compute P i = R i + P i B i 3: end if 4: Compute W i = Q T i P i Block dot-products 5: Compute g i = Pi T i 6: Solve W i a i = g i Find scalars α 7: Compute x k+s = x k + P i a i Compute new approximation x k 8: Compute r k+s = f Ax k+s Recompute residual r k 9: if r k+s / tol then stop end Check Convergence 20: end for 5

6 Let us now introduce preconditioning into the s-step CG in a manner similar to the algorithm. Let the preconditioned linear system be (L AL T )(L T x) = L f (9) Then, we can apply the s-step on the preconditioned system (9) and re-factor recurrences in terms of preconditioner M = LL T. If we express everything in terms of preconditioned directions M P i and work with the original solution vector x i we obtain the preconditioned s-step CG method, shown in Alg. 3. Notice that tall matrices Z i and Q i on lines 6 and 7 can be constructed during the computation of T on line 5 in Alg. 3. The key observation is that the computation of T proceeds in the following fashion r i, M r i, (AM )r i, M (AM )r i, (AM ) 2 r i,..., M (AM ) s r i, (AM ) s r i and therefore alternating odd and even terms define columns of tall matrices Z i and Q i. Algorithm 3 Preconditioned S-Step CG : Let A R n n be a SPD matrix and x 0 be an initial guess. 2: Let M = LL T be SPD preconditioner, so that M A. 3: Compute r 0 = f Ax 0, ˆr 0 = M r 0 and let index k = is (initially k = 0) 4: for i = 0,,...until convergence do 5: Compute T = M [r k, Âr k,..., Âs r k ] Matrix-power kernel 6: Let Z i = M [r k, Âr k,..., Âs r k ] = M R i Build Z & Q during 7: Let Q i = [Âr k, Â2 r k,..., Âs r k ] = AZ i computation of T 8: where auxiliary Â = AM and R i = [r k, Âr k,..., Âs r k ] 9: if i == 0 then Find directions p 0: Set P i = Z i : else 2: Compute C i = Q T i P i Block dot-products 3: Solve W i B i = C i Find scalars β 4: Compute P i = Z i + P i B i 5: end if 6: Compute W i = Q T i P i Block dot-products 7: Compute g i = Pi T i 8: Solve W i a i = g i Find scalars α 9: Compute x k+s = x k + P i a i Compute new approximation x k 20: Compute r k+s = f Ax k+s Recompute residual r k 2: Compute ˆr k+s = M r k+s 22: if r k+s / tol then stop end Check Convergence 23: end for 6

7 Finally, notice that preconditioned s-step CG performs an extra matrix-vector multiplication with A and preconditioner solve with M per iteration, when compared with s iterations of the preconditioned CG method. The extra work happens during re-computation of residual at the end of every s-step iteration. 3 S-Step BiConjugate Gradient Stabilized (BiCGStab) Let us now focus on the s-step BiCGStab method, which has not been previously derived by A. T. Chronopoulos et al. The BiCGStab is given in Alg 4. Algorithm 4 Standard BiCGStab : Let A R n n be a nonsingular matrix and x 0 be an initial guess. 2: Compute r 0 = f Ax 0 and set p 0 = r 0 3: Let ř 0 be arbitrary (for example ř 0 = r 0 ) 4: for i = 0,,...until convergence do 5: Compute z i = Ap i Matrix-vector multiply 6: Compute α i = řt 0 r i ř T 0 z Dot-product i 7: Compute s i = r i α i z i Find directions s 8: if s i / tol then set x i+ = x i + α i p i and stop end Early exit 9: Compute v i = As i Matrix-vector multiply 0: Compute ω i = st i v i v T i v i : Compute x i+ = x i + α i p i + ω i s i Dot-product 2: Compute r i+ = s i ω i v i 3: if r i+ / r 0 ( 2 tol ) or ( early ) exit then stop end Check Convergence řt 4: Compute β i = 0 r i+ αi ř T 0 r i ω i 5: Compute p i+ = r i+ + β i (p i ω i z i ) Find directions p 6: end for Recall that the directions p i and residuals r i in BiCGStab have been derived from BiCG method using polynomial recurrences to avoid multiplication with transpose of the coefficient matrix A T [7]. The properties of directions and residuals in BiCG method are well known and can be easily used to setup the corresponding s-step method. The relationships between them in BiCGStab are far less clear. Therefore, the first step is to understand the properties of directions s i, p i and residuals r i, ř 0 in BiCGStab method. Theorem 2. The directions s i, p i and residuals r i, ř 0 satisfy the following relationships ř T 0 s i = 0 (20) r T i+as i = 0 (2) ř T 0 (p i+ β i p i ) = 0 (22) 7

8 Proof. First relationship (20) follows from ř T 0 s i = ř T 0 (r i α i Ap i ) = ř T 0 r i α i ř T 0 Ap i = ř T 0 r i ř T 0 r i = 0 (23) Second relationship (2) follows from r T i+as i = (s i ω i As i ) T As i = s T i As i ω i (As i ) T (As i ) = s T i As i s T i As i = 0 (24) The last relationship (22) follows from ř T 0 (p i+ β i p i ) = ř T 0 (r i+ β i ω i Ap i ) = (25) (řt ) = ř T 0 r i+ 0 r i+ ř T 0 Ap ř T 0 Ap i = 0 (26) i Notice that we can express the recurrences of BiCGStab in the following convenient form s i = r i α i Ap i (27) r i+ = (I ω i A)s i (28) q i = (I ω i A)p i (29) p i+ = r i+ + β i q i (30) Corollary 3. Let auxiliary vector y i = s i + β i p i, then directions p i+ = (I ω i A)y i and ř T 0 Ay i = 0 (3) as long as ω i 0. Proof. Notice that using recurrences (28), (29) and (30) we have p i+ = r i+ + β i q i = (I ω i A)s i + β i (I ω i A)p i = (I ω i A)(s i + β i p i ) (32) and using first and last relationships in Theorem 2 we have ř T 0 A(s i + β i p i ) = ř T 0 A( ω i s i β i ω i p ω i ) = ř T 0 (s i ω i As i β i ω i Ap i ω i ) i = ř T 0 (r i+ β i ω i Ap ω i ) = ř T 0 (p i ω i+ β i p i ) = 0 i (33) These properties are interesting by themselves, but it is difficult to rely only on them to setup a linear system for finding scalars α, ω and β similarly to CG and BiCG methods because they involve a single vector corresponding to the shadow residual ř 0, rather than all directions p in CG or shadow directions ˇp in BiCG. Therefore, we have to find additional conditions and a different way of setting up the s-step method. 8

9 Let us now assume that we have s directions p i and s i available at once, then we have x i+s = x i + α i p i α i+s p i+s + ω i s i ω i+s s i+s = x i + P i a i + S i o i (34) where P i = [p i,..., p i+s ] and S i = [s i,..., s i+s ] R n s are tall matrices and a i = [α i,..., α i+s ] T and o i = [ω i,..., ω i+s ] T R s are small vectors. Consequently, residual can be expressed as r i+s = f Ax i+s = r i A(P i a i + S i o i ) (35) Also, let us assume that we have the monomial basis for Krylov subspace for s iterations and monomial basis for directions R i = [r i, Ar i,..., A 2(s ) r i ] (36) P i = [p i, Ap i,..., A 2(s ) p i ] (37) Notice that using recurrences (27) - (30) we can write the following [s i, As i, q i, Aq i ] = [ r i, Ar i, A 2 r i, p i, Ap i, A 2 p i ] C (38) [ ri+, p i+ ] = [s i, As i, q i, Aq i ] D (39) where transitional matrices of scalar coefficients C and D are C = C () C (2) = 0 0 α i 0 α i = 0 0 α i ω i α i ω i ω i ω i (40) and D = D () D (2) = ω i β i 0 = ω i ω i β i (4) 0

10 Also, note that for B = CD and k 0 we have A k [r i+, p i+ ] = A k [r i, Ar i, A 2 r i, p i, Ap i, A 2 p i ]B (42) Therefore, letting B i R 4s 2 4s 6 follow the same pattern as B we have [r i+,..., A 2(s 2) r i+, p i+,..., A 2(s 2) p i+ ] = [r i,..., A 2(s ) r i, p i,..., A 2(s ) p i ]B i (43) and consequently [r i+s, p i+s ] = [r i,..., A 2(s ) r i, p i,..., A 2(s ) p i ]B i...b i+s 2 (44) Hence, we can use (44) to build directions s i, p i and residuals r i after s iterations, as long as we are able to compute the scalars α, ω and β used in transitional matrices. Let us now compute the matrix W R 4s 2 4s 2 and vector w R 4s 2 such that W i = [R i, P i ] T [R i, P i ] (45) w T i = ř T 0 [R i, P i ] = [g T i, h T i ] (46) and let W i (j, k) denote j-th row and k-th column element of matrix W i, while w i (j) denotes the j-th element of vector w i. Also, let us use -based indexing. Then, α i = řt 0 r i ř T 0 Ap = g i() (47) i h i (2) Further, after updating we have and, after updating we obtain Finally, and W i = C () T i Wi C () i (48) s T i ω i = As i (As i ) T (As i ) = W i (, 2) W i (2, 2) (49) w i = (C i D () i ) T w i (50) (řt ) ( ) ( ) β i = 0 r i+ αi w i ř T 0 r = () (αi ) i ω i g i () ω i (5) W i+ = (C (2) i D i ) T W i (C (2) i D i ) = B T i W i B i (52) w i+ = D (2) T i w i = Bi T w i (53) Finally, putting all the equations (34) - (53) together we obtain the pseudo-code for s-step BiCGStab iterative method, shown in Alg. 5. 0

11 Algorithm 5 S-Step BiCGStab : Let A R n n be a nonsingular coefficient matrix and let x 0 be an initial guess. 2: Compute r 0 = f Ax 0 and set p 0 = r 0 3: Let ř 0 be arbitrary (for example ř 0 = r 0 ) and let index k = is + j (initially k = 0) 4: for i = 0,,...until convergence do 5: Compute T = [[r k, p k ], A[r k, p k ],..., A 2s [r k, p k ]] Matrix-power kernel 6: Let R i = [r k, Ar k,..., A 2s r k ] = [ R i, A 2s r k ] Extract R from T 7: Let P i = [p k, Ap k,..., A 2s p k ] = [ P i, A 2s p k ] = [p k, P i ] Extract P from T 8: Compute W i = [R i, P i ] T [R i, P i ] Block dot-products 9: Compute w T i = ř T 0 [R i, P i ] = [g T i, ht i ] 0: Let. and. indicate all but last and first elements of., as shown for R i and P i : for j = 0,..., s do Note α k = řt 0 r k ř T 0 Ap k 3: Compute S k = R k α k Pk S k = [s k, As k,..., A 2(s j) s k ] 2: Set α k = g k () h k (2) and store a i(j) = α k T Wk C () k 4: Update W k = C() k 5: if W k (, )/ r 0 tol then break end Using s k = r k α k Ap k Check s k 2 for early exit 6: Set ω k = W k (,2) W k (2,2) and store o i(j) = ω k Note ω k = st k As k (As k ) T (As k ) 7: Compute Q k = P k ω k Pk Q k = [q k, Aq k,..., A 2(s j) q k ] 8: if (j == s ) then break; Check for last iteration 9: Compute R k+ = S k ω k Sk R k+ = [r k+,.., A 2(s j) 2 r k+ ] 20: Update w ( k = (C ) kd () k )T w k Using r k+ = s k ω k As k ( ( ) ( 2: Set β k = αk řt Note β k = 0 r k+ αk w k () g k () ω k ) 22: Compute P k+ = R k+ + β k Qk P k+ = [p k+,.., A 2(s j) 2 p k+ ] 23: Update W k+ = B T k W kb k Using p k+ = r k+ + β k q k 24: Update w k+ = Bk T w k 25: end for 26: Let P i = [p i,..., p k ] and S i = [s i,..., s k ] have the constructed directions 27: Compute x i+s = x i + P i a i + S i o i Do not use last s k if early exit 28: Compute r i+s = f Ax i+s Note k = is + s unless early exit 29: if r i+s / tol or early exit then stop end Check Convergence 30: Compute β k = ( řt 0 r i+s g k () ) ( ai (s ) o i (s ) 3: Compute p i+s = r i+s + β k q k 32: end for ) ř T 0 r k ω k ) Note α k = a i (s ) and ω k = o i (s )

12 It is worthwhile to mention again that updates using transitional matrices of scalar coefficients B i = C i D i can be expressed in terms of recurrences. In fact notice that matrices C (), C (2), D (), D (2) in equations (40) - (4) have one-to-one correspondence to recurrences in equations (27) - (30). We use a mix of transitional matrices and recurrences for clarity in the pseudo-code. However, we point out that latter allows for more efficient implementation of the algorithm in practice. Also, notice that at each inner iteration the size of tall matrices R i and P i as well as small square matrix W i and vector w i is reduced by 4 elements (columns for R i and P i, rows/columns for W i and vector elements for w i ). These 4 elements correspond to two extra powers of A and A 2 applied to residual r and directions p. Consequently, the updates can be done in-place preserving all previously computed residuals and directions without requiring extra memory storage for them. Let us now introduce preconditioning into the s-step BiCGStab in a manner similar to the algorithm. Let the preconditioned linear system be (AM )(Mx) = f (54) Then, we can apply the s-step on the preconditioned system (54) and re-factor recurrences in terms of preconditioner M = LU. If we express everything in terms of preconditioned directions M P i and M S i and work with the original solution vector x i we obtain the preconditioned s-step BiCGStab method, shown in Alg. 6. Notice that tall matrices R i, P i, V i and Z i on lines 7 0 can be constructed during the computation of T on line 6 in Alg. 6. The key observation is that the computation of T proceeds in the following fashion [r i, p i ], M [r i, p i ], (AM )[r i, p i ], M (AM )[r i, p i ], (AM ) 2s [r i, p i ], M (AM ) 2s [r i, p i ],... (AM ) 2s [r i, p i ] (55) and therefore terms in first and second column define tall matrices R i, P i, V i and Z i. It is important to point out that in preconditioned s-step BiCGStab we carry both R i, P i and V i, Z i through inner s iterations. The latter pair is needed to compute original x i+s on line 30 and then residual r i+s on line 3, while the former pair is needed to compute direction p k on line 35 in Alg. 6. Both residual r k and direction p k are needed to compute T in the next outer iteration. Notice that this is different from preconditioned s-step CG, where only preconditioned residual r k is needed for the next outer iteration. Finally, notice that preconditioned s-step BiCGStab performs an extra matrix-vector multiplication with A and preconditioner solve with M per iteration, when compared with s iterations of the preconditioned BiCGStab method. The extra work happens during re-computation of residual at the end of every s-step iteration. 2

13 Algorithm 6 Preconditioned S-Step BiCGStab : Let A R n n be a nonsingular coefficient matrix and let x 0 be an initial guess. 2: Let M = LU be the preconditioner, so that M A. 3: Compute r 0 = f Ax 0 and set p 0 = r 0 4: Let ř 0 be arbitrary (for example ř 0 = r 0 ) and let index k = is + j (initially k = 0) 5: for i = 0,,...until convergence do 6: Compute T = [[r k, p k ], Â[r k, p k ],..., Â2s [r k, p k ]] with Â = AM Matrix-power kernel 7: Let R i = [r k, Âr k,..., Â2s r k ] = [ R i, Â2s r k ] Build R, P, V and Z 8: Let P i = [p k, Âp k,..., Â2s p k ] = [ P i, Â2s p k ] = [p k, P i ] during comput. of T 9: Let V i = M [r k, Âr k,..., Â2s r k ] = M Ri 0: Let Z i = M [p k, Âp k,..., Â2s p k ] = M Pi : Compute W i = [R i, P i ] T [R i, P i ] Block dot-products 2: Compute w T i = ř T 0 [V i, Z i ] = [g T i, ht i ] 3: Let. and. indicate all but last and first elements of., as shown for R i and P i 4: for j = 0,..., s do Note α k = řt 0 r k ř T 0 Ap k 6: Compute [S k, S k ] = [ R k, Ṽk] α k [ P k, Z k ] S k = [s k, As k,..., A 2(s j) s k ] 5: Set α k = g k () h k (2) and store a i(j) = α k T Wk C () k 7: Update W k = C() k 8: if W k (, )/ r 0 tol then break end Using s k = r k α k Ap k Check s k 2 for early exit 9: Set ω k = W k (,2) W k (2,2) and store o i(j) = ω k Note ω k = st k As k (As k ) T (As k ) 20: Compute [Q k, Q k ] = [ P k, Z k ] ω k [ P k, Z k ] Q k = [q k, Aq k,..., A 2(s j) q k ] 2: if (j == s ) then break; Check for last iteration 22: Compute [R k+, V k+ ] = [ S k, S k ] ω k[ S k, S k ] R k+ = [r k+,.., A 2(s j) 2 r k+ ] 23: Update w ( k = (C ) kd () k )T w k Using r k+ = s k ω k As k ( ( ) ( 24: Set β k = αk řt Note β k = 0 r k+ αk w k () g k () ω k ) 25: Compute [P k+, Z k+ ] = [R k+, V k+ ] + β k [ Q k, Q k ] P k+ = [p k+,.., A 2(s j) 2 p k+ ] 26: Update W k+ = B T k W kb k Using p k+ = r k+ + β k q k 27: Update w k+ = Bk T w k 28: end for 29: Let Z i = M [p i,..., p k ] and S i = M [s i,..., s k ] have the constructed directions 30: Compute x i+s = x i + Z i a i + S i o i Do not use last s k if early exit 3: Compute r i+s = f Ax i+s Note k = is + s unless early exit 32: if r i+s / tol or early exit then stop end Check Convergence 33: Compute r i+s = ( M r i+s 34: Compute β k = ř T 0 r i+s g k () ) ( ai (s ) o i (s ) 35: Compute ˆp i+s = ˆr i+s + β k q k 36: end for 3 ) ř T 0 r k ω k ) Note α k = a i (s ) and ω k = o i (s )

14 4 Numerical Experiments In this section we study the numerical behavior of the preconditioned s-step CG and BiCGStab iterative methods. In order to facilitate comparisons, we use the same matrices as previous investigations of their and block counterparts [3, 4]. The seven SPD and five nonsymmetric matrices from UFSMC [9] with the respective number of rows (m), columns (n=m) and non-zero elements (nnz) are grouped and shown according to their increasing order in Tab.. We compare the s-step iterative methods using a reference MATLAB implementation. We let the initial guess be zero and the RHS f = Ae where e = (,..., ) T. Also, unless stated otherwise we let the stopping criteria be the maximum number of iterations 40 or relative residual r i / < 0 2, where r i = f Ax i is the residual at i-th iteration. These modest stopping criteria allow us to illustrate the behavior of the algorithms without being significantly affected by the accumulation of floating point errors through iterations, which is addressed later on separately with different stopping criteria that match previous studies [3, 4]. # Matrix m,n nnz SPD Application. offshore 259,789 4,242,673 yes Geophysics 2. af shell3 504,855 7,562,05 yes Mechanics 3. parabolic fem 525,825 3,674,625 yes General 4. apache2 75,76 4,87,870 yes Mechanics 5. ecology2 999,999 4,995,99 yes Biology 6. thermal2,228,045 8,580,33 yes Thermal Simulation 7. G3 circuit,585,478 7,660,826 yes Circuit Simulation 8. FEM 3D thermal2 47,900 3,489,300 no Mechanics 9. thermomech dk 204,36 2,846,228 no Mechanics 0. ASIC 320ks 32,67,36,085 no Circuit Simulation. cage3 445,35 7,479,343 no Biology 2. atmosmodd,270,432 8,84,880 no Atmospheric Model. Table : Symmetric positive definite (SPD) and nonsymmetric test matrices First, let us illustrate that s-step iterative methods indeed produce the same approximation to the solution x i after i iterations as their counterparts after i s iterations. Notice that when we plot r i through iterations for scheme in a grey line and for s-step scheme with s = in blue circle, s = 2 in green star and s = 4 in orange diamond, these markings lie on the same curve and coincide in strides of s iterations, as can be seen in Fig. and 2. The only difference might happen in the last iteration where schemes can stop at an arbitrary iteration number, while s-step schemes must proceed to the next multiple of s. This indicates that both un- and preconditioned s-step CG and BiCGStab indeed produce the same solution x i as iterative methods. 4

15 r i s step (s=) s step (s=2) s step (s=4) r i s step (s=) s step (s=2) s step (s=4) (a) Conjugate Gradient (CG) (b) Preconditioned CG Figure : Plot of convergence of and s-step CG for s =, 2, 4 on af shell3 matrix r i s step (s=) s step (s=2) s step (s=4) r i s step (s=) s step (s=2) s step (s=4) (a) BiConjugate Gradient Stabilized (BiCGStab) (b) Preconditioned BiCGStab Figure 2: Plot of convergence of and s-step BiCGStab for s =, 2, 4 on atmosmmodd matrix Second, let us make an important observation about numerical stability of s-step methods. As we have mentioned earlier, the vectors in the monomial basis of the Krylov subspace used in s-step methods can become linearly dependent relatively quickly. In fact, if we plot r i through iterations for s = 8 in red square, we can easily identify several occasions where these markings do not coincide with the grey line used for the scheme, see Fig. 3 and 4. Notice that in the case of s-step CG on Fig. 3 (a) the s-step method diverged, while in the case of s-step BiCGStab on Fig 4 (a) the method exited early, because intermediate value s fell below a specified threshold. This behavior is in line with the observations made in [6] that for s > 5 s-step CG may suffer from loss of orthogonality and that preconditioned s-step CG would likely fair better than its unpreconditioned version. Third, let us look at the convergence of the preconditioned s-step iterative method for different stopping criteria. In particular, we are interested in what happens when we ask the algorithm to achieve a tighter tolerance r i / < 0 7 and allow a larger maximum number of iterations. Notice that the residuals computed by the s-step methods always start following the ones computed by algorithms, see Fig 5 and 6. In some cases, the s-step methods match their counterparts exactly until the solution is found, as shown for offshore matrix on Fig. 5 (a). In other cases, they follow the residuals approximately, while still finding the solution towards the end, as shown for atmosmodd 5

16 s step (s=8) s step (s=8) r i r i (a) Conjugate Gradient (CG) (b) Preconditioned CG Figure 3: Plot of convergence of and s-step CG for s = 8 on af shell3 matrix 0 7 s step (s=8) 0 6 s step (s=8) 0 6 r i r i (a) BiConjugate Gradient Stabilized (BiCGStab) (b) Preconditioned BiCGStab Figure 4: Plot of convergence of and s-step BiCGStab for s = 8 on atmosmmodd matrix matrix on Fig. 5 (b). However, there are also cases where the residuals computed by and s-step methods initially coincide, but the latter diverge from the solution towards the end, as shown for matrices af shell3 and apache2 on Fig. 6. These empirical results lead us to conclude that s-step methods are often better suited for linear systems, where we are looking for an approximate solution with a relatively relaxed tolerance, such as 0 2, and expect to find it in modest number of iterations, for example < 80 iterations. Also, we recall that s 5 is recommended for most cases. Finally, reverting to using our original stopping criteria that were stated on page 4, we plot log r i obtained by the preconditioned and s-step methods for all the matrices on Fig. 7. Notice that because we plot the negative of the log, the higher is the value on the plot in Fig. 7, the lower is the relative residual. Notice that for most matrices the value of the relative residual across all schemes is either the same or slightly better for s-step schemes, because the latter stop at iteration number that is multiple of s, which might be slightly higher than the iteration at which the algorithm has stopped. There are also a few cases, such as thermomech dk corresponding to matrix 9, where the relative residual for s-step method with s = 8 is worse, due to loss of orthogonality of the vectors in the monomial basis. 6

17 r i s step (s=) s step (s=2) s step (s=4) r i s step (s=) s step (s=2) s step (s=4) (a) offshore (b) atmosmodd Figure 5: Plot of convergence of preconditioned methods for tighter tolerance r i/ r 0 < 0 7 r i s step (s=) s step (s=2) s step (s=4) r i s step (s=) s step (s=2) s step (s=4) (a) af shell3 matrix (b) apache2 matrix Figure 6: Plot of convergence of preconditioned methods for tighter tolerance r i/ r 0 < 0 7 -log( r i / ) s= s=2 s=4 s=8 (higher is better) Matrix Figure 7: Plot of log r i obtained by preconditioned and s-step methods 7

18 The detailed results of the numerical experiments are shown in Tab. 2 and 3. # # it. Standard s-step (s=) s-step (s=2) s-step (s=4) s-step (s=8) r i # it. r i # it. r i # it. r i # it. r i E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E-03.70E-03.57E-03.37E E E E E E-03.09E E E E-0.78E-0 5 NaN E E E-02.69E-0.69E E E E E E E E E E E-02 Table 2: Results for and s-step unpreconditioned CG and BiCGStab methods # # it. Standard s-step (s=) s-step (s=2) s-step (s=4) s-step (s=8) r i # it. r i # it. r i # it. r i # it. r i E E E-03.23E-05.07E E E E E E E E E E E E E E E E E E E E E E E E E E E-04.90E-04.03E E E E E-03.29E-03.29E-03.29E E E E E E E E E E E E E-04.92E-05.92E-05.92E E E E E E-03 Table 3: Results for and s-step preconditioned CG and BiCGStab methods 8

19 5 Conclusion In this paper we made an overview of the existing s-step CG iterative method and developed a novel s-step BiCGStab iterative method. We have also introduced preconditioning to both s-step algorithms, maintaining the extra work performed at each s iterations to the extra matrix-vector multiplication and a preconditioning solve. We have discussed advantages of these methods, such as the use of matrix-power kernel and block dot-products, as well as their disadvantages, such as the loss of orthogonality of the monomial Krylov subspace basis and extra work performed per iteration. Also, we have pointed out the connection between s-step and communication-avoiding algorithms. Finally, we performed numerical experiments to validate the theory behind the s-step methods, and will look at their performance in the future. These algorithms can potentially be a better alternative to their counterparts when modest tolerance and number of iterations are required to solve a linear system on parallel platforms. Also, they could be used for a larger class of problems if the numerical stability issues related to the monomial basis are resolved. 6 Acknowledgements The author would like to acknowledge Erik Boman and Michael Garland for their useful comments and suggestions. References [] R. Barrett, M. Berry, T. F. Chan, J. Demmel, J. Donato, J. Dongarra, V. Eijkhout, R. Pozo, C. Romine, H. van der Vorst, Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods, SIAM, Philadelphia, PA, 994. [2] E. Carson, N. Knight and J. Demmel, Avoiding Communication in Nonsymmetric Lanczos-Based Krylov Subspace Methods, SIAM J. Sci. Comput., Vol. 35, pp. 42-6, 202. [3] E. Carson, Communication-Avoiding Krylov Subspace Methods in Theory and Practice, Ph.D. Thesis, University of California - Berkeley, 205. [4] A. T. Chronopoulos A Class of Parallel Iterative Methods on Multiprocessor, Ph.D. Thesis, University of Illinois - Urbana-Champagne, 987. [5] A. T. Chronopoulos, S-Step Iterative Methods for (Non)Symmetric (In)Definite Linear Systems, SIAM Journal on Numerical Analysis, Vol. 28, pp , 99. 9

20 [6] A. T. Chronopoulos and C. W. Gear, S-Step Iterative Methods for Symmetric Linear Systems, Journal of Computational and Applied Mathematics, Vol. 25, pp , 989. [7] A. T. Chronopoulos and C. D. Swanson, Parallel Iterative S-step Methods for Unsymmetric Linear Systems, Parallel Computing. Vol. 22, pp , 996. [8] A. T. Chronopoulos and S. K. Kim, S-Step Orthomin and GMRES Implemented on Parallel Computers, Technical Report UMSI 90/43R, University of Minnesota Supercomputing Institute, 990. [9] A. T. Chronopoulos and S. K. Kim, S-step Lanczos and Arnoldi Methods on Parallel Computers, Technical Report UMSI 90/4R, University of Minnesota Supercomputing Institute, Minneapolis, 990. [0] M. M. Dehnavi, J. Demmel, D. Giannacopoulos, Y. El-Kurdi, Communication-Avoiding Krylov Techniques on Graphics Processing Units, IEEE Trans. on Magnetics, Vol. 49, pp , 203. [] A. El Guennouni, K. Jbilou and H. Sadok, A Block Version of BiCGStab for Linear Systems with Multiple Right-Hand-Sides, ETNA, 6, pp (2003). [2] M. Hoemmen, Communication-Avoiding Krylov Subspace Methods, Ph.D. Thesis, University of California - Berkeley, 200. [3] M. Naumov, Parallel Incomplete-LU and Cholesky Factorization in the Preconditioned Iterative Methods on the GPU, Nvidia Technical Report, 3, 202. [4] M. Naumov, Preconditioned Block-Iterative Methods on GPUs, Proc. International Association of Applied Mathematics and Mechanics, PAMM, Vol. 2, pp. -4, 202. [5] D. P. O Leary, The Block Conjugate Gradient Algorithm and Related Methods, Linear Algebra Appl., Vol. 29, pp , 980. [6] J. van Rosendale, Minimizing Inner Product Data Dependence in Conjugate Gradient Iteration, Technical Report 7278, NASA Langley Research Center, 983. [7] Y. Saad, Iterative Methods for Sparse Linear Systems, SIAM, Philadelphia, PA, 2nd Ed., [8] I. Yamazaki, S. Rajamanickam, E. G. Boman, M. Hoemmen, M. A. Heroux and S. Tomov, Domain Decomposition Preconditioners for Communication-Avoiding Krylov Methods on Hybrid CPU/GPU Cluster, Proc. International Conference for High Performance Computing, Networking, Storage and Analysis, pp , 204. [9] The University of Florida Sparse Matrix Collection (UFSMC), 20

Incomplete-LU and Cholesky Preconditioned Iterative Methods Using CUSPARSE and CUBLAS

Incomplete-LU and Cholesky Preconditioned Iterative Methods Using CUSPARSE and CUBLAS Maxim Naumov NVIDIA, 2701 San Tomas Expressway, Santa Clara, CA 95050 June 21, 2011 Abstract In this white paper we