(Technical Report 832, Mathematical Institute, University of Utrecht, October 1993) E. de Sturler. Delft University of Technology.

Size: px

Start display at page:

Download "(Technical Report 832, Mathematical Institute, University of Utrecht, October 1993) E. de Sturler. Delft University of Technology."

Harold Carter
5 years ago
Views:

1 Reducing the eect of global communication in GMRES(m) and CG on Parallel Distributed Memory Computers (Technical Report 83, Mathematical Institute, University of Utrecht, October 1993) E. de Sturler Faculty of Technical Mathematics and Informatics Delft University of Technology Mekelweg 4 Delft, The Netherlands and H. A. van der Vorst Mathematical Institute Utrecht University Budapestlaan 6 Utrecht, The Netherlands Abstract In this paper we study possibilities to reduce the communication overhead introduced by inner products in the iterative solution methods CG and GMRES(m). The performance of these methods on parallel distributed memory machines is often limited because of the global communication required for the inner products. We investigate two ways of improvement. One is to assemble the results of a number of local inner products of a processor and to accumulate them collectively. The other is to try to overlap communication with computation. The matrix vector products may also introduce some communication overhead, but for many relevant problems this involves communication with a few nearby processors only, and this does not necessarily degrade the performance of the algorithm. Key words. parallel computing, distributed memory computers, conjugate gradient methods, performance, GMRES, modied Gram{Schmidt. AMS(mos) subject classication. 65Y05, 65F10, 65Y0. 1 Introduction The Conjugate Gradients (CG) method [9] and the GMRES(m) method [11] are widely used methods for the iterative solution of specic classes of linear systems. The time-consuming kernels in these methods are: inner products, vector updates, and matrix vector products (including preconditioning operations). In many situations, especially when the matrix operations are well-structured, these operations are suited for implementation on vector computers and shared memory parallel computers [8]. This author wishes to acknowledge Shell Research B.V. and STIPT for the nancial support of his research. 1

2 For parallel distributed memory machines the picture is entirely dierent. In general the vectors are distributed over the processors, so that even when the matrix operations can be implemented eciently by parallel operations, we cannot avoid the global communication required for inner product computations. These global communication costs become relatively more and more important when the number of parallel processors is increased and thus they have the potential to aect the scalability of the algorithms in a very negative way [5]. This aspect has received much attention and several approaches have been suggested to improve the performance of these algorithms. For CG the approaches come down to reformulating the orthogonalization part of the algorithm, so that the required inner products can be computed in the same phase of the iteration step (see, e.g., [4, 10]), or to combine the orthogonalization for several successive iteration steps, as in the s-step methods []. The numerical stability of these approaches is a major point of concern. For GMRES(m) the approach comes down to some variant of the s-step methods [1, 3]. After having generated basis vectors for part of the Krylov subspaces by some suitable recurrence relation, they have to be orthogonalized. One often resorts to cheap but potentially unstable methods like Gram-Schmidt orthogonalization. In the present study we investigate other ways to reduce the global communication overhead due to the inner products. Our approach is to identify operations that may be executed while communication takes place, since our aim is to overlap communication with computation. For CG this is done by rescheduling the operations, without changing the numerical stability of the method [7]. For GMRES(m) it is achieved by reformulating the modied Gram{Schmidt orthogonalization step [5, 6]. For GMRES(m) we also exploit the possibility of packing the results of the local inner products of a processor in one message and accumulating them collectively. We believe that our ndings are relevant for other Krylov subspace methods as well, since methods like BiCG, and its variants CGS, BiCGSTAB, and QMR have much in common with CG from the implementation point of view. Likewise, the communication problems with GM- RES(m) are representative for the problems in methods like ORTHODIR, GENCG, FOM, and ORTHOMIN. We have carried out our experiments on a 400-processor Parsytec Supercluster at the Koninklijke/Shell-Laboratorium in Amsterdam. The processors are connected in a xed 0 0 mesh, of which arbitrary submeshes can be used. Each processor is a T800-0 transputer. The transputer supports only nearest neighbor synchronous communication more complicated communication has to be programmed explicitly. The communication rate is fast compared to the op rate, but to current standards the T800 is a slow processor. Another feature of the transputer is the support of time-shared execution of multiple `parallel' CPU-processes on a single processor, which facilitates the implementation of programs that switch between tasks when necessary (on interrupt basis), e.g., between communication and computation. Finally, transputers have the possibility of concurrent communication and computation. As a result it is possible to overlap computation and communication on a single processor. The program that runs on each processor consists of two processes that run time-shared: a computation process and a communication process. The computation process functions as the master. If at some point communication is necessary, the computation process sends the data to the communication process (on the same processor) or requests the data from the communication process, which then handles the actual communication. This organization permits the computation processes on dierent processors to work asynchronously even though the actual communication is synchronous. The communication process is given the higher priority so that

3 if there is something to communicate this is started as soon as possible. The algorithms for GMRES(m) and CG Preconditioned CG: start: x 0 = initial guess r 0 = b ; Ax 0 p;1 =0 ;1 =0 Solve for w 0 in Kw 0 = r 0 0 =(r 0 w 0 ) iterate: for i =0 1 :::: do p i = w i + i;1p i;1 q i = Ap i i = i =(p i q i ) x i+1 = x i + i p i r i+1 = r i ; i q i compute krk if accurate enough then quit Solve for w i+1 in Kw i+1 = r i+1 i+1 =(r i+1 w i+1 ) i = i+1 = i end GMRES(m): start: x 0 = initial guess r 0 = b ; Ax 0 v 1 = r 0 =kr 0 k iterate: for j =1 ::: m do ^v j+1 = Av j for i =1 ::: j do h i j =(^v j+1 v i ) ^v j+1 =^v j+1 ; h i j v i end h j+1 j = k^v j+1 k v j+1 =^v j+1 =h j+1 j end form the approximate solution: x m = x 0 + V m y m where y m minimizes kr0 k e 1 ; Hm y, y IRm restart: compute r m = b ; Ax m if satised then stop, else x 0 = x m v 1 = r m =kr m k goto iterate The preconditioned CG algo- Figure 1: rithm Figure : The GMRES(m) algorithm In this section we will discuss the time-consuming kernels in CG and GMRES(m): the vector update (daxpy), the preconditioner, the matrix vector product, and the inner product (ddot), see Figures 1 and. Because the results of the inner products are needed on all processors, the Hessenberg matrix H m (see Figure ) is available on each processor. Hence, the computation of y m can be done on each processor. This is often ecient, because it would be a synchronization point if implemented on a single processor, and then the other processors would have towait for the result. However, if the size of the reduced system is large compared to the local number of unknowns, the computation might be expensive enough to make the distribution and parallel solution worthwhile. We have not pursued this idea. The parallel implementation of the vector update (daxpy) poses no problem since it involves only local computation. In this paper we restrict ourselves to problems for which the parallelism in the matrix vector product does not pose serious problems. That is, our model problems have a strong data locality which istypical for many nite dierence and nite element problems. A suitable 3

4 accumulation over the processor grid, no outgoing step can be taken before all incoming steps have taken place. Figure 3: accumulation over the processor grid domain decomposition approach preserves this locality more or less independent of the number of processors, so that the matrix vector product requires only neighbor{neighbor communication or communication with only a few nearby processors. This could be overlapped with computations for the interior of the domain but it is relatively less important, since the number of boundary operations is in general an order of magnitude less than the number of interior operations (this is the surface to volume eect). The communication overhead introduced by the preconditioner is obviously strongly dependent on the selected preconditioner. Popular preconditioners on sequential computers, like the (M)ILU variants, are highly sequential or introduce irregular communication patterns (as in the hyperplane approach, see [8]) and therefore these are not suitable. Obviously we prefer preconditioners which only require a limited amount of communication, for instance comparable to or less than that of the matrix vector product. On the other hand we would like to retain the iteration reducing eect of the preconditioners, and these considerations are often in conict. In our study we have avoided to discuss the convergence accelerating eects of the preconditioner and we have used a simple Incomplete Block Jacobi preconditioner with blocks corresponding to the domains. In this case we have nocommunication at all for the preconditioner. Since the vectors are distributed over the processor grid, the inner product (ddot) is computed in two steps. All processors start to compute in parallel the local inner product. After that, the local inner products are accumulated on one `central' processor and broadcasted. We will describe the implementation in little more detail for a -dimensional mesh of processors, see Figure 3. The processors on each processor line in the x-direction accumulate their results along this line on an `accumulation' processor at the same place on each processor line: each processor waits for the result from its neighbor further from the accumulation processor, then adds this result to its own partial result and sends the new result along. Then the `accumulation' processors do a likewise accumulation in the y-direction. The broadcast consists of the reverse process. Each processor is active at only a limited number of steps and will be idle for the rest of the time. So, here are opportunities to make itavailable for other tasks. The communication time of an accumulation or a broadcast is of the order of the diameter of the processor grid. This means that for an increasing number of processors the communication time for the inner products increases as well, and hence this is a potential threat to the scalability 4

5 of the method. Indeed, if the global communication for the inner products is not overlapped, it often becomes a bottleneck on large processor grids, as will be shown later. In [5] a simple performance model based on these considerations is introduced, which clearly shows quantitatively the dramatic inuence of the global communication for inner products over large processor grids on the performance of Krylov subspace methods. That model also shows that the degradation of performance depends on the relative costs of local computation and global communication. This means that results analogous to those presented in Sections 5 and 7 will be seen for larger problems on processor conguration(s) with relatively faster computational speed (this is the current trend in parallel computers). Moreover, if the problem sizes increase proportionally to the number of processors, the local computation time remains the same but the global communication cost increases. This emphasizes the necessity to reduce the eect of the global communication costs. 3 Parallel performance of GMRES(m) and CG We will now describe briey a model for the computation time, the communication cost, and the communication time of the main kernels in Krylov subspace methods. We use the term communication cost to indicate the wall clock time spent in communication that is not overlapped with useful computation (so that it really contributes to wall-clock time). The term communication time is used to indicate the wall-clock time of the whole communication. In the case of a nonoverlapped communication, the communication time and the communication cost are the same. Our quantitative formulas are not meant to give very accurate predictions of the exact execution times, but they will be used to identify the bottlenecks and to evaluate improvements. Several of the parameters that we introduce may vary over the processor grid. In that case the value to use is either a maximum or an average, whatever is the most appropriate. Computation time. We will only be concerned with the local computation time, since the cost of communication and synchronization is modeled explicitly. The computation time for the solution of the Hessenberg system is neglected in our model. For a vector update (daxpy) or an inner product (ddot) the computation time is given by t fl N=P, where N=P is the local number of unknowns of a processor and t fl is the average time for a double precision oating point operation. The computation time for the (sparse) matrix vector product is given by (n z ; 1)t fl N=P, where n z is the average number of non-zero elements per row of the matrix. As preconditioner we chose Block-(M)ILU variants without ll-in of the form LD ;1 U for GMRES(m), and LL T for CG. For CG we have scaled the system so that diag(l) = I. The computation time of the preconditioner for GMRES(m) is (n z +1)t fl N=P, andforcgitis(n z ; 1)t fl N=P. A full GMRES(m) cycle has approximately 1 (m +3m) inner products, the same number of vector updates, and (m + 1) multiplications with the matrix and the preconditioner, if one computes the exact residual at the end of each cycle. The complete (local) computation time for the GMRES(m) algorithm is given by the equation: T gmr cmp1 = ; (m +3m)+4n z (m +1) N P t fl: (1) A single iteration of CG has three inner products, the same number of vector updates and one multiplication with the matrix and the preconditioner. The complete (local) computation time is given by the equation: T cg cmp =(9+4n z) N P t fl: () 5

6 Communication cost. As we mentioned already the solution of the Hessenberg system in GMRES(m) and the vector update are local and involve no communication cost. The most important communication is for the global inner products. If we do not overlap this global communication then we are concerned with the wall clock time for the entire, global operation and not with the local part of a single processor. We note that we can view the time for the accumulation and broadcast either as the communication time for the entire operation or as a small local communication time and a long delay because of global synchronization. In the rst interpretation we would consider overlapping the global communication, whereas in the second one we would consider removing the delays by reducing the number of synchronization points. We will take the rst point ofview. Consider a processor grid with P = p processors. With p d dp=e( p P ), the maximum distance to the `most central' processor over the processor grid is p d. Let the communication start-up time be given by t s and the word (3 bits) transmission time by t w. The time to communicate one double precision number between two neighboring processors is then (t s +3t w ), since a double precision number takes two words and we need a one word header to accompany each message. Hence, the global accumulation and broadcast of one double precision number takes p d (t s +3t w ) and the global accumulation and broadcast of a vector of k double precision numbers takes p d (t s +(k +1)t w ). For GMRES(m) in the nonoverlapped case the communication time for the modied Gram- Schmidt algorithm (with 1 (m +3m) accumulations and broadcasts) is T gmr a+b =(m +3m)p d (t s +3t w ) (3) where `a + b' indicates the accumulation and broadcast. For CG in the nonoverlapped case the communication time of the three inner products per iteration, is T cg a+b =6p d(t s +3t w ): (4) The communication for the matrix vector product is necessary for the exchange of so-called boundary data: sending boundary data to other processors and receiving boundary data from other processors. Assume that each processor has to send and to receive n m messages, which each take d steps of nearest neighbor communication from source to destination, and let the number of boundary data elements on a processor be given by n b. The total number of words that have to be communicated (sent and received) is then (n b + n m ) per processor. For GMRES(m) the communication time of (m + 1) matrix vectors products is T gmr bde =dn m (m +1)t s +d(m +1)(n b + n m )t w (5) where `bde' refers to the boundary exchange. For CG the communication time of one matrix vector product is T cg bde =dn m t s +d(n b + n m )t w : (6) Note that we have assumed no overlap. For preconditioners that only need boundary exchanges, we could have used the same formulas with a dierent choice of the parameter values if necessary, but in our experiments we have used only local block preconditioners (without communication). 6

7 4 Communication overhead reduction in GMRES(m) From the expressions (1), (3) and (5) we conclude that the communication cost for GMRES(m) is of the order O(m p P ) and for large processor grids this will become a bottleneck. Moreover, in the standard implementation we cannot reduce these costs by accumulating multiple inner products together, saving on start-up times, or overlap this expensive communication with computation, reducing the runtime lost in communication. The problem stems from the fact that the modied Gram-Schmidt orthogonalization of a single vector against some set of vectors and its subsequent normalization is an inherently sequential process. However, if the modied Gram-Schmidt orthogonalization of a set of vectors is considered there is no such problem, since the orthogonalizations of all intermediate vectors on the previously orthogonalized vectors are independent. Therefore, we can compute several or all of the local inner products rst and then accumulate the subresults collectively. Suppose the set of vectors v 1 ^v ^v 3 ::: ^v m+1 has to be orthogonalized, where kv 1 k =1. The modied Gram-Schmidt process can be implemented as sketched in Figure 4. This reduces for i =1 ::: m do orthogonalize ^v i+1 ::: ^v m+1 on v i v i+1 =^v i+1 =k^v i+1 k end Figure 4: a block-wise modied Gram{ Schmidt orthogonalization the number of accumulations to only m instead of 1 (m +3m) for the usual implementation of GMRES(m), but the length of the messages has increased. In this way, start-up time is saved by packing small messages, corresponding to one block of orthogonalizations, into one larger message. Moreover, we also reduce the amount of data transfer because we have less message headers. Instead of computing all local inner products in one block and accumulating these partial results only once for the whole block, it is preferable to split each stepinto two blocks of orthogonalizations, since this oers the possibility tooverlap with communication. This overlap is achieved by performing the accumulation and broadcast of the local inner products of the rst block concurrently with the computation of the local inner products of the second block and performing the accumulation and broadcast of the local inner products of the second block concurrently with the vector updates of the rst block, see Figure 5. Note that the computation time for this approach is equal to that for the standard modied Gram{Schmidt algorithm. For the parallel `overlapped' implementation of the modied Gram-Schmidt algorithm given in Figure 5, we will neglect potential eects of overlap of the communication with computation on a single processor. We will only consider the overlap with useful computational work of the time that a processor is not active in the global accumulation and broadcast. If we assume that sucient computational work can be done to completely ll this time, the communication cost T gmr a+b, see (3), reduces to only the communication time spent locally by a processor. This `local' communication cost for the accumulation and broadcast of a vector of k double precision numbers is given by 4t s +4(k +1)t w, for a receive and send in the accumulation phase and a receive and send in the broadcast phase, if the processor only participates in the accumulation 7

8 for i =1 ::: m do split ^v i+1 ::: ^v m+1 into two blocks compute local inner products (LIPs) block 1 k ( accumulate LIP s block 1 compute LIP s block update ^v i+1, compute LIP for k^v i+1 k, place this LIP into block k ( accumulate LIP s block update vectors block 1 end update vectors block normalize ^v i+1 Figure 5: the implementation of the modied Gram-Schmidt process along the x-direction and it is given by 8t s +8(k +1)t w if the processor also participates in the accumulation along the y-direction. The latter case is obviously the most important, since all processors nish the modied Gram-Schmidt algorithm more or less at the same time. The communication cost of the entire parallel modied Gram-Schmidt algorithm (mgs) now becomes T l mgs =16mt s +8(m +5m)t w : (7) In general wemaynothave enough computational work to overlap all the communication time in a global communication process. For the wall clock time of (parallel) operations, it is the longest time that matters. Here it is the global communication time for the modied Gram{Schmidt algorithm (mgs): T g mgs =4mp d t s +(m +5m)p d t w : (8) Since the communication is partly overlapped, the communication cost is in general signicantly lower than the communication time, and then it may still be better described by (7) instead of (8). Two important facts are highlighted by expressions (3), (7) and (8). First, assuming suf- cient computational work, the contribution of start-up times to the communication cost is reduced from O(m p P ) in the standard GMRES(m) (3) to O(m) using the parallel modied Gram-Schmidt algorithm (7). Especially for machines with relatively high start-up times this is important. In fact, if the start-ups dominate the communication cost, then we can reduce this contribution by a factor of two by the algorithm given in Figure 4 (even if we neglect the overlap). Second, assuming sucient computational work, the communication cost no longer depends on the size of the processor grid instead of being of the order of the diameter of the processor grid p d,itisnow more or less constant. If we lack sucient computational work the communication cost is described by (8) minus the time for the overlapped computation. 8

9 ^v 1 = v 1 = r=krk for i =1 ::: m do ^v i+1 =^v i ; d i A^v i end Figure 6: Generation of a polynomial basis for the Krylov subspace In order to be able to use this parallel modied Gram{Schmidt algorithm in GMRES(m), a basis for the Krylov subspace has to be generated rst. The idea to generate a basis for the Krylov subspace rst and then to orthogonalize this basis was already suggested for the CG algorithm, referred to as s-step CG, in [] for shared (hierarchical) memory parallel vector processors. In [] it is also reported that the s-step CG algorithm may converge slowly due to numerical instability for s>5. In the pargmres(m) algorithm stability seems to be much less of a problem since each vector is explicitly orthogonalized against all the other vectors, and we generate a polynomial basis for the Krylov subspace such as to minimize the condition number, see [1] where the Krylov subspace is generated rst to exploit higher level BLAS in the orthogonalization, and [6]. The basis vectors for the Krylov subspace ^v i are generated as indicated in Figure 6, where the parameters d i are used to get the condition number of the matrix [v 1 ^v ::: ^v m+1 ] suciently small. Bai, Hu, and Reichel [1] discuss a strategy for this. Their idea is to use one cycle of standard GMRES(m). Then the eigenvalues of the resulting Hessenberg matrix, which approximate those of A, are used in the so-called Leja ordering as the parameters d ;1 i in the rest of the modied GMRES(m) cycles. Their examples indicate that the convergence of such a GMRES(m) is (virtually) the same as that of standard GMRES(m). This is also borne out by our experience. Therefore, in the next section we limit our experiments to the evaluation of a single GMRES(m) cycle. Our parallel computation of the Krylov subspace basis requires m extra daxpys. It is obvious from (1) that this cost is negligible. However, for completeness we give the computation time with these extra daxpys, T gmr cmp =(m(m +4)+4n z (m + 1)) N P t fl: (9) Because we generate the Krylov subspace basis rst and then orthogonalize it, the Hessenberg matrix that we obtain from the inner products is not V T AV m+1 m, as in the standard GMRES(m) algorithm, and therefore we need to solve the least squares problem in a slightly dierent way. Dene ^v 1 = v 1 = krk ;1 r, and generate the other basis vectors as ^v i+1 = (I ; d i A)^v i, for i =1 ::: m.thisgives the following relation: [^v ^v 3 ::: ^v m+1 ]= b Vm ; Ab Vm D m (10) where D m = diag(d i )and Vm b is the matrix with the vectors ^v i as its columns. This relation between vectors and matrices composed from these vectors will be used throughout this discussion. The parallel modied Gram-Schmidt orthogonalization gives the orthogonal set of vectors fv 1 ::: v m+1 g, for which wehave! jx v j+1 = h ;1 j+1 j+1 ^v j+1 ; h i j+1 v i for i =1 ::: m (11) i=1 9

10 where h i j is dened by (but computed dierently) ( (vi ^v h i j = j ) i j 0 i>j: (1) Notice the subtle dierence with the denition of Hm in the standard implementation of GM- RES(m). Here the matrix H m+1 is upper triangular. Furthermore, as long as h i i 6= 0 the matrix H i is nonsingular, whereas h i i = 0 indicates a lucky breakdown. We will further assume, without loss of generality, that h i i 6=0,fori =1 ::: m+1. Leth i denote the i-th column of H m+1.from equations (11) and (1) it follows that Equation (10) can be rewritten as bv i = V i H i for i =1 ::: m+1: (13) bv m ; [^v :::^v m+1 ]=Ab Vm D m = AV m H m D m : (14) Dene b Hm =[h 1 h :::h m ] ; [h h 3 :::h m+1 ], so that b Hm is an upper Hessenberg matrix of rank m, sinceh i i 6= 0, for i =1 ::: m+ 1. Substituting this in (14) nally leads to V m+1 b Hm = AV m H m D m : (15) Using this expression the least squares problem can be solved in the same way as for standard GMRES(m): min y kr ; AV m yk =minkr ; AV m H m D ^y m^yk where H m D m^y = y: (16) Because H m and D m are nonsingular, the latter by denition, H m D m^y = y is always well-dened. Combining (15) and (16) yields ^y :min ^y r ; V m+1hm b ^y = min ^y krk e 1 ; Hm^y b : (17) The additional computational work in this approach is only O(m ) and therefore negligible. We will refer to this adapted version of GMRES(m) as pargmres(m). 5 Performance of GMRES(m) and pargmres(m) Before we discuss the experiments below, we present a short theoretical analysis. The communication time for the exchange of boundary data and the computation time for the m additional vector updates in the pargmres(m) implementation will be neglected in this analysis, because they are relatively unimportant. The runtime of a GMRES(m) cycle on P 4 processors is then given by T P = T gmr cmp1 + T gmr a+b, see (1) and (3): T P = ; (m +3m)+4n z (m +1) t fl N P + ; (m +3m)(t s +3t w ) p P: (18) This equation shows that for suciently large P the communication will dominate. Following the analysis in [5] we introduce the value P max as the number of processors that minimizes the runtime of GMRES(m). We have studied the performance of GMRES(m) and pargmres(m) for numbers of processors less than or approximately equal to P max. Note that for pargmres(m) 10

11 we can improve the performance further with more processors than P max,becauseithasalower communication cost. The cost of communication is reduced in pargmres(m) in two steps. First, we reduce the communication time by accumulating and broadcasting multiple inner products in groups. This reduces the communication time from T gmr a+b to T m mgs, see (3) and (8). Second, we overlap the non-local part of the remaining communication time with half the computation in the modied Gram-Schmidt algorithm, see Figure 5. The length of the overlap then determines the performance of pargmres(m) and the improvement over GMRES(m). Therefore we introduce the value P ovl, which is the number of processors for which the overlap is exact. The performance and the improvement are then related to whether P P ovl or P > P ovl and how large P ovl is relative top max, because the fraction of the runtime spent incommunication increases for increasing P, see (18). We will now give relations for P max and P ovl. The minimization of (18) gives =3 [4(m +3m)+8n P max = z (m +1)]t fl N (19) (m +3m)(t s +3t w ) and the eciency E P = T 1 =(PT P )forp max processors is given by E Pmax =1=3, where T 1 = Tcmp1. gmr This means that T 3 P max is spent in communication, because in this model eciency is lost only through communication. For P ovl wehave that the (total) communication time T g mgs, see (8), is equal to the sum of the overlapping computation time, (m N +m)t fl, and the local P communication time T l mgs, see (7): p N (4mt s +(m +10m)t w ) P ovl =(m +m)t fl +16mt s +(8m +40m)t w : (0) P ovl If P P ovl then the communication costp is reduced to T l mgs, see (7). This means that the cost of start-ups is reduced by a factor of m+3 16 P and the cost of data transfer by a factor of 3 8p P. Furthermore, as long as P < P ovl an increase in the number of processors will not result in an increase of the communication cost, and hence the eciency remains constant. If P>P ovl then the overlap is no longer complete and the communication cost is given by the communication time minus the computation time of the overlapping computation: T g mgs ;(m +m)t fl N P. The runtime is then given by ~T P = ; (m +4m)+4n z (m +1) t fl N P + ; 4mt s +(m +10m)t w p P: (1) For P > P ovl we see that the eciency decreases again, because the communication time increases and the computation time of the overlap decreases. Equation (0) gives P ovl (m +m)t fl 4mt s +(m +10m)t w N =3 : () Comparing (19) with (), we see that if t s dominates the communication, that is t s t w, then P ovl >P max and we always have P P ovl, so that we can overlap all communication after the reduction of start-ups. This means that we can reduce the runtime by almost a factor of three. For transputers we have t s t w, and comparing (19) and () we see that P ovl <P max. One can prove that the improvement of pargmres(m) compared to GMRES(m), T P = ~ T P,asa function of P is either constant or a strictly increasing or decreasing function. The maximum improvement is therefore found for either P = P ovl or P = P max. 11

12 For P = P ovl, the communication time is strongly reduced. Furthermore, (19) and () indicate that for m large enough P ovl (1=) =3 P max, which means that the eciency at P ovl is less than about 50%. Therefore we may expect an improvement by about a factor of two. For P = P max the runtime is given by (1). When t s t w wegett g mgs 1 T gmr a+b, and we may say that due to the overlap the cost of computation is reduced by (m N +m)t fl, that is P approximately by a factor of m +6m +4n z (m +1) m +4m +4n z (m +1) m +6+4n z m +4+4n z which is a little less than a factor of two. Hence we may expect an improvement by a factor of about two in this case also. We now discuss our experimental observations on the parallel performance of GMRES(m) and the adapted algorithm pargmres(m) on the 400-transputer machine. We will only consider the performance of one (par)gmres(m) cycle, because both algorithms take about the same number of iterations, which generally leads to the same number of GMRES(m) cycles, with only a possible dierence in the last cycle. The dierence may be that GMRES(m) stops before it completes the full m iterations of the last cycle. This gives on average a dierence of only a half GMRES(m) cycle, which is often more than compensated by the much better performance of pargmres(m) for the other cycles. In our experiments we used square processor grids (minimal diameter), and this is optimal for GMRES(m). For other processor grids the degradation of performance for GMRES(m) will be even worse. The pargmres(m) algorithm is much less sensitive to the diameter of the processor grid. We have solved a convection diusion problem discretized by nite volumes over a grid, resulting in the familiar ve-diagonal matrix with a tridiagonal block-structure, corresponding to the 5-point star. This relatively small problem size was chosen, because for the processor grids of increasing size it very well shows the degradation of performance for GMRES(m) and the large improvements of pargmres(m) over GMRES(m). As we will see, the pargmres(m) variant has much better scaling properties than GMRES(m). The measured runtimes for a single (par)gmres(m) cycle are listed in Table 1 for m =30 and m = 50. For m =30wehave that P max 400 and P ovl 36. For m =50wehave P max 375 and P ovl 44. We give speed-ups and eciencies in Table. These are calculated from the measured runtimes of GMRES(m) and pargmres(m) and an estimated sequential runtime for GMRES(m), because the problem was too large to run on a single processor. The estimated T 1 is the net computation time derived from (1). We mention that for CG (see Section 7) the measured T 1 is approximately 9% less than the estimated T 1, but this is not necessarily processor m =30 m =50 grid GMRES(m) pargmres(m) GMRES(m) pargmres(m) (s) (s) (s) (s) Table 1: measured runtimes for GMRES(m) and pargmres(m) 1

13 m =30 m =50 processor GMRES(m) pargmres(m) GMRES(m) pargmres(m) grid E (%) S E(%) S E(%) S E (%) S Table : Eciencies and speed-ups for GMRES(m) and pargmres(m) based on measured runtimes and an estimated sequential runtime for GMRES(m) the case for GMRES(m) too. The dierence between the estimated sequential runtime and the measured one for CG is probably due to a simpler implementation (e.g., less indirect addressing, copying of buers) for the sequential program, which results in a higher (average) op-rate. The runtime for GMRES(m) is reduced by approximately 5%, when increasing the number of processors from 100 to 196. When increasing this from 100 to 89, the runtime reduces only by some 35%. When we further increase the number of processors to 400 then the runtime is already more than for 89 processors, which is in agreement with the previous discussion because P P max for m =30andP > P max for m = 50. Hence the cost of communication spoils the performance of GMRES(m) completely for large P. On the other hand, for pargmres(m) the runtime reduction when increasing from 100 to 196 processors is approximately 45%, where the upper bound is 49%, so this is almost optimal. Such a speed-up shows that the eciency remains almost constant for this increase in the number of processors, see also Table. This is to be expected because we have P < P ovl, so that any increase in the communication time of the inner products is more than compensated by the overlapping computation. On 89 processors the runtime is about 53% of the runtime on 100 processors, which is still quite good. If we continue to increase the number of processors, we see that for 400 processors the runtime is not much better than for 89 processors, although it is still decreasing. At this point the speed-up for pargmres(m) levels o, because there is insucient computational worktooverlap the communication (P >P ovl ). A direct comparison between the runtimes of GMRES(m) and pargmres(m) shows that, for 100 processors, GMRES(m) is about 5% slower than pargmres(m). However, for 196 processors this has increased already to 65% and 81% for m =30andm = 50 respectively. From then on the relative dierence increases more gradually to a maximum of about a factor of two forp max processors. These results are very much in agreement with our theoretical expectations. Note that although the maximum is reached for P max the improvement is already substantial for 196 processors, which isnearp ovl. In Table 4 we give the estimated runtimes from expressions (1), (3), and (5) for GMRES(m) and formulas (5), (7), and (9) for pargmres(m). Table 3 gives a short overview of the relevant parameters and their meaning (see Section 3). If the value of a parameter is xed, its value is given also. The parameters d, n z and n m are derived from our model problem and implementation the parameters t s, t w and t fl have been determined experimentally. A comparison of the estimates with the measured execution times indicates that the formulas are quite accurate except for the 400 processor case. The rst reason for this discrepancy is that for both algorithms the neglected costs become more important when the size of the local problem is small. These neglected costs are due to, e.g., copying of buers for communication and indirect 13

14 parameter meaning t w (4:80s) communication word rate t s (5:30s) communication start-up time t fl (3:00s) average time for a single oating point operation d (1) (max) number of communication steps in boundary exchange n m (4) number of messages (to send and receive) in boundary exchange n z (5) average number of non-zero elements per row in the matrix p d maximum distance to the `most central' processor n b (max) number of boundary data elements on a processor m size of the Krylov space over which (par)gmres(m) minimizes Table 3: parameters and meaning m =30 m =50 processor p d N l n b GMRES(m) pargmres(m) GMRES(m) pargmres(m) grid (s) (s) (s) (s) Table 4: Estimated runtimes for GMRES(m) and pargmres(m) addressing using exterior data, the organization of the communication, and the solution of the least squares problem. For the pargmres(m) algorithm there is a second and more important reason, viz. due to the small size of the local problem we can no longer assume an almost complete overlap of the communication in the modied Gram-Schmidt algorithm (P > P ovl ). This is illustrated in Table 5, which gives estimates for the twooverlapping parts given in (0). We refer to the the sum of the local communication time and half of the computation time in the modied Gram-Schmidt algorithm as comp, and to the total communication time for the accumulation as comm. Already for the 1717 grid we do not have a complete overlap, although the overlap will still be good. For the 0 0 processor grid an overlap of about 55% is already the maximum. Obviously for a larger problem this would improve. 6 Communication overhead reduction in CG For a reduction in the communication overhead for preconditioned CG we follow the approach suggested in [7]. In that approach the operations are rescheduled to create more opportunities for overlap. This leads to an algorithm (parcg) as the one given in Figure 7, where we have assumed that the preconditioner K canbewrittenask = LL T. For a discussion of the ideas behind this scheme we refer to [7]. For our purposes it is relevant topoint at the inner products at lines (1), () and (3). The communication for these inner products is overlapped by the computational work in the following line. We split the preconditioner to create an overlap for the inner products (1) and (3) and 14

15 m =30 m =50 processor p d N l comp comm comp comm grid (s) (s) (s) (s) Table 5: Comparison of estimated costs for overlapping computation and `global' communication of the modied Gram-Schmidt implementation in pargmres(m) we have extra overlap possibilities since the inner product () is followed by the update for x corresponding to the previous iteration step. Under the assumption of a complete overlap for the time that a processor is not active in the accumulation and broadcast of the inner products and following the derivation of (7), the communication cost for the three inner products in a parcg iteration reduces from Ta+b, cg see (4), to the communication time spent locally by a processor: T cg l a+b = 4(t s +3t w ): (3) Therefore, the communication cost is reduced from O( p P )too(1), which means that (in theory) the communication cost is independent of the processor grid size. 7 Performance of CG variants We will follow closely the lines set forth in the analysis for (par)gmres(m) in Section 5. The communication time for the exchange of boundary data will be neglected in this analysis, because it is relatively unimportant for our kind of model problems. The problem dependent parameters and the machine dependent parameters have the same values as in the discussion for GMRES(m), see Table 3. The runtime for a CG iteration with P 4 processors is given by T P = Tcmp cg + T cg a+b, see () and (4): N T P =(9+4n z )t fl P +6(t s +3t w ) p P: (4) This expression shows that for suciently large P the communication time will dominate. Here we can also dene a P max as the number of processors that gives the minimal runtime, and a P ovl as the number of processors for which the (total) communication time in the inner products T cg a+b (see (4)) is equal to the sum of the computation time of the preconditioner, one vector N cg update (n z t fl ), and the local communication time T P l a+b (see (3)). P max = (18 + 8nz =3 )Nt fl (5) 6(t s +3t w ) For P = P max processors the eciency E P = T 1 =(PT P ) is again E Pmax =1=3, where T 1 = T cg cmp therefore, the communication time is 3 T P max. The value for P ovl is given by 6(t s +3t w ) p P ovl =n z t fl N 15 P ovl +4(t s +3t w ): (6)

16 parcg: x;1 = x 0 = initial guess r 0 = b ; Ax 0 p;1 =0 ;1 =0 s = L ;1 r 0 ;1 =1 for i =0 1 ::: do (1) i =(s s) w i = L ;T s i;1 = i = i;1 p i = w i + i;1p i;1 q i = Ap i () =(p i q i ) x i = x i;1 + i;1p i;1 i = i = r i+1 = r i ; i q i (3) compute krk s = L ;1 r i+1 if accurate enough then x i+1 = x i + i p i quit end Figure 7: The parcg algorithm For P P ovl the communication cost is reduced from T cg a+b to T p cg l a+b, which gives a reduction by a factor of 1 4 P.For P>Povl the communication cost is given by T cg N a+b ; n z t. fl P A comparison of (5) and (6) shows that P ovl <P max. Even though the preconditioner is strongly problem- and implementation dependent, this holds in general, because for P = P ovl the communication time is equal to a part of the computation time, whereas for P = P max the communication time is already twice the computation time. This leads to three phases in the performance of parcg. Let a be the computation time, and for [0 1] let a be the computation time for the `potential' overlap, and let c be the communication time. Then the runtime of CG is given by a + c, whereas for parcg it is given by (1; )a + max(a c). For increasing P, a decreases and c increases as described above. For small P, c a, P P ovl, all communication can be overlapped but the communication time is relatively unimportant. For medium P, c a, P P ovl, the communication time is more or less in balance with the computation time for the overlap and the improvement is maximal, see below. For large P, c a, P P ovl, the communication time will be dominant, and then we will not have enough computational work to overlap it suciently. Itiseasytoprove that the fraction (a + c)=((1 ; )a + max(a c)) is maximal if a = c, that is for P = P ovl, and then the improvement is a + c (1 ; )a + max(a c) = a + a =1+: (7) a Hence, the maximum improvement of parcg over CG is determined by this fraction. The larger this fraction is, the larger is the maximum improvement by parcg. If the computation 16

17 time of the preconditioner is dominant, e.g. when n z is large and when we use preconditioners from a factorization with ll in, then 1, and we can expect an improvement by a factor of two. In our model we have = nz < 9+4n z 1, so that for n z large enough we can expect a reduction by a factor of 1:5. For our model problem we have n z = 5, so that the improvement is limited to factor of 1:33. We will now discuss the results for the parallel implementation of the standard CG algorithm and the adapted version parcg on the 400-transputer machine for a model problem. Since the algorithms are equivalent theytake the same number of iterations, and therefore we will only consider the runtime for one single iteration. We have solved a diusion problem discretized by nite volumes over a grid, resulting in a symmetric positive denite ve-diagonal matrix (corresponding to the 5-point star). We have solved this relatively small problem on processor grids of increasing size. This problem size was chosen because for processor grids of increasing size it shows the three dierent phases mentioned before. processor CG parcg grid T P S P E P T P S P E P di (ms) (%) (ms) (%) (%) Table 6: Measured runtimes for CG and parcg, speed-up and eciency compared to the sequential runtime of CG Table 6 gives the measured runtimes for one iteration step, the speed-ups, and the eciencies for both CG and parcg for several processor grids. The speed-ups and eciencies are computed relative to the measured, sequential runtime of the CG iteration, which isgiven by: T 1 =0:788s. Although CG has much less inner products than GMRES(m) per iteration (i.e., per matrix vector product), we observe that the performance levels o fairly quickly. This is in agreement with the ndings reported in [5] which show that such behavior is to be expected for any Krylov subspace method. For our test problem we have that P max 600 and P ovl 8. For the processor grids that we usedwehave P < P max, so that the runtime decreases for increasing numbers of processors as predicted by our analysis. Note also the large relative dierence between P ovl and P max compared to the relative small dierence for GMRES(m). This indicates that for this test problem, with a small n z and with a relatively cheap preconditioner, we have a small. Hence, the improvement in the runtime will be limited, as is illustrated in Table 6. We see that the parcg algorithm leads to better speed-ups than the standard CG algorithm, especially on the and processor grids, where the number of processors is closest to P ovl. Moreover, for parcg we observe that if the number of processors is increased from 100 to 196, the eciency remains almost constant, and the runtime is reduced by a factor of about 1:75 (against a maximum of 1:96). Just as for GMRES(m) this is predicted by our analysis, because P<P ovl, so that the increase in the communication time is masked by the overlapping computation. The initial decrease of eciency when going from 1 to 100 processors is due to a substan- 17

18 processor CG parcg non-overlapped parcg and non-ovl. grid communication communication (ms) (ms) (ms) (ms) Table 7: Estimated runtimes for CG and parcg, a correction of the estimate for parcg and the corrected estimate for parcg tial initial overhead. This parallel overhead is also illustrated by the fact that the estimated sequential runtime from Tcmp, cg see (), is 0:870s, which is about 10% larger than the measured sequential runtime. The three phases in the performance of parcg are illustrated by the difference in runtime between CG and parcg. For small processor grids the communication time is not very important and we see only small dierences. For processor grids with P near P ovl the communication and the overlapping computation are in balance and we see an increase in the runtime dierence. For larger processor grids we can no longer overlap the communication, which dominates the runtime, to a sucient degree, and we see the dierences decrease again. We cannot quite match the improvements for pargmres(m), but on the other hand it is important to note that the improvement for parcg comes virtually for free. Besides, for GMRES(m) we have the possibility tocombine messages as well as to overlap communication, whereas for CG we can only exploit overlap of communication unless we combine multiple iterations. Expression (7) indicates that for our problem we cannot expect much more: 1=3 so that the maximum improvement is approximately 33%. This estimate is rather optimistic in view of the large initial parallel overhead. When the computation time for the preconditioner is large or even dominant ( 1) then the improvement may also be large. This would be the case if n z is large or when (M)ILU preconditioners with ll-in are used. For many problems this may be a realistic assumption. Another important observation is that as long as P>P ovl,we can increase the computation time of the preconditioner without increasing the runtime of the iteration, because the preconditioner is overlapped with the accumulation and distribution. That means that we can decrease the number of iterations without increasing the runtime of an iteration. In Table 7 we show estimates for the execution times of the CG algorithm and the parcg algorithm. The total cost for CG is computed from (), (6), and (4) and for parcg we have used (), (6), and (3). Just as for GMRES(m) the estimates for CG are relatively accurate, except for the 0 0 case. Again, this is probably caused by neglected costs in the implementation that become more important when the local problem size becomes small. For parcg as well as for pargmres(m) there is also a discrepancy between the measured execution time and the estimated time, due to an incomplete overlap. When we cannot overlap all communication, we can correct the estimate for the runtime of parcg by adding an estimate for the non-overlapped communication time. These corrections can be computed from Table 8 and from the local communication time for one accumulation and broadcast (0:158ms). Note that we need computation time for three inner products in one iteration (see (3)). For example, for the processor grid the computation time of the vector update is not sucient tooverlap the non-local communication time for the accumulation 18

Incomplete Block LU Preconditioners on Slightly Overlapping. E. de Sturler. Delft University of Technology. Abstract

Incomplete Block LU Preconditioners on Slightly Overlapping Subdomains for a Massively Parallel Computer E. de Sturler Faculty of Technical Mathematics and Informatics Delft University of Technology Mekelweg