(Technical Report 832, Mathematical Institute, University of Utrecht, October 1993) E. de Sturler. Delft University of Technology.
|
|
- Harold Carter
- 5 years ago
- Views:
Transcription
1 Reducing the eect of global communication in GMRES(m) and CG on Parallel Distributed Memory Computers (Technical Report 83, Mathematical Institute, University of Utrecht, October 1993) E. de Sturler Faculty of Technical Mathematics and Informatics Delft University of Technology Mekelweg 4 Delft, The Netherlands and H. A. van der Vorst Mathematical Institute Utrecht University Budapestlaan 6 Utrecht, The Netherlands Abstract In this paper we study possibilities to reduce the communication overhead introduced by inner products in the iterative solution methods CG and GMRES(m). The performance of these methods on parallel distributed memory machines is often limited because of the global communication required for the inner products. We investigate two ways of improvement. One is to assemble the results of a number of local inner products of a processor and to accumulate them collectively. The other is to try to overlap communication with computation. The matrix vector products may also introduce some communication overhead, but for many relevant problems this involves communication with a few nearby processors only, and this does not necessarily degrade the performance of the algorithm. Key words. parallel computing, distributed memory computers, conjugate gradient methods, performance, GMRES, modied Gram{Schmidt. AMS(mos) subject classication. 65Y05, 65F10, 65Y0. 1 Introduction The Conjugate Gradients (CG) method [9] and the GMRES(m) method [11] are widely used methods for the iterative solution of specic classes of linear systems. The time-consuming kernels in these methods are: inner products, vector updates, and matrix vector products (including preconditioning operations). In many situations, especially when the matrix operations are well-structured, these operations are suited for implementation on vector computers and shared memory parallel computers [8]. This author wishes to acknowledge Shell Research B.V. and STIPT for the nancial support of his research. 1
2 For parallel distributed memory machines the picture is entirely dierent. In general the vectors are distributed over the processors, so that even when the matrix operations can be implemented eciently by parallel operations, we cannot avoid the global communication required for inner product computations. These global communication costs become relatively more and more important when the number of parallel processors is increased and thus they have the potential to aect the scalability of the algorithms in a very negative way [5]. This aspect has received much attention and several approaches have been suggested to improve the performance of these algorithms. For CG the approaches come down to reformulating the orthogonalization part of the algorithm, so that the required inner products can be computed in the same phase of the iteration step (see, e.g., [4, 10]), or to combine the orthogonalization for several successive iteration steps, as in the s-step methods []. The numerical stability of these approaches is a major point of concern. For GMRES(m) the approach comes down to some variant of the s-step methods [1, 3]. After having generated basis vectors for part of the Krylov subspaces by some suitable recurrence relation, they have to be orthogonalized. One often resorts to cheap but potentially unstable methods like Gram-Schmidt orthogonalization. In the present study we investigate other ways to reduce the global communication overhead due to the inner products. Our approach is to identify operations that may be executed while communication takes place, since our aim is to overlap communication with computation. For CG this is done by rescheduling the operations, without changing the numerical stability of the method [7]. For GMRES(m) it is achieved by reformulating the modied Gram{Schmidt orthogonalization step [5, 6]. For GMRES(m) we also exploit the possibility of packing the results of the local inner products of a processor in one message and accumulating them collectively. We believe that our ndings are relevant for other Krylov subspace methods as well, since methods like BiCG, and its variants CGS, BiCGSTAB, and QMR have much in common with CG from the implementation point of view. Likewise, the communication problems with GM- RES(m) are representative for the problems in methods like ORTHODIR, GENCG, FOM, and ORTHOMIN. We have carried out our experiments on a 400-processor Parsytec Supercluster at the Koninklijke/Shell-Laboratorium in Amsterdam. The processors are connected in a xed 0 0 mesh, of which arbitrary submeshes can be used. Each processor is a T800-0 transputer. The transputer supports only nearest neighbor synchronous communication more complicated communication has to be programmed explicitly. The communication rate is fast compared to the op rate, but to current standards the T800 is a slow processor. Another feature of the transputer is the support of time-shared execution of multiple `parallel' CPU-processes on a single processor, which facilitates the implementation of programs that switch between tasks when necessary (on interrupt basis), e.g., between communication and computation. Finally, transputers have the possibility of concurrent communication and computation. As a result it is possible to overlap computation and communication on a single processor. The program that runs on each processor consists of two processes that run time-shared: a computation process and a communication process. The computation process functions as the master. If at some point communication is necessary, the computation process sends the data to the communication process (on the same processor) or requests the data from the communication process, which then handles the actual communication. This organization permits the computation processes on dierent processors to work asynchronously even though the actual communication is synchronous. The communication process is given the higher priority so that
3 if there is something to communicate this is started as soon as possible. The algorithms for GMRES(m) and CG Preconditioned CG: start: x 0 = initial guess r 0 = b ; Ax 0 p;1 =0 ;1 =0 Solve for w 0 in Kw 0 = r 0 0 =(r 0 w 0 ) iterate: for i =0 1 :::: do p i = w i + i;1p i;1 q i = Ap i i = i =(p i q i ) x i+1 = x i + i p i r i+1 = r i ; i q i compute krk if accurate enough then quit Solve for w i+1 in Kw i+1 = r i+1 i+1 =(r i+1 w i+1 ) i = i+1 = i end GMRES(m): start: x 0 = initial guess r 0 = b ; Ax 0 v 1 = r 0 =kr 0 k iterate: for j =1 ::: m do ^v j+1 = Av j for i =1 ::: j do h i j =(^v j+1 v i ) ^v j+1 =^v j+1 ; h i j v i end h j+1 j = k^v j+1 k v j+1 =^v j+1 =h j+1 j end form the approximate solution: x m = x 0 + V m y m where y m minimizes kr0 k e 1 ; Hm y, y IRm restart: compute r m = b ; Ax m if satised then stop, else x 0 = x m v 1 = r m =kr m k goto iterate The preconditioned CG algo- Figure 1: rithm Figure : The GMRES(m) algorithm In this section we will discuss the time-consuming kernels in CG and GMRES(m): the vector update (daxpy), the preconditioner, the matrix vector product, and the inner product (ddot), see Figures 1 and. Because the results of the inner products are needed on all processors, the Hessenberg matrix H m (see Figure ) is available on each processor. Hence, the computation of y m can be done on each processor. This is often ecient, because it would be a synchronization point if implemented on a single processor, and then the other processors would have towait for the result. However, if the size of the reduced system is large compared to the local number of unknowns, the computation might be expensive enough to make the distribution and parallel solution worthwhile. We have not pursued this idea. The parallel implementation of the vector update (daxpy) poses no problem since it involves only local computation. In this paper we restrict ourselves to problems for which the parallelism in the matrix vector product does not pose serious problems. That is, our model problems have a strong data locality which istypical for many nite dierence and nite element problems. A suitable 3
4 accumulation over the processor grid, no outgoing step can be taken before all incoming steps have taken place. Figure 3: accumulation over the processor grid domain decomposition approach preserves this locality more or less independent of the number of processors, so that the matrix vector product requires only neighbor{neighbor communication or communication with only a few nearby processors. This could be overlapped with computations for the interior of the domain but it is relatively less important, since the number of boundary operations is in general an order of magnitude less than the number of interior operations (this is the surface to volume eect). The communication overhead introduced by the preconditioner is obviously strongly dependent on the selected preconditioner. Popular preconditioners on sequential computers, like the (M)ILU variants, are highly sequential or introduce irregular communication patterns (as in the hyperplane approach, see [8]) and therefore these are not suitable. Obviously we prefer preconditioners which only require a limited amount of communication, for instance comparable to or less than that of the matrix vector product. On the other hand we would like to retain the iteration reducing eect of the preconditioners, and these considerations are often in conict. In our study we have avoided to discuss the convergence accelerating eects of the preconditioner and we have used a simple Incomplete Block Jacobi preconditioner with blocks corresponding to the domains. In this case we have nocommunication at all for the preconditioner. Since the vectors are distributed over the processor grid, the inner product (ddot) is computed in two steps. All processors start to compute in parallel the local inner product. After that, the local inner products are accumulated on one `central' processor and broadcasted. We will describe the implementation in little more detail for a -dimensional mesh of processors, see Figure 3. The processors on each processor line in the x-direction accumulate their results along this line on an `accumulation' processor at the same place on each processor line: each processor waits for the result from its neighbor further from the accumulation processor, then adds this result to its own partial result and sends the new result along. Then the `accumulation' processors do a likewise accumulation in the y-direction. The broadcast consists of the reverse process. Each processor is active at only a limited number of steps and will be idle for the rest of the time. So, here are opportunities to make itavailable for other tasks. The communication time of an accumulation or a broadcast is of the order of the diameter of the processor grid. This means that for an increasing number of processors the communication time for the inner products increases as well, and hence this is a potential threat to the scalability 4
5 of the method. Indeed, if the global communication for the inner products is not overlapped, it often becomes a bottleneck on large processor grids, as will be shown later. In [5] a simple performance model based on these considerations is introduced, which clearly shows quantitatively the dramatic inuence of the global communication for inner products over large processor grids on the performance of Krylov subspace methods. That model also shows that the degradation of performance depends on the relative costs of local computation and global communication. This means that results analogous to those presented in Sections 5 and 7 will be seen for larger problems on processor conguration(s) with relatively faster computational speed (this is the current trend in parallel computers). Moreover, if the problem sizes increase proportionally to the number of processors, the local computation time remains the same but the global communication cost increases. This emphasizes the necessity to reduce the eect of the global communication costs. 3 Parallel performance of GMRES(m) and CG We will now describe briey a model for the computation time, the communication cost, and the communication time of the main kernels in Krylov subspace methods. We use the term communication cost to indicate the wall clock time spent in communication that is not overlapped with useful computation (so that it really contributes to wall-clock time). The term communication time is used to indicate the wall-clock time of the whole communication. In the case of a nonoverlapped communication, the communication time and the communication cost are the same. Our quantitative formulas are not meant to give very accurate predictions of the exact execution times, but they will be used to identify the bottlenecks and to evaluate improvements. Several of the parameters that we introduce may vary over the processor grid. In that case the value to use is either a maximum or an average, whatever is the most appropriate. Computation time. We will only be concerned with the local computation time, since the cost of communication and synchronization is modeled explicitly. The computation time for the solution of the Hessenberg system is neglected in our model. For a vector update (daxpy) or an inner product (ddot) the computation time is given by t fl N=P, where N=P is the local number of unknowns of a processor and t fl is the average time for a double precision oating point operation. The computation time for the (sparse) matrix vector product is given by (n z ; 1)t fl N=P, where n z is the average number of non-zero elements per row of the matrix. As preconditioner we chose Block-(M)ILU variants without ll-in of the form LD ;1 U for GMRES(m), and LL T for CG. For CG we have scaled the system so that diag(l) = I. The computation time of the preconditioner for GMRES(m) is (n z +1)t fl N=P, andforcgitis(n z ; 1)t fl N=P. A full GMRES(m) cycle has approximately 1 (m +3m) inner products, the same number of vector updates, and (m + 1) multiplications with the matrix and the preconditioner, if one computes the exact residual at the end of each cycle. The complete (local) computation time for the GMRES(m) algorithm is given by the equation: T gmr cmp1 = ; (m +3m)+4n z (m +1) N P t fl: (1) A single iteration of CG has three inner products, the same number of vector updates and one multiplication with the matrix and the preconditioner. The complete (local) computation time is given by the equation: T cg cmp =(9+4n z) N P t fl: () 5
6 Communication cost. As we mentioned already the solution of the Hessenberg system in GMRES(m) and the vector update are local and involve no communication cost. The most important communication is for the global inner products. If we do not overlap this global communication then we are concerned with the wall clock time for the entire, global operation and not with the local part of a single processor. We note that we can view the time for the accumulation and broadcast either as the communication time for the entire operation or as a small local communication time and a long delay because of global synchronization. In the rst interpretation we would consider overlapping the global communication, whereas in the second one we would consider removing the delays by reducing the number of synchronization points. We will take the rst point ofview. Consider a processor grid with P = p processors. With p d dp=e( p P ), the maximum distance to the `most central' processor over the processor grid is p d. Let the communication start-up time be given by t s and the word (3 bits) transmission time by t w. The time to communicate one double precision number between two neighboring processors is then (t s +3t w ), since a double precision number takes two words and we need a one word header to accompany each message. Hence, the global accumulation and broadcast of one double precision number takes p d (t s +3t w ) and the global accumulation and broadcast of a vector of k double precision numbers takes p d (t s +(k +1)t w ). For GMRES(m) in the nonoverlapped case the communication time for the modied Gram- Schmidt algorithm (with 1 (m +3m) accumulations and broadcasts) is T gmr a+b =(m +3m)p d (t s +3t w ) (3) where `a + b' indicates the accumulation and broadcast. For CG in the nonoverlapped case the communication time of the three inner products per iteration, is T cg a+b =6p d(t s +3t w ): (4) The communication for the matrix vector product is necessary for the exchange of so-called boundary data: sending boundary data to other processors and receiving boundary data from other processors. Assume that each processor has to send and to receive n m messages, which each take d steps of nearest neighbor communication from source to destination, and let the number of boundary data elements on a processor be given by n b. The total number of words that have to be communicated (sent and received) is then (n b + n m ) per processor. For GMRES(m) the communication time of (m + 1) matrix vectors products is T gmr bde =dn m (m +1)t s +d(m +1)(n b + n m )t w (5) where `bde' refers to the boundary exchange. For CG the communication time of one matrix vector product is T cg bde =dn m t s +d(n b + n m )t w : (6) Note that we have assumed no overlap. For preconditioners that only need boundary exchanges, we could have used the same formulas with a dierent choice of the parameter values if necessary, but in our experiments we have used only local block preconditioners (without communication). 6
7 4 Communication overhead reduction in GMRES(m) From the expressions (1), (3) and (5) we conclude that the communication cost for GMRES(m) is of the order O(m p P ) and for large processor grids this will become a bottleneck. Moreover, in the standard implementation we cannot reduce these costs by accumulating multiple inner products together, saving on start-up times, or overlap this expensive communication with computation, reducing the runtime lost in communication. The problem stems from the fact that the modied Gram-Schmidt orthogonalization of a single vector against some set of vectors and its subsequent normalization is an inherently sequential process. However, if the modied Gram-Schmidt orthogonalization of a set of vectors is considered there is no such problem, since the orthogonalizations of all intermediate vectors on the previously orthogonalized vectors are independent. Therefore, we can compute several or all of the local inner products rst and then accumulate the subresults collectively. Suppose the set of vectors v 1 ^v ^v 3 ::: ^v m+1 has to be orthogonalized, where kv 1 k =1. The modied Gram-Schmidt process can be implemented as sketched in Figure 4. This reduces for i =1 ::: m do orthogonalize ^v i+1 ::: ^v m+1 on v i v i+1 =^v i+1 =k^v i+1 k end Figure 4: a block-wise modied Gram{ Schmidt orthogonalization the number of accumulations to only m instead of 1 (m +3m) for the usual implementation of GMRES(m), but the length of the messages has increased. In this way, start-up time is saved by packing small messages, corresponding to one block of orthogonalizations, into one larger message. Moreover, we also reduce the amount of data transfer because we have less message headers. Instead of computing all local inner products in one block and accumulating these partial results only once for the whole block, it is preferable to split each stepinto two blocks of orthogonalizations, since this oers the possibility tooverlap with communication. This overlap is achieved by performing the accumulation and broadcast of the local inner products of the rst block concurrently with the computation of the local inner products of the second block and performing the accumulation and broadcast of the local inner products of the second block concurrently with the vector updates of the rst block, see Figure 5. Note that the computation time for this approach is equal to that for the standard modied Gram{Schmidt algorithm. For the parallel `overlapped' implementation of the modied Gram-Schmidt algorithm given in Figure 5, we will neglect potential eects of overlap of the communication with computation on a single processor. We will only consider the overlap with useful computational work of the time that a processor is not active in the global accumulation and broadcast. If we assume that sucient computational work can be done to completely ll this time, the communication cost T gmr a+b, see (3), reduces to only the communication time spent locally by a processor. This `local' communication cost for the accumulation and broadcast of a vector of k double precision numbers is given by 4t s +4(k +1)t w, for a receive and send in the accumulation phase and a receive and send in the broadcast phase, if the processor only participates in the accumulation 7
8 for i =1 ::: m do split ^v i+1 ::: ^v m+1 into two blocks compute local inner products (LIPs) block 1 k ( accumulate LIP s block 1 compute LIP s block update ^v i+1, compute LIP for k^v i+1 k, place this LIP into block k ( accumulate LIP s block update vectors block 1 end update vectors block normalize ^v i+1 Figure 5: the implementation of the modied Gram-Schmidt process along the x-direction and it is given by 8t s +8(k +1)t w if the processor also participates in the accumulation along the y-direction. The latter case is obviously the most important, since all processors nish the modied Gram-Schmidt algorithm more or less at the same time. The communication cost of the entire parallel modied Gram-Schmidt algorithm (mgs) now becomes T l mgs =16mt s +8(m +5m)t w : (7) In general wemaynothave enough computational work to overlap all the communication time in a global communication process. For the wall clock time of (parallel) operations, it is the longest time that matters. Here it is the global communication time for the modied Gram{Schmidt algorithm (mgs): T g mgs =4mp d t s +(m +5m)p d t w : (8) Since the communication is partly overlapped, the communication cost is in general signicantly lower than the communication time, and then it may still be better described by (7) instead of (8). Two important facts are highlighted by expressions (3), (7) and (8). First, assuming suf- cient computational work, the contribution of start-up times to the communication cost is reduced from O(m p P ) in the standard GMRES(m) (3) to O(m) using the parallel modied Gram-Schmidt algorithm (7). Especially for machines with relatively high start-up times this is important. In fact, if the start-ups dominate the communication cost, then we can reduce this contribution by a factor of two by the algorithm given in Figure 4 (even if we neglect the overlap). Second, assuming sucient computational work, the communication cost no longer depends on the size of the processor grid instead of being of the order of the diameter of the processor grid p d,itisnow more or less constant. If we lack sucient computational work the communication cost is described by (8) minus the time for the overlapped computation. 8
9 ^v 1 = v 1 = r=krk for i =1 ::: m do ^v i+1 =^v i ; d i A^v i end Figure 6: Generation of a polynomial basis for the Krylov subspace In order to be able to use this parallel modied Gram{Schmidt algorithm in GMRES(m), a basis for the Krylov subspace has to be generated rst. The idea to generate a basis for the Krylov subspace rst and then to orthogonalize this basis was already suggested for the CG algorithm, referred to as s-step CG, in [] for shared (hierarchical) memory parallel vector processors. In [] it is also reported that the s-step CG algorithm may converge slowly due to numerical instability for s>5. In the pargmres(m) algorithm stability seems to be much less of a problem since each vector is explicitly orthogonalized against all the other vectors, and we generate a polynomial basis for the Krylov subspace such as to minimize the condition number, see [1] where the Krylov subspace is generated rst to exploit higher level BLAS in the orthogonalization, and [6]. The basis vectors for the Krylov subspace ^v i are generated as indicated in Figure 6, where the parameters d i are used to get the condition number of the matrix [v 1 ^v ::: ^v m+1 ] suciently small. Bai, Hu, and Reichel [1] discuss a strategy for this. Their idea is to use one cycle of standard GMRES(m). Then the eigenvalues of the resulting Hessenberg matrix, which approximate those of A, are used in the so-called Leja ordering as the parameters d ;1 i in the rest of the modied GMRES(m) cycles. Their examples indicate that the convergence of such a GMRES(m) is (virtually) the same as that of standard GMRES(m). This is also borne out by our experience. Therefore, in the next section we limit our experiments to the evaluation of a single GMRES(m) cycle. Our parallel computation of the Krylov subspace basis requires m extra daxpys. It is obvious from (1) that this cost is negligible. However, for completeness we give the computation time with these extra daxpys, T gmr cmp =(m(m +4)+4n z (m + 1)) N P t fl: (9) Because we generate the Krylov subspace basis rst and then orthogonalize it, the Hessenberg matrix that we obtain from the inner products is not V T AV m+1 m, as in the standard GMRES(m) algorithm, and therefore we need to solve the least squares problem in a slightly dierent way. Dene ^v 1 = v 1 = krk ;1 r, and generate the other basis vectors as ^v i+1 = (I ; d i A)^v i, for i =1 ::: m.thisgives the following relation: [^v ^v 3 ::: ^v m+1 ]= b Vm ; Ab Vm D m (10) where D m = diag(d i )and Vm b is the matrix with the vectors ^v i as its columns. This relation between vectors and matrices composed from these vectors will be used throughout this discussion. The parallel modied Gram-Schmidt orthogonalization gives the orthogonal set of vectors fv 1 ::: v m+1 g, for which wehave! jx v j+1 = h ;1 j+1 j+1 ^v j+1 ; h i j+1 v i for i =1 ::: m (11) i=1 9
10 where h i j is dened by (but computed dierently) ( (vi ^v h i j = j ) i j 0 i>j: (1) Notice the subtle dierence with the denition of Hm in the standard implementation of GM- RES(m). Here the matrix H m+1 is upper triangular. Furthermore, as long as h i i 6= 0 the matrix H i is nonsingular, whereas h i i = 0 indicates a lucky breakdown. We will further assume, without loss of generality, that h i i 6=0,fori =1 ::: m+1. Leth i denote the i-th column of H m+1.from equations (11) and (1) it follows that Equation (10) can be rewritten as bv i = V i H i for i =1 ::: m+1: (13) bv m ; [^v :::^v m+1 ]=Ab Vm D m = AV m H m D m : (14) Dene b Hm =[h 1 h :::h m ] ; [h h 3 :::h m+1 ], so that b Hm is an upper Hessenberg matrix of rank m, sinceh i i 6= 0, for i =1 ::: m+ 1. Substituting this in (14) nally leads to V m+1 b Hm = AV m H m D m : (15) Using this expression the least squares problem can be solved in the same way as for standard GMRES(m): min y kr ; AV m yk =minkr ; AV m H m D ^y m^yk where H m D m^y = y: (16) Because H m and D m are nonsingular, the latter by denition, H m D m^y = y is always well-dened. Combining (15) and (16) yields ^y :min ^y r ; V m+1hm b ^y = min ^y krk e 1 ; Hm^y b : (17) The additional computational work in this approach is only O(m ) and therefore negligible. We will refer to this adapted version of GMRES(m) as pargmres(m). 5 Performance of GMRES(m) and pargmres(m) Before we discuss the experiments below, we present a short theoretical analysis. The communication time for the exchange of boundary data and the computation time for the m additional vector updates in the pargmres(m) implementation will be neglected in this analysis, because they are relatively unimportant. The runtime of a GMRES(m) cycle on P 4 processors is then given by T P = T gmr cmp1 + T gmr a+b, see (1) and (3): T P = ; (m +3m)+4n z (m +1) t fl N P + ; (m +3m)(t s +3t w ) p P: (18) This equation shows that for suciently large P the communication will dominate. Following the analysis in [5] we introduce the value P max as the number of processors that minimizes the runtime of GMRES(m). We have studied the performance of GMRES(m) and pargmres(m) for numbers of processors less than or approximately equal to P max. Note that for pargmres(m) 10
11 we can improve the performance further with more processors than P max,becauseithasalower communication cost. The cost of communication is reduced in pargmres(m) in two steps. First, we reduce the communication time by accumulating and broadcasting multiple inner products in groups. This reduces the communication time from T gmr a+b to T m mgs, see (3) and (8). Second, we overlap the non-local part of the remaining communication time with half the computation in the modied Gram-Schmidt algorithm, see Figure 5. The length of the overlap then determines the performance of pargmres(m) and the improvement over GMRES(m). Therefore we introduce the value P ovl, which is the number of processors for which the overlap is exact. The performance and the improvement are then related to whether P P ovl or P > P ovl and how large P ovl is relative top max, because the fraction of the runtime spent incommunication increases for increasing P, see (18). We will now give relations for P max and P ovl. The minimization of (18) gives =3 [4(m +3m)+8n P max = z (m +1)]t fl N (19) (m +3m)(t s +3t w ) and the eciency E P = T 1 =(PT P )forp max processors is given by E Pmax =1=3, where T 1 = Tcmp1. gmr This means that T 3 P max is spent in communication, because in this model eciency is lost only through communication. For P ovl wehave that the (total) communication time T g mgs, see (8), is equal to the sum of the overlapping computation time, (m N +m)t fl, and the local P communication time T l mgs, see (7): p N (4mt s +(m +10m)t w ) P ovl =(m +m)t fl +16mt s +(8m +40m)t w : (0) P ovl If P P ovl then the communication costp is reduced to T l mgs, see (7). This means that the cost of start-ups is reduced by a factor of m+3 16 P and the cost of data transfer by a factor of 3 8p P. Furthermore, as long as P < P ovl an increase in the number of processors will not result in an increase of the communication cost, and hence the eciency remains constant. If P>P ovl then the overlap is no longer complete and the communication cost is given by the communication time minus the computation time of the overlapping computation: T g mgs ;(m +m)t fl N P. The runtime is then given by ~T P = ; (m +4m)+4n z (m +1) t fl N P + ; 4mt s +(m +10m)t w p P: (1) For P > P ovl we see that the eciency decreases again, because the communication time increases and the computation time of the overlap decreases. Equation (0) gives P ovl (m +m)t fl 4mt s +(m +10m)t w N =3 : () Comparing (19) with (), we see that if t s dominates the communication, that is t s t w, then P ovl >P max and we always have P P ovl, so that we can overlap all communication after the reduction of start-ups. This means that we can reduce the runtime by almost a factor of three. For transputers we have t s t w, and comparing (19) and () we see that P ovl <P max. One can prove that the improvement of pargmres(m) compared to GMRES(m), T P = ~ T P,asa function of P is either constant or a strictly increasing or decreasing function. The maximum improvement is therefore found for either P = P ovl or P = P max. 11
12 For P = P ovl, the communication time is strongly reduced. Furthermore, (19) and () indicate that for m large enough P ovl (1=) =3 P max, which means that the eciency at P ovl is less than about 50%. Therefore we may expect an improvement by about a factor of two. For P = P max the runtime is given by (1). When t s t w wegett g mgs 1 T gmr a+b, and we may say that due to the overlap the cost of computation is reduced by (m N +m)t fl, that is P approximately by a factor of m +6m +4n z (m +1) m +4m +4n z (m +1) m +6+4n z m +4+4n z which is a little less than a factor of two. Hence we may expect an improvement by a factor of about two in this case also. We now discuss our experimental observations on the parallel performance of GMRES(m) and the adapted algorithm pargmres(m) on the 400-transputer machine. We will only consider the performance of one (par)gmres(m) cycle, because both algorithms take about the same number of iterations, which generally leads to the same number of GMRES(m) cycles, with only a possible dierence in the last cycle. The dierence may be that GMRES(m) stops before it completes the full m iterations of the last cycle. This gives on average a dierence of only a half GMRES(m) cycle, which is often more than compensated by the much better performance of pargmres(m) for the other cycles. In our experiments we used square processor grids (minimal diameter), and this is optimal for GMRES(m). For other processor grids the degradation of performance for GMRES(m) will be even worse. The pargmres(m) algorithm is much less sensitive to the diameter of the processor grid. We have solved a convection diusion problem discretized by nite volumes over a grid, resulting in the familiar ve-diagonal matrix with a tridiagonal block-structure, corresponding to the 5-point star. This relatively small problem size was chosen, because for the processor grids of increasing size it very well shows the degradation of performance for GMRES(m) and the large improvements of pargmres(m) over GMRES(m). As we will see, the pargmres(m) variant has much better scaling properties than GMRES(m). The measured runtimes for a single (par)gmres(m) cycle are listed in Table 1 for m =30 and m = 50. For m =30wehave that P max 400 and P ovl 36. For m =50wehave P max 375 and P ovl 44. We give speed-ups and eciencies in Table. These are calculated from the measured runtimes of GMRES(m) and pargmres(m) and an estimated sequential runtime for GMRES(m), because the problem was too large to run on a single processor. The estimated T 1 is the net computation time derived from (1). We mention that for CG (see Section 7) the measured T 1 is approximately 9% less than the estimated T 1, but this is not necessarily processor m =30 m =50 grid GMRES(m) pargmres(m) GMRES(m) pargmres(m) (s) (s) (s) (s) Table 1: measured runtimes for GMRES(m) and pargmres(m) 1
13 m =30 m =50 processor GMRES(m) pargmres(m) GMRES(m) pargmres(m) grid E (%) S E(%) S E(%) S E (%) S Table : Eciencies and speed-ups for GMRES(m) and pargmres(m) based on measured runtimes and an estimated sequential runtime for GMRES(m) the case for GMRES(m) too. The dierence between the estimated sequential runtime and the measured one for CG is probably due to a simpler implementation (e.g., less indirect addressing, copying of buers) for the sequential program, which results in a higher (average) op-rate. The runtime for GMRES(m) is reduced by approximately 5%, when increasing the number of processors from 100 to 196. When increasing this from 100 to 89, the runtime reduces only by some 35%. When we further increase the number of processors to 400 then the runtime is already more than for 89 processors, which is in agreement with the previous discussion because P P max for m =30andP > P max for m = 50. Hence the cost of communication spoils the performance of GMRES(m) completely for large P. On the other hand, for pargmres(m) the runtime reduction when increasing from 100 to 196 processors is approximately 45%, where the upper bound is 49%, so this is almost optimal. Such a speed-up shows that the eciency remains almost constant for this increase in the number of processors, see also Table. This is to be expected because we have P < P ovl, so that any increase in the communication time of the inner products is more than compensated by the overlapping computation. On 89 processors the runtime is about 53% of the runtime on 100 processors, which is still quite good. If we continue to increase the number of processors, we see that for 400 processors the runtime is not much better than for 89 processors, although it is still decreasing. At this point the speed-up for pargmres(m) levels o, because there is insucient computational worktooverlap the communication (P >P ovl ). A direct comparison between the runtimes of GMRES(m) and pargmres(m) shows that, for 100 processors, GMRES(m) is about 5% slower than pargmres(m). However, for 196 processors this has increased already to 65% and 81% for m =30andm = 50 respectively. From then on the relative dierence increases more gradually to a maximum of about a factor of two forp max processors. These results are very much in agreement with our theoretical expectations. Note that although the maximum is reached for P max the improvement is already substantial for 196 processors, which isnearp ovl. In Table 4 we give the estimated runtimes from expressions (1), (3), and (5) for GMRES(m) and formulas (5), (7), and (9) for pargmres(m). Table 3 gives a short overview of the relevant parameters and their meaning (see Section 3). If the value of a parameter is xed, its value is given also. The parameters d, n z and n m are derived from our model problem and implementation the parameters t s, t w and t fl have been determined experimentally. A comparison of the estimates with the measured execution times indicates that the formulas are quite accurate except for the 400 processor case. The rst reason for this discrepancy is that for both algorithms the neglected costs become more important when the size of the local problem is small. These neglected costs are due to, e.g., copying of buers for communication and indirect 13
14 parameter meaning t w (4:80s) communication word rate t s (5:30s) communication start-up time t fl (3:00s) average time for a single oating point operation d (1) (max) number of communication steps in boundary exchange n m (4) number of messages (to send and receive) in boundary exchange n z (5) average number of non-zero elements per row in the matrix p d maximum distance to the `most central' processor n b (max) number of boundary data elements on a processor m size of the Krylov space over which (par)gmres(m) minimizes Table 3: parameters and meaning m =30 m =50 processor p d N l n b GMRES(m) pargmres(m) GMRES(m) pargmres(m) grid (s) (s) (s) (s) Table 4: Estimated runtimes for GMRES(m) and pargmres(m) addressing using exterior data, the organization of the communication, and the solution of the least squares problem. For the pargmres(m) algorithm there is a second and more important reason, viz. due to the small size of the local problem we can no longer assume an almost complete overlap of the communication in the modied Gram-Schmidt algorithm (P > P ovl ). This is illustrated in Table 5, which gives estimates for the twooverlapping parts given in (0). We refer to the the sum of the local communication time and half of the computation time in the modied Gram-Schmidt algorithm as comp, and to the total communication time for the accumulation as comm. Already for the 1717 grid we do not have a complete overlap, although the overlap will still be good. For the 0 0 processor grid an overlap of about 55% is already the maximum. Obviously for a larger problem this would improve. 6 Communication overhead reduction in CG For a reduction in the communication overhead for preconditioned CG we follow the approach suggested in [7]. In that approach the operations are rescheduled to create more opportunities for overlap. This leads to an algorithm (parcg) as the one given in Figure 7, where we have assumed that the preconditioner K canbewrittenask = LL T. For a discussion of the ideas behind this scheme we refer to [7]. For our purposes it is relevant topoint at the inner products at lines (1), () and (3). The communication for these inner products is overlapped by the computational work in the following line. We split the preconditioner to create an overlap for the inner products (1) and (3) and 14
15 m =30 m =50 processor p d N l comp comm comp comm grid (s) (s) (s) (s) Table 5: Comparison of estimated costs for overlapping computation and `global' communication of the modied Gram-Schmidt implementation in pargmres(m) we have extra overlap possibilities since the inner product () is followed by the update for x corresponding to the previous iteration step. Under the assumption of a complete overlap for the time that a processor is not active in the accumulation and broadcast of the inner products and following the derivation of (7), the communication cost for the three inner products in a parcg iteration reduces from Ta+b, cg see (4), to the communication time spent locally by a processor: T cg l a+b = 4(t s +3t w ): (3) Therefore, the communication cost is reduced from O( p P )too(1), which means that (in theory) the communication cost is independent of the processor grid size. 7 Performance of CG variants We will follow closely the lines set forth in the analysis for (par)gmres(m) in Section 5. The communication time for the exchange of boundary data will be neglected in this analysis, because it is relatively unimportant for our kind of model problems. The problem dependent parameters and the machine dependent parameters have the same values as in the discussion for GMRES(m), see Table 3. The runtime for a CG iteration with P 4 processors is given by T P = Tcmp cg + T cg a+b, see () and (4): N T P =(9+4n z )t fl P +6(t s +3t w ) p P: (4) This expression shows that for suciently large P the communication time will dominate. Here we can also dene a P max as the number of processors that gives the minimal runtime, and a P ovl as the number of processors for which the (total) communication time in the inner products T cg a+b (see (4)) is equal to the sum of the computation time of the preconditioner, one vector N cg update (n z t fl ), and the local communication time T P l a+b (see (3)). P max = (18 + 8nz =3 )Nt fl (5) 6(t s +3t w ) For P = P max processors the eciency E P = T 1 =(PT P ) is again E Pmax =1=3, where T 1 = T cg cmp therefore, the communication time is 3 T P max. The value for P ovl is given by 6(t s +3t w ) p P ovl =n z t fl N 15 P ovl +4(t s +3t w ): (6)
16 parcg: x;1 = x 0 = initial guess r 0 = b ; Ax 0 p;1 =0 ;1 =0 s = L ;1 r 0 ;1 =1 for i =0 1 ::: do (1) i =(s s) w i = L ;T s i;1 = i = i;1 p i = w i + i;1p i;1 q i = Ap i () =(p i q i ) x i = x i;1 + i;1p i;1 i = i = r i+1 = r i ; i q i (3) compute krk s = L ;1 r i+1 if accurate enough then x i+1 = x i + i p i quit end Figure 7: The parcg algorithm For P P ovl the communication cost is reduced from T cg a+b to T p cg l a+b, which gives a reduction by a factor of 1 4 P.For P>Povl the communication cost is given by T cg N a+b ; n z t. fl P A comparison of (5) and (6) shows that P ovl <P max. Even though the preconditioner is strongly problem- and implementation dependent, this holds in general, because for P = P ovl the communication time is equal to a part of the computation time, whereas for P = P max the communication time is already twice the computation time. This leads to three phases in the performance of parcg. Let a be the computation time, and for [0 1] let a be the computation time for the `potential' overlap, and let c be the communication time. Then the runtime of CG is given by a + c, whereas for parcg it is given by (1; )a + max(a c). For increasing P, a decreases and c increases as described above. For small P, c a, P P ovl, all communication can be overlapped but the communication time is relatively unimportant. For medium P, c a, P P ovl, the communication time is more or less in balance with the computation time for the overlap and the improvement is maximal, see below. For large P, c a, P P ovl, the communication time will be dominant, and then we will not have enough computational work to overlap it suciently. Itiseasytoprove that the fraction (a + c)=((1 ; )a + max(a c)) is maximal if a = c, that is for P = P ovl, and then the improvement is a + c (1 ; )a + max(a c) = a + a =1+: (7) a Hence, the maximum improvement of parcg over CG is determined by this fraction. The larger this fraction is, the larger is the maximum improvement by parcg. If the computation 16
17 time of the preconditioner is dominant, e.g. when n z is large and when we use preconditioners from a factorization with ll in, then 1, and we can expect an improvement by a factor of two. In our model we have = nz < 9+4n z 1, so that for n z large enough we can expect a reduction by a factor of 1:5. For our model problem we have n z = 5, so that the improvement is limited to factor of 1:33. We will now discuss the results for the parallel implementation of the standard CG algorithm and the adapted version parcg on the 400-transputer machine for a model problem. Since the algorithms are equivalent theytake the same number of iterations, and therefore we will only consider the runtime for one single iteration. We have solved a diusion problem discretized by nite volumes over a grid, resulting in a symmetric positive denite ve-diagonal matrix (corresponding to the 5-point star). We have solved this relatively small problem on processor grids of increasing size. This problem size was chosen because for processor grids of increasing size it shows the three dierent phases mentioned before. processor CG parcg grid T P S P E P T P S P E P di (ms) (%) (ms) (%) (%) Table 6: Measured runtimes for CG and parcg, speed-up and eciency compared to the sequential runtime of CG Table 6 gives the measured runtimes for one iteration step, the speed-ups, and the eciencies for both CG and parcg for several processor grids. The speed-ups and eciencies are computed relative to the measured, sequential runtime of the CG iteration, which isgiven by: T 1 =0:788s. Although CG has much less inner products than GMRES(m) per iteration (i.e., per matrix vector product), we observe that the performance levels o fairly quickly. This is in agreement with the ndings reported in [5] which show that such behavior is to be expected for any Krylov subspace method. For our test problem we have that P max 600 and P ovl 8. For the processor grids that we usedwehave P < P max, so that the runtime decreases for increasing numbers of processors as predicted by our analysis. Note also the large relative dierence between P ovl and P max compared to the relative small dierence for GMRES(m). This indicates that for this test problem, with a small n z and with a relatively cheap preconditioner, we have a small. Hence, the improvement in the runtime will be limited, as is illustrated in Table 6. We see that the parcg algorithm leads to better speed-ups than the standard CG algorithm, especially on the and processor grids, where the number of processors is closest to P ovl. Moreover, for parcg we observe that if the number of processors is increased from 100 to 196, the eciency remains almost constant, and the runtime is reduced by a factor of about 1:75 (against a maximum of 1:96). Just as for GMRES(m) this is predicted by our analysis, because P<P ovl, so that the increase in the communication time is masked by the overlapping computation. The initial decrease of eciency when going from 1 to 100 processors is due to a substan- 17
18 processor CG parcg non-overlapped parcg and non-ovl. grid communication communication (ms) (ms) (ms) (ms) Table 7: Estimated runtimes for CG and parcg, a correction of the estimate for parcg and the corrected estimate for parcg tial initial overhead. This parallel overhead is also illustrated by the fact that the estimated sequential runtime from Tcmp, cg see (), is 0:870s, which is about 10% larger than the measured sequential runtime. The three phases in the performance of parcg are illustrated by the difference in runtime between CG and parcg. For small processor grids the communication time is not very important and we see only small dierences. For processor grids with P near P ovl the communication and the overlapping computation are in balance and we see an increase in the runtime dierence. For larger processor grids we can no longer overlap the communication, which dominates the runtime, to a sucient degree, and we see the dierences decrease again. We cannot quite match the improvements for pargmres(m), but on the other hand it is important to note that the improvement for parcg comes virtually for free. Besides, for GMRES(m) we have the possibility tocombine messages as well as to overlap communication, whereas for CG we can only exploit overlap of communication unless we combine multiple iterations. Expression (7) indicates that for our problem we cannot expect much more: 1=3 so that the maximum improvement is approximately 33%. This estimate is rather optimistic in view of the large initial parallel overhead. When the computation time for the preconditioner is large or even dominant ( 1) then the improvement may also be large. This would be the case if n z is large or when (M)ILU preconditioners with ll-in are used. For many problems this may be a realistic assumption. Another important observation is that as long as P>P ovl,we can increase the computation time of the preconditioner without increasing the runtime of the iteration, because the preconditioner is overlapped with the accumulation and distribution. That means that we can decrease the number of iterations without increasing the runtime of an iteration. In Table 7 we show estimates for the execution times of the CG algorithm and the parcg algorithm. The total cost for CG is computed from (), (6), and (4) and for parcg we have used (), (6), and (3). Just as for GMRES(m) the estimates for CG are relatively accurate, except for the 0 0 case. Again, this is probably caused by neglected costs in the implementation that become more important when the local problem size becomes small. For parcg as well as for pargmres(m) there is also a discrepancy between the measured execution time and the estimated time, due to an incomplete overlap. When we cannot overlap all communication, we can correct the estimate for the runtime of parcg by adding an estimate for the non-overlapped communication time. These corrections can be computed from Table 8 and from the local communication time for one accumulation and broadcast (0:158ms). Note that we need computation time for three inner products in one iteration (see (3)). For example, for the processor grid the computation time of the vector update is not sucient tooverlap the non-local communication time for the accumulation 18
Incomplete Block LU Preconditioners on Slightly Overlapping. E. de Sturler. Delft University of Technology. Abstract
Incomplete Block LU Preconditioners on Slightly Overlapping Subdomains for a Massively Parallel Computer E. de Sturler Faculty of Technical Mathematics and Informatics Delft University of Technology Mekelweg
More informationCME342 Parallel Methods in Numerical Analysis. Matrix Computation: Iterative Methods II. Sparse Matrix-vector Multiplication.
CME342 Parallel Methods in Numerical Analysis Matrix Computation: Iterative Methods II Outline: CG & its parallelization. Sparse Matrix-vector Multiplication. 1 Basic iterative methods: Ax = b r = b Ax
More informationGMRESR: A family of nested GMRES methods
GMRESR: A family of nested GMRES methods Report 91-80 H.A. van der Vorst C. Vui Technische Universiteit Delft Delft University of Technology Faculteit der Technische Wisunde en Informatica Faculty of Technical
More informationA JACOBI-DAVIDSON ITERATION METHOD FOR LINEAR EIGENVALUE PROBLEMS. GERARD L.G. SLEIJPEN y AND HENK A. VAN DER VORST y
A JACOBI-DAVIDSON ITERATION METHOD FOR LINEAR EIGENVALUE PROBLEMS GERARD L.G. SLEIJPEN y AND HENK A. VAN DER VORST y Abstract. In this paper we propose a new method for the iterative computation of a few
More informationIterative Methods and Multigrid
Iterative Methods and Multigrid Part 3: Preconditioning 2 Eric de Sturler Preconditioning The general idea behind preconditioning is that convergence of some method for the linear system Ax = b can be
More informationThe solution of the discretized incompressible Navier-Stokes equations with iterative methods
The solution of the discretized incompressible Navier-Stokes equations with iterative methods Report 93-54 C. Vuik Technische Universiteit Delft Delft University of Technology Faculteit der Technische
More informationThe parallel computation of the smallest eigenpair of an. acoustic problem with damping. Martin B. van Gijzen and Femke A. Raeven.
The parallel computation of the smallest eigenpair of an acoustic problem with damping. Martin B. van Gijzen and Femke A. Raeven Abstract Acoustic problems with damping may give rise to large quadratic
More informationLinear Solvers. Andrew Hazel
Linear Solvers Andrew Hazel Introduction Thus far we have talked about the formulation and discretisation of physical problems...... and stopped when we got to a discrete linear system of equations. Introduction
More informationFEM and sparse linear system solving
FEM & sparse linear system solving, Lecture 9, Nov 19, 2017 1/36 Lecture 9, Nov 17, 2017: Krylov space methods http://people.inf.ethz.ch/arbenz/fem17 Peter Arbenz Computer Science Department, ETH Zürich
More informationAMS526: Numerical Analysis I (Numerical Linear Algebra) Lecture 23: GMRES and Other Krylov Subspace Methods; Preconditioning
AMS526: Numerical Analysis I (Numerical Linear Algebra) Lecture 23: GMRES and Other Krylov Subspace Methods; Preconditioning Xiangmin Jiao SUNY Stony Brook Xiangmin Jiao Numerical Analysis I 1 / 18 Outline
More informationFurther experiences with GMRESR
Further experiences with GMRESR Report 92-2 C. Vui Technische Universiteit Delft Delft University of Technology Faculteit der Technische Wisunde en Informatica Faculty of Technical Mathematics and Informatics
More informationContents. Preface... xi. Introduction...
Contents Preface... xi Introduction... xv Chapter 1. Computer Architectures... 1 1.1. Different types of parallelism... 1 1.1.1. Overlap, concurrency and parallelism... 1 1.1.2. Temporal and spatial parallelism
More informationPreconditioned Conjugate Gradient-Like Methods for. Nonsymmetric Linear Systems 1. Ulrike Meier Yang 2. July 19, 1994
Preconditioned Conjugate Gradient-Like Methods for Nonsymmetric Linear Systems Ulrike Meier Yang 2 July 9, 994 This research was supported by the U.S. Department of Energy under Grant No. DE-FG2-85ER25.
More informationChapter 7 Iterative Techniques in Matrix Algebra
Chapter 7 Iterative Techniques in Matrix Algebra Per-Olof Persson persson@berkeley.edu Department of Mathematics University of California, Berkeley Math 128B Numerical Analysis Vector Norms Definition
More information6.4 Krylov Subspaces and Conjugate Gradients
6.4 Krylov Subspaces and Conjugate Gradients Our original equation is Ax = b. The preconditioned equation is P Ax = P b. When we write P, we never intend that an inverse will be explicitly computed. P
More informationIterative methods for Linear System of Equations. Joint Advanced Student School (JASS-2009)
Iterative methods for Linear System of Equations Joint Advanced Student School (JASS-2009) Course #2: Numerical Simulation - from Models to Software Introduction In numerical simulation, Partial Differential
More informationThe Conjugate Gradient Method
The Conjugate Gradient Method Classical Iterations We have a problem, We assume that the matrix comes from a discretization of a PDE. The best and most popular model problem is, The matrix will be as large
More informationSOLVING SPARSE LINEAR SYSTEMS OF EQUATIONS. Chao Yang Computational Research Division Lawrence Berkeley National Laboratory Berkeley, CA, USA
1 SOLVING SPARSE LINEAR SYSTEMS OF EQUATIONS Chao Yang Computational Research Division Lawrence Berkeley National Laboratory Berkeley, CA, USA 2 OUTLINE Sparse matrix storage format Basic factorization
More informationE. de Sturler 1. Delft University of Technology. Abstract. introduced by Vuik and Van der Vorst. Similar methods have been proposed by Axelsson
Nested Krylov Methods Based on GCR E. de Sturler 1 Faculty of Technical Mathematics and Informatics Delft University of Technology Mekelweg 4 Delft, The Netherlands Abstract Recently the GMRESR method
More informationEfficient Deflation for Communication-Avoiding Krylov Subspace Methods
Efficient Deflation for Communication-Avoiding Krylov Subspace Methods Erin Carson Nicholas Knight, James Demmel Univ. of California, Berkeley Monday, June 24, NASCA 2013, Calais, France Overview We derive
More informationSOR as a Preconditioner. A Dissertation. Presented to. University of Virginia. In Partial Fulllment. of the Requirements for the Degree
SOR as a Preconditioner A Dissertation Presented to The Faculty of the School of Engineering and Applied Science University of Virginia In Partial Fulllment of the Reuirements for the Degree Doctor of
More information9.1 Preconditioned Krylov Subspace Methods
Chapter 9 PRECONDITIONING 9.1 Preconditioned Krylov Subspace Methods 9.2 Preconditioned Conjugate Gradient 9.3 Preconditioned Generalized Minimal Residual 9.4 Relaxation Method Preconditioners 9.5 Incomplete
More informationM.A. Botchev. September 5, 2014
Rome-Moscow school of Matrix Methods and Applied Linear Algebra 2014 A short introduction to Krylov subspaces for linear systems, matrix functions and inexact Newton methods. Plan and exercises. M.A. Botchev
More informationPreface to the Second Edition. Preface to the First Edition
n page v Preface to the Second Edition Preface to the First Edition xiii xvii 1 Background in Linear Algebra 1 1.1 Matrices................................. 1 1.2 Square Matrices and Eigenvalues....................
More informationFinite-choice algorithm optimization in Conjugate Gradients
Finite-choice algorithm optimization in Conjugate Gradients Jack Dongarra and Victor Eijkhout January 2003 Abstract We present computational aspects of mathematically equivalent implementations of the
More informationThe amount of work to construct each new guess from the previous one should be a small multiple of the number of nonzeros in A.
AMSC/CMSC 661 Scientific Computing II Spring 2005 Solution of Sparse Linear Systems Part 2: Iterative methods Dianne P. O Leary c 2005 Solving Sparse Linear Systems: Iterative methods The plan: Iterative
More informationThe rate of convergence of the GMRES method
The rate of convergence of the GMRES method Report 90-77 C. Vuik Technische Universiteit Delft Delft University of Technology Faculteit der Technische Wiskunde en Informatica Faculty of Technical Mathematics
More informationIterative Methods for Linear Systems of Equations
Iterative Methods for Linear Systems of Equations Projection methods (3) ITMAN PhD-course DTU 20-10-08 till 24-10-08 Martin van Gijzen 1 Delft University of Technology Overview day 4 Bi-Lanczos method
More informationKrylov Subspace Methods that Are Based on the Minimization of the Residual
Chapter 5 Krylov Subspace Methods that Are Based on the Minimization of the Residual Remark 51 Goal he goal of these methods consists in determining x k x 0 +K k r 0,A such that the corresponding Euclidean
More informationIterative methods for Linear System
Iterative methods for Linear System JASS 2009 Student: Rishi Patil Advisor: Prof. Thomas Huckle Outline Basics: Matrices and their properties Eigenvalues, Condition Number Iterative Methods Direct and
More informationITERATIVE METHODS FOR SPARSE LINEAR SYSTEMS
ITERATIVE METHODS FOR SPARSE LINEAR SYSTEMS YOUSEF SAAD University of Minnesota PWS PUBLISHING COMPANY I(T)P An International Thomson Publishing Company BOSTON ALBANY BONN CINCINNATI DETROIT LONDON MADRID
More informationOverview: Synchronous Computations
Overview: Synchronous Computations barriers: linear, tree-based and butterfly degrees of synchronization synchronous example 1: Jacobi Iterations serial and parallel code, performance analysis synchronous
More informationDomain decomposition on different levels of the Jacobi-Davidson method
hapter 5 Domain decomposition on different levels of the Jacobi-Davidson method Abstract Most computational work of Jacobi-Davidson [46], an iterative method suitable for computing solutions of large dimensional
More informationproblem Au = u by constructing an orthonormal basis V k = [v 1 ; : : : ; v k ], at each k th iteration step, and then nding an approximation for the e
A Parallel Solver for Extreme Eigenpairs 1 Leonardo Borges and Suely Oliveira 2 Computer Science Department, Texas A&M University, College Station, TX 77843-3112, USA. Abstract. In this paper a parallel
More information-.- Bi-CG... GMRES(25) --- Bi-CGSTAB BiCGstab(2)
.... Advection dominated problem -.- Bi-CG... GMRES(25) --- Bi-CGSTAB BiCGstab(2) * Universiteit Utrecht -2 log1 of residual norm -4-6 -8 Department of Mathematics - GMRES(25) 2 4 6 8 1 Hybrid Bi-Conjugate
More informationApplied Mathematics 205. Unit V: Eigenvalue Problems. Lecturer: Dr. David Knezevic
Applied Mathematics 205 Unit V: Eigenvalue Problems Lecturer: Dr. David Knezevic Unit V: Eigenvalue Problems Chapter V.4: Krylov Subspace Methods 2 / 51 Krylov Subspace Methods In this chapter we give
More informationLab 1: Iterative Methods for Solving Linear Systems
Lab 1: Iterative Methods for Solving Linear Systems January 22, 2017 Introduction Many real world applications require the solution to very large and sparse linear systems where direct methods such as
More informationSummary of Iterative Methods for Non-symmetric Linear Equations That Are Related to the Conjugate Gradient (CG) Method
Summary of Iterative Methods for Non-symmetric Linear Equations That Are Related to the Conjugate Gradient (CG) Method Leslie Foster 11-5-2012 We will discuss the FOM (full orthogonalization method), CG,
More informationInstitute for Advanced Computer Studies. Department of Computer Science. Iterative methods for solving Ax = b. GMRES/FOM versus QMR/BiCG
University of Maryland Institute for Advanced Computer Studies Department of Computer Science College Park TR{96{2 TR{3587 Iterative methods for solving Ax = b GMRES/FOM versus QMR/BiCG Jane K. Cullum
More informationAPPARC PaA3a Deliverable. ESPRIT BRA III Contract # Reordering of Sparse Matrices for Parallel Processing. Achim Basermannn.
APPARC PaA3a Deliverable ESPRIT BRA III Contract # 6634 Reordering of Sparse Matrices for Parallel Processing Achim Basermannn Peter Weidner Zentralinstitut fur Angewandte Mathematik KFA Julich GmbH D-52425
More informationRestarting parallel Jacobi-Davidson with both standard and harmonic Ritz values
Centrum voor Wiskunde en Informatica REPORTRAPPORT Restarting parallel Jacobi-Davidson with both standard and harmonic Ritz values M. Nool, A. van der Ploeg Modelling, Analysis and Simulation (MAS) MAS-R9807
More informationKrylov Space Methods. Nonstationary sounds good. Radu Trîmbiţaş ( Babeş-Bolyai University) Krylov Space Methods 1 / 17
Krylov Space Methods Nonstationary sounds good Radu Trîmbiţaş Babeş-Bolyai University Radu Trîmbiţaş ( Babeş-Bolyai University) Krylov Space Methods 1 / 17 Introduction These methods are used both to solve
More informationIterative Methods for Sparse Linear Systems
Iterative Methods for Sparse Linear Systems Luca Bergamaschi e-mail: berga@dmsa.unipd.it - http://www.dmsa.unipd.it/ berga Department of Mathematical Methods and Models for Scientific Applications University
More informationITERATIVE METHODS BASED ON KRYLOV SUBSPACES
ITERATIVE METHODS BASED ON KRYLOV SUBSPACES LONG CHEN We shall present iterative methods for solving linear algebraic equation Au = b based on Krylov subspaces We derive conjugate gradient (CG) method
More informationLaboratoire d'informatique Fondamentale de Lille
Laboratoire d'informatique Fondamentale de Lille Publication AS-181 Modied Krylov acceleration for parallel environments C. Le Calvez & Y. Saad February 1998 c LIFL USTL UNIVERSITE DES SCIENCES ET TECHNOLOGIES
More informationNumerical Methods I Non-Square and Sparse Linear Systems
Numerical Methods I Non-Square and Sparse Linear Systems Aleksandar Donev Courant Institute, NYU 1 donev@courant.nyu.edu 1 MATH-GA 2011.003 / CSCI-GA 2945.003, Fall 2014 September 25th, 2014 A. Donev (Courant
More informationWHEN studying distributed simulations of power systems,
1096 IEEE TRANSACTIONS ON POWER SYSTEMS, VOL 21, NO 3, AUGUST 2006 A Jacobian-Free Newton-GMRES(m) Method with Adaptive Preconditioner and Its Application for Power Flow Calculations Ying Chen and Chen
More informationConjugate gradient method. Descent method. Conjugate search direction. Conjugate Gradient Algorithm (294)
Conjugate gradient method Descent method Hestenes, Stiefel 1952 For A N N SPD In exact arithmetic, solves in N steps In real arithmetic No guaranteed stopping Often converges in many fewer than N steps
More informationNotes on PCG for Sparse Linear Systems
Notes on PCG for Sparse Linear Systems Luca Bergamaschi Department of Civil Environmental and Architectural Engineering University of Padova e-mail luca.bergamaschi@unipd.it webpage www.dmsa.unipd.it/
More information4.8 Arnoldi Iteration, Krylov Subspaces and GMRES
48 Arnoldi Iteration, Krylov Subspaces and GMRES We start with the problem of using a similarity transformation to convert an n n matrix A to upper Hessenberg form H, ie, A = QHQ, (30) with an appropriate
More informationEECS 275 Matrix Computation
EECS 275 Matrix Computation Ming-Hsuan Yang Electrical Engineering and Computer Science University of California at Merced Merced, CA 95344 http://faculty.ucmerced.edu/mhyang Lecture 20 1 / 20 Overview
More informationChapter 7. Iterative methods for large sparse linear systems. 7.1 Sparse matrix algebra. Large sparse matrices
Chapter 7 Iterative methods for large sparse linear systems In this chapter we revisit the problem of solving linear systems of equations, but now in the context of large sparse systems. The price to pay
More informationLecture 18 Classical Iterative Methods
Lecture 18 Classical Iterative Methods MIT 18.335J / 6.337J Introduction to Numerical Methods Per-Olof Persson November 14, 2006 1 Iterative Methods for Linear Systems Direct methods for solving Ax = b,
More informationAMS526: Numerical Analysis I (Numerical Linear Algebra for Computational and Data Sciences)
AMS526: Numerical Analysis I (Numerical Linear Algebra for Computational and Data Sciences) Lecture 19: Computing the SVD; Sparse Linear Systems Xiangmin Jiao Stony Brook University Xiangmin Jiao Numerical
More informationLargest Bratu solution, lambda=4
Largest Bratu solution, lambda=4 * Universiteit Utrecht 5 4 3 2 Department of Mathematics 1 0 30 25 20 15 10 5 5 10 15 20 25 30 Accelerated Inexact Newton Schemes for Large Systems of Nonlinear Equations
More informationAlgorithms that use the Arnoldi Basis
AMSC 600 /CMSC 760 Advanced Linear Numerical Analysis Fall 2007 Arnoldi Methods Dianne P. O Leary c 2006, 2007 Algorithms that use the Arnoldi Basis Reference: Chapter 6 of Saad The Arnoldi Basis How to
More informationTopics. The CG Algorithm Algorithmic Options CG s Two Main Convergence Theorems
Topics The CG Algorithm Algorithmic Options CG s Two Main Convergence Theorems What about non-spd systems? Methods requiring small history Methods requiring large history Summary of solvers 1 / 52 Conjugate
More informationStabilization and Acceleration of Algebraic Multigrid Method
Stabilization and Acceleration of Algebraic Multigrid Method Recursive Projection Algorithm A. Jemcov J.P. Maruszewski Fluent Inc. October 24, 2006 Outline 1 Need for Algorithm Stabilization and Acceleration
More informationLast Time. Social Network Graphs Betweenness. Graph Laplacian. Girvan-Newman Algorithm. Spectral Bisection
Eigenvalue Problems Last Time Social Network Graphs Betweenness Girvan-Newman Algorithm Graph Laplacian Spectral Bisection λ 2, w 2 Today Small deviation into eigenvalue problems Formulation Standard eigenvalue
More informationKey words. linear equations, polynomial preconditioning, nonsymmetric Lanczos, BiCGStab, IDR
POLYNOMIAL PRECONDITIONED BICGSTAB AND IDR JENNIFER A. LOE AND RONALD B. MORGAN Abstract. Polynomial preconditioning is applied to the nonsymmetric Lanczos methods BiCGStab and IDR for solving large nonsymmetric
More informationCourse Notes: Week 1
Course Notes: Week 1 Math 270C: Applied Numerical Linear Algebra 1 Lecture 1: Introduction (3/28/11) We will focus on iterative methods for solving linear systems of equations (and some discussion of eigenvalues
More informationSolving Large Nonlinear Sparse Systems
Solving Large Nonlinear Sparse Systems Fred W. Wubs and Jonas Thies Computational Mechanics & Numerical Mathematics University of Groningen, the Netherlands f.w.wubs@rug.nl Centre for Interdisciplinary
More informationIn order to solve the linear system KL M N when K is nonsymmetric, we can solve the equivalent system
!"#$% "&!#' (%)!#" *# %)%(! #! %)!#" +, %"!"#$ %*&%! $#&*! *# %)%! -. -/ 0 -. 12 "**3! * $!#%+,!2!#% 44" #% ! # 4"!#" "%! "5"#!!#6 -. - #% " 7% "3#!#3! - + 87&2! * $!#% 44" ) 3( $! # % %#!!#%+ 9332!
More informationScalable Non-blocking Preconditioned Conjugate Gradient Methods
Scalable Non-blocking Preconditioned Conjugate Gradient Methods Paul Eller and William Gropp University of Illinois at Urbana-Champaign Department of Computer Science Supercomputing 16 Paul Eller and William
More informationOn the influence of eigenvalues on Bi-CG residual norms
On the influence of eigenvalues on Bi-CG residual norms Jurjen Duintjer Tebbens Institute of Computer Science Academy of Sciences of the Czech Republic duintjertebbens@cs.cas.cz Gérard Meurant 30, rue
More informationUniversiteit-Utrecht. Department. of Mathematics. Jacobi-Davidson algorithms for various. eigenproblems. - A working document -
Universiteit-Utrecht * Department of Mathematics Jacobi-Davidson algorithms for various eigenproblems - A working document - by Gerard L.G. Sleipen, Henk A. Van der Vorst, and Zhaoun Bai Preprint nr. 1114
More informationCommunication-avoiding Krylov subspace methods
Motivation Communication-avoiding Krylov subspace methods Mark mhoemmen@cs.berkeley.edu University of California Berkeley EECS MS Numerical Libraries Group visit: 28 April 2008 Overview Motivation Current
More informationBarrier. Overview: Synchronous Computations. Barriers. Counter-based or Linear Barriers
Overview: Synchronous Computations Barrier barriers: linear, tree-based and butterfly degrees of synchronization synchronous example : Jacobi Iterations serial and parallel code, performance analysis synchronous
More informationSolving Symmetric Indefinite Systems with Symmetric Positive Definite Preconditioners
Solving Symmetric Indefinite Systems with Symmetric Positive Definite Preconditioners Eugene Vecharynski 1 Andrew Knyazev 2 1 Department of Computer Science and Engineering University of Minnesota 2 Department
More informationON ORTHOGONAL REDUCTION TO HESSENBERG FORM WITH SMALL BANDWIDTH
ON ORTHOGONAL REDUCTION TO HESSENBERG FORM WITH SMALL BANDWIDTH V. FABER, J. LIESEN, AND P. TICHÝ Abstract. Numerous algorithms in numerical linear algebra are based on the reduction of a given matrix
More informationReduced Synchronization Overhead on. December 3, Abstract. The standard formulation of the conjugate gradient algorithm involves
Lapack Working Note 56 Conjugate Gradient Algorithms with Reduced Synchronization Overhead on Distributed Memory Multiprocessors E. F. D'Azevedo y, V.L. Eijkhout z, C. H. Romine y December 3, 1999 Abstract
More informationSolving Ax = b, an overview. Program
Numerical Linear Algebra Improving iterative solvers: preconditioning, deflation, numerical software and parallelisation Gerard Sleijpen and Martin van Gijzen November 29, 27 Solving Ax = b, an overview
More informationDELFT UNIVERSITY OF TECHNOLOGY
DELFT UNIVERSITY OF TECHNOLOGY REPORT 18-05 Efficient and robust Schur complement approximations in the augmented Lagrangian preconditioner for high Reynolds number laminar flows X. He and C. Vuik ISSN
More informationSolving Sparse Linear Systems: Iterative methods
Scientific Computing with Case Studies SIAM Press, 2009 http://www.cs.umd.edu/users/oleary/sccs Lecture Notes for Unit VII Sparse Matrix Computations Part 2: Iterative Methods Dianne P. O Leary c 2008,2010
More informationSolving Sparse Linear Systems: Iterative methods
Scientific Computing with Case Studies SIAM Press, 2009 http://www.cs.umd.edu/users/oleary/sccswebpage Lecture Notes for Unit VII Sparse Matrix Computations Part 2: Iterative Methods Dianne P. O Leary
More information7.2 Steepest Descent and Preconditioning
7.2 Steepest Descent and Preconditioning Descent methods are a broad class of iterative methods for finding solutions of the linear system Ax = b for symmetric positive definite matrix A R n n. Consider
More informationAMS526: Numerical Analysis I (Numerical Linear Algebra)
AMS526: Numerical Analysis I (Numerical Linear Algebra) Lecture 23: GMRES and Other Krylov Subspace Methods Xiangmin Jiao SUNY Stony Brook Xiangmin Jiao Numerical Analysis I 1 / 9 Minimizing Residual CG
More informationIterative Methods for Solving A x = b
Iterative Methods for Solving A x = b A good (free) online source for iterative methods for solving A x = b is given in the description of a set of iterative solvers called templates found at netlib: http
More information6. Iterative Methods for Linear Systems. The stepwise approach to the solution...
6 Iterative Methods for Linear Systems The stepwise approach to the solution Miriam Mehl: 6 Iterative Methods for Linear Systems The stepwise approach to the solution, January 18, 2013 1 61 Large Sparse
More informationAlgebraic Multigrid as Solvers and as Preconditioner
Ò Algebraic Multigrid as Solvers and as Preconditioner Domenico Lahaye domenico.lahaye@cs.kuleuven.ac.be http://www.cs.kuleuven.ac.be/ domenico/ Department of Computer Science Katholieke Universiteit Leuven
More informationHenk van der Vorst. Abstract. We discuss a novel approach for the computation of a number of eigenvalues and eigenvectors
Subspace Iteration for Eigenproblems Henk van der Vorst Abstract We discuss a novel approach for the computation of a number of eigenvalues and eigenvectors of the standard eigenproblem Ax = x. Our method
More informationBounding the End-to-End Response Times of Tasks in a Distributed. Real-Time System Using the Direct Synchronization Protocol.
Bounding the End-to-End Response imes of asks in a Distributed Real-ime System Using the Direct Synchronization Protocol Jun Sun Jane Liu Abstract In a distributed real-time system, a task may consist
More informationAlternative correction equations in the Jacobi-Davidson method
Chapter 2 Alternative correction equations in the Jacobi-Davidson method Menno Genseberger and Gerard Sleijpen Abstract The correction equation in the Jacobi-Davidson method is effective in a subspace
More information1 Extrapolation: A Hint of Things to Come
Notes for 2017-03-24 1 Extrapolation: A Hint of Things to Come Stationary iterations are simple. Methods like Jacobi or Gauss-Seidel are easy to program, and it s (relatively) easy to analyze their convergence.
More informationSimple iteration procedure
Simple iteration procedure Solve Known approximate solution Preconditionning: Jacobi Gauss-Seidel Lower triangle residue use of pre-conditionner correction residue use of pre-conditionner Convergence Spectral
More informationFrom Stationary Methods to Krylov Subspaces
Week 6: Wednesday, Mar 7 From Stationary Methods to Krylov Subspaces Last time, we discussed stationary methods for the iterative solution of linear systems of equations, which can generally be written
More informationParallel Numerics, WT 2016/ Iterative Methods for Sparse Linear Systems of Equations. page 1 of 1
Parallel Numerics, WT 2016/2017 5 Iterative Methods for Sparse Linear Systems of Equations page 1 of 1 Contents 1 Introduction 1.1 Computer Science Aspects 1.2 Numerical Problems 1.3 Graphs 1.4 Loop Manipulations
More informationPeter Deuhard. for Symmetric Indenite Linear Systems
Peter Deuhard A Study of Lanczos{Type Iterations for Symmetric Indenite Linear Systems Preprint SC 93{6 (March 993) Contents 0. Introduction. Basic Recursive Structure 2. Algorithm Design Principles 7
More informationDELFT UNIVERSITY OF TECHNOLOGY
DELFT UNIVERSITY OF TECHNOLOGY REPORT 16-02 The Induced Dimension Reduction method applied to convection-diffusion-reaction problems R. Astudillo and M. B. van Gijzen ISSN 1389-6520 Reports of the Delft
More informationAMS Mathematics Subject Classification : 65F10,65F50. Key words and phrases: ILUS factorization, preconditioning, Schur complement, 1.
J. Appl. Math. & Computing Vol. 15(2004), No. 1, pp. 299-312 BILUS: A BLOCK VERSION OF ILUS FACTORIZATION DAVOD KHOJASTEH SALKUYEH AND FAEZEH TOUTOUNIAN Abstract. ILUS factorization has many desirable
More informationMultigrid absolute value preconditioning
Multigrid absolute value preconditioning Eugene Vecharynski 1 Andrew Knyazev 2 (speaker) 1 Department of Computer Science and Engineering University of Minnesota 2 Department of Mathematical and Statistical
More informationA short course on: Preconditioned Krylov subspace methods. Yousef Saad University of Minnesota Dept. of Computer Science and Engineering
A short course on: Preconditioned Krylov subspace methods Yousef Saad University of Minnesota Dept. of Computer Science and Engineering Universite du Littoral, Jan 19-3, 25 Outline Part 1 Introd., discretization
More informationMathematics Research Report No. MRR 003{96, HIGH RESOLUTION POTENTIAL FLOW METHODS IN OIL EXPLORATION Stephen Roberts 1 and Stephan Matthai 2 3rd Febr
HIGH RESOLUTION POTENTIAL FLOW METHODS IN OIL EXPLORATION Stephen Roberts and Stephan Matthai Mathematics Research Report No. MRR 003{96, Mathematics Research Report No. MRR 003{96, HIGH RESOLUTION POTENTIAL
More informationThe Lanczos and conjugate gradient algorithms
The Lanczos and conjugate gradient algorithms Gérard MEURANT October, 2008 1 The Lanczos algorithm 2 The Lanczos algorithm in finite precision 3 The nonsymmetric Lanczos algorithm 4 The Golub Kahan bidiagonalization
More informationModelling and implementation of algorithms in applied mathematics using MPI
Modelling and implementation of algorithms in applied mathematics using MPI Lecture 3: Linear Systems: Simple Iterative Methods and their parallelization, Programming MPI G. Rapin Brazil March 2011 Outline
More informationLecture 17: Iterative Methods and Sparse Linear Algebra
Lecture 17: Iterative Methods and Sparse Linear Algebra David Bindel 25 Mar 2014 Logistics HW 3 extended to Wednesday after break HW 4 should come out Monday after break Still need project description
More informationJos L.M. van Dorsselaer. February Abstract. Continuation methods are a well-known technique for computing several stationary
Computing eigenvalues occurring in continuation methods with the Jacobi-Davidson QZ method Jos L.M. van Dorsselaer February 1997 Abstract. Continuation methods are a well-known technique for computing
More informationA DISSERTATION. Extensions of the Conjugate Residual Method. by Tomohiro Sogabe. Presented to
A DISSERTATION Extensions of the Conjugate Residual Method ( ) by Tomohiro Sogabe Presented to Department of Applied Physics, The University of Tokyo Contents 1 Introduction 1 2 Krylov subspace methods
More informationParallel programming using MPI. Analysis and optimization. Bhupender Thakur, Jim Lupo, Le Yan, Alex Pacheco
Parallel programming using MPI Analysis and optimization Bhupender Thakur, Jim Lupo, Le Yan, Alex Pacheco Outline l Parallel programming: Basic definitions l Choosing right algorithms: Optimal serial and
More informationA Hybrid Method for the Wave Equation. beilina
A Hybrid Method for the Wave Equation http://www.math.unibas.ch/ beilina 1 The mathematical model The model problem is the wave equation 2 u t 2 = (a 2 u) + f, x Ω R 3, t > 0, (1) u(x, 0) = 0, x Ω, (2)
More information