(Technical Report 832, Mathematical Institute, University of Utrecht, October 1993) E. de Sturler. Delft University of Technology.

Size: px
Start display at page:

Download "(Technical Report 832, Mathematical Institute, University of Utrecht, October 1993) E. de Sturler. Delft University of Technology."

Transcription

1 Reducing the eect of global communication in GMRES(m) and CG on Parallel Distributed Memory Computers (Technical Report 83, Mathematical Institute, University of Utrecht, October 1993) E. de Sturler Faculty of Technical Mathematics and Informatics Delft University of Technology Mekelweg 4 Delft, The Netherlands and H. A. van der Vorst Mathematical Institute Utrecht University Budapestlaan 6 Utrecht, The Netherlands Abstract In this paper we study possibilities to reduce the communication overhead introduced by inner products in the iterative solution methods CG and GMRES(m). The performance of these methods on parallel distributed memory machines is often limited because of the global communication required for the inner products. We investigate two ways of improvement. One is to assemble the results of a number of local inner products of a processor and to accumulate them collectively. The other is to try to overlap communication with computation. The matrix vector products may also introduce some communication overhead, but for many relevant problems this involves communication with a few nearby processors only, and this does not necessarily degrade the performance of the algorithm. Key words. parallel computing, distributed memory computers, conjugate gradient methods, performance, GMRES, modied Gram{Schmidt. AMS(mos) subject classication. 65Y05, 65F10, 65Y0. 1 Introduction The Conjugate Gradients (CG) method [9] and the GMRES(m) method [11] are widely used methods for the iterative solution of specic classes of linear systems. The time-consuming kernels in these methods are: inner products, vector updates, and matrix vector products (including preconditioning operations). In many situations, especially when the matrix operations are well-structured, these operations are suited for implementation on vector computers and shared memory parallel computers [8]. This author wishes to acknowledge Shell Research B.V. and STIPT for the nancial support of his research. 1

2 For parallel distributed memory machines the picture is entirely dierent. In general the vectors are distributed over the processors, so that even when the matrix operations can be implemented eciently by parallel operations, we cannot avoid the global communication required for inner product computations. These global communication costs become relatively more and more important when the number of parallel processors is increased and thus they have the potential to aect the scalability of the algorithms in a very negative way [5]. This aspect has received much attention and several approaches have been suggested to improve the performance of these algorithms. For CG the approaches come down to reformulating the orthogonalization part of the algorithm, so that the required inner products can be computed in the same phase of the iteration step (see, e.g., [4, 10]), or to combine the orthogonalization for several successive iteration steps, as in the s-step methods []. The numerical stability of these approaches is a major point of concern. For GMRES(m) the approach comes down to some variant of the s-step methods [1, 3]. After having generated basis vectors for part of the Krylov subspaces by some suitable recurrence relation, they have to be orthogonalized. One often resorts to cheap but potentially unstable methods like Gram-Schmidt orthogonalization. In the present study we investigate other ways to reduce the global communication overhead due to the inner products. Our approach is to identify operations that may be executed while communication takes place, since our aim is to overlap communication with computation. For CG this is done by rescheduling the operations, without changing the numerical stability of the method [7]. For GMRES(m) it is achieved by reformulating the modied Gram{Schmidt orthogonalization step [5, 6]. For GMRES(m) we also exploit the possibility of packing the results of the local inner products of a processor in one message and accumulating them collectively. We believe that our ndings are relevant for other Krylov subspace methods as well, since methods like BiCG, and its variants CGS, BiCGSTAB, and QMR have much in common with CG from the implementation point of view. Likewise, the communication problems with GM- RES(m) are representative for the problems in methods like ORTHODIR, GENCG, FOM, and ORTHOMIN. We have carried out our experiments on a 400-processor Parsytec Supercluster at the Koninklijke/Shell-Laboratorium in Amsterdam. The processors are connected in a xed 0 0 mesh, of which arbitrary submeshes can be used. Each processor is a T800-0 transputer. The transputer supports only nearest neighbor synchronous communication more complicated communication has to be programmed explicitly. The communication rate is fast compared to the op rate, but to current standards the T800 is a slow processor. Another feature of the transputer is the support of time-shared execution of multiple `parallel' CPU-processes on a single processor, which facilitates the implementation of programs that switch between tasks when necessary (on interrupt basis), e.g., between communication and computation. Finally, transputers have the possibility of concurrent communication and computation. As a result it is possible to overlap computation and communication on a single processor. The program that runs on each processor consists of two processes that run time-shared: a computation process and a communication process. The computation process functions as the master. If at some point communication is necessary, the computation process sends the data to the communication process (on the same processor) or requests the data from the communication process, which then handles the actual communication. This organization permits the computation processes on dierent processors to work asynchronously even though the actual communication is synchronous. The communication process is given the higher priority so that

3 if there is something to communicate this is started as soon as possible. The algorithms for GMRES(m) and CG Preconditioned CG: start: x 0 = initial guess r 0 = b ; Ax 0 p;1 =0 ;1 =0 Solve for w 0 in Kw 0 = r 0 0 =(r 0 w 0 ) iterate: for i =0 1 :::: do p i = w i + i;1p i;1 q i = Ap i i = i =(p i q i ) x i+1 = x i + i p i r i+1 = r i ; i q i compute krk if accurate enough then quit Solve for w i+1 in Kw i+1 = r i+1 i+1 =(r i+1 w i+1 ) i = i+1 = i end GMRES(m): start: x 0 = initial guess r 0 = b ; Ax 0 v 1 = r 0 =kr 0 k iterate: for j =1 ::: m do ^v j+1 = Av j for i =1 ::: j do h i j =(^v j+1 v i ) ^v j+1 =^v j+1 ; h i j v i end h j+1 j = k^v j+1 k v j+1 =^v j+1 =h j+1 j end form the approximate solution: x m = x 0 + V m y m where y m minimizes kr0 k e 1 ; Hm y, y IRm restart: compute r m = b ; Ax m if satised then stop, else x 0 = x m v 1 = r m =kr m k goto iterate The preconditioned CG algo- Figure 1: rithm Figure : The GMRES(m) algorithm In this section we will discuss the time-consuming kernels in CG and GMRES(m): the vector update (daxpy), the preconditioner, the matrix vector product, and the inner product (ddot), see Figures 1 and. Because the results of the inner products are needed on all processors, the Hessenberg matrix H m (see Figure ) is available on each processor. Hence, the computation of y m can be done on each processor. This is often ecient, because it would be a synchronization point if implemented on a single processor, and then the other processors would have towait for the result. However, if the size of the reduced system is large compared to the local number of unknowns, the computation might be expensive enough to make the distribution and parallel solution worthwhile. We have not pursued this idea. The parallel implementation of the vector update (daxpy) poses no problem since it involves only local computation. In this paper we restrict ourselves to problems for which the parallelism in the matrix vector product does not pose serious problems. That is, our model problems have a strong data locality which istypical for many nite dierence and nite element problems. A suitable 3

4 accumulation over the processor grid, no outgoing step can be taken before all incoming steps have taken place. Figure 3: accumulation over the processor grid domain decomposition approach preserves this locality more or less independent of the number of processors, so that the matrix vector product requires only neighbor{neighbor communication or communication with only a few nearby processors. This could be overlapped with computations for the interior of the domain but it is relatively less important, since the number of boundary operations is in general an order of magnitude less than the number of interior operations (this is the surface to volume eect). The communication overhead introduced by the preconditioner is obviously strongly dependent on the selected preconditioner. Popular preconditioners on sequential computers, like the (M)ILU variants, are highly sequential or introduce irregular communication patterns (as in the hyperplane approach, see [8]) and therefore these are not suitable. Obviously we prefer preconditioners which only require a limited amount of communication, for instance comparable to or less than that of the matrix vector product. On the other hand we would like to retain the iteration reducing eect of the preconditioners, and these considerations are often in conict. In our study we have avoided to discuss the convergence accelerating eects of the preconditioner and we have used a simple Incomplete Block Jacobi preconditioner with blocks corresponding to the domains. In this case we have nocommunication at all for the preconditioner. Since the vectors are distributed over the processor grid, the inner product (ddot) is computed in two steps. All processors start to compute in parallel the local inner product. After that, the local inner products are accumulated on one `central' processor and broadcasted. We will describe the implementation in little more detail for a -dimensional mesh of processors, see Figure 3. The processors on each processor line in the x-direction accumulate their results along this line on an `accumulation' processor at the same place on each processor line: each processor waits for the result from its neighbor further from the accumulation processor, then adds this result to its own partial result and sends the new result along. Then the `accumulation' processors do a likewise accumulation in the y-direction. The broadcast consists of the reverse process. Each processor is active at only a limited number of steps and will be idle for the rest of the time. So, here are opportunities to make itavailable for other tasks. The communication time of an accumulation or a broadcast is of the order of the diameter of the processor grid. This means that for an increasing number of processors the communication time for the inner products increases as well, and hence this is a potential threat to the scalability 4

5 of the method. Indeed, if the global communication for the inner products is not overlapped, it often becomes a bottleneck on large processor grids, as will be shown later. In [5] a simple performance model based on these considerations is introduced, which clearly shows quantitatively the dramatic inuence of the global communication for inner products over large processor grids on the performance of Krylov subspace methods. That model also shows that the degradation of performance depends on the relative costs of local computation and global communication. This means that results analogous to those presented in Sections 5 and 7 will be seen for larger problems on processor conguration(s) with relatively faster computational speed (this is the current trend in parallel computers). Moreover, if the problem sizes increase proportionally to the number of processors, the local computation time remains the same but the global communication cost increases. This emphasizes the necessity to reduce the eect of the global communication costs. 3 Parallel performance of GMRES(m) and CG We will now describe briey a model for the computation time, the communication cost, and the communication time of the main kernels in Krylov subspace methods. We use the term communication cost to indicate the wall clock time spent in communication that is not overlapped with useful computation (so that it really contributes to wall-clock time). The term communication time is used to indicate the wall-clock time of the whole communication. In the case of a nonoverlapped communication, the communication time and the communication cost are the same. Our quantitative formulas are not meant to give very accurate predictions of the exact execution times, but they will be used to identify the bottlenecks and to evaluate improvements. Several of the parameters that we introduce may vary over the processor grid. In that case the value to use is either a maximum or an average, whatever is the most appropriate. Computation time. We will only be concerned with the local computation time, since the cost of communication and synchronization is modeled explicitly. The computation time for the solution of the Hessenberg system is neglected in our model. For a vector update (daxpy) or an inner product (ddot) the computation time is given by t fl N=P, where N=P is the local number of unknowns of a processor and t fl is the average time for a double precision oating point operation. The computation time for the (sparse) matrix vector product is given by (n z ; 1)t fl N=P, where n z is the average number of non-zero elements per row of the matrix. As preconditioner we chose Block-(M)ILU variants without ll-in of the form LD ;1 U for GMRES(m), and LL T for CG. For CG we have scaled the system so that diag(l) = I. The computation time of the preconditioner for GMRES(m) is (n z +1)t fl N=P, andforcgitis(n z ; 1)t fl N=P. A full GMRES(m) cycle has approximately 1 (m +3m) inner products, the same number of vector updates, and (m + 1) multiplications with the matrix and the preconditioner, if one computes the exact residual at the end of each cycle. The complete (local) computation time for the GMRES(m) algorithm is given by the equation: T gmr cmp1 = ; (m +3m)+4n z (m +1) N P t fl: (1) A single iteration of CG has three inner products, the same number of vector updates and one multiplication with the matrix and the preconditioner. The complete (local) computation time is given by the equation: T cg cmp =(9+4n z) N P t fl: () 5

6 Communication cost. As we mentioned already the solution of the Hessenberg system in GMRES(m) and the vector update are local and involve no communication cost. The most important communication is for the global inner products. If we do not overlap this global communication then we are concerned with the wall clock time for the entire, global operation and not with the local part of a single processor. We note that we can view the time for the accumulation and broadcast either as the communication time for the entire operation or as a small local communication time and a long delay because of global synchronization. In the rst interpretation we would consider overlapping the global communication, whereas in the second one we would consider removing the delays by reducing the number of synchronization points. We will take the rst point ofview. Consider a processor grid with P = p processors. With p d dp=e( p P ), the maximum distance to the `most central' processor over the processor grid is p d. Let the communication start-up time be given by t s and the word (3 bits) transmission time by t w. The time to communicate one double precision number between two neighboring processors is then (t s +3t w ), since a double precision number takes two words and we need a one word header to accompany each message. Hence, the global accumulation and broadcast of one double precision number takes p d (t s +3t w ) and the global accumulation and broadcast of a vector of k double precision numbers takes p d (t s +(k +1)t w ). For GMRES(m) in the nonoverlapped case the communication time for the modied Gram- Schmidt algorithm (with 1 (m +3m) accumulations and broadcasts) is T gmr a+b =(m +3m)p d (t s +3t w ) (3) where `a + b' indicates the accumulation and broadcast. For CG in the nonoverlapped case the communication time of the three inner products per iteration, is T cg a+b =6p d(t s +3t w ): (4) The communication for the matrix vector product is necessary for the exchange of so-called boundary data: sending boundary data to other processors and receiving boundary data from other processors. Assume that each processor has to send and to receive n m messages, which each take d steps of nearest neighbor communication from source to destination, and let the number of boundary data elements on a processor be given by n b. The total number of words that have to be communicated (sent and received) is then (n b + n m ) per processor. For GMRES(m) the communication time of (m + 1) matrix vectors products is T gmr bde =dn m (m +1)t s +d(m +1)(n b + n m )t w (5) where `bde' refers to the boundary exchange. For CG the communication time of one matrix vector product is T cg bde =dn m t s +d(n b + n m )t w : (6) Note that we have assumed no overlap. For preconditioners that only need boundary exchanges, we could have used the same formulas with a dierent choice of the parameter values if necessary, but in our experiments we have used only local block preconditioners (without communication). 6

7 4 Communication overhead reduction in GMRES(m) From the expressions (1), (3) and (5) we conclude that the communication cost for GMRES(m) is of the order O(m p P ) and for large processor grids this will become a bottleneck. Moreover, in the standard implementation we cannot reduce these costs by accumulating multiple inner products together, saving on start-up times, or overlap this expensive communication with computation, reducing the runtime lost in communication. The problem stems from the fact that the modied Gram-Schmidt orthogonalization of a single vector against some set of vectors and its subsequent normalization is an inherently sequential process. However, if the modied Gram-Schmidt orthogonalization of a set of vectors is considered there is no such problem, since the orthogonalizations of all intermediate vectors on the previously orthogonalized vectors are independent. Therefore, we can compute several or all of the local inner products rst and then accumulate the subresults collectively. Suppose the set of vectors v 1 ^v ^v 3 ::: ^v m+1 has to be orthogonalized, where kv 1 k =1. The modied Gram-Schmidt process can be implemented as sketched in Figure 4. This reduces for i =1 ::: m do orthogonalize ^v i+1 ::: ^v m+1 on v i v i+1 =^v i+1 =k^v i+1 k end Figure 4: a block-wise modied Gram{ Schmidt orthogonalization the number of accumulations to only m instead of 1 (m +3m) for the usual implementation of GMRES(m), but the length of the messages has increased. In this way, start-up time is saved by packing small messages, corresponding to one block of orthogonalizations, into one larger message. Moreover, we also reduce the amount of data transfer because we have less message headers. Instead of computing all local inner products in one block and accumulating these partial results only once for the whole block, it is preferable to split each stepinto two blocks of orthogonalizations, since this oers the possibility tooverlap with communication. This overlap is achieved by performing the accumulation and broadcast of the local inner products of the rst block concurrently with the computation of the local inner products of the second block and performing the accumulation and broadcast of the local inner products of the second block concurrently with the vector updates of the rst block, see Figure 5. Note that the computation time for this approach is equal to that for the standard modied Gram{Schmidt algorithm. For the parallel `overlapped' implementation of the modied Gram-Schmidt algorithm given in Figure 5, we will neglect potential eects of overlap of the communication with computation on a single processor. We will only consider the overlap with useful computational work of the time that a processor is not active in the global accumulation and broadcast. If we assume that sucient computational work can be done to completely ll this time, the communication cost T gmr a+b, see (3), reduces to only the communication time spent locally by a processor. This `local' communication cost for the accumulation and broadcast of a vector of k double precision numbers is given by 4t s +4(k +1)t w, for a receive and send in the accumulation phase and a receive and send in the broadcast phase, if the processor only participates in the accumulation 7

8 for i =1 ::: m do split ^v i+1 ::: ^v m+1 into two blocks compute local inner products (LIPs) block 1 k ( accumulate LIP s block 1 compute LIP s block update ^v i+1, compute LIP for k^v i+1 k, place this LIP into block k ( accumulate LIP s block update vectors block 1 end update vectors block normalize ^v i+1 Figure 5: the implementation of the modied Gram-Schmidt process along the x-direction and it is given by 8t s +8(k +1)t w if the processor also participates in the accumulation along the y-direction. The latter case is obviously the most important, since all processors nish the modied Gram-Schmidt algorithm more or less at the same time. The communication cost of the entire parallel modied Gram-Schmidt algorithm (mgs) now becomes T l mgs =16mt s +8(m +5m)t w : (7) In general wemaynothave enough computational work to overlap all the communication time in a global communication process. For the wall clock time of (parallel) operations, it is the longest time that matters. Here it is the global communication time for the modied Gram{Schmidt algorithm (mgs): T g mgs =4mp d t s +(m +5m)p d t w : (8) Since the communication is partly overlapped, the communication cost is in general signicantly lower than the communication time, and then it may still be better described by (7) instead of (8). Two important facts are highlighted by expressions (3), (7) and (8). First, assuming suf- cient computational work, the contribution of start-up times to the communication cost is reduced from O(m p P ) in the standard GMRES(m) (3) to O(m) using the parallel modied Gram-Schmidt algorithm (7). Especially for machines with relatively high start-up times this is important. In fact, if the start-ups dominate the communication cost, then we can reduce this contribution by a factor of two by the algorithm given in Figure 4 (even if we neglect the overlap). Second, assuming sucient computational work, the communication cost no longer depends on the size of the processor grid instead of being of the order of the diameter of the processor grid p d,itisnow more or less constant. If we lack sucient computational work the communication cost is described by (8) minus the time for the overlapped computation. 8

9 ^v 1 = v 1 = r=krk for i =1 ::: m do ^v i+1 =^v i ; d i A^v i end Figure 6: Generation of a polynomial basis for the Krylov subspace In order to be able to use this parallel modied Gram{Schmidt algorithm in GMRES(m), a basis for the Krylov subspace has to be generated rst. The idea to generate a basis for the Krylov subspace rst and then to orthogonalize this basis was already suggested for the CG algorithm, referred to as s-step CG, in [] for shared (hierarchical) memory parallel vector processors. In [] it is also reported that the s-step CG algorithm may converge slowly due to numerical instability for s>5. In the pargmres(m) algorithm stability seems to be much less of a problem since each vector is explicitly orthogonalized against all the other vectors, and we generate a polynomial basis for the Krylov subspace such as to minimize the condition number, see [1] where the Krylov subspace is generated rst to exploit higher level BLAS in the orthogonalization, and [6]. The basis vectors for the Krylov subspace ^v i are generated as indicated in Figure 6, where the parameters d i are used to get the condition number of the matrix [v 1 ^v ::: ^v m+1 ] suciently small. Bai, Hu, and Reichel [1] discuss a strategy for this. Their idea is to use one cycle of standard GMRES(m). Then the eigenvalues of the resulting Hessenberg matrix, which approximate those of A, are used in the so-called Leja ordering as the parameters d ;1 i in the rest of the modied GMRES(m) cycles. Their examples indicate that the convergence of such a GMRES(m) is (virtually) the same as that of standard GMRES(m). This is also borne out by our experience. Therefore, in the next section we limit our experiments to the evaluation of a single GMRES(m) cycle. Our parallel computation of the Krylov subspace basis requires m extra daxpys. It is obvious from (1) that this cost is negligible. However, for completeness we give the computation time with these extra daxpys, T gmr cmp =(m(m +4)+4n z (m + 1)) N P t fl: (9) Because we generate the Krylov subspace basis rst and then orthogonalize it, the Hessenberg matrix that we obtain from the inner products is not V T AV m+1 m, as in the standard GMRES(m) algorithm, and therefore we need to solve the least squares problem in a slightly dierent way. Dene ^v 1 = v 1 = krk ;1 r, and generate the other basis vectors as ^v i+1 = (I ; d i A)^v i, for i =1 ::: m.thisgives the following relation: [^v ^v 3 ::: ^v m+1 ]= b Vm ; Ab Vm D m (10) where D m = diag(d i )and Vm b is the matrix with the vectors ^v i as its columns. This relation between vectors and matrices composed from these vectors will be used throughout this discussion. The parallel modied Gram-Schmidt orthogonalization gives the orthogonal set of vectors fv 1 ::: v m+1 g, for which wehave! jx v j+1 = h ;1 j+1 j+1 ^v j+1 ; h i j+1 v i for i =1 ::: m (11) i=1 9

10 where h i j is dened by (but computed dierently) ( (vi ^v h i j = j ) i j 0 i>j: (1) Notice the subtle dierence with the denition of Hm in the standard implementation of GM- RES(m). Here the matrix H m+1 is upper triangular. Furthermore, as long as h i i 6= 0 the matrix H i is nonsingular, whereas h i i = 0 indicates a lucky breakdown. We will further assume, without loss of generality, that h i i 6=0,fori =1 ::: m+1. Leth i denote the i-th column of H m+1.from equations (11) and (1) it follows that Equation (10) can be rewritten as bv i = V i H i for i =1 ::: m+1: (13) bv m ; [^v :::^v m+1 ]=Ab Vm D m = AV m H m D m : (14) Dene b Hm =[h 1 h :::h m ] ; [h h 3 :::h m+1 ], so that b Hm is an upper Hessenberg matrix of rank m, sinceh i i 6= 0, for i =1 ::: m+ 1. Substituting this in (14) nally leads to V m+1 b Hm = AV m H m D m : (15) Using this expression the least squares problem can be solved in the same way as for standard GMRES(m): min y kr ; AV m yk =minkr ; AV m H m D ^y m^yk where H m D m^y = y: (16) Because H m and D m are nonsingular, the latter by denition, H m D m^y = y is always well-dened. Combining (15) and (16) yields ^y :min ^y r ; V m+1hm b ^y = min ^y krk e 1 ; Hm^y b : (17) The additional computational work in this approach is only O(m ) and therefore negligible. We will refer to this adapted version of GMRES(m) as pargmres(m). 5 Performance of GMRES(m) and pargmres(m) Before we discuss the experiments below, we present a short theoretical analysis. The communication time for the exchange of boundary data and the computation time for the m additional vector updates in the pargmres(m) implementation will be neglected in this analysis, because they are relatively unimportant. The runtime of a GMRES(m) cycle on P 4 processors is then given by T P = T gmr cmp1 + T gmr a+b, see (1) and (3): T P = ; (m +3m)+4n z (m +1) t fl N P + ; (m +3m)(t s +3t w ) p P: (18) This equation shows that for suciently large P the communication will dominate. Following the analysis in [5] we introduce the value P max as the number of processors that minimizes the runtime of GMRES(m). We have studied the performance of GMRES(m) and pargmres(m) for numbers of processors less than or approximately equal to P max. Note that for pargmres(m) 10

11 we can improve the performance further with more processors than P max,becauseithasalower communication cost. The cost of communication is reduced in pargmres(m) in two steps. First, we reduce the communication time by accumulating and broadcasting multiple inner products in groups. This reduces the communication time from T gmr a+b to T m mgs, see (3) and (8). Second, we overlap the non-local part of the remaining communication time with half the computation in the modied Gram-Schmidt algorithm, see Figure 5. The length of the overlap then determines the performance of pargmres(m) and the improvement over GMRES(m). Therefore we introduce the value P ovl, which is the number of processors for which the overlap is exact. The performance and the improvement are then related to whether P P ovl or P > P ovl and how large P ovl is relative top max, because the fraction of the runtime spent incommunication increases for increasing P, see (18). We will now give relations for P max and P ovl. The minimization of (18) gives =3 [4(m +3m)+8n P max = z (m +1)]t fl N (19) (m +3m)(t s +3t w ) and the eciency E P = T 1 =(PT P )forp max processors is given by E Pmax =1=3, where T 1 = Tcmp1. gmr This means that T 3 P max is spent in communication, because in this model eciency is lost only through communication. For P ovl wehave that the (total) communication time T g mgs, see (8), is equal to the sum of the overlapping computation time, (m N +m)t fl, and the local P communication time T l mgs, see (7): p N (4mt s +(m +10m)t w ) P ovl =(m +m)t fl +16mt s +(8m +40m)t w : (0) P ovl If P P ovl then the communication costp is reduced to T l mgs, see (7). This means that the cost of start-ups is reduced by a factor of m+3 16 P and the cost of data transfer by a factor of 3 8p P. Furthermore, as long as P < P ovl an increase in the number of processors will not result in an increase of the communication cost, and hence the eciency remains constant. If P>P ovl then the overlap is no longer complete and the communication cost is given by the communication time minus the computation time of the overlapping computation: T g mgs ;(m +m)t fl N P. The runtime is then given by ~T P = ; (m +4m)+4n z (m +1) t fl N P + ; 4mt s +(m +10m)t w p P: (1) For P > P ovl we see that the eciency decreases again, because the communication time increases and the computation time of the overlap decreases. Equation (0) gives P ovl (m +m)t fl 4mt s +(m +10m)t w N =3 : () Comparing (19) with (), we see that if t s dominates the communication, that is t s t w, then P ovl >P max and we always have P P ovl, so that we can overlap all communication after the reduction of start-ups. This means that we can reduce the runtime by almost a factor of three. For transputers we have t s t w, and comparing (19) and () we see that P ovl <P max. One can prove that the improvement of pargmres(m) compared to GMRES(m), T P = ~ T P,asa function of P is either constant or a strictly increasing or decreasing function. The maximum improvement is therefore found for either P = P ovl or P = P max. 11

12 For P = P ovl, the communication time is strongly reduced. Furthermore, (19) and () indicate that for m large enough P ovl (1=) =3 P max, which means that the eciency at P ovl is less than about 50%. Therefore we may expect an improvement by about a factor of two. For P = P max the runtime is given by (1). When t s t w wegett g mgs 1 T gmr a+b, and we may say that due to the overlap the cost of computation is reduced by (m N +m)t fl, that is P approximately by a factor of m +6m +4n z (m +1) m +4m +4n z (m +1) m +6+4n z m +4+4n z which is a little less than a factor of two. Hence we may expect an improvement by a factor of about two in this case also. We now discuss our experimental observations on the parallel performance of GMRES(m) and the adapted algorithm pargmres(m) on the 400-transputer machine. We will only consider the performance of one (par)gmres(m) cycle, because both algorithms take about the same number of iterations, which generally leads to the same number of GMRES(m) cycles, with only a possible dierence in the last cycle. The dierence may be that GMRES(m) stops before it completes the full m iterations of the last cycle. This gives on average a dierence of only a half GMRES(m) cycle, which is often more than compensated by the much better performance of pargmres(m) for the other cycles. In our experiments we used square processor grids (minimal diameter), and this is optimal for GMRES(m). For other processor grids the degradation of performance for GMRES(m) will be even worse. The pargmres(m) algorithm is much less sensitive to the diameter of the processor grid. We have solved a convection diusion problem discretized by nite volumes over a grid, resulting in the familiar ve-diagonal matrix with a tridiagonal block-structure, corresponding to the 5-point star. This relatively small problem size was chosen, because for the processor grids of increasing size it very well shows the degradation of performance for GMRES(m) and the large improvements of pargmres(m) over GMRES(m). As we will see, the pargmres(m) variant has much better scaling properties than GMRES(m). The measured runtimes for a single (par)gmres(m) cycle are listed in Table 1 for m =30 and m = 50. For m =30wehave that P max 400 and P ovl 36. For m =50wehave P max 375 and P ovl 44. We give speed-ups and eciencies in Table. These are calculated from the measured runtimes of GMRES(m) and pargmres(m) and an estimated sequential runtime for GMRES(m), because the problem was too large to run on a single processor. The estimated T 1 is the net computation time derived from (1). We mention that for CG (see Section 7) the measured T 1 is approximately 9% less than the estimated T 1, but this is not necessarily processor m =30 m =50 grid GMRES(m) pargmres(m) GMRES(m) pargmres(m) (s) (s) (s) (s) Table 1: measured runtimes for GMRES(m) and pargmres(m) 1

13 m =30 m =50 processor GMRES(m) pargmres(m) GMRES(m) pargmres(m) grid E (%) S E(%) S E(%) S E (%) S Table : Eciencies and speed-ups for GMRES(m) and pargmres(m) based on measured runtimes and an estimated sequential runtime for GMRES(m) the case for GMRES(m) too. The dierence between the estimated sequential runtime and the measured one for CG is probably due to a simpler implementation (e.g., less indirect addressing, copying of buers) for the sequential program, which results in a higher (average) op-rate. The runtime for GMRES(m) is reduced by approximately 5%, when increasing the number of processors from 100 to 196. When increasing this from 100 to 89, the runtime reduces only by some 35%. When we further increase the number of processors to 400 then the runtime is already more than for 89 processors, which is in agreement with the previous discussion because P P max for m =30andP > P max for m = 50. Hence the cost of communication spoils the performance of GMRES(m) completely for large P. On the other hand, for pargmres(m) the runtime reduction when increasing from 100 to 196 processors is approximately 45%, where the upper bound is 49%, so this is almost optimal. Such a speed-up shows that the eciency remains almost constant for this increase in the number of processors, see also Table. This is to be expected because we have P < P ovl, so that any increase in the communication time of the inner products is more than compensated by the overlapping computation. On 89 processors the runtime is about 53% of the runtime on 100 processors, which is still quite good. If we continue to increase the number of processors, we see that for 400 processors the runtime is not much better than for 89 processors, although it is still decreasing. At this point the speed-up for pargmres(m) levels o, because there is insucient computational worktooverlap the communication (P >P ovl ). A direct comparison between the runtimes of GMRES(m) and pargmres(m) shows that, for 100 processors, GMRES(m) is about 5% slower than pargmres(m). However, for 196 processors this has increased already to 65% and 81% for m =30andm = 50 respectively. From then on the relative dierence increases more gradually to a maximum of about a factor of two forp max processors. These results are very much in agreement with our theoretical expectations. Note that although the maximum is reached for P max the improvement is already substantial for 196 processors, which isnearp ovl. In Table 4 we give the estimated runtimes from expressions (1), (3), and (5) for GMRES(m) and formulas (5), (7), and (9) for pargmres(m). Table 3 gives a short overview of the relevant parameters and their meaning (see Section 3). If the value of a parameter is xed, its value is given also. The parameters d, n z and n m are derived from our model problem and implementation the parameters t s, t w and t fl have been determined experimentally. A comparison of the estimates with the measured execution times indicates that the formulas are quite accurate except for the 400 processor case. The rst reason for this discrepancy is that for both algorithms the neglected costs become more important when the size of the local problem is small. These neglected costs are due to, e.g., copying of buers for communication and indirect 13

14 parameter meaning t w (4:80s) communication word rate t s (5:30s) communication start-up time t fl (3:00s) average time for a single oating point operation d (1) (max) number of communication steps in boundary exchange n m (4) number of messages (to send and receive) in boundary exchange n z (5) average number of non-zero elements per row in the matrix p d maximum distance to the `most central' processor n b (max) number of boundary data elements on a processor m size of the Krylov space over which (par)gmres(m) minimizes Table 3: parameters and meaning m =30 m =50 processor p d N l n b GMRES(m) pargmres(m) GMRES(m) pargmres(m) grid (s) (s) (s) (s) Table 4: Estimated runtimes for GMRES(m) and pargmres(m) addressing using exterior data, the organization of the communication, and the solution of the least squares problem. For the pargmres(m) algorithm there is a second and more important reason, viz. due to the small size of the local problem we can no longer assume an almost complete overlap of the communication in the modied Gram-Schmidt algorithm (P > P ovl ). This is illustrated in Table 5, which gives estimates for the twooverlapping parts given in (0). We refer to the the sum of the local communication time and half of the computation time in the modied Gram-Schmidt algorithm as comp, and to the total communication time for the accumulation as comm. Already for the 1717 grid we do not have a complete overlap, although the overlap will still be good. For the 0 0 processor grid an overlap of about 55% is already the maximum. Obviously for a larger problem this would improve. 6 Communication overhead reduction in CG For a reduction in the communication overhead for preconditioned CG we follow the approach suggested in [7]. In that approach the operations are rescheduled to create more opportunities for overlap. This leads to an algorithm (parcg) as the one given in Figure 7, where we have assumed that the preconditioner K canbewrittenask = LL T. For a discussion of the ideas behind this scheme we refer to [7]. For our purposes it is relevant topoint at the inner products at lines (1), () and (3). The communication for these inner products is overlapped by the computational work in the following line. We split the preconditioner to create an overlap for the inner products (1) and (3) and 14

15 m =30 m =50 processor p d N l comp comm comp comm grid (s) (s) (s) (s) Table 5: Comparison of estimated costs for overlapping computation and `global' communication of the modied Gram-Schmidt implementation in pargmres(m) we have extra overlap possibilities since the inner product () is followed by the update for x corresponding to the previous iteration step. Under the assumption of a complete overlap for the time that a processor is not active in the accumulation and broadcast of the inner products and following the derivation of (7), the communication cost for the three inner products in a parcg iteration reduces from Ta+b, cg see (4), to the communication time spent locally by a processor: T cg l a+b = 4(t s +3t w ): (3) Therefore, the communication cost is reduced from O( p P )too(1), which means that (in theory) the communication cost is independent of the processor grid size. 7 Performance of CG variants We will follow closely the lines set forth in the analysis for (par)gmres(m) in Section 5. The communication time for the exchange of boundary data will be neglected in this analysis, because it is relatively unimportant for our kind of model problems. The problem dependent parameters and the machine dependent parameters have the same values as in the discussion for GMRES(m), see Table 3. The runtime for a CG iteration with P 4 processors is given by T P = Tcmp cg + T cg a+b, see () and (4): N T P =(9+4n z )t fl P +6(t s +3t w ) p P: (4) This expression shows that for suciently large P the communication time will dominate. Here we can also dene a P max as the number of processors that gives the minimal runtime, and a P ovl as the number of processors for which the (total) communication time in the inner products T cg a+b (see (4)) is equal to the sum of the computation time of the preconditioner, one vector N cg update (n z t fl ), and the local communication time T P l a+b (see (3)). P max = (18 + 8nz =3 )Nt fl (5) 6(t s +3t w ) For P = P max processors the eciency E P = T 1 =(PT P ) is again E Pmax =1=3, where T 1 = T cg cmp therefore, the communication time is 3 T P max. The value for P ovl is given by 6(t s +3t w ) p P ovl =n z t fl N 15 P ovl +4(t s +3t w ): (6)

16 parcg: x;1 = x 0 = initial guess r 0 = b ; Ax 0 p;1 =0 ;1 =0 s = L ;1 r 0 ;1 =1 for i =0 1 ::: do (1) i =(s s) w i = L ;T s i;1 = i = i;1 p i = w i + i;1p i;1 q i = Ap i () =(p i q i ) x i = x i;1 + i;1p i;1 i = i = r i+1 = r i ; i q i (3) compute krk s = L ;1 r i+1 if accurate enough then x i+1 = x i + i p i quit end Figure 7: The parcg algorithm For P P ovl the communication cost is reduced from T cg a+b to T p cg l a+b, which gives a reduction by a factor of 1 4 P.For P>Povl the communication cost is given by T cg N a+b ; n z t. fl P A comparison of (5) and (6) shows that P ovl <P max. Even though the preconditioner is strongly problem- and implementation dependent, this holds in general, because for P = P ovl the communication time is equal to a part of the computation time, whereas for P = P max the communication time is already twice the computation time. This leads to three phases in the performance of parcg. Let a be the computation time, and for [0 1] let a be the computation time for the `potential' overlap, and let c be the communication time. Then the runtime of CG is given by a + c, whereas for parcg it is given by (1; )a + max(a c). For increasing P, a decreases and c increases as described above. For small P, c a, P P ovl, all communication can be overlapped but the communication time is relatively unimportant. For medium P, c a, P P ovl, the communication time is more or less in balance with the computation time for the overlap and the improvement is maximal, see below. For large P, c a, P P ovl, the communication time will be dominant, and then we will not have enough computational work to overlap it suciently. Itiseasytoprove that the fraction (a + c)=((1 ; )a + max(a c)) is maximal if a = c, that is for P = P ovl, and then the improvement is a + c (1 ; )a + max(a c) = a + a =1+: (7) a Hence, the maximum improvement of parcg over CG is determined by this fraction. The larger this fraction is, the larger is the maximum improvement by parcg. If the computation 16

17 time of the preconditioner is dominant, e.g. when n z is large and when we use preconditioners from a factorization with ll in, then 1, and we can expect an improvement by a factor of two. In our model we have = nz < 9+4n z 1, so that for n z large enough we can expect a reduction by a factor of 1:5. For our model problem we have n z = 5, so that the improvement is limited to factor of 1:33. We will now discuss the results for the parallel implementation of the standard CG algorithm and the adapted version parcg on the 400-transputer machine for a model problem. Since the algorithms are equivalent theytake the same number of iterations, and therefore we will only consider the runtime for one single iteration. We have solved a diusion problem discretized by nite volumes over a grid, resulting in a symmetric positive denite ve-diagonal matrix (corresponding to the 5-point star). We have solved this relatively small problem on processor grids of increasing size. This problem size was chosen because for processor grids of increasing size it shows the three dierent phases mentioned before. processor CG parcg grid T P S P E P T P S P E P di (ms) (%) (ms) (%) (%) Table 6: Measured runtimes for CG and parcg, speed-up and eciency compared to the sequential runtime of CG Table 6 gives the measured runtimes for one iteration step, the speed-ups, and the eciencies for both CG and parcg for several processor grids. The speed-ups and eciencies are computed relative to the measured, sequential runtime of the CG iteration, which isgiven by: T 1 =0:788s. Although CG has much less inner products than GMRES(m) per iteration (i.e., per matrix vector product), we observe that the performance levels o fairly quickly. This is in agreement with the ndings reported in [5] which show that such behavior is to be expected for any Krylov subspace method. For our test problem we have that P max 600 and P ovl 8. For the processor grids that we usedwehave P < P max, so that the runtime decreases for increasing numbers of processors as predicted by our analysis. Note also the large relative dierence between P ovl and P max compared to the relative small dierence for GMRES(m). This indicates that for this test problem, with a small n z and with a relatively cheap preconditioner, we have a small. Hence, the improvement in the runtime will be limited, as is illustrated in Table 6. We see that the parcg algorithm leads to better speed-ups than the standard CG algorithm, especially on the and processor grids, where the number of processors is closest to P ovl. Moreover, for parcg we observe that if the number of processors is increased from 100 to 196, the eciency remains almost constant, and the runtime is reduced by a factor of about 1:75 (against a maximum of 1:96). Just as for GMRES(m) this is predicted by our analysis, because P<P ovl, so that the increase in the communication time is masked by the overlapping computation. The initial decrease of eciency when going from 1 to 100 processors is due to a substan- 17

18 processor CG parcg non-overlapped parcg and non-ovl. grid communication communication (ms) (ms) (ms) (ms) Table 7: Estimated runtimes for CG and parcg, a correction of the estimate for parcg and the corrected estimate for parcg tial initial overhead. This parallel overhead is also illustrated by the fact that the estimated sequential runtime from Tcmp, cg see (), is 0:870s, which is about 10% larger than the measured sequential runtime. The three phases in the performance of parcg are illustrated by the difference in runtime between CG and parcg. For small processor grids the communication time is not very important and we see only small dierences. For processor grids with P near P ovl the communication and the overlapping computation are in balance and we see an increase in the runtime dierence. For larger processor grids we can no longer overlap the communication, which dominates the runtime, to a sucient degree, and we see the dierences decrease again. We cannot quite match the improvements for pargmres(m), but on the other hand it is important to note that the improvement for parcg comes virtually for free. Besides, for GMRES(m) we have the possibility tocombine messages as well as to overlap communication, whereas for CG we can only exploit overlap of communication unless we combine multiple iterations. Expression (7) indicates that for our problem we cannot expect much more: 1=3 so that the maximum improvement is approximately 33%. This estimate is rather optimistic in view of the large initial parallel overhead. When the computation time for the preconditioner is large or even dominant ( 1) then the improvement may also be large. This would be the case if n z is large or when (M)ILU preconditioners with ll-in are used. For many problems this may be a realistic assumption. Another important observation is that as long as P>P ovl,we can increase the computation time of the preconditioner without increasing the runtime of the iteration, because the preconditioner is overlapped with the accumulation and distribution. That means that we can decrease the number of iterations without increasing the runtime of an iteration. In Table 7 we show estimates for the execution times of the CG algorithm and the parcg algorithm. The total cost for CG is computed from (), (6), and (4) and for parcg we have used (), (6), and (3). Just as for GMRES(m) the estimates for CG are relatively accurate, except for the 0 0 case. Again, this is probably caused by neglected costs in the implementation that become more important when the local problem size becomes small. For parcg as well as for pargmres(m) there is also a discrepancy between the measured execution time and the estimated time, due to an incomplete overlap. When we cannot overlap all communication, we can correct the estimate for the runtime of parcg by adding an estimate for the non-overlapped communication time. These corrections can be computed from Table 8 and from the local communication time for one accumulation and broadcast (0:158ms). Note that we need computation time for three inner products in one iteration (see (3)). For example, for the processor grid the computation time of the vector update is not sucient tooverlap the non-local communication time for the accumulation 18

Incomplete Block LU Preconditioners on Slightly Overlapping. E. de Sturler. Delft University of Technology. Abstract

Incomplete Block LU Preconditioners on Slightly Overlapping. E. de Sturler. Delft University of Technology. Abstract Incomplete Block LU Preconditioners on Slightly Overlapping Subdomains for a Massively Parallel Computer E. de Sturler Faculty of Technical Mathematics and Informatics Delft University of Technology Mekelweg

More information

CME342 Parallel Methods in Numerical Analysis. Matrix Computation: Iterative Methods II. Sparse Matrix-vector Multiplication.

CME342 Parallel Methods in Numerical Analysis. Matrix Computation: Iterative Methods II. Sparse Matrix-vector Multiplication. CME342 Parallel Methods in Numerical Analysis Matrix Computation: Iterative Methods II Outline: CG & its parallelization. Sparse Matrix-vector Multiplication. 1 Basic iterative methods: Ax = b r = b Ax

More information

GMRESR: A family of nested GMRES methods

GMRESR: A family of nested GMRES methods GMRESR: A family of nested GMRES methods Report 91-80 H.A. van der Vorst C. Vui Technische Universiteit Delft Delft University of Technology Faculteit der Technische Wisunde en Informatica Faculty of Technical

More information

A JACOBI-DAVIDSON ITERATION METHOD FOR LINEAR EIGENVALUE PROBLEMS. GERARD L.G. SLEIJPEN y AND HENK A. VAN DER VORST y

A JACOBI-DAVIDSON ITERATION METHOD FOR LINEAR EIGENVALUE PROBLEMS. GERARD L.G. SLEIJPEN y AND HENK A. VAN DER VORST y A JACOBI-DAVIDSON ITERATION METHOD FOR LINEAR EIGENVALUE PROBLEMS GERARD L.G. SLEIJPEN y AND HENK A. VAN DER VORST y Abstract. In this paper we propose a new method for the iterative computation of a few

More information

Iterative Methods and Multigrid

Iterative Methods and Multigrid Iterative Methods and Multigrid Part 3: Preconditioning 2 Eric de Sturler Preconditioning The general idea behind preconditioning is that convergence of some method for the linear system Ax = b can be

More information

The solution of the discretized incompressible Navier-Stokes equations with iterative methods

The solution of the discretized incompressible Navier-Stokes equations with iterative methods The solution of the discretized incompressible Navier-Stokes equations with iterative methods Report 93-54 C. Vuik Technische Universiteit Delft Delft University of Technology Faculteit der Technische

More information

The parallel computation of the smallest eigenpair of an. acoustic problem with damping. Martin B. van Gijzen and Femke A. Raeven.

The parallel computation of the smallest eigenpair of an. acoustic problem with damping. Martin B. van Gijzen and Femke A. Raeven. The parallel computation of the smallest eigenpair of an acoustic problem with damping. Martin B. van Gijzen and Femke A. Raeven Abstract Acoustic problems with damping may give rise to large quadratic

More information

Linear Solvers. Andrew Hazel

Linear Solvers. Andrew Hazel Linear Solvers Andrew Hazel Introduction Thus far we have talked about the formulation and discretisation of physical problems...... and stopped when we got to a discrete linear system of equations. Introduction

More information

FEM and sparse linear system solving

FEM and sparse linear system solving FEM & sparse linear system solving, Lecture 9, Nov 19, 2017 1/36 Lecture 9, Nov 17, 2017: Krylov space methods http://people.inf.ethz.ch/arbenz/fem17 Peter Arbenz Computer Science Department, ETH Zürich

More information

AMS526: Numerical Analysis I (Numerical Linear Algebra) Lecture 23: GMRES and Other Krylov Subspace Methods; Preconditioning

AMS526: Numerical Analysis I (Numerical Linear Algebra) Lecture 23: GMRES and Other Krylov Subspace Methods; Preconditioning AMS526: Numerical Analysis I (Numerical Linear Algebra) Lecture 23: GMRES and Other Krylov Subspace Methods; Preconditioning Xiangmin Jiao SUNY Stony Brook Xiangmin Jiao Numerical Analysis I 1 / 18 Outline

More information

Further experiences with GMRESR

Further experiences with GMRESR Further experiences with GMRESR Report 92-2 C. Vui Technische Universiteit Delft Delft University of Technology Faculteit der Technische Wisunde en Informatica Faculty of Technical Mathematics and Informatics

More information

Contents. Preface... xi. Introduction...

Contents. Preface... xi. Introduction... Contents Preface... xi Introduction... xv Chapter 1. Computer Architectures... 1 1.1. Different types of parallelism... 1 1.1.1. Overlap, concurrency and parallelism... 1 1.1.2. Temporal and spatial parallelism

More information

Preconditioned Conjugate Gradient-Like Methods for. Nonsymmetric Linear Systems 1. Ulrike Meier Yang 2. July 19, 1994

Preconditioned Conjugate Gradient-Like Methods for. Nonsymmetric Linear Systems 1. Ulrike Meier Yang 2. July 19, 1994 Preconditioned Conjugate Gradient-Like Methods for Nonsymmetric Linear Systems Ulrike Meier Yang 2 July 9, 994 This research was supported by the U.S. Department of Energy under Grant No. DE-FG2-85ER25.

More information

Chapter 7 Iterative Techniques in Matrix Algebra

Chapter 7 Iterative Techniques in Matrix Algebra Chapter 7 Iterative Techniques in Matrix Algebra Per-Olof Persson persson@berkeley.edu Department of Mathematics University of California, Berkeley Math 128B Numerical Analysis Vector Norms Definition

More information

6.4 Krylov Subspaces and Conjugate Gradients

6.4 Krylov Subspaces and Conjugate Gradients 6.4 Krylov Subspaces and Conjugate Gradients Our original equation is Ax = b. The preconditioned equation is P Ax = P b. When we write P, we never intend that an inverse will be explicitly computed. P

More information

Iterative methods for Linear System of Equations. Joint Advanced Student School (JASS-2009)

Iterative methods for Linear System of Equations. Joint Advanced Student School (JASS-2009) Iterative methods for Linear System of Equations Joint Advanced Student School (JASS-2009) Course #2: Numerical Simulation - from Models to Software Introduction In numerical simulation, Partial Differential

More information

The Conjugate Gradient Method

The Conjugate Gradient Method The Conjugate Gradient Method Classical Iterations We have a problem, We assume that the matrix comes from a discretization of a PDE. The best and most popular model problem is, The matrix will be as large

More information

SOLVING SPARSE LINEAR SYSTEMS OF EQUATIONS. Chao Yang Computational Research Division Lawrence Berkeley National Laboratory Berkeley, CA, USA

SOLVING SPARSE LINEAR SYSTEMS OF EQUATIONS. Chao Yang Computational Research Division Lawrence Berkeley National Laboratory Berkeley, CA, USA 1 SOLVING SPARSE LINEAR SYSTEMS OF EQUATIONS Chao Yang Computational Research Division Lawrence Berkeley National Laboratory Berkeley, CA, USA 2 OUTLINE Sparse matrix storage format Basic factorization

More information

E. de Sturler 1. Delft University of Technology. Abstract. introduced by Vuik and Van der Vorst. Similar methods have been proposed by Axelsson

E. de Sturler 1. Delft University of Technology. Abstract. introduced by Vuik and Van der Vorst. Similar methods have been proposed by Axelsson Nested Krylov Methods Based on GCR E. de Sturler 1 Faculty of Technical Mathematics and Informatics Delft University of Technology Mekelweg 4 Delft, The Netherlands Abstract Recently the GMRESR method

More information

Efficient Deflation for Communication-Avoiding Krylov Subspace Methods

Efficient Deflation for Communication-Avoiding Krylov Subspace Methods Efficient Deflation for Communication-Avoiding Krylov Subspace Methods Erin Carson Nicholas Knight, James Demmel Univ. of California, Berkeley Monday, June 24, NASCA 2013, Calais, France Overview We derive

More information

SOR as a Preconditioner. A Dissertation. Presented to. University of Virginia. In Partial Fulllment. of the Requirements for the Degree

SOR as a Preconditioner. A Dissertation. Presented to. University of Virginia. In Partial Fulllment. of the Requirements for the Degree SOR as a Preconditioner A Dissertation Presented to The Faculty of the School of Engineering and Applied Science University of Virginia In Partial Fulllment of the Reuirements for the Degree Doctor of

More information

9.1 Preconditioned Krylov Subspace Methods

9.1 Preconditioned Krylov Subspace Methods Chapter 9 PRECONDITIONING 9.1 Preconditioned Krylov Subspace Methods 9.2 Preconditioned Conjugate Gradient 9.3 Preconditioned Generalized Minimal Residual 9.4 Relaxation Method Preconditioners 9.5 Incomplete

More information

M.A. Botchev. September 5, 2014

M.A. Botchev. September 5, 2014 Rome-Moscow school of Matrix Methods and Applied Linear Algebra 2014 A short introduction to Krylov subspaces for linear systems, matrix functions and inexact Newton methods. Plan and exercises. M.A. Botchev

More information

Preface to the Second Edition. Preface to the First Edition

Preface to the Second Edition. Preface to the First Edition n page v Preface to the Second Edition Preface to the First Edition xiii xvii 1 Background in Linear Algebra 1 1.1 Matrices................................. 1 1.2 Square Matrices and Eigenvalues....................

More information

Finite-choice algorithm optimization in Conjugate Gradients

Finite-choice algorithm optimization in Conjugate Gradients Finite-choice algorithm optimization in Conjugate Gradients Jack Dongarra and Victor Eijkhout January 2003 Abstract We present computational aspects of mathematically equivalent implementations of the

More information

The amount of work to construct each new guess from the previous one should be a small multiple of the number of nonzeros in A.

The amount of work to construct each new guess from the previous one should be a small multiple of the number of nonzeros in A. AMSC/CMSC 661 Scientific Computing II Spring 2005 Solution of Sparse Linear Systems Part 2: Iterative methods Dianne P. O Leary c 2005 Solving Sparse Linear Systems: Iterative methods The plan: Iterative

More information

The rate of convergence of the GMRES method

The rate of convergence of the GMRES method The rate of convergence of the GMRES method Report 90-77 C. Vuik Technische Universiteit Delft Delft University of Technology Faculteit der Technische Wiskunde en Informatica Faculty of Technical Mathematics

More information

Iterative Methods for Linear Systems of Equations

Iterative Methods for Linear Systems of Equations Iterative Methods for Linear Systems of Equations Projection methods (3) ITMAN PhD-course DTU 20-10-08 till 24-10-08 Martin van Gijzen 1 Delft University of Technology Overview day 4 Bi-Lanczos method

More information

Krylov Subspace Methods that Are Based on the Minimization of the Residual

Krylov Subspace Methods that Are Based on the Minimization of the Residual Chapter 5 Krylov Subspace Methods that Are Based on the Minimization of the Residual Remark 51 Goal he goal of these methods consists in determining x k x 0 +K k r 0,A such that the corresponding Euclidean

More information

Iterative methods for Linear System

Iterative methods for Linear System Iterative methods for Linear System JASS 2009 Student: Rishi Patil Advisor: Prof. Thomas Huckle Outline Basics: Matrices and their properties Eigenvalues, Condition Number Iterative Methods Direct and

More information

ITERATIVE METHODS FOR SPARSE LINEAR SYSTEMS

ITERATIVE METHODS FOR SPARSE LINEAR SYSTEMS ITERATIVE METHODS FOR SPARSE LINEAR SYSTEMS YOUSEF SAAD University of Minnesota PWS PUBLISHING COMPANY I(T)P An International Thomson Publishing Company BOSTON ALBANY BONN CINCINNATI DETROIT LONDON MADRID

More information

Overview: Synchronous Computations

Overview: Synchronous Computations Overview: Synchronous Computations barriers: linear, tree-based and butterfly degrees of synchronization synchronous example 1: Jacobi Iterations serial and parallel code, performance analysis synchronous

More information

Domain decomposition on different levels of the Jacobi-Davidson method

Domain decomposition on different levels of the Jacobi-Davidson method hapter 5 Domain decomposition on different levels of the Jacobi-Davidson method Abstract Most computational work of Jacobi-Davidson [46], an iterative method suitable for computing solutions of large dimensional

More information

problem Au = u by constructing an orthonormal basis V k = [v 1 ; : : : ; v k ], at each k th iteration step, and then nding an approximation for the e

problem Au = u by constructing an orthonormal basis V k = [v 1 ; : : : ; v k ], at each k th iteration step, and then nding an approximation for the e A Parallel Solver for Extreme Eigenpairs 1 Leonardo Borges and Suely Oliveira 2 Computer Science Department, Texas A&M University, College Station, TX 77843-3112, USA. Abstract. In this paper a parallel

More information

-.- Bi-CG... GMRES(25) --- Bi-CGSTAB BiCGstab(2)

-.- Bi-CG... GMRES(25) --- Bi-CGSTAB BiCGstab(2) .... Advection dominated problem -.- Bi-CG... GMRES(25) --- Bi-CGSTAB BiCGstab(2) * Universiteit Utrecht -2 log1 of residual norm -4-6 -8 Department of Mathematics - GMRES(25) 2 4 6 8 1 Hybrid Bi-Conjugate

More information

Applied Mathematics 205. Unit V: Eigenvalue Problems. Lecturer: Dr. David Knezevic

Applied Mathematics 205. Unit V: Eigenvalue Problems. Lecturer: Dr. David Knezevic Applied Mathematics 205 Unit V: Eigenvalue Problems Lecturer: Dr. David Knezevic Unit V: Eigenvalue Problems Chapter V.4: Krylov Subspace Methods 2 / 51 Krylov Subspace Methods In this chapter we give

More information

Lab 1: Iterative Methods for Solving Linear Systems

Lab 1: Iterative Methods for Solving Linear Systems Lab 1: Iterative Methods for Solving Linear Systems January 22, 2017 Introduction Many real world applications require the solution to very large and sparse linear systems where direct methods such as

More information

Summary of Iterative Methods for Non-symmetric Linear Equations That Are Related to the Conjugate Gradient (CG) Method

Summary of Iterative Methods for Non-symmetric Linear Equations That Are Related to the Conjugate Gradient (CG) Method Summary of Iterative Methods for Non-symmetric Linear Equations That Are Related to the Conjugate Gradient (CG) Method Leslie Foster 11-5-2012 We will discuss the FOM (full orthogonalization method), CG,

More information

Institute for Advanced Computer Studies. Department of Computer Science. Iterative methods for solving Ax = b. GMRES/FOM versus QMR/BiCG

Institute for Advanced Computer Studies. Department of Computer Science. Iterative methods for solving Ax = b. GMRES/FOM versus QMR/BiCG University of Maryland Institute for Advanced Computer Studies Department of Computer Science College Park TR{96{2 TR{3587 Iterative methods for solving Ax = b GMRES/FOM versus QMR/BiCG Jane K. Cullum

More information

APPARC PaA3a Deliverable. ESPRIT BRA III Contract # Reordering of Sparse Matrices for Parallel Processing. Achim Basermannn.

APPARC PaA3a Deliverable. ESPRIT BRA III Contract # Reordering of Sparse Matrices for Parallel Processing. Achim Basermannn. APPARC PaA3a Deliverable ESPRIT BRA III Contract # 6634 Reordering of Sparse Matrices for Parallel Processing Achim Basermannn Peter Weidner Zentralinstitut fur Angewandte Mathematik KFA Julich GmbH D-52425

More information

Restarting parallel Jacobi-Davidson with both standard and harmonic Ritz values

Restarting parallel Jacobi-Davidson with both standard and harmonic Ritz values Centrum voor Wiskunde en Informatica REPORTRAPPORT Restarting parallel Jacobi-Davidson with both standard and harmonic Ritz values M. Nool, A. van der Ploeg Modelling, Analysis and Simulation (MAS) MAS-R9807

More information

Krylov Space Methods. Nonstationary sounds good. Radu Trîmbiţaş ( Babeş-Bolyai University) Krylov Space Methods 1 / 17

Krylov Space Methods. Nonstationary sounds good. Radu Trîmbiţaş ( Babeş-Bolyai University) Krylov Space Methods 1 / 17 Krylov Space Methods Nonstationary sounds good Radu Trîmbiţaş Babeş-Bolyai University Radu Trîmbiţaş ( Babeş-Bolyai University) Krylov Space Methods 1 / 17 Introduction These methods are used both to solve

More information

Iterative Methods for Sparse Linear Systems

Iterative Methods for Sparse Linear Systems Iterative Methods for Sparse Linear Systems Luca Bergamaschi e-mail: berga@dmsa.unipd.it - http://www.dmsa.unipd.it/ berga Department of Mathematical Methods and Models for Scientific Applications University

More information

ITERATIVE METHODS BASED ON KRYLOV SUBSPACES

ITERATIVE METHODS BASED ON KRYLOV SUBSPACES ITERATIVE METHODS BASED ON KRYLOV SUBSPACES LONG CHEN We shall present iterative methods for solving linear algebraic equation Au = b based on Krylov subspaces We derive conjugate gradient (CG) method

More information

Laboratoire d'informatique Fondamentale de Lille

Laboratoire d'informatique Fondamentale de Lille Laboratoire d'informatique Fondamentale de Lille Publication AS-181 Modied Krylov acceleration for parallel environments C. Le Calvez & Y. Saad February 1998 c LIFL USTL UNIVERSITE DES SCIENCES ET TECHNOLOGIES

More information

Numerical Methods I Non-Square and Sparse Linear Systems

Numerical Methods I Non-Square and Sparse Linear Systems Numerical Methods I Non-Square and Sparse Linear Systems Aleksandar Donev Courant Institute, NYU 1 donev@courant.nyu.edu 1 MATH-GA 2011.003 / CSCI-GA 2945.003, Fall 2014 September 25th, 2014 A. Donev (Courant

More information

WHEN studying distributed simulations of power systems,

WHEN studying distributed simulations of power systems, 1096 IEEE TRANSACTIONS ON POWER SYSTEMS, VOL 21, NO 3, AUGUST 2006 A Jacobian-Free Newton-GMRES(m) Method with Adaptive Preconditioner and Its Application for Power Flow Calculations Ying Chen and Chen

More information

Conjugate gradient method. Descent method. Conjugate search direction. Conjugate Gradient Algorithm (294)

Conjugate gradient method. Descent method. Conjugate search direction. Conjugate Gradient Algorithm (294) Conjugate gradient method Descent method Hestenes, Stiefel 1952 For A N N SPD In exact arithmetic, solves in N steps In real arithmetic No guaranteed stopping Often converges in many fewer than N steps

More information

Notes on PCG for Sparse Linear Systems

Notes on PCG for Sparse Linear Systems Notes on PCG for Sparse Linear Systems Luca Bergamaschi Department of Civil Environmental and Architectural Engineering University of Padova e-mail luca.bergamaschi@unipd.it webpage www.dmsa.unipd.it/

More information

4.8 Arnoldi Iteration, Krylov Subspaces and GMRES

4.8 Arnoldi Iteration, Krylov Subspaces and GMRES 48 Arnoldi Iteration, Krylov Subspaces and GMRES We start with the problem of using a similarity transformation to convert an n n matrix A to upper Hessenberg form H, ie, A = QHQ, (30) with an appropriate

More information

EECS 275 Matrix Computation

EECS 275 Matrix Computation EECS 275 Matrix Computation Ming-Hsuan Yang Electrical Engineering and Computer Science University of California at Merced Merced, CA 95344 http://faculty.ucmerced.edu/mhyang Lecture 20 1 / 20 Overview

More information

Chapter 7. Iterative methods for large sparse linear systems. 7.1 Sparse matrix algebra. Large sparse matrices

Chapter 7. Iterative methods for large sparse linear systems. 7.1 Sparse matrix algebra. Large sparse matrices Chapter 7 Iterative methods for large sparse linear systems In this chapter we revisit the problem of solving linear systems of equations, but now in the context of large sparse systems. The price to pay

More information

Lecture 18 Classical Iterative Methods

Lecture 18 Classical Iterative Methods Lecture 18 Classical Iterative Methods MIT 18.335J / 6.337J Introduction to Numerical Methods Per-Olof Persson November 14, 2006 1 Iterative Methods for Linear Systems Direct methods for solving Ax = b,

More information

AMS526: Numerical Analysis I (Numerical Linear Algebra for Computational and Data Sciences)

AMS526: Numerical Analysis I (Numerical Linear Algebra for Computational and Data Sciences) AMS526: Numerical Analysis I (Numerical Linear Algebra for Computational and Data Sciences) Lecture 19: Computing the SVD; Sparse Linear Systems Xiangmin Jiao Stony Brook University Xiangmin Jiao Numerical

More information

Largest Bratu solution, lambda=4

Largest Bratu solution, lambda=4 Largest Bratu solution, lambda=4 * Universiteit Utrecht 5 4 3 2 Department of Mathematics 1 0 30 25 20 15 10 5 5 10 15 20 25 30 Accelerated Inexact Newton Schemes for Large Systems of Nonlinear Equations

More information

Algorithms that use the Arnoldi Basis

Algorithms that use the Arnoldi Basis AMSC 600 /CMSC 760 Advanced Linear Numerical Analysis Fall 2007 Arnoldi Methods Dianne P. O Leary c 2006, 2007 Algorithms that use the Arnoldi Basis Reference: Chapter 6 of Saad The Arnoldi Basis How to

More information

Topics. The CG Algorithm Algorithmic Options CG s Two Main Convergence Theorems

Topics. The CG Algorithm Algorithmic Options CG s Two Main Convergence Theorems Topics The CG Algorithm Algorithmic Options CG s Two Main Convergence Theorems What about non-spd systems? Methods requiring small history Methods requiring large history Summary of solvers 1 / 52 Conjugate

More information

Stabilization and Acceleration of Algebraic Multigrid Method

Stabilization and Acceleration of Algebraic Multigrid Method Stabilization and Acceleration of Algebraic Multigrid Method Recursive Projection Algorithm A. Jemcov J.P. Maruszewski Fluent Inc. October 24, 2006 Outline 1 Need for Algorithm Stabilization and Acceleration

More information

Last Time. Social Network Graphs Betweenness. Graph Laplacian. Girvan-Newman Algorithm. Spectral Bisection

Last Time. Social Network Graphs Betweenness. Graph Laplacian. Girvan-Newman Algorithm. Spectral Bisection Eigenvalue Problems Last Time Social Network Graphs Betweenness Girvan-Newman Algorithm Graph Laplacian Spectral Bisection λ 2, w 2 Today Small deviation into eigenvalue problems Formulation Standard eigenvalue

More information

Key words. linear equations, polynomial preconditioning, nonsymmetric Lanczos, BiCGStab, IDR

Key words. linear equations, polynomial preconditioning, nonsymmetric Lanczos, BiCGStab, IDR POLYNOMIAL PRECONDITIONED BICGSTAB AND IDR JENNIFER A. LOE AND RONALD B. MORGAN Abstract. Polynomial preconditioning is applied to the nonsymmetric Lanczos methods BiCGStab and IDR for solving large nonsymmetric

More information

Course Notes: Week 1

Course Notes: Week 1 Course Notes: Week 1 Math 270C: Applied Numerical Linear Algebra 1 Lecture 1: Introduction (3/28/11) We will focus on iterative methods for solving linear systems of equations (and some discussion of eigenvalues

More information

Solving Large Nonlinear Sparse Systems

Solving Large Nonlinear Sparse Systems Solving Large Nonlinear Sparse Systems Fred W. Wubs and Jonas Thies Computational Mechanics & Numerical Mathematics University of Groningen, the Netherlands f.w.wubs@rug.nl Centre for Interdisciplinary

More information

In order to solve the linear system KL M N when K is nonsymmetric, we can solve the equivalent system

In order to solve the linear system KL M N when K is nonsymmetric, we can solve the equivalent system !"#$% "&!#' (%)!#" *# %)%(! #! %)!#" +, %"!"#$ %*&%! $#&*! *# %)%! -. -/ 0 -. 12 "**3! * $!#%+,!2!#% 44" #% &#33 # 4"!#" "%! "5"#!!#6 -. - #% " 7% "3#!#3! - + 87&2! * $!#% 44" ) 3( $! # % %#!!#%+ 9332!

More information

Scalable Non-blocking Preconditioned Conjugate Gradient Methods

Scalable Non-blocking Preconditioned Conjugate Gradient Methods Scalable Non-blocking Preconditioned Conjugate Gradient Methods Paul Eller and William Gropp University of Illinois at Urbana-Champaign Department of Computer Science Supercomputing 16 Paul Eller and William

More information

On the influence of eigenvalues on Bi-CG residual norms

On the influence of eigenvalues on Bi-CG residual norms On the influence of eigenvalues on Bi-CG residual norms Jurjen Duintjer Tebbens Institute of Computer Science Academy of Sciences of the Czech Republic duintjertebbens@cs.cas.cz Gérard Meurant 30, rue

More information

Universiteit-Utrecht. Department. of Mathematics. Jacobi-Davidson algorithms for various. eigenproblems. - A working document -

Universiteit-Utrecht. Department. of Mathematics. Jacobi-Davidson algorithms for various. eigenproblems. - A working document - Universiteit-Utrecht * Department of Mathematics Jacobi-Davidson algorithms for various eigenproblems - A working document - by Gerard L.G. Sleipen, Henk A. Van der Vorst, and Zhaoun Bai Preprint nr. 1114

More information

Communication-avoiding Krylov subspace methods

Communication-avoiding Krylov subspace methods Motivation Communication-avoiding Krylov subspace methods Mark mhoemmen@cs.berkeley.edu University of California Berkeley EECS MS Numerical Libraries Group visit: 28 April 2008 Overview Motivation Current

More information

Barrier. Overview: Synchronous Computations. Barriers. Counter-based or Linear Barriers

Barrier. Overview: Synchronous Computations. Barriers. Counter-based or Linear Barriers Overview: Synchronous Computations Barrier barriers: linear, tree-based and butterfly degrees of synchronization synchronous example : Jacobi Iterations serial and parallel code, performance analysis synchronous

More information

Solving Symmetric Indefinite Systems with Symmetric Positive Definite Preconditioners

Solving Symmetric Indefinite Systems with Symmetric Positive Definite Preconditioners Solving Symmetric Indefinite Systems with Symmetric Positive Definite Preconditioners Eugene Vecharynski 1 Andrew Knyazev 2 1 Department of Computer Science and Engineering University of Minnesota 2 Department

More information

ON ORTHOGONAL REDUCTION TO HESSENBERG FORM WITH SMALL BANDWIDTH

ON ORTHOGONAL REDUCTION TO HESSENBERG FORM WITH SMALL BANDWIDTH ON ORTHOGONAL REDUCTION TO HESSENBERG FORM WITH SMALL BANDWIDTH V. FABER, J. LIESEN, AND P. TICHÝ Abstract. Numerous algorithms in numerical linear algebra are based on the reduction of a given matrix

More information

Reduced Synchronization Overhead on. December 3, Abstract. The standard formulation of the conjugate gradient algorithm involves

Reduced Synchronization Overhead on. December 3, Abstract. The standard formulation of the conjugate gradient algorithm involves Lapack Working Note 56 Conjugate Gradient Algorithms with Reduced Synchronization Overhead on Distributed Memory Multiprocessors E. F. D'Azevedo y, V.L. Eijkhout z, C. H. Romine y December 3, 1999 Abstract

More information

Solving Ax = b, an overview. Program

Solving Ax = b, an overview. Program Numerical Linear Algebra Improving iterative solvers: preconditioning, deflation, numerical software and parallelisation Gerard Sleijpen and Martin van Gijzen November 29, 27 Solving Ax = b, an overview

More information

DELFT UNIVERSITY OF TECHNOLOGY

DELFT UNIVERSITY OF TECHNOLOGY DELFT UNIVERSITY OF TECHNOLOGY REPORT 18-05 Efficient and robust Schur complement approximations in the augmented Lagrangian preconditioner for high Reynolds number laminar flows X. He and C. Vuik ISSN

More information

Solving Sparse Linear Systems: Iterative methods

Solving Sparse Linear Systems: Iterative methods Scientific Computing with Case Studies SIAM Press, 2009 http://www.cs.umd.edu/users/oleary/sccs Lecture Notes for Unit VII Sparse Matrix Computations Part 2: Iterative Methods Dianne P. O Leary c 2008,2010

More information

Solving Sparse Linear Systems: Iterative methods

Solving Sparse Linear Systems: Iterative methods Scientific Computing with Case Studies SIAM Press, 2009 http://www.cs.umd.edu/users/oleary/sccswebpage Lecture Notes for Unit VII Sparse Matrix Computations Part 2: Iterative Methods Dianne P. O Leary

More information

7.2 Steepest Descent and Preconditioning

7.2 Steepest Descent and Preconditioning 7.2 Steepest Descent and Preconditioning Descent methods are a broad class of iterative methods for finding solutions of the linear system Ax = b for symmetric positive definite matrix A R n n. Consider

More information

AMS526: Numerical Analysis I (Numerical Linear Algebra)

AMS526: Numerical Analysis I (Numerical Linear Algebra) AMS526: Numerical Analysis I (Numerical Linear Algebra) Lecture 23: GMRES and Other Krylov Subspace Methods Xiangmin Jiao SUNY Stony Brook Xiangmin Jiao Numerical Analysis I 1 / 9 Minimizing Residual CG

More information

Iterative Methods for Solving A x = b

Iterative Methods for Solving A x = b Iterative Methods for Solving A x = b A good (free) online source for iterative methods for solving A x = b is given in the description of a set of iterative solvers called templates found at netlib: http

More information

6. Iterative Methods for Linear Systems. The stepwise approach to the solution...

6. Iterative Methods for Linear Systems. The stepwise approach to the solution... 6 Iterative Methods for Linear Systems The stepwise approach to the solution Miriam Mehl: 6 Iterative Methods for Linear Systems The stepwise approach to the solution, January 18, 2013 1 61 Large Sparse

More information

Algebraic Multigrid as Solvers and as Preconditioner

Algebraic Multigrid as Solvers and as Preconditioner Ò Algebraic Multigrid as Solvers and as Preconditioner Domenico Lahaye domenico.lahaye@cs.kuleuven.ac.be http://www.cs.kuleuven.ac.be/ domenico/ Department of Computer Science Katholieke Universiteit Leuven

More information

Henk van der Vorst. Abstract. We discuss a novel approach for the computation of a number of eigenvalues and eigenvectors

Henk van der Vorst. Abstract. We discuss a novel approach for the computation of a number of eigenvalues and eigenvectors Subspace Iteration for Eigenproblems Henk van der Vorst Abstract We discuss a novel approach for the computation of a number of eigenvalues and eigenvectors of the standard eigenproblem Ax = x. Our method

More information

Bounding the End-to-End Response Times of Tasks in a Distributed. Real-Time System Using the Direct Synchronization Protocol.

Bounding the End-to-End Response Times of Tasks in a Distributed. Real-Time System Using the Direct Synchronization Protocol. Bounding the End-to-End Response imes of asks in a Distributed Real-ime System Using the Direct Synchronization Protocol Jun Sun Jane Liu Abstract In a distributed real-time system, a task may consist

More information

Alternative correction equations in the Jacobi-Davidson method

Alternative correction equations in the Jacobi-Davidson method Chapter 2 Alternative correction equations in the Jacobi-Davidson method Menno Genseberger and Gerard Sleijpen Abstract The correction equation in the Jacobi-Davidson method is effective in a subspace

More information

1 Extrapolation: A Hint of Things to Come

1 Extrapolation: A Hint of Things to Come Notes for 2017-03-24 1 Extrapolation: A Hint of Things to Come Stationary iterations are simple. Methods like Jacobi or Gauss-Seidel are easy to program, and it s (relatively) easy to analyze their convergence.

More information

Simple iteration procedure

Simple iteration procedure Simple iteration procedure Solve Known approximate solution Preconditionning: Jacobi Gauss-Seidel Lower triangle residue use of pre-conditionner correction residue use of pre-conditionner Convergence Spectral

More information

From Stationary Methods to Krylov Subspaces

From Stationary Methods to Krylov Subspaces Week 6: Wednesday, Mar 7 From Stationary Methods to Krylov Subspaces Last time, we discussed stationary methods for the iterative solution of linear systems of equations, which can generally be written

More information

Parallel Numerics, WT 2016/ Iterative Methods for Sparse Linear Systems of Equations. page 1 of 1

Parallel Numerics, WT 2016/ Iterative Methods for Sparse Linear Systems of Equations. page 1 of 1 Parallel Numerics, WT 2016/2017 5 Iterative Methods for Sparse Linear Systems of Equations page 1 of 1 Contents 1 Introduction 1.1 Computer Science Aspects 1.2 Numerical Problems 1.3 Graphs 1.4 Loop Manipulations

More information

Peter Deuhard. for Symmetric Indenite Linear Systems

Peter Deuhard. for Symmetric Indenite Linear Systems Peter Deuhard A Study of Lanczos{Type Iterations for Symmetric Indenite Linear Systems Preprint SC 93{6 (March 993) Contents 0. Introduction. Basic Recursive Structure 2. Algorithm Design Principles 7

More information

DELFT UNIVERSITY OF TECHNOLOGY

DELFT UNIVERSITY OF TECHNOLOGY DELFT UNIVERSITY OF TECHNOLOGY REPORT 16-02 The Induced Dimension Reduction method applied to convection-diffusion-reaction problems R. Astudillo and M. B. van Gijzen ISSN 1389-6520 Reports of the Delft

More information

AMS Mathematics Subject Classification : 65F10,65F50. Key words and phrases: ILUS factorization, preconditioning, Schur complement, 1.

AMS Mathematics Subject Classification : 65F10,65F50. Key words and phrases: ILUS factorization, preconditioning, Schur complement, 1. J. Appl. Math. & Computing Vol. 15(2004), No. 1, pp. 299-312 BILUS: A BLOCK VERSION OF ILUS FACTORIZATION DAVOD KHOJASTEH SALKUYEH AND FAEZEH TOUTOUNIAN Abstract. ILUS factorization has many desirable

More information

Multigrid absolute value preconditioning

Multigrid absolute value preconditioning Multigrid absolute value preconditioning Eugene Vecharynski 1 Andrew Knyazev 2 (speaker) 1 Department of Computer Science and Engineering University of Minnesota 2 Department of Mathematical and Statistical

More information

A short course on: Preconditioned Krylov subspace methods. Yousef Saad University of Minnesota Dept. of Computer Science and Engineering

A short course on: Preconditioned Krylov subspace methods. Yousef Saad University of Minnesota Dept. of Computer Science and Engineering A short course on: Preconditioned Krylov subspace methods Yousef Saad University of Minnesota Dept. of Computer Science and Engineering Universite du Littoral, Jan 19-3, 25 Outline Part 1 Introd., discretization

More information

Mathematics Research Report No. MRR 003{96, HIGH RESOLUTION POTENTIAL FLOW METHODS IN OIL EXPLORATION Stephen Roberts 1 and Stephan Matthai 2 3rd Febr

Mathematics Research Report No. MRR 003{96, HIGH RESOLUTION POTENTIAL FLOW METHODS IN OIL EXPLORATION Stephen Roberts 1 and Stephan Matthai 2 3rd Febr HIGH RESOLUTION POTENTIAL FLOW METHODS IN OIL EXPLORATION Stephen Roberts and Stephan Matthai Mathematics Research Report No. MRR 003{96, Mathematics Research Report No. MRR 003{96, HIGH RESOLUTION POTENTIAL

More information

The Lanczos and conjugate gradient algorithms

The Lanczos and conjugate gradient algorithms The Lanczos and conjugate gradient algorithms Gérard MEURANT October, 2008 1 The Lanczos algorithm 2 The Lanczos algorithm in finite precision 3 The nonsymmetric Lanczos algorithm 4 The Golub Kahan bidiagonalization

More information

Modelling and implementation of algorithms in applied mathematics using MPI

Modelling and implementation of algorithms in applied mathematics using MPI Modelling and implementation of algorithms in applied mathematics using MPI Lecture 3: Linear Systems: Simple Iterative Methods and their parallelization, Programming MPI G. Rapin Brazil March 2011 Outline

More information

Lecture 17: Iterative Methods and Sparse Linear Algebra

Lecture 17: Iterative Methods and Sparse Linear Algebra Lecture 17: Iterative Methods and Sparse Linear Algebra David Bindel 25 Mar 2014 Logistics HW 3 extended to Wednesday after break HW 4 should come out Monday after break Still need project description

More information

Jos L.M. van Dorsselaer. February Abstract. Continuation methods are a well-known technique for computing several stationary

Jos L.M. van Dorsselaer. February Abstract. Continuation methods are a well-known technique for computing several stationary Computing eigenvalues occurring in continuation methods with the Jacobi-Davidson QZ method Jos L.M. van Dorsselaer February 1997 Abstract. Continuation methods are a well-known technique for computing

More information

A DISSERTATION. Extensions of the Conjugate Residual Method. by Tomohiro Sogabe. Presented to

A DISSERTATION. Extensions of the Conjugate Residual Method. by Tomohiro Sogabe. Presented to A DISSERTATION Extensions of the Conjugate Residual Method ( ) by Tomohiro Sogabe Presented to Department of Applied Physics, The University of Tokyo Contents 1 Introduction 1 2 Krylov subspace methods

More information

Parallel programming using MPI. Analysis and optimization. Bhupender Thakur, Jim Lupo, Le Yan, Alex Pacheco

Parallel programming using MPI. Analysis and optimization. Bhupender Thakur, Jim Lupo, Le Yan, Alex Pacheco Parallel programming using MPI Analysis and optimization Bhupender Thakur, Jim Lupo, Le Yan, Alex Pacheco Outline l Parallel programming: Basic definitions l Choosing right algorithms: Optimal serial and

More information

A Hybrid Method for the Wave Equation. beilina

A Hybrid Method for the Wave Equation.   beilina A Hybrid Method for the Wave Equation http://www.math.unibas.ch/ beilina 1 The mathematical model The model problem is the wave equation 2 u t 2 = (a 2 u) + f, x Ω R 3, t > 0, (1) u(x, 0) = 0, x Ω, (2)

More information