APPARC PaA3a Deliverable. ESPRIT BRA III Contract # Reordering of Sparse Matrices for Parallel Processing. Achim Basermannn.

Size: px
Start display at page:

Download "APPARC PaA3a Deliverable. ESPRIT BRA III Contract # Reordering of Sparse Matrices for Parallel Processing. Achim Basermannn."

Transcription

1 APPARC PaA3a Deliverable ESPRIT BRA III Contract # 6634 Reordering of Sparse Matrices for Parallel Processing Achim Basermannn Peter Weidner Zentralinstitut fur Angewandte Mathematik KFA Julich GmbH D Julich, Germany A.Basermann@kfa-juelich.de P.Weidner@kfa-juelich.de Per Christian Hansen Tzvetan Ostromsky UNIC Danish Computing Center for Research and Education Building 305, Technical University of Denmark DK-2800 Lyngby, Denmark Per.Christian.Hansen@uni-c.dk Tzvetan.Ostromsky@uni-c.dk Zahari Zlatev Danish Environmental Research Institute Frederiksborgvej 399 DK-4000 Roskilde, Denmark luzz@sun2.dmu.min.d February 28, 1994 In addition to the support from the EEC ESPRIT Basic Research Action Programme, Project 6634 (APPARC), Tz. Ostromsky was supported by Danish Government Scholarship # Achim Basermann was supported by the Graduiertenkolleg `Informatik und Technik', RWTH Aachen, Germany.

2 Contents 1 Introduction The Need for Reordering Algorithms : : : : : : : : : : : : : : : : : : : : : : Organization of the Report : : : : : : : : : : : : : : : : : : : : : : : : : : : 1 2 Case Study I: Iterative Methods Theoretical Background : : : : : : : : : : : : : : : : : : : : : : : : : : : : : The Method of Conjugate Gradients : : : : : : : : : : : : : : : : : : The Lanczos Tridiagonalization : : : : : : : : : : : : : : : : : : : : : Storage Scheme : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Parallelization : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Data Distribution : : : : : : : : : : : : : : : : : : : : : : : : : : : : Reordering and Communication Scheme : : : : : : : : : : : : : : : : Results : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Numerical Results : : : : : : : : : : : : : : : : : : : : : : : : : : : : Performance Results : : : : : : : : : : : : : : : : : : : : : : : : : : : 13 3 Case Study II: Direct Methods The Locally Optimized Reordering Algorithm : : : : : : : : : : : : : : : : : Implementation of LORA by Means of Binary Trees : : : : : : : : : : : : : Using the Reordered Matrix in the Solution Process : : : : : : : : : : : : : Step 1 - Reorder the matrix : : : : : : : : : : : : : : : : : : : : : : : Step 2 - Partition the matrix (data distribution) : : : : : : : : : : : Step 3 - Perform the rst phase of the factorization : : : : : : : : : : Step 4 - Perform the second phase of the factorization : : : : : : : : Step 5 - Carry out the second reordering : : : : : : : : : : : : : : : : Step 6 - Perform the third phase of the factorization : : : : : : : : : Step 7 - Find a rst solution (back substitution) : : : : : : : : : : : Step 8 - Improve the rst solution by a modied preconditioned orthomin algorithm : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Stability Considerations : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Numerical Results : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 26 4 Conclusion 30

3 1 Introduction The present report summarizes our work on reordering algorithms for sparse matrix computations on parallel computers. During our investigations in Work Package PaA2 [11], it became clear to us that the use of \alternative algorithms" for dense matrix computations is fairly limited in connection with high-performance parallel computers (such algorithms may still be useful on other architectures, e.g., systolic arrays). Hence, it was decided in the present Work Package PaA3a to switch the focus to reordering algorithms for sparse matrices. This work is strongly connected to the future Work Packages PaA4, PaA5a, PaA5b and PaA6a. 1.1 The Need for Reordering Algorithms Reordering of sparse matrices is essential for good performance on parallel computers. In fact, reordering is often crucial even on a sequential computer but on a parallel computer a good reordering algorithm can lead to a much better load balance of the computer, and thus to a dramatic increase in performance compared to a \naive" ordering. We remark that in many circumstances, the reordering is only necessary once, as long as the structure of the matrix does not change. For example, this is the case in connection with solving partial dierential equations by the nite element method. This is so, because the reordering depends solely on the structure of the matrix, not on the numerical value of the matrix elements. It is important to realize that the matrix reordering is tightly connected to both the data distribution and the communication scheme used on the parallel computer. On distributed memory machines, the sparse matrix must be distributed to each processor. Criteria for the data distribution can be load balancing and as little data tranfer as possible, i.e., as much local computations as possible. It is usually dicult to achieve both load balancing and minimum data transfer. By reordering the matrix, the number of local computations can be increased and the data exchange decreased. Therefore, the matrix reordering inuences the data distribution as well as the communication scheme; it further depends on the data structure of the sparse matrix. Moreover, reordering the matrix can make possible to perform communication and local computations overlapped. In this case, the order of the computations is suitably changed. The underlying idea in our work is that we will accept a fairly advanced reordering algorithm, as long as it ensures good parallel performance of the core algorithm. Hence, it is acceptable that the reordering algorithm is somewhat slower than a simple scheme, provided that the gain in total execution time is high enough when switching to the more advanced algorithm. Some of the work in this report is based on ideas from graph theory. Graph theory has often been used in sparse matrix studies especially in connection with symmetric positive denite systems [31]. However, the use of graph theory in connection with general sparse matrices is not as widespread, although some applications exist, based on bipartite graphs; see, e.g., [32]. The application used in this work is dierent from the above-mentioned applications. 1.2 Organization of the Report We will illustrate the design and use of reordering algorithms in connection with both iterative and direct parallel methods for sparse matrices. For iterative methods, the biggest challenge is to implement these methods on distributed memory computers where message passing can introduce a serious overhead. The \stan- 1

4 dard iterations", such as Gauss-Seidel, SOR, SSOR, etc., all have fairly slow convergence in general. Therefore, as our case study we consider in Section 2 a message passing implementation of the conjugate gradient (CG) method for solution of sparse symmetric positive denite systems of equations. We will also consider the computation of eigenvalues and -vectors by means of the Lanczos algorithm for solving large sparse symmetric eigenproblems, which has a close theoretical connection to the CG method [14]. We apply alternative forms of the CG and Lanczos algorithms which have better parallelization properties than the standard forms, and we illustrate that the developed data distribution and communication scheme together with the reordering of the sparse matrix have a dramatic inuence on the execution times. For direct methods the problem of matrix reordering is more complex because the reordering now inuences the numerical accuracy of the computed solution. Therefore, in Section 3 we concentrate on shared memory computers because we must understand this situation on this type of computer before we move on to message passing computers. As our case study we consider direct solution of general sparse systems of equations by means of Gaussian elimination. In particular, we describe a new reordering algorithm, LORA, which is based on graph theory (in particular, we introduce a new class of binary trees), and we illustrate its use in connection with coarse-grained parallel Gaussian elimination. The LORA algorithm tries to reorder the matrix so that it is close to upper triangular form. We also show that there is a trade-o between fast parallel execution and numerical stability and, as a consequence, extra iterative improvement steps may be required. Finally, in Section 4 we summarize our results and point to future research. 2

5 2 Case Study I: Iterative Methods For the analysis and solution of discretized ordinary or partial dierential equations it is necessary to solve systems of equations or eigenproblems with coecient matrices of dierent sparsity patterns, depending on the discretization method. In many cases, the use of the nite element method (FE) results in largely unstructured systems of equations. Sparse eigenproblems play an important role in the analysis of elastic solids and structures [13] [37] [45]. In the corresponding FE models, the natural frequencies and mode shapes of free vibration are determined as are buckling loads and modes. Another class of problems is related to stability analysis, e.g. of electrical networks. Moreover, approximations of extreme eigenvalues are useful for solving sets of linear equations, e.g. for determining condition numbers of symmetric positive denite matrices or for conjugate gradients methods with polynomial preconditioning [9]. The main computational eort in iterative methods for solving linear systems and eigenproblems consists of matrix-vector products and vector-vector operations; the main work in each iteration is usually the computation of matrix-vector products. Therein, accessing the vector is determined by the sparsity pattern and the storage scheme of the matrix. For parallelizing iterative solvers on a multiprocessor system with distributed memory, the data distribution and the communication scheme { depending on the data structures used for sparse matrices { are of greatest importance for the ecient execution. In this context, dierent reordering strategies of the sparse matrix have been investigated to reduce waiting times by performing communication and computation overlapped. Additionally, the reverse Cuthill-McKee scheme [47] is applied to diminish the bandwidth of the matrix. Depending on the sparsity pattern of the matrix, bandwidth reduction results in a considerable decrease of communication. The data distribution and the communication scheme are determined before the execution of the solver by preprocessing the symbolic structure of the sparse matrix and are exploited in each iteration. The schemes can be reused as long as the sparsity pattern of the matrix (which is determined by the discretization mesh and the element types) does not change. For example, they can be used in each time step of a time dependent problem or in each iterative step of a nonlinear problem which is solved by linearization. In this section, data distribution and communication schemes are presented which are based on the analysis of the column indices of the non-zero matrix elements. Performance tests using the conjugate gradients algorithm (CG) with preconditioning [36] [42] for solving systems of equations and the Lanczos tridiagonalization [14] [15] [43] [51] [52] as the rst step for solving eigenproblems have been carried out on the distributed memory system INTEL ipsc/860 of the Research Centre Julich with sparse matrices from two FE models. The rst FE model comes from environmental science; it simulates the behavior of pollutants in geological systems [1] [49]. In the second FE model from structural mechanics, stresses in materials induced by thermal expansion are calculated by applying the FE program SMART [2]. 2.1 Theoretical Background The variant of the CG algorithm suggested in [10], and the modied Lanczos tridiagonalization from [38], have been employed in our investigations. The main dierence between the original and the modied algorithm is that in the modied version all dot products are computed without any operations in between. If each iteration is performed in parallel on a distributed memory system the local values of the dot products can therefore be included in one message for determining the global values. 3

6 2.1.1 The Method of Conjugate Gradients The method of conjugate gradients [36] is an algorithm for solving systems of linear equations Ax = b, particularly for sparse coecient matrices A. The method applies to matrices which are symmetric and positive denite. Aykanat et al. [10] suggested a modied CG algorithm which is described as follows: Algorithm 2.1. The modied CG method Choose an arbitrary x 0 2 IR n ; g 0 = Ax 0? b d 0 =?g 0 i = 0; 1; : : : i = gt i g i d T i Ad i i = i(ad i ) T Ad i d T i Ad i? 1 g T i+1g i+1 = i g T i g i x i+1 = x i + i d i g i+1 = g i + i Ad i d i+1 =?g i+1 + i d i until kg i+1 k 2 r : In each iteration, the vectors x i, g i, and d i are computed. x i approximates the solution vector, g i is the residual; d i determines the direction in which the next approximation of the solution vector is searched for. For some sparse matrices, the main work in each iteration consists solely of the computation of the matrix-vector product Ad i ; for other sparse matrices, this work is comparable with the work involved in the computation of inner products and saxpys. Iteration is continued until the Euclidean norm of the residual is less than or equal to r. Another stopping criterion which uses the maximum scaled absolute dierence of the components of the latest two approximations of the solution vector is determined as follows: max 2 jxj i+1? x j ij j=1;:::;n jx j i+1j + jx j ij s: (2.1) In the investigations, Algorithm 2.1 has been performed with and without diagonal scaling [42], a simple preconditioner which hardly contributes to the total execution time but usually accelerates the convergence considerably The Lanczos Tridiagonalization Lanczos methods are most commonly used to approximate a small number of extreme eigenvalues and eigenvectors for a real symmetric large sparse matrix [14] [15] [43] [51] [52]. The principle of the modied Lanczos tridiagonalization from [38] is described in the following algorithm. 4

7 Algorithm 2.2. The modied Lanczos tridiagonalization Choose an arbitrary r 0 with r 0 6= 0 and set q 0 = 0; i = 1; 2; : : : i = kr i?1 k 2 i = rt i?1ar i?1 r T i?1r i?1 q i = r i?1 i r i = Ar i?1 i? i q i?1? i q i In this method, a vector sequence q i 2 IR n, i = 1; 2; 3; : : :, and a sequence of i i symmetric tridiagonal matrices T i, i = 1; 2; 3; : : :, are generated by an iterative process starting with a n n real symmetric matrix A and an initial residual vector r 0 2 IR n. The orthonormal vectors q i 2 IR n, i = 1; 2; 3; : : :, are called the Lanczos vectors, and the symmetric tridiagonal matrices T i the Lanczos matrices. The matrices T i have the following form with m, m = 1; 2; : : :; i and m, m = 2; 3; : : :; i as the diagonal and bidiagonal elements, respectively: T i = B. : C 0 i?1 i?1 i A 0 0 i i In Algorithm 2.2, kr i?1 k 2 = (r T i?1 r i?1) 1=2 denotes the Euclidean norm. The main work in each iteration consists in the computation of the matrix-vector product Ar i?1 and, in some situations, also in the vector-vector operations. 2.2 Storage Scheme Storage schemes for large sparse matrices depend on the sparsity pattern of the matrix, the considered algorithm, and the architecture of the computer system used. In the literature, many variants of storage schemes can be found [21] [22] [40] [41] [44] [46]. The storage scheme considered here is often used in FE programs and suitable for regular as well as for irregular discretization meshes. It can be found in similar form in e.g. [40]. The scheme is illustrated in (2.4) for matrix (2.3). The non-zeros of matrix (2.3) are stored row-wise in three one-dimensional arrays. a w contains the values of the non-zeros, a s the corresponding column indices. In a z, the position of the beginning of each row in a w and a s is stored. The subdivisions in a w and a s have been added to mark the beginning of a new row. The order of the matrix elements per row in a w and a s is dierent from that in matrix (2.3) since this is usually the case in FE programs caused by the assembly of the coecient matrix from the single elements. 5

8 A = C A : (2.3) a w = (1 j 9 2 j j j j j j 18 8); a s = (1 j 3 2 j j j j j j 4 8); (2.4) a z = ( ): 2.3 Parallelization Data Distribution For parallelizing the Algorithms 2.1 and 2.2 on a distributed memory system, the matrix and vector arrays must be suitably distributed to each processor. For the considered data distribution schemes, the arrays a w and a s are distributed row-wise; the rows of each processor succeed one another. The distribution of the vector arrays corresponds component-wise to the row distribution of the matrix arrays. Criteria for the data distribution can be: each processor gets the same number of rows or so many rows that each processor has nearly the same number of non-zeros. The number of operations for the computation of the matrix-vector product is proportional to the number of non-zeros; the remaining vector operations of one iteration are proportional to the number of rows. The criterion considered here is that each processor has to compute nearly the same number of operations. If the discretization mesh is regular, i.e., the sparsity pattern of the coecient matrix is regular, then all three criteria result in nearly the same data distribution. If the mesh is very irregular, the three distributions dier considerably. Our algorithm for distributing the rows onto the processors for the latter case can be described as follows. The row distribution is determined by analyzing the array a z. Let n k denote the number of rows assigned to processor k, and let p denote the number of processors. Also, let e k denote the number of nonzeros assigned to processor k, and let e denote the total number of nonzeros in the matrix. Then processor k is assigned rows until the following requirement is satised for the rst time: e k + n k e + n 1 p ; for e k; n k 10: (2.5) The parameter rstly depends on the number of vector operations which are additional to the operations of the matrix-vector product in each iteration. Secondly, it considers the execution times of arithmetical, logical, and memory operations on the processor used; it is therefore dependent on the processor architecture. The numerator in (2.5) is proportional to the number of operations of one partial iteration on processor k, the denominator is proportional to the total number of operations of one iteration. It shall be remarked that for! 0 each processor gets nearly the same number of non-zeros and for! 1 nearly the same number of rows. The rst case means that the execution time of all vector-vector operations is neglectable compared with the execution time of the matrix-vector product. In the second case, the execution time of the matrix-vector product hardly contributes to the total execution time. 6

9 With these considerations, the contribution of the matrix-vector product to one iteration can be approximated by a MVP e e + n = 1 ; for e; n 10: (2.6) 1 + =m z Here, m z = e=n is the mean number of non-zeros per row. Additionally, (2.6) provides a means for measuring. If a MVP is determined by timings then an approximation of can be computed by 1 m z? 1 : (2.7) a MVP On the INTEL i860xr, the timings result in an approximative value of about 8.3 for the CG method and of about 2 for the Lanczos tridiagonalization. The data distribution according to criterion (2.5) is shown in (2.8) by distributing matrix (2.3) to four processors. The other arrays are distributed analogously. For this simple small example, the data distribution is the same for both the CG method and the Lanczos tridiagonalization. For large sparse matrices from FE applications, the data distributions usually vary for the considered algorithms caused by the dierent values of. Processor 0: a w 0 = (1 j 9 2 j ); Processor 1: a w 1 = ( j ); (2.8) Processor 2: a w 2 = ( j ); Processor 3: a w 3 = (18 8): Reordering and Communication Scheme On a distributed memory system, the computation of the matrix-vector product requires communication because each processor owns only a partial vector. For the ecient computation of the matrix-vector product, it is necessary to develop a suitable communication scheme by preprocessing the distributed column index arrays. Here, we describe two dierent schemes based on two dierent reorderings of the matrix. First, the arrays a s k are analyzed on each processor k to determine which data result in accesses to components of d i of other processors. Then, a s k and a w k are reordered in such a way that the data which result in accesses to processor h are collected in block h. The data of block h succeed row-wise one another with increasing column index per row. Block k is the rst block in a s k and a w k and contains the data which result in local accesses. The goal of the reordering is performing computation and communication overlapped. The rst reordering scheme is shown in (2.9) for the data distribution from (2.8) and the matrix-vector product Ad i from Algorithm 2.1. Here, merely array a s 1 is analyzed and reordered. Processor 0: a s 0 = (1 j 3 2 j 4 2 3); d i;0 = (d 1 i d 2 i d 3 i ) Processor 1: a s 1 = ( j ); d i;1 = (d 4 i d 5 i ) Processor 2: a s 2 = (7 4 6 j ); d i;2 = (d 6 i d 7 i ) (2.9) Processor 3: a s 3 = (4 8); d i;3 = (d 8 i ) Reordering: a s 1 = (4 5 j 4 5 {z } 1 k 3 {z} 0 k 6 7 j 7 {z } 2 k 8 Computing the operation row-times-vector of the matrix-vector product of processor 1, the index 3 results in an access to component d 3 i of processor 0, the index 8 to d 8 i of processor 7 {z} 3 )

10 3 2 k j : Processor k : Index j of a vector component Figure 2.1: Communication scheme, reordering 1 3, and the indices 6 and 7 in accesses to d 6 and i d7 i of processor 2. The data blocks in (2.9) are separated by double dashes for elucidation; the blocks have been numbered below the brackets. After reordering, the data of block 1 result in local accesses, the data of block 0 in accesses to processor 0, the data of block 2 in accesses to processor 2, and the data of block 3 in accesses to processor 3. After having analyzed the column index array a s k, each processor k knows which components of d i are required by which processors. This information is broadcasted to all processors. Then, each processor can decide which data must be sent to which processors. This communication scheme is determined once before starting the parallel CG algorithm or Lanczos tridiagonalization and applies unchanged to each iteration. The communication scheme for the example discussed before is displayed in Fig Processor 1 receives the third component of d i from processor 0, the sixth and seventh component from processor 2 and the eighth component from processor 3. On the other hand, the fourth component of processor 1 is sent to processor 0, the fourth and fth to processor 2 and the fourth to processor 3. In Fig. 2.2, the parallel computation of the matrix-vector product is described for both considered algorithms. First, on each processor, the data which are necessary for other processors are sent asynchronously. After having executed asynchronous receive-routines for receiving non-local data, all local computations are performed, in particular the local part of the matrix-vector product. Then each processor waits until the data of an arbitrary processor arrive, and continues the computation of the matrix-vector product. Thereafter, each processor awaits the data of other processors until the computation of the matrixvector product is complete. Computation and communication are performed overlapped. While required data are on the network, operations with local or already arrived data of other processors are executed. In the second reordering scheme, the data blocks, built as discussed before, are sent to the processors which own the corresponding components of the vector of the matrixvector product. The goal is to increase the number of local computations while required data are on the network. In this case, the processors compute partial results of the result vector of the matrix-vector product. Then, y k;l denotes the partial result of y k = A k d i of processor k which is computed on processor l. After the computation, the partial results except the local one are sent to the corresponding processors and then are added to the local result of the receiving processor. The new distribution of the matrix data is presented in (2.10). 8

11 k= p-1 Sending the data which are necessary for other processors, asynchronously Receiving non-local data for the matrix-vector product, asynchronously Local vector-vector operations Computing the matrix-vector product with local data??? no no no Data of a processor available? yes yes yes Matrix-vector product with the data of the processor??? no no no Computation complete? yes yes yes Figure 2.2: The parallel matrix-vector product, reordering 1 Processor 0: a s 0 = (1 j 2 3 j 2 3 {z } 0;0 Processor 1: a s 1 = (4 5 j 4 5 {z } 1;1 Processor 2: a s 2 = (6 7 j 6 7 {z } 2;2 Processor 3: a s 3 = ( {z} 8 3;3 k 8 {z} 1;3 k 3 k 4 {z} 0;1 {z} 1;0 k 6 7 j 7 ); d i;0 = (d 1 i d 2 i d 3 i ) k 4 j 4 5 {z } 2;1 {z } 1;2 ); d i;3 = (d 8 i ) k 4 {z} 3;1 ); d i;2 = (d 6 i d 7 i ) ); d i;1 = (d 4 i d 5 i ) (2.10) The rst number of the blocks in (2.10) denotes the processor to which the partial result is sent; the second number indicates the processor on which the computation is performed. Processor 1 e.g. computes the local result y 1;1 with the rst block, the partial result y 0;1 of processor 0 with the second block, the partial result y 2;1 of processor 2 with the third block, and the partial result y 3;1 of processor 3 with the fourth block, respectively. Fig. 2.3 shows the communication scheme for the block distribution from (2.10). Processor 1 sends a value to processor 0, and this value is added to the third component of y 0. Simultaneously, processor 1 receives a value from processor 0 which must be added to the rst component of y 1. In Fig. 2.4, the parallel computation of the matrix-vector product is presented for the 9

12 3 2 k j : Processor k : Index j of a vector component Figure 2.3: Communication scheme, reordering 2 k= p-1 Receiving all neccessary non-local partial results, asynchronously Are there data-blocks of other?? processors?? no yes no yes no yes Computing a partial result of another processor Sending the partial result, asynchronously no? no? no? Processing of all data-blocks of other proc. yes yes complete? yes Local vector-vector operations k= p-1 Computing the local partial result??? no no no Data of a processor available? yes yes yes Adding the values to the local partial result??? no no no Computation complete? yes yes yes Figure 2.4: The parallel matrix-vector product, reordering 2 10

13 second reordering scheme. First, asynchronous receive-routines for receiving all necessary partial results of other processors are executed on each processor. After that, each processor computes the partial results which are sent to other processors. The computation is performed per data block; the results are asynchronously sent to the corresponding processors after each computation. Then, all local computations are performed, in particular the computation of the local part of the matrix-vector product. Thereafter, each processor waits until the data of an arbitrary processor arrive and then adds the values to the corresponding components of the local result. This is repeated until the computation of the matrix-vector product is complete. Computation and communication are performed overlapped. Since partial results of the matrix-vector product are exchanged most computations are local. After having received non-local data, merely a summation of vector components is necessary. The disadvantage of this method is that load balancing is not guaranteed any more after the new distribution of the blocks; some processors can own more or larger data blocks than other ones. However, this scheme allows arbitrary data distributions; each processor can get arbitrary parts of arbitrary rows which need not succeed one another. For a specic FE application, suitable data distributions for this scheme can be found considering the discretization mesh. The data distribution and the communication scheme presented here do not require any knowledge about a specic discretization mesh; the schemes are determined automatically by analyzing the column indices of the non-zero matrix elements. 2.4 Results The numerical and performance tests of the developed parallel CG method and the parallel Lanczos tridiagonalization have been performed on the distributed-memory system ipsc/860 of the Research Centre Julich. The INTEL computer system has 32 processors with 16 Megabyte private memory each, interconnected by a hypercube-network. The maximum transfer rate is 2.8 Megabyte/second per channel in both directions Numerical Results The tests presented here have been carried out with one matrix each of the FE models from environmental science and structural mechanics. In Table 2.1, numerical data of the coecient matrices and for the convergence of the CG method are shown. The matrix from environmental science has rows, that from structural mechanics In the rst case, the mean number of non-zeros per row is near the maximum number. This is caused by a regular discretization mesh. For the second case, the mean and the maximum number are considerably dierent; the discretization mesh is much more irregular. The operational contribution of the matrix-vector product to one iteration is 75% for the matrix from environmental science and 95% for the matrix from structural mechanics in the case of the CG method; in the case of the Lanczos tridiagonalization, the values are 93% and 99%. In Table 2.1, the number of CG iterations with and without diagonal scaling is given. The iteration is stopped when the maximum scaled absolute dierence from (2.1) is less than 10?5 ; this corresponds to a precision of the solution vector of about ve decimals. With diagonal scaling, the number of iterations is considerably smaller in both cases. The contribution of this preconditioner to the total execution time is in both cases below 1%. For the preconditioned method, the Euclidean norm of the residual after 84 and 658 iterations, respectively, is given in addition. The sparsity patterns of both matrices are shown in Fig The matrix from environmental science has essentially band structure with a maximum bandwidth of

14 Figure 2.5: Top: sparsity patterns of the matrices from environmental science (left) and structural mechanics (right). Bottom: same matrices with bandwidth reduction. 12

15 Environmental science Structural mechanics Rows Non-zeros Density 0.05% 0.6% Non-zeros per row, max m z a MVP, CG method 75% 95% a MVP, Lanczos tridiagonalization 93% 99% CG method: max. scal. abs. di. 10?5 Iterations without scaling Iterations with scaling kg i+1 k 2 4:5 10?4 1:5 10?5 Table 2.1: Numerical data of the considered large sparse matrices The matrix from structural mechanics has a much more irregular structure; the maximum bandwidth is By reducing the bandwidth of the matrices, the communication overhead in each iteration of both considered algorithms can be reduced. Since communication is necessary for the operation row times vector of the matrix-vector product, a smaller bandwidth results in smaller message length or even in communication with fewer processors. Here, the matrix is reordered by the reverse Cuthill-McKee (RCM) scheme [47]. In FE models, this scheme is frequently used for the assembly of the coecient matrix; it is performed merely once if the mesh does not change, whereas in many cases equation systems or eigenproblems are frequently solved, e.g. in each time step of a time dependent problem. Fig. 2.5 also shows the sparsity patterns of both matrices with bandwidth reduction. For the matrix from environmental science, the bandwidth is reduced by 45%; the maximum bandwidth is The maximum bandwidth of the matrix from structural mechanics after applying the reverse Cuthill-McKee scheme is 2989; this is a reduction by merely 14% Performance Results In the rst two investigations, bandwidth reduction has not been applied to the matrices. Fig. 2.6 shows execution times per iteration of the CG method on 32 processors for both presented reordering schemes. For the matrix from environmental science, using reordering 1 or reordering 2 results in almost the same execution times. This is caused by a regular discretization mesh. For the matrix from structural mechanics, the time using reordering 2 increases distinctly compared with the time using reordering 1. Because of the irregular structure of this matrix, the new distribution of the data-blocks destroys the load balancing. Since reordering 2 does not result in an improvement compared with reordering 1 for the considered matrices the latter scheme is applied in all following investigations. Fig. 2.6 also shows the execution times per iteration on 32 processors with and without communication and computation performed overlapped for the parallel CG algorithm. The overlapped execution reduces the execution times by nearly 20%. In Fig. 2.7, speedups on 4 to 32 processors are shown for the CG method with and without bandwidth reduction of the matrices. The equation system from environmental science together with the program code and the remaining data requires the memory of 13

16 Figure 2.6: Execution times per iteration, CG method, 32 processors. Left: the inuence of the reordering; right: the inuence of overlapping. Figure 2.7: Speedups, CG method (left) and Lanczos tridiagonalization (right). more than two processors, that from structural mechanics the memory of more than four processors. For up to four and, in the second case, up to eight processors, linear speedup was assumed because nearly linear speedup was observed in tests with smaller systems of equations for up to 8 processors. Performance tests with a smaller system of equations from environmental science (17368 rows, non-zeros, bandwidth reduced) resulted in speedups of 1.0, 1.9, 3.6, and 6.7 on 1, 2, 4, and 8 processors, respectively. Using one processor, one CG iteration required an execution time of 216 milliseconds. For 16 processors and without bandwidth reduction, the speedup is 13.2 in the rst case and 15.2 in the second case. This corresponds to eciencies of 83% and 95%. With bandwidth reduction, the speedups increase to 14.6 and 15.6, respectively; the eciencies are 91% and 97%. For 32 processors, speedups of 21.6 and 27.2 without bandwidth reduction or of 24.8 and 28.5 with bandwidth reduction are achieved. The eciencies decrease to 68% and 85% without bandwidth reduction or 78% and 89% with bandwidth reduction because the communication overhead increases. Fig. 2.7 also shows the speedups for the Lanczos tridiagonalization. In the case of the matrix from environmental science, the speedups are 21.1 without bandwidth reduction and 23.8 with bandwidth reduction using 32 processors. This corresponds to eciences of 66% and 74%, respectively. In the case of the matrix from structural mechanics, speedups 14

17 of of 28.5 without bandwidth reduction and of 29.8 with bandwidth reduction have been obtained on 32 processors. The corresponding eciences are 89% and 93%, respectively. Using 32 processors, 100 Lanczos iterations require an execution time of 3.3 seconds without and of 2.8 seconds with bandwidth reduction for the matrix from environmental science; the corresponding times for the matrix from structural mechanics are 5.7 seconds and 5.5 seconds, respectively. 15

18 3 Case Study II: Direct Methods When coarse-grain parallelism is to be used for solving Ax = b by Gaussian elimination, then the original matrix A has to be reordered before the start of the elimination in order to obtain several relatively large blocks that can be treated concurrently. The idea is not a new one; it has been used in many applications (not only in order to obtain a parallel algorithm). An ecient reordering has been proposed by Hellerman and Rarick [34, 35]. It has been used, with some modications, by many other authors; see, for example, [8] or [20]. Other preliminary reorderings have also been proposed in the literature; see, for example, [26]. A common feature of all these reorderings is that one always imposes a requirement to obtain square blocks on the main diagonal. Moreover, it is also required that the reordered matrix is either an upper block-triangular matrix or a bordered matrix, in both cases with square blocks on the main diagonal (see the references given above or [17]). For some matrices these two requirements are too restrictive. Therefore these requirements should not always be imposed. The main purpose of this paper is to show how to avoid them (when this is appropriate). A direct solver, where one attempts to exploit coarse-grain parallelism without imposing the above two requirements, is described and tested in [30, 54]. This solver is based on partitioning the matrix into an upper block-triangular form with rectangular diagonal blocks. A reordering algorithm, by which as many as possible zero elements are obtained in the lower left corner of the matrix, is to be applied before the partitioning. After that the matrix must be divided into block rows, each of them containing approximately the same number of rows. If the reordering algorithm is ecient, then it is rather easy to obtain large block-rows that contain approximately the same number of rows during the partitioning (because it is allowed to use rectangular diagonal blocks). This is why we concentrate our attention on the initial reordering. In the remaining of this chapter we discuss an improvement of the reordering algorithm proposed in [30] and [54], and its application in the solution of systems of linear algebraic equations by Gaussian elimination. Throughout, n and N Z denote respectively the order of the matrix and the number of its non-zero elements. More details about the algorithm can be found in [24]. 3.1 The Locally Optimized Reordering Algorithm It is convenient rst to sketch the initial reordering scheme used in the old solver. It consists of two completely separated steps: column reordering and row reordering. The following denition is needed in order to describe the column reordering. Denition 1. The number c j of non-zero elements in a given column j, j = 1; 2; : : :; n of the matrix A is called the count of this column. When the column ordering is completed, the columns of matrix A are ordered by increasing counts: j < k ) c j c k : Let r i be the column index of the rst non-zero element in row i (i.e., a iri 6= 0, but a i1 = a i2 = : : : = a iri?1 = 0). When the row ordering is completed, the rows are ordered so that the following relationship is satised: j < k ) r j r k : It is clear that by reordering the matrix in this way, many of the zero elements are moved to the lower left corner of the matrix. It can easily be proved that the cost of this reordering is very low; it requires only O(N Z) operations. However, the number of zeros in 16

19 the left lower corner of the reordered matrix may be considerably smaller than the maximal number that can be achieved. Consider the (6 6) matrix shown in Fig This matrix has two non-zero elements in each column. Therefore the column ordering will not lead to any column permutations. It is easy to see that the row ordering does not lead to row permutations either. Thus, the old ordering algorithm leaves this matrix unchanged. This leads to a rather poor result (only six zero elements in the lower left corner) Figure 3.1. The original sparsity pattern of a matrix; the old reordering algorithm preserves this structure. Not only is the main drawback of the old algorithm revealed by this simple example, but the example also shows how the reordering can be made more ecient. When the rst column with a minimal count is chosen, the counts of the other columns should be updated by removing all non-zero elements from the rows that have non-zero elements in that column. Then the second column is chosen (among the columns with minimal count) and brought by permutations to position 2. The counts of the other columns are again updated, and the process is continued in this manner. The most important feature of this new algorithm is that after choosing a column with a best count the counts of some columns are reduced. A more rigorous description of the algorithm will be given below. It is worthwhile to illustrate here the eect of this reordering by using the same matrix as in the example shown in Fig The result obtained by the new ordering is shown on Fig It is immediately seen that the number of zero elements in the lower left corner is considerably greater Figure 3.2. An optimal reordering of the matrix from Fig The column permutation sequence is given above the matrix. The reordered matrix is obtained without row interchanges. This reordering is produced by the new algorithm. The examples demonstrated on Fig. 3.1 and Fig. 3.2 show that better results can be achieved by using a new reordering algorithm. Two denitions are needed in order to formulate more rigorously the new algorithm. Denition 2. Assume that the rst k columns of the reordered matrix have been selected (and brought to the rst k positions by permutations). Let i 2 1; 2; : : :; n be an arbitrary row number. Consider a column j, j 2 fk + 1; k + 2; : : :; ng. The elements in column j which have row numbers greater than i form the active part of column j with index i. 17

20 Remark 1. The above denition can be considered as a generalization of a denition which is sometimes used in connection with the Gaussian elimination. In the original denition i is equal to k. In the new reordering algorithm (see below) i will in general be greater than k. Denition 3. The number of non-zero elements in the active part of column j with index i is called the active count c i j of column j with index i. Locally Optimized Reordering Algorithm (LORA) 1. Initialization. Set j = 1; i = 0; and nd the active counts c k, k = 1; 2; : : :; n, of the columns. 2. Choice of the next column. Find the rst column k j with minimal active count c i k with index i among the columns j, j + 1; : : :n and interchange the elements of columns j and k. 3. Row permutations. Perform row permutations such that the q rows with row numbers greater than i having non-zero elements in column j are moved to locations i + 1, i + 2; : : :; i + q. 4. Stopping check. If i + q = n then STOP. 5. Updating the active counts. Find the active counts of the columns j + 1; j + 2; : : :; n with index i + q. 6. Preparation for a new column search. Increase i by q, j by 1, and go to Step 2. The typical structure of a matrix that is reordered by LORA is shown in Fig The reordered matrix contains three dierent types of blocks: 1. the black blocks are dense matrices (these blocks are normally small and rectangular), 2. the shaded blocks are sparse matrices (all such blocks are over the black blocks or to the right of them; i.e. in the upper right corner of the reordered matrix A), 3. all elements that are located under the black blocks and to the left of them (i.e., the elements in the lower left corner of the reordered matrix A) are zeros. The black blocks in Fig. 3.3 will be called the dense separator of the reordered matrix. As stated above, the new algorithm tries to reorder the matrix so that the number of zero elements under the dense separator is as large as possible. The following theorem is proved in [24]. Theorem 1. All blocks of the dense separator, which is obtained when the new reordering algorithm is applied, are dense matrices. The strategy by which one attempts to achieve as many as possible zero elements under the dense separator is local. The algorithm nds consecutively columns that are locally best. If a column j is chosen at some stage of LORA, then this column is the column that adds the maximal number of zeros to the zero blocks. The local optimization does not ensure a global optimization, but in Section 3.5 it will be shown that the algorithm performs well in practice. 18

21 dense blocks sparse blocks zero blocks Figure 3.3. The structure of the reordered matrix. 19

22 3.2 Implementation of LORA by Means of Binary Trees If LORA is implemented in a straight-forward way then O(n 2 ) operations are required. For sparse matrix algorithms this is not a very considerable cost and, therefore, it is highly desirable to reduce it. The crucial point is to reduce the cost of Step 2. The use of binary trees in the eorts to reduce the total number of operations needed to perform LORA will be discussed in this section. The discussion is based on a set of fundamental concepts from graph theory, which are assumed to be known. This set includes nodes, edges, paths, roots, parents, children, ancestors, descendants, trees, subtrees, etc. (a good description of these concepts can be found, for example, in [3, 4]. Several denitions are needed in order to introduce the concept of binary trees with leveled nodes; a binary tree with leveled nodes could be considered as a modication of the leftist tree discussed in Knuth [39]. Denition 4. A tree is called a binary tree if each of its nodes has at most two children. Denition 5. Consider a binary tree T (V; E) (where V is the set of nodes and E is the set of edges). Assume that V contains n nodes. Sometimes it is necessary to keep the nodes of T in a certain order. If this is so, then a special function must be attached to the nodes of T. The value of this function for a given node j, j 2 V = f1; 2; :::; ng will be called the key of node j. Denition 6. Consider again T (V; E) from the previous denition. The minimal distance between a given node j, j 2 V = f1; 2; :::; ng, and one of its descendants with less than two children is called the level of node j. The level of a node with less than two children is equal to zero. Some modications of binary trees are often carried out many times consecutively during the solution of a given problem. This will be needed, for example, if LORA is implemented by using binary trees. Such modications can eciently be performed if some extra information about the binary tree is stored and used. A special class of binary trees, the binary trees with leveled nodes, can be very useful for some special kinds of modications. Five extra arrays, each of them of length n, are necessary, to store the information and to update it eciently, when binary trees with leveled nodes are to be used. The contents of the jth component of these arrays are given below: 1. LEF T (j) - contains any of the children of the jth node, or LEF T (j) = 0 if node j has no children. 2. RIGHT (j) - if node j has two children, then RIGHT (j) contains the child that is not stored in LEF T (j); RIGHT (J) = 0 when node j has only one child or no child. 3. P REV (j) - contains the parent of the jth node; P REV (j) = 0 for the root. 4. KEY (j) - contains the key of the jth node. 5. LEV (j) - contains the level of the jth node. Denition 7. A binary tree T (V; KEY; LEV; P REV; LEF T; RIGHT) is called binary tree with leveled nodes if the following relations are satised: 1. KEY (P REV (j)) KEY (j), 8j 2 V; j 6= root 2. if LEF T (j) = m 6= 0 then P REV (m) = j if RIGHT (j) = k 6= 0 then P REV (k) = j 3. if LEF T (j) 6= 0 then LEF T (j) 6= RIGHT (j) 20

23 4. if LEF T (j) = 0 or RIGHT (j) = 0 then LEV (j) = 0 5. if LEF T (j) 6= 0 and RIGHT (j) 6= 0 then LEV (j) = min(lev (LEF T (j); LEV (RIGHT (j)) + 1 KEY=3 LEV=2 KEY=4 LEV=1 KEY=3 LEV=1 KEY=6 LEV=0 KEY=5 LEV=1 KEY=5 LEV=0 KEY=6 LEV=0 KEY=7 KEY=5 KEY=7 LEV=0 LEV=0 LEV=0 Figure 3.4. A binary tree with ten leveled nodes. A binary tree with leveled nodes is shown in Fig The relations stated by (1){(5) in Denition 7 can easily be checked using the information about the key and the level of each node, which is also given in Fig A collection of properties of binary trees with leveled nodes can be found in [24]. Consider a binary tree T (V; KEY; LEV; P REV; LEF T; RIGHT ) with n leveled nodes. If such a tree is to be used in connection with LORA, then the nodes of the tree must be associated with the columns of matrix A. The active counts of column j is to be associated with the value of KEY (j). These counts are modied in LORA. This leads to the necessity to perform modications of T (V; KEY; LEV; P REV; LEF T; RIGHT ) after the choice of each column (Step 2 of LORA). Appropriate values of the active count of column j with some index i is to be used as KEY (j) during the consecutive modications of T (V; KEY; LEV; P REV; LEF T; RIGHT). These remarks suce to explain how binary trees with leveled nodes are used to implement LORA eciently. The main result about the complexity of LORA is given in the theorem below the proof can be found in [24]. Theorem 3. The reordering algorithm LORA can be implemented in O(N Z log n) operations when binary trees with leveled nodes are used. In the above considerations we assumed for simplicity that KEY(k) represent the active count of the kth column. In this particular case a more straightforward implementation exists, using doubly linked lists (see [17, x2.8]), and this implementation leads to an algorithm 21

Reduced Synchronization Overhead on. December 3, Abstract. The standard formulation of the conjugate gradient algorithm involves

Reduced Synchronization Overhead on. December 3, Abstract. The standard formulation of the conjugate gradient algorithm involves Lapack Working Note 56 Conjugate Gradient Algorithms with Reduced Synchronization Overhead on Distributed Memory Multiprocessors E. F. D'Azevedo y, V.L. Eijkhout z, C. H. Romine y December 3, 1999 Abstract

More information

SOLVING SPARSE LINEAR SYSTEMS OF EQUATIONS. Chao Yang Computational Research Division Lawrence Berkeley National Laboratory Berkeley, CA, USA

SOLVING SPARSE LINEAR SYSTEMS OF EQUATIONS. Chao Yang Computational Research Division Lawrence Berkeley National Laboratory Berkeley, CA, USA 1 SOLVING SPARSE LINEAR SYSTEMS OF EQUATIONS Chao Yang Computational Research Division Lawrence Berkeley National Laboratory Berkeley, CA, USA 2 OUTLINE Sparse matrix storage format Basic factorization

More information

Contents. Preface... xi. Introduction...

Contents. Preface... xi. Introduction... Contents Preface... xi Introduction... xv Chapter 1. Computer Architectures... 1 1.1. Different types of parallelism... 1 1.1.1. Overlap, concurrency and parallelism... 1 1.1.2. Temporal and spatial parallelism

More information

The Conjugate Gradient Method

The Conjugate Gradient Method The Conjugate Gradient Method Classical Iterations We have a problem, We assume that the matrix comes from a discretization of a PDE. The best and most popular model problem is, The matrix will be as large

More information

CME342 Parallel Methods in Numerical Analysis. Matrix Computation: Iterative Methods II. Sparse Matrix-vector Multiplication.

CME342 Parallel Methods in Numerical Analysis. Matrix Computation: Iterative Methods II. Sparse Matrix-vector Multiplication. CME342 Parallel Methods in Numerical Analysis Matrix Computation: Iterative Methods II Outline: CG & its parallelization. Sparse Matrix-vector Multiplication. 1 Basic iterative methods: Ax = b r = b Ax

More information

Scientific Computing with Case Studies SIAM Press, Lecture Notes for Unit VII Sparse Matrix

Scientific Computing with Case Studies SIAM Press, Lecture Notes for Unit VII Sparse Matrix Scientific Computing with Case Studies SIAM Press, 2009 http://www.cs.umd.edu/users/oleary/sccswebpage Lecture Notes for Unit VII Sparse Matrix Computations Part 1: Direct Methods Dianne P. O Leary c 2008

More information

9.1 Preconditioned Krylov Subspace Methods

9.1 Preconditioned Krylov Subspace Methods Chapter 9 PRECONDITIONING 9.1 Preconditioned Krylov Subspace Methods 9.2 Preconditioned Conjugate Gradient 9.3 Preconditioned Generalized Minimal Residual 9.4 Relaxation Method Preconditioners 9.5 Incomplete

More information

AMS526: Numerical Analysis I (Numerical Linear Algebra for Computational and Data Sciences)

AMS526: Numerical Analysis I (Numerical Linear Algebra for Computational and Data Sciences) AMS526: Numerical Analysis I (Numerical Linear Algebra for Computational and Data Sciences) Lecture 19: Computing the SVD; Sparse Linear Systems Xiangmin Jiao Stony Brook University Xiangmin Jiao Numerical

More information

6. Iterative Methods for Linear Systems. The stepwise approach to the solution...

6. Iterative Methods for Linear Systems. The stepwise approach to the solution... 6 Iterative Methods for Linear Systems The stepwise approach to the solution Miriam Mehl: 6 Iterative Methods for Linear Systems The stepwise approach to the solution, January 18, 2013 1 61 Large Sparse

More information

Linear Solvers. Andrew Hazel

Linear Solvers. Andrew Hazel Linear Solvers Andrew Hazel Introduction Thus far we have talked about the formulation and discretisation of physical problems...... and stopped when we got to a discrete linear system of equations. Introduction

More information

Numerical Solution Techniques in Mechanical and Aerospace Engineering

Numerical Solution Techniques in Mechanical and Aerospace Engineering Numerical Solution Techniques in Mechanical and Aerospace Engineering Chunlei Liang LECTURE 3 Solvers of linear algebraic equations 3.1. Outline of Lecture Finite-difference method for a 2D elliptic PDE

More information

Scientific Computing

Scientific Computing Scientific Computing Direct solution methods Martin van Gijzen Delft University of Technology October 3, 2018 1 Program October 3 Matrix norms LU decomposition Basic algorithm Cost Stability Pivoting Pivoting

More information

Preconditioned Conjugate Gradient-Like Methods for. Nonsymmetric Linear Systems 1. Ulrike Meier Yang 2. July 19, 1994

Preconditioned Conjugate Gradient-Like Methods for. Nonsymmetric Linear Systems 1. Ulrike Meier Yang 2. July 19, 1994 Preconditioned Conjugate Gradient-Like Methods for Nonsymmetric Linear Systems Ulrike Meier Yang 2 July 9, 994 This research was supported by the U.S. Department of Energy under Grant No. DE-FG2-85ER25.

More information

Direct and Incomplete Cholesky Factorizations with Static Supernodes

Direct and Incomplete Cholesky Factorizations with Static Supernodes Direct and Incomplete Cholesky Factorizations with Static Supernodes AMSC 661 Term Project Report Yuancheng Luo 2010-05-14 Introduction Incomplete factorizations of sparse symmetric positive definite (SSPD)

More information

Consider the following example of a linear system:

Consider the following example of a linear system: LINEAR SYSTEMS Consider the following example of a linear system: Its unique solution is x + 2x 2 + 3x 3 = 5 x + x 3 = 3 3x + x 2 + 3x 3 = 3 x =, x 2 = 0, x 3 = 2 In general we want to solve n equations

More information

A MULTIGRID ALGORITHM FOR. Richard E. Ewing and Jian Shen. Institute for Scientic Computation. Texas A&M University. College Station, Texas SUMMARY

A MULTIGRID ALGORITHM FOR. Richard E. Ewing and Jian Shen. Institute for Scientic Computation. Texas A&M University. College Station, Texas SUMMARY A MULTIGRID ALGORITHM FOR THE CELL-CENTERED FINITE DIFFERENCE SCHEME Richard E. Ewing and Jian Shen Institute for Scientic Computation Texas A&M University College Station, Texas SUMMARY In this article,

More information

Preface to the Second Edition. Preface to the First Edition

Preface to the Second Edition. Preface to the First Edition n page v Preface to the Second Edition Preface to the First Edition xiii xvii 1 Background in Linear Algebra 1 1.1 Matrices................................. 1 1.2 Square Matrices and Eigenvalues....................

More information

Chapter 7 Iterative Techniques in Matrix Algebra

Chapter 7 Iterative Techniques in Matrix Algebra Chapter 7 Iterative Techniques in Matrix Algebra Per-Olof Persson persson@berkeley.edu Department of Mathematics University of California, Berkeley Math 128B Numerical Analysis Vector Norms Definition

More information

Lecture 4: Linear Algebra 1

Lecture 4: Linear Algebra 1 Lecture 4: Linear Algebra 1 Sourendu Gupta TIFR Graduate School Computational Physics 1 February 12, 2010 c : Sourendu Gupta (TIFR) Lecture 4: Linear Algebra 1 CP 1 1 / 26 Outline 1 Linear problems Motivation

More information

J.I. Aliaga 1 M. Bollhöfer 2 A.F. Martín 1 E.S. Quintana-Ortí 1. March, 2009

J.I. Aliaga 1 M. Bollhöfer 2 A.F. Martín 1 E.S. Quintana-Ortí 1. March, 2009 Parallel Preconditioning of Linear Systems based on ILUPACK for Multithreaded Architectures J.I. Aliaga M. Bollhöfer 2 A.F. Martín E.S. Quintana-Ortí Deparment of Computer Science and Engineering, Univ.

More information

APPLIED NUMERICAL LINEAR ALGEBRA

APPLIED NUMERICAL LINEAR ALGEBRA APPLIED NUMERICAL LINEAR ALGEBRA James W. Demmel University of California Berkeley, California Society for Industrial and Applied Mathematics Philadelphia Contents Preface 1 Introduction 1 1.1 Basic Notation

More information

5.7 Cramer's Rule 1. Using Determinants to Solve Systems Assumes the system of two equations in two unknowns

5.7 Cramer's Rule 1. Using Determinants to Solve Systems Assumes the system of two equations in two unknowns 5.7 Cramer's Rule 1. Using Determinants to Solve Systems Assumes the system of two equations in two unknowns (1) possesses the solution and provided that.. The numerators and denominators are recognized

More information

Matrix Computations: Direct Methods II. May 5, 2014 Lecture 11

Matrix Computations: Direct Methods II. May 5, 2014 Lecture 11 Matrix Computations: Direct Methods II May 5, 2014 ecture Summary You have seen an example of how a typical matrix operation (an important one) can be reduced to using lower level BS routines that would

More information

5.1 Banded Storage. u = temperature. The five-point difference operator. uh (x, y + h) 2u h (x, y)+u h (x, y h) uh (x + h, y) 2u h (x, y)+u h (x h, y)

5.1 Banded Storage. u = temperature. The five-point difference operator. uh (x, y + h) 2u h (x, y)+u h (x, y h) uh (x + h, y) 2u h (x, y)+u h (x h, y) 5.1 Banded Storage u = temperature u= u h temperature at gridpoints u h = 1 u= Laplace s equation u= h u = u h = grid size u=1 The five-point difference operator 1 u h =1 uh (x + h, y) 2u h (x, y)+u h

More information

BLAS: Basic Linear Algebra Subroutines Analysis of the Matrix-Vector-Product Analysis of Matrix-Matrix Product

BLAS: Basic Linear Algebra Subroutines Analysis of the Matrix-Vector-Product Analysis of Matrix-Matrix Product Level-1 BLAS: SAXPY BLAS-Notation: S single precision (D for double, C for complex) A α scalar X vector P plus operation Y vector SAXPY: y = αx + y Vectorization of SAXPY (αx + y) by pipelining: page 8

More information

Numerical Analysis Lecture Notes

Numerical Analysis Lecture Notes Numerical Analysis Lecture Notes Peter J Olver 8 Numerical Computation of Eigenvalues In this part, we discuss some practical methods for computing eigenvalues and eigenvectors of matrices Needless to

More information

Modelling and implementation of algorithms in applied mathematics using MPI

Modelling and implementation of algorithms in applied mathematics using MPI Modelling and implementation of algorithms in applied mathematics using MPI Lecture 3: Linear Systems: Simple Iterative Methods and their parallelization, Programming MPI G. Rapin Brazil March 2011 Outline

More information

Numerical Analysis Fall. Gauss Elimination

Numerical Analysis Fall. Gauss Elimination Numerical Analysis 2015 Fall Gauss Elimination Solving systems m g g m m g x x x k k k k k k k k k 3 2 1 3 2 1 3 3 3 2 3 2 2 2 1 0 0 Graphical Method For small sets of simultaneous equations, graphing

More information

Direct Methods for Solving Linear Systems. Simon Fraser University Surrey Campus MACM 316 Spring 2005 Instructor: Ha Le

Direct Methods for Solving Linear Systems. Simon Fraser University Surrey Campus MACM 316 Spring 2005 Instructor: Ha Le Direct Methods for Solving Linear Systems Simon Fraser University Surrey Campus MACM 316 Spring 2005 Instructor: Ha Le 1 Overview General Linear Systems Gaussian Elimination Triangular Systems The LU Factorization

More information

FINE-GRAINED PARALLEL INCOMPLETE LU FACTORIZATION

FINE-GRAINED PARALLEL INCOMPLETE LU FACTORIZATION FINE-GRAINED PARALLEL INCOMPLETE LU FACTORIZATION EDMOND CHOW AND AFTAB PATEL Abstract. This paper presents a new fine-grained parallel algorithm for computing an incomplete LU factorization. All nonzeros

More information

ITERATIVE METHODS FOR SPARSE LINEAR SYSTEMS

ITERATIVE METHODS FOR SPARSE LINEAR SYSTEMS ITERATIVE METHODS FOR SPARSE LINEAR SYSTEMS YOUSEF SAAD University of Minnesota PWS PUBLISHING COMPANY I(T)P An International Thomson Publishing Company BOSTON ALBANY BONN CINCINNATI DETROIT LONDON MADRID

More information

Calculus and linear algebra for biomedical engineering Week 3: Matrices, linear systems of equations, and the Gauss algorithm

Calculus and linear algebra for biomedical engineering Week 3: Matrices, linear systems of equations, and the Gauss algorithm Calculus and linear algebra for biomedical engineering Week 3: Matrices, linear systems of equations, and the Gauss algorithm Hartmut Führ fuehr@matha.rwth-aachen.de Lehrstuhl A für Mathematik, RWTH Aachen

More information

Mathematics Research Report No. MRR 003{96, HIGH RESOLUTION POTENTIAL FLOW METHODS IN OIL EXPLORATION Stephen Roberts 1 and Stephan Matthai 2 3rd Febr

Mathematics Research Report No. MRR 003{96, HIGH RESOLUTION POTENTIAL FLOW METHODS IN OIL EXPLORATION Stephen Roberts 1 and Stephan Matthai 2 3rd Febr HIGH RESOLUTION POTENTIAL FLOW METHODS IN OIL EXPLORATION Stephen Roberts and Stephan Matthai Mathematics Research Report No. MRR 003{96, Mathematics Research Report No. MRR 003{96, HIGH RESOLUTION POTENTIAL

More information

Algorithm for Sparse Approximate Inverse Preconditioners in the Conjugate Gradient Method

Algorithm for Sparse Approximate Inverse Preconditioners in the Conjugate Gradient Method Algorithm for Sparse Approximate Inverse Preconditioners in the Conjugate Gradient Method Ilya B. Labutin A.A. Trofimuk Institute of Petroleum Geology and Geophysics SB RAS, 3, acad. Koptyug Ave., Novosibirsk

More information

Linear Algebra and Eigenproblems

Linear Algebra and Eigenproblems Appendix A A Linear Algebra and Eigenproblems A working knowledge of linear algebra is key to understanding many of the issues raised in this work. In particular, many of the discussions of the details

More information

Incomplete Block LU Preconditioners on Slightly Overlapping. E. de Sturler. Delft University of Technology. Abstract

Incomplete Block LU Preconditioners on Slightly Overlapping. E. de Sturler. Delft University of Technology. Abstract Incomplete Block LU Preconditioners on Slightly Overlapping Subdomains for a Massively Parallel Computer E. de Sturler Faculty of Technical Mathematics and Informatics Delft University of Technology Mekelweg

More information

Numerical Methods in Matrix Computations

Numerical Methods in Matrix Computations Ake Bjorck Numerical Methods in Matrix Computations Springer Contents 1 Direct Methods for Linear Systems 1 1.1 Elements of Matrix Theory 1 1.1.1 Matrix Algebra 2 1.1.2 Vector Spaces 6 1.1.3 Submatrices

More information

Iterative Methods for Solving A x = b

Iterative Methods for Solving A x = b Iterative Methods for Solving A x = b A good (free) online source for iterative methods for solving A x = b is given in the description of a set of iterative solvers called templates found at netlib: http

More information

Lecture 9: Numerical Linear Algebra Primer (February 11st)

Lecture 9: Numerical Linear Algebra Primer (February 11st) 10-725/36-725: Convex Optimization Spring 2015 Lecture 9: Numerical Linear Algebra Primer (February 11st) Lecturer: Ryan Tibshirani Scribes: Avinash Siravuru, Guofan Wu, Maosheng Liu Note: LaTeX template

More information

Numerical Methods I Non-Square and Sparse Linear Systems

Numerical Methods I Non-Square and Sparse Linear Systems Numerical Methods I Non-Square and Sparse Linear Systems Aleksandar Donev Courant Institute, NYU 1 donev@courant.nyu.edu 1 MATH-GA 2011.003 / CSCI-GA 2945.003, Fall 2014 September 25th, 2014 A. Donev (Courant

More information

An exploration of matrix equilibration

An exploration of matrix equilibration An exploration of matrix equilibration Paul Liu Abstract We review three algorithms that scale the innity-norm of each row and column in a matrix to. The rst algorithm applies to unsymmetric matrices,

More information

FINE-GRAINED PARALLEL INCOMPLETE LU FACTORIZATION

FINE-GRAINED PARALLEL INCOMPLETE LU FACTORIZATION FINE-GRAINED PARALLEL INCOMPLETE LU FACTORIZATION EDMOND CHOW AND AFTAB PATEL Abstract. This paper presents a new fine-grained parallel algorithm for computing an incomplete LU factorization. All nonzeros

More information

ADDITIVE SCHWARZ FOR SCHUR COMPLEMENT 305 the parallel implementation of both preconditioners on distributed memory platforms, and compare their perfo

ADDITIVE SCHWARZ FOR SCHUR COMPLEMENT 305 the parallel implementation of both preconditioners on distributed memory platforms, and compare their perfo 35 Additive Schwarz for the Schur Complement Method Luiz M. Carvalho and Luc Giraud 1 Introduction Domain decomposition methods for solving elliptic boundary problems have been receiving increasing attention

More information

SOR as a Preconditioner. A Dissertation. Presented to. University of Virginia. In Partial Fulllment. of the Requirements for the Degree

SOR as a Preconditioner. A Dissertation. Presented to. University of Virginia. In Partial Fulllment. of the Requirements for the Degree SOR as a Preconditioner A Dissertation Presented to The Faculty of the School of Engineering and Applied Science University of Virginia In Partial Fulllment of the Reuirements for the Degree Doctor of

More information

Linear Regression and Its Applications

Linear Regression and Its Applications Linear Regression and Its Applications Predrag Radivojac October 13, 2014 Given a data set D = {(x i, y i )} n the objective is to learn the relationship between features and the target. We usually start

More information

Improvements for Implicit Linear Equation Solvers

Improvements for Implicit Linear Equation Solvers Improvements for Implicit Linear Equation Solvers Roger Grimes, Bob Lucas, Clement Weisbecker Livermore Software Technology Corporation Abstract Solving large sparse linear systems of equations is often

More information

Fine-grained Parallel Incomplete LU Factorization

Fine-grained Parallel Incomplete LU Factorization Fine-grained Parallel Incomplete LU Factorization Edmond Chow School of Computational Science and Engineering Georgia Institute of Technology Sparse Days Meeting at CERFACS June 5-6, 2014 Contribution

More information

The Solution of Linear Systems AX = B

The Solution of Linear Systems AX = B Chapter 2 The Solution of Linear Systems AX = B 21 Upper-triangular Linear Systems We will now develop the back-substitution algorithm, which is useful for solving a linear system of equations that has

More information

A JACOBI-DAVIDSON ITERATION METHOD FOR LINEAR EIGENVALUE PROBLEMS. GERARD L.G. SLEIJPEN y AND HENK A. VAN DER VORST y

A JACOBI-DAVIDSON ITERATION METHOD FOR LINEAR EIGENVALUE PROBLEMS. GERARD L.G. SLEIJPEN y AND HENK A. VAN DER VORST y A JACOBI-DAVIDSON ITERATION METHOD FOR LINEAR EIGENVALUE PROBLEMS GERARD L.G. SLEIJPEN y AND HENK A. VAN DER VORST y Abstract. In this paper we propose a new method for the iterative computation of a few

More information

Power System Analysis Prof. A. K. Sinha Department of Electrical Engineering Indian Institute of Technology, Kharagpur. Lecture - 21 Power Flow VI

Power System Analysis Prof. A. K. Sinha Department of Electrical Engineering Indian Institute of Technology, Kharagpur. Lecture - 21 Power Flow VI Power System Analysis Prof. A. K. Sinha Department of Electrical Engineering Indian Institute of Technology, Kharagpur Lecture - 21 Power Flow VI (Refer Slide Time: 00:57) Welcome to lesson 21. In this

More information

Numerical Linear Algebra

Numerical Linear Algebra Numerical Linear Algebra Decompositions, numerical aspects Gerard Sleijpen and Martin van Gijzen September 27, 2017 1 Delft University of Technology Program Lecture 2 LU-decomposition Basic algorithm Cost

More information

Direct solution methods for sparse matrices. p. 1/49

Direct solution methods for sparse matrices. p. 1/49 Direct solution methods for sparse matrices p. 1/49 p. 2/49 Direct solution methods for sparse matrices Solve Ax = b, where A(n n). (1) Factorize A = LU, L lower-triangular, U upper-triangular. (2) Solve

More information

Program Lecture 2. Numerical Linear Algebra. Gaussian elimination (2) Gaussian elimination. Decompositions, numerical aspects

Program Lecture 2. Numerical Linear Algebra. Gaussian elimination (2) Gaussian elimination. Decompositions, numerical aspects Numerical Linear Algebra Decompositions, numerical aspects Program Lecture 2 LU-decomposition Basic algorithm Cost Stability Pivoting Cholesky decomposition Sparse matrices and reorderings Gerard Sleijpen

More information

Math 471 (Numerical methods) Chapter 3 (second half). System of equations

Math 471 (Numerical methods) Chapter 3 (second half). System of equations Math 47 (Numerical methods) Chapter 3 (second half). System of equations Overlap 3.5 3.8 of Bradie 3.5 LU factorization w/o pivoting. Motivation: ( ) A I Gaussian Elimination (U L ) where U is upper triangular

More information

Research Reports on Mathematical and Computing Sciences

Research Reports on Mathematical and Computing Sciences ISSN 1342-284 Research Reports on Mathematical and Computing Sciences Exploiting Sparsity in Linear and Nonlinear Matrix Inequalities via Positive Semidefinite Matrix Completion Sunyoung Kim, Masakazu

More information

Technical University Hamburg { Harburg, Section of Mathematics, to reduce the number of degrees of freedom to manageable size.

Technical University Hamburg { Harburg, Section of Mathematics, to reduce the number of degrees of freedom to manageable size. Interior and modal masters in condensation methods for eigenvalue problems Heinrich Voss Technical University Hamburg { Harburg, Section of Mathematics, D { 21071 Hamburg, Germany EMail: voss @ tu-harburg.d400.de

More information

Applied Mathematics 205. Unit V: Eigenvalue Problems. Lecturer: Dr. David Knezevic

Applied Mathematics 205. Unit V: Eigenvalue Problems. Lecturer: Dr. David Knezevic Applied Mathematics 205 Unit V: Eigenvalue Problems Lecturer: Dr. David Knezevic Unit V: Eigenvalue Problems Chapter V.4: Krylov Subspace Methods 2 / 51 Krylov Subspace Methods In this chapter we give

More information

6.4 Krylov Subspaces and Conjugate Gradients

6.4 Krylov Subspaces and Conjugate Gradients 6.4 Krylov Subspaces and Conjugate Gradients Our original equation is Ax = b. The preconditioned equation is P Ax = P b. When we write P, we never intend that an inverse will be explicitly computed. P

More information

Numerical methods part 2

Numerical methods part 2 Numerical methods part 2 Alain Hébert alain.hebert@polymtl.ca Institut de génie nucléaire École Polytechnique de Montréal ENE6103: Week 6 Numerical methods part 2 1/33 Content (week 6) 1 Solution of an

More information

Notes for CS542G (Iterative Solvers for Linear Systems)

Notes for CS542G (Iterative Solvers for Linear Systems) Notes for CS542G (Iterative Solvers for Linear Systems) Robert Bridson November 20, 2007 1 The Basics We re now looking at efficient ways to solve the linear system of equations Ax = b where in this course,

More information

Minisymposia 9 and 34: Avoiding Communication in Linear Algebra. Jim Demmel UC Berkeley bebop.cs.berkeley.edu

Minisymposia 9 and 34: Avoiding Communication in Linear Algebra. Jim Demmel UC Berkeley bebop.cs.berkeley.edu Minisymposia 9 and 34: Avoiding Communication in Linear Algebra Jim Demmel UC Berkeley bebop.cs.berkeley.edu Motivation (1) Increasing parallelism to exploit From Top500 to multicores in your laptop Exponentially

More information

Jordan Journal of Mathematics and Statistics (JJMS) 5(3), 2012, pp A NEW ITERATIVE METHOD FOR SOLVING LINEAR SYSTEMS OF EQUATIONS

Jordan Journal of Mathematics and Statistics (JJMS) 5(3), 2012, pp A NEW ITERATIVE METHOD FOR SOLVING LINEAR SYSTEMS OF EQUATIONS Jordan Journal of Mathematics and Statistics JJMS) 53), 2012, pp.169-184 A NEW ITERATIVE METHOD FOR SOLVING LINEAR SYSTEMS OF EQUATIONS ADEL H. AL-RABTAH Abstract. The Jacobi and Gauss-Seidel iterative

More information

have invested in supercomputer systems, which have cost up to tens of millions of dollars each. Over the past year or so, however, the future of vecto

have invested in supercomputer systems, which have cost up to tens of millions of dollars each. Over the past year or so, however, the future of vecto MEETING THE NVH COMPUTATIONAL CHALLENGE: AUTOMATED MULTI-LEVEL SUBSTRUCTURING J. K. Bennighof, M. F. Kaplan, y M. B. Muller, y and M. Kim y Department of Aerospace Engineering & Engineering Mechanics The

More information

c 2015 Society for Industrial and Applied Mathematics

c 2015 Society for Industrial and Applied Mathematics SIAM J. SCI. COMPUT. Vol. 37, No. 2, pp. C169 C193 c 2015 Society for Industrial and Applied Mathematics FINE-GRAINED PARALLEL INCOMPLETE LU FACTORIZATION EDMOND CHOW AND AFTAB PATEL Abstract. This paper

More information

Parallelization of Multilevel Preconditioners Constructed from Inverse-Based ILUs on Shared-Memory Multiprocessors

Parallelization of Multilevel Preconditioners Constructed from Inverse-Based ILUs on Shared-Memory Multiprocessors Parallelization of Multilevel Preconditioners Constructed from Inverse-Based ILUs on Shared-Memory Multiprocessors J.I. Aliaga 1 M. Bollhöfer 2 A.F. Martín 1 E.S. Quintana-Ortí 1 1 Deparment of Computer

More information

Numerical Linear Algebra

Numerical Linear Algebra Numerical Linear Algebra The two principal problems in linear algebra are: Linear system Given an n n matrix A and an n-vector b, determine x IR n such that A x = b Eigenvalue problem Given an n n matrix

More information

Block-tridiagonal matrices

Block-tridiagonal matrices Block-tridiagonal matrices. p.1/31 Block-tridiagonal matrices - where do these arise? - as a result of a particular mesh-point ordering - as a part of a factorization procedure, for example when we compute

More information

OUTLINE ffl CFD: elliptic pde's! Ax = b ffl Basic iterative methods ffl Krylov subspace methods ffl Preconditioning techniques: Iterative methods ILU

OUTLINE ffl CFD: elliptic pde's! Ax = b ffl Basic iterative methods ffl Krylov subspace methods ffl Preconditioning techniques: Iterative methods ILU Preconditioning Techniques for Solving Large Sparse Linear Systems Arnold Reusken Institut für Geometrie und Praktische Mathematik RWTH-Aachen OUTLINE ffl CFD: elliptic pde's! Ax = b ffl Basic iterative

More information

Parallel Iterative Methods for Sparse Linear Systems. H. Martin Bücker Lehrstuhl für Hochleistungsrechnen

Parallel Iterative Methods for Sparse Linear Systems. H. Martin Bücker Lehrstuhl für Hochleistungsrechnen Parallel Iterative Methods for Sparse Linear Systems Lehrstuhl für Hochleistungsrechnen www.sc.rwth-aachen.de RWTH Aachen Large and Sparse Small and Dense Outline Problem with Direct Methods Iterative

More information

problem Au = u by constructing an orthonormal basis V k = [v 1 ; : : : ; v k ], at each k th iteration step, and then nding an approximation for the e

problem Au = u by constructing an orthonormal basis V k = [v 1 ; : : : ; v k ], at each k th iteration step, and then nding an approximation for the e A Parallel Solver for Extreme Eigenpairs 1 Leonardo Borges and Suely Oliveira 2 Computer Science Department, Texas A&M University, College Station, TX 77843-3112, USA. Abstract. In this paper a parallel

More information

Incomplete Cholesky preconditioners that exploit the low-rank property

Incomplete Cholesky preconditioners that exploit the low-rank property anapov@ulb.ac.be ; http://homepages.ulb.ac.be/ anapov/ 1 / 35 Incomplete Cholesky preconditioners that exploit the low-rank property (theory and practice) Artem Napov Service de Métrologie Nucléaire, Université

More information

Chapter 7. Iterative methods for large sparse linear systems. 7.1 Sparse matrix algebra. Large sparse matrices

Chapter 7. Iterative methods for large sparse linear systems. 7.1 Sparse matrix algebra. Large sparse matrices Chapter 7 Iterative methods for large sparse linear systems In this chapter we revisit the problem of solving linear systems of equations, but now in the context of large sparse systems. The price to pay

More information

Numerical Linear Algebra

Numerical Linear Algebra Numerical Linear Algebra By: David McQuilling; Jesus Caban Deng Li Jan.,31,006 CS51 Solving Linear Equations u + v = 8 4u + 9v = 1 A x b 4 9 u v = 8 1 Gaussian Elimination Start with the matrix representation

More information

Key words. conjugate gradients, normwise backward error, incremental norm estimation.

Key words. conjugate gradients, normwise backward error, incremental norm estimation. Proceedings of ALGORITMY 2016 pp. 323 332 ON ERROR ESTIMATION IN THE CONJUGATE GRADIENT METHOD: NORMWISE BACKWARD ERROR PETR TICHÝ Abstract. Using an idea of Duff and Vömel [BIT, 42 (2002), pp. 300 322

More information

Numerical Programming I (for CSE)

Numerical Programming I (for CSE) Technische Universität München WT 1/13 Fakultät für Mathematik Prof. Dr. M. Mehl B. Gatzhammer January 1, 13 Numerical Programming I (for CSE) Tutorial 1: Iterative Methods 1) Relaxation Methods a) Let

More information

Bindel, Fall 2016 Matrix Computations (CS 6210) Notes for

Bindel, Fall 2016 Matrix Computations (CS 6210) Notes for 1 Iteration basics Notes for 2016-11-07 An iterative solver for Ax = b is produces a sequence of approximations x (k) x. We always stop after finitely many steps, based on some convergence criterion, e.g.

More information

Lab 1: Iterative Methods for Solving Linear Systems

Lab 1: Iterative Methods for Solving Linear Systems Lab 1: Iterative Methods for Solving Linear Systems January 22, 2017 Introduction Many real world applications require the solution to very large and sparse linear systems where direct methods such as

More information

Scientific Computing: An Introductory Survey

Scientific Computing: An Introductory Survey Scientific Computing: An Introductory Survey Chapter 2 Systems of Linear Equations Prof. Michael T. Heath Department of Computer Science University of Illinois at Urbana-Champaign Copyright c 2002. Reproduction

More information

Lecture Note 7: Iterative methods for solving linear systems. Xiaoqun Zhang Shanghai Jiao Tong University

Lecture Note 7: Iterative methods for solving linear systems. Xiaoqun Zhang Shanghai Jiao Tong University Lecture Note 7: Iterative methods for solving linear systems Xiaoqun Zhang Shanghai Jiao Tong University Last updated: December 24, 2014 1.1 Review on linear algebra Norms of vectors and matrices vector

More information

Solving Ax = b, an overview. Program

Solving Ax = b, an overview. Program Numerical Linear Algebra Improving iterative solvers: preconditioning, deflation, numerical software and parallelisation Gerard Sleijpen and Martin van Gijzen November 29, 27 Solving Ax = b, an overview

More information

Solving PDEs with CUDA Jonathan Cohen

Solving PDEs with CUDA Jonathan Cohen Solving PDEs with CUDA Jonathan Cohen jocohen@nvidia.com NVIDIA Research PDEs (Partial Differential Equations) Big topic Some common strategies Focus on one type of PDE in this talk Poisson Equation Linear

More information

Course Notes: Week 1

Course Notes: Week 1 Course Notes: Week 1 Math 270C: Applied Numerical Linear Algebra 1 Lecture 1: Introduction (3/28/11) We will focus on iterative methods for solving linear systems of equations (and some discussion of eigenvalues

More information

Conceptual Questions for Review

Conceptual Questions for Review Conceptual Questions for Review Chapter 1 1.1 Which vectors are linear combinations of v = (3, 1) and w = (4, 3)? 1.2 Compare the dot product of v = (3, 1) and w = (4, 3) to the product of their lengths.

More information

Lecture 18 Classical Iterative Methods

Lecture 18 Classical Iterative Methods Lecture 18 Classical Iterative Methods MIT 18.335J / 6.337J Introduction to Numerical Methods Per-Olof Persson November 14, 2006 1 Iterative Methods for Linear Systems Direct methods for solving Ax = b,

More information

In order to solve the linear system KL M N when K is nonsymmetric, we can solve the equivalent system

In order to solve the linear system KL M N when K is nonsymmetric, we can solve the equivalent system !"#$% "&!#' (%)!#" *# %)%(! #! %)!#" +, %"!"#$ %*&%! $#&*! *# %)%! -. -/ 0 -. 12 "**3! * $!#%+,!2!#% 44" #% &#33 # 4"!#" "%! "5"#!!#6 -. - #% " 7% "3#!#3! - + 87&2! * $!#% 44" ) 3( $! # % %#!!#%+ 9332!

More information

Algebraic Multigrid as Solvers and as Preconditioner

Algebraic Multigrid as Solvers and as Preconditioner Ò Algebraic Multigrid as Solvers and as Preconditioner Domenico Lahaye domenico.lahaye@cs.kuleuven.ac.be http://www.cs.kuleuven.ac.be/ domenico/ Department of Computer Science Katholieke Universiteit Leuven

More information

Solution of Linear Systems

Solution of Linear Systems Solution of Linear Systems Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico May 12, 2016 CPD (DEI / IST) Parallel and Distributed Computing

More information

Parallel Programming. Parallel algorithms Linear systems solvers

Parallel Programming. Parallel algorithms Linear systems solvers Parallel Programming Parallel algorithms Linear systems solvers Terminology System of linear equations Solve Ax = b for x Special matrices Upper triangular Lower triangular Diagonally dominant Symmetric

More information

Finite-choice algorithm optimization in Conjugate Gradients

Finite-choice algorithm optimization in Conjugate Gradients Finite-choice algorithm optimization in Conjugate Gradients Jack Dongarra and Victor Eijkhout January 2003 Abstract We present computational aspects of mathematically equivalent implementations of the

More information

Today s class. Linear Algebraic Equations LU Decomposition. Numerical Methods, Fall 2011 Lecture 8. Prof. Jinbo Bi CSE, UConn

Today s class. Linear Algebraic Equations LU Decomposition. Numerical Methods, Fall 2011 Lecture 8. Prof. Jinbo Bi CSE, UConn Today s class Linear Algebraic Equations LU Decomposition 1 Linear Algebraic Equations Gaussian Elimination works well for solving linear systems of the form: AX = B What if you have to solve the linear

More information

Sparse Linear Systems. Iterative Methods for Sparse Linear Systems. Motivation for Studying Sparse Linear Systems. Partial Differential Equations

Sparse Linear Systems. Iterative Methods for Sparse Linear Systems. Motivation for Studying Sparse Linear Systems. Partial Differential Equations Sparse Linear Systems Iterative Methods for Sparse Linear Systems Matrix Computations and Applications, Lecture C11 Fredrik Bengzon, Robert Söderlund We consider the problem of solving the linear system

More information

Review of Linear Algebra

Review of Linear Algebra Review of Linear Algebra Definitions An m n (read "m by n") matrix, is a rectangular array of entries, where m is the number of rows and n the number of columns. 2 Definitions (Con t) A is square if m=

More information

A Comparison of Parallel Solvers for Diagonally. Dominant and General Narrow-Banded Linear. Systems II.

A Comparison of Parallel Solvers for Diagonally. Dominant and General Narrow-Banded Linear. Systems II. A Comparison of Parallel Solvers for Diagonally Dominant and General Narrow-Banded Linear Systems II Peter Arbenz 1, Andrew Cleary 2, Jack Dongarra 3, and Markus Hegland 4 1 Institute of Scientic Computing,

More information

The amount of work to construct each new guess from the previous one should be a small multiple of the number of nonzeros in A.

The amount of work to construct each new guess from the previous one should be a small multiple of the number of nonzeros in A. AMSC/CMSC 661 Scientific Computing II Spring 2005 Solution of Sparse Linear Systems Part 2: Iterative methods Dianne P. O Leary c 2005 Solving Sparse Linear Systems: Iterative methods The plan: Iterative

More information

Review: From problem to parallel algorithm

Review: From problem to parallel algorithm Review: From problem to parallel algorithm Mathematical formulations of interesting problems abound Poisson s equation Sources: Electrostatics, gravity, fluid flow, image processing (!) Numerical solution:

More information

MTH Linear Algebra. Study Guide. Dr. Tony Yee Department of Mathematics and Information Technology The Hong Kong Institute of Education

MTH Linear Algebra. Study Guide. Dr. Tony Yee Department of Mathematics and Information Technology The Hong Kong Institute of Education MTH 3 Linear Algebra Study Guide Dr. Tony Yee Department of Mathematics and Information Technology The Hong Kong Institute of Education June 3, ii Contents Table of Contents iii Matrix Algebra. Real Life

More information

LU Factorization. Marco Chiarandini. DM559 Linear and Integer Programming. Department of Mathematics & Computer Science University of Southern Denmark

LU Factorization. Marco Chiarandini. DM559 Linear and Integer Programming. Department of Mathematics & Computer Science University of Southern Denmark DM559 Linear and Integer Programming LU Factorization Marco Chiarandini Department of Mathematics & Computer Science University of Southern Denmark [Based on slides by Lieven Vandenberghe, UCLA] Outline

More information

AMS526: Numerical Analysis I (Numerical Linear Algebra) Lecture 23: GMRES and Other Krylov Subspace Methods; Preconditioning

AMS526: Numerical Analysis I (Numerical Linear Algebra) Lecture 23: GMRES and Other Krylov Subspace Methods; Preconditioning AMS526: Numerical Analysis I (Numerical Linear Algebra) Lecture 23: GMRES and Other Krylov Subspace Methods; Preconditioning Xiangmin Jiao SUNY Stony Brook Xiangmin Jiao Numerical Analysis I 1 / 18 Outline

More information

Enhancing Scalability of Sparse Direct Methods

Enhancing Scalability of Sparse Direct Methods Journal of Physics: Conference Series 78 (007) 0 doi:0.088/7-6596/78//0 Enhancing Scalability of Sparse Direct Methods X.S. Li, J. Demmel, L. Grigori, M. Gu, J. Xia 5, S. Jardin 6, C. Sovinec 7, L.-Q.

More information

Parallel Sparse Matrix Vector Multiplication (PSC 4.3)

Parallel Sparse Matrix Vector Multiplication (PSC 4.3) Parallel Sparse Matrix Vector Multiplication (PSC 4.) original slides by Rob Bisseling, Universiteit Utrecht, accompanying the textbook Parallel Scientific Computation adapted for the lecture HPC Algorithms

More information