APPARC PaA3a Deliverable. ESPRIT BRA III Contract # Reordering of Sparse Matrices for Parallel Processing. Achim Basermannn.

Size: px

Start display at page:

Download "APPARC PaA3a Deliverable. ESPRIT BRA III Contract # Reordering of Sparse Matrices for Parallel Processing. Achim Basermannn."

Susan Ryan
6 years ago
Views:

1 APPARC PaA3a Deliverable ESPRIT BRA III Contract # 6634 Reordering of Sparse Matrices for Parallel Processing Achim Basermannn Peter Weidner Zentralinstitut fur Angewandte Mathematik KFA Julich GmbH D Julich, Germany A.Basermann@kfa-juelich.de P.Weidner@kfa-juelich.de Per Christian Hansen Tzvetan Ostromsky UNIC Danish Computing Center for Research and Education Building 305, Technical University of Denmark DK-2800 Lyngby, Denmark Per.Christian.Hansen@uni-c.dk Tzvetan.Ostromsky@uni-c.dk Zahari Zlatev Danish Environmental Research Institute Frederiksborgvej 399 DK-4000 Roskilde, Denmark luzz@sun2.dmu.min.d February 28, 1994 In addition to the support from the EEC ESPRIT Basic Research Action Programme, Project 6634 (APPARC), Tz. Ostromsky was supported by Danish Government Scholarship # Achim Basermann was supported by the Graduiertenkolleg `Informatik und Technik', RWTH Aachen, Germany.

2 Contents 1 Introduction The Need for Reordering Algorithms : : : : : : : : : : : : : : : : : : : : : : Organization of the Report : : : : : : : : : : : : : : : : : : : : : : : : : : : 1 2 Case Study I: Iterative Methods Theoretical Background : : : : : : : : : : : : : : : : : : : : : : : : : : : : : The Method of Conjugate Gradients : : : : : : : : : : : : : : : : : : The Lanczos Tridiagonalization : : : : : : : : : : : : : : : : : : : : : Storage Scheme : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Parallelization : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Data Distribution : : : : : : : : : : : : : : : : : : : : : : : : : : : : Reordering and Communication Scheme : : : : : : : : : : : : : : : : Results : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Numerical Results : : : : : : : : : : : : : : : : : : : : : : : : : : : : Performance Results : : : : : : : : : : : : : : : : : : : : : : : : : : : 13 3 Case Study II: Direct Methods The Locally Optimized Reordering Algorithm : : : : : : : : : : : : : : : : : Implementation of LORA by Means of Binary Trees : : : : : : : : : : : : : Using the Reordered Matrix in the Solution Process : : : : : : : : : : : : : Step 1 - Reorder the matrix : : : : : : : : : : : : : : : : : : : : : : : Step 2 - Partition the matrix (data distribution) : : : : : : : : : : : Step 3 - Perform the rst phase of the factorization : : : : : : : : : : Step 4 - Perform the second phase of the factorization : : : : : : : : Step 5 - Carry out the second reordering : : : : : : : : : : : : : : : : Step 6 - Perform the third phase of the factorization : : : : : : : : : Step 7 - Find a rst solution (back substitution) : : : : : : : : : : : Step 8 - Improve the rst solution by a modied preconditioned orthomin algorithm : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Stability Considerations : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Numerical Results : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 26 4 Conclusion 30

3 1 Introduction The present report summarizes our work on reordering algorithms for sparse matrix computations on parallel computers. During our investigations in Work Package PaA2 [11], it became clear to us that the use of \alternative algorithms" for dense matrix computations is fairly limited in connection with high-performance parallel computers (such algorithms may still be useful on other architectures, e.g., systolic arrays). Hence, it was decided in the present Work Package PaA3a to switch the focus to reordering algorithms for sparse matrices. This work is strongly connected to the future Work Packages PaA4, PaA5a, PaA5b and PaA6a. 1.1 The Need for Reordering Algorithms Reordering of sparse matrices is essential for good performance on parallel computers. In fact, reordering is often crucial even on a sequential computer but on a parallel computer a good reordering algorithm can lead to a much better load balance of the computer, and thus to a dramatic increase in performance compared to a \naive" ordering. We remark that in many circumstances, the reordering is only necessary once, as long as the structure of the matrix does not change. For example, this is the case in connection with solving partial dierential equations by the nite element method. This is so, because the reordering depends solely on the structure of the matrix, not on the numerical value of the matrix elements. It is important to realize that the matrix reordering is tightly connected to both the data distribution and the communication scheme used on the parallel computer. On distributed memory machines, the sparse matrix must be distributed to each processor. Criteria for the data distribution can be load balancing and as little data tranfer as possible, i.e., as much local computations as possible. It is usually dicult to achieve both load balancing and minimum data transfer. By reordering the matrix, the number of local computations can be increased and the data exchange decreased. Therefore, the matrix reordering inuences the data distribution as well as the communication scheme; it further depends on the data structure of the sparse matrix. Moreover, reordering the matrix can make possible to perform communication and local computations overlapped. In this case, the order of the computations is suitably changed. The underlying idea in our work is that we will accept a fairly advanced reordering algorithm, as long as it ensures good parallel performance of the core algorithm. Hence, it is acceptable that the reordering algorithm is somewhat slower than a simple scheme, provided that the gain in total execution time is high enough when switching to the more advanced algorithm. Some of the work in this report is based on ideas from graph theory. Graph theory has often been used in sparse matrix studies especially in connection with symmetric positive denite systems [31]. However, the use of graph theory in connection with general sparse matrices is not as widespread, although some applications exist, based on bipartite graphs; see, e.g., [32]. The application used in this work is dierent from the above-mentioned applications. 1.2 Organization of the Report We will illustrate the design and use of reordering algorithms in connection with both iterative and direct parallel methods for sparse matrices. For iterative methods, the biggest challenge is to implement these methods on distributed memory computers where message passing can introduce a serious overhead. The \stan- 1

4 dard iterations", such as Gauss-Seidel, SOR, SSOR, etc., all have fairly slow convergence in general. Therefore, as our case study we consider in Section 2 a message passing implementation of the conjugate gradient (CG) method for solution of sparse symmetric positive denite systems of equations. We will also consider the computation of eigenvalues and -vectors by means of the Lanczos algorithm for solving large sparse symmetric eigenproblems, which has a close theoretical connection to the CG method [14]. We apply alternative forms of the CG and Lanczos algorithms which have better parallelization properties than the standard forms, and we illustrate that the developed data distribution and communication scheme together with the reordering of the sparse matrix have a dramatic inuence on the execution times. For direct methods the problem of matrix reordering is more complex because the reordering now inuences the numerical accuracy of the computed solution. Therefore, in Section 3 we concentrate on shared memory computers because we must understand this situation on this type of computer before we move on to message passing computers. As our case study we consider direct solution of general sparse systems of equations by means of Gaussian elimination. In particular, we describe a new reordering algorithm, LORA, which is based on graph theory (in particular, we introduce a new class of binary trees), and we illustrate its use in connection with coarse-grained parallel Gaussian elimination. The LORA algorithm tries to reorder the matrix so that it is close to upper triangular form. We also show that there is a trade-o between fast parallel execution and numerical stability and, as a consequence, extra iterative improvement steps may be required. Finally, in Section 4 we summarize our results and point to future research. 2

5 2 Case Study I: Iterative Methods For the analysis and solution of discretized ordinary or partial dierential equations it is necessary to solve systems of equations or eigenproblems with coecient matrices of dierent sparsity patterns, depending on the discretization method. In many cases, the use of the nite element method (FE) results in largely unstructured systems of equations. Sparse eigenproblems play an important role in the analysis of elastic solids and structures [13] [37] [45]. In the corresponding FE models, the natural frequencies and mode shapes of free vibration are determined as are buckling loads and modes. Another class of problems is related to stability analysis, e.g. of electrical networks. Moreover, approximations of extreme eigenvalues are useful for solving sets of linear equations, e.g. for determining condition numbers of symmetric positive denite matrices or for conjugate gradients methods with polynomial preconditioning [9]. The main computational eort in iterative methods for solving linear systems and eigenproblems consists of matrix-vector products and vector-vector operations; the main work in each iteration is usually the computation of matrix-vector products. Therein, accessing the vector is determined by the sparsity pattern and the storage scheme of the matrix. For parallelizing iterative solvers on a multiprocessor system with distributed memory, the data distribution and the communication scheme { depending on the data structures used for sparse matrices { are of greatest importance for the ecient execution. In this context, dierent reordering strategies of the sparse matrix have been investigated to reduce waiting times by performing communication and computation overlapped. Additionally, the reverse Cuthill-McKee scheme [47] is applied to diminish the bandwidth of the matrix. Depending on the sparsity pattern of the matrix, bandwidth reduction results in a considerable decrease of communication. The data distribution and the communication scheme are determined before the execution of the solver by preprocessing the symbolic structure of the sparse matrix and are exploited in each iteration. The schemes can be reused as long as the sparsity pattern of the matrix (which is determined by the discretization mesh and the element types) does not change. For example, they can be used in each time step of a time dependent problem or in each iterative step of a nonlinear problem which is solved by linearization. In this section, data distribution and communication schemes are presented which are based on the analysis of the column indices of the non-zero matrix elements. Performance tests using the conjugate gradients algorithm (CG) with preconditioning [36] [42] for solving systems of equations and the Lanczos tridiagonalization [14] [15] [43] [51] [52] as the rst step for solving eigenproblems have been carried out on the distributed memory system INTEL ipsc/860 of the Research Centre Julich with sparse matrices from two FE models. The rst FE model comes from environmental science; it simulates the behavior of pollutants in geological systems [1] [49]. In the second FE model from structural mechanics, stresses in materials induced by thermal expansion are calculated by applying the FE program SMART [2]. 2.1 Theoretical Background The variant of the CG algorithm suggested in [10], and the modied Lanczos tridiagonalization from [38], have been employed in our investigations. The main dierence between the original and the modied algorithm is that in the modied version all dot products are computed without any operations in between. If each iteration is performed in parallel on a distributed memory system the local values of the dot products can therefore be included in one message for determining the global values. 3

6 2.1.1 The Method of Conjugate Gradients The method of conjugate gradients [36] is an algorithm for solving systems of linear equations Ax = b, particularly for sparse coecient matrices A. The method applies to matrices which are symmetric and positive denite. Aykanat et al. [10] suggested a modied CG algorithm which is described as follows: Algorithm 2.1. The modied CG method Choose an arbitrary x 0 2 IR n ; g 0 = Ax 0? b d 0 =?g 0 i = 0; 1; : : : i = gt i g i d T i Ad i i = i(ad i ) T Ad i d T i Ad i? 1 g T i+1g i+1 = i g T i g i x i+1 = x i + i d i g i+1 = g i + i Ad i d i+1 =?g i+1 + i d i until kg i+1 k 2 r : In each iteration, the vectors x i, g i, and d i are computed. x i approximates the solution vector, g i is the residual; d i determines the direction in which the next approximation of the solution vector is searched for. For some sparse matrices, the main work in each iteration consists solely of the computation of the matrix-vector product Ad i ; for other sparse matrices, this work is comparable with the work involved in the computation of inner products and saxpys. Iteration is continued until the Euclidean norm of the residual is less than or equal to r. Another stopping criterion which uses the maximum scaled absolute dierence of the components of the latest two approximations of the solution vector is determined as follows: max 2 jxj i+1? x j ij j=1;:::;n jx j i+1j + jx j ij s: (2.1) In the investigations, Algorithm 2.1 has been performed with and without diagonal scaling [42], a simple preconditioner which hardly contributes to the total execution time but usually accelerates the convergence considerably The Lanczos Tridiagonalization Lanczos methods are most commonly used to approximate a small number of extreme eigenvalues and eigenvectors for a real symmetric large sparse matrix [14] [15] [43] [51] [52]. The principle of the modied Lanczos tridiagonalization from [38] is described in the following algorithm. 4

7 Algorithm 2.2. The modied Lanczos tridiagonalization Choose an arbitrary r 0 with r 0 6= 0 and set q 0 = 0; i = 1; 2; : : : i = kr i?1 k 2 i = rt i?1ar i?1 r T i?1r i?1 q i = r i?1 i r i = Ar i?1 i? i q i?1? i q i In this method, a vector sequence q i 2 IR n, i = 1; 2; 3; : : :, and a sequence of i i symmetric tridiagonal matrices T i, i = 1; 2; 3; : : :, are generated by an iterative process starting with a n n real symmetric matrix A and an initial residual vector r 0 2 IR n. The orthonormal vectors q i 2 IR n, i = 1; 2; 3; : : :, are called the Lanczos vectors, and the symmetric tridiagonal matrices T i the Lanczos matrices. The matrices T i have the following form with m, m = 1; 2; : : :; i and m, m = 2; 3; : : :; i as the diagonal and bidiagonal elements, respectively: T i = B. : C 0 i?1 i?1 i A 0 0 i i In Algorithm 2.2, kr i?1 k 2 = (r T i?1 r i?1) 1=2 denotes the Euclidean norm. The main work in each iteration consists in the computation of the matrix-vector product Ar i?1 and, in some situations, also in the vector-vector operations. 2.2 Storage Scheme Storage schemes for large sparse matrices depend on the sparsity pattern of the matrix, the considered algorithm, and the architecture of the computer system used. In the literature, many variants of storage schemes can be found [21] [22] [40] [41] [44] [46]. The storage scheme considered here is often used in FE programs and suitable for regular as well as for irregular discretization meshes. It can be found in similar form in e.g. [40]. The scheme is illustrated in (2.4) for matrix (2.3). The non-zeros of matrix (2.3) are stored row-wise in three one-dimensional arrays. a w contains the values of the non-zeros, a s the corresponding column indices. In a z, the position of the beginning of each row in a w and a s is stored. The subdivisions in a w and a s have been added to mark the beginning of a new row. The order of the matrix elements per row in a w and a s is dierent from that in matrix (2.3) since this is usually the case in FE programs caused by the assembly of the coecient matrix from the single elements. 5

8 A = C A : (2.3) a w = (1 j 9 2 j j j j j j 18 8); a s = (1 j 3 2 j j j j j j 4 8); (2.4) a z = ( ): 2.3 Parallelization Data Distribution For parallelizing the Algorithms 2.1 and 2.2 on a distributed memory system, the matrix and vector arrays must be suitably distributed to each processor. For the considered data distribution schemes, the arrays a w and a s are distributed row-wise; the rows of each processor succeed one another. The distribution of the vector arrays corresponds component-wise to the row distribution of the matrix arrays. Criteria for the data distribution can be: each processor gets the same number of rows or so many rows that each processor has nearly the same number of non-zeros. The number of operations for the computation of the matrix-vector product is proportional to the number of non-zeros; the remaining vector operations of one iteration are proportional to the number of rows. The criterion considered here is that each processor has to compute nearly the same number of operations. If the discretization mesh is regular, i.e., the sparsity pattern of the coecient matrix is regular, then all three criteria result in nearly the same data distribution. If the mesh is very irregular, the three distributions dier considerably. Our algorithm for distributing the rows onto the processors for the latter case can be described as follows. The row distribution is determined by analyzing the array a z. Let n k denote the number of rows assigned to processor k, and let p denote the number of processors. Also, let e k denote the number of nonzeros assigned to processor k, and let e denote the total number of nonzeros in the matrix. Then processor k is assigned rows until the following requirement is satised for the rst time: e k + n k e + n 1 p ; for e k; n k 10: (2.5) The parameter rstly depends on the number of vector operations which are additional to the operations of the matrix-vector product in each iteration. Secondly, it considers the execution times of arithmetical, logical, and memory operations on the processor used; it is therefore dependent on the processor architecture. The numerator in (2.5) is proportional to the number of operations of one partial iteration on processor k, the denominator is proportional to the total number of operations of one iteration. It shall be remarked that for! 0 each processor gets nearly the same number of non-zeros and for! 1 nearly the same number of rows. The rst case means that the execution time of all vector-vector operations is neglectable compared with the execution time of the matrix-vector product. In the second case, the execution time of the matrix-vector product hardly contributes to the total execution time. 6

9 With these considerations, the contribution of the matrix-vector product to one iteration can be approximated by a MVP e e + n = 1 ; for e; n 10: (2.6) 1 + =m z Here, m z = e=n is the mean number of non-zeros per row. Additionally, (2.6) provides a means for measuring. If a MVP is determined by timings then an approximation of can be computed by 1 m z? 1 : (2.7) a MVP On the INTEL i860xr, the timings result in an approximative value of about 8.3 for the CG method and of about 2 for the Lanczos tridiagonalization. The data distribution according to criterion (2.5) is shown in (2.8) by distributing matrix (2.3) to four processors. The other arrays are distributed analogously. For this simple small example, the data distribution is the same for both the CG method and the Lanczos tridiagonalization. For large sparse matrices from FE applications, the data distributions usually vary for the considered algorithms caused by the dierent values of. Processor 0: a w 0 = (1 j 9 2 j ); Processor 1: a w 1 = ( j ); (2.8) Processor 2: a w 2 = ( j ); Processor 3: a w 3 = (18 8): Reordering and Communication Scheme On a distributed memory system, the computation of the matrix-vector product requires communication because each processor owns only a partial vector. For the ecient computation of the matrix-vector product, it is necessary to develop a suitable communication scheme by preprocessing the distributed column index arrays. Here, we describe two dierent schemes based on two dierent reorderings of the matrix. First, the arrays a s k are analyzed on each processor k to determine which data result in accesses to components of d i of other processors. Then, a s k and a w k are reordered in such a way that the data which result in accesses to processor h are collected in block h. The data of block h succeed row-wise one another with increasing column index per row. Block k is the rst block in a s k and a w k and contains the data which result in local accesses. The goal of the reordering is performing computation and communication overlapped. The rst reordering scheme is shown in (2.9) for the data distribution from (2.8) and the matrix-vector product Ad i from Algorithm 2.1. Here, merely array a s 1 is analyzed and reordered. Processor 0: a s 0 = (1 j 3 2 j 4 2 3); d i;0 = (d 1 i d 2 i d 3 i ) Processor 1: a s 1 = ( j ); d i;1 = (d 4 i d 5 i ) Processor 2: a s 2 = (7 4 6 j ); d i;2 = (d 6 i d 7 i ) (2.9) Processor 3: a s 3 = (4 8); d i;3 = (d 8 i ) Reordering: a s 1 = (4 5 j 4 5 {z } 1 k 3 {z} 0 k 6 7 j 7 {z } 2 k 8 Computing the operation row-times-vector of the matrix-vector product of processor 1, the index 3 results in an access to component d 3 i of processor 0, the index 8 to d 8 i of processor 7 {z} 3 )

10 3 2 k j : Processor k : Index j of a vector component Figure 2.1: Communication scheme, reordering 1 3, and the indices 6 and 7 in accesses to d 6 and i d7 i of processor 2. The data blocks in (2.9) are separated by double dashes for elucidation; the blocks have been numbered below the brackets. After reordering, the data of block 1 result in local accesses, the data of block 0 in accesses to processor 0, the data of block 2 in accesses to processor 2, and the data of block 3 in accesses to processor 3. After having analyzed the column index array a s k, each processor k knows which components of d i are required by which processors. This information is broadcasted to all processors. Then, each processor can decide which data must be sent to which processors. This communication scheme is determined once before starting the parallel CG algorithm or Lanczos tridiagonalization and applies unchanged to each iteration. The communication scheme for the example discussed before is displayed in Fig Processor 1 receives the third component of d i from processor 0, the sixth and seventh component from processor 2 and the eighth component from processor 3. On the other hand, the fourth component of processor 1 is sent to processor 0, the fourth and fth to processor 2 and the fourth to processor 3. In Fig. 2.2, the parallel computation of the matrix-vector product is described for both considered algorithms. First, on each processor, the data which are necessary for other processors are sent asynchronously. After having executed asynchronous receive-routines for receiving non-local data, all local computations are performed, in particular the local part of the matrix-vector product. Then each processor waits until the data of an arbitrary processor arrive, and continues the computation of the matrix-vector product. Thereafter, each processor awaits the data of other processors until the computation of the matrixvector product is complete. Computation and communication are performed overlapped. While required data are on the network, operations with local or already arrived data of other processors are executed. In the second reordering scheme, the data blocks, built as discussed before, are sent to the processors which own the corresponding components of the vector of the matrixvector product. The goal is to increase the number of local computations while required data are on the network. In this case, the processors compute partial results of the result vector of the matrix-vector product. Then, y k;l denotes the partial result of y k = A k d i of processor k which is computed on processor l. After the computation, the partial results except the local one are sent to the corresponding processors and then are added to the local result of the receiving processor. The new distribution of the matrix data is presented in (2.10). 8

11 k= p-1 Sending the data which are necessary for other processors, asynchronously Receiving non-local data for the matrix-vector product, asynchronously Local vector-vector operations Computing the matrix-vector product with local data??? no no no Data of a processor available? yes yes yes Matrix-vector product with the data of the processor??? no no no Computation complete? yes yes yes Figure 2.2: The parallel matrix-vector product, reordering 1 Processor 0: a s 0 = (1 j 2 3 j 2 3 {z } 0;0 Processor 1: a s 1 = (4 5 j 4 5 {z } 1;1 Processor 2: a s 2 = (6 7 j 6 7 {z } 2;2 Processor 3: a s 3 = ( {z} 8 3;3 k 8 {z} 1;3 k 3 k 4 {z} 0;1 {z} 1;0 k 6 7 j 7 ); d i;0 = (d 1 i d 2 i d 3 i ) k 4 j 4 5 {z } 2;1 {z } 1;2 ); d i;3 = (d 8 i ) k 4 {z} 3;1 ); d i;2 = (d 6 i d 7 i ) ); d i;1 = (d 4 i d 5 i ) (2.10) The rst number of the blocks in (2.10) denotes the processor to which the partial result is sent; the second number indicates the processor on which the computation is performed. Processor 1 e.g. computes the local result y 1;1 with the rst block, the partial result y 0;1 of processor 0 with the second block, the partial result y 2;1 of processor 2 with the third block, and the partial result y 3;1 of processor 3 with the fourth block, respectively. Fig. 2.3 shows the communication scheme for the block distribution from (2.10). Processor 1 sends a value to processor 0, and this value is added to the third component of y 0. Simultaneously, processor 1 receives a value from processor 0 which must be added to the rst component of y 1. In Fig. 2.4, the parallel computation of the matrix-vector product is presented for the 9

12 3 2 k j : Processor k : Index j of a vector component Figure 2.3: Communication scheme, reordering 2 k= p-1 Receiving all neccessary non-local partial results, asynchronously Are there data-blocks of other?? processors?? no yes no yes no yes Computing a partial result of another processor Sending the partial result, asynchronously no? no? no? Processing of all data-blocks of other proc. yes yes complete? yes Local vector-vector operations k= p-1 Computing the local partial result??? no no no Data of a processor available? yes yes yes Adding the values to the local partial result??? no no no Computation complete? yes yes yes Figure 2.4: The parallel matrix-vector product, reordering 2 10

13 second reordering scheme. First, asynchronous receive-routines for receiving all necessary partial results of other processors are executed on each processor. After that, each processor computes the partial results which are sent to other processors. The computation is performed per data block; the results are asynchronously sent to the corresponding processors after each computation. Then, all local computations are performed, in particular the computation of the local part of the matrix-vector product. Thereafter, each processor waits until the data of an arbitrary processor arrive and then adds the values to the corresponding components of the local result. This is repeated until the computation of the matrix-vector product is complete. Computation and communication are performed overlapped. Since partial results of the matrix-vector product are exchanged most computations are local. After having received non-local data, merely a summation of vector components is necessary. The disadvantage of this method is that load balancing is not guaranteed any more after the new distribution of the blocks; some processors can own more or larger data blocks than other ones. However, this scheme allows arbitrary data distributions; each processor can get arbitrary parts of arbitrary rows which need not succeed one another. For a specic FE application, suitable data distributions for this scheme can be found considering the discretization mesh. The data distribution and the communication scheme presented here do not require any knowledge about a specic discretization mesh; the schemes are determined automatically by analyzing the column indices of the non-zero matrix elements. 2.4 Results The numerical and performance tests of the developed parallel CG method and the parallel Lanczos tridiagonalization have been performed on the distributed-memory system ipsc/860 of the Research Centre Julich. The INTEL computer system has 32 processors with 16 Megabyte private memory each, interconnected by a hypercube-network. The maximum transfer rate is 2.8 Megabyte/second per channel in both directions Numerical Results The tests presented here have been carried out with one matrix each of the FE models from environmental science and structural mechanics. In Table 2.1, numerical data of the coecient matrices and for the convergence of the CG method are shown. The matrix from environmental science has rows, that from structural mechanics In the rst case, the mean number of non-zeros per row is near the maximum number. This is caused by a regular discretization mesh. For the second case, the mean and the maximum number are considerably dierent; the discretization mesh is much more irregular. The operational contribution of the matrix-vector product to one iteration is 75% for the matrix from environmental science and 95% for the matrix from structural mechanics in the case of the CG method; in the case of the Lanczos tridiagonalization, the values are 93% and 99%. In Table 2.1, the number of CG iterations with and without diagonal scaling is given. The iteration is stopped when the maximum scaled absolute dierence from (2.1) is less than 10?5 ; this corresponds to a precision of the solution vector of about ve decimals. With diagonal scaling, the number of iterations is considerably smaller in both cases. The contribution of this preconditioner to the total execution time is in both cases below 1%. For the preconditioned method, the Euclidean norm of the residual after 84 and 658 iterations, respectively, is given in addition. The sparsity patterns of both matrices are shown in Fig The matrix from environmental science has essentially band structure with a maximum bandwidth of

14 Figure 2.5: Top: sparsity patterns of the matrices from environmental science (left) and structural mechanics (right). Bottom: same matrices with bandwidth reduction. 12

15 Environmental science Structural mechanics Rows Non-zeros Density 0.05% 0.6% Non-zeros per row, max m z a MVP, CG method 75% 95% a MVP, Lanczos tridiagonalization 93% 99% CG method: max. scal. abs. di. 10?5 Iterations without scaling Iterations with scaling kg i+1 k 2 4:5 10?4 1:5 10?5 Table 2.1: Numerical data of the considered large sparse matrices The matrix from structural mechanics has a much more irregular structure; the maximum bandwidth is By reducing the bandwidth of the matrices, the communication overhead in each iteration of both considered algorithms can be reduced. Since communication is necessary for the operation row times vector of the matrix-vector product, a smaller bandwidth results in smaller message length or even in communication with fewer processors. Here, the matrix is reordered by the reverse Cuthill-McKee (RCM) scheme [47]. In FE models, this scheme is frequently used for the assembly of the coecient matrix; it is performed merely once if the mesh does not change, whereas in many cases equation systems or eigenproblems are frequently solved, e.g. in each time step of a time dependent problem. Fig. 2.5 also shows the sparsity patterns of both matrices with bandwidth reduction. For the matrix from environmental science, the bandwidth is reduced by 45%; the maximum bandwidth is The maximum bandwidth of the matrix from structural mechanics after applying the reverse Cuthill-McKee scheme is 2989; this is a reduction by merely 14% Performance Results In the rst two investigations, bandwidth reduction has not been applied to the matrices. Fig. 2.6 shows execution times per iteration of the CG method on 32 processors for both presented reordering schemes. For the matrix from environmental science, using reordering 1 or reordering 2 results in almost the same execution times. This is caused by a regular discretization mesh. For the matrix from structural mechanics, the time using reordering 2 increases distinctly compared with the time using reordering 1. Because of the irregular structure of this matrix, the new distribution of the data-blocks destroys the load balancing. Since reordering 2 does not result in an improvement compared with reordering 1 for the considered matrices the latter scheme is applied in all following investigations. Fig. 2.6 also shows the execution times per iteration on 32 processors with and without communication and computation performed overlapped for the parallel CG algorithm. The overlapped execution reduces the execution times by nearly 20%. In Fig. 2.7, speedups on 4 to 32 processors are shown for the CG method with and without bandwidth reduction of the matrices. The equation system from environmental science together with the program code and the remaining data requires the memory of 13

16 Figure 2.6: Execution times per iteration, CG method, 32 processors. Left: the inuence of the reordering; right: the inuence of overlapping. Figure 2.7: Speedups, CG method (left) and Lanczos tridiagonalization (right). more than two processors, that from structural mechanics the memory of more than four processors. For up to four and, in the second case, up to eight processors, linear speedup was assumed because nearly linear speedup was observed in tests with smaller systems of equations for up to 8 processors. Performance tests with a smaller system of equations from environmental science (17368 rows, non-zeros, bandwidth reduced) resulted in speedups of 1.0, 1.9, 3.6, and 6.7 on 1, 2, 4, and 8 processors, respectively. Using one processor, one CG iteration required an execution time of 216 milliseconds. For 16 processors and without bandwidth reduction, the speedup is 13.2 in the rst case and 15.2 in the second case. This corresponds to eciencies of 83% and 95%. With bandwidth reduction, the speedups increase to 14.6 and 15.6, respectively; the eciencies are 91% and 97%. For 32 processors, speedups of 21.6 and 27.2 without bandwidth reduction or of 24.8 and 28.5 with bandwidth reduction are achieved. The eciencies decrease to 68% and 85% without bandwidth reduction or 78% and 89% with bandwidth reduction because the communication overhead increases. Fig. 2.7 also shows the speedups for the Lanczos tridiagonalization. In the case of the matrix from environmental science, the speedups are 21.1 without bandwidth reduction and 23.8 with bandwidth reduction using 32 processors. This corresponds to eciences of 66% and 74%, respectively. In the case of the matrix from structural mechanics, speedups 14

17 of of 28.5 without bandwidth reduction and of 29.8 with bandwidth reduction have been obtained on 32 processors. The corresponding eciences are 89% and 93%, respectively. Using 32 processors, 100 Lanczos iterations require an execution time of 3.3 seconds without and of 2.8 seconds with bandwidth reduction for the matrix from environmental science; the corresponding times for the matrix from structural mechanics are 5.7 seconds and 5.5 seconds, respectively. 15

18 3 Case Study II: Direct Methods When coarse-grain parallelism is to be used for solving Ax = b by Gaussian elimination, then the original matrix A has to be reordered before the start of the elimination in order to obtain several relatively large blocks that can be treated concurrently. The idea is not a new one; it has been used in many applications (not only in order to obtain a parallel algorithm). An ecient reordering has been proposed by Hellerman and Rarick [34, 35]. It has been used, with some modications, by many other authors; see, for example, [8] or [20]. Other preliminary reorderings have also been proposed in the literature; see, for example, [26]. A common feature of all these reorderings is that one always imposes a requirement to obtain square blocks on the main diagonal. Moreover, it is also required that the reordered matrix is either an upper block-triangular matrix or a bordered matrix, in both cases with square blocks on the main diagonal (see the references given above or [17]). For some matrices these two requirements are too restrictive. Therefore these requirements should not always be imposed. The main purpose of this paper is to show how to avoid them (when this is appropriate). A direct solver, where one attempts to exploit coarse-grain parallelism without imposing the above two requirements, is described and tested in [30, 54]. This solver is based on partitioning the matrix into an upper block-triangular form with rectangular diagonal blocks. A reordering algorithm, by which as many as possible zero elements are obtained in the lower left corner of the matrix, is to be applied before the partitioning. After that the matrix must be divided into block rows, each of them containing approximately the same number of rows. If the reordering algorithm is ecient, then it is rather easy to obtain large block-rows that contain approximately the same number of rows during the partitioning (because it is allowed to use rectangular diagonal blocks). This is why we concentrate our attention on the initial reordering. In the remaining of this chapter we discuss an improvement of the reordering algorithm proposed in [30] and [54], and its application in the solution of systems of linear algebraic equations by Gaussian elimination. Throughout, n and N Z denote respectively the order of the matrix and the number of its non-zero elements. More details about the algorithm can be found in [24]. 3.1 The Locally Optimized Reordering Algorithm It is convenient rst to sketch the initial reordering scheme used in the old solver. It consists of two completely separated steps: column reordering and row reordering. The following denition is needed in order to describe the column reordering. Denition 1. The number c j of non-zero elements in a given column j, j = 1; 2; : : :; n of the matrix A is called the count of this column. When the column ordering is completed, the columns of matrix A are ordered by increasing counts: j < k ) c j c k : Let r i be the column index of the rst non-zero element in row i (i.e., a iri 6= 0, but a i1 = a i2 = : : : = a iri?1 = 0). When the row ordering is completed, the rows are ordered so that the following relationship is satised: j < k ) r j r k : It is clear that by reordering the matrix in this way, many of the zero elements are moved to the lower left corner of the matrix. It can easily be proved that the cost of this reordering is very low; it requires only O(N Z) operations. However, the number of zeros in 16

19 the left lower corner of the reordered matrix may be considerably smaller than the maximal number that can be achieved. Consider the (6 6) matrix shown in Fig This matrix has two non-zero elements in each column. Therefore the column ordering will not lead to any column permutations. It is easy to see that the row ordering does not lead to row permutations either. Thus, the old ordering algorithm leaves this matrix unchanged. This leads to a rather poor result (only six zero elements in the lower left corner) Figure 3.1. The original sparsity pattern of a matrix; the old reordering algorithm preserves this structure. Not only is the main drawback of the old algorithm revealed by this simple example, but the example also shows how the reordering can be made more ecient. When the rst column with a minimal count is chosen, the counts of the other columns should be updated by removing all non-zero elements from the rows that have non-zero elements in that column. Then the second column is chosen (among the columns with minimal count) and brought by permutations to position 2. The counts of the other columns are again updated, and the process is continued in this manner. The most important feature of this new algorithm is that after choosing a column with a best count the counts of some columns are reduced. A more rigorous description of the algorithm will be given below. It is worthwhile to illustrate here the eect of this reordering by using the same matrix as in the example shown in Fig The result obtained by the new ordering is shown on Fig It is immediately seen that the number of zero elements in the lower left corner is considerably greater Figure 3.2. An optimal reordering of the matrix from Fig The column permutation sequence is given above the matrix. The reordered matrix is obtained without row interchanges. This reordering is produced by the new algorithm. The examples demonstrated on Fig. 3.1 and Fig. 3.2 show that better results can be achieved by using a new reordering algorithm. Two denitions are needed in order to formulate more rigorously the new algorithm. Denition 2. Assume that the rst k columns of the reordered matrix have been selected (and brought to the rst k positions by permutations). Let i 2 1; 2; : : :; n be an arbitrary row number. Consider a column j, j 2 fk + 1; k + 2; : : :; ng. The elements in column j which have row numbers greater than i form the active part of column j with index i. 17

20 Remark 1. The above denition can be considered as a generalization of a denition which is sometimes used in connection with the Gaussian elimination. In the original denition i is equal to k. In the new reordering algorithm (see below) i will in general be greater than k. Denition 3. The number of non-zero elements in the active part of column j with index i is called the active count c i j of column j with index i. Locally Optimized Reordering Algorithm (LORA) 1. Initialization. Set j = 1; i = 0; and nd the active counts c k, k = 1; 2; : : :; n, of the columns. 2. Choice of the next column. Find the rst column k j with minimal active count c i k with index i among the columns j, j + 1; : : :n and interchange the elements of columns j and k. 3. Row permutations. Perform row permutations such that the q rows with row numbers greater than i having non-zero elements in column j are moved to locations i + 1, i + 2; : : :; i + q. 4. Stopping check. If i + q = n then STOP. 5. Updating the active counts. Find the active counts of the columns j + 1; j + 2; : : :; n with index i + q. 6. Preparation for a new column search. Increase i by q, j by 1, and go to Step 2. The typical structure of a matrix that is reordered by LORA is shown in Fig The reordered matrix contains three dierent types of blocks: 1. the black blocks are dense matrices (these blocks are normally small and rectangular), 2. the shaded blocks are sparse matrices (all such blocks are over the black blocks or to the right of them; i.e. in the upper right corner of the reordered matrix A), 3. all elements that are located under the black blocks and to the left of them (i.e., the elements in the lower left corner of the reordered matrix A) are zeros. The black blocks in Fig. 3.3 will be called the dense separator of the reordered matrix. As stated above, the new algorithm tries to reorder the matrix so that the number of zero elements under the dense separator is as large as possible. The following theorem is proved in [24]. Theorem 1. All blocks of the dense separator, which is obtained when the new reordering algorithm is applied, are dense matrices. The strategy by which one attempts to achieve as many as possible zero elements under the dense separator is local. The algorithm nds consecutively columns that are locally best. If a column j is chosen at some stage of LORA, then this column is the column that adds the maximal number of zeros to the zero blocks. The local optimization does not ensure a global optimization, but in Section 3.5 it will be shown that the algorithm performs well in practice. 18

21 dense blocks sparse blocks zero blocks Figure 3.3. The structure of the reordered matrix. 19

22 3.2 Implementation of LORA by Means of Binary Trees If LORA is implemented in a straight-forward way then O(n 2 ) operations are required. For sparse matrix algorithms this is not a very considerable cost and, therefore, it is highly desirable to reduce it. The crucial point is to reduce the cost of Step 2. The use of binary trees in the eorts to reduce the total number of operations needed to perform LORA will be discussed in this section. The discussion is based on a set of fundamental concepts from graph theory, which are assumed to be known. This set includes nodes, edges, paths, roots, parents, children, ancestors, descendants, trees, subtrees, etc. (a good description of these concepts can be found, for example, in [3, 4]. Several denitions are needed in order to introduce the concept of binary trees with leveled nodes; a binary tree with leveled nodes could be considered as a modication of the leftist tree discussed in Knuth [39]. Denition 4. A tree is called a binary tree if each of its nodes has at most two children. Denition 5. Consider a binary tree T (V; E) (where V is the set of nodes and E is the set of edges). Assume that V contains n nodes. Sometimes it is necessary to keep the nodes of T in a certain order. If this is so, then a special function must be attached to the nodes of T. The value of this function for a given node j, j 2 V = f1; 2; :::; ng will be called the key of node j. Denition 6. Consider again T (V; E) from the previous denition. The minimal distance between a given node j, j 2 V = f1; 2; :::; ng, and one of its descendants with less than two children is called the level of node j. The level of a node with less than two children is equal to zero. Some modications of binary trees are often carried out many times consecutively during the solution of a given problem. This will be needed, for example, if LORA is implemented by using binary trees. Such modications can eciently be performed if some extra information about the binary tree is stored and used. A special class of binary trees, the binary trees with leveled nodes, can be very useful for some special kinds of modications. Five extra arrays, each of them of length n, are necessary, to store the information and to update it eciently, when binary trees with leveled nodes are to be used. The contents of the jth component of these arrays are given below: 1. LEF T (j) - contains any of the children of the jth node, or LEF T (j) = 0 if node j has no children. 2. RIGHT (j) - if node j has two children, then RIGHT (j) contains the child that is not stored in LEF T (j); RIGHT (J) = 0 when node j has only one child or no child. 3. P REV (j) - contains the parent of the jth node; P REV (j) = 0 for the root. 4. KEY (j) - contains the key of the jth node. 5. LEV (j) - contains the level of the jth node. Denition 7. A binary tree T (V; KEY; LEV; P REV; LEF T; RIGHT) is called binary tree with leveled nodes if the following relations are satised: 1. KEY (P REV (j)) KEY (j), 8j 2 V; j 6= root 2. if LEF T (j) = m 6= 0 then P REV (m) = j if RIGHT (j) = k 6= 0 then P REV (k) = j 3. if LEF T (j) 6= 0 then LEF T (j) 6= RIGHT (j) 20

23 4. if LEF T (j) = 0 or RIGHT (j) = 0 then LEV (j) = 0 5. if LEF T (j) 6= 0 and RIGHT (j) 6= 0 then LEV (j) = min(lev (LEF T (j); LEV (RIGHT (j)) + 1 KEY=3 LEV=2 KEY=4 LEV=1 KEY=3 LEV=1 KEY=6 LEV=0 KEY=5 LEV=1 KEY=5 LEV=0 KEY=6 LEV=0 KEY=7 KEY=5 KEY=7 LEV=0 LEV=0 LEV=0 Figure 3.4. A binary tree with ten leveled nodes. A binary tree with leveled nodes is shown in Fig The relations stated by (1){(5) in Denition 7 can easily be checked using the information about the key and the level of each node, which is also given in Fig A collection of properties of binary trees with leveled nodes can be found in [24]. Consider a binary tree T (V; KEY; LEV; P REV; LEF T; RIGHT ) with n leveled nodes. If such a tree is to be used in connection with LORA, then the nodes of the tree must be associated with the columns of matrix A. The active counts of column j is to be associated with the value of KEY (j). These counts are modied in LORA. This leads to the necessity to perform modications of T (V; KEY; LEV; P REV; LEF T; RIGHT ) after the choice of each column (Step 2 of LORA). Appropriate values of the active count of column j with some index i is to be used as KEY (j) during the consecutive modications of T (V; KEY; LEV; P REV; LEF T; RIGHT). These remarks suce to explain how binary trees with leveled nodes are used to implement LORA eciently. The main result about the complexity of LORA is given in the theorem below the proof can be found in [24]. Theorem 3. The reordering algorithm LORA can be implemented in O(N Z log n) operations when binary trees with leveled nodes are used. In the above considerations we assumed for simplicity that KEY(k) represent the active count of the kth column. In this particular case a more straightforward implementation exists, using doubly linked lists (see [17, x2.8]), and this implementation leads to an algorithm 21

Reduced Synchronization Overhead on. December 3, Abstract. The standard formulation of the conjugate gradient algorithm involves

Lapack Working Note 56 Conjugate Gradient Algorithms with Reduced Synchronization Overhead on Distributed Memory Multiprocessors E. F. D'Azevedo y, V.L. Eijkhout z, C. H. Romine y December 3, 1999 Abstract