Parallel Solution of Sparse Linear Systems

Size: px

Start display at page:

Download "Parallel Solution of Sparse Linear Systems"

Antony Stanley
5 years ago
Views:

1 Parallel Solution of Sparse Linear Systems Murat Manguoglu Abstract Many simulations in science and engineering give rise to sparse linear systems of equations. t is a well known fact that the cost of the simulation process is almost always governed by the solution of the linear systems especially for largescale problems. The emergence of extreme-scale parallel platforms, along with the increasing number of processing cores available on a single chip pose significant challenges for algorithm development. Machines with tens of thousands of multicore processors place tremendous constraints on the communication as well as memory access requirements of algorithms. The increase in number of cores in a processing unit without an increase in memory bandwidth aggravates an already significant memory bottleneck. Sparse linear algebra kernels are well-known for their poor processor utilization. This is a result of limited memory reuse, which renders data caching less effective. n view of emerging hardware trends, it is necessary to develop algorithms that strike a more meaningful balance between memory accesses, communication, and computation. Specifically, an algorithm that performs more floating point operations at the expense of reduced memory accesses and communication is likely to yield better performance. We presented two alternative variations of DS factorization based methods for solution of sparse linear systems on parallel computing platforms. Performance comparisons to traditional LU factorization based parallel solvers is also discussed. We show that combining iterative methods with direct solvers and using DS factorization, one can achieve better scalability and shorter time to solution. Murat Manguoglu Department of Computer Engineering, Middle East Technical University, Ankara, Turkey manguoglu@ceng.metu.edu.tr 1

2 2 Murat Manguoglu 1 ntroduction Many simulations in science and engineering give rise to sparse linear systems of equations. t is a well known fact that the cost of the solution process is almost always governed by the solution of the linear systems especially for large-scale problems. The emergence of extreme-scale parallel platforms, along with the increasing number of processing cores available on a single chip pose significant challenges for algorithm development. Machines with tens of thousands of multicore processors place tremendous constraints on the communication as well as memory access requirements of algorithms. The increase in number of cores in a processing unit without an increase in memory bandwidth aggravates an already significant memory bottleneck. Sparse linear algebra kernels are well-known for their poor processor utilization. This is a result of limited memory reuse, which renders data caching less effective. n view of emerging hardware trends, it is necessary to develop algorithms that strike a more meaningful balance between memory accesses, communication, and computation. Specifically, an algorithm that performs more floating point operations at the expense of reduced memory accesses and communication is likely to yield better performance. Significant amount of effort has been devoted to design and implementation of parallel sparse linear systems solvers. Existing parallel sparse direct solvers, such as MUMPS [3, 2, 1], Pardiso [39, 40], SuperLU [22], and WSMP [14, 13], are based on LU factorization. Therefore, the speed improvements realized by such solvers are often limited due to the inherited limitations of sparse LU factorizations and sparse triangular forward-backward sweeps. terative solvers, such as preconditioned Krylov subspace methods with sparse approximate inverse or incomplete LU factorization based preconditioners, on the other hand, are often more scalable but not as robust as direct solvers. We present two robust hybrid algorithms based on DS factorization for parallel solution of general sparse linear systems. At the cost of increased computation, DS factorization for solving the system allows us to minimize the interprocess communications and, hence, enhances concurrency. The remainder of this chapter is organized as follows. n Section 2, we present banded and sparse variations of DS factorization. n Section 3, we develop two hybrid general sparse linear system solvers that use DS factorization to solve the preconditioned system. 2 Banded and sparse parallel DS factorizations A number of banded solvers have been proposed and implemented in software packages such as, LAPACK [4] for uniprocessors, ScaLAPACK [7], and Spike [36, 34, 21, 6, 9, 30, 31, 37] for parallel architectures. The central idea of Spike is to partition the matrix so that each process (or processing element) can work on its own part of the matrix, with the processes communicating only during the solu-

3 Parallel Solution of Sparse Linear Systems 3 tion of the common reduced system. The size of the reduced system is determined by the bandwidth of the matrix and the number of partitions. Unlike classical sequential LU factorization of the coefficient matrix A, for solving a banded linear system Ax = f, the Spike scheme employs the factorization: A = DS, (1) where D is the block diagonal of A for a given number of partitions. The factor S, given by D 1 A (assuming D is nonsingular), called the spike matrix, consists of the block diagonal identity matrix modified by spikes to the right and left of each partition. The process of solving Ax = f, then, reduces to a sequence of the following steps g D 1 f (modification of the right hand side) S D 1 A (forming the spike system coefficient matrix) ˆx Ŝ 1 ĝ (solving a smaller independent reduced system) x S 1 g (retrieving the full solution). All the steps of the solution process can be executed in perfect parallelism with the exception of the solution of the small reduced system. The size of the reduced system will increase as we increase the number of partitions (or processors). Furthermore, each step can be accomplished using one of several available methods, depending on the specific parallel architecture and the linear system at hand. This gives rise to a family of optimized variants of the basic Spike algorithm. For a small banded system we illustrate the partitioning of the system among three processors in Figure 1. Figure 2 depicts the spike system Sx = g. Finally, Figure 3 shows the smaller reduced system. Dashed boxes in figures show those elements which could be near machine precision if A is diagonally dominant system. gnoring those small elements will give rise to truncated variation of the algorithm allowing more parallelism in the solution of the reduced system. For the solution of general sparse linear systems (where A is sparse) a new algorithm has been proposed in [24, 28]. nspired by the banded Spike solver, this algorithm also uses the DS factorization and partitioning of the system (see Figures 4 and 5). The resulting S matrix, however, is not banded and consists of sparse spikes. Nevertheless, a smaller reduced system can still be obtained (see Figure 6). Solution stages follow exactly the banded case. Both banded and sparse DS factorization based algorithms can be adapted for efficient and scalable solution of general sparse linear systems as will be discussed in Section 3. 3 Solution of general sparse linear systems n this section, we present two hybrid methods for the solution of general sparse linear systems which use an iterative method as the outer layer with preconditioning.

4 4 Murat Manguoglu A x f A 11 A 22 * = A 33 Fig. 1 Partitioning of the system (Ax = f ) into three parts, boxes represent nonzeros D -1 A x D -1 f * = Fig. 2 Spike system: Sx = g where g = D 1 f and S = D 1 A for the system in Figure 1 n both methods, the solution of preconditioned linear systems are handled by one of the variations of the DS factorization based banded or sparse solvers described in Section 2. The first method relies on reordering systems so that the large entries in the coefficient matrix are moved closer to the main diagonal. After reordering an effective banded preconditioner, M, can be extracted and used for solving the system with an outer iterative layer. M could be treated as dense or sparse within the band and hence the diagonal blocks can be handled by a variety of algorithms. The second method, on the other hand, eliminates the need for the reordering with weights by dropping small elements when forming the preconditioner. An outer

5 Parallel Solution of Sparse Linear Systems 5 S x g * = Fig. 3 Reduced system: Ŝ ˆx = ĝ obtained from the spike system in Figure 2 A x f A 11 A 22 * = A 33 Fig. 4 Partitioning of the system (Ax = f ) into three parts, boxes represent nonzeros iterative method is also used for the second method and preconditioned systems are handled by the sparse variation of the DS factorization. 3.1 Weighted reordering and banded preconditioning Given a linear system of equations, Ax = f, we first apply a nonsymmetric row permutation as follows: QAx = Q f. (2)

6 6 Murat Manguoglu D -1 A x D -1 f * = Fig. 5 Spike system: Sx = g where g = D 1 f and S = D 1 A for the system in Figure 4 S x * = g Fig. 6 Reduced system: Ŝ ˆx = ĝ obtained from the spike system in Figure 5 Here, Q is the row permutation matrix that either maximizes the number of nonzeros on the diagonal of A, or the permutation that maximizes the product of the absolute values of the diagonal entries [11]. The first algorithm is known as maximum traversal search, while the second algorithm provides scaling factors so that the absolute values of the diagonal entries are equal to one and all other elements are less than or equal to one. This scaling can be applied as follows: (QD 2 AD 1 )(D 1 1 x) = (QD 2 f ). (3) Both algorithms are implemented in subroutine MC64 [10] of the HSL [17] library. Following the above nonsymmetric reordering and optional scaling, we apply the symmetric permutation P as follows:

7 Parallel Solution of Sparse Linear Systems 7 (PQD 2 AD 1 P T )(PD 1 1 x) = (PQD 2 f ). (4) The permutation, P, can be chosen such that the magnitude of the nonzeros closer to the main diagonal are larger than the ones that are far away. Such a reordering can be obtained by solving the following eigenvalue problem. The second smallest eigenvalue and the corresponding eigenvector of the Laplacian of a graph have been used in a number of application areas including matrix reordering [26, 23, 5], graph partitioning [32, 33], machine learning [29], protein analysis and data mining [16, 42, 20], and web search [15]. The second smallest eigenvalue of the Laplacian of a graph is sometimes called the algebraic connectivity of the graph, and the corresponding eigenvector is known as the Fiedler vector, due to the work of Fiedler [12]. For a given n n sparse symmetric matrix A, or an undirected weighted graph with positive weights, one can form the weighted-laplacian matrix, L w, as follows: { L w (i, j) = ĵ i A(i, ĵ) if i = j, A(i, j) if i j. (5) Since the Fiedler vector can be computed independently for disconnected graphs, we assume that the graph is connected. The eigenvalues of L w are different than zero except λ 1. The eigenvector x 2 corresponding to smallest nontrivial eigenvalue λ 2 is called the Fiedler vector. Since we assume a connected graph, the trivial eigenvector, x 1, is a vector of all ones. n case the matrix, A, is nonsymmetric one can use ( A + A T )/2, instead. A state of the art multilevel solver [18] called MC73 Fiedler for computing the Fiedler vector is implemented in the Harwell Subroutine Library(HSL) [17]. t uses a series of levels of coarser graphs where the eigenvalue problem corresponding to the coarsest level is solved via the Lanczos method for estimating the Fiedler vector. The results are then prolongated to the finer graphs and Rayleigh Quotient terations (RQ) with shift and invert are used for refining the eigenvector. Linear systems encountered in RQ are solved via the SYMMLQ algorithm. We consider MC73 Fiedler as one of the best uniprocessor implementation for determining the Fiedler vector. A new parallel algorithm TraceMin-Fiedler is developed based on the Trace Minimization algorithm (TraceMin) [38, 35], and parallel results comparing it to MC73 Fiedler is presented in [25]. We consider solving the standard symmetric eigenvalue problem Lx = λx (6) where L denotes the weighted Laplacian, using the TraceMin scheme for obtaining the Fiedler vector. The basic TraceMin algorithm can be summarized as follows. Let X k be an approximation of the eigenvectors corresponding to the p smallest eigenvalues such that X T k LX k = Σ k and X T k X k =, where Σ k = DAG(ρ (k) 1,ρ(k) 2,...,ρ(k) p ). The updated approximation is obtained by solving the minimization problem min tr(x k k ) T L(X k k ), subject to T k X k = 0. (7)

8 8 Murat Manguoglu This in turn leads to the need for solving a saddle point problem, in each iteration of the TraceMin algorithm, of the form [ ][ ] [ ] L Xk k LXk X T =. (8) k 0 0 N k We solve first the Schur complement system (X T k L 1 X k )N k = X T k X k for obtaining N k. After k is retrieved, (X k k ) is then used to obtain X k+1 which forms the section X T k+1 LX k+1 = Σ k+1,x T k+1 X k+1 =. (9) The TraceMin-Fiedler algorithm, which is based on the basic TraceMin algorithm, is given in Figure 7. Algorithm 1: Data: L is the n n Laplacian matrix defined in Eqn.(5), ε out is the stopping criterion for the. of the eigenvalue problem residual, p is the number of eigenpairs to be computed, and q is the dimension of the search space Result: x 2 is the eigenvector corresponding to the second smallest eigenvalue of L p 2; q p + 2 ; n conv 0; X conv [ ]; ˆL L + L ; D the diagonal of L ; ˆD the diagonal of ˆL ; X 1 rand(n,q); for k = 1,2,... max it do 1. Orthonormalize X k into V k ; 2. Compute the interaction matrix H k V T k LV k; 3. Compute the eigendecomposition H k Y k = Y k Σ k of H k. The eigenvalues Σ k are arranged in ascending order and the eigenvectors are chosen to be orthogonal; 4. Compute the corresponding Ritz vectors X k V k Y k ; Note that X k is a section, i.e. X T k LX k = Σ k,x T k X k = ; 5. Compute the relative residual LX k X k Σ k / L ; 6. Test for convergence: f the relative residual of an approximate eigenvector is less than ε out, move that vector from X k to X conv and replace n conv by n conv + 1 increment. f n conv p, stop; 7. Deflate: f n conv > 0,X k X k X conv (X T convx k ); 8. if n conv = 0 then Solve the linear system ˆLW k = X k approximately with relative residual ε in via the PCG scheme using the diagonal preconditioner ˆD; else Solve the linear system LW k = X k approximately with relative residual ε in via the PCG scheme using the diagonal preconditioner D; 9. Form the Schur complement S k X T k W k; 10. Solve the linear system S k N k = X T k X k for N k ; 11. Update X k+1 X k k = W k N k ; Fig. 7 TraceMin-Fiedler algorithm.

9 Parallel Solution of Sparse Linear Systems 9 Using the above algorithm, speed improvements over the uniprocessor MC73 Fiedler using TraceMin-Fiedler on 1, 8, 16, 32, and 64 cores are shown in Figure 8 for matrices obtained from the University of Florida Sparse Matrix Collection [8]. The platform we use is a cluster with nfiniband interconnection where each node consists of two six-core ntel Xeon CPUs (Westmere X5670) running at 2.93GHz (12 cores per node) Speed mprovement nlpkkt160 circuit5m cage15 rajat31 circuit5m_dc nlpkkt120 freescale1 memchip kktpow er Fig. 8 Speed improvement : Time (MC73 Fiedler) / Time (TraceMin-Fiedler) for 9 test problems. Reordering using the Fiedler vector provides matrices in which the large elements are clustered aroung the main diagonal as shown in Figure 9 for a matrix obtained from the University of Florida Sparse Matrix Collection. Once the reordered system is obtained one can extract a banded preconditioner and solve the general system using a preconditioned iterative method. Systems involving the preconditioner are solved at each iteration using the DS factorization. n which systems involving the diagonal blocks in D can be solved by using (i) dense sequential or multithreaded banded solvers [26] or (ii) a sparse sequential or multithreaded direct solver. Second variation has been implemented in [41, 27, 23] and is called PSPKE. We obtained the g3 circuit (1,585,478 unknowns and 7,660,826 nonzeros) matrix from the University of Florida Sparse Matrix Collection. n Figure 10, speed improvements of PSPKE, MUMPS and Pardiso are presented for g3 circuit system is presented. The platform we use is an ntel Xeon (X5560@2.8GHz) cluster with nband interconnection and 16GB memory per node. n PSPKE, BiCGStab [43] is used as the outer iterative solver and the iterations are stopped

10 10 Murat Manguoglu (a) original matrix (b) TraceMin-Fiedler Fig. 9 Sparsity plots of eurqsa; red and blue indicates the largest and the smallest elements, respectively. when f Ax / f The reduced system is truncated to enhance parallelism and solved directly. 9 Speed mprovement PSPKE PARDSO MUMPS Number of Cores Fig. 10 The speed improvement for g3 circuit compared to Pardiso using one core (73.5 seconds) 3.2 Domain Decomposing Parallel Sparse Solver Given a general sparse linear system Ax = f, we partition A R n n into p block rows A = [A 1,A 2,...,A p ] T. Let where D consists of the p block diagonals of A, A = D + R, (10)

11 Parallel Solution of Sparse Linear Systems 11 A 11 A 22 D =..., App (11) and R consists of the remaining elements (i.e. R = A D). We note that the process can be viewed as an algebraic domain decomposition. Therefore, we will call the method described in this section Domain Decomposing Parallel Sparse Solver (DDPS). The DS factorization in DDPS is for solving the systems involving the preconditioner. Let L i and Ũ i be the incomplete or complete LU factorizations of A ii, where i = 1,2,..., p. We define D = Ã 11 Ã22... Ãpp (12) in which Ã ii = L i Ũ i. The DDPS algorithm is shown in Figure 11. Data: Ax = f and p Result: x 1. D + R A for a given p; 2. L i Ũ i A ii (approximate or exact) for i = 1,2,..., p; 3. R R (by dropping some elements); 4. S D 1 R; 5. identify nonzero columns of S and store their indices in array c; 6. solve Ax = f via a Krylov subspace method with a preconditioner P = D + R and stopping tolerance ε out solve Pz = y ( D 1 Pz = D 1 y ( + S)z = g); 6.1. g D 1 y; 6.2. Ŝ ((c,c) + S(c,c)); ẑ z(c); ĝ g(c); 6.3. solve the smaller independent system: Ŝẑ = ĝ (directly or iteratively with stopping tolerance ε in ); 6.4. z(c) ẑ; 6.5. z g Sz; end end Fig. 11 DDPS algorithm. Stages 1 5 are preprocessing stages where the right-hand-side is not required. After preprocessing we solve the system via a Krylov subspace method and using a preconditioner. The major operations in a Krylov subspace method are: (i) matrix vector multiplications, (ii) inner products, and (iii) preconditioning operations in

12 12 Murat Manguoglu the form of Pz = y. Details of the preconditioning operations for DDPS are given in Figure 11. Each stage, with the exception of Stage 6.3, can be executed with perfect parallelism, requiring no interprocessor communication. n Stage 6.3, the solution of the smaller system Ŝẑ = ĝ is the only part of the algorithm that requires communication. We solve this smaller reduced system iteratively via BiCGStab without preconditioning. The size of Ŝ, which is determined by the nonzero columns it has, is problem dependent and is expected to have an influence on the overall scalability of the algorithm. We employ several techniques to reduce the dimension of Ŝ. First, we use METS [19] reordering to minimize the total communication volume, hence reducing the size of Ŝ. We also use the following dropping strategy: if for any column j in R i R(:, j) i δ max l R(:,l) i (i = 1,2,.., p) we do not consider that column when forming Ŝ. Here R i is the block row partition of R (i.e. R = [R 1,R 2,...,R p ] T ). We call this dropping strategy apriori dropping and use it for obtaining the results in this section. Another possibility is to drop elements after computing S posteriori dropping. We note that dropping elements from R in Stage 3 to reduce the size of Ŝ results in an approximation of the solution. Furthermore, we can use approximate LU factorization of the diagonal blocks in Stage 2 and solve Ŝẑ = ĝ iteratively in Stage 6.3. Therefore, we place an outer iterative layer (e.g. BiCGStab) where we use the above algorithm as a solver for systems involving the preconditioner P = D + R, where R consists only of the columns that are not dropped. We stop the outer iterations when the relative residual at the k th iteration r k / r 0 ε out. DDPS is a direct solver if (i) nothing is dropped from R, (ii) exact LU factorization of A ii is computed, and (iii) Ŝẑ = ĝ is solved exactly. n the case of using DDPS as a direct solver, an outer iterative scheme is not required. The choices we make in Stages 2, 3 and 6.3, result in a solver that can be as robust as a direct solver or as scalable as an iterative solver, or anything in between. We note that the outer iterative layer also benefits from our partitioning strategy as METS minimizes the total communication volume in parallel sparse matrix vector multiplications. We further note that Ŝ consists of dense columns, which we store as a two-dimensional array in memory, and as a result matrix vector multiplications can be done via BLAS2 (or BLAS3 in case of multiple right hand sides). We obtained the torso3 matrix (259, 156 unknowns and 4, 429, 042 nonzeros) from the University of Florida sparse matrix collection. n Figure 12 we present for this matrix the speed improvements of DDPS and MUMPS solvers compared to Pardiso on a single core of an ntel Xeon (X5560@2.8GHz) cluster with nband interconnection and 16GB memory per node, and with ε out = n this example, DDPS uses Pardiso for solving systems involving the diagonal blocks of D.

13 Parallel Solution of Sparse Linear Systems MUMPS DDPS speed improvement number of cores Fig. 12 The speed improvement for torso3 compared to Pardiso using one core (49.4 seconds) 4 Conclusions We have presented two alternative formulations of DS factorization based methods for solution of sparse linear systems on parallel computing platforms. Performance comparisons to traditional LU factorization based parallel solvers show that combining iterative methods with direct solvers and using DS factorization, one can achieve better scalability and shorter time to solution. Acknowledgements would like to thank Ahmed Sameh, Ananth Grama, Faisal Saied, Eric Cox, Kenji Takizawa, Madan Sathe, Mehmet Koyuturk, Olaf Schenk, and Tayfun Tezduyar for their help and many useful discussions. References 1. P. R. AMESTOY AND. S. DUFF, Multifrontal parallel distributed symmetric and unsymmetric solvers, Comput. Methods Appl. Mech. Eng, 184 (2000), pp P. R. AMESTOY,. S. DUFF, J.-Y. L EXCELLENT, AND J. KOSTER, A fully asynchronous multifrontal solver using distributed dynamic scheduling, SAM J. Matrix Anal. Appl., 23 (2001), pp P. R. AMESTOY, A. GUERMOUCHE, J.-Y. L EXCELLENT, AND S. PRALET, Hybrid scheduling for the parallel solution of linear systems, Parallel Comput., 32 (2006), pp E. ANDERSON, Z. BA, C. BSCHOF, J. DEMMEL, J. DONGARRA, J. DUCROZ, A. GREEN- BAUM, S. HAMMARLNG, A. MCKENNEY, AND D. SORENSEN., LAPACK: A portable linear algebra library for high-performance computers, tech. report, Knoxville, S. T. BARNARD, A. POTHEN, AND H. SMON, A spectral algorithm for envelope reduction of sparse matrices, Numerical Linear Algebra with Applications, 2 (1995), pp

14 14 Murat Manguoglu 6. M. W. BERRY AND A. SAMEH, Multiprocessor schemes for solving block tridiagonal linear systems, The nternational Journal of Supercomputer Applications, 1 (1988), pp L. S. BLACKFORD, J. CHO, A. CLEARY, E. D AZEVEDO, J. DEMMEL,. DHLLON, J. DONGARRA, S. HAMMARLNG, G. HENRY, A. PETTET, K. STANLEY, D. WALKER, AND R. C. WHALEY, ScaLAPACK: a linear algebra library for message-passing computers, in Proceedings of the Eighth SAM Conference on Parallel Processing for Scientific Computing (Minneapolis, MN, 1997), Philadelphia, PA, USA, 1997, Society for ndustrial and Applied Mathematics, p. 15 (electronic). 8. T. A. DAVS, University of Florida sparse matrix collection. NA Digest, J. J. DONGARRA AND A. H. SAMEH, On some parallel banded system solvers, Parallel Computing, 1 (1984), pp DUFF AND J. KOSTER, On algorithms for permuting large entries to the diagonal of a sparse matrix, S. DUFF AND J. KOSTER, The design and use of algorithms for permuting large entries to the diagonal of sparse matrices, SAM Journal on Matrix Analysis and Applications, 20 (1999), pp M. FEDLER, Algebraic connectivity of graphs, Czechoslovak Mathematical Journal, 23 (1973), pp A. GUPTA, Recent advances in direct methods for solving unsymmetric sparse systems of linear equations, ACM Transactions on Mathematical Software, 28 (2002), pp A. GUPTA, S. KORC, AND T. GEORGE, Sparse matrix factorization on massively parallel computers, in SC 09 USB Key, ACM/EEE, Portland, OR, USA, Nov a1,. 15. X. HE, H. ZHA, C. H. DNG, AND H. D. SMON, Web document clustering using hyperlink structures, Computational Statistics & Data Analysis, 41 (2002), pp D. J. HGHAM, G. KALNA, AND M. KBBLE, Spectral clustering and its use in bioinformatics, Journal of Computational and Applied Mathematics, 204 (2007), pp Special issue dedicated to Professor Shinnosuke Oharu on the occasion of his 65th birthday. 17. HSL, A collection of Fortran codes for large-scale scientific computation, See Y. HU AND J. SCOTT, HSL MC73: a fast multilevel Fiedler and profile reduction code, Technical Report RAL-TR , G. KARYPS AND V. KUMAR, METS: Unstructured graph partitioning and sparse matrix ordering system, The University of Minnesota, 2 (1995). 20. S. KUNDU, D. SORENSEN, AND J. G.N. PHLPHS, Automatic domain decomposition of proteins by a gaussian network model, Proteins: Structure, Function, and Bioinformatics, 57 (2004), pp D. H. LAWRE AND A. H. SAMEH, The computation and communication complexity of a parallel banded system solver, ACM Trans. Math. Softw., 10 (1984), pp X. L AND J. W. DEMMEL, Superlu dist: A scalable distributed-memory sparse direct solver for unsymmetric linear systems, ACM Trans. Mathematical Software, 29 (2003), pp M. MANGUOGLU, A parallel hybrid sparse linear system solver, in Computational Electromagnetics nternational Workshop, CEM 2009, July 2009, pp M. MANGUOGLU, A domain decomposing parallel sparse linear system solver, journal of Computational and Applied Mathematics, (n review). 25. M. MANGUOGLU, E. COX, F. SAED, AND A. SAMEH, TRACEMN-Fiedler: A Parallel Algorithm for Computing the Fiedler Vector, High Performance Computing for Computational Science VECPAR 2010, (2011), pp M. MANGUOGLU, M. KOYUTURK, A. SAMEH, AND A. GRAMA, Weighted matrix ordering and parallel banded preconditioners for iterative linear system solvers, SAM Journal on Scientific Computing, 32 (2010), pp M. MANGUOGLU, A. SAMEH, AND O. SCHENK, Pspike: A parallel hybrid sparse linear system solver, Euro-Par 2009 Parallel Processing, (2009), pp M. MANGUOGLU, K. TAKZAWA, A. SAMEH, AND T. TEZDUYAR, Nested and parallel sparse algorithms for arterial fluid mechanics computations with boundary layer mesh refinement, nternational Journal for Numerical Methods in Fluids, 65 (2011), pp

15 Parallel Solution of Sparse Linear Systems A. Y. NG, M.. JORDAN, AND Y. WESS, On spectral clustering: Analysis and an algorithm, in Advances in Neural nformation Processing Systems 14, MT Press, 2001, pp E. POLZZ AND A. H. SAMEH, A parallel hybrid banded system solver: the spike algorithm, Parallel Comput., 32 (2006), pp , Spike: A parallel environment for solving banded linear systems, Computers & Fluids, 36 (2007), pp A. POTHEN, H. D. SMON, AND K.-P. LOU, Partitioning sparse matrices with eigenvectors of graphs, SAM J. Matrix Anal. Appl., 11 (1990), pp H. QU AND E. R. HANCOCK, Graph matching and clustering using spectral partitions, Pattern Recognition, 39 (2006), pp D. J. K. S. C. CHEN AND A. H. SAMEH, Practical parallel band triangular system solvers, ACM Transactions on Mathematical Software, 4 (1978), pp A. SAMEH AND Z. TONG, The trace minimization method for the symmetric generalized eigenvalue problem, J. Comput. Appl. Math., 123 (2000), pp A. H. SAMEH AND D. J. KUCK, On stable parallel linear system solvers, J. ACM, 25 (1978), pp A. H. SAMEH AND V. SARN, Hybrid parallel linear system solvers, nternational Journal of Computational Fluid Dynamics, 12 (1999), pp A. H. SAMEH AND J. A. WSNEWSK, A trace minimization algorithm for the generalized eigenvalue problem, SAM Journal on Numerical Analysis, 19 (1982), pp O. SCHENK AND K. GÄRTNER, Solving unsymmetric sparse systems of linear equations with pardiso, Future Generation Computer Systems, 20 (2004), pp , On fast factorization pivoting methods for sparse symmetric indefinite systems, Electronic Transactions on Numerical Analysis, 23 (2006), pp O. SCHENK, M. MANGUOGLU, A. SAMEH, M. CHRSTAN, AND M. SATHE, Parallel scalable PDE-constrained optimization: antenna identification in hyperthermia cancer treatment planning, Computer Science Research and Development, 23 (2009). 42. S. J. SHEPHERD, C. B. BEGGS, AND S. JONES, Amino acid partitioning using a fiedler vector model, Journal European Biophysics Journal, 37 (2007), pp H. A. VAN DER VORST, Bi-cgstab: a fast and smoothly converging variant of bi-cg for the solution of nonsymmetric linear systems, SAM J. Sci. Stat. Comput., 13 (1992), pp

Parallel Algorithms for Solution of Large Sparse Linear Systems with Applications

Parallel Algorithms for Solution of Large Sparse Linear Systems with Applications Murat Manguoğlu Department of Computer Engineering Middle East Technical University, Ankara, Turkey Prace workshop: HPC