Enhancing Scalability of Sparse Direct Methods

Similar documents
Sparse linear solvers

A sparse multifrontal solver using hierarchically semi-separable frontal matrices

Incomplete Cholesky preconditioners that exploit the low-rank property

AMS526: Numerical Analysis I (Numerical Linear Algebra for Computational and Data Sciences)

Improvements for Implicit Linear Equation Solvers

Static-scheduling and hybrid-programming in SuperLU DIST on multicore cluster systems

Sparse factorizations: Towards optimal complexity and resilience at exascale

Fast algorithms for hierarchically semiseparable matrices

Parallelization of Multilevel Preconditioners Constructed from Inverse-Based ILUs on Shared-Memory Multiprocessors

Partial Left-Looking Structured Multifrontal Factorization & Algorithms for Compressed Sensing. Cinna Julie Wu

J.I. Aliaga 1 M. Bollhöfer 2 A.F. Martín 1 E.S. Quintana-Ortí 1. March, 2009

NIMROD Project Overview

Contents. Preface... xi. Introduction...

SOLVING SPARSE LINEAR SYSTEMS OF EQUATIONS. Chao Yang Computational Research Division Lawrence Berkeley National Laboratory Berkeley, CA, USA

Using Postordering and Static Symbolic Factorization for Parallel Sparse LU

Communication Avoiding LU Factorization using Complete Pivoting

Using Random Butterfly Transformations to Avoid Pivoting in Sparse Direct Methods

FAST STRUCTURED EIGENSOLVER FOR DISCRETIZED PARTIAL DIFFERENTIAL OPERATORS ON GENERAL MESHES

On the design of parallel linear solvers for large scale problems

A Sparse QS-Decomposition for Large Sparse Linear System of Equations

A robust multilevel approximate inverse preconditioner for symmetric positive definite matrices

V C V L T I 0 C V B 1 V T 0 I. l nk

Linear Solvers. Andrew Hazel

Sparse factorization using low rank submatrices. Cleve Ashcraft LSTC 2010 MUMPS User Group Meeting April 15-16, 2010 Toulouse, FRANCE

Block Low-Rank (BLR) approximations to improve multifrontal sparse solvers

arxiv: v1 [cs.na] 20 Jul 2015

A Robust Preconditioned Iterative Method for the Navier-Stokes Equations with High Reynolds Numbers

LU factorization with Panel Rank Revealing Pivoting and its Communication Avoiding version

An Implementation and Evaluation of the AMLS Method for Sparse Eigenvalue Problems

Hybrid static/dynamic scheduling for already optimized dense matrix factorization. Joint Laboratory for Petascale Computing, INRIA-UIUC

Lecture 8: Fast Linear Solvers (Part 7)

5.1 Banded Storage. u = temperature. The five-point difference operator. uh (x, y + h) 2u h (x, y)+u h (x, y h) uh (x + h, y) 2u h (x, y)+u h (x h, y)

Minisymposia 9 and 34: Avoiding Communication in Linear Algebra. Jim Demmel UC Berkeley bebop.cs.berkeley.edu

Efficiently solving large sparse linear systems on a distributed and heterogeneous grid by using the multisplitting-direct method

A Parallel Geometric Multifrontal Solver Using Hierarchically Semiseparable Structure

CALU: A Communication Optimal LU Factorization Algorithm

Parallel Algorithms for Forward and Back Substitution in Direct Solution of Sparse Linear Systems

Communication-avoiding LU and QR factorizations for multicore architectures

Direct solution methods for sparse matrices. p. 1/49

Recent advances in sparse linear solver technology for semiconductor device simulation matrices

1 Overview. 2 Adapting to computing system evolution. 11 th European LS-DYNA Conference 2017, Salzburg, Austria

A DISTRIBUTED-MEMORY RANDOMIZED STRUCTURED MULTIFRONTAL METHOD FOR SPARSE DIRECT SOLUTIONS

MARCH 24-27, 2014 SAN JOSE, CA

IMPROVING THE PERFORMANCE OF SPARSE LU MATRIX FACTORIZATION USING A SUPERNODAL ALGORITHM

A dissection solver with kernel detection for unsymmetric matrices in FreeFem++

Numerical Methods I Non-Square and Sparse Linear Systems

Avoiding Communication in Distributed-Memory Tridiagonalization

Matrix Assembly in FEA

A communication-avoiding thick-restart Lanczos method on a distributed-memory system

What s New in Isorropia?

Jacobi-Davidson Eigensolver in Cusolver Library. Lung-Sheng Chien, NVIDIA

A DISTRIBUTED-MEMORY RANDOMIZED STRUCTURED MULTIFRONTAL METHOD FOR SPARSE DIRECT SOLUTIONS

Scientific Computing with Case Studies SIAM Press, Lecture Notes for Unit VII Sparse Matrix

Solving Large Nonlinear Sparse Systems

Direct and Incomplete Cholesky Factorizations with Static Supernodes

Solving PDEs with CUDA Jonathan Cohen

Scientific Computing

July 18, Abstract. capture the unsymmetric structure of the matrices. We introduce a new algorithm which

Utilisation de la compression low-rank pour réduire la complexité du solveur PaStiX

LU Factorization with Panel Rank Revealing Pivoting and Its Communication Avoiding Version

Fine-grained Parallel Incomplete LU Factorization

A High-Performance Parallel Hybrid Method for Large Sparse Linear Systems

A Column Pre-ordering Strategy for the Unsymmetric-Pattern Multifrontal Method

Multilevel low-rank approximation preconditioners Yousef Saad Department of Computer Science and Engineering University of Minnesota

Accelerating linear algebra computations with hybrid GPU-multicore systems.

A Numerical Study of Some Parallel Algebraic Preconditioners

Integration of PETSc for Nonlinear Solves

LU Factorization. Marco Chiarandini. DM559 Linear and Integer Programming. Department of Mathematics & Computer Science University of Southern Denmark

Scalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver

Communication-avoiding parallel and sequential QR factorizations

Theoretical Computer Science

Key words. numerical linear algebra, direct methods, Cholesky factorization, sparse matrices, mathematical software, matrix updates.

Computing least squares condition numbers on hybrid multicore/gpu systems

Numerical Methods I: Numerical linear algebra

A dissection solver with kernel detection for symmetric finite element matrices on shared memory computers

Introduction to communication avoiding algorithms for direct methods of factorization in Linear Algebra

MUMPS. The MUMPS library: work done during the SOLSTICE project. MUMPS team, Lyon-Grenoble, Toulouse, Bordeaux

Balanced Truncation Model Reduction of Large and Sparse Generalized Linear Systems

Scalable Non-blocking Preconditioned Conjugate Gradient Methods

A Fast Direct Solver for a Class of Elliptic Partial Differential Equations

An Efficient Solver for Sparse Linear Systems based on Rank-Structured Cholesky Factorization

MULTI-LAYER HIERARCHICAL STRUCTURES AND FACTORIZATIONS

Introduction to communication avoiding linear algebra algorithms in high performance computing

An H-LU Based Direct Finite Element Solver Accelerated by Nested Dissection for Large-scale Modeling of ICs and Packages

Solution of Linear Systems

Gaussian Elimination without/with Pivoting and Cholesky Decomposition

Matrix Computations: Direct Methods II. May 5, 2014 Lecture 11

30.3. LU Decomposition. Introduction. Prerequisites. Learning Outcomes

An Efficient Graph Sparsification Approach to Scalable Harmonic Balance (HB) Analysis of Strongly Nonlinear RF Circuits

Large Scale Sparse Linear Algebra

Block-tridiagonal matrices

PERFORMANCE ENHANCEMENT OF PARALLEL MULTIFRONTAL SOLVER ON BLOCK LANCZOS METHOD 1. INTRODUCTION

Low Rank Approximation of a Sparse Matrix Based on LU Factorization with Column and Row Tournament Pivoting

Linear Algebra in IPMs

Numerical Methods in Matrix Computations

Scaling and Pivoting in an Out-of-Core Sparse Direct Solver

AMS Mathematics Subject Classification : 65F10,65F50. Key words and phrases: ILUS factorization, preconditioning, Schur complement, 1.

A model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization)

SPARSE SOLVERS POISSON EQUATION. Margreet Nool. November 9, 2015 FOR THE. CWI, Multiscale Dynamics

Numerical Analysis Fall. Gauss Elimination

Transcription:

Journal of Physics: Conference Series 78 (007) 0 doi:0.088/7-6596/78//0 Enhancing Scalability of Sparse Direct Methods X.S. Li, J. Demmel, L. Grigori, M. Gu, J. Xia 5, S. Jardin 6, C. Sovinec 7, L.-Q. Lee 8 One Cyclotron Road, Lawrence Berkeley National Laboratory, MS 50F-650, Berkeley, CA 970. Computer Science Division, University of California at Berkeley, Berkeley CA 970-776. INRIA Rennes, Campus Universitaire Beaulieu, Rennes, 50, France. Mathematics Department, University of California at Berkeley, Berkeley CA 970-80. 5 Department of Mathematics, University of California, Los Angeles, Los Angeles, CA 90095. 6 Princeton Plasma Physics Laboratory, P.O. Box 5, Princeton, NJ 085. 7 Engineering Physics Department, University of Wisconsin, Madison, WI 5706. 8 Stanford Linear Accelerator Center, 575 Sand Hill Road, Mail Stop 7, Menlo Park, CA 905. E-mail: xsli@lbl.gov Abstract. TOPS is providing high-performance, scalable sparse direct solvers, which have had significant impacts on the SciDAC applications, including fusion simulation (CEMM), accelerator modeling (COMPASS), as well as many other mission-critical applications in DOE and elsewhere. Our recent developments have been focusing on new techniques to overcome scalability bottleneck of direct methods, in both time and memory. These include parallelizing symbolic analysis phase and developing linear-complexity sparse factorization methods. The new techniques will make sparse direct methods more widely usable in large D simulations on highly-parallel petascale computers.. The SciDAC applications.. Fusion energy research The Center for Extended Magnetohydrodynamic Modeling (CEMM) [] is developing simulation codes for studying the nonlinear macroscopic dynamics of MHD-like phenomena in fusion plasmas, and address critical issues facing burning plasma experiments such as ITER. Their code suite includes MD-C, NIMROD. The PDEs include many more physical effects than the standard MHD equations; they involve large, multiple time-scales, and are very stiff temporally, therefore requiring implicit methods. The linear systems are extremely ill-conditioned, and 50-90% of the execution time is spent in linear solvers. TOPS sparse direct solver SuperLU has played significant roles in both codes. For the large D matrix-free formulation in NIMROD, SuperLU is also used as an effective preconditioner for the global GMRES solver... Accelerator design The Community Petascale Project for Accelerator Science and Simulation (COMPASS) [] is developing parallel simulation tools with integrated capabilities in beam dynamics, electromagnetics, and advanced accelerator concept modeling for accelerator design, analysis, and discovery. Their code suite includes OmegaP eigenanalysis. The eigen computations for cavity mode frequencies and field vectors are on the critical path of the shape optimization cycles. These need to be done repeatedly, accurately, and quickly. Another challenge is that a large number of small nonzero eigenvalues, which are tightly clustered, is desired. The matrix c 007 Ltd

Journal of Physics: Conference Series 78 (007) 0 doi:0.088/7-6596/78//0 Name Codes Type Order (N) nnz(a)/n Fill-ratio matrix8 MD-C Real 589,698 6 9. matrix MD-C Real 8,78 6 9. cc linear NIMROD Complex 59,0 09 7.5 dds5 OmegaP Real 8,575 6 0. Table. Characteristics of the sample matrices. The sparsity is measured as average number of nonzeros per row (i.e., nnz(a)/n), and the Fill-ratio shows the ratio of number of nonzeros in L+U over that in A. Here, MeTiS is used to reorder the equations to reduce fill. 00 80 60 0 Factorization matrix8 matrix cc_linear dds5.5 matrix8 matrix cc_linear dds5 Triangular solution Seconds 0 00 80 60 0 0 8 8 56 IBM power5 processors (a) Factorization time. Seconds.5 0.5 8 8 56 IBM power5 processors (b) Triangular solution time. Figure. SuperLU runtime (seconds) for the linear systems from the SciDAC applications. This was done on the IBM Power 5 machine at NERSC. The factorization reached 6 Gflops/s flop rate for matrix. dimensions can be tens to hundreds of millions. The main method used is shift-invert Lanczos, for which the shifted linear systems are solved with a combination of direct and iterative methods... SuperLU efficiency with these applications SuperLU [6] is a leading scalable solver for sparse linear systems using direct methods, of which the development is mainly funded through the TOPS SciDAC project (led by David Keyes) [7]. Table shows the characteristics of a few typical matrices taken from these simulation codes. Figure shows the parallel runtime of the two important phases of SuperLU: factorization and triangular solution. The experiments were performed on an IBM Power 5 parallel machine at NERSC. In strong scaling sense, the factorization routine scales very well, although performance varies with applications. The triangular solution takes very small fraction of the total time. On the other hand, it does not scale as well as factorization, mainly due to large communication to computation ratio and higher degree of sequential dependencies. One of our future tasks is to improve scalability of this phase, since in these application codes, the triangular solution often needs to be done several times with respect to one factorization. In the last year or so, we have been focusing on developing new algorithms to enhance scalability of our direct solvers. The new results are summarized in the next two sections.. Improving memory scalability of SuperLU parallelizing symbolic factorization Symbolic factorization is a phase to determine the nonzero locations of the L. U factors. In most parallel sparse direct solvers, this phase is performed in serial, with matrix A being available on

Journal of Physics: Conference Series 78 (007) 0 doi:0.088/7-6596/78//0 P 0 P P P 00 00 00 00 00 00 0 0 0 P 0 P P 0 P P P P 0 P P P Level Level Level 0 Figure. Subtree-to-subgroup processor to matrix mapping. The figure on the right illustrates the separator tree. matrix8 P=8 P=56 Fill (millions) Sequential 888. 888. Parallel 09. 5. Symbolic Sequential 65.5 65.5 Parallel 8.0 8. Ratio Seq/Par 0 5 Entire solver Old 5. 77. New 6.8 8. dds5 P = 8 P = 56 Fill (millions) Sequential 56.6 56.6 Parallel 58.9 58.7 Symbolic Sequential 95.8 95.8 Parallel 7. 0.8 Ratio Seq/Par 7 Entire solver Old 06.9. New 87.0. Table. Memory usage (Megabytes) (maximum among all processors). Old or New solver means using the serial or parallel symbolic factorization algorithm, respectively. one node. The reasons for not doing it in parallel are: ) the algorithm involves only integer operations and is very fast in practice the complexity is larger than the nonzero count of L+U, but much smaller than the flops for LU decomposition, and ) it is very difficult to design an efficient parallel algorithm, which is partly due to sequentiality of the tasks computation of the i-th column/row depends on results of the previous columns/rows, and partly due to lower computation-to-communication ratio. Obviously, with this sequential part, the solver s scalability will suffer by the Amdahl s law. Another reason we have to do this in parallel is that when the simulations increase in size, A itself will not fit in one node. Motivated mainly by memory scalability, we have designed and implemented a an efficient parallel algorithm for this phase. Our approach is based on a graph partitioning for reordering/partitioning the input matrix. Specifically, we apply ParMeTiS [5] to the structure of A + A T. We exploit parallelism given by this partition (coarse level with the separator tree) and by a block cyclic distribution (fine level at each separator). We also identify dense separators, dense columns of L and rows of U to decrease the algorithmic complexity, see details in []. The processors-to-matrix assignment is based on a top-down subtree-to-subgroup mapping scheme, as shown in Figure. Table compares, for two matrices, the memory water marks before and after parallelizing the symbolic factorization phase. Considering only the symbolic phase itself, the memory reduction is dramatic, up to a factor of 5 for matrix8 and 7 for dds5. When all the other phases are included (entire solver), the new solver reduces the memory usage by nearly 5-fold for matrix 8. Now, the numerical phase consumes more memory. Table shows the runtime of the parallel symbolic algorithm, and that reasonable speedups are obtained. One remark worth noting is that the fill quality of ParMeTiS is worsening with increasing processor count. This remains an open problem for future research.

Journal of Physics: Conference Series 78 (007) 0 doi:0.088/7-6596/78//0 matrix8 P = 8 P = 56 Symbolic Sequential 6.8 6.8 Parallel.6.7 Entire solver Old 8.7 6.6 New 59. 6.5 dds5 P = 8 P = 56 Symbolic Sequential.6.6 Parallel.6 0.5 Entire solver Old 6.. New 66.. Table. Parallel runtime (seconds). Old or New solver means using the serial or parallel symbolic factorization algorithm, respectively.. Linear-complexity sparse factorization As is well known, sparse Gaussian elimination has super-linear time complexity. For example, for a Laplacian-type of PDE model problem with D n n mesh or D n n n mesh, using nested dissection numbering, the operation counts are O(n ) and O(n 6 ), respectively. This was shown to be optimal when performing exact elimination. However, we have observed that by exploiting numerical low rankness in Schur complements, we can design a nearly-linear time, and sufficiently accurate factorization algorithm. The idea comes from structured matrix computations. Specifically, we use semi-separable matrices in our study. For example, a semi-separable matrix with blocks has the following structure: D U V T U W V T U W W V T V A U T D U V T U W V T V W TUT V U T D U V T V W TW TUT V W U T V U T D where, The first and second off-diagonal blocks of A are ( ) U V T W V T W W V T and U W ( ) V T U W V T. For general i < j, the (i,j) block-entry is U i W i+ W i+ W j Vj T. Every upper off-diagonal block has the form U W W W i U W W i. U i ( V T i+ W i+ V T i+ W i+ W i+ W N V T N ). D,U,W and V matrices have dimensions k k. A is n n with n = Nk and uses O(nk) memory, which is very economical for k n. The examples of semi-separable matrix include banded matrices and their inverses. This representation can be numerically constructed. We can devise many fast algorithms based on this compressed representation. For example, Figure compares the fast semi-separable Cholesky factorization (SS-Cholesky) to the standard Cholesky in LAPACK (DPOTRF), with the rank k chosen to be 6 and 6. For large matrices, we see orders of magnitude speedups. This SS-Cholesky kernel can be used in a sparse factorization. Consider using a nested dissection ordering for the discretization mesh, then the submatrix corresponding to the separator node is essentially dense during Schur complement updates. Furthermore, the maximum off-diagonal rank of those matrices are small and nearly constant. Based on this observation, we devised a new multifrontal factorization scheme, which compresses the frontal matrix in semi-separable form, and performs SS-Cholesky on it. The superfast multifrontal method works as follows []:

Journal of Physics: Conference Series 78 (007) 0 doi:0.088/7-6596/78//0 log 0 (time ratio DPOTRF/SS Cholesky).5.5.5.5 0.5 0 0.5 k=6 k=6 56 5 0 08 096 89 matrix size Figure. Speedup of fast semi-separable Cholesky factorization over DPOTRF. Mesh 55 55 5 5 0 0 07 07 095 095 MF 0.. 8.7 55.8 8.6 fast MF 0.. 6. 7.0. Table. Superfast solver runtime (seconds) on a SGI Altix. Perform traditional Cholesky factorization at the bottom levels of the separator tree. At the higher levels, convert the fill-in submatrices into semi-separable structures, and perform structured semi-separable Cholesky factorizations at these levels. The overall operation count is reduced from O(n ) operations to O(n k) for D problems, where k is tolerance-dependent constant. Table compares the performance of traditional multifrontal (MF) solver to that of the superfast multifrontal (fast MF) solver. As can be seen, with increasing problem size, the superfast solver can be more than three times faster than the traditional one. Our fast solver framework can be extended to construct effective and robust preconditioners for more general, non-elliptic PDEs.. Acknowledgments This work was partly supported by the US Department of Energy under contract No. DE-AC0-6SF00098. 5. References [] Center for Extended MHD Modeling (CEMM). URL: http://w.pppl.gov/cemm/. [] The Community Petascale Project for Accelerator Science and Simulation (COMPASS). URL: http://www.scidac.gov/. [] S. Chandrasekaran, M. Gu, X.S. Li, and J. Xia. Superfast multifrontal method for structured linear systems of equations. Technical Report LBNL-6898, Lawrence Berkeley National Laboratory, June 007. [] Laura Grigori, James W. Demmel, and Xiaoye S. Li. Parallel symbolic factorization for sparse LU with static pivoting. SIAM J. Scientific Computing, 007. To appear. [5] G. Karypis, K. Schloegel, and V. Kumar. ParMeTiS: Parallel Graph Partitioning and Sparse Matrix Ordering Library Version.. University of Minnesota, August 00. http://www-users.cs.umn. edu/~karypis/metis/parmetis/. [6] Xiaoye S. Li. An overview of SuperLU: Algorithms, implementation, and user interface. ACM Trans. Mathematical Software, ():0 5, September 005. http://crd.lbl.gov/~xiaoye/lbnl-588.pdf. [7] Towards Optimal Petascale Simulations, URL: http://www-unix.mcs.anl.gov/scidac-tops/. 5