Improvements for Implicit Linear Equation Solvers

Similar documents
1 Overview. 2 Adapting to computing system evolution. 11 th European LS-DYNA Conference 2017, Salzburg, Austria

Information Sciences Institute 22 June 2012 Bob Lucas, Gene Wagenbreth, Dan Davis, Roger Grimes and

Enhancing Scalability of Sparse Direct Methods

A DISTRIBUTED-MEMORY RANDOMIZED STRUCTURED MULTIFRONTAL METHOD FOR SPARSE DIRECT SOLUTIONS

A DISTRIBUTED-MEMORY RANDOMIZED STRUCTURED MULTIFRONTAL METHOD FOR SPARSE DIRECT SOLUTIONS

Solving PDEs with CUDA Jonathan Cohen

Fast algorithms for hierarchically semiseparable matrices

MULTI-LAYER HIERARCHICAL STRUCTURES AND FACTORIZATIONS

Matrix Assembly in FEA

Incomplete Cholesky preconditioners that exploit the low-rank property

Static-scheduling and hybrid-programming in SuperLU DIST on multicore cluster systems

Sparse linear solvers

Sparse factorization using low rank submatrices. Cleve Ashcraft LSTC 2010 MUMPS User Group Meeting April 15-16, 2010 Toulouse, FRANCE

A sparse multifrontal solver using hierarchically semi-separable frontal matrices

Large Scale Sparse Linear Algebra

Utilisation de la compression low-rank pour réduire la complexité du solveur PaStiX

Low-Complexity Encoding Algorithm for LDPC Codes

CRYSTAL in parallel: replicated and distributed (MPP) data. Why parallel?

Block Low-Rank (BLR) approximations to improve multifrontal sparse solvers

Parallel PIPS-SBB Multi-level parallelism for 2-stage SMIPS. Lluís-Miquel Munguia, Geoffrey M. Oxberry, Deepak Rajan, Yuji Shinano

Analytical Modeling of Parallel Systems

Parallel Algorithms for Forward and Back Substitution in Direct Solution of Sparse Linear Systems

Matrix Computations: Direct Methods II. May 5, 2014 Lecture 11

SUMMARY INTRODUCTION BLOCK LOW-RANK MULTIFRONTAL METHOD

Contents. Preface... xi. Introduction...

5.1 Banded Storage. u = temperature. The five-point difference operator. uh (x, y + h) 2u h (x, y)+u h (x, y h) uh (x + h, y) 2u h (x, y)+u h (x h, y)

On the design of parallel linear solvers for large scale problems

Performance and Scalability of the Block Low-Rank Multifrontal Factorization on Multicore Architectures

Numerical Analysis Fall. Gauss Elimination

Minisymposia 9 and 34: Avoiding Communication in Linear Algebra. Jim Demmel UC Berkeley bebop.cs.berkeley.edu

J.I. Aliaga 1 M. Bollhöfer 2 A.F. Martín 1 E.S. Quintana-Ortí 1. March, 2009

SPARSE SOLVERS POISSON EQUATION. Margreet Nool. November 9, 2015 FOR THE. CWI, Multiscale Dynamics

Scientific Computing with Case Studies SIAM Press, Lecture Notes for Unit VII Sparse Matrix

SOLVING SPARSE LINEAR SYSTEMS OF EQUATIONS. Chao Yang Computational Research Division Lawrence Berkeley National Laboratory Berkeley, CA, USA

Partial Left-Looking Structured Multifrontal Factorization & Algorithms for Compressed Sensing. Cinna Julie Wu

Using Random Butterfly Transformations to Avoid Pivoting in Sparse Direct Methods

Hybrid static/dynamic scheduling for already optimized dense matrix factorization. Joint Laboratory for Petascale Computing, INRIA-UIUC

Parallel programming using MPI. Analysis and optimization. Bhupender Thakur, Jim Lupo, Le Yan, Alex Pacheco

Scientific Computing

A dissection solver with kernel detection for unsymmetric matrices in FreeFem++

Fast matrix algebra for dense matrices with rank-deficient off-diagonal blocks

AMS526: Numerical Analysis I (Numerical Linear Algebra for Computational and Data Sciences)

Numerical Methods I: Numerical linear algebra

SCALABLE HYBRID SPARSE LINEAR SOLVERS

An evaluation of sparse direct symmetric solvers: an introduction and preliminary finding

Parallel Circuit Simulation. Alexander Hayman

Parallelization of Multilevel Preconditioners Constructed from Inverse-Based ILUs on Shared-Memory Multiprocessors

Scalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver

Parallel Programming. Parallel algorithms Linear systems solvers

Jacobi-Davidson Eigensolver in Cusolver Library. Lung-Sheng Chien, NVIDIA

Welcome to MCS 572. content and organization expectations of the course. definition and classification

An H-LU Based Direct Finite Element Solver Accelerated by Nested Dissection for Large-scale Modeling of ICs and Packages

A Sparse QS-Decomposition for Large Sparse Linear System of Equations

arxiv: v1 [cs.na] 20 Jul 2015

PERFORMANCE ENHANCEMENT OF PARALLEL MULTIFRONTAL SOLVER ON BLOCK LANCZOS METHOD 1. INTRODUCTION

Parallel sparse direct solvers for Poisson s equation in streamer discharges

Behavioral Simulations in MapReduce

Symmetric Pivoting in ScaLAPACK Craig Lucas University of Manchester Cray User Group 8 May 2006, Lugano

Power System Analysis Prof. A. K. Sinha Department of Electrical Engineering Indian Institute of Technology, Kharagpur. Lecture - 21 Power Flow VI

9. Numerical linear algebra background

V C V L T I 0 C V B 1 V T 0 I. l nk

Sparse solver 64 bit and out-of-core addition

Bridging the gap between flat and hierarchical low-rank matrix formats: the multilevel BLR format

QuIDD-Optimised Quantum Algorithms

MITOCW ocw f99-lec05_300k

A Numerical Study of Some Parallel Algebraic Preconditioners

Direct solution methods for sparse matrices. p. 1/49

Fast Sparse Matrix Factorization on Modern Workstations. Edward Rothberg and Anoop Gupta. Department of Computer Science

IMPROVING THE PERFORMANCE OF SPARSE LU MATRIX FACTORIZATION USING A SUPERNODAL ALGORITHM

Parallel solver for semidefinite programming problem having sparse Schur complement matrix

What s New in Isorropia?

Accelerating linear algebra computations with hybrid GPU-multicore systems.

ab initio Electronic Structure Calculations

Sparse Linear Algebra: Direct Methods, advanced features

Communication-avoiding LU and QR factorizations for multicore architectures

Chapter 9: Gaussian Elimination

Matrix Reduction Techniques for Ordinary Differential Equations in Chemical Systems

QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment

Block-tridiagonal matrices

Computational Techniques Prof. Sreenivas Jayanthi. Department of Chemical Engineering Indian institute of Technology, Madras

Analytical Modeling of Parallel Programs (Chapter 5) Alexandre David

LU Factorization. Marco Chiarandini. DM559 Linear and Integer Programming. Department of Mathematics & Computer Science University of Southern Denmark

Communication avoiding parallel algorithms for dense matrix factorizations

Performance and Application of Observation Sensitivity to Global Forecasts on the KMA Cray XE6

Dense Arithmetic over Finite Fields with CUMODP

Parallel Performance Studies for a Numerical Simulator of Atomic Layer Deposition Michael J. Reid

I-v k e k. (I-e k h kt ) = Stability of Gauss-Huard Elimination for Solving Linear Systems. 1 x 1 x x x x

Recent advances in sparse linear solver technology for semiconductor device simulation matrices

Weather Research and Forecasting (WRF) Performance Benchmark and Profiling. July 2012

6. Iterative Methods for Linear Systems. The stepwise approach to the solution...

INF2270 Spring Philipp Häfliger. Lecture 8: Superscalar CPUs, Course Summary/Repetition (1/2)

Direct and Incomplete Cholesky Factorizations with Static Supernodes

Exploiting Low-Rank Structure in Computing Matrix Powers with Applications to Preconditioning

A Simple Weak-Field Coupling Benchmark Test of the Electromagnetic-Thermal-Structural Solution Capabilities of LS-DYNA Using Parallel Current Wires

Software for Integer and Nonlinear Optimization

ELSI: A Unified Software Interface for Kohn-Sham Electronic Structure Solvers

2.5D algorithms for distributed-memory computing

Efficient Sparse LU Factorization with Partial Pivoting on Distributed Memory Architectures

Lecture 23: Illusiveness of Parallel Performance. James C. Hoe Department of ECE Carnegie Mellon University

On the design of parallel linear solvers for large scale problems

Transcription:

Improvements for Implicit Linear Equation Solvers Roger Grimes, Bob Lucas, Clement Weisbecker Livermore Software Technology Corporation Abstract Solving large sparse linear systems of equations is often the computational bottleneck for implicit calculations. LSTC is both continuously improving its existing solvers, and continuously looking for new technology. This talk addresses both, describing improvements to the default distributed memory linear solver used by LSTC as well as promising new research. We discuss improvements to the symbolic preprocessing that reduce the occurrences of sparse matrix reordering, a sequential bottleneck and significant Amdahl fraction of the wall clock time for executions with large numbers of processors. Our new approach works for large Implicit models with large numbers of processes when the contact surfaces do not change, or only change slightly. We have also improved the performance of numerical factorization, most significantly with the introduction of a tiled, or two-dimensional distribution of frontal matrices to processors. This improves the computational efficiency of the factorization and also reduces the communication overhead. Finally, we present promising early results of research into using low-rank approximations to substantially decrease both the computational burden and the memory footprint required for sparse matrix factorization. Reusing the Matrix Reordering and Symbolic Factorization Solving a sparse linear system by factoring usually proceeds in four phases. In the first phase, the matrix is reordered to reduce the storage and operations that will be required to factor it. The second phase is a symbolic factorization, in which the elimination tree is created, and the numerical data structures are initialized. The third phase is the sparse factorization itself. And finally, solutions can be obtained by forward elimination and backward substitution. Reordering to reduce fill-in and operations is a critical operation in sparse matrix factorization. Finding the optimal ordering, one which strictly minimizes storage or operations is provably NPhard, so heuristics are used. The default used for large problems in LS-DYNA is Metis, a weighted nested dissection based code. Historically, reordering and symbolic factorization took relatively little time compared to factorization. However, parallel scaling of the factorization phase has been studied for over three decades, and substantial progress has been made, whereas reordering remains a largely June 12-14, 2016 1-1

Session: Computing Technologies 14 th International LS-DYNA Users Conference sequential process. As LS-DYNA is run on increasingly large numbers of processors, the time for reordering has emerged as the dominant aspect of many problems. A relatively small example is the one million element NCAC Silverado model. Running on a 16-core workstation, reordering and symbolic factorization takes 23 seconds whereas sparse matrix factorization takes only 15. An LS-DYNA implicit user reported that on a larger, proprietary model, reordering and symbolic preprocessing took 1917 seconds, while factorization, running on 768 cores, took only 142. Over a simulation with hundreds of implicit time steps this represents a huge potential reduction in execution time. LSTC is working on addressing the reordering problem with a new scalable algorithm. But even when that is available, we will want to minimize the number of times that we reorder the matrix. Towards that end, we have modified the code that permutes and redistributes the input matrix to allow it to use a previous ordering. If off-diagonal coefficients are dropped from the matrix, they are simply filled in as zeroes. If new off-diagonal coefficients are added to the matrix, then we have to check to see if they are consistent with the previous symbolic factorization. If they are, the previous ordering is reused, even though it is likely suboptimal, and reordering is avoided. The important key to reusing the reordering is predicting any additional nonzeroes added to the stiffness matrix due to slight changes in contact. Consider the following model. As the blade moves outward due to centrifugal forces the blade slides slightly outward. If we can predict this slight change and add appropriate terms in the stiffness matrix the reordering can be reused. The ability to activate this new feature is on the new keyword *CONTROL_IMPLICIT_ORDERING. Figure 1 Savings from reusing the sparse matrix reordering can be significant. But the savings is only for large models using both many processes MPI and many implicit time steps where the reordering can be reused. 1-2 June 12-14, 2016

Performance Improvements to the Numerical Factorization Research into parallel multifrontal codes goes back to the early 1980s, and distributed memory multifrontal codes were first implemented in 1986. LSTC has been using distributed memory versions of the multifrontal method since the introduction of scalable implicit processing at the turn of the century. We use a subtree-subcube technique to distribute the branches of the elimination tree to different processors. Near the root of the tree, where there are more processors than supernodes, then multiple processors are assign to the factorization of each of these MPP frontal matrices. When the code for the MPP frontal matrices was first written, large-scale implicit users had O(10) processors. Designing to scale well to 32 seemed to provide plenty of head room for future growth. It was known in the mid-1980s that distributing the matrix by columns could scale to 100 processors. Furthermore, this one-dimensional distribution localizes evaluation of the ratio of each diagonal value to its off-diagonals, which is needed to determine if pivoting is required. Therefore, LSTC has been using 1D MPP frontal matrices for nearly two decades. Even for jobs with over 100 processors, where performance could tail off for the supernodes near the root of the elimination tree, their children or grandchildren would benefit from additional processors, and the overall factorization would speed up. Today, given the exponential growth in the number of cores that we are experiencing, LS-DYNA jobs are now running on thousands of cores. To enable sparse matrix factorization to continue to scale, LSTC has had to add a new generation of tiled, or two-dimensional frontal matrices. The processors are organized into a N by M Cartesian grid, where N*M is less than, or equal, to the number of processors. The frontal matrix is then distributed across the 2D grid of processes. When the factored pivot columns are distributed, there are N simultaneous MPI broadcasts, each amongst M processors. Each of these broadcasts transmits 1/N the amount of data a 1D broadcast would convey. A down side is that pivot rows also have to be broadcast, but surprisingly, the 2D factorization kernel can be faster than the 1D kernel on as few as 4 processors using a 2 by 2 grid. Figure 2 depicts the relative performance of 1D versus 2D kernels factoring a simulated symmetric frontal matrix. The plateaus in the 2D curve reflect processor totals that don't map to a two-dimensional grid, with an acceptable aspect ratio. Examples are 13 and 14, which are turned into a 3 by 4 grid, with the extra processors idled. Forming the 2D frontal matrices is more complicated than their 1D counterparts, so LS-DYNA doesn't switch to 2D unless there are at least eight MPI ranks assigned to the supernode. As more and more cores are brought to bear, the advantage of 2D frontal matrices increases. Of course, the vast majority of supernodes in the elimination tree are assigned to individual processors, and their throughput is unchanged by the 2D kernels. Still, the frontal matrix of the root supernode is disproportionately large, so the overall impact is still significant. Consider a 3D implicit model supplied by an LS-DYNA user, which requires 10.57 quadrillion (10^15) floating point operations to factor. When run on all 128 cores of an 8-node, dual-socket, 8-core cluster, constraining LS-DYNA to 1D frontal matrices requires 7124 seconds for the first factorization. Enabling 2D frontal matrices brings that down to 4931 seconds. The root supernode had 137259 equations, and 1D throughput was 754 Gflop/s whereas 2D throughput was 1862 GFlop/s. June 12-14, 2016 1-3

Session: Computing Technologies 14 th International LS-DYNA Users Conference <<<< a chart goes here but I cannot get it to stay >>>>>> Figure 2 Block Low Rank Approximation There is no limit to the imagination of LS-DYNA users, and the size and complexity of the analyzes they need to perform. Unfortunately, sparse matrix factorization scales superlinearly, both in terms of memory and operations. This scaling threatens to be a constraint on the ability of LS-DYNA users to achieve their objectives. To address that, LSTC is examining the use of block low rank approximations (BLR) to the factors of the matrix. Over the past twenty years, many theoretical studies have been carried out on data-sparse algorithms. They allow for algebraically compressing the numerical data involved in linear systems by taking advantage of the properties of the underlying elliptic operator, at the price of a controllable loss of accuracy. This process transforms a dense block into a product of much smaller blocks: the low-rank representation. These blocks, in turn, allows for faster matrixmatrix products when they are combined. Many low-rank representations have been proposed in the literature. Some of them are recursive (the so-called hierarchical representations) and obtained through complex compression algorithms. Even though theory has shown that this is the most efficient kind of representations, they are often not flexible enough to be used in pre-existing modern, parallel, fully featured direct solvers. For this reason, simpler representations have been designed in order to compromise between efficiency, ease of use and numerical stability. The Block Low-Rank (BLR) format has been shown to fulfill these requirements and it is implemented in a research version of the multifrontal solver MUMPS. A BLR representation of a dense block is obtained through a rank-revealing factorization and consist of a product of a tall and skinny matrix with a short and wide matrix. The classic multifrontal algorithm has to be adapted to handle BLR blocks in a way that pays off in as many stages of the factorization, both in terms of memory footprint and operation count. MUMPS BLR is coupled to LS-DYNA as a research tool to evaluate the potential of low-rank technologies on our implicit problems. Preliminary studies have shown promising results both in terms of performance and accuracy. On a 100x100x100 rubber simulation problem (yielding a linear system of 3 millions of unknowns), the size of the factors is reduced by a factor of 2, and the number of operations to perform the factorization is reduced by 5. Because operations on low-rank blocks are of smaller granularity, they are also slower. Consequently, the operation reduction translates to a speed-up of only 2 for the overall factorization CPU time, including the overhead due to low-rank compressions. Another decisive property of this new approach is that the larger the problem, the better the relative efficiency of the low-rank approach. For instance, on the same problem but with a 50x50x50 model, the operation count was reduced by a factor of 4 instead of 5 for the larger equivalent model. 1-4 June 12-14, 2016

Summary LSTC is continuously improving LS-DYNA, adding new capabilities. It is also addressing the impact on existing software of changes in the computing platforms available to our users. This talk highlighted three examples of this process, applied to multifrontal linear solvers. We have reduced the need to reorder the sparse matrix when small changes are made to the model. We have developed a new frontal matrix factorization kernel, that exploits MPI Cartesian meshes to reduce communication overheads and improve load balance. And finally, we are exploring the utility of block low rank approximations, which offer the promise substantial reductions in the storage and operations required to solve a sparse linear system of equations. June 12-14, 2016 1-5