An Efficient Solver for Sparse Linear Systems based on Rank-Structured Cholesky Factorization

Similar documents
An Efficient Solver for Sparse Linear Systems based on Rank-Structured Cholesky Factorization

arxiv: v1 [cs.na] 20 Jul 2015

Lecture 17: Iterative Methods and Sparse Linear Algebra

Incomplete Cholesky preconditioners that exploit the low-rank property

A sparse multifrontal solver using hierarchically semi-separable frontal matrices

Block Low-Rank (BLR) approximations to improve multifrontal sparse solvers

On the design of parallel linear solvers for large scale problems

Scientific Computing with Case Studies SIAM Press, Lecture Notes for Unit VII Sparse Matrix

Fast algorithms for hierarchically semiseparable matrices

SOLVING SPARSE LINEAR SYSTEMS OF EQUATIONS. Chao Yang Computational Research Division Lawrence Berkeley National Laboratory Berkeley, CA, USA

Utilisation de la compression low-rank pour réduire la complexité du solveur PaStiX

FAST STRUCTURED EIGENSOLVER FOR DISCRETIZED PARTIAL DIFFERENTIAL OPERATORS ON GENERAL MESHES

Partial Left-Looking Structured Multifrontal Factorization & Algorithms for Compressed Sensing. Cinna Julie Wu

Matrix Assembly in FEA

Direct and Incomplete Cholesky Factorizations with Static Supernodes

Linear Solvers. Andrew Hazel

Enhancing Scalability of Sparse Direct Methods

Bindel, Fall 2016 Matrix Computations (CS 6210) Notes for

Exploiting off-diagonal rank structures in the solution of linear matrix equations

V C V L T I 0 C V B 1 V T 0 I. l nk

AMS526: Numerical Analysis I (Numerical Linear Algebra for Computational and Data Sciences)

A DISTRIBUTED-MEMORY RANDOMIZED STRUCTURED MULTIFRONTAL METHOD FOR SPARSE DIRECT SOLUTIONS

SUMMARY INTRODUCTION BLOCK LOW-RANK MULTIFRONTAL METHOD

A Parallel Geometric Multifrontal Solver Using Hierarchically Semiseparable Structure

A Fast Direct Solver for a Class of Elliptic Partial Differential Equations

A DISTRIBUTED-MEMORY RANDOMIZED STRUCTURED MULTIFRONTAL METHOD FOR SPARSE DIRECT SOLUTIONS

A High-Performance Parallel Hybrid Method for Large Sparse Linear Systems

Fine-Grained Parallel Algorithms for Incomplete Factorization Preconditioning

Fast Structured Spectral Methods

An Efficient Low Memory Implicit DG Algorithm for Time Dependent Problems

Bindel, Spring 2016 Numerical Analysis (CS 4220) Notes for

Implicit Solution of Viscous Aerodynamic Flows using the Discontinuous Galerkin Method

An Empirical Comparison of Graph Laplacian Solvers

Network Analysis at IIT Bombay

CLASSICAL ITERATIVE METHODS

Bridging the gap between flat and hierarchical low-rank matrix formats: the multilevel BLR format

Parallel Scalability of a FETI DP Mortar Method for Problems with Discontinuous Coefficients

ACCELERATING SPARSE CHOLESKY FACTORIZATION ON THE GPU

Numerical Methods I: Numerical linear algebra

Large Scale Sparse Linear Algebra

ANONSINGULAR tridiagonal linear system of the form

Numerical Methods in Matrix Computations

c 2017 Society for Industrial and Applied Mathematics

Power System Analysis Prof. A. K. Sinha Department of Electrical Engineering Indian Institute of Technology, Kharagpur. Lecture - 21 Power Flow VI

An exact reanalysis algorithm using incremental Cholesky factorization and its application to crack growth modeling

FINE-GRAINED PARALLEL INCOMPLETE LU FACTORIZATION

Parallel scalability of a FETI DP mortar method for problems with discontinuous coefficients

Numerical Methods I Non-Square and Sparse Linear Systems

Sparse LU Factorization on GPUs for Accelerating SPICE Simulation

Assignment on iterative solution methods and preconditioning

Solving PDEs with Multigrid Methods p.1

The Conjugate Gradient Method

Multilevel low-rank approximation preconditioners Yousef Saad Department of Computer Science and Engineering University of Minnesota

Solving Large Nonlinear Sparse Systems

Improvements for Implicit Linear Equation Solvers

Fast QR decomposition of HODLR matrices

MUMPS. The MUMPS library: work done during the SOLSTICE project. MUMPS team, Lyon-Grenoble, Toulouse, Bordeaux

A robust multilevel approximate inverse preconditioner for symmetric positive definite matrices

From Direct to Iterative Substructuring: some Parallel Experiences in 2 and 3D

Effective matrix-free preconditioning for the augmented immersed interface method

Fast matrix algebra for dense matrices with rank-deficient off-diagonal blocks

PDE Solvers for Fluid Flow

5.1 Banded Storage. u = temperature. The five-point difference operator. uh (x, y + h) 2u h (x, y)+u h (x, y h) uh (x + h, y) 2u h (x, y)+u h (x h, y)

PowerPoints organized by Dr. Michael R. Gustafson II, Duke University

1 GSW Sets of Systems

A dissection solver with kernel detection for unsymmetric matrices in FreeFem++

1 Overview. 2 Adapting to computing system evolution. 11 th European LS-DYNA Conference 2017, Salzburg, Austria

A robust inner-outer HSS preconditioner

FINE-GRAINED PARALLEL INCOMPLETE LU FACTORIZATION

Sparse Matrix Computations in Arterial Fluid Mechanics

Performance and Scalability of the Block Low-Rank Multifrontal Factorization on Multicore Architectures

Kasetsart University Workshop. Multigrid methods: An introduction

Course Notes: Week 1

Lecture 8: Fast Linear Solvers (Part 7)

Sparse least squares and Q-less QR

Review Questions REVIEW QUESTIONS 71

Contents. Preface... xi. Introduction...

c 2015 Society for Industrial and Applied Mathematics

The solution of the discretized incompressible Navier-Stokes equations with iterative methods

Sparse factorization using low rank submatrices. Cleve Ashcraft LSTC 2010 MUMPS User Group Meeting April 15-16, 2010 Toulouse, FRANCE

Scalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver

FINDING PARALLELISM IN GENERAL-PURPOSE LINEAR PROGRAMMING

Scalable Non-blocking Preconditioned Conjugate Gradient Methods

Sparse BLAS-3 Reduction

A Robust Preconditioned Iterative Method for the Navier-Stokes Equations with High Reynolds Numbers

Iterative Methods for Solving A x = b

7. LU factorization. factor-solve method. LU factorization. solving Ax = b with A nonsingular. the inverse of a nonsingular matrix

CS 542G: Conditioning, BLAS, LU Factorization

Linear Systems of Equations. ChEn 2450

Fine-grained Parallel Incomplete LU Factorization

An Efficient Graph Sparsification Approach to Scalable Harmonic Balance (HB) Analysis of Strongly Nonlinear RF Circuits

Cholesky factorisation of linear systems coming from finite difference approximations of singularly perturbed problems

A Column Pre-ordering Strategy for the Unsymmetric-Pattern Multifrontal Method

Preconditioning Techniques Analysis for CG Method

MATHEMATICS FOR COMPUTER VISION WEEK 2 LINEAR SYSTEMS. Dr Fabio Cuzzolin MSc in Computer Vision Oxford Brookes University Year

An H-LU Based Direct Finite Element Solver Accelerated by Nested Dissection for Large-scale Modeling of ICs and Packages

Numerical Linear Algebra

Program Lecture 2. Numerical Linear Algebra. Gaussian elimination (2) Gaussian elimination. Decompositions, numerical aspects

Lecture 18 Classical Iterative Methods

6. Iterative Methods for Linear Systems. The stepwise approach to the solution...

Transcription:

An Efficient Solver for Sparse Linear Systems based on Rank-Structured Cholesky Factorization David Bindel Department of Computer Science Cornell University 15 March 2016 (TSIMF) Rank-Structured Cholesky 2016-03-15 1 / 30

u = K \ f Great for circuit simulations, 1D or 2D finite elements, etc. Standard advice to students: Just try backslash for these problems. Standard response: What about for the 3D case? (TSIMF) Rank-Structured Cholesky 2016-03-15 2 / 30

Try PCG with a good preconditioner. Maybe start with the ones in PETSc. You ve taken Matrix Computations, right? Blah blah yadda blah... (Not an actual student) (TSIMF) Rank-Structured Cholesky 2016-03-15 3 / 30

My own frustration (c. 2008) (TSIMF) Rank-Structured Cholesky 2016-03-15 4 / 30

Factorization with nested dissection A L Cholesky: A = LL T, L is lower triangular. L may have more nonzeros than A (fill) Order controls fill still a lot of memory in 3D (TSIMF) Rank-Structured Cholesky 2016-03-15 5 / 30

Direct or iterative? A L CW: Gaussian elimination scales poorly. Iterate instead! Pro: Less memory, potentially better complexity Con: Less robust, potentially worse memory patterns Commercial finite element codes still use (out-of-core) Cholesky. Longer compute times, but fewer tech support hours. (TSIMF) Rank-Structured Cholesky 2016-03-15 6 / 30

Desiderata I want a code for sparse Cholesky (A = LL T ) that Handles modest problems on a desktop (or laptop?) Inside a loop, without trying my patience = Does not need gobs of memory = Makes effective use of level 3 BLAS Requires little parameter fiddling / hand-holding Works with general elliptic problems (esp. elasticity) (TSIMF) Rank-Structured Cholesky 2016-03-15 7 / 30

From ND to superfast ND... ND gets performance using just graph structure: 2D: O(N 3/2 ) time, O(N log N) space. 3D: O(N 2 ) time, O(N 4/3 ) space. Superfast ND reduces space/time complexity via low-rank structure. (TSIMF) Rank-Structured Cholesky 2016-03-15 8 / 30

Strategy Start with CHOLMOD (a good supernodal left-looking Cholesky) Supernodal data structures are compact Algorithm + data layout = most work in level 3 BLAS Widely used already (so re-use the API!) Incorporate compact representations for low-rank blocks Outer product for off-diagonal blocks HSS-style representations for diagonal blocks Optimize, test, swear, fix, repeat (TSIMF) Rank-Structured Cholesky 2016-03-15 9 / 30

Supernodal storage structure L D j L(C j, C j ) L D j L(C j, R j ) L O j collapsed L O j L j (TSIMF) Rank-Structured Cholesky 2016-03-15 10 / 30

Supernode factorization U D j A(C j, C j ) U O j A(R j, C j ) for each k D j do Build dense updates from L O k Scatter updates to U D j and U O j L D j cholesky(u D j ) UO j (L D j ) T L O j Initialize storage Pull Schur contributions Finish forming L D j What changes in the rank-structured Cholesky? (TSIMF) Rank-Structured Cholesky 2016-03-15 11 / 30

Off-diagonal block compression L D j V j U T j Collapsed L O j Compressed L O j L O j Collapsed off-diagonal block is a (nearly low-rank) dense matrix (TSIMF) Rank-Structured Cholesky 2016-03-15 12 / 30

Off-diagonal block compression G rand( C j, r + p) C (L O j )T G for i = 1,..., s do C (L O j )C C (L O j )T C U j = orth(c) V j = L O j U j Compress without explicit L O j : Probe (L O j )T with random G Extract orth. row basis U j L O j = V ju T j = V j = L O j U j Where do we get the estimated rank bound r? (TSIMF) Rank-Structured Cholesky 2016-03-15 13 / 30

Interaction rank Could dynamically estimate the rank of L O j. Practice: empirical rank bound α k log(k). (TSIMF) Rank-Structured Cholesky 2016-03-15 14 / 30

Optimization: Selective off-diagonal compression j 1 j 2 j 3 j 1 j 2 j 3 Compress off-diagonal blocks of sufficiently large supernodes (j 1, j 2 ). (TSIMF) Rank-Structured Cholesky 2016-03-15 15 / 30

Optimization: Interior blocks B 2 B 4 j 1 j 2 B 1 B 3 j 3 j 1 j 2 j 3 B 1 B 2 B 3 B 4 Don t store any of L O j for interior blocks (Represent as L O j = AO j (LD j ) 1 when needed) (TSIMF) Rank-Structured Cholesky 2016-03-15 16 / 30

Diagonal block compression L D j = L D j,1 0 L D j,2 L D j,3 L D j,4 0 L D j,5 0 L D j,6 L D j,7 Basic observation: off-diagonal blocks are low-rank. (H-matrix, semiseparable structure, quasiseparable structure,...) Assumes reasonable ordering of unknowns! (TSIMF) Rank-Structured Cholesky 2016-03-15 17 / 30

Diagonal block compression L D j L D j,1 0 0 Vj,2 D (UD j,2 )T L D j,3 Vj,4 D (UD j,4 )T L D j,5 0. V D j,6 (UD j,6 )T L D j,7 How do we get directly to this without forming U D j explicitly? (TSIMF) Rank-Structured Cholesky 2016-03-15 18 / 30

Forming compressed updates L D j,1 L D j,3 L D j,2 L D j,5 L D j,4 L D j,7 L D j,6 D j (TSIMF) Rank-Structured Cholesky 2016-03-15 19 / 30

Rank-structured supernode factorization Basic ingredients: Randomized algorithms form U D j Rank-structured factorization of U D j Randomized algorithm forms L O j (involves solves with LD j ) Plus various optimizations. (TSIMF) Rank-Structured Cholesky 2016-03-15 20 / 30

Example: Large deformation of an elastic block (TSIMF) Rank-Structured Cholesky 2016-03-15 21 / 30

Example: Large deformation of an elastic block Benchmark based on example from deal.ii: Nearly-incompressible hyperelastic block under compression Mixed FE formulation (pressure and dilation condensed out) Tried both p = 1 and p = 2 finite elements Two load steps, Newton on each (14-15 steps) Experimental setup: 8-core Xeon X5570 with 48 GB RAM LAPACK/BLAS from MKL 11.0 PCG + preconditioners from Trilinos (TSIMF) Rank-Structured Cholesky 2016-03-15 22 / 30

RSC vs standard preconditioners (p = 1, N = 50) 10 1 10 1 10 0 10 0 Relative residual 10 1 10 2 10 3 Jacobi Relative residual 10 1 10 2 10 3 10 4 10 5 RSC ML ICC 10 4 10 5 RSC Jacobi ICC ML 0 5 10 0 0.5 1 Iterations 10 3 Seconds 10 3 (TSIMF) Rank-Structured Cholesky 2016-03-15 23 / 30

RSC vs standard preconditioners (p = 2, N = 35) 10 1 10 1 10 0 10 0 Relative residual 10 1 10 2 10 3 10 4 10 5 RSC ICC ML Jacobi Relative residual 10 1 10 2 10 3 10 4 10 5 RSC ICC ML Jacobi 0 5 10 15 Iterations 10 3 0 5 10 Seconds 10 3 (TSIMF) Rank-Structured Cholesky 2016-03-15 24 / 30

Time and memory comparisons (p = 1) Solve time (s) 1,000 800 600 400 ICC ML Jacobi RSC Cholesk Memory (GB) 30 25 20 15 10 Choles 200 5 RSC 0 0 0 0.5 1 1.5 n 10 6 0 0.5 1 1.5 n 10 6 (TSIMF) Rank-Structured Cholesky 2016-03-15 25 / 30

Effect of in-separator ordering 10 1 10 0 Relative residual 10 1 10 2 10 3 10 4 Semi-sep diag relies on variable order don t want any old order! Apply recursive bisection based on spatial coords Use coordinates if known 10 5 10 6 Geo. Eig. Random Else assign spectrally 0 0.2 0.4 Iteration 10 3 (TSIMF) Rank-Structured Cholesky 2016-03-15 26 / 30

Example: Trabecular bone model ( 1M dof) 10 2 10 2 10 1 10 1 Relative residual 10 0 10 1 10 2 10 3 10 4 Relative residual 10 0 10 1 10 2 10 3 10 4 ICC 10 5 10 5 ML 10 6 RSC2RSC1 ML ICC 10 6 RSC1 RSC2 0 1 2 3 0 1 2 Iterations 10 3 Seconds 10 3 (TSIMF) Rank-Structured Cholesky 2016-03-15 27 / 30

Example: Steel flange ( 1.5M dof) Relative residual 10 3 10 2 10 1 10 0 10 1 10 2 10 3 10 4 10 5 10 6 RSC1 RSC2 ML ICC 0 2 4 Iterations 10 3 Relative residual 10 3 10 2 10 1 10 0 10 1 10 2 10 3 10 4 10 5 10 6 RSC2 RSC1 0 0.5 1 Seconds ML ICC 10 3 (TSIMF) Rank-Structured Cholesky 2016-03-15 28 / 30

Competition Most direct competitor is STRUMPACK (Li, Ghysels, Rouet): http://portal.nersc.gov/project/sparse/strumpack/ Multifrontal vs left-looking (extend-add vs pull updates) LU vs Cholesky STRUMPACK is explicitly parallel See also Sherry Li s plenary from SIAM LA 2015 Also work by Xia and Li; HODLR (Aminfar, Ambikasaran, Darve); BLR (Amestoy, Ashcraft, Boiteau, Buttari, L Excellent, Weisbecker); H-matrix work (Hackbush, Bebendorf, et al);... (TSIMF) Rank-Structured Cholesky 2016-03-15 29 / 30

Conclusions For more: www.cs.cornell.edu/~bindel bindel@cs.cornell.edu J. Chadwick and D. Bindel. An Efficient Solver for Sparse Linear Systems Based on Rank-Structured Cholesky Factorization. http://arxiv.org/abs/1507.05593 https: //github.com/jeffchadwick/rank_structured_cholesky (TSIMF) Rank-Structured Cholesky 2016-03-15 30 / 30