A sparse multifrontal solver using hierarchically semi-separable frontal matrices

Similar documents
Fast algorithms for hierarchically semiseparable matrices

MULTI-LAYER HIERARCHICAL STRUCTURES AND FACTORIZATIONS

Enhancing Scalability of Sparse Direct Methods

Incomplete Cholesky preconditioners that exploit the low-rank property

Block Low-Rank (BLR) approximations to improve multifrontal sparse solvers

A Parallel Geometric Multifrontal Solver Using Hierarchically Semiseparable Structure

A DISTRIBUTED-MEMORY RANDOMIZED STRUCTURED MULTIFRONTAL METHOD FOR SPARSE DIRECT SOLUTIONS

A DISTRIBUTED-MEMORY RANDOMIZED STRUCTURED MULTIFRONTAL METHOD FOR SPARSE DIRECT SOLUTIONS

Fast matrix algebra for dense matrices with rank-deficient off-diagonal blocks

Partial Left-Looking Structured Multifrontal Factorization & Algorithms for Compressed Sensing. Cinna Julie Wu

A robust inner-outer HSS preconditioner

FAST STRUCTURED EIGENSOLVER FOR DISCRETIZED PARTIAL DIFFERENTIAL OPERATORS ON GENERAL MESHES

On the design of parallel linear solvers for large scale problems

J.I. Aliaga 1 M. Bollhöfer 2 A.F. Martín 1 E.S. Quintana-Ortí 1. March, 2009

1 Overview. 2 Adapting to computing system evolution. 11 th European LS-DYNA Conference 2017, Salzburg, Austria

c 2017 Society for Industrial and Applied Mathematics

Multilevel low-rank approximation preconditioners Yousef Saad Department of Computer Science and Engineering University of Minnesota

An Efficient Solver for Sparse Linear Systems based on Rank-Structured Cholesky Factorization

Effective matrix-free preconditioning for the augmented immersed interface method

Improvements for Implicit Linear Equation Solvers

EFFECTIVE AND ROBUST PRECONDITIONING OF GENERAL SPD MATRICES VIA STRUCTURED INCOMPLETE FACTORIZATION

SOLVING SPARSE LINEAR SYSTEMS OF EQUATIONS. Chao Yang Computational Research Division Lawrence Berkeley National Laboratory Berkeley, CA, USA

Parallelization of Multilevel Preconditioners Constructed from Inverse-Based ILUs on Shared-Memory Multiprocessors

Utilisation de la compression low-rank pour réduire la complexité du solveur PaStiX

AMS526: Numerical Analysis I (Numerical Linear Algebra for Computational and Data Sciences)

Large Scale Sparse Linear Algebra

A fast randomized algorithm for computing a Hierarchically Semi-Separable representation of a matrix

Static-scheduling and hybrid-programming in SuperLU DIST on multicore cluster systems

Solving PDEs with CUDA Jonathan Cohen

Linear Solvers. Andrew Hazel

Contents. Preface... xi. Introduction...

INTERCONNECTED HIERARCHICAL RANK-STRUCTURED METHODS FOR DIRECTLY SOLVING AND PRECONDITIONING THE HELMHOLTZ EQUATION

arxiv: v1 [cs.na] 20 Jul 2015

On the design of parallel linear solvers for large scale problems

Sparse linear solvers

SUMMARY INTRODUCTION BLOCK LOW-RANK MULTIFRONTAL METHOD

Using Random Butterfly Transformations to Avoid Pivoting in Sparse Direct Methods

Sparse Linear Algebra: Direct Methods, advanced features

Scientific Computing with Case Studies SIAM Press, Lecture Notes for Unit VII Sparse Matrix

Sparse factorization using low rank submatrices. Cleve Ashcraft LSTC 2010 MUMPS User Group Meeting April 15-16, 2010 Toulouse, FRANCE

SPARSE SOLVERS POISSON EQUATION. Margreet Nool. November 9, 2015 FOR THE. CWI, Multiscale Dynamics

Preface to the Second Edition. Preface to the First Edition

Scalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver

A High-Performance Parallel Hybrid Method for Large Sparse Linear Systems

From Direct to Iterative Substructuring: some Parallel Experiences in 2 and 3D

MUMPS. The MUMPS library: work done during the SOLSTICE project. MUMPS team, Lyon-Grenoble, Toulouse, Bordeaux

Accelerating linear algebra computations with hybrid GPU-multicore systems.

Lecture 17: Iterative Methods and Sparse Linear Algebra

Sparse factorizations: Towards optimal complexity and resilience at exascale

Fast direct solvers for elliptic PDEs

A dissection solver with kernel detection for unsymmetric matrices in FreeFem++

Performance and Scalability of the Block Low-Rank Multifrontal Factorization on Multicore Architectures

Lecture 8: Fast Linear Solvers (Part 7)

A direct solver for elliptic PDEs in three dimensions based on hierarchical merging of Poincaré-Steklov operators

Numerical Methods I Non-Square and Sparse Linear Systems

SCALABLE HYBRID SPARSE LINEAR SOLVERS

Information Sciences Institute 22 June 2012 Bob Lucas, Gene Wagenbreth, Dan Davis, Roger Grimes and

Direct and Incomplete Cholesky Factorizations with Static Supernodes

Bridging the gap between flat and hierarchical low-rank matrix formats: the multilevel BLR format

Solving an Elliptic PDE Eigenvalue Problem via Automated Multi-Level Substructuring and Hierarchical Matrices

Preconditioning techniques to accelerate the convergence of the iterative solution methods

A robust multilevel approximate inverse preconditioner for symmetric positive definite matrices

V C V L T I 0 C V B 1 V T 0 I. l nk

A dissection solver with kernel detection for symmetric finite element matrices on shared memory computers

Parallel Numerical Algorithms

ITERATIVE METHODS FOR SPARSE LINEAR SYSTEMS

Fast multipole method

Hybrid static/dynamic scheduling for already optimized dense matrix factorization. Joint Laboratory for Petascale Computing, INRIA-UIUC

Lecture 2: The Fast Multipole Method

ENHANCING PERFORMANCE AND ROBUSTNESS OF ILU PRECONDITIONERS BY BLOCKING AND SELECTIVE TRANSPOSITION

FAST SPARSE SELECTED INVERSION

Indefinite and physics-based preconditioning

BLAS: Basic Linear Algebra Subroutines Analysis of the Matrix-Vector-Product Analysis of Matrix-Matrix Product

Some Geometric and Algebraic Aspects of Domain Decomposition Methods

Exploiting off-diagonal rank structures in the solution of linear matrix equations

Spectrum-Revealing Matrix Factorizations Theory and Algorithms

Introduction to communication avoiding algorithms for direct methods of factorization in Linear Algebra

Balanced Truncation Model Reduction of Large and Sparse Generalized Linear Systems

Fast numerical methods for solving linear PDEs

Parallel Numerics, WT 2016/ Iterative Methods for Sparse Linear Systems of Equations. page 1 of 1

Fast Structured Spectral Methods

Towards a highly-parallel PDE-Solver using Adaptive Sparse Grids on Compute Clusters

An H-LU Based Direct Finite Element Solver Accelerated by Nested Dissection for Large-scale Modeling of ICs and Packages

Math 671: Tensor Train decomposition methods

A Fast Direct Solver for a Class of Elliptic Partial Differential Equations

LU factorization with Panel Rank Revealing Pivoting and its Communication Avoiding version

Block-tridiagonal matrices

Direct solution methods for sparse matrices. p. 1/49

Iterative methods, preconditioning, and their application to CMB data analysis. Laura Grigori INRIA Saclay

Scaling and Pivoting in an Out-of-Core Sparse Direct Solver

Incomplete LU Preconditioning and Error Compensation Strategies for Sparse Matrices

FINE-GRAINED PARALLEL INCOMPLETE LU FACTORIZATION

Algebraic Multigrid as Solvers and as Preconditioner

Sparse LU Factorization on GPUs for Accelerating SPICE Simulation

Fine-grained Parallel Incomplete LU Factorization

Kasetsart University Workshop. Multigrid methods: An introduction

A Numerical Study of Some Parallel Algebraic Preconditioners

Multilevel Low-Rank Preconditioners Yousef Saad Department of Computer Science and Engineering University of Minnesota. Modelling 2014 June 2,2014

Exploiting Low-Rank Structure in Computing Matrix Powers with Applications to Preconditioning

Solving PDEs with Multigrid Methods p.1

Transcription:

A sparse multifrontal solver using hierarchically semi-separable frontal matrices Pieter Ghysels Lawrence Berkeley National Laboratory Joint work with: Xiaoye S. Li (LBNL), Artem Napov (ULB), François-Henry Rouet (LBNL) 2014 CBMS-NSF Conference: Fast direct solvers for elliptic PDEs June 23 29, 2014, Dartmouth College

Introduction Consider solving Ax = b or M 1 A = M 1 b with a preconditioned iterative method (CG, GMRES, etc.) Fast convergence if good preconditioner M A is available 1. Cheap to construct, store, apply, parallelize 2. Good approximation of A Contradictory goals trade-off Standard design strategy Start with a direct factorization (like multifrontal LU) Add approximations to make it cheaper (cf. 1) while (hopefully/provably) affecting little (2) Approximation idea: use low-rank approximation

he multifrontal method [Duff & Reid 83] 5 Nested dissection reordering defines an elimination tree - SCOCH graph partitioner Bottom-up traversal of the e-tree At each frontal matrix, partial factorization and computation of a contribution block (Schur complement) Parent nodes sum the contribution blocks of their children: extend-add 1 4 5 4 5 3 4 2 3 Elimination tree

Low-rank property In many applications, frontal matrices exhibit some low-rank blocks ( data sparsity ) Compression with SVD, Rank-Revealing QR,... - SVD: optimal but too expensive - RRQR: QR with column pivoting Exact compression or with a compression threshold ε Recursively, diagonal blocks have low-rank subblocks too - What to do with off-diagonal blocks?

Low-rank representations Most low-rank representations belong to the class of H-matrices [Bebendof, Börm, Hackbush, Grasedyck,... ]. Embedded in both dense and sparse solvers: U 1 V 1 U 2 U 3 = U1 0 U3 small 0 U2 Hierarchically Off-Diag. Low Rank (HODLR) [Ambikasaran, Cecka, Darve... ] U 6 B 36 V 3 Hierarchically Semiseparable (HSS) [Chandrasekaran, Dewilde, Gu, Li, Xia,... ] Nested basis. Block-low rank (BLR) [Amestoy et al.] Simple 2D partitioning. Recently in MUMPS. Also: H 2 (includes HSS and FMM), SSS... Choice: simple representations apply to broad classes of problems but provide less gains in memory/operations than specialized/complex ones.

HSS/HBS representation he structure is represented by a tree: 7 B 3,6 U 1 U 3, V 3 3 B 6,3 6 U 6, V 6 V 1 U 2 U 3 = U1 0 U3 small 0 U 2 B 1,2 B 1 2 4 B 5,4 5 B 4,5 U 1, V 1 2,1 U 2, V 2 U 4, V 4 U 5, V 5 D 1 D 2 D 4 D 5 U 6 B 36 V 3 Number of leaves depends on the problem (geometry) and number of processors to be used. Building the HSS structure and all the usual operations (multiplying... ) consist of traversals of the tree.

Embedding low-rank techniques in a multifrontal solver 1. Choose which frontal matrices are compressed (size, level... ) 2. Low-rankness: weak interactions between distant variables = need suitable ordering/clustering of each frontal matrix Geometric setting (3D grid): 2D plane separator Need clusters with small diameters Hierarchical formats, merged clusters need small diameter too Split domain into squares and order with Morton ordering 11 12 15 16 9 10 13 14 3 4 7 8 1 2 5 6 9+10 13+14 +11+12+15+16 1+2 +3+4 5+6 +7+8 Algebraic: add some kind of halo to (complete) graph of separator variables and call a graph partitioner (MEIS) [Amestoy et al., Napov]

Embedding HSS kernels in a multifrontal solver More complicated HSS for frontal matrices: Fully structured: HSS on the whole frontal matrix. No dense matrix. Partial+: HSS on the whole frontal matrix. Dense frontal matrix. Partially structured: HSS on the F 11, F 12 and F 21 parts only. Dense frontal matrix, dense CB = F 22 F 21 F11 1 F 12 in stack. More memory F 11 F 21 F 12 F 22 Partially structured can do regular extend-add In partially structured, HSS compression of dense matrix After HSS compression, ULV factorization of F 11 block - Compared to classical LU in dense case Low rank Schur complement update

HSS compression via randomized sampling [Martinsson 11, Xia 13] HSS compression of a matrix A. Ingredients: R r and R c random matrices with d columns d = r + p with r estimated max rank; p = 10 in practice S r = AR r and S c = A R c samples of matrix A Can benefit from a fast matvec Interpolative Decomposition: A = A(:, J) X A is linear combination of selected columns of A wo sided ID: S c = S c (:, J c )X c and S r = S r (:, J r )X r, A = X c A(I c, I r )X r

HSS compression via randomized sampling 2 Algorithm (symmetric): from fine to coarse do Leaf node τ: 1. Sample: S loc = S(I τ, :) D R(I τ, :) 2. ID: S loc = U τ S loc (J τ, :) 3. Update: S τ = S loc (J τ, :) R τ = U τ R(I τ, :) I τ = I τ (J τ, :) Inner node τ with children [ ν 1, ν 2 : ] Sν1 A(I 1. Sample: S loc = ν1, I ν2 ) R ν2 S ν2 A(I ν2, I ν1 ) R ν1 2. ID: S loc = U τ S loc (J τ, :) 3. Update: S τ = S loc (J τ, :) R τ I τ = Uτ [R ν1 ; R ν2 ] = [I ν1 I ν2 ](J τ, :)

HSS compression via randomized sampling 3 If A A, do this for columns as well (simultaneously) [ ] I Bases have special structure: U τ = Π τ E τ Extract elements from frontal matrix: D τ = A(I τ, I τ ) and B ν1,ν 2 = A(I ν1, I ν2 ) Frontal matrix is combination of separator and HSS children Extracting element from HSS matrix requires traversing the HSS tree and multiplying basis matrices Limiting number of tree traversals is crucial for performance Benefits: Extend-add operation is simplified: only on random vectors Gains in complexity: O(r 2 N log N) iso O(rN 2 ) for non-randomized algorithm. log N due to extracting elements from HSS matrix

Randomized sampling extend-add Assembly in regular multifrontal: F p = A p CBc1 CBc2. Sample: S p = F p R p = (A p CBc1 CBc2 )R p = A p R p Y c1 Y c2 1D extend-add (only along rows); much simpler Y c1 and Y c2 samples of CB of children. R p = R c1 R c2 (+random rows for missing indices) Stages: Build random vectors from random vectors of children Build sample from samples of CB of children Multiply separator part of frontal matrix with random vectors: A p R p Compression of F p using S p and R p

HSS ULV factorization Exploit structure of U τ (from ID) to introduce zero s [ ] [ ] [ ] I Eτ I U τ = Π τ, Ω E τ = Π 0 τ I 0 τ Ω τ U τ = I D 1 U 1 B 2;1 V 2 U 2 B 1;2 V 1 D 2

HSS ULV factorization Exploit structure of U τ (from ID) to introduce zero s [ ] [ ] [ ] I Eτ I U τ = Π τ, Ω E τ = Π 0 τ I 0 τ Ω τ U τ = I [ ] [ ] [ ] Ω1 D 1 U 1B 1,2V2 W1 B = 1,2V Ω 2 U 2B 2,1V1 2 D 2 B 2,1V1 W 2 D 1 U 1 B 2;1 V 2 W 1;1 B 2;1 V 2 W 1;2 U 2 B 1;2 V 1 D 2 B 1;2 V 1 W 2;1 W 2;2

HSS ULV factorization Exploit structure of U τ (from ID) to introduce zero s [ ] [ ] [ ] I Eτ I U τ = Π τ, Ω E τ = Π 0 τ I 0 τ Ω τ U τ = I [ ] [ ] [ ] Ω1 D 1 U 1B 1,2V2 W1 B = 1,2V Ω 2 U 2B 2,1V1 2 D 2 B 2,1V1 W 2 ake (full) LQ decomposition [[ ] ] L 1 Lτ 0 Q W τ = τ [W 1;2 Q1 ] [B 1,2V2 Q 2 ] [ ] Q 1 W τ;2 L 2 Q 2 [B 2,1 V1 Q 1 ] [W 2;2Q2 ] D 1 U 1 B 2;1 V 2 W 1;1 B 2;1 V 2 L 1 B 2;1 V 2 Q 2 W 1;2 W 1;2,1 U 2 B 1;2 V 1 D 2 B 1;2 V 1 W 2;1 B 1;2 V 1 Q 1 L 2 W 2;2 W 2;2,1

HSS ULV factorization Exploit structure of U τ (from ID) to introduce zero s [ ] [ ] [ ] I Eτ I U τ = Π τ, Ω E τ = Π 0 τ I 0 τ Ω τ U τ = I [ ] [ ] [ ] Ω1 D 1 U 1B 1,2V2 W1 B = 1,2V Ω 2 U 2B 2,1V1 2 D 2 B 2,1V1 W 2 ake (full) LQ decomposition [[ ] ] L 1 Lτ 0 Q W τ = τ [W 1;2 Q1 ] [B 1,2V2 Q 2 ] [ ] Q 1 W τ;2 L 2 Q 2 [B 2,1 V1 Q 1 ] [W 2;2Q2 ] Rows for L τ can be eliminated, others are passed to parent At root node: LU solve of reduced D L 1 B 2;1 V 2 Q 2 W 1;2,1 B 1;2 V 1 Q 1 L 2 W 2;2,1

HSS ULV factorization Exploit structure of U τ (from ID) to introduce zero s [ ] [ ] [ ] I Eτ I U τ = Π τ, Ω E τ = Π 0 τ I 0 τ Ω τ U τ = I [ ] [ ] [ ] Ω1 D 1 U 1B 1,2V2 W1 B = 1,2V Ω 2 U 2B 2,1V1 2 D 2 B 2,1V1 W 2 ake (full) LQ decomposition [[ ] ] L 1 Lτ 0 Q W τ = τ [W 1;2 Q1 ] [B 1,2V2 Q 2 ] [ ] Q 1 W τ;2 L 2 Q 2 [B 2,1 V1 Q 1 ] [W 2;2Q2 ] Rows for L τ can be eliminated, others are passed to parent At root node: LU solve of reduced D ULV-like: Ω τ not orthonormal, forward/backward solve phases

Low rank Schur complement update Schur Complement update F 21 {}}{ F 22 F 21 F11 1 F 12 = F 22 U q B qk Vk F 1 11 F 12 {}}{ U k B kq Vq F 1 11 via ULV solve V k F 1 11 U k O(rN 2 ) D k is reduced HSS matrix O(r r) k q F 22 U q B qk (Ṽ k D 1 Ũ k )B kq V q F 11 F12 Ṽk D 1 Ũ k O(r 3 ) F 22 ΨΦ Ψ, Φ O(rN) U q and V q : traverse q subtree Cheap multiply with random vectors F 21 F 22

Numerical example general matrices GMRES(30) with right preconditioner Multifrontal with HSS vs ILUP from SuperLU b = (1, 1,... ), x 0 = (0, 0,... ) Stopping criterium: r k 2 / b 2 10 6 HSS compression tolerance: 10 6 fill-ratio factor (s) its Matrix descr N rank HSS ILU HSS ILU HSS ILU add32 circuit 4, 690 0 2.1 1.3 0.01 0.01 1 2 mchln85ks17 car tire 84, 180 948 13.5 12.3 133.8 216.1 4 39 mhd500 plasma 250, 000 100 11.6 15.6 2.5 7.9 2 8 poli large economics 15, 575 64 4.8 1.6 0.04 0.02 1 2 stomach bio eng. 213, 360 92 12.1 2.9 13.8 18.7 2 2 tdr190k accelerator 1, 100, 242 596 14.1-629.2-7 - torso3 bio eng. 259, 156 136 22.6 2.4 86.7 63.7 2 2 utm5940 tokamak 5, 940 123 6.7 8.0 0.1 0.16 3 15 wang4 device 26, 068 385 45.3 23.1 4.4 6.4 3 4

Numerical example tdr190k tdr190k: Maxwell equations in the frequency domain GMRES(30) convergence

Parallel implementation We have a serial code (StruMF [Napov 11-12 ]) Some performance issues - Currently generates random vectors for all nodes in e-tree - How to estimate the rank? Currently guess and start over when too small Parallel implementation is a work in progress Distributed memory HSS compression of dense matrix - MPI, BLACS, PBLAS, BLAS, LAPACK Shared memory multifrontal code - OpenMP task parallelism for tree traversal for both elimination tree and HSS tree - Next step is parallel dense algebra

Parallel HSS compression MPI code opmost separator of a 3D problem, generated with exact multifrontal method k 100 200 300 N 10,000 40,000 90,000 MPI processes / cores 64 256 1024 Nodes 4 16 64 Levels 6 7 8 olerance 1e-3 1e-3 1-e3 Non-randomized (s) 8.3 51.5 193.4 Randomized (s) 2.9 16.0 37.2 Dense LU ScaLAPACK (s) 4.2 57.6 175.9 On 1024 cores achieved 5.3Flops/s very good flop balance: min / max = 0.93 17% communication overhead

ask based parallel tree traversal OpenMP Postorder (leaf to root) tree traversal: handle siblings in parallel, wait for children void traverse(node* p) { if (p->left) #pragma omp task traverse(p->left); if (p->right) #pragma omp task traverse(p->right); #pragma omp taskwait process(p->data); } Run-time system schedules tasks to cores (work-stealing) Root node will be handled sequentially: scaling bottleneck Nested trees: node of nested dissection tree contains HSS tree

5 4 3 2 1 0 20 15 10 5 0 0 0.5 1 1.5 2 2.5 3 0 0.1 0.2 0.3 0.4 0.5 0.6 time (s) time (s) ask based parallel tree traversal OpenMP HSS Compression of dense frontal matrix thread ID Multifrontal thread ID Blue: E-tree node, Red: HSS compression, Green: ULV-fact Extraction of elements from HSS matrix forms bottleneck Considering other runtime task schedulers Intel BB, StarPU, Quark he Quark scheduler from PLASMA could allow integration of PLASMA parallel (tiled) BLAS/LAPACK

Conclusions Developing an algebraic preconditioner for general nonsymmetric matrices (from PDEs) HSS is restricted format, large gains possible for certain applications, not for all Graph partitioning difficulties - Separator not always just a plane/line, not always a single piece - Bad for rank structure Some performance issues need to be addressed Separate distributed and shared memory codes - How to combine in a hybrid MPI+X code? Prepare for next generation NERSC supercomputer - Intel MIC based, > 60 cores

Conclusions Developing an algebraic preconditioner for general nonsymmetric matrices (from PDEs) HSS is restricted format, large gains possible for certain applications, not for all Graph partitioning difficulties - Separator not always just a plane/line, not always a single piece - Bad for rank structure Some performance issues need to be addressed Separate distributed and shared memory codes - How to combine in a hybrid MPI+X code? Prepare for next generation NERSC supercomputer - Intel MIC based, > 60 cores hank you! Questions?