Large Scale Sparse Linear Algebra

Similar documents
Performance and Scalability of the Block Low-Rank Multifrontal Factorization on Multicore Architectures

Block Low-Rank (BLR) approximations to improve multifrontal sparse solvers

c 2017 Society for Industrial and Applied Mathematics

SUMMARY INTRODUCTION BLOCK LOW-RANK MULTIFRONTAL METHOD

Bridging the gap between flat and hierarchical low-rank matrix formats: the multilevel BLR format

MUMPS. The MUMPS library: work done during the SOLSTICE project. MUMPS team, Lyon-Grenoble, Toulouse, Bordeaux

Sparse Linear Algebra: Direct Methods, advanced features

1 Overview. 2 Adapting to computing system evolution. 11 th European LS-DYNA Conference 2017, Salzburg, Austria

Improvements for Implicit Linear Equation Solvers

A sparse multifrontal solver using hierarchically semi-separable frontal matrices

Utilisation de la compression low-rank pour réduire la complexité du solveur PaStiX

Incomplete Cholesky preconditioners that exploit the low-rank property

AMS526: Numerical Analysis I (Numerical Linear Algebra for Computational and Data Sciences)

Geophysical Journal International

LU factorization with Panel Rank Revealing Pivoting and its Communication Avoiding version

Parallel Numerical Algorithms

MULTI-LAYER HIERARCHICAL STRUCTURES AND FACTORIZATIONS

Hybrid static/dynamic scheduling for already optimized dense matrix factorization. Joint Laboratory for Petascale Computing, INRIA-UIUC

Enhancing Scalability of Sparse Direct Methods

Computing least squares condition numbers on hybrid multicore/gpu systems

From Direct to Iterative Substructuring: some Parallel Experiences in 2 and 3D

Communication-avoiding LU and QR factorizations for multicore architectures

Sparse factorization using low rank submatrices. Cleve Ashcraft LSTC 2010 MUMPS User Group Meeting April 15-16, 2010 Toulouse, FRANCE

Spectrum-Revealing Matrix Factorizations Theory and Algorithms

Introduction to communication avoiding algorithms for direct methods of factorization in Linear Algebra

Towards parallel bipartite matching algorithms

A dissection solver with kernel detection for unsymmetric matrices in FreeFem++

J.I. Aliaga 1 M. Bollhöfer 2 A.F. Martín 1 E.S. Quintana-Ortí 1. March, 2009

On the design of parallel linear solvers for large scale problems

On the design of parallel linear solvers for large scale problems

Static-scheduling and hybrid-programming in SuperLU DIST on multicore cluster systems

An Adaptive Hierarchical Matrix on Point Iterative Poisson Solver

ERLANGEN REGIONAL COMPUTING CENTER

Numerical Methods I Non-Square and Sparse Linear Systems

Block-tridiagonal matrices

Balanced Truncation Model Reduction of Large and Sparse Generalized Linear Systems

A High-Performance Parallel Hybrid Method for Large Sparse Linear Systems

CME342 Parallel Methods in Numerical Analysis. Matrix Computation: Iterative Methods II. Sparse Matrix-vector Multiplication.

Sparse BLAS-3 Reduction

Sparse linear solvers

Algorithm 853: an Efficient Algorithm for Solving Rank-Deficient Least Squares Problems

IMPROVING THE PERFORMANCE OF SPARSE LU MATRIX FACTORIZATION USING A SUPERNODAL ALGORITHM

Communication avoiding parallel algorithms for dense matrix factorizations

An Efficient Solver for Sparse Linear Systems based on Rank-Structured Cholesky Factorization

Direct solution methods for sparse matrices. p. 1/49

Matrix Computations: Direct Methods II. May 5, 2014 Lecture 11

Matrix Assembly in FEA

SOLVING SPARSE LINEAR SYSTEMS OF EQUATIONS. Chao Yang Computational Research Division Lawrence Berkeley National Laboratory Berkeley, CA, USA

Math 671: Tensor Train decomposition methods II

V C V L T I 0 C V B 1 V T 0 I. l nk

High Performance Parallel Tucker Decomposition of Sparse Tensors

A DISTRIBUTED-MEMORY RANDOMIZED STRUCTURED MULTIFRONTAL METHOD FOR SPARSE DIRECT SOLUTIONS

Singular Value Decompsition

LU Factorization with Panel Rank Revealing Pivoting and Its Communication Avoiding Version

This ensures that we walk downhill. For fixed λ not even this may be the case.

Fast matrix algebra for dense matrices with rank-deficient off-diagonal blocks

BLAS: Basic Linear Algebra Subroutines Analysis of the Matrix-Vector-Product Analysis of Matrix-Matrix Product

SPARSE SOLVERS POISSON EQUATION. Margreet Nool. November 9, 2015 FOR THE. CWI, Multiscale Dynamics

Matrix decompositions

Communication-avoiding parallel and sequential QR factorizations

A Parallel Geometric Multifrontal Solver Using Hierarchically Semiseparable Structure

Numerical Methods I Solving Square Linear Systems: GEM and LU factorization

Solution to Laplace Equation using Preconditioned Conjugate Gradient Method with Compressed Row Storage using MPI

Applications of Randomized Methods for Decomposing and Simulating from Large Covariance Matrices

Block Bidiagonal Decomposition and Least Squares Problems

A fast randomized algorithm for computing a Hierarchically Semi-Separable representation of a matrix

arxiv: v1 [cs.na] 20 Jul 2015

Exploiting off-diagonal rank structures in the solution of linear matrix equations

A Novel Aggregation Method based on Graph Matching for Algebraic MultiGrid Preconditioning of Sparse Linear Systems

H 2 -matrices with adaptive bases

I-v k e k. (I-e k h kt ) = Stability of Gauss-Huard Elimination for Solving Linear Systems. 1 x 1 x x x x

Accelerating linear algebra computations with hybrid GPU-multicore systems.

Math 671: Tensor Train decomposition methods

A DISTRIBUTED-MEMORY RANDOMIZED STRUCTURED MULTIFRONTAL METHOD FOR SPARSE DIRECT SOLUTIONS

Parallel sparse direct solvers for Poisson s equation in streamer discharges

Contents. Preface... xi. Introduction...

Rank Revealing QR factorization. F. Guyomarc h, D. Mezher and B. Philippe

Multilevel low-rank approximation preconditioners Yousef Saad Department of Computer Science and Engineering University of Minnesota

Dense LU factorization and its error analysis

Linear Algebra Linear Algebra : Matrix decompositions Monday, February 11th Math 365 Week #4

Parallel Singular Value Decomposition. Jiaxing Tan

Solving PDEs with CUDA Jonathan Cohen

14.2 QR Factorization with Column Pivoting

Parallel programming using MPI. Analysis and optimization. Bhupender Thakur, Jim Lupo, Le Yan, Alex Pacheco

An H-LU Based Direct Finite Element Solver Accelerated by Nested Dissection for Large-scale Modeling of ICs and Packages

FAST STRUCTURED EIGENSOLVER FOR DISCRETIZED PARTIAL DIFFERENTIAL OPERATORS ON GENERAL MESHES

Rank revealing factorizations, and low rank approximations

A Sparse QS-Decomposition for Large Sparse Linear System of Equations

Some Geometric and Algebraic Aspects of Domain Decomposition Methods

Symmetric Pivoting in ScaLAPACK Craig Lucas University of Manchester Cray User Group 8 May 2006, Lugano

A simple FEM solver and its data parallelism

Partial Left-Looking Structured Multifrontal Factorization & Algorithms for Compressed Sensing. Cinna Julie Wu

c 2013 Society for Industrial and Applied Mathematics

Minisymposia 9 and 34: Avoiding Communication in Linear Algebra. Jim Demmel UC Berkeley bebop.cs.berkeley.edu

Lecture 5: Randomized methods for low-rank approximation

Parallel PIPS-SBB Multi-level parallelism for 2-stage SMIPS. Lluís-Miquel Munguia, Geoffrey M. Oxberry, Deepak Rajan, Yuji Shinano

Fast algorithms for hierarchically semiseparable matrices

A direct solver for elliptic PDEs in three dimensions based on hierarchical merging of Poincaré-Steklov operators

Matrix decompositions

Transcription:

Large Scale Sparse Linear Algebra P. Amestoy (INP-N7, IRIT) A. Buttari (CNRS, IRIT) T. Mary (University of Toulouse, IRIT) A. Guermouche (Univ. Bordeaux, LaBRI), J.-Y. L Excellent (INRIA, LIP, ENS-Lyon) B. Uçar (CNRS, LIP, ENS-Lyon) F.-H. Rouet (LSTC, Livermore, USA) C. Weisbecker (LSTC, Livermore, USA) Main principle Principle: build approximated factorization A ε = L ε U ε at given accuracy ε Part I: asymptotic complexity reduction Theoretical proof and experimental validation that (3D case): Operations: O(N 6 ) O(N 5 ) O(N 4 ) Memory: O(N 4 ) O(N 3 log N) Part II: efficient and scalable algorithms How to design algorithms to efficiently translate the theoretical complexity reduction into actual performance and memory gains for large-scale systems and applications?

Impact on industrial applications 10 ߝ௫ ܧ 0 20 Dip (km) 5 10 15 20 10 Cross (km) 15,, ܧ ௫ 5 Depth (km) 0 1 2 3 4 3000 4000 5000 6000 m/s Structural mechanics Matrix of order 8M Required accuracy: 10 9 Seismic imaging Matrix of order 17M Required accuracy: 10 3 Results on 900 cores: Electromagnetism Matrix of order 30M Required accuracy: 10 7 factorization time (s) memory/proc (GB) application MUMPS BLR ratio MUMPS BLR gain structural 289.3 104.9 2.5 7.9 5.9 25% seismic 617.0 123.4 4.9 13.3 10.4 22% electromag. 1307.4 233.8 5.3 20.6 14.4 30% Introduction

Rank and rank k approximation In the following, B is a dense matrix of size m n. Definition 1 (Rank) The rank k of B is defined as the smallest integer such that there exist matrices X and Y of size m k and n k such that B = XY T. Definition 2 We call a rank-k approximation of B at accuracy ε any matrix B of rank k such that B B ε. Optimal rank-k approximation Theorem 3 (Eckart-Young) Let UΣV T be the SVD decomposition of B and let us note σ i = Σ i,i its singular values. Then B = U 1:m,1:k Σ 1:k,1:k V1:n,1:k T is the optimal rank-k approximation of B and B B 2 = σ k+1.

Numerical rank Definition 4 (Numerical rank) The numerical rank Rk ε (B) of B at accuracy ε is defined as the smallest integer k ε such that there exists a matrix B of rank k ε such that B B ε. Theorem 5 Let UΣV T be the SVD decomposition of B and let us note σ i = Σ i,i its singular values. Then the numerical rank of B at accuracy ε is given by Proof: in exercise. k ε = min σ k+1 ε. 1 k min(m,n) Low-rank matrices If the numerical rank of B is equal to min(m, n) then B is said to be full-rank. Inversely, if Rk ε (B) < min(m, n), then B is said to be rank-deficient. A class of rank-deficient matrices of particular interest are low-rank matrices, defined as follows. Definition 6 (Low-rank matrix) B is said to be low-rank (for a given accuracy ε) if its numerical rank k ε is small enough such that its rank-k ε approximation B = XY T requires less storage than the full-rank matrix B, i.e., if k ε (m + n) mn. In that case, B is said to be a low-rank approximation of B and ε is called the low-rank threshold. In the following, for the sake of simplicity, we refer to the numerical rank of a matrix at accuracy ε simply as its rank.

Compression kernels The act of computing B from B is called the compression of B. What are the different methods to compress B? SVD: optimal but expensive: O(mn min(m, n)) operations Truncated QR factorization: slightly less accurate but much cheaper: O(mnk ε ) operations widely used Multiple other methods: randomized algorithms, adaptive cross-approximation, interpolative decomposition, CUR, etc. In the following, we assume truncated QR is used as compression kernel. Low-rank subblocks Frontal matrices are not low-rank but in some applications they exhibit low-rank blocks A block B represents the interaction between two subdomains σ and τ. σ If they have a small diameter and are far away their interaction is weak rank is τ low. The block-admissibility condition formalizes this intuition: σ τ is admissible max (diam (σ), diam (τ)) η dist (σ, τ) 200 180 σ τ rank of 160 140 128 120 100 80 0 10 20 30 40 50 60 70 80 90 100 distance between and

Block Low-Rank matrices A BLR matrix is defined by a partition P = S S, with S = {σ 1,..., σ p }. σ 1 σ 2 σ 3 σ 4 σ 5 σ 1 σ 6 σ 6 σ 2 σ 3 σ 4 σ 5 Gray blocks are non-admissible and therefore kept full-rank White blocks are admissible and therefore compressed to low-rank Standard BLR factorization: FSCU + FSCU (Factor, Solve, Compress, Update) To evaluate the complexity of the BLR factorization, we must compute the cost of these four main steps

Part I: complexity of the BLR factorization Cost analysis of the involved steps Let us consider two blocks A and B of size b b and of rank bounded by r. step type operation cost Factor FR A LU Solve FR-FR B BU 1 Compress LR A Ã Update FR-FR C AB Update LR-FR C ÃB Update FR-LR C A B Update LR-LR C Ã B This is not enough to compute the complexity we need to bound the number of FR blocks!

Bounding the number of FR blocks BLR-admissibility condition of a partition P { #{σ, σ τ P is not admissible} q P is admissible #{τ, σ τ P is not admissible} q Non-Admissible Admissible Main result For any matrix, we can build an admissible P for q = O(1), s.t. the maximal rank of the admissible blocks of A is r Amestoy, Buttari, L Excellent, and Mary. On the Complexity of the Block Low-Rank Multifrontal Factorization, SIAM J. Sci. Comput., 2017. Memory complexity of the dense BLR factorization Let us consider a dense (frontal) matrix of order m divided into p p blocks of order b, with p = m/b. The memory complexity to store the matrix can be computed as M total (b, p, r) = M FR (b, p) + M LR (b, p, r) Ex. 1: compute M FR (b, p) = M LR (b, p, r) = Ex. 2: assuming b = O(m x ) and r = O(m α ), compute M total (m, x, α) = Ex. 3: compute the optimal block size b = O(m x ) and the resulting optimal complexity: x =, b =, and M opt (m, r) = M total (m, x, α) =

Flop complexity of the dense BLR factorization Let us consider a dense (frontal) matrix of order m divided into p p blocks of order b, with p = m/b. step type cost number C step (b, p, r) C step (m, x, α) Factor FR O(b 3 ) Solve FR-FR O(b 3 ) Compress LR O(b 2 r) Update FR-FR O(b 3 ) Update LR-FR O(b 2 r) Update LR-LR O(b 2 r) Ex. 1: compute C step (b, p, r) = cost number Ex. 2: compute C step (m, x, α) with b = O(m x ) and r = O(m α ). Ex. 3: compute the total complexity (sum of all steps) C total (m, x, α) = Ex. 4: compute the optimal block size b = O(m x ) and the resulting optimal complexity: x =, b =, and C opt (m, r) = C total (m, x, α) = Complexity of the sparse multifrontal BLR factorization Sparse multifrontal complexity with ND For a dense complexity C opt (m, r), the sparse complexity is computed as log 2 N ( ) N d 1 C mf = O( 2 dl C 2 l ), l=0 where d is the dimension (2 or 3). operations (OPC) N N grid factor size (NNZ) FR O(N 3 ) O(N 2 log N) BLR O(N 5/2 r 1/2 ) O(N 2 ) N N N grid FR O(N 6 ) O(N 4 ) BLR

Flop count Flop count Experimental Setting: Matrices 1. Poisson: N 3 grid with a 7-point stencil with u = 1 on the boundary Ω u = f Rank bound is theoretically proven to be r = O(1). 2. Helmholtz: N 3 grid with a 27-point stencil, ω is the angular frequency, v(x) is the seismic velocity field, and u(x, ω) is the time-harmonic wavefield solution to the forcing term s(x, ω). ( ω2 v(x) 2 ) u(x, ω) = s(x, ω) ω is fixed and equal to 4Hz. Heuristically, rank bound can be expected to behave as r = O(N). Experimental MF flop complexity: Poisson (ε = 10 10 ) Nested Dissection ordering (geometric) METIS ordering (purely algebraic) 10 14 FR -t: 5n 2:02 BLR (FSCU) -t: 2105n 1:45 10 14 FR -t: 3n 2:05 BLR (FSCU) -t: 1068n 1:50 10 12 10 12 64 96 128 160 192 224256 320 Mesh size N 64 96 128 160 192 224256 320 Mesh size N good agreement with theoretical complexity remains close to ND complexity with METIS ordering

Factors size Factors size Flop count Flop count Experimental MF flop complexity: Helmholtz (ε = 10 4 ) Nested Dissection ordering (geometric) METIS ordering (purely algebraic) 10 16 FR -t: 12n 2:01 10 14 BLR (FSCU) -t: 31n 1:85 10 16 FR -t: 8n 2:04 10 14 BLR (FSCU) -t: 22n 1:87 10 12 10 12 64 96 128 160 192 224256 320 Mesh size N 64 96 128 160 192 224256 320 Mesh size N good agreement with theoretical complexity remains close to ND complexity with METIS ordering Experimental MF complexity: factor size NNZ (Poisson) NNZ (Helmholtz) 10 10 FR -t: 3n 1:40 BLR -t: 12n 1:05 log n 10 11 FR -t: 15n 1:36 10 10 BLR -t: 32n 1:26 10 9 10 8 64 96 128 160 192 224256 320 Mesh size N 64 96 128 160 192 224256 320 Mesh size N good agreement with theoretical complexity remains close to ND complexity with METIS ordering (not shown)

Flop count Flop count Experimental MF complexity: low-rank threshold ε OPC (Poisson) OPC (Helmholtz) 10 14 0 = 10!14 -t: 905n 1:55 10 13 10 12 0 = 10!10 -t: 1068n 1:50 0 = 10!6 -t: 1045n 1:43 0 = 10!2 -t: 851n 1:36 10 14 0 = 10!5 -t: 23n 1:89 0 = 10!4 -t: 22n 1:87 0 = 10!3 -t: 14n 1:89 10 11 10 12 64 96 128 160 192 224 256 320 Mesh size N 64 96 128 160 192 224256 320 Mesh size N theory states ε should only play a role in the constant factor true for Helmholtz, but not Poisson why? Influence of zero-rank blocks on the complexity N 64 128 192 256 320 N F R 40.8 31.3 26.4 23.6 13.4 ε = 10 14 N LR 59.2 68.6 73.6 76.4 86.6 N ZR 0.0 0.1 0.0 0.0 0.0 N F R 21.3 16.6 14.6 12.8 5.8 ε = 10 10 N LR 78.6 83.4 85.4 87.1 94.2 N ZR 0.0 0.1 0.0 0.0 0.0 N F R 2.9 3.0 2.5 2.1 0.6 ε = 10 6 N LR 97.0 96.7 96.4 95.3 93.3 N ZR 0.1 0.3 1.0 2.5 6.1 N F R 0.0 0.0 0.0 0.0 0.0 ε = 10 2 N LR 26.2 12.2 7.6 5.5 3.0 N ZR 73.8 87.8 92.4 94.5 97.0 Number of full-rank/low-rank/zero-rank blocks in percentage of the total number of blocks (Poisson problem). N F R decreases with N: asymptotically negligible N ZR increases with ε (as one would expect) but also with N: asymptotically dominant

Normalized.ops Influence of the block size b on the complexity Analysis on the root node (of size m = N 2 ): 1.8 1.6 1.4 m = 128 2 m = 192 2 m = 256 2 1.2 1 128 192 256 320 384 448 512 576 640 Block size b large range of acceptable block sizes around the optimal b flexibility to tune block size for performance that range increases with the size of the matrix necessity to have variable block sizes Part II: performance of the BLR factorization

Normalized time Normalized time Normalized flops Normalized time Sequential result (matrix S3) 100 80 LAI parts Factor+Solve Update Compress 100 80 LAI parts Factor+Solve Update Compress 60 60 40 40 20 20 0 FR BLR 0 FR BLR Normalized Flops Normalized Time 7.7 gain in flops only translated to a 3.3 gain in time: why? lower granularity of the Update higher relative weight of the FR parts inefficient Compress Multithreaded result on 24 threads 100 80 LAI parts Factor+Solve Update Compress 100 80 LAI parts Factor+Solve Update Compress 60 60 40 40 20 20 0 FR BLR 0 FR BLR Normalized Time (Seq.) Normalized Time (MT) 3.3 gain in sequential becomes 1.7 in multithreaded: why? LAI parts have become critical Update and Compress are memory-bound

Exploiting tree-based multithreading in MF solvers thr 0-3 thr 0-3 thr 0-3 Node parallelism L0 layer thr 0-3 thr 0-3 thr 0-3 thr 0-3 L Excellent and Sid-Lakhdar. A study of shared-memory parallelism in a multifrontal solver, Parallel Computing. how big an impact can tree-based multithreading make? Impact of tree-based multithreading on BLR (24 threads) % hai % lai Higher AI Lower AI node only node + tree time % lai time % lai FR 509 21% 424 13% BLR 307 35% 221 24% In FR, top of the tree is dominant tree MT brings little gain In BLR, bottom of the tree compresses less, becomes important 1.7 gain becomes 1.9 thanks to tree-based multithreading Theoretical speedup tree only node only node + tree N N grid N N N grid FR O(1) O(N) O(N 2 ) BLR O(log N) O(log N) O(N log N) FR O(1) O(N 3 ) O(N 4 ) BLR

Right-looking Vs. Left-looking analysis (24 threads) FR time BLR time RL LL RL LL Update 338 336 110 67 Total 424 421 221 175 read once written at each step read at each step written once RL factorization LL factorization Lower volume of memory transfers in LL (more critical in MT) Update is now less memory-bound: 1.9 gain becomes 2.4 in LL LUAR variant: accumulation and recompression FSCU (Factor, Solve, Compress, Update) FSCU+LUAR Better granularity in Update operations Potential recompression asymptotic complexity reduction? Designed and compared several recompression strategies

GF/s Gflops/s Performance of Outer Product with LUA(R) (24 threads) Outer Product benchmark 50 40 30 20 10 b=256 b=512 0 0 20 40 60 80 100 Size of Outer Product LL LUA LUAR average size of Outer Product 16.5 61.0 32.8 flops ( 10 12 ) time (s) Outer Product 3.8 3.8 1.6 Total 10.2 10.2 8.1 Outer Product 21 14 6 Total 175 167 160 All metrics include the Recompression overhead Higher granularity and lower flops in Update: 2.4 gain becomes 2.6 Impact of machine properties on BLR: roofline model specs time (s) for peak bw BLR factorization (GF/s) (GB/s) RL LL LUA grunch (28 threads) 37 57 248 228 196 brunch (24 threads) 46 102 221 175 167 S3 matrix Arithmetic Intensity in BLR: LL > RL (lower volume of memory transfers) LUA > LL (higher granularities more efficient cache use) 50 40 30 20 10 0 brunch grunch RL LL LUA Arithmetic Intensity of the Outer Product

FCSU variant: compress before solve FSCU (Factor, Solve, Compress, Update) FSCU+LUAR Better granularity in Update operations Potential recompression asymptotic complexity reduction? Designed and compared several recompression strategies FCSU(+LUAR) Restricted pivoting, e.g. to diagonal blocks Low-rank Solve asymptotic complexity reduction? Performance and accuracy of FCSU vs FSCU standard pivoting restricted pivoting FR FSCU FR FSCU FCSU +LUAR +LUAR +LUAR flops ( 10 12 ) 77.97 8.15 77.97 8.15 3.95 time (s) 424 160 404 143 111 residual 4.5e-16 1.5e-09 5.0e-16 1.9e-09 2.7e-09 On this problem, restricted pivoting is enough to ensure stability better BLAS-3/BLAS-2 ratio Compressing before the Solve has little impact on the residual flop reduction 2.6 gain becomes 3.7

Normalized time (FR=1) Flop count Flop count Variants improve asymptotic complexity We have theoretically proven that: FSCU FSCU+LUAR FCSU+LUAR dense O(m 5/2 r 1/2 ) O(m 7/3 r 2/3 ) O(m 2 r) sparse (3D) O(N 5 r 1/2 ) O(N 14/3 r 2/3 ) O(N 4 r) Amestoy, Buttari, L Excellent, and Mary. On the Complexity of the Block Low-Rank Multifrontal Factorization, SIAM J. Sci. Comput., 2017. Poisson (ε = 10 10 ) FSCU -t: 1068n 1:50 FSCU+LUAR -t: 2235n 1:42 FCSU+LUAR -t: 6175n 1:33 10 14 Helmholtz (ε = 10 4 ) FSCU -t: 22n 1:87 FSCU+LUAR -t: 34n 1:82 FCSU+LUAR -t: 60n 1:77 10 12 64 96 128 160 192 224256 320 Mesh size N 64 96 128 160 192 224256 320 Mesh size N Multicore performance results (24 threads) 1 FR BLR BLR+ 0.8 0.6 0.4 0.2 0 5Hz 7Hz 10Hz E3 E4 S3 S4 p8d p8ar p8cr BLR : FSCU, right-looking, node only multithreading BLR+ : FCSU+LUAR, left-looking, node+tree multithreading Amestoy, Buttari, L Excellent, and Mary. Performance and Scalability of the Block Low-Rank Multifrontal Factorization on Multicore Architectures, submitted to ACM Trans. Math. Soft., 2017.

The problem with FCSU FSCU (Factor, Solve, Compress, Update) FSCU+LUAR Better granularity in Update operations Potential recompression asymptotic complexity reduction? Designed and compared several recompression strategies FCSU(+LUAR) Restricted pivoting, e.g. to diagonal blocks not acceptable in many applications Low-rank Solve asymptotic complexity reduction? Compress before Solve + pivoting: CFSU variant D k What s straightforward: Column swaps on B can be performed as row swaps of Y Triangular solve and update can also be performed on Y X Y T B What s less straightforward: How to assess the quality of pivot k? We need to estimate B :,k max : B :,k max, assuming X is orthonormal (e.g. RRQR, SVD) How to deal with postponed/delayed pivots? Several strategies to merge them with next panel

Residual Normalized flops FSCU vs FCSU vs CFSU FSCU Standard pivoting Compress after Solve FCSU Restricted pivoting Compress before Solve CFSU Standard pivoting Compress before Solve 1 0.5 FR FSCU FCSU CFSU 0 barrier2-10 Lin para-10 kkt_power perf009d perf009ar 10-5 10-10 10 0 FSCU FCSU CFSU 10-15 barrier2-10 Lin para-10 kkt_power perf009d perf009ar When FCSU is enough (left), CFSU does not degrade compression When FCSU fails (right), CFSU achieves both good residual and compression Distributed-memory parallelism P 0 P 0 P 0 P 1 P 2 P 3 LU messages P 0 P 1 P 2 P 1 P 1 P 2 P 2 P 3 P 3 P 4 P 4 P 5 P 5 CB messages P 3 P 4 P 5 Volume of LU messages is reduced in BLR (compressed factors) Volume of CB messages can be reduced by compressing the CB but it is an overhead cost

Total bytes sent Time (s) Strong scalability analysis 2000 FR BLR 1000 500 250 30x10 45x10 60x10 75x10 90x10 Number of MPIs x Number of cores Compression rate is not significantly impacted by number of processes Flops reduced by 12.8 but volume of communications only by 2.2 higher relative weight of communications Load unbalance (ratio between most and less loaded processes) increases from 1.28 to 2.57 Communication analysis 10 11 10 10 10 9 10 8 10 7 10 6 LU messages CB messages 10 5 0 2 4 6 8 Front size #10 4 FR case: LU messages dominate BLR case: CB messages dominate underwhelming reduction of comms CB compression allows for truly reducing the comms; it is an overhead cost but may lead to speedups depending on network speed w.r.t. processor speed Theoretical communication analysis bounds W LU W CB W tot FR O(n 4/3 p) O(n 4/3 ) O(n 4/3 p) BLR (CB FR ) BLR (CB LR )

Result on a very large problem Result on matrix 15Hz (order 58 10 6, nnz 1.5 10 9 ) on 900 cores: flops factors memory (GB) elapsed time (s) (PF) size (TB) avg. max. ana. fac. sol. MUMPS 29.6 3.7 103 120 OOM OOM OOM BLR 1.3 0.7 37 57 437 856 0.2/RHS ratio 22.9 5.1 2.8 2.3 References Amestoy, Buttari, L Excellent, and Mary. On the Complexity of the Block Low-Rank Multifrontal Factorization, SIAM J. Sci. Comput., 2017. Amestoy, Buttari, L Excellent, and Mary. Performance and Scalability of the Block Low-Rank Multifrontal Factorization on Multicore Architectures, submitted to ACM Trans. Math. Soft., 2017. Amestoy, Brossier, Buttari, L Excellent, Mary, Métivier, Miniussi, and Operto. Fast 3D frequency-domain full waveform inversion with a parallel Block Low-Rank multifrontal direct solver: application to OBC data from the North Sea, Geophysics, 2016. Shantsev, Jaysaval, de la Kethulle de Ryhove, Amestoy, Buttari, L Excellent, and Mary. Large-scale 3D EM modeling with a Block Low-Rank multifrontal direct solver, Geophysical Journal International, 2017.