Large Scale Sparse Linear Algebra

Large Scale Sparse Linear Algebra P. Amestoy (INP-N7, IRIT) A. Buttari (CNRS, IRIT) T. Mary (University of Toulouse, IRIT) A. Guermouche (Univ. Bordeaux, LaBRI), J.-Y. L Excellent (INRIA, LIP, ENS-Lyon) B. Uçar (CNRS, LIP, ENS-Lyon) F.-H. Rouet (LSTC, Livermore, USA) C. Weisbecker (LSTC, Livermore, USA) Main principle Principle: build approximated factorization A ε = L ε U ε at given accuracy ε Part I: asymptotic complexity reduction Theoretical proof and experimental validation that (3D case): Operations: O(N 6 ) O(N 5 ) O(N 4 ) Memory: O(N 4 ) O(N 3 log N) Part II: efficient and scalable algorithms How to design algorithms to efficiently translate the theoretical complexity reduction into actual performance and memory gains for large-scale systems and applications?

Impact on industrial applications 10 ߝ௫ ܧ 0 20 Dip (km) 5 10 15 20 10 Cross (km) 15,, ܧ ௫ 5 Depth (km) 0 1 2 3 4 3000 4000 5000 6000 m/s Structural mechanics Matrix of order 8M Required accuracy: 10 9 Seismic imaging Matrix of order 17M Required accuracy: 10 3 Results on 900 cores: Electromagnetism Matrix of order 30M Required accuracy: 10 7 factorization time (s) memory/proc (GB) application MUMPS BLR ratio MUMPS BLR gain structural 289.3 104.9 2.5 7.9 5.9 25% seismic 617.0 123.4 4.9 13.3 10.4 22% electromag. 1307.4 233.8 5.3 20.6 14.4 30% Introduction

Rank and rank k approximation In the following, B is a dense matrix of size m n. Definition 1 (Rank) The rank k of B is defined as the smallest integer such that there exist matrices X and Y of size m k and n k such that B = XY T. Definition 2 We call a rank-k approximation of B at accuracy ε any matrix B of rank k such that B B ε. Optimal rank-k approximation Theorem 3 (Eckart-Young) Let UΣV T be the SVD decomposition of B and let us note σ i = Σ i,i its singular values. Then B = U 1:m,1:k Σ 1:k,1:k V1:n,1:k T is the optimal rank-k approximation of B and B B 2 = σ k+1.

Numerical rank Definition 4 (Numerical rank) The numerical rank Rk ε (B) of B at accuracy ε is defined as the smallest integer k ε such that there exists a matrix B of rank k ε such that B B ε. Theorem 5 Let UΣV T be the SVD decomposition of B and let us note σ i = Σ i,i its singular values. Then the numerical rank of B at accuracy ε is given by Proof: in exercise. k ε = min σ k+1 ε. 1 k min(m,n) Low-rank matrices If the numerical rank of B is equal to min(m, n) then B is said to be full-rank. Inversely, if Rk ε (B) < min(m, n), then B is said to be rank-deficient. A class of rank-deficient matrices of particular interest are low-rank matrices, defined as follows. Definition 6 (Low-rank matrix) B is said to be low-rank (for a given accuracy ε) if its numerical rank k ε is small enough such that its rank-k ε approximation B = XY T requires less storage than the full-rank matrix B, i.e., if k ε (m + n) mn. In that case, B is said to be a low-rank approximation of B and ε is called the low-rank threshold. In the following, for the sake of simplicity, we refer to the numerical rank of a matrix at accuracy ε simply as its rank.

Compression kernels The act of computing B from B is called the compression of B. What are the different methods to compress B? SVD: optimal but expensive: O(mn min(m, n)) operations Truncated QR factorization: slightly less accurate but much cheaper: O(mnk ε ) operations widely used Multiple other methods: randomized algorithms, adaptive cross-approximation, interpolative decomposition, CUR, etc. In the following, we assume truncated QR is used as compression kernel. Low-rank subblocks Frontal matrices are not low-rank but in some applications they exhibit low-rank blocks A block B represents the interaction between two subdomains σ and τ. σ If they have a small diameter and are far away their interaction is weak rank is τ low. The block-admissibility condition formalizes this intuition: σ τ is admissible max (diam (σ), diam (τ)) η dist (σ, τ) 200 180 σ τ rank of 160 140 128 120 100 80 0 10 20 30 40 50 60 70 80 90 100 distance between and

Block Low-Rank matrices A BLR matrix is defined by a partition P = S S, with S = {σ 1,..., σ p }. σ 1 σ 2 σ 3 σ 4 σ 5 σ 1 σ 6 σ 6 σ 2 σ 3 σ 4 σ 5 Gray blocks are non-admissible and therefore kept full-rank White blocks are admissible and therefore compressed to low-rank Standard BLR factorization: FSCU + FSCU (Factor, Solve, Compress, Update) To evaluate the complexity of the BLR factorization, we must compute the cost of these four main steps

Part I: complexity of the BLR factorization Cost analysis of the involved steps Let us consider two blocks A and B of size b b and of rank bounded by r. step type operation cost Factor FR A LU Solve FR-FR B BU 1 Compress LR A Ã Update FR-FR C AB Update LR-FR C ÃB Update FR-LR C A B Update LR-LR C Ã B This is not enough to compute the complexity we need to bound the number of FR blocks!

Bounding the number of FR blocks BLR-admissibility condition of a partition P { #{σ, σ τ P is not admissible} q P is admissible #{τ, σ τ P is not admissible} q Non-Admissible Admissible Main result For any matrix, we can build an admissible P for q = O(1), s.t. the maximal rank of the admissible blocks of A is r Amestoy, Buttari, L Excellent, and Mary. On the Complexity of the Block Low-Rank Multifrontal Factorization, SIAM J. Sci. Comput., 2017. Memory complexity of the dense BLR factorization Let us consider a dense (frontal) matrix of order m divided into p p blocks of order b, with p = m/b. The memory complexity to store the matrix can be computed as M total (b, p, r) = M FR (b, p) + M LR (b, p, r) Ex. 1: compute M FR (b, p) = M LR (b, p, r) = Ex. 2: assuming b = O(m x ) and r = O(m α ), compute M total (m, x, α) = Ex. 3: compute the optimal block size b = O(m x ) and the resulting optimal complexity: x =, b =, and M opt (m, r) = M total (m, x, α) =

Flop complexity of the dense BLR factorization Let us consider a dense (frontal) matrix of order m divided into p p blocks of order b, with p = m/b. step type cost number C step (b, p, r) C step (m, x, α) Factor FR O(b 3 ) Solve FR-FR O(b 3 ) Compress LR O(b 2 r) Update FR-FR O(b 3 ) Update LR-FR O(b 2 r) Update LR-LR O(b 2 r) Ex. 1: compute C step (b, p, r) = cost number Ex. 2: compute C step (m, x, α) with b = O(m x ) and r = O(m α ). Ex. 3: compute the total complexity (sum of all steps) C total (m, x, α) = Ex. 4: compute the optimal block size b = O(m x ) and the resulting optimal complexity: x =, b =, and C opt (m, r) = C total (m, x, α) = Complexity of the sparse multifrontal BLR factorization Sparse multifrontal complexity with ND For a dense complexity C opt (m, r), the sparse complexity is computed as log 2 N ( ) N d 1 C mf = O( 2 dl C 2 l ), l=0 where d is the dimension (2 or 3). operations (OPC) N N grid factor size (NNZ) FR O(N 3 ) O(N 2 log N) BLR O(N 5/2 r 1/2 ) O(N 2 ) N N N grid FR O(N 6 ) O(N 4 ) BLR

Flop count Flop count Experimental Setting: Matrices 1. Poisson: N 3 grid with a 7-point stencil with u = 1 on the boundary Ω u = f Rank bound is theoretically proven to be r = O(1). 2. Helmholtz: N 3 grid with a 27-point stencil, ω is the angular frequency, v(x) is the seismic velocity field, and u(x, ω) is the time-harmonic wavefield solution to the forcing term s(x, ω). ( ω2 v(x) 2 ) u(x, ω) = s(x, ω) ω is fixed and equal to 4Hz. Heuristically, rank bound can be expected to behave as r = O(N). Experimental MF flop complexity: Poisson (ε = 10 10 ) Nested Dissection ordering (geometric) METIS ordering (purely algebraic) 10 14 FR -t: 5n 2:02 BLR (FSCU) -t: 2105n 1:45 10 14 FR -t: 3n 2:05 BLR (FSCU) -t: 1068n 1:50 10 12 10 12 64 96 128 160 192 224256 320 Mesh size N 64 96 128 160 192 224256 320 Mesh size N good agreement with theoretical complexity remains close to ND complexity with METIS ordering

Factors size Factors size Flop count Flop count Experimental MF flop complexity: Helmholtz (ε = 10 4 ) Nested Dissection ordering (geometric) METIS ordering (purely algebraic) 10 16 FR -t: 12n 2:01 10 14 BLR (FSCU) -t: 31n 1:85 10 16 FR -t: 8n 2:04 10 14 BLR (FSCU) -t: 22n 1:87 10 12 10 12 64 96 128 160 192 224256 320 Mesh size N 64 96 128 160 192 224256 320 Mesh size N good agreement with theoretical complexity remains close to ND complexity with METIS ordering Experimental MF complexity: factor size NNZ (Poisson) NNZ (Helmholtz) 10 10 FR -t: 3n 1:40 BLR -t: 12n 1:05 log n 10 11 FR -t: 15n 1:36 10 10 BLR -t: 32n 1:26 10 9 10 8 64 96 128 160 192 224256 320 Mesh size N 64 96 128 160 192 224256 320 Mesh size N good agreement with theoretical complexity remains close to ND complexity with METIS ordering (not shown)

Flop count Flop count Experimental MF complexity: low-rank threshold ε OPC (Poisson) OPC (Helmholtz) 10 14 0 = 10!14 -t: 905n 1:55 10 13 10 12 0 = 10!10 -t: 1068n 1:50 0 = 10!6 -t: 1045n 1:43 0 = 10!2 -t: 851n 1:36 10 14 0 = 10!5 -t: 23n 1:89 0 = 10!4 -t: 22n 1:87 0 = 10!3 -t: 14n 1:89 10 11 10 12 64 96 128 160 192 224 256 320 Mesh size N 64 96 128 160 192 224256 320 Mesh size N theory states ε should only play a role in the constant factor true for Helmholtz, but not Poisson why? Influence of zero-rank blocks on the complexity N 64 128 192 256 320 N F R 40.8 31.3 26.4 23.6 13.4 ε = 10 14 N LR 59.2 68.6 73.6 76.4 86.6 N ZR 0.0 0.1 0.0 0.0 0.0 N F R 21.3 16.6 14.6 12.8 5.8 ε = 10 10 N LR 78.6 83.4 85.4 87.1 94.2 N ZR 0.0 0.1 0.0 0.0 0.0 N F R 2.9 3.0 2.5 2.1 0.6 ε = 10 6 N LR 97.0 96.7 96.4 95.3 93.3 N ZR 0.1 0.3 1.0 2.5 6.1 N F R 0.0 0.0 0.0 0.0 0.0 ε = 10 2 N LR 26.2 12.2 7.6 5.5 3.0 N ZR 73.8 87.8 92.4 94.5 97.0 Number of full-rank/low-rank/zero-rank blocks in percentage of the total number of blocks (Poisson problem). N F R decreases with N: asymptotically negligible N ZR increases with ε (as one would expect) but also with N: asymptotically dominant

Normalized.ops Influence of the block size b on the complexity Analysis on the root node (of size m = N 2 ): 1.8 1.6 1.4 m = 128 2 m = 192 2 m = 256 2 1.2 1 128 192 256 320 384 448 512 576 640 Block size b large range of acceptable block sizes around the optimal b flexibility to tune block size for performance that range increases with the size of the matrix necessity to have variable block sizes Part II: performance of the BLR factorization

Normalized time Normalized time Normalized flops Normalized time Sequential result (matrix S3) 100 80 LAI parts Factor+Solve Update Compress 100 80 LAI parts Factor+Solve Update Compress 60 60 40 40 20 20 0 FR BLR 0 FR BLR Normalized Flops Normalized Time 7.7 gain in flops only translated to a 3.3 gain in time: why? lower granularity of the Update higher relative weight of the FR parts inefficient Compress Multithreaded result on 24 threads 100 80 LAI parts Factor+Solve Update Compress 100 80 LAI parts Factor+Solve Update Compress 60 60 40 40 20 20 0 FR BLR 0 FR BLR Normalized Time (Seq.) Normalized Time (MT) 3.3 gain in sequential becomes 1.7 in multithreaded: why? LAI parts have become critical Update and Compress are memory-bound

Exploiting tree-based multithreading in MF solvers thr 0-3 thr 0-3 thr 0-3 Node parallelism L0 layer thr 0-3 thr 0-3 thr 0-3 thr 0-3 L Excellent and Sid-Lakhdar. A study of shared-memory parallelism in a multifrontal solver, Parallel Computing. how big an impact can tree-based multithreading make? Impact of tree-based multithreading on BLR (24 threads) % hai % lai Higher AI Lower AI node only node + tree time % lai time % lai FR 509 21% 424 13% BLR 307 35% 221 24% In FR, top of the tree is dominant tree MT brings little gain In BLR, bottom of the tree compresses less, becomes important 1.7 gain becomes 1.9 thanks to tree-based multithreading Theoretical speedup tree only node only node + tree N N grid N N N grid FR O(1) O(N) O(N 2 ) BLR O(log N) O(log N) O(N log N) FR O(1) O(N 3 ) O(N 4 ) BLR

Right-looking Vs. Left-looking analysis (24 threads) FR time BLR time RL LL RL LL Update 338 336 110 67 Total 424 421 221 175 read once written at each step read at each step written once RL factorization LL factorization Lower volume of memory transfers in LL (more critical in MT) Update is now less memory-bound: 1.9 gain becomes 2.4 in LL LUAR variant: accumulation and recompression FSCU (Factor, Solve, Compress, Update) FSCU+LUAR Better granularity in Update operations Potential recompression asymptotic complexity reduction? Designed and compared several recompression strategies

GF/s Gflops/s Performance of Outer Product with LUA(R) (24 threads) Outer Product benchmark 50 40 30 20 10 b=256 b=512 0 0 20 40 60 80 100 Size of Outer Product LL LUA LUAR average size of Outer Product 16.5 61.0 32.8 flops ( 10 12 ) time (s) Outer Product 3.8 3.8 1.6 Total 10.2 10.2 8.1 Outer Product 21 14 6 Total 175 167 160 All metrics include the Recompression overhead Higher granularity and lower flops in Update: 2.4 gain becomes 2.6 Impact of machine properties on BLR: roofline model specs time (s) for peak bw BLR factorization (GF/s) (GB/s) RL LL LUA grunch (28 threads) 37 57 248 228 196 brunch (24 threads) 46 102 221 175 167 S3 matrix Arithmetic Intensity in BLR: LL > RL (lower volume of memory transfers) LUA > LL (higher granularities more efficient cache use) 50 40 30 20 10 0 brunch grunch RL LL LUA Arithmetic Intensity of the Outer Product

FCSU variant: compress before solve FSCU (Factor, Solve, Compress, Update) FSCU+LUAR Better granularity in Update operations Potential recompression asymptotic complexity reduction? Designed and compared several recompression strategies FCSU(+LUAR) Restricted pivoting, e.g. to diagonal blocks Low-rank Solve asymptotic complexity reduction? Performance and accuracy of FCSU vs FSCU standard pivoting restricted pivoting FR FSCU FR FSCU FCSU +LUAR +LUAR +LUAR flops ( 10 12 ) 77.97 8.15 77.97 8.15 3.95 time (s) 424 160 404 143 111 residual 4.5e-16 1.5e-09 5.0e-16 1.9e-09 2.7e-09 On this problem, restricted pivoting is enough to ensure stability better BLAS-3/BLAS-2 ratio Compressing before the Solve has little impact on the residual flop reduction 2.6 gain becomes 3.7

Normalized time (FR=1) Flop count Flop count Variants improve asymptotic complexity We have theoretically proven that: FSCU FSCU+LUAR FCSU+LUAR dense O(m 5/2 r 1/2 ) O(m 7/3 r 2/3 ) O(m 2 r) sparse (3D) O(N 5 r 1/2 ) O(N 14/3 r 2/3 ) O(N 4 r) Amestoy, Buttari, L Excellent, and Mary. On the Complexity of the Block Low-Rank Multifrontal Factorization, SIAM J. Sci. Comput., 2017. Poisson (ε = 10 10 ) FSCU -t: 1068n 1:50 FSCU+LUAR -t: 2235n 1:42 FCSU+LUAR -t: 6175n 1:33 10 14 Helmholtz (ε = 10 4 ) FSCU -t: 22n 1:87 FSCU+LUAR -t: 34n 1:82 FCSU+LUAR -t: 60n 1:77 10 12 64 96 128 160 192 224256 320 Mesh size N 64 96 128 160 192 224256 320 Mesh size N Multicore performance results (24 threads) 1 FR BLR BLR+ 0.8 0.6 0.4 0.2 0 5Hz 7Hz 10Hz E3 E4 S3 S4 p8d p8ar p8cr BLR : FSCU, right-looking, node only multithreading BLR+ : FCSU+LUAR, left-looking, node+tree multithreading Amestoy, Buttari, L Excellent, and Mary. Performance and Scalability of the Block Low-Rank Multifrontal Factorization on Multicore Architectures, submitted to ACM Trans. Math. Soft., 2017.

The problem with FCSU FSCU (Factor, Solve, Compress, Update) FSCU+LUAR Better granularity in Update operations Potential recompression asymptotic complexity reduction? Designed and compared several recompression strategies FCSU(+LUAR) Restricted pivoting, e.g. to diagonal blocks not acceptable in many applications Low-rank Solve asymptotic complexity reduction? Compress before Solve + pivoting: CFSU variant D k What s straightforward: Column swaps on B can be performed as row swaps of Y Triangular solve and update can also be performed on Y X Y T B What s less straightforward: How to assess the quality of pivot k? We need to estimate B :,k max : B :,k max, assuming X is orthonormal (e.g. RRQR, SVD) How to deal with postponed/delayed pivots? Several strategies to merge them with next panel

Residual Normalized flops FSCU vs FCSU vs CFSU FSCU Standard pivoting Compress after Solve FCSU Restricted pivoting Compress before Solve CFSU Standard pivoting Compress before Solve 1 0.5 FR FSCU FCSU CFSU 0 barrier2-10 Lin para-10 kkt_power perf009d perf009ar 10-5 10-10 10 0 FSCU FCSU CFSU 10-15 barrier2-10 Lin para-10 kkt_power perf009d perf009ar When FCSU is enough (left), CFSU does not degrade compression When FCSU fails (right), CFSU achieves both good residual and compression Distributed-memory parallelism P 0 P 0 P 0 P 1 P 2 P 3 LU messages P 0 P 1 P 2 P 1 P 1 P 2 P 2 P 3 P 3 P 4 P 4 P 5 P 5 CB messages P 3 P 4 P 5 Volume of LU messages is reduced in BLR (compressed factors) Volume of CB messages can be reduced by compressing the CB but it is an overhead cost

Total bytes sent Time (s) Strong scalability analysis 2000 FR BLR 1000 500 250 30x10 45x10 60x10 75x10 90x10 Number of MPIs x Number of cores Compression rate is not significantly impacted by number of processes Flops reduced by 12.8 but volume of communications only by 2.2 higher relative weight of communications Load unbalance (ratio between most and less loaded processes) increases from 1.28 to 2.57 Communication analysis 10 11 10 10 10 9 10 8 10 7 10 6 LU messages CB messages 10 5 0 2 4 6 8 Front size #10 4 FR case: LU messages dominate BLR case: CB messages dominate underwhelming reduction of comms CB compression allows for truly reducing the comms; it is an overhead cost but may lead to speedups depending on network speed w.r.t. processor speed Theoretical communication analysis bounds W LU W CB W tot FR O(n 4/3 p) O(n 4/3 ) O(n 4/3 p) BLR (CB FR ) BLR (CB LR )

Result on a very large problem Result on matrix 15Hz (order 58 10 6, nnz 1.5 10 9 ) on 900 cores: flops factors memory (GB) elapsed time (s) (PF) size (TB) avg. max. ana. fac. sol. MUMPS 29.6 3.7 103 120 OOM OOM OOM BLR 1.3 0.7 37 57 437 856 0.2/RHS ratio 22.9 5.1 2.8 2.3 References Amestoy, Buttari, L Excellent, and Mary. On the Complexity of the Block Low-Rank Multifrontal Factorization, SIAM J. Sci. Comput., 2017. Amestoy, Buttari, L Excellent, and Mary. Performance and Scalability of the Block Low-Rank Multifrontal Factorization on Multicore Architectures, submitted to ACM Trans. Math. Soft., 2017. Amestoy, Brossier, Buttari, L Excellent, Mary, Métivier, Miniussi, and Operto. Fast 3D frequency-domain full waveform inversion with a parallel Block Low-Rank multifrontal direct solver: application to OBC data from the North Sea, Geophysics, 2016. Shantsev, Jaysaval, de la Kethulle de Ryhove, Amestoy, Buttari, L Excellent, and Mary. Large-scale 3D EM modeling with a Block Low-Rank multifrontal direct solver, Geophysical Journal International, 2017.