Large Scale Sparse Linear Algebra

Size: px

Start display at page:

Download "Large Scale Sparse Linear Algebra"

Magdalen Tucker
5 years ago
Views:

1 Large Scale Sparse Linear Algebra P. Amestoy (INP-N7, IRIT) A. Buttari (CNRS, IRIT) T. Mary (University of Toulouse, IRIT) A. Guermouche (Univ. Bordeaux, LaBRI), J.-Y. L Excellent (INRIA, LIP, ENS-Lyon) B. Uçar (CNRS, LIP, ENS-Lyon) F.-H. Rouet (LSTC, Livermore, USA) C. Weisbecker (LSTC, Livermore, USA) Main principle Principle: build approximated factorization A ε = L ε U ε at given accuracy ε Part I: asymptotic complexity reduction Theoretical proof and experimental validation that (3D case): Operations: O(N 6 ) O(N 5 ) O(N 4 ) Memory: O(N 4 ) O(N 3 log N) Part II: efficient and scalable algorithms How to design algorithms to efficiently translate the theoretical complexity reduction into actual performance and memory gains for large-scale systems and applications?

2 Impact on industrial applications 10 ߝ௫ ܧ 0 20 Dip (km) Cross (km) 15,, ܧ ௫ 5 Depth (km) m/s Structural mechanics Matrix of order 8M Required accuracy: 10 9 Seismic imaging Matrix of order 17M Required accuracy: 10 3 Results on 900 cores: Electromagnetism Matrix of order 30M Required accuracy: 10 7 factorization time (s) memory/proc (GB) application MUMPS BLR ratio MUMPS BLR gain structural % seismic % electromag % Introduction

3 Rank and rank k approximation In the following, B is a dense matrix of size m n. Definition 1 (Rank) The rank k of B is defined as the smallest integer such that there exist matrices X and Y of size m k and n k such that B = XY T. Definition 2 We call a rank-k approximation of B at accuracy ε any matrix B of rank k such that B B ε. Optimal rank-k approximation Theorem 3 (Eckart-Young) Let UΣV T be the SVD decomposition of B and let us note σ i = Σ i,i its singular values. Then B = U 1:m,1:k Σ 1:k,1:k V1:n,1:k T is the optimal rank-k approximation of B and B B 2 = σ k+1.

4 Numerical rank Definition 4 (Numerical rank) The numerical rank Rk ε (B) of B at accuracy ε is defined as the smallest integer k ε such that there exists a matrix B of rank k ε such that B B ε. Theorem 5 Let UΣV T be the SVD decomposition of B and let us note σ i = Σ i,i its singular values. Then the numerical rank of B at accuracy ε is given by Proof: in exercise. k ε = min σ k+1 ε. 1 k min(m,n) Low-rank matrices If the numerical rank of B is equal to min(m, n) then B is said to be full-rank. Inversely, if Rk ε (B) < min(m, n), then B is said to be rank-deficient. A class of rank-deficient matrices of particular interest are low-rank matrices, defined as follows. Definition 6 (Low-rank matrix) B is said to be low-rank (for a given accuracy ε) if its numerical rank k ε is small enough such that its rank-k ε approximation B = XY T requires less storage than the full-rank matrix B, i.e., if k ε (m + n) mn. In that case, B is said to be a low-rank approximation of B and ε is called the low-rank threshold. In the following, for the sake of simplicity, we refer to the numerical rank of a matrix at accuracy ε simply as its rank.

5 Compression kernels The act of computing B from B is called the compression of B. What are the different methods to compress B? SVD: optimal but expensive: O(mn min(m, n)) operations Truncated QR factorization: slightly less accurate but much cheaper: O(mnk ε ) operations widely used Multiple other methods: randomized algorithms, adaptive cross-approximation, interpolative decomposition, CUR, etc. In the following, we assume truncated QR is used as compression kernel. Low-rank subblocks Frontal matrices are not low-rank but in some applications they exhibit low-rank blocks A block B represents the interaction between two subdomains σ and τ. σ If they have a small diameter and are far away their interaction is weak rank is τ low. The block-admissibility condition formalizes this intuition: σ τ is admissible max (diam (σ), diam (τ)) η dist (σ, τ) σ τ rank of distance between and

6 Block Low-Rank matrices A BLR matrix is defined by a partition P = S S, with S = {σ 1,..., σ p }. σ 1 σ 2 σ 3 σ 4 σ 5 σ 1 σ 6 σ 6 σ 2 σ 3 σ 4 σ 5 Gray blocks are non-admissible and therefore kept full-rank White blocks are admissible and therefore compressed to low-rank Standard BLR factorization: FSCU + FSCU (Factor, Solve, Compress, Update) To evaluate the complexity of the BLR factorization, we must compute the cost of these four main steps

7 Part I: complexity of the BLR factorization Cost analysis of the involved steps Let us consider two blocks A and B of size b b and of rank bounded by r. step type operation cost Factor FR A LU Solve FR-FR B BU 1 Compress LR A Ã Update FR-FR C AB Update LR-FR C ÃB Update FR-LR C A B Update LR-LR C Ã B This is not enough to compute the complexity we need to bound the number of FR blocks!

8 Bounding the number of FR blocks BLR-admissibility condition of a partition P { #{σ, σ τ P is not admissible} q P is admissible #{τ, σ τ P is not admissible} q Non-Admissible Admissible Main result For any matrix, we can build an admissible P for q = O(1), s.t. the maximal rank of the admissible blocks of A is r Amestoy, Buttari, L Excellent, and Mary. On the Complexity of the Block Low-Rank Multifrontal Factorization, SIAM J. Sci. Comput., Memory complexity of the dense BLR factorization Let us consider a dense (frontal) matrix of order m divided into p p blocks of order b, with p = m/b. The memory complexity to store the matrix can be computed as M total (b, p, r) = M FR (b, p) + M LR (b, p, r) Ex. 1: compute M FR (b, p) = M LR (b, p, r) = Ex. 2: assuming b = O(m x ) and r = O(m α ), compute M total (m, x, α) = Ex. 3: compute the optimal block size b = O(m x ) and the resulting optimal complexity: x =, b =, and M opt (m, r) = M total (m, x, α) =

9 Flop complexity of the dense BLR factorization Let us consider a dense (frontal) matrix of order m divided into p p blocks of order b, with p = m/b. step type cost number C step (b, p, r) C step (m, x, α) Factor FR O(b 3 ) Solve FR-FR O(b 3 ) Compress LR O(b 2 r) Update FR-FR O(b 3 ) Update LR-FR O(b 2 r) Update LR-LR O(b 2 r) Ex. 1: compute C step (b, p, r) = cost number Ex. 2: compute C step (m, x, α) with b = O(m x ) and r = O(m α ). Ex. 3: compute the total complexity (sum of all steps) C total (m, x, α) = Ex. 4: compute the optimal block size b = O(m x ) and the resulting optimal complexity: x =, b =, and C opt (m, r) = C total (m, x, α) = Complexity of the sparse multifrontal BLR factorization Sparse multifrontal complexity with ND For a dense complexity C opt (m, r), the sparse complexity is computed as log 2 N ( ) N d 1 C mf = O( 2 dl C 2 l ), l=0 where d is the dimension (2 or 3). operations (OPC) N N grid factor size (NNZ) FR O(N 3 ) O(N 2 log N) BLR O(N 5/2 r 1/2 ) O(N 2 ) N N N grid FR O(N 6 ) O(N 4 ) BLR

10 Flop count Flop count Experimental Setting: Matrices 1. Poisson: N 3 grid with a 7-point stencil with u = 1 on the boundary Ω u = f Rank bound is theoretically proven to be r = O(1). 2. Helmholtz: N 3 grid with a 27-point stencil, ω is the angular frequency, v(x) is the seismic velocity field, and u(x, ω) is the time-harmonic wavefield solution to the forcing term s(x, ω). ( ω2 v(x) 2 ) u(x, ω) = s(x, ω) ω is fixed and equal to 4Hz. Heuristically, rank bound can be expected to behave as r = O(N). Experimental MF flop complexity: Poisson (ε = ) Nested Dissection ordering (geometric) METIS ordering (purely algebraic) FR -t: 5n 2:02 BLR (FSCU) -t: 2105n 1: FR -t: 3n 2:05 BLR (FSCU) -t: 1068n 1: Mesh size N Mesh size N good agreement with theoretical complexity remains close to ND complexity with METIS ordering

11 Factors size Factors size Flop count Flop count Experimental MF flop complexity: Helmholtz (ε = 10 4 ) Nested Dissection ordering (geometric) METIS ordering (purely algebraic) FR -t: 12n 2: BLR (FSCU) -t: 31n 1: FR -t: 8n 2: BLR (FSCU) -t: 22n 1: Mesh size N Mesh size N good agreement with theoretical complexity remains close to ND complexity with METIS ordering Experimental MF complexity: factor size NNZ (Poisson) NNZ (Helmholtz) FR -t: 3n 1:40 BLR -t: 12n 1:05 log n FR -t: 15n 1: BLR -t: 32n 1: Mesh size N Mesh size N good agreement with theoretical complexity remains close to ND complexity with METIS ordering (not shown)

12 Flop count Flop count Experimental MF complexity: low-rank threshold ε OPC (Poisson) OPC (Helmholtz) = 10!14 -t: 905n 1: = 10!10 -t: 1068n 1:50 0 = 10!6 -t: 1045n 1:43 0 = 10!2 -t: 851n 1: = 10!5 -t: 23n 1:89 0 = 10!4 -t: 22n 1:87 0 = 10!3 -t: 14n 1: Mesh size N Mesh size N theory states ε should only play a role in the constant factor true for Helmholtz, but not Poisson why? Influence of zero-rank blocks on the complexity N N F R ε = N LR N ZR N F R ε = N LR N ZR N F R ε = 10 6 N LR N ZR N F R ε = 10 2 N LR N ZR Number of full-rank/low-rank/zero-rank blocks in percentage of the total number of blocks (Poisson problem). N F R decreases with N: asymptotically negligible N ZR increases with ε (as one would expect) but also with N: asymptotically dominant

13 Normalized.ops Influence of the block size b on the complexity Analysis on the root node (of size m = N 2 ): m = m = m = Block size b large range of acceptable block sizes around the optimal b flexibility to tune block size for performance that range increases with the size of the matrix necessity to have variable block sizes Part II: performance of the BLR factorization

14 Normalized time Normalized time Normalized flops Normalized time Sequential result (matrix S3) LAI parts Factor+Solve Update Compress LAI parts Factor+Solve Update Compress FR BLR 0 FR BLR Normalized Flops Normalized Time 7.7 gain in flops only translated to a 3.3 gain in time: why? lower granularity of the Update higher relative weight of the FR parts inefficient Compress Multithreaded result on 24 threads LAI parts Factor+Solve Update Compress LAI parts Factor+Solve Update Compress FR BLR 0 FR BLR Normalized Time (Seq.) Normalized Time (MT) 3.3 gain in sequential becomes 1.7 in multithreaded: why? LAI parts have become critical Update and Compress are memory-bound

15 Exploiting tree-based multithreading in MF solvers thr 0-3 thr 0-3 thr 0-3 Node parallelism L0 layer thr 0-3 thr 0-3 thr 0-3 thr 0-3 L Excellent and Sid-Lakhdar. A study of shared-memory parallelism in a multifrontal solver, Parallel Computing. how big an impact can tree-based multithreading make? Impact of tree-based multithreading on BLR (24 threads) % hai % lai Higher AI Lower AI node only node + tree time % lai time % lai FR % % BLR % % In FR, top of the tree is dominant tree MT brings little gain In BLR, bottom of the tree compresses less, becomes important 1.7 gain becomes 1.9 thanks to tree-based multithreading Theoretical speedup tree only node only node + tree N N grid N N N grid FR O(1) O(N) O(N 2 ) BLR O(log N) O(log N) O(N log N) FR O(1) O(N 3 ) O(N 4 ) BLR

16 Right-looking Vs. Left-looking analysis (24 threads) FR time BLR time RL LL RL LL Update Total read once written at each step read at each step written once RL factorization LL factorization Lower volume of memory transfers in LL (more critical in MT) Update is now less memory-bound: 1.9 gain becomes 2.4 in LL LUAR variant: accumulation and recompression FSCU (Factor, Solve, Compress, Update) FSCU+LUAR Better granularity in Update operations Potential recompression asymptotic complexity reduction? Designed and compared several recompression strategies

17 GF/s Gflops/s Performance of Outer Product with LUA(R) (24 threads) Outer Product benchmark b=256 b= Size of Outer Product LL LUA LUAR average size of Outer Product flops ( ) time (s) Outer Product Total Outer Product Total All metrics include the Recompression overhead Higher granularity and lower flops in Update: 2.4 gain becomes 2.6 Impact of machine properties on BLR: roofline model specs time (s) for peak bw BLR factorization (GF/s) (GB/s) RL LL LUA grunch (28 threads) brunch (24 threads) S3 matrix Arithmetic Intensity in BLR: LL > RL (lower volume of memory transfers) LUA > LL (higher granularities more efficient cache use) brunch grunch RL LL LUA Arithmetic Intensity of the Outer Product

18 FCSU variant: compress before solve FSCU (Factor, Solve, Compress, Update) FSCU+LUAR Better granularity in Update operations Potential recompression asymptotic complexity reduction? Designed and compared several recompression strategies FCSU(+LUAR) Restricted pivoting, e.g. to diagonal blocks Low-rank Solve asymptotic complexity reduction? Performance and accuracy of FCSU vs FSCU standard pivoting restricted pivoting FR FSCU FR FSCU FCSU +LUAR +LUAR +LUAR flops ( ) time (s) residual 4.5e e e e e-09 On this problem, restricted pivoting is enough to ensure stability better BLAS-3/BLAS-2 ratio Compressing before the Solve has little impact on the residual flop reduction 2.6 gain becomes 3.7

19 Normalized time (FR=1) Flop count Flop count Variants improve asymptotic complexity We have theoretically proven that: FSCU FSCU+LUAR FCSU+LUAR dense O(m 5/2 r 1/2 ) O(m 7/3 r 2/3 ) O(m 2 r) sparse (3D) O(N 5 r 1/2 ) O(N 14/3 r 2/3 ) O(N 4 r) Amestoy, Buttari, L Excellent, and Mary. On the Complexity of the Block Low-Rank Multifrontal Factorization, SIAM J. Sci. Comput., Poisson (ε = ) FSCU -t: 1068n 1:50 FSCU+LUAR -t: 2235n 1:42 FCSU+LUAR -t: 6175n 1: Helmholtz (ε = 10 4 ) FSCU -t: 22n 1:87 FSCU+LUAR -t: 34n 1:82 FCSU+LUAR -t: 60n 1: Mesh size N Mesh size N Multicore performance results (24 threads) 1 FR BLR BLR Hz 7Hz 10Hz E3 E4 S3 S4 p8d p8ar p8cr BLR : FSCU, right-looking, node only multithreading BLR+ : FCSU+LUAR, left-looking, node+tree multithreading Amestoy, Buttari, L Excellent, and Mary. Performance and Scalability of the Block Low-Rank Multifrontal Factorization on Multicore Architectures, submitted to ACM Trans. Math. Soft., 2017.

20 The problem with FCSU FSCU (Factor, Solve, Compress, Update) FSCU+LUAR Better granularity in Update operations Potential recompression asymptotic complexity reduction? Designed and compared several recompression strategies FCSU(+LUAR) Restricted pivoting, e.g. to diagonal blocks not acceptable in many applications Low-rank Solve asymptotic complexity reduction? Compress before Solve + pivoting: CFSU variant D k What s straightforward: Column swaps on B can be performed as row swaps of Y Triangular solve and update can also be performed on Y X Y T B What s less straightforward: How to assess the quality of pivot k? We need to estimate B :,k max : B :,k max, assuming X is orthonormal (e.g. RRQR, SVD) How to deal with postponed/delayed pivots? Several strategies to merge them with next panel

21 Residual Normalized flops FSCU vs FCSU vs CFSU FSCU Standard pivoting Compress after Solve FCSU Restricted pivoting Compress before Solve CFSU Standard pivoting Compress before Solve FR FSCU FCSU CFSU 0 barrier2-10 Lin para-10 kkt_power perf009d perf009ar FSCU FCSU CFSU barrier2-10 Lin para-10 kkt_power perf009d perf009ar When FCSU is enough (left), CFSU does not degrade compression When FCSU fails (right), CFSU achieves both good residual and compression Distributed-memory parallelism P 0 P 0 P 0 P 1 P 2 P 3 LU messages P 0 P 1 P 2 P 1 P 1 P 2 P 2 P 3 P 3 P 4 P 4 P 5 P 5 CB messages P 3 P 4 P 5 Volume of LU messages is reduced in BLR (compressed factors) Volume of CB messages can be reduced by compressing the CB but it is an overhead cost

22 Total bytes sent Time (s) Strong scalability analysis 2000 FR BLR x10 45x10 60x10 75x10 90x10 Number of MPIs x Number of cores Compression rate is not significantly impacted by number of processes Flops reduced by 12.8 but volume of communications only by 2.2 higher relative weight of communications Load unbalance (ratio between most and less loaded processes) increases from 1.28 to 2.57 Communication analysis LU messages CB messages Front size #10 4 FR case: LU messages dominate BLR case: CB messages dominate underwhelming reduction of comms CB compression allows for truly reducing the comms; it is an overhead cost but may lead to speedups depending on network speed w.r.t. processor speed Theoretical communication analysis bounds W LU W CB W tot FR O(n 4/3 p) O(n 4/3 ) O(n 4/3 p) BLR (CB FR ) BLR (CB LR )

23 Result on a very large problem Result on matrix 15Hz (order , nnz ) on 900 cores: flops factors memory (GB) elapsed time (s) (PF) size (TB) avg. max. ana. fac. sol. MUMPS OOM OOM OOM BLR /RHS ratio References Amestoy, Buttari, L Excellent, and Mary. On the Complexity of the Block Low-Rank Multifrontal Factorization, SIAM J. Sci. Comput., Amestoy, Buttari, L Excellent, and Mary. Performance and Scalability of the Block Low-Rank Multifrontal Factorization on Multicore Architectures, submitted to ACM Trans. Math. Soft., Amestoy, Brossier, Buttari, L Excellent, Mary, Métivier, Miniussi, and Operto. Fast 3D frequency-domain full waveform inversion with a parallel Block Low-Rank multifrontal direct solver: application to OBC data from the North Sea, Geophysics, Shantsev, Jaysaval, de la Kethulle de Ryhove, Amestoy, Buttari, L Excellent, and Mary. Large-scale 3D EM modeling with a Block Low-Rank multifrontal direct solver, Geophysical Journal International, 2017.

Performance and Scalability of the Block Low-Rank Multifrontal Factorization on Multicore Architectures

Performance and Scalability of the Block Low-Rank Multifrontal Factorization on Multicore Architectures Patrick R. Amestoy, Alfredo Buttari, Jean-Yves L Excellent, Théo Mary To cite this version: Patrick