An Efficient Solver for Sparse Linear Systems based on Rank-Structured Cholesky Factorization David Bindel Department of Computer Science Cornell University 15 March 2016 (TSIMF) Rank-Structured Cholesky 2016-03-15 1 / 30
u = K \ f Great for circuit simulations, 1D or 2D finite elements, etc. Standard advice to students: Just try backslash for these problems. Standard response: What about for the 3D case? (TSIMF) Rank-Structured Cholesky 2016-03-15 2 / 30
Try PCG with a good preconditioner. Maybe start with the ones in PETSc. You ve taken Matrix Computations, right? Blah blah yadda blah... (Not an actual student) (TSIMF) Rank-Structured Cholesky 2016-03-15 3 / 30
My own frustration (c. 2008) (TSIMF) Rank-Structured Cholesky 2016-03-15 4 / 30
Factorization with nested dissection A L Cholesky: A = LL T, L is lower triangular. L may have more nonzeros than A (fill) Order controls fill still a lot of memory in 3D (TSIMF) Rank-Structured Cholesky 2016-03-15 5 / 30
Direct or iterative? A L CW: Gaussian elimination scales poorly. Iterate instead! Pro: Less memory, potentially better complexity Con: Less robust, potentially worse memory patterns Commercial finite element codes still use (out-of-core) Cholesky. Longer compute times, but fewer tech support hours. (TSIMF) Rank-Structured Cholesky 2016-03-15 6 / 30
Desiderata I want a code for sparse Cholesky (A = LL T ) that Handles modest problems on a desktop (or laptop?) Inside a loop, without trying my patience = Does not need gobs of memory = Makes effective use of level 3 BLAS Requires little parameter fiddling / hand-holding Works with general elliptic problems (esp. elasticity) (TSIMF) Rank-Structured Cholesky 2016-03-15 7 / 30
From ND to superfast ND... ND gets performance using just graph structure: 2D: O(N 3/2 ) time, O(N log N) space. 3D: O(N 2 ) time, O(N 4/3 ) space. Superfast ND reduces space/time complexity via low-rank structure. (TSIMF) Rank-Structured Cholesky 2016-03-15 8 / 30
Strategy Start with CHOLMOD (a good supernodal left-looking Cholesky) Supernodal data structures are compact Algorithm + data layout = most work in level 3 BLAS Widely used already (so re-use the API!) Incorporate compact representations for low-rank blocks Outer product for off-diagonal blocks HSS-style representations for diagonal blocks Optimize, test, swear, fix, repeat (TSIMF) Rank-Structured Cholesky 2016-03-15 9 / 30
Supernodal storage structure L D j L(C j, C j ) L D j L(C j, R j ) L O j collapsed L O j L j (TSIMF) Rank-Structured Cholesky 2016-03-15 10 / 30
Supernode factorization U D j A(C j, C j ) U O j A(R j, C j ) for each k D j do Build dense updates from L O k Scatter updates to U D j and U O j L D j cholesky(u D j ) UO j (L D j ) T L O j Initialize storage Pull Schur contributions Finish forming L D j What changes in the rank-structured Cholesky? (TSIMF) Rank-Structured Cholesky 2016-03-15 11 / 30
Off-diagonal block compression L D j V j U T j Collapsed L O j Compressed L O j L O j Collapsed off-diagonal block is a (nearly low-rank) dense matrix (TSIMF) Rank-Structured Cholesky 2016-03-15 12 / 30
Off-diagonal block compression G rand( C j, r + p) C (L O j )T G for i = 1,..., s do C (L O j )C C (L O j )T C U j = orth(c) V j = L O j U j Compress without explicit L O j : Probe (L O j )T with random G Extract orth. row basis U j L O j = V ju T j = V j = L O j U j Where do we get the estimated rank bound r? (TSIMF) Rank-Structured Cholesky 2016-03-15 13 / 30
Interaction rank Could dynamically estimate the rank of L O j. Practice: empirical rank bound α k log(k). (TSIMF) Rank-Structured Cholesky 2016-03-15 14 / 30
Optimization: Selective off-diagonal compression j 1 j 2 j 3 j 1 j 2 j 3 Compress off-diagonal blocks of sufficiently large supernodes (j 1, j 2 ). (TSIMF) Rank-Structured Cholesky 2016-03-15 15 / 30
Optimization: Interior blocks B 2 B 4 j 1 j 2 B 1 B 3 j 3 j 1 j 2 j 3 B 1 B 2 B 3 B 4 Don t store any of L O j for interior blocks (Represent as L O j = AO j (LD j ) 1 when needed) (TSIMF) Rank-Structured Cholesky 2016-03-15 16 / 30
Diagonal block compression L D j = L D j,1 0 L D j,2 L D j,3 L D j,4 0 L D j,5 0 L D j,6 L D j,7 Basic observation: off-diagonal blocks are low-rank. (H-matrix, semiseparable structure, quasiseparable structure,...) Assumes reasonable ordering of unknowns! (TSIMF) Rank-Structured Cholesky 2016-03-15 17 / 30
Diagonal block compression L D j L D j,1 0 0 Vj,2 D (UD j,2 )T L D j,3 Vj,4 D (UD j,4 )T L D j,5 0. V D j,6 (UD j,6 )T L D j,7 How do we get directly to this without forming U D j explicitly? (TSIMF) Rank-Structured Cholesky 2016-03-15 18 / 30
Forming compressed updates L D j,1 L D j,3 L D j,2 L D j,5 L D j,4 L D j,7 L D j,6 D j (TSIMF) Rank-Structured Cholesky 2016-03-15 19 / 30
Rank-structured supernode factorization Basic ingredients: Randomized algorithms form U D j Rank-structured factorization of U D j Randomized algorithm forms L O j (involves solves with LD j ) Plus various optimizations. (TSIMF) Rank-Structured Cholesky 2016-03-15 20 / 30
Example: Large deformation of an elastic block (TSIMF) Rank-Structured Cholesky 2016-03-15 21 / 30
Example: Large deformation of an elastic block Benchmark based on example from deal.ii: Nearly-incompressible hyperelastic block under compression Mixed FE formulation (pressure and dilation condensed out) Tried both p = 1 and p = 2 finite elements Two load steps, Newton on each (14-15 steps) Experimental setup: 8-core Xeon X5570 with 48 GB RAM LAPACK/BLAS from MKL 11.0 PCG + preconditioners from Trilinos (TSIMF) Rank-Structured Cholesky 2016-03-15 22 / 30
RSC vs standard preconditioners (p = 1, N = 50) 10 1 10 1 10 0 10 0 Relative residual 10 1 10 2 10 3 Jacobi Relative residual 10 1 10 2 10 3 10 4 10 5 RSC ML ICC 10 4 10 5 RSC Jacobi ICC ML 0 5 10 0 0.5 1 Iterations 10 3 Seconds 10 3 (TSIMF) Rank-Structured Cholesky 2016-03-15 23 / 30
RSC vs standard preconditioners (p = 2, N = 35) 10 1 10 1 10 0 10 0 Relative residual 10 1 10 2 10 3 10 4 10 5 RSC ICC ML Jacobi Relative residual 10 1 10 2 10 3 10 4 10 5 RSC ICC ML Jacobi 0 5 10 15 Iterations 10 3 0 5 10 Seconds 10 3 (TSIMF) Rank-Structured Cholesky 2016-03-15 24 / 30
Time and memory comparisons (p = 1) Solve time (s) 1,000 800 600 400 ICC ML Jacobi RSC Cholesk Memory (GB) 30 25 20 15 10 Choles 200 5 RSC 0 0 0 0.5 1 1.5 n 10 6 0 0.5 1 1.5 n 10 6 (TSIMF) Rank-Structured Cholesky 2016-03-15 25 / 30
Effect of in-separator ordering 10 1 10 0 Relative residual 10 1 10 2 10 3 10 4 Semi-sep diag relies on variable order don t want any old order! Apply recursive bisection based on spatial coords Use coordinates if known 10 5 10 6 Geo. Eig. Random Else assign spectrally 0 0.2 0.4 Iteration 10 3 (TSIMF) Rank-Structured Cholesky 2016-03-15 26 / 30
Example: Trabecular bone model ( 1M dof) 10 2 10 2 10 1 10 1 Relative residual 10 0 10 1 10 2 10 3 10 4 Relative residual 10 0 10 1 10 2 10 3 10 4 ICC 10 5 10 5 ML 10 6 RSC2RSC1 ML ICC 10 6 RSC1 RSC2 0 1 2 3 0 1 2 Iterations 10 3 Seconds 10 3 (TSIMF) Rank-Structured Cholesky 2016-03-15 27 / 30
Example: Steel flange ( 1.5M dof) Relative residual 10 3 10 2 10 1 10 0 10 1 10 2 10 3 10 4 10 5 10 6 RSC1 RSC2 ML ICC 0 2 4 Iterations 10 3 Relative residual 10 3 10 2 10 1 10 0 10 1 10 2 10 3 10 4 10 5 10 6 RSC2 RSC1 0 0.5 1 Seconds ML ICC 10 3 (TSIMF) Rank-Structured Cholesky 2016-03-15 28 / 30
Competition Most direct competitor is STRUMPACK (Li, Ghysels, Rouet): http://portal.nersc.gov/project/sparse/strumpack/ Multifrontal vs left-looking (extend-add vs pull updates) LU vs Cholesky STRUMPACK is explicitly parallel See also Sherry Li s plenary from SIAM LA 2015 Also work by Xia and Li; HODLR (Aminfar, Ambikasaran, Darve); BLR (Amestoy, Ashcraft, Boiteau, Buttari, L Excellent, Weisbecker); H-matrix work (Hackbush, Bebendorf, et al);... (TSIMF) Rank-Structured Cholesky 2016-03-15 29 / 30
Conclusions For more: www.cs.cornell.edu/~bindel bindel@cs.cornell.edu J. Chadwick and D. Bindel. An Efficient Solver for Sparse Linear Systems Based on Rank-Structured Cholesky Factorization. http://arxiv.org/abs/1507.05593 https: //github.com/jeffchadwick/rank_structured_cholesky (TSIMF) Rank-Structured Cholesky 2016-03-15 30 / 30