Sparse factorization using low rank submatrices. Cleve Ashcraft LSTC 2010 MUMPS User Group Meeting April 15-16, 2010 Toulouse, FRANCE

Sparse factorization using low rank submatrices Cleve Ashcraft LSTC cleve@lstc.com 21 MUMPS User Group Meeting April 15-16, 21 Toulouse, FRANCE ftp.lstc.com:outgoing/cleve/mumps1 Ashcraft.pdf 1

LSTC Livermore Software Technology Corporation Livermore, California 9455 Founded by John Hallquist of LLNL, 198 s Public domain versions of DYNA and NIKE codes LS-DYNA : implicit/explicit, nonlinear finite element analysis code Multiphysics capabilities Fluid/structure interaction Thermal analysis Acoustics Electromagnetics 2

Multifrontal Algorithm very large, sparse LDL T and LDU factorizations tree structure organizes factor storage, solve and factor operations medium-to-large sparse linear systems located at each leaf node of the tree medium-sized dense linear systems located at each interior node of the tree dense matrix-matrix operations at each interior node sparse matrix-matrix adds between nodes 3

Multifrontal tree 1 2 3 4 5 6 7 8 9 1 11 12 13 14 15 16 17 18 19 2 21 22 23 24 25 26 27 28 29 3 31 32 33 34 35 36 37 38 39 4 41 42 43 44 45 46 47 48 49 5 51 52 53 54 55 56 57 58 59 6 61 62 63 64 65 66 67 68 69 7 71 4

Multifrontal tree polar representation 1 2 3 4 5 6 7 8 9 1 11 12 13 14 15 16 17 18 19 2 21 22 23 24 25 26 27 28 29 3 31 32 33 34 35 36 37 38 39 4 41 42 43 44 45 46 47 48 49 5 51 52 53 54 55 56 57 58 59 6 61 62 63 64 65 66 67 68 69 7 71 5

Multifrontal Algorithm 4M equations, present frontier at LSTC serial, SMP, MPP, hybrid, GPU large problems, > 1M - 1M dof, require out-of-core storage of factor entries, even on distributed memory systems IO cost largely hidden during the factorization IO cost dominant during the solves eigensolver = several right hand sides many applications, e.g., Newton s method, have a single right hand side 6

Low Rank Approximations hierarchical matrices, Hackbusch, Bebendorf, Leborne, others semi-separable matrices, Gu, others submatrices are numerically rank deficient method of choice for Boundary Elements (BEM) now applied to Finite Elements (FEM) n r A XY T Y Tn r m A = m X storage = r(m + n) vs mn r reduction in ops = min(m, n) 7

Multifrontal Algorithm + Low Rank Approximations At each leaf node in the multifrontal tree use standard multifrontal At each interior node in the multifrontal tree low rank matrix-matrix multiplies and sums Between nodes low rank matrix sums Dramatic reduction in storage Dramatic reduction in operations Excellent approximation properties for finite element operators Our experience is with potential equations, elasticity with solids and shells 8

Outline graph, tree, matrix perspectives experiments 2-D potential equation low rank computations blocking strategies summary 9

One subgraph 1.8.6 One subtree S.4.2.2.4.6.8 1 1.8.6.4.2.2.4.6.8 1 Ω I S Ω J One submatrix A ΩI,Ω I A ΩI,S A ΩI, S A ΩJ,Ω J A ΩJ,S A ΩJ, S A S,ΩI A S,ΩJ A S,S A S, S A S,ΩI A S,ΩJ A S,S A S, S 1

Portion of original matrix A ordered by domains, separator and external boundary 2 4 6 8 1 12 2 4 6 8 1 12 nz = 952 11

Portion of factor matrix L ordered by domains, separator and external boundary 2 4 6 8 1 12 2 4 6 8 nz = 48899 12 1 12

Schur complement matrix Schur complement, separator and external boundary 2 4 [ ÂS,S Â S, S Â S,S Â S, S ] = 6 8 1 12 14 16 2 4 6 8 1 12 14 16 nz = 2659 13

Outline graph, tree, matrix perspectives experiments 2-D potential equation low rank computations blocking strategies summary 14

Computational experiments Compute L S,S, L S,S and Â S, S Find domain decompositions of S and S Form block matrices, e.g., L S,S = L K,J K J Find singular value decompositions of each L K,J Collect all singular values {σ} = K J min( K, J ) i=1 σ (K,J) i Split matrix L S,S = M S,S + N S,S using singular values {σ}. We want N S,S F 1 14 L S,S F 15

255 255 diagonal block factor L S,S 43% dense, relative accuracy 1 14 25 2 L S,S = 15 1 5 5 1 15 2 25 16

754 255 lower block factor L S,S 16% dense, relative accuracy 1 14 7 6 5 L S,S = 4 3 2 1 1 2 17

754 754 update matrix Â S, S 21% dense, relative accuracy 1 14 7 6 5 Ã S, S = 4 3 2 1 1 2 3 4 5 6 7 18

255 255 factor matrix L S,S storage vs accuracy approximating 255 x 255 separator factor matrix L 3,3 2 4 log 1 (1 M F / A F ) 6 8 1 12 2 x 2 blocks 3 x 3 blocks 4 x 4 blocks 14 5 x 5 blocks 6 x 6 blocks 7 x 7 blocks 8 x 8 blocks 16.1.2.3.4.5.6 fraction of dense storage 19

754 255 factor matrix L S,S storage vs accuracy approximating 754 x 255 interface separator factor matrix L 4,3 2 4 log 1 (1 M F / A F ) 6 8 1 12 14 3 x 1 blocks 6 x 2 blocks 9 x 3 blocks 12 x 4 blocks 16.5.1.15.2 fraction of dense storage 2

754 754 update matrix Â S, S storage vs accuracy approximating 754 x 754 schur complement matrix Ahat44 2 4 log 1 (1 M F / A F ) 6 8 1 12 2 x 2 blocks 3 x 3 blocks 4 x 4 blocks 14 5 x 5 blocks 6 x 6 blocks 7 x 7 blocks 8 x 8 blocks 16.1.2.3.4.5 fraction of dense storage 21

Outline graph, tree, matrix perspectives experiments 2-D potential equation low rank computations blocking strategies summary 22

How to compute low rank submatrices? SVD singular value decomposition A = UΣU T where U and V are orthogonal and Σ is diagonal Gold standard, expensive, O(n 3 ) ops QR factorization AP = QR where Q is orthogonal and R is upper triangular, and P is a permutation matrix Silver standard, moderate cost, O(rn 2 ) ops 23

row norms of R vs singular values of A 754 754 update matrix Â S, S 94 94 submatrix size near, mid and far interactions 2 4 6 row norms of R versus singular values of 754 x 754 A using 8 segments near R i,: 2 near σ i mid R i,: 2 mid σ i far R i,: 2 far σ i 8 1 12 14 16 18 2 2 4 6 8 1 12 14 16 18 2 24

SVD vs QR with column pivoting Conclusions : Column pivoting QR factorization does well. Row norms of R track singular values σ The numerical rank of R is greater than needed, but not that much greater For more accuracy/less storage two sided orthogonal factorizations AP = ULV, U and V orthogonal, L triangular PAQ = UBV, U and V orthogonal, B bidiagonal track the singular values very closely. 25

Type of approximation of submatrices submatrix L I,J of L S,S, L I,J F = 2.92 1 2 factorization numerical rank total entries none 49 QR 2 2681 ULV 18 268 SVD 18 2538 submatrix L I,J of L S,S, L I,J F = 2.4 1 3 factorization numerical rank total entries none 49 QR 14 1896 ULV 1 1447 SVD 1 141 26

Operations with low rank submatrices A = U A V T A, B = U BV T B, C = U CV T C See Bebendorf, Hierarchical Matrices, Chapter 1 Multiplication A = B C, rank(a) min(rank(b), rank(c)) ) ( ) A = U A VA (U T = B VB T U C VC T = BC ) = U B (VB T U C VC T ) ) = U B ((VB T U C VC T ( )) = (U B VB T U C Addition A = B + C, rank(a) rank(b) +rank(c) ) ( ) A = U A VA (U T = B VB T + U C VC T = B + C = [ ][ ] T U B U C VB V C V T C 27

Multiplication A = B C A = U A VA T, B = U BVB T, C = U CVC T A = B C = = ( (m h) (h l) )( (l k) (k n) ) = = = (m min(h,k)) (min(h,k) n) 28

Near-near matrix product T near near interaction, L 2,1 L 2,1 σ(l21) σ(l21*l21 T ) 2 4 6 8 1 12 14 16 5 1 15 2 25 3 35 29

mid-near matrix product T mid near interaction, L 2,1 L 1,1 2 σ(l21) σ(l11) σ(l21*l11 T ) 4 6 8 1 12 14 16 5 1 15 2 25 3 35 3

far-near matrix product T far near interaction, L 5,1 L 6,1 2 σ(l51) σ(l61) σ(l51*l61 T ) 4 6 8 1 12 14 16 5 1 15 2 25 3 35 31

mid-mid matrix product T mid mid interaction, L 1,1 L 1,1 σ(l11) σ(l11*l11 T ) 2 4 6 8 1 12 14 16 2 4 6 8 1 12 14 16 32

far-mid matrix product T mid far interaction, L 6,1 L 1,1 σ(l61) σ(l61*l61 T ) 2 4 6 8 1 12 14 16 2 4 6 8 1 12 14 16 33

far-far matrix product T far far interaction, L 6,1 L 6,1 σ(l61) σ(l61*l61 T ) 2 4 6 8 1 12 14 16 1 2 3 4 5 6 7 8 9 1 11 34

Addition A = B + C rank(a) rank(b) + rank(c) ) ( A = U A VA (U T = B VB T + U C VC T = [ ][ ] T U B U C VB V C = (Q 1 R 1 ) (Q 2 ) R 2 ) T = Q 1 (R 1 R2 T Q T 2 ) = B + C = Q 1 (Q 3 R ( 3 ) Q T 2 ) = (Q 1 Q 3 ) R 3 Q T 2 ( = (Q 1 Q 3 ) Q 2 R3 T ) T R 1 R2 T usually has low numerical rank examples follow 35

update matrix, diagonal block Â 3,3 = L 3,1 L T 3,1 + L 3,2L T 3,2 + L 3,3L T 3,3 2 update matrix self interaction σ(near) σ(sum) σ(mid) σ(far) 4 6 log 1 (σ) 8 1 12 14 16 5 1 15 2 25 36

update matrix, mid-distance off-diagonal block Â 5,3 = L 5,1 L T 3,1 + L 5,2L T 3,2 + L 5,3L T 3,3 2 4 6 update matrix mid range interaction σ(sum) σ(near) σ(mid) σ(far) 8 log 1 (σ) 1 12 14 16 18 2 2 4 6 8 1 12 37

update matrix, far-distance off-diagonal block Â 7,3 = L 7,1 L T 3,1 + L 7,2L T 3,2 + L 7,3L T 3,3 2 4 6 update matrix far range interaction σ(sum) σ(near) σ(mid) σ(far) 8 log 1 (σ) 1 12 14 16 18 2 2 4 6 8 1 12 38

Outline graph, tree, matrix perspectives experiments 2-D potential equation low rank computations blocking strategies summary 39

Blocking Strategies Active data structures [ ] L J,J L J,J Â J, J partition J and J independently for 2-d problems, J and J are 1-d manifolds for 3-d problems, J and J are 2-d manifolds we need mesh partitioning of the separator and the boundary of a region J each index set of a partition is a segment 4

Segment partition 1 segment wirebasket domain decomposition.8.6.4.2.2.4.6.8 1 1.8.6.4.2.2.4.6.8 1 41

25 2 15 L J,J = σ,τ L σ,τ σ τ J J 1 5 5 1 15 2 25 7 6 4 5 L J,J = σ,τ L σ,τ σ τ J J 3 2 1 1 2 7 6 4 5 Â J, J = σ,τ Â σ,τ σ τ J J 3 2 1 1 2 3 4 5 6 7 42

Strategy I The partition of J (local to node J) conforms to the partitions of ancestors K, K J Advantages : Update assembly is simplified Â (p(j)) σ 2,τ 2 where σ 1 σ 2, τ 1 τ 2 = Â(p(J)) σ 2,τ 2 + Â(J) σ 1,τ 1 One destination for each Â(J) σ 1,τ 1 Disadvantages : Partition of J can be fragmented, more segments, smaller size, less efficient storage 43

Strategy II The partition of J (local to node J) need not conform to the partitions of ancestors Advantages : Partition of J can be optimized better since J is small and localized Disadvantages : Update assembly is more complex Â (p(j)) σ 1 σ 2,τ 1 τ 2 = Â(p(J)) σ 1 σ 2,τ 1 τ 2 + Â(J) σ 1 σ 2,τ 1 τ 2 where σ 1 σ 2, τ 1 τ 2 Several destinations for a submatrix Â(J) σ 1,τ 1 44

Outline graph, tree, matrix perspectives experiments 2-D potential equation low rank computations blocking strategies summary 45

Multifrontal tree Multifrontal Factorization 18 57 58 59 65 6 66 62 67 63 64 68 69 19 2 21 22 15 17 16 14 29 28 26 27 25 24 31 3 23 35 52 54 36 53 55 37 56 38 41 61 39 42 4 43 48 51 7 5 49 71 47 32 12 9 13 11 8 1 7 2 1 6 5 4 3 34 46 33 45 44 each leaf node is a d-dimensional sparse FEM matrix each interior node is a (d 1)-dimensional dense BEM matrix use low rank storage and computation at each interior node 46

Call to action progress to date in low rank factorizations driven by iterative methods work needed from direct methods community start from industrial strength multifrontal code serial, SMP, MPP, hybrid, GPU pivoting for stability, out-of-core, singular systems, null spaces 47

Many challenges Call to action partition of space, partition of interface added programming complexity of low rank matrices challenges to pivot for stability challenges/opportunities to implement in parallel Payoff will be huge reduction in storage footprint reduction in computational work take direct methods to next level of problem size 48