A dissection solver with kernel detection for unsymmetric matrices in FreeFem++

Size: px

Start display at page:

Download "A dissection solver with kernel detection for unsymmetric matrices in FreeFem++"

Cecilia Stanley
5 years ago
Views:

1 . p.1/21 11 Dec. 2014, LJLL, Paris FreeFem++ workshop A dissection solver with kernel detection for unsymmetric matrices in FreeFem++ Atsushi Suzuki Atsushi.Suzuki@ann.jussieu.fr Joint work with François-Xavier Roux, LJLL-UPMC, ONERA,

2 . p.2/21 Sparse direct solvers Software parallel elimination data management kernel environment strategy and pivoting detection SuperLU_MT shared super-nodal dynamic data + pivoting no Pardiso shared super-nodal dynamic data + ε-perturbation no + pivoting SuperLU_DIST distributed super-nodal static data + ε-perturbation none MUMPS distributed multi-frontal dynamic data + pivoting yes J. W. Demmel, S. C. Eisenstat, J. R. Gilbert, X. S. Li, J. W. H. Liu. A supernodal approach to sparse partial pivoting, SIAM Journal on Matrix Analysis and Applications, 20 (1999), O. Schenk, K. Gärtner. Solving unsymmetric sparse systems of liner equations with PARDISO, Future Generation of Computer Systems, 20 (2004), X. S. Li, J. W. Demmel. SuperLU_DIST : A scalable distributed-memory sparse direct solver for unsymmetric linear systems, ACM Transactions on Mathematical Software, 29 (200), P. R. Amestoy, I. S. Duff, J.-Y. L Execellent. Mutlifrontal parallel distributed symmetric and unsymmetric solvers, Computer Methods in Applied Mechanics and Engineering, 184 (2000)

3 Sparse direct solvers Software parallel elimination data management kernel environment strategy and pivoting detection SuperLU_MT shared super-nodal dynamic data + pivoting no Pardiso shared super-nodal dynamic data + ε-perturbation no + pivoting SuperLU_DIST distributed super-nodal static data + ε-perturbation none MUMPS distributed multi-frontal dynamic data + pivoting yes Dissection shared multi-frontal static data + pivoting yes J. W. Demmel, S. C. Eisenstat, J. R. Gilbert, X. S. Li, J. W. H. Liu. A supernodal approach to sparse partial pivoting, SIAM Journal on Matrix Analysis and Applications, 20 (1999), O. Schenk, K. Gärtner. Solving unsymmetric sparse systems of liner equations with PARDISO, Future Generation of Computer Systems, 20 (2004), X. S. Li, J. W. Demmel. SuperLU_DIST : A scalable distributed-memory sparse direct solver for unsymmetric linear systems, ACM Transactions on Mathematical Software, 29 (200), P. R. Amestoy, I. S. Duff, J.-Y. L Execellent. Mutlifrontal parallel distributed symmetric and unsymmetric solvers, Computer Methods in Applied Mechanics and Engineering, 184 (2000) A. Suzuki, F.-X. Roux, A dissection solver with kernel detection for symmetric finite element matrices on shared memory computers, International Journal for Numerical Methods in Engineering, 100 (2014) p./21

4 . p.4/21 Overview of Dissection solver Dissection based on graph decomposition of sparse matrix 1 dense solver a 5 b e 7 f 2 dense solver c a 5 b c 6 d e 7 f dense solver sparse solver d each leaf can be computed in parallel sparse solver (tridiag) is written with Fortran90 by F.-X. Roux? load unbalance? different sizes of subdomains? parallel computation of higher levels : smaller subdomains with many cores/processors?

5 . p.5/21 Overview : recursive generation of Schur complement 8 9 a b c d e f aa a5 a bb b5 b2 b1 Schur complement cc dd c6 d6 d c1 d1 by sparse solver ee e7 e e a 5b ff f f Schur complement by dense solver 6c 6d Schur complement a 2b 7e 7f by dense solver d e f b 1c 1d 1e dense factorization there is no 6 block in sparse matrix. 6 in Schur complement has values as result of fill-in. This is analyzed in symbolic phase.

6 . p.6/21 Overview : graph decomposition by METIS/SCOTCH matrix size = 206,76, bisection level = 8, sub-blocks = 255 METIS SCOTCH number of nonzeros in dense blocks METIS SCOTCH diag 82,659,78 66,956,787 off-diag 18,097, ,858,81 total 265,79,487 22,815,168

7 . p.7/21 Block factorization and block pivot strategy block pivot strategy is employed for parallel efficiency Π T 1 Π T L L 21 L D D U 1 1 U 1 2 U U 2 2 U Π Π Π T L 1 L 2 L D U Π k k LDU LDU LDU 2 LDU 2 LDU LDU When null-pivots appear in a block, they are sent to the last, and corresponding rows and columns are nullified for rank (n k) update. 0

8 . p.8/21 Dissection solver : parallelization of dense blocks T LDL factorization A 11 A 12 A 1 A 14 dtrsm + dcaling A 22 A 2 A 24 A A 4 A 22 A 2 A 24 A A 4 -= A 21 A 1 A 11-1 A 12 A 1 A 14 A 44 A 44 A 41 dgemm α (1) : factorization L 11 D 11 L T 11 = A(1) 11 β (1) k : DTRSM + scaling D 1 11 (L 1 11 A(1) 1 k ) γ (1) k, l : DGEMM A k, l ---= (L 1 11 A 1 k) T D 1 11 (L 1 11 A(1) 1 l ) α (2) : factorization L 22 D 22 L T 22 = A(2) 22 α (1) {β (1) 2, β(1), β(1) 4 } {γ(1) 2,2, γ(1) 2,, γ(1),,,γ(1), β(2) 4 } {γ(2),, γ(2) - updating of Schur complement depends on factorization sequential among levels 4,4 } α(2) {β (2) α (1) {β (1) 2 -γ(1) 2,2 -α(2), β (1), β(1) 4 } {γ(1) 2,, γ(1),,,γ(1) 4,4 } {β(2) -γ(2), -α(), β (2) 4 } {γ(2) + factorization of diagonal block // updating of rest of Schur complement,4, γ(2) 4,4 },4, γ(2) 4,4 }

9 . p.9/21 Task scheduling in Dissection solver n = 206, 76 4?,5?,6?,7? 2?,?,1? DTRSM DGEMM SUBTR α β γ Westmere Xeon 5680@.GHz, 6core x2, POSIX threads 2?,?,1? DTRSM α 21,1 DGEMM SUBTR β γ factorization 22, factorization 11 αβγ αβγ computation of Schur complement in level, factorization in level 2 : scheduled together. tasks on critical path : scheduled statically other tasks : executed in greedy way with large group using predicted complexity + fine greedy

10 . p.10/21 Performance comparison N =1,004,784 symmetric real matrix : 2.0GHz Dissection Intel Pardiso MUMPS + parablas core CPU elapsed / CPU elapsed / CPU elapsed / 1 5, , ,40.6 5,41.1 5, , ,64.7 2, , , ,547.5, , , ,40.9 1, , , , , , , , , , , , , ,88.7 1, automatic parallelization of BLAS by OpenMP is not efficient + Dissection uses coarse grain parallelization by Pthreads # core GFlop/s time for parallel tasks time for the numerical factorization elapsed time idle time of cores elapsed time CPU time GFlops /159.84GFlops = 60% : running on NUMA is another challenge

11 . p.11/21 Task scheduling and performance of Dissection solver n = 206, 76 Westmere Xeon 5680@.GHz, 6core x2, POSIX threads LDLt LinvU fdhlfs DTRSM fdgemm sdgemm DSUB2d DSUB2o DSUBmd DSUBmo Dfilld Dfillo SpLDLt SpSchr dealloc GFlop/s

12 . p.12/21 Detection of the kernel of matrix in Dissection solver kernel of stiffness matrix coarse space for domain decomposition theoretically the last block is singular for semi-positive definite matrices numerically all blocks may be nearly singular ε > 0 : given threshold for null pivot a i i / a i 1 i 1 < ε a i i is null pivot regular entry candidates of null pivot 11 Schur complement matrix from suspicious null pivots and additional nodes = the algorithm to compute dimension of the kernel

13 How to determine dimension of the kernel? 2 4Ã11 Ã 12 Ã 21 Ã 22 5 = 2 4Ã11 0 Ã 21 S I 1 Ã 1 0 I 2 11 Ã12 5 S22 = 0 KerÃ = 2 4Ã 1 11 Ã12 I 2 5 numerical example: symmetric semi-positive definite, m + k = = 10 by Householder-QR factorization: Matrix Computation; Golub, Van Loan, e e e e e-02-9.e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e-1-8.0e e e e e e-1-5.e e e e e e e e e e e e e-14 how to set threshold to distinguish non-zero (1.28e-02) and zero (1.2e-11) values? = an algorithm by measuring dimension of a projection onto the image space and detecting existence of inverse of the matrix using higher precision arithmetic. p.1/21

14 Detection of the kernel of matrix (1/4) A R N N, dimkera = k 1, dimima m. two parameters: l, n, which define factorization, A 11 A 12 5 = 4 A I 1 A 1 11 A 12 5 A 11 R (N l) (N l) or R (N n) (N n). A 21 A 22 A 21 S 22 0 I 2 projection : P n : RN span 2 solution in subspace, Ā N l b = Ā Ā 1 11 A 12 I Ā 1 11 b 1 0 5, A 11 R (N n) (N n) 5, A 11 R (N l) (N l), b = 2 4 b 1 5 = b b N l b l : in quadruple-precision with perturbation to simulate double-precision round-off error. n : candidate of dimension of the kernel compute for l = n 1, n, n + 1 err (n) l := max max x=[0 x l ] 0 P n (Ā N l A x x) x, max x=[x N l 0] 0 5 Ā N l A x x x A 1 N k+1 n = k + 1 err n 1 0 err n 0 err n+1 1 n = k err n 1 0 err n 0 err n+1 1 n = k 1 err n 1 0 err n 0 err n+1 1 does not exist. err(k) k 1 is computable and is similar order of Ā 1 N A N I N.. p.14/21

15 S 22 0 Ā 1 A I p.15/21 Detection of the kernel of matrix (2/4) Ā 1 : in quadruple-precision with perturbation to simulate double-precision round-off error with decomposition of A into regular part A 1 1 R m m and others A 11 A I A I 11 A 1 11 A 125 A 21 A 22 A 21 Â 1 11 I 22 0 S 22 0 I 22 A 1 11 = U 1 11 D 1 11 L artificial perturbation with ε A A I 11 Â 1 11 A 12 4Ǎ I A 21 A 22 0 I 22 0 Š 1 22 A 21 Â 1 11 I 22 Ǎ 1 11 = U 1 11 D 1 11 L 1 11 within quadruple-precision A A A 11 A I A 21 A 22 A 21 A 22 0 I 2 2 = 4 (Ǎ 1 11 A 11 I 1 ) Â 1 11 A12Š 1 22 A 21(I 1 Â 1 11 A 11) Ǎ 1 11 A 12 Â 1 11 A12Š 1 Š 1 22 A 21(I 1 Â 1 11 A 11) Š 1 22 Ŝ22 I 2 Ŝ 22 := A 22 A 21 Â 1 11 A 12 Â 1 11 A 11 I , Ǎ 1 11 A 11 I 1 10, Š 1 22 S 22 I S 22 = 0 Ŝ Ā 1 A I 0 22 Ŝ22 5

16 Detection of the kernel of matrix (/4) Stokes equations with full-neumann bc., N = 199, 808, m = 4, τ = 10 2, eigenvalues by Householder-QR [D] i : diagonal entry [D] 1 i k err (k) k 1 err (k) k err (k) k dim. of kernel = 5 dim. of kernel = 6 dim. of kernel = 7. p.16/

17 . p.17/21 Detection of the kernel of matrix (4/4) stationary Navier-Stokes equations, Re = 12,800, N = 47,886, m = 4, τ = 10 2, matrix by Householder QR factorization e e e e e e e e e e e e e e e e e e e e e-16 k err (k) k 1 err (k) k err (k) k dim. of kernel = 1 dim. of kernel =

18 . p.18/21 dependency of kernel detection on a parameter in MUMPS Threshold for null pivots by CNTL() automatic elstct e-14 kernel error 4.817e e e+0 residual e e e-15 stokes e-21 kernel NA ( ) error e e e e+1 residual e e e e-12 elstct e-16 kernel error e e+0.184e e+0 residual 1.827e e e e-15 Koutsovasilis/F e-14 kernel error 1.809e e e e+0 residual 1.557e e e e-16 ( ) stokes1 with CNTL()= 10 exceeds the memory limitation.

19 . p.19/21 memory consumption: Dissection and Intel Pardiso real 206,76 sym Dissection Intel Pardiso min max avg min max avg 1 thread fact (+2%).977 (+25%) fw/bw (+25%) threads fact (+2%) (+17%) fw/bw (+16%) threads fact (+12%) (+7%) fw/bw (+6%) threads fact (+17%) (+5%) fw/bw (+0%) real 206,76 unsym Dissection Intel Pardiso min max avg min max avg 1 thread fact (+17%) (+10%) fw/bw (+9%) threads fact (+1%) (+8%) fw/bw (+7%) threads fact (+11%) (+4%) fw/bw (+%) threads fact (+9%) (+1%) fw/bw (-%)

20 . p.20/21 FreeFem++ example stationary Navier-Stokes equations by Newton-Raphson iteration : cavitynewtow.edp find (u, p) V g L 2 0 (Ω) a(u, v) + a 1 (u, u, v) + b(v, p) = 0 b(u, q) ε(p, q) = 0 ε = load "Dissection" defaulttodissection(); XXMh [up1,up2,pp]; varf vdns ([u1,u2,p],[v1,v2,q])= int2d(th)(nu*(dx(u1)*dx(v1)+dy(u1)*dy(v1) + dx(u2)*dx(v2)+dy(u2)*dy(v2)) // + p*q*1.e-6 + p*dx(v1)+p*dy(v2) - dx(u1)*q-dy(u2)*q + Ugrad(u1,u2,up1,up2) *[v1,v2] + Ugrad(up1,up2,u1,u2) *[v1,v2]) + on(1,2,,4,u1=0,u2=0); //... up1[]=u1[]; matrix Ans=vDNS(XXMh,XXMh,tgv=-1); set(ans,solver=sparsesolver,tgv=-1, numthreads=4,tolpivot=1.0e-2); real[int] b=vns(0,xxmh,tgv=-1); real[int] w=ansˆ-1*b; u1[]-=w;

21 . p.21/21 summary diss_int(pointer dslv, real_or_complex, numthreads) diss_s_fact(pointer dslv, dim, CSR_symboic_data, sym, decomposer, level) diss_n_fact(pointer dslv, matrix_coef, scaling, tolpivot) diss_get_kern_dim(pointer dslv, kern_dim) diss_get_kern_vecs(pointer dslv, vectors) diss_fee(pointer dslv) written in Dissection.cpp for FreeFem++ load module. sparse matrix is stored in CSR format with structurally symmetric pattern. symmetric matrix is stored in either upper or lower. real symmetric/unsymmetric with kernel detection. complex general matrix without kernel detection. symbolic factorization contains scheduling of parallel tasks in addition to ordering by SCOTCH/METIS and fill-in analysis. SCOTCH/METIS sometimes does not produce aligned bisection tree. symbolic factorization phase is slower than one of UMFPACK double-double quadruple precision arithmetic library will replace Fortran90 REAL(16) usage.

A dissection solver with kernel detection for symmetric finite element matrices on shared memory computers

A dissection solver with kernel detection for symmetric finite element matrices on shared memory computers Atsushi Suzuki, François-Xavier Roux To cite this version: Atsushi Suzuki, François-Xavier Roux.