SPARSE SOLVERS POISSON EQUATION. Margreet Nool. November 9, 2015 FOR THE. CWI, Multiscale Dynamics

SPARSE SOLVERS FOR THE POISSON EQUATION Margreet Nool CWI, Multiscale Dynamics November 9, 2015

OUTLINE OF THIS TALK 1 FISHPACK, LAPACK, PARDISO 2 SYSTEM OVERVIEW OF CARTESIUS 3 POISSON EQUATION 4 SOLVERS 5 EXAMPLE OF USE 6 CONCLUSIONS AND REMARKS

2D POISSON PROBLEM 2D Poisson problem solution at Cartesius pardiso 1 thread pardiso 12 threads pardiso 24 threads fishpack lapack mkl 10-1 10-2 10-3 10-4 10 3 10 4 n x = n y Results on 1 node of Cartesius with 24 cores LAPACK: fastest implementation on Cartesius PARDISO: shared-memory multiprocessing parallel direct sparse solver by Olaf Schenk[ 00-04] optimized for Intel R

residu 2D POISSON PROBLEM 2D Poisson problem accuracy 10-2 10-4 10-6 pardiso fishpack lapack 10-8 10 3 10 4 n x = n y LAPACK: maximum problem size n x = n y = 1300 FISHPACK: convergence till problem size n x = n y = 1400 PARDISO: maximum problem size n x = n y = 5600

OUTLINE OF THIS TALK 1 FISHPACK, LAPACK, PARDISO 2 SYSTEM OVERVIEW OF CARTESIUS 3 POISSON EQUATION 4 SOLVERS 5 EXAMPLE OF USE 6 CONCLUSIONS AND REMARKS

CLUSTER MACHINE Cartesius, the Dutch Supercomputer at SURFsara is a cluster machine Node Type Number Cores CPU Clock Memory thin 1080 24 E5-2690 v3 2.6 GHz 64 GB thin 540 24 E5-2695 v2 2.4 GHz 64 GB fat 32 32 E5-4650 2.7 GHz 256 GB gpu 64 16 E5-2450 v2 2.5 GHz 96 GB 40,960 cores + 132 GPUs: 1.559 Pflop/s (peak performance) 117 TB memory (CPU + GPGPU) Fat nodes have 4 times more memory than thin nodes, but are slower

NODES AND CORES A Cartesius node can have 24 or 32 cores Within a node shared memory Over nodes distributed memory Nodes can be configured in different ways 1 NODE 1 NODE 1 NODE 8 CORES (a) 8 MPI processes 8 CORES (b) 8 OpenMP threads 8 CORES (c) 4 MPI processes

SOFTWARE MKL LIBRARY Intel R Math Kernel Library is a library of optimized math routines for science, engineering, and financial applications. Core math functions include BLAS, LAPACK, ScaLAPACK, sparse solvers, fast Fourier transforms, and vector math. The routines in MKL are hand-optimized specifically for Intel R processors. Sparse solvers: MKL PARDISO- Parallel Direct Sparse Solver interface Parallel Direct Sparse Solver for Cluster Interface Direct Sparse Solvers (DDS) (Interface Routines) Iterative Sparse Solvers (based on Reverse Communication Interface)

SOFTWARE Intel R Poisson solvers for a single node: Two-dimensional Helmholtz problem on a Cartesian plane Two-dimensional Poisson problem on a Cartesian plane Two-dimensional Laplace problem on a Cartesian plane Helmholtz problem on a sphere Poisson problem on a sphere Three-dimensional Helmholtz problem Three-dimensional Poisson problem Three-dimensional Laplace problem

OUTLINE OF THIS TALK 1 FISHPACK, LAPACK, PARDISO 2 SYSTEM OVERVIEW OF CARTESIUS 3 POISSON EQUATION 4 SOLVERS 5 EXAMPLE OF USE 6 CONCLUSIONS AND REMARKS

1D CELL CENTERED DIRICHLET BC Hundsdorfer and Verwer: Consider cell centered grid with nodes x i = (i 1 )h; i = 1,, M; h = 1/M. 2 For Dirichlet BC we need in x 0 = 1 h and in x 2 M+1 = 1 + 1 h, 2 the virtual values u 0 and u M+1, such that 1 2 (u 0 + u 1 ) = γ 0 1 2 (u M + u M+1 ) = γ M. We obtain the following semi-discrete system u 1 u i u M = 1 ( 3u h 2 1 + u 2 ) + 2 γ h 2 0, = 1 (u h 2 i 1 2u i + u i+1 ), 2 i M 1, = 1 (u h 2 M 1 3u M ) + 2 γ h 2 M,

1D CELL CENTERED DIRICHLET BC 1D Poisson matrix A of size M and RHS vector b are defined by A = 1 h 2 3 1 1 2 1 1 2 1... 1 2 1 1 3, b = Note: the Poisson matrix is symmetric positive indefinite Note: correction on the RHS vector b 1 + 2 h 2 γ 0 b 2 b 3... b M 1 b M + 2 h 2 γ M

2D AND 3D CELL CENTERED DIRICHLET BC 2D Poisson matrix A of size M 2 M 2 for M = 4 is defined by A = 1 h 2 6 1 1 1 5 1 1 1 5 1 1 1 6 1 1 5 1 1 1 1 4 1 1 1 1 4 1 1 1 1 5 1 1 5 1 1 1 1 4 1 1 1 1 4 1 1 1 1 5 1 1 1 6 1 1 1 5 1 1 1 5 1 1 1 6 For the 3D case we distinguish 3 diagonal parts [ 9 8 8 8 8 9] for cells on edges [ 8 7 7 7 7 8] for cells on surfaces [ 7 6 6 6 6 7] for inner cells supplemented with 3 sub diagonals and 3 super diagonals

OUTLINE OF THIS TALK 1 FISHPACK, LAPACK, PARDISO 2 SYSTEM OVERVIEW OF CARTESIUS 3 POISSON EQUATION 4 SOLVERS 5 EXAMPLE OF USE 6 CONCLUSIONS AND REMARKS

POISSON SOLVER FOR LARGE 2D AND 3D SIMULATIONS Poisson solvers PARDISO (MKL) CLUSTER_SPARSE_SOLVER (MKL) MUMPS Release 5.0.1

ANALYSIS, FACTORIZATION, SOLVE To solve we factorize A into A x = b A = L D L T For both PARDISO, CLUSTER_SPARSE_SOLVER and MUMPS we can distinguish three main phases analysis and reordering factorization solution Note 1 : Each phase can be called independently (not for FISHPACK) Note 2 : Once the matrix has been factorized we may restrict to the solution phase

ANALYSIS, FACTORIZATION, SOLVE Analysis phase reordering of the matrix to reduce fill-in choosing pivots using a selection criterion to preserve sparsity matrix input distributions CRS for PARDISO and CLUSTER_SPARSE_SOLVER Central assembled matrix format for MUMPS matrix only on host or distributed over processes if desired an analysis report is made

ANALYSIS, FACTORIZATION, SOLVE Factorization phase most time consuming phase most memory consuming phase if desired a report about the factorization is made pivot strategy required only once?

ANALYSIS, FACTORIZATION, SOLVE Solution phase Post-processing: iterative refinement Error analysis Compute r = Ax b then max i=1,,m r i < 1 E 12 Let x cont be the solution of the continuous problem then Residu : x x cont 2 or Residu : max i=1,,m x(i) x cont(i)

OUTLINE OF THIS TALK 1 FISHPACK, LAPACK, PARDISO 2 SYSTEM OVERVIEW OF CARTESIUS 3 POISSON EQUATION 4 SOLVERS 5 EXAMPLE OF USE 6 CONCLUSIONS AND REMARKS

2D POISSON PROBLEM Solve U(x, y) = ( 2 x + 2 ) U(x, y) 2 y 2 using a 4-pt centered 2-nd order difference scheme. 2D POISSON PROBLEM WITH KNOWN SOLUTION U(x, y) = exp ( C((x x 0 ) 2 + (y y 0 ) 2 )) + 1.0 U(x, y) = ( 4C + 4C 2 ((x x 0 ) 2 + (y y 0 ) 2 )) exp ( C((x x 0 ) 2 + (y y 0 ) 2 )) on an uniform grid defined on x [0, 1] and y [0, 1] and C {,, 10 4, 10 6 } and x 0 = y 0 = 0.5

2D P OISSON PROBLEM 2 2 1.5 1.5 1 1 0.5 0.5 0 0.5 0 0.5 0.5 0 Y 0.5 0 0-0.5-0.5 (d) C = 10 Y X 0 0-0.5-0.5 (e) C = 10 2 2 1.5 1.5 1 1 0.5 0.5 0 0.5 X 2 0 0.5 0.5 0 Y 0-0.5-0.5 (f) C = 104 X 0.5 0 Y 0-0.5-0.5 (g) C = 106 X

Residu (2-norm) 2D POISSON PROBLEM 10 5 2D Poisson problem accuracy C= C= C=10 4 C=10 6 10 4 2D Reordering phase on 1 node C=1 C= C=10 4 C=10 6 10-5 10-10 10-2 10-15 10 3 10 4 nx = ny (h) convergence 2D Factorize phase on 1 node C=1 10-1 10-2 10-3 C= C=10 4 C=10 6 10-4 10 3 10 4 nx = ny (i) reordering PARDISO 2D Solution phase on 1 node C=1 10-1 10-2 10-3 10-4 C= C=10 4 C=10 6 10-4 10 3 10 4 nx = ny (j) factorization PARDISO 10-5 10 3 10 4 nx = ny (k) solution PARDISO

3D POISSON PROBLEM Solve U(x, y, z) = ( 2 x + 2 2 y + 2 ) U(x, y, z) 2 z2 using a 6-pt centered 2-nd order difference scheme. 3D POISSON PROBLEM WITH KNOWN SOLUTION U(x, y, z) = exp ( C((x x 0 ) 2 + (y y 0 ) 2 ) + (z z 0 ) 2 ) + 1.0 U(x, y, z) = ( 4C + 4C 2 ((x x 0 ) 2 + (y y 0 ) 2 ) + (z z 0 ) 2 ) exp ( C((x x 0 ) 2 + (y y 0 ) 2 + (z z 0 ) 2 )) on an uniform grid defined on x [0, 1], y [0, 1] and z [0, 1] and C {,, 10 4, 10 6 } and x 0 = y 0 = z 0 = 0.5

3D POISSON PROBLEM CLUSTER_SPARSE_SOLVER residu(max norm) 3D Poisson problem accuracy 12 cores CLUSTER -2 10 3D Reordering phase 12 cores CLUSTER_SPARSE_SOLVER 10 3 10-3 10-4 (l) convergence 10 4 3D Factorize phase 12 cores CLUSTER (m) reordering 3D Solution phase 12 cores CLUSTER_SPARSE_SOLVER 10 3 (n) factorization 10-1 10-2 (o) solution

3D POISSON PROBLEM CLUSTER_SPARSE_SOLVER 3D Reordering phase 12 cores CLUSTER_SPARSE_SOLVER 10 3 10 4 3D Factorize phase 12 cores CLUSTER 3D Solution phase 12 cores CLUSTER_SPARSE_SOLVER 10 3 (p) reordering (q) factorization 10-1 10-2 (r) solution 3D Reordering phase 24 cores CLUSTER_SPARSE_SOLVER 10 3 3D Factorize phase 24 cores CLUSTER_SPARSE_SOLVER 10 4 3D Solution phase 24 cores CLUSTER_SPARSE_SOLVER 10 3 (s) reordering (t) factorization 10-1 10-2 (u) solution FIGURE: Number of cores per node 12 (upper) and 24 (lower) figures

3D POISSON PROBLEM MUMPS 10 3 3D Reordering phase MUMPS 10 4 3D Factorize phase MUMPS 3D Solution phase MUMPS 10 3 (a) reordering (b) factorization 10-1 10-2 (c) solution 10 4 3D Reordering phase MUMPS 10 4 3D Factorize phase MUMPS 3D Solution phase MUMPS 10 3 10 3 (d) reordering (e) factorization 10-1 10-2 (f) solution FIGURE: Number of cores per node 12 (upper) and 24 (lower) figures

3D POISSON PROBLEM CLUSTER_SPARSE_SOLVER VERSUS MUMPS 3D Reordering phase 24 cores CLUSTER_SPARSE_SOLVER 10 3 3D Factorize phase 24 cores CLUSTER_SPARSE_SOLVER 10 4 3D Solution phase 24 cores CLUSTER_SPARSE_SOLVER 10 3 (a) reordering (b) factorization 10-1 10-2 (c) solution 10 4 3D Reordering phase MUMPS 10 4 3D Factorize phase MUMPS 3D Solution phase MUMPS 10 3 10 3 (d) reordering (e) factorization 10-1 10-2 (f) solution FIGURE: CLUSTER_SPARSE_SOLVER (upper) versus MUMPS (lower) figures; number of cores per node 24

3D POISSON PROBLEM MUMPS Speedup Speedup Speedup 10 8 6 3D Reordering phase MUMPS 16 14 12 10 3D Factorize phase MUMPS 16 14 12 10 3D Solution phase MUMPS 8 8 4 2 0 50 60 70 80 90 100 120 130 (a) reordering 6 4 2 0 60 70 80 90 100 120 130 (b) factorization 6 4 2 0 60 70 80 90 100 120 130 (c) solution FIGURE: Speedup compared with 1 node

3D POISSON PROBLEM Analysis report for 3D MUMPS on 64 nodes n x N NZ operations host avg total MBYTES MBYTES MBYTES 64 262144 1036288 3.75 E+11 155 69 4465 80 512000 2028800 1.49 E+12 217 165 10614 96 884736 3511296 4.47 E+12 446 358 22925 112 1404928 5582080 1.17 E+13 851 659 42189 128 2097152 8339456 2.72 E+13 1644 1087 69627 160 4096000 16307200 1.06 E+13 3298 2596 166191 192 7077888 28200960 3.24 E+14 8711 5784 370210 224 11239424 44807168 8.24 E+14 14399 10114 647308 256 16777216 66912256 1.87 E+14 21989 16270 1041342

OUTLINE OF THIS TALK 1 FISHPACK, LAPACK, PARDISO 2 SYSTEM OVERVIEW OF CARTESIUS 3 POISSON EQUATION 4 SOLVERS 5 EXAMPLE OF USE 6 CONCLUSIONS AND REMARKS

CONCLUSIONS, REMARKS AND QUESTIONS 2D Poisson problems up to n x = n y = 5400 on single node 2D Poisson problems up to n x = n y = 13000 on 32 nodes 3D Poisson problems up to n x = n y = n z = 128 on single nodes 3D Poisson problems up to n x = n y = n z = 256 on 64 nodes MUMPS is very suitable for cluster machines CLUSTER_SPARSE_SOLVER can handle larger problems than MUMPS the solution phase of CLUSTER_SPARSE_SOLVER is slower than MUMPS use MKL software where possible also for MUMPS parallelization with MUMPS or CLUSTER_SPARSE_SOLVER is NOT difficult forget about FISHPACK it is no longer the fastest solver results obtained by FISHPACK are not reliable

CONCLUSIONS, REMARKS AND QUESTIONS Is it possible to accelerate Anna s code? Is the 3D approach suitable for Anna? More questions