Sparse factorizations: Towards optimal complexity and resilience at exascale

Size: px

Start display at page:

Download "Sparse factorizations: Towards optimal complexity and resilience at exascale"

Alice Charles
6 years ago
Views:

1 Sparse factorizations: Towards optimal complexity and resilience at exascale Xiaoye Sherry Li Lawrence Berkeley National Laboratory Challenges in 21st Century Experimental Mathematical Computation Workshop, ICERM, Brown Univ., July 21-25, 2014.

2 Introduction! DOE SciDAC programs (Scientific Discovery through Advanced Computing)! FASTMath Institute ( , Frameworks, Algorithms, and Scalable Technologies for Mathematics) Software: SuperLU, PETSc, Trilinos, Chombo, mesh, (3 other Institutes)! Science Applications (many mostly Partial Differential Equations) CEMM ( , Center for Extended MHD Modeling, fusion energy) ComPASS ( , Community Petascale Project for Accelerator Science and Simulation)! LBNL focuses! Direct solvers (SuperLU): scaling to 1000s cores! Hybrid solvers (direct & iterative): scaling to 10,000 cores! Low-rank HSS preconditioner: nearly linear complexity for certain PDEs 2

the United States Study how to harness fusion, creating clean energy using nearly inexhaustible hydrogen as the fuel.

3 Application 1: Burning plasma for fusion energy! ITER a new fusion reactor being constructed in Cadarache, France International collaboration: China, the European Union, India, Japan, Korea, Russia, and the United States Study how to harness fusion, creating clean energy using nearly inexhaustible hydrogen as the fuel. ITER promises to produce 10 times as much energy than it uses but that success hinges on accurately designing the device.! One major simulation goal is to predict microscopic MHD instabilities of burning plasma in ITER. This involves solving extended and nonlinear Magnetohydrodynamics equations. 3

4 Application 1: ITER modeling! Center for Extended Magnetohydrodynamic Modeling (CEMM), PI: S. Jardin, PPPL.! Develop simulation codes to predict microscopic MHD instabilities of burning magnetized plasma in a confinement device (e.g., tokamak used in ITER experiments). Efficiency of the fusion configuration increases with the ratio of thermal and magnetic pressures, but the MHD instabilities are more likely with higher ratio.! Code suite includes M3D-C 1, NIMROD Z ϕ R At each ϕ = constant plane, scalar 2D data is represented using 18 degree of freedom quintic triangular finite elements Q 18 Coupling along toroidal direction (S. Jardin) 4

5 ITER modeling: 2-Fluid 3D MHD Equations n + (nv ) = 0 continuity t B t = E, B = 0, µ 0J = B Maxwell % V ( nm t ' +V V *+ p = J B Π GV Π µ Momentum & t ) E +V B = ηj + 1 ne (J B p e Π e ) Ohm's law p e t + % 3 2 p ( ' ev * = p e +ηj 2 q e +Q Δ & ) electron energy p i t + % 3 2 p ( ' iv * = p i Π µ V q i Q Δ & ) ion energy The objective of the M3D-C 1 code is to solve these equations as accurately as possible in 3D toroidal geometry with realistic B.C. and optimized for a low-β torus with a strong toroidal field. 5

6 Application 2: particle accelerator cavity design Community Petascale Project for Accelerator Science and Simulation (ComPASS), PI: P. Spentzouris, Fermilab Development of a comprehensive computational infrastructure for accelerator modeling and optimization RF cavity: Maxwell equations in electromagnetic field FEM in frequency domain leads to large sparse eigenvalue problem; needs to solve shifted linear systems (L.-Q. Lee) RF unit in ILC Γ E Closed Cavity Γ M linear eigenvalue problem 2 ( K0 σ M 0) x = M 0 b Waveguide BC Waveguide BC Open Cavity 2 ( K + i σ W - M ) x = b Waveguide BC nonlinear complex eigenvalue problem 0 σ 0 6

7 Sparse: lots of zeros in matrix! fluid dynamics, structural mechanics, chemical process simulation, circuit simulation, electromagnetic fields, magneto-hydrodynamics, seismic-imaging, economic modeling, optimization, data analysis, statistics,...! Example: A of dimension 10 6, 10~100 nonzeros per row! Matlab: > spy(a) Boeing/msc00726 (structural eng.) Mallya/lhr01 (chemical eng.) 7

8 Strategies of sparse linear solvers Solving a system of linear equations Ax = b Sparse: many zeros in A; worth special treatment Iterative methods (CG, GMRES, ) A is not changed (read-only) Key kernel: sparse matrix-vector multiply Easier to optimize and parallelize Low algorithmic complexity, but may not converge Direct methods A is modified (factorized) Harder to optimize and parallelize Numerically robust, but higher algorithmic complexity Often use direct method (factorization) to precondition iterative method Solve an easy system: M -1 Ax = M -1 b 8

9 Gaussian Elimination (GE)! Solving a system of linear equations Ax = b! First step of GE! Repeat GE on C! Result in LU factorization (A = LU) L lower triangular with unit diagonal, U upper triangular! Then, x is obtained by solving two triangular systems with L and U = = C w I v B v w A T T 0 / 0 1 α α α 9 α T w v B C =

10 Sparse factorization! Store A explicitly many sparse compressed formats! Fill-in... new nonzeros in L & U! Graph algorithms: directed/undirected graphs, bipartite graphs, paths, elimination trees, depth-first search, heuristics for NP-hard problems, cliques, graph partitioning,...! Unfriendly to high performance, parallel computing 1! Irregular memory access, indirect addressing, strong task/data dependency 2 L 3 4 U

11 Graph tool: reachable set, fill-path o y x + o o Edge (x,y) exists in filled graph G + due to the path: x à 7 à 3 à 9 à y! Finding fill-ins ßà finding transitive closure of G(A) 11

12 Algorithmic phases in sparse GE 1. Minimize number of fill-ins, maximize parallelism! Sparsity structure of L & U depends on that of A, which can be changed by row/column permutations (vertex re-labeling of the underlying graph)! Ordering (combinatorial algorithms; NP-complete to find optimum [Yannakis 83]; use heuristics) 2. Predict the fill-in positions in L & U! Symbolic factorization (combinatorial algorithms) 3. Design efficient data structure for storage and quick retrieval of the nonzeros! Compressed storage schemes 4. Perform factorization and triangular solutions! Numerical algorithms (F.P. operations only on nonzeros)! Usually dominate the total runtime! For sparse Cholesky and QR, the steps can be separate; for sparse LU with pivoting, steps 2 and 4 my be interleaved. 12

13 Distributed-memory parallelization! 2D block-cyclic matrix distribution For j = 1, 2, 3.. Number of Supernodes 1. Block LU factorization L (j, j) U (j, j) ß LU(A(j, j)) 2. L update : L (k, j) ß A (k, j) U -1 (j, j) k>j 3. U update : U (j, k) ß L -1 (j, j) A (j, k) k>j 4. Rank K Update : A (i, k) ßA (i, k) L (I,j) U (j,k), i, k > j! Scalability challenges:! High degree of data & task dependency (DAG)! Irregular, indirect memory access! Low Arithmetic Intensity 13

14 SuperLU_DIST 2.5 on Cray XE6 Profiling using IPM! Synchronization dominates on a large number of cores! up to 96% of factorization time Factorization Communication Factorization Communication Factorization time(s) Factorization time(s) Number of cores Number of cores Accelerator (sym), n=2.7m, fill-ratio=12 DNA, n = 445K, fill-ratio=

15 SuperLU_DIST 3.0: better DAG scheduling look ahead window Factorization/Communication time (s) version 2.5 version Number of cores Factorization/Communication time (s) Number of cores version 2.5 version 3.0 Accelerator, n=2.7m, fill-ratio=12 DNA, n = 445K, fill-ratio= 609! Implemented new static scheduling and flexible look-ahead algorithms that shortened the length of the critical path.! Idle time was significantly reduced (speedup up to 2.6x)! To further improve performance:! more sophisticated scheduling schemes! hybrid programming paradigms 15

16 Performance of larger matrices Name Application Data type N A / N Sparsity L\U (10^6) Fill-ratio matrix211 cc_linear2 matick cage13 Fusion, MHD eqns (M3D-C1) Fusion, MHD eqns (NIMROD) Circuit sim. MNA method (IBM) DNA electrophoresis Real 801, Complex 259, Complex 16, Real 445, v Sparsity ordering: MeTis applied to structure of A +A 16

17 Strong scaling: MPI, Cray XE6 2 x 12-core AMD 'MagnyCours per node, 2.1 GHz processor v Up to 1.4 Tflops factorization rate 17

18 Variety of node architectures Cray XE6: dual-socket x 2-die x 6-core, 24 cores Cray XC30: dual-socket x 8-core, 16 cores Cray XK7: 16-core AMD + K20X GPU Intel MIC: 16-core host cores co-processor 18

19 Multicore / GPU-Aware SuperLU! New hybrid programming code: MPI+OpenMP+CUDA, able to use all the CPUs and GPUs on manycore computers.! Algorithmic changes:! Aggregate small BLAS operations into larger ones.! CPU multithreading Scatter/Gather operations.! Hide long-latency operations.! Results: using 100 nodes GPU clusters, up to 2.7x faster, 2x-5x memory saving.! New SuperLU_DIST 4.0 release, August

20 CPU + GPU algorithm Aggregate small blocks GEMM of large blocks Scatter GPU acceleration: Software pipelining to overlap GPU execution with CPU Scatter, data transfer. 20

21 Software issues! Use preprocesing to produce 4 versions {s, d, c, z}! Creating macro-enabled basefile at the first time is clumsy; later maintenance is easier.! template in C++ is better.! Performance portability?! Need adjust block size for each architecture Larger blocks better for uniprocessor Smaller blocks better for parallellism and load balance! Open problem: automatic tuning for block size?! Flexible interface?! Example: block diagonal preconditioner M -1 A x = M -1 b M = diag(a 11, A 22, A 33 ) à use SuperLU_DIST for each diagonal block! No explicit funding for user support. (other than SciDAC apps.) A 11 A 22 A33

22 Software issues! Use preprocesing to produce 4 versions {s, d, c, z}! Creating macro-enabled basefile at the first time is clumsy; later maintenance is easier.! template in C++ is better.! Performance portability?! Need adjust block size for each architecture Larger blocks better for uniprocessor Smaller blocks better for parallellism and load balance! Open problem: automatic tuning for block size?! Flexible interface?! Example: block diagonal preconditioner M -1 A x = M -1 b M = diag(a 11, A 22, A 33 ) à use SuperLU_DIST for each diagonal block ! No explicit funding for user support. (other than SciDAC apps.)

23 Towards exascale! Exascale machines will have hierarchical organization! Hierarchical memory, NUMA nodes: multicore, manycore! Exascale applications will encompass multiphysics (coupled PDEs) and multiscale (time and space)! Hierarchical algorithms and parallelism match machines and applications features Studying two classes of algorithms for sparse linear systems: 1. Domain decomposition hybrid method! General algebraic solver 2. Low-rank factorization employing hierarchical matrices and randomization! Target PDE applications 23

24 1. Domain decomposition, Schur-complement (PDSLin : 1. Graph-partition into subdomains, A 11 is block diagonal A A A A x x b = b 2. Schur complement S = A where 22 A A = L 1 2 A U 11 A = A 22! # " A 11 A 12 A 21 A 22 (U -T 11! D 1 E # 1 $ # D 2 E 2 & = # # % # D k E k # F 1 F 2 F k A " S = interface (separator) variables, no need to form explicitly A T 21 ) T (L A $ & & & & & & % 12 ) = A 22 W G 3. Hybrid solution methods: (1) x 2 = S 1 (b 2 A 21 A b 1 ) iterative solver (2) x 1 = A (b 1 A 12 x 2 ) direct solver 24

(18 : 23) P (18 : 23) E 4 P (0 : 5) P (6 : 11) P (12 : 17) P (18 : 23) F 1 F 2 F 3 F 4 A 22! Advantages:!

25 Hierarchical parallelism! Multiple processors per subdomain! one subdomain with 2x3 procs (e.g. SuperLU_DIST) D 1 P P (0 : 5) P (0 : 5) E 1 D 2 P (6 : 11) P (6 : 11) E 2 D 3 P (12 : 17) P (12 : 17) E 3 D 4 P (18 : 23) P (18 : 23) E 4 P (0 : 5) P (6 : 11) P (12 : 17) P (18 : 23) F 1 F 2 F 3 F 4 A 22! Advantages:! Constant #subdomains, Schur size, and convergence rate, regardless of core count.! Need only modest level of parallelism from direct solver. 25

PDSLin in Omega3P: Cryomodule Computa(on parameters 2.3M elements First order finite element (p = 1) PIP2 cryomodule consis1ng of 8 cavi1es - 39M non- zeroes, 2.

26 PDSLin in Omega3P: Cryomodule Computa(on parameters 2.3M elements First order finite element (p = 1) PIP2 cryomodule consis1ng of 8 cavi1es - 39M non- zeroes, 2.5M DOFs - Solu1on 1me on hopper using 50 nodes and 600 cores: 863 ms (total) Second order finite element (p = 2) - 590M non- zeroes, 14M DOFs - Solu1on 1me on edison using 400 nodes, 4800 cores: 5:40 min (wall) - Using MUMPS with 400 nodes, 800 cores, solu1on 1me: 6:46 min (wall)

27 New mathematical algorithms! K-way, multi-constraint graph partitioning! Small separator, similar subdomains, similar connectivity! Both intra- and inter-group load balance! Sparse triangular sol. with many sparse RHS (intra-subdomain) S = A 22 (U -T l F T l ) T (L -1 l E l ) = W l G l, where D l = L l U l l! Sparse matrix matrix multiplication (inter-subdomain) W sparsify(w, σ 1 ); G sparsify(g, σ1 ) T ( p) W ( p) G ( p) Ŝ ( p) ( A p) 22 T (q) (p) ; S sparsify( Ŝ, σ 2 ) q l I. Yamazali, F.-H. Rouet, X.S. Li, B. Ucar, On partitioning and reordering problems in a hierarchically parallel hybrid linear solver, IPDPS / PDSEC Workshop, May 24,

2. HSS-embedded sparse factorization! Dense, but data-sparse): hierarchically semi-separable structure! PDEs with smooth kernels, off-diagonal blocks are rank deficient!

28 2. HSS-embedded sparse factorization! Dense, but data-sparse): hierarchically semi-separable structure! PDEs with smooth kernels, off-diagonal blocks are rank deficient! Recursion leads to hierarchical partitioning! Key to low complexity: nested bases HSS tree " " T % $ $ D 1 U 1 B 1 V 2 ' $ $ T # U 2 B 2 V 1 D ' $ 2 & A $ $ " U 4 R % $ 4 ' # $ U 5 R 5 &' B " 6 W T T 1 V 1 W T T $ 2 V #$ 2 $ #! Sparse: apply HSS to dense separators/supernodes Nested tree-parallelism: Outer tree: separator tree Inner tree: HSS tree % &' " U 1 R % $ 1 ' # $ U 2 R 2 &' B " 3#$ " $ D 4 $ T # U 5 B 5 V 4 W T T 4 V 4 W T T 5 V 5 U 4 B 4 V 5 T D 5 % ' ' & % &' % ' ' ' ' ' ' ' &

29 3D Helmholtz! Helmholtz equation with PML boundary # Δ ω 2 & % ( u(x,ω) = s(x,ω) $ v(x) 2 '! N = = 27M, procs = 1024! Max rank = 1391 (tolerance = 1e-4) Times (s) Gflops (peak %) Comm % Mem (GB) MF (27.7%) 32.6 % 3144 MF + HSS HSS-compr (29.2%) 41.2 % 15.3 %

30 New compression kernel: Randomized Sampling! Traditional methods: SVD, rank-revealing QR! Difficult to scale up! Extend-add HSS structures of different shapes! Randomized sampling: 1. Pick random matrix Ω nx(k+p), p small, e.g Sample matrix S = A Ω, with slight oversampling p 3. Compute Q = ON-basis(S), orthonormal basis of S Accuracy: with high probability 1 6 p -p A QQ * A ( 1+11 k + p min(m, n) ) σ k+1! Benefits: kernel becomes dense matrix-matrix multiply! Extend-add tall-skinny dense matrices of conforming shapes! Scalable and resilient algorithms exist! Even faster, if fast matrix-vector multiply available (e.g. FMM)! Matrix-free solver, if only matrix-vector action available 30

31 Summary, forward looking...! Direct solvers can scale to 1000s cores! Domain-decomposition type of hybrid solvers can scale to 10,000s cores! Can also maintain robustness! Expect to scale more with low-rank structured factorization methods! Extend to general solver framework, examine feasibility with wider class of problems 31

Enhancing Scalability of Sparse Direct Methods

Journal of Physics: Conference Series 78 (007) 0 doi:0.088/7-6596/78//0 Enhancing Scalability of Sparse Direct Methods X.S. Li, J. Demmel, L. Grigori, M. Gu, J. Xia 5, S. Jardin 6, C. Sovinec 7, L.-Q.