Static-scheduling and hybrid-programming in SuperLU DIST on multicore cluster systems

Similar documents
Scalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver

Enhancing Scalability of Sparse Direct Methods

Hybrid static/dynamic scheduling for already optimized dense matrix factorization. Joint Laboratory for Petascale Computing, INRIA-UIUC

Sparse factorizations: Towards optimal complexity and resilience at exascale

Sparse linear solvers

Improvements for Implicit Linear Equation Solvers

On the design of parallel linear solvers for large scale problems

Contents. Preface... xi. Introduction...

Accelerating linear algebra computations with hybrid GPU-multicore systems.

A Computation- and Communication-Optimal Parallel Direct 3-body Algorithm

Direct and Incomplete Cholesky Factorizations with Static Supernodes

A sparse multifrontal solver using hierarchically semi-separable frontal matrices

Review for the Midterm Exam

Communication-avoiding LU and QR factorizations for multicore architectures

Saving Energy in Sparse and Dense Linear Algebra Computations

Parallelization of Multilevel Preconditioners Constructed from Inverse-Based ILUs on Shared-Memory Multiprocessors

Using Random Butterfly Transformations to Avoid Pivoting in Sparse Direct Methods

Using Postordering and Static Symbolic Factorization for Parallel Sparse LU

A model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization)

Sparse LU Factorization on GPUs for Accelerating SPICE Simulation

Utilisation de la compression low-rank pour réduire la complexité du solveur PaStiX

Efficient Sparse LU Factorization with Partial Pivoting on Distributed Memory Architectures

Leveraging Task-Parallelism in Energy-Efficient ILU Preconditioners

Performance of the fusion code GYRO on three four generations of Crays. Mark Fahey University of Tennessee, Knoxville

Information Sciences Institute 22 June 2012 Bob Lucas, Gene Wagenbreth, Dan Davis, Roger Grimes and

Parallel programming using MPI. Analysis and optimization. Bhupender Thakur, Jim Lupo, Le Yan, Alex Pacheco

J.I. Aliaga 1 M. Bollhöfer 2 A.F. Martín 1 E.S. Quintana-Ortí 1. March, 2009

Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers

Model Order Reduction via Matlab Parallel Computing Toolbox. Istanbul Technical University

IMPROVING THE PERFORMANCE OF SPARSE LU MATRIX FACTORIZATION USING A SUPERNODAL ALGORITHM

Performance Analysis of Lattice QCD Application with APGAS Programming Model

Parallel Circuit Simulation. Alexander Hayman

Lightweight Superscalar Task Execution in Distributed Memory

2.5D algorithms for distributed-memory computing

Petascale Quantum Simulations of Nano Systems and Biomolecules

The Performance Evolution of the Parallel Ocean Program on the Cray X1

Parallelism in Structured Newton Computations

Parallel Polynomial Evaluation

QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment

SPARSE SOLVERS POISSON EQUATION. Margreet Nool. November 9, 2015 FOR THE. CWI, Multiscale Dynamics

Parallel Simulations of Self-propelled Microorganisms

Communication avoiding parallel algorithms for dense matrix factorizations

Parallel Sparse LU Factorization on Second-class Message Passing Platforms

A Quantum Chemistry Domain-Specific Language for Heterogeneous Clusters

SOLVING SPARSE LINEAR SYSTEMS OF EQUATIONS. Chao Yang Computational Research Division Lawrence Berkeley National Laboratory Berkeley, CA, USA

Parallel PIPS-SBB Multi-level parallelism for 2-stage SMIPS. Lluís-Miquel Munguia, Geoffrey M. Oxberry, Deepak Rajan, Yuji Shinano

Communication-avoiding parallel and sequential QR factorizations

A dissection solver with kernel detection for unsymmetric matrices in FreeFem++

Tall and Skinny QR Matrix Factorization Using Tile Algorithms on Multicore Architectures LAPACK Working Note - 222

MPI at MPI. Jens Saak. Max Planck Institute for Dynamics of Complex Technical Systems Computational Methods in Systems and Control Theory

Tile QR Factorization with Parallel Panel Processing for Multicore Architectures

1 Overview. 2 Adapting to computing system evolution. 11 th European LS-DYNA Conference 2017, Salzburg, Austria

MUMPS. The MUMPS library: work done during the SOLSTICE project. MUMPS team, Lyon-Grenoble, Toulouse, Bordeaux

Parallel Scientific Computing

Solving PDEs with CUDA Jonathan Cohen

Parallel Transposition of Sparse Data Structures

Efficient implementation of the overlap operator on multi-gpus

Cyclops Tensor Framework

Parallelization of Molecular Dynamics (with focus on Gromacs) SeSE 2014 p.1/29

A Blackbox Polynomial System Solver on Parallel Shared Memory Computers

Binding Performance and Power of Dense Linear Algebra Operations

GPU acceleration of Newton s method for large systems of polynomial equations in double double and quad double arithmetic

Performance and Scalability of the Block Low-Rank Multifrontal Factorization on Multicore Architectures

Dynamic Scheduling within MAGMA

Claude Tadonki. MINES ParisTech PSL Research University Centre de Recherche Informatique

Practical Combustion Kinetics with CUDA

Large Scale Sparse Linear Algebra

Che-Wei Chang Department of Computer Science and Information Engineering, Chang Gung University

Parallel Algorithms for Solution of Large Sparse Linear Systems with Applications

Mapping Dense LU Factorization on Multicore Supercomputer Nodes

Multicore Parallelization of Determinant Quantum Monte Carlo Simulations

Parallel Performance Studies for a Numerical Simulator of Atomic Layer Deposition Michael J. Reid

Dynamic Scheduling for Work Agglomeration on Heterogeneous Clusters

Parallel Sparse Tensor Decompositions using HiCOO Format

Recent advances in sparse linear solver technology for semiconductor device simulation matrices

Direct Methods for Sparse Linear Systems: MATLAB sparse backslash

HYCOM and Navy ESPC Future High Performance Computing Needs. Alan J. Wallcraft. COAPS Short Seminar November 6, 2017

3D Cartesian Transport Sweep for Massively Parallel Architectures on top of PaRSEC

MARCH 24-27, 2014 SAN JOSE, CA

Parallel Numerics. Scope: Revise standard numerical methods considering parallel computations!

Saving Energy in the LU Factorization with Partial Pivoting on Multi-Core Processors

Parallel sparse direct solvers for Poisson s equation in streamer discharges

Fine-grained Parallel Incomplete LU Factorization

Cactus Tools for Petascale Computing

Performance and Energy Analysis of the Iterative Solution of Sparse Linear Systems on Multicore and Manycore Architectures

Open-Source Parallel FE Software : FrontISTR -- Performance Considerations about B/F (Byte per Flop) of SpMV on K-Supercomputer and GPU-Clusters --

Lattice Boltzmann simulations on heterogeneous CPU-GPU clusters

FINE-GRAINED PARALLEL INCOMPLETE LU FACTORIZATION

A Sparse QS-Decomposition for Large Sparse Linear System of Equations

Introduction to communication avoiding linear algebra algorithms in high performance computing

Domain Decomposition-based contour integration eigenvalue solvers

On the design of parallel linear solvers for large scale problems

MAGMA MIC 1.0: Linear Algebra Library for Intel Xeon Phi Coprocessors

Some thoughts about energy efficient application execution on NEC LX Series compute clusters

How to deal with uncertainties and dynamicity?

Efficiently solving large sparse linear systems on a distributed and heterogeneous grid by using the multisplitting-direct method

Parallel Algorithms for Forward and Back Substitution in Direct Solution of Sparse Linear Systems

Large-scale Electronic Structure Simulations with MVAPICH2 on Intel Knights Landing Manycore Processors

Dense Arithmetic over Finite Fields with CUMODP

Solving Quadratic Equations with XL on Parallel Architectures

Transcription:

Static-scheduling and hybrid-programming in SuperLU DIST on multicore cluster systems Ichitaro Yamazaki University of Tennessee, Knoxville Xiaoye Sherry Li Lawrence Berkeley National Laboratory MS49: Sparse Linear Solvers on Many-core Architectures SIAM PP: Savannah, 2/17/212 I. Yamazaki and X. Li 1/22 SuperLU DIST on multicore clusters

SuperLU DIST: direct solver for general sparse linear systems on a distributed memory system first release in 1999 each compute node with 1+ cores and UMA. capable of factorizing matrices with millions of unknowns from real applications. used in large-scale simulations: iterative/hybrid solvers quantum mechanics [SC 1]: low-order uncoupled systems fusion energy (M3D-C 1, PPPL): 2D slices of 3D torus PDSLin: hybrid linear solver domain decomposition inetrior subdomain solver studied for accelerator modeling (Omega3P, SLAC) I. Yamazaki and X. Li 2/22 SuperLU DIST on multicore clusters

Our testbeds at NERSC Cray-XE6 (Hopper): 6,384 nodes (peak 1.28Pflop/s), No. 8 on TOP5. two 12-core AMD MagnyCours (two six-core Bulldozer) 2.1GHz processors + 32GB of memory per node. Cray Gemini Network in a 3D torus. IBM idatplex (Caver) 1,22 nodes (peak.11pflops/sec), max. 64 nodes for a parallel job. two quad-core Intel Nehalem 2.7GHz processors + 2GB of memory per node. 4x QDR InfiniBand in a 2D mesh. I. Yamazaki and X. Li 3/22 SuperLU DIST on multicore clusters

SuperLU DIST version 2.5 (released Nov. 21) on Cray-XE6 3 22 2 Factorization time (s) 25 2 15 1 5 Factorization time(s) 18 16 14 12 1 8 6 4 2 8 32 128 512 248 Number of cores 32 128 512 248 Number of cores (a) accelerator (sym), n = 2.7M, fill-ratio= 12 (b) DNA (unsym), n = 445K, fill-ratio= 69 SuperLU DIST often does not scale to thousands of cores. Why, and can we improve its performance? I. Yamazaki and X. Li 4/22 SuperLU DIST on multicore clusters

Outline: Introduction to SuperLU DIST Static scheduling scheme exploit more parallelism and reduce idle time. obtain speedups of upto 2.6 on upto 248 cores. Hybrid programming paradigm reduce memory overhead and obtain similar parallel efficiency. utilize more cores per node and reduce factorization time. I. Yamazaki and X. Li 5/22 SuperLU DIST on multicore clusters

SuperLU DIST: steps to solution Compute factorization in three-stages: 1. Matrix preprocessing: - static pivoting/scaling/permutation to improve numerical stability and to preseve sparsity 2. Symbolic factorization: - computation of e-tree/structure of LU and static comm./comp. schedulings - supernodal (6-5 cols) for efficient dense block operations 3. Numerical factorization: - fan-out (right-looking, outer-product) - 2D cyclic MPI grid 3 4 1 2 3 4 5 3 1 2 1 3 4 5 3 4 5 Compute solution with forward and backward substitutions. 1 2 5 1 2 3 4 5 1 1 2 1 2 4 2 5 2 3 3 3 I. Yamazaki and X. Li 6/22 SuperLU DIST on multicore clusters

SuperLU DIST: steps to solution Compute factorization in three-stages: 1. Matrix preprocessing: - static pivoting/scaling/permutation to improve numerical stability and to preseve sparsity 2. Symbolic factorization: - computation of e-tree/structure of LU and static comm./comp. scheduling - supernodal (6-5 cols) for efficient dense block operations 3. Numerical factorization: dominate - fan-out (right-looking, outer-product) - 2D cyclic MPI grid 3 4 1 2 3 4 5 3 1 2 1 3 4 5 3 4 5 Compute solution with forward and backward substitutions. 1 2 5 1 2 3 4 5 1 1 2 1 2 4 2 5 2 3 3 3 I. Yamazaki and X. Li 7/22 SuperLU DIST on multicore clusters

SuperLU DIST: numerical factorization fan-out (right-looking) factorization for j = 1, 2,..., n s panel factorization (column and row) factor A j,j and isend to P C (k) and P R (k) wait for A j,j and factor A (j+1):ns,j and send to P R (:) wait for A j,j and factor A j,(j+1):ns and send to P C (:) trailing matrix update update A (j+1):ns,(j+1):n s end for 1 2 1 2 3 4 5 3 4 5 3 1 2 1 2 3 4 5 3 4 5 3 1 2 1 2 3 4 5 3 4 5 3 1 2 1 2 high parallelism and good load-balance for trailing matrix updates, where most of computation time is spent. I. Yamazaki and X. Li 8/22 SuperLU DIST on multicore clusters

SuperLU DIST: numerical factorization fan-out (right-looking) factorization for j = 1, 2,..., n s panel factorization (column and row) factor A j,j and isend to P C (k) and P R (k) wait for A j,j and factor A (j+1):ns,j and send to P R (:) wait for A j,j and factor A j,(j+1):ns and send to P C (:) trailing matrix update update A (j+1):ns,(j+1):n s end for 1 2 1 2 3 4 5 3 4 5 3 1 2 1 2 3 4 5 3 4 5 3 1 2 1 2 3 4 5 3 4 5 3 1 2 1 2 sequential flow and limited parallelism at panel factorization poor scheduling causes processors being idle. I. Yamazaki and X. Li 9/22 SuperLU DIST on multicore clusters

SuperLU DIST version 2.5 (released Nov. 21) on Cray-XE6 5 45 Factorization Communication 22 2 Factorization Communication 4 18 Factorization time(s) 35 3 25 2 15 Factorization time(s) 16 14 12 1 8 6 1 4 5 2 8 32 128 512 248 Number of cores 32 128 512 248 Number of cores (c) accelerator (sym), n = 2.7M, fill-ratio= 12 (d) DNA, n = 445K, fill-ratio= 69 synchronization dominates on a large number of cores (e.g., up to 96% of factorization time). I. Yamazaki and X. Li 1/22 SuperLU DIST on multicore clusters

Outline: Introduction to SuperLU DIST Static scheduling scheme exploit more parallelism and reduce idle time. obtain speedups of upto 2.6 on upto 248 cores. Hybrid programming paradigm reduce memory overhead and obtain similar parallel efficiency. utilize more cores per node and reduce factorization time. I. Yamazaki and X. Li 11/22 SuperLU DIST on multicore clusters

SuperLU DIST: numerical factorization fan-out (right-looking) factorization for j = 1, 2,..., n s panel factorization (column and row) factor A j,j and isend to P C (k) and P R (k) wait for A j,j and factor A (j+1):ns,j and send to P R (:) wait for A j,j and factor A j,(j+1):ns and send to P C (:) trailing matrix update update A (j+1):ns,(j+1):n s end for 1 2 1 2 3 4 5 3 4 5 3 1 2 1 2 3 4 5 3 4 5 3 1 2 1 2 3 4 5 3 4 5 3 1 2 1 2 synchronization dominates on a large number of cores. processors in PC (k) and P R (k) wait for diagonal factorization. all the processors wait for the panel factorization. due to sparsity, many panels won t be updated by remaining panels and are ready to be factorized. I. Yamazaki and X. Li 12/22 SuperLU DIST on multicore clusters

Look-ahead in SuperLU DIST with a fixed window size n w At each j-th step; factorize all ready panels in the window. - reduce idle time of cores - overlap comm. and comp. - exploit more parallelism for j = 1, 2,..., n s look-ahead row factorization for k = j + 1, j + 2,..., j + n w do if U k,k has arrived on P R (k) then factor A k,(k+1):ns and isend to P C (:) synchronizations wait for U j,j and factor A j,j+1:ns if needed wait for L :,j and U j,: look-ahead column factorization for k = j + 1, j + 2,..., j + n w do update A :,k if possible then factor A k:ns,k and isend to P R (:) trailing matrix update update remaining A (j+nw +1):n s,(j+n w +1):n s end for 1 3 4 2 5 1 2 3 4 5 3 1 2 1 1 2 3 4 5 1 3 4 5 3 4 5 1 2 1 2 4 look ahead window factorization time is reduced only by 1% ready-for-factorize panels are not in window. performance depends on ordering of panels. I. Yamazaki and X. Li 13/22 SuperLU DIST on multicore clusters 2 5 2 3 3 3

Keeping track of dependencies in SuperLU DIST: Directed acyclic graph (DAG) 8 9 1 11 5 1 7 4 3 6 2 - node for each panel factor. task - edge (k, j) for dependency k j - source/leaf = ready for factorization Unsymmetric case: symmetrically-pruned DAG identify smallest s k such that U(k, s k ) and L(s k, k) are the first symmetrically matched non-empty block for each k. add an edge (k, j) for all the non-empty U(k, j) for j s k. add an edge (k, i) for all the non-empty L(i, k) for j s k. Symmetric case: elimination-tree (etree) I. Yamazaki and X. Li 14/22 SuperLU DIST on multicore clusters

Symbolic factorization of SuperLU DIST: order columns in postorder of etree (of A T + A for unsymmetric A) - larger supernodes - same structures/dependencies of LU setups data structures as supernodes are identified (postorder). uses same postordering during numerical factorization. - data locality limit number of ready panels in window - look-ahead only subtree 11 9 8 1 7 5 6 4 2 3 1 I. Yamazaki and X. Li 15/22 SuperLU DIST on multicore clusters

Static scheduling in SuperLU DIST: Symmetric case: keep track of dependencies using etree 11 schedule tasks from leaves to root with higher-priority on the nodes further away from the root (bottom-up topological ordering) 4 1 9 5 Unsymmetric case: use etree of A T + A over-estimate dependencies for unsymmetric factorization 8 3 7 6 2 1 I. Yamazaki and X. Li 16/22 SuperLU DIST on multicore clusters

Static scheduling in SuperLU DIST: Unsymmetric case: symmetrically-pruned DAG to track both row and column dependencies. 8 9 1 11 5 1 7 4 3 6 2 source nodes represent ready-to-be-factorized columns. statically schedule tasks from sources to sinks with higher-priority on the nodes further away from a sink. I. Yamazaki and X. Li 17/22 SuperLU DIST on multicore clusters

SuperLU DIST version 2.5 and 3. on Cray-XE6 5 45 version 2.5 version 3. 22 2 version 2.5 version 3. Factorization/Communication time (s) 4 35 3 25 2 15 1 Factorization/Communication time (s) 18 16 14 12 1 8 6 4 5 2 8 32 128 512 248 Number of cores 32 128 512 248 Number of cores (e) accelerator (sym), n = 2.7M, fill-ratio= 12 (f) DNA (unsym), n = 445K, fill-ratio= 68 idle time was significantly reduced (speedups of up to 2.6). To further improve the performance, more sophisticated scheduling schemes hybrid programming paradigms I. Yamazaki and X. Li 18/22 SuperLU DIST on multicore clusters

Outline: Introduction to SuperLU DIST Static scheduling scheme exploit more parallelism and reduce idle time. obtain speedups of upto 2.6 on upto 248 cores. Hybrid programming paradigm reduce memory overhead and obtain similar parallel efficiency. utilize more cores per node and reduce factorization time. I. Yamazaki and X. Li 19/22 SuperLU DIST on multicore clusters

Hybrid programming in SuperLU DIST Computation is dominated by trailing matrix updates, where each MPI process updates independent supernodal blocks use OpenMP threads to update these blocks 1 1 1 1 1 1 2 3 2 3 2 3 2 3 2 3 2 3 1 1 1 1 1 1 2 3 2 3 2 3 2 3 2 3 2 3 1 1 1 1 1 1 2 3 2 3 2 3 2 3 2 3 2 3 (g) 1D block if enough columns 1 1 1 1 1 1 2 3 2 3 2 3 2 3 2 3 2 3 1 1 1 1 1 1 2 3 2 3 2 3 2 3 2 3 2 3 1 1 1 1 1 1 2 3 2 3 2 3 2 3 2 3 2 3 (h) 2D cyclic otherwise I. Yamazaki and X. Li 2/22 SuperLU DIST on multicore clusters

Hybrid programming on top of SuperLU DIST version 3. factorization times on 16 nodes of Cray-XE6. 12 1 1 thread 2 threads 4 threads 4 35 1 thread 2 threads 4 threads 3 Factorization time (s) 8 6 4 Factorization time (s) 25 2 15 1 2 5 16 32 64 128 256 Number of cores 16 32 64 128 256 Number of cores (i) accelerator (sym), n = 2.7M, fill-ratio= 12 (j) DNA (unsym), n = 445K, fill-ratio= 68 hybrid pradigm utilizes more cores on a node avoiding MPI overheads (memory/time). I. Yamazaki and X. Li 21/22 SuperLU DIST on multicore clusters

Final remarks static scheduling reduced idle time with speedups of upto 2.6 available in version 3.: http://crd-legacy.lbl.gov/~xiaoye/superlu. hybrid programming reduced MPI overheads, utilized more cores/node, and obtained speedups on same node count. more results are available in IPDPS 12 proceedings. Thank you!! I. Yamazaki and X. Li 22/22 SuperLU DIST on multicore clusters