On the design of parallel linear solvers for large scale problems

Similar documents
On the design of parallel linear solvers for large scale problems

Utilisation de la compression low-rank pour réduire la complexité du solveur PaStiX

Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers

Accelerating linear algebra computations with hybrid GPU-multicore systems.

3D Cartesian Transport Sweep for Massively Parallel Architectures on top of PaRSEC

ACCELERATING SPARSE CHOLESKY FACTORIZATION ON THE GPU

A High-Performance Parallel Hybrid Method for Large Sparse Linear Systems

ANR Project DEDALES Algebraic and Geometric Domain Decomposition for Subsurface Flow

MUMPS. The MUMPS library: work done during the SOLSTICE project. MUMPS team, Lyon-Grenoble, Toulouse, Bordeaux

Scalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver

MAGMA MIC 1.0: Linear Algebra Library for Intel Xeon Phi Coprocessors

Dynamic Scheduling within MAGMA

Lightweight Superscalar Task Execution in Distributed Memory

SPARSE SOLVERS POISSON EQUATION. Margreet Nool. November 9, 2015 FOR THE. CWI, Multiscale Dynamics

MAGMA. Matrix Algebra on GPU and Multicore Architectures. Mark Gates. February 2012

A dissection solver with kernel detection for unsymmetric matrices in FreeFem++

A model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization)

J.I. Aliaga 1 M. Bollhöfer 2 A.F. Martín 1 E.S. Quintana-Ortí 1. March, 2009

Sparse LU Factorization on GPUs for Accelerating SPICE Simulation

Algorithms and Methods for Fast Model Predictive Control

Parallelization of Multilevel Preconditioners Constructed from Inverse-Based ILUs on Shared-Memory Multiprocessors

Contents. Preface... xi. Introduction...

A sparse multifrontal solver using hierarchically semi-separable frontal matrices

Incomplete Cholesky preconditioners that exploit the low-rank property

MARCH 24-27, 2014 SAN JOSE, CA

From Direct to Iterative Substructuring: some Parallel Experiences in 2 and 3D

Multilevel low-rank approximation preconditioners Yousef Saad Department of Computer Science and Engineering University of Minnesota

1 Overview. 2 Adapting to computing system evolution. 11 th European LS-DYNA Conference 2017, Salzburg, Austria

Using Random Butterfly Transformations to Avoid Pivoting in Sparse Direct Methods

Large Scale Sparse Linear Algebra

Lattice Boltzmann simulations on heterogeneous CPU-GPU clusters

Performance and Scalability of the Block Low-Rank Multifrontal Factorization on Multicore Architectures

A robust multilevel approximate inverse preconditioner for symmetric positive definite matrices

Improvements for Implicit Linear Equation Solvers

Computing least squares condition numbers on hybrid multicore/gpu systems

Hybrid static/dynamic scheduling for already optimized dense matrix factorization. Joint Laboratory for Petascale Computing, INRIA-UIUC

Enhancing Scalability of Sparse Direct Methods

Hydra: Generation and Tuning of parallel solutions for linear algebra equations. Alexandre X. Duchâteau University of Illinois at Urbana Champaign

QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment

Static-scheduling and hybrid-programming in SuperLU DIST on multicore cluster systems

Information Sciences Institute 22 June 2012 Bob Lucas, Gene Wagenbreth, Dan Davis, Roger Grimes and

TR A Comparison of the Performance of SaP::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems

Block Low-Rank (BLR) approximations to improve multifrontal sparse solvers

MagmaDNN High-Performance Data Analytics for Manycore GPUs and CPUs

Toward black-box adaptive domain decomposition methods

Some Geometric and Algebraic Aspects of Domain Decomposition Methods

Tall and Skinny QR Matrix Factorization Using Tile Algorithms on Multicore Architectures LAPACK Working Note - 222

Level-3 BLAS on a GPU

Accelerating computation of eigenvectors in the nonsymmetric eigenvalue problem

Modeling and Tuning Parallel Performance in Dense Linear Algebra

Jacobi-Based Eigenvalue Solver on GPU. Lung-Sheng Chien, NVIDIA

INITIAL INTEGRATION AND EVALUATION

Parallelism in FreeFem++.

Solving PDEs with CUDA Jonathan Cohen

Intel Math Kernel Library (Intel MKL) LAPACK

Accelerating computation of eigenvectors in the dense nonsymmetric eigenvalue problem

Domain Decomposition-based contour integration eigenvalue solvers

Leveraging Task-Parallelism in Energy-Efficient ILU Preconditioners

Sparse factorizations: Towards optimal complexity and resilience at exascale

ERLANGEN REGIONAL COMPUTING CENTER

Saving Energy in the LU Factorization with Partial Pivoting on Multi-Core Processors

Scheduling Trees of Malleable Tasks for Sparse Linear Algebra

Scalable Domain Decomposition Preconditioners For Heterogeneous Elliptic Problems

Parallel Asynchronous Hybrid Krylov Methods for Minimization of Energy Consumption. Langshi CHEN 1,2,3 Supervised by Serge PETITON 2

Energy Consumption Evaluation for Krylov Methods on a Cluster of GPU Accelerators

Case Study: Quantum Chromodynamics

GPU acceleration of Newton s method for large systems of polynomial equations in double double and quad double arithmetic

HPMPC - A new software package with efficient solvers for Model Predictive Control

Tile QR Factorization with Parallel Panel Processing for Multicore Architectures

GPU accelerated Arnoldi solver for small batched matrix

Accelerating interior point methods with GPUs for smart grid systems

Block AIR Methods. For Multicore and GPU. Per Christian Hansen Hans Henrik B. Sørensen. Technical University of Denmark

Large-scale Electronic Structure Simulations with MVAPICH2 on Intel Knights Landing Manycore Processors

An Integrative Model for Parallelism

Algorithm for Sparse Approximate Inverse Preconditioners in the Conjugate Gradient Method

Direct Self-Consistent Field Computations on GPU Clusters

Multipréconditionnement adaptatif pour les méthodes de décomposition de domaine. Nicole Spillane (CNRS, CMAP, École Polytechnique)

Feeding of the Thousands. Leveraging the GPU's Computing Power for Sparse Linear Algebra

Direct and Incomplete Cholesky Factorizations with Static Supernodes

A parallel tiled solver for dense symmetric indefinite systems on multicore architectures

Parallel Preconditioning Methods for Ill-conditioned Problems

Parallel sparse linear solvers and applications in CFD

A Numerical Study of Some Parallel Algebraic Preconditioners

Parallel Multilevel Low-Rank approximation preconditioners Yousef Saad Department of Computer Science and Engineering University of Minnesota

Saving Energy in Sparse and Dense Linear Algebra Computations

Multilevel Low-Rank Preconditioners Yousef Saad Department of Computer Science and Engineering University of Minnesota. Modelling 2014 June 2,2014

Fine-Grained Parallel Algorithms for Incomplete Factorization Preconditioning

Efficiently solving large sparse linear systems on a distributed and heterogeneous grid by using the multisplitting-direct method

Panorama des modèles et outils de programmation parallèle

ExaGeoStat: A High Performance Unified Software for Geostatistics on Manycore Systems

A dissection solver with kernel detection for symmetric finite element matrices on shared memory computers

Parallel Algorithms for Solution of Large Sparse Linear Systems with Applications

Presentation of XLIFE++

The new challenges to Krylov subspace methods Yousef Saad Department of Computer Science and Engineering University of Minnesota

Review for the Midterm Exam

Heterogeneous programming for hybrid CPU-GPU systems: Lessons learned from computational chemistry

Multilevel spectral coarse space methods in FreeFem++ on parallel architectures

Preconditioned Parallel Block Jacobi SVD Algorithm

PRECONDITIONING IN THE PARALLEL BLOCK-JACOBI SVD ALGORITHM

Communication-avoiding LU and QR factorizations for multicore architectures

Transcription:

On the design of parallel linear solvers for large scale problems Journée problème de Poisson, IHP, Paris M. Faverge, P. Ramet M. Faverge Assistant Professor Bordeaux INP LaBRI Inria Bordeaux - Sud-Ouest January 26, 2015

Context Inria HiePACS team Objectives: Contribute to the design of effective tools for frontier simulations arising from challenging research and industrial multi-scale applications towards exascale computing M. Faverge - Journée problème de Poisson January 26, 2015-2

1 Introduction

Introduction Guideline Introduction Dense linear algebra Sparse direct solver over runtimes: PaStiX On homogeneous shared memory nodes On heterogeneous shared memory nodes Hybrid direct/sparse solver: MaPHyS Conclusion M. Faverge - Journée problème de Poisson January 26, 2015-4

Introduction What are the modern architectures? Today software developers face systems with: 1 TFlop/s of compute power per node 32 cores or more per nodes, sometimes 100+ hardware threads Highly heterogeneous architectures (cores + specialized cores + accelerators/co-processors) Deep memory hierarchies And also distributed Consequences of our programming paradigms are systemic noise, load imbalance, overheads (< 70% peak on DLA) M. Faverge - Journée problème de Poisson January 26, 2015-5

Introduction How to harness these devices productively? Certainly not the way we did it so far We need to shift our focus on: Power (forget GHz think MW) Cost (forget FLOPS think data movement) Concurrency (not linear anymore) Memory Scaling (compute grows twice faster than memory bandwidth) Heterogeneity (it s not going away) Reliability (it is supposed to fail one day). M. Faverge - Journée problème de Poisson January 26, 2015-6

Introduction A short history of computing paradigms M. Faverge - Journée problème de Poisson January 26, 2015-7

Introduction A short history of computing paradigms M. Faverge - Journée problème de Poisson January 26, 2015-7

Introduction A short history of computing paradigms M. Faverge - Journée problème de Poisson January 26, 2015-7

Introduction A short history of computing paradigms M. Faverge - Journée problème de Poisson January 26, 2015-7

Introduction A short history of computing paradigms M. Faverge - Journée problème de Poisson January 26, 2015-7

Introduction Task-based programming Focus on concepts, data dependencies and flows Don t develop for an architecture but for an impermeable portability layer Let the runtime deal with the hardware characteristics But provide some flexibility if possible StarSS, Swift, Parallax, Quark, XKaapi, StarPU, PaRSEC,... M. Faverge - Journée problème de Poisson January 26, 2015-8

Introduction Task-based programming Workflows, bag-of-task, grid are similar concepts but at another level of granularity (we re talking tens of microseconds) Time constraint: Reduces the opportunity for complex dynamic scheduling strategies, but increases the probability to do the wrong thing Separation of concerns: each layer with it s own issues Humans focus on depicted the flow of data between tasks Compiler focus on tasks, problems that fit in one memory level And a runtime orchestrate the flows to maximize the throughput of the architecture M. Faverge - Journée problème de Poisson January 26, 2015-9

2 Dense linear algebra

Dense linear algebra From sequential code to DAG Sequential code for QR factorization M. Faverge - Journée problème de Poisson January 26, 2015-11

Dense linear algebra From sequential code to DAG Sequential C code Annotated through some specific syntax Insert Task INOUT, OUTPUT, INPUT REGION L, REGION U, REGION D,... LOCALITY M. Faverge - Journée problème de Poisson January 26, 2015-12

Dense linear algebra From sequential code to DAG Automatic dataflow analysis M. Faverge - Journée problème de Poisson January 26, 2015-13

Dense linear algebra From sequential code to DAG Automatic dataflow analysis M. Faverge - Journée problème de Poisson January 26, 2015-13

Dense linear algebra Parameterized Task Graph VS Sequential Task Flow Advantages of PTG Local view of the DAG No need to unroll the full graph Full knowledge of Input/Output for each task Strong separation of concerns (Data/task distribution, algorithm) Advantages of STF Knowledge of the full graph for scheduling strategies Handle dynamic representation of the algorithm Submission process can be postponed to hang on data values (convergence criteria) More simple to implement than PTG M. Faverge - Journée problème de Poisson January 26, 2015-14

Dense linear algebra QR factorization with Chameleon (StarPU) Heterogeneous shared memory node 1000 800 4 GPUs + 4 CPUs - Single 3 GPUs + 3 CPUs - Single 2 GPUs + 2 CPUs - Single 1 GPUs + 1 CPUs - Single 4 GPUs + 4 CPUs - Double 3 GPUs + 3 CPUs - Double 2 GPUs + 2 CPUs - Double 1 GPUs + 1 CPUs - Double Gflop/s 600 400 200 0 0 5000 10000 15000 20000 25000 30000 35000 40000 Matrix order M. Faverge - Journée problème de Poisson January 26, 2015-15

Dense linear algebra QR factorization with Chameleon (StarPU) Heterogeneous shared memory node 1000 800 4 GPUs + 16 CPUs - Single 4 GPUs + 4 CPUs - Single 3 GPUs + 3 CPUs - Single 2 GPUs + 2 CPUs - Single 1 GPUs + 1 CPUs - Single 4 GPUs + 16 CPUs - Double 4 GPUs + 4 CPUs - Double 3 GPUs + 3 CPUs - Double 2 GPUs + 2 CPUs - Double 1 GPUs + 1 CPUs - Double Gflop/s 600 400 200 0 0 5000 10000 15000 20000 25000 30000 35000 40000 Matrix order + 200 GFlop/s but 12 cores = 150 GFlop/s M. Faverge - Journée problème de Poisson January 26, 2015-15

Dense linear algebra QR factorization with DPLASMA (PaRSEC) Distributed homogeneous memory nodes (60 8 cores) Theoretical Peak: 4358.4 GFlop/s Theoretical Peak: 4358.4 GFlop/s 3000 2500 3000 2500 Scalapack [BBD+10] [SLHD10] HQR Performance (GFlop/s) 2000 1500 1000 Performance (GFlop/s) 2000 1500 1000 500 0 Scalapack [BBD+10] [SLHD10] HQR 0 10000 20000 30000 40000 50000 60000 70000 500 0 0 50000 100000 150000 200000 250000 300000 N (M=67,200) M (N=4,480) M. Faverge - Journée problème de Poisson January 26, 2015-16

Dense linear algebra Libraries for solving large dense linear systems Chameleon (http://project.inria.fr/chameleon) Inria HiePACS/Runtime StarPU Runtime STF programing model Smart/Complex scheduling DPLASMA (http://icl.cs.utk.edu/dplasma) ICL, University of Tennessee PaRSEC Runtime PTG model Simple scheduling based on data locality M. Faverge - Journée problème de Poisson January 26, 2015-17

Dense linear algebra Libraries for solving large dense linear systems Chameleon (http://project.inria.fr/chameleon) Inria HiePACS/Runtime StarPU Runtime STF programing model Smart/Complex scheduling DPLASMA (http://icl.cs.utk.edu/dplasma) ICL, University of Tennessee PaRSEC Runtime PTG model Simple scheduling based on data locality What about irregular problems as sparse solvers? M. Faverge - Journée problème de Poisson January 26, 2015-17

3 Sparse direct solver over runtimes: PaStiX

3.1 Sparse direct solver over runtimes: PaStiX On homogeneous shared memory nodes

Sparse direct solver over runtimes: PaStiX On homogeneous shared memory nodes Tasks structure POTRF TRSM SYRK GEMM (a) Dense tile task decomposition (b) Decomposition of the task applied while processing one panel M. Faverge - Journée problème de Poisson January 26, 2015-20

Sparse direct solver over runtimes: PaStiX On homogeneous shared memory nodes A=>A gemm(3) gemm(4) gemm(6) panel(0) A=>A A=>A gemm(2) panel(1) A=>A panel(4) gemm(8) C=>A panel(2) A=>A A=>A gemm(12) A=>A gemm(1) gemm(10) panel(3) A=>A gemm(15) A=>A panel(5) gemm(14) A=>A C=>A panel(6) A=>A gemm(17) gemm(25) gemm(31) gemm(36) A=>A A=>A gemm(22) gemm(37) A=>A A=>A gemm(27) A=>A gemm(38) A=>A A=>A gemm(21) panel(7) A=>A gemm(19) gemm(20) C=>A panel(8) A=>A gemm(28) panel(9) A=>A gemm(40) C=>A panel(11) C=>A A=>A gemm(35) panel(10) A=>A C=>A A=>A A=>A A=>A gemm(34) A=>A A=>A A=>A gemm(24) gemm(29) gemm(30) gemm(33) A=>A gemm(23) DAG representation POTRF T=>T T=>T T=>T TRSM T=>T C=>A TRSM C=>A TRSM C=>A SYRK C=>A TRSM C=>A C=>B C=>B C=>A C=>A C=>B C=>B T=>T C=>A C=>A C=>B C=>B C=>A GEMM GEMM GEMM POTRF GEMM GEMM GEMM SYRK SYRK T=>T T=>T T=>T SYRK T=>T TRSM T=>T TRSM TRSM T=>T C=>A C=>B C=>A C=>B C=>A C=>A C=>A C=>B C=>A SYRK GEMM SYRK GEMM GEMM SYRK T=>T POTRF T=>T T=>T T=>T T=>T TRSM TRSM C=>A C=>B C=>A C=>A SYRK GEMM SYRK T=>T POTRF T=>T T=>T TRSM C=>A SYRK T=>T POTRF (c) Dense DAG (d) Sparse DAG representation of a sparse LDL T factorization M. Faverge - Journée problème de Poisson January 26, 2015-21

Sparse direct solver over runtimes: PaStiX On homogeneous shared memory nodes Supernodal sequential algorithm forall the Supernode S 1 do panel (S 1 ); /* update of the panel */ forall the extra diagonal block B i of S 1 do S 2 supernode in front of (B i ); gemm (S 1,S 2 ); /* sparse GEMM B k,k i Bi T substracted from S 2 */ end end M. Faverge - Journée problème de Poisson January 26, 2015-22

Sparse direct solver over runtimes: PaStiX On homogeneous shared memory nodes StarPU Tasks submission forall the Supernode S 1 do submit panel (S 1 ); /* update of the panel */ forall the extra diagonal block B i of S 1 do S 2 supernode in front of (B i ); submit gemm (S 1,S 2 ); /* sparse GEMM B k,k i B T i subtracted from S 2 */ end end wait for all tasks (); M. Faverge - Journée problème de Poisson January 26, 2015-23

Sparse direct solver over runtimes: PaStiX On homogeneous shared memory nodes PaRSEC s parameterized task graph panel(0) A=>A A=>A A=>A gemm(3) A=>A gemm(1) gemm(4) gemm(2) C=>A panel(1) A=>A Task Graph 1 p a n e l ( j ) 2 3 / E x e c u t i o n Space / 4 j = 0.. cblknbr 1 5 6 / Task L o c a l i t y ( Owner Compute ) / 7 : A( j ) 8 9 / Data d e p e n d e n c i e s / 10 RW A < ( l e a f )? A( j ) : C gemm( l a s t b r o w ) 11 > A gemm( f i r s t b l o c k +1.. l a s t b l o c k ) 12 > A( j ) Panel Factorization in JDF Format gemm(6) panel(2) A=>A gemm(8) panel(3) A=>A M. panel(4) Faverge - Journée gemm(10) problème de panel(5) Poisson panel(7) January 26, 2015-24

Sparse direct solver over runtimes: PaStiX On homogeneous shared memory nodes Matrices and Machines Matrix Prec Method Size nnz A nnz L TFlop FilterV2 Z LU 0.6e+6 12e+6 536e+6 3.6 Flan D LL T 1.6e+6 59e+6 1712e+6 5.3 Audi D LL T 0.9e+6 39e+6 1325e+6 6.5 MHD D LU 0.5e+6 24e+6 1133e+6 6.6 Geo1438 D LL T 1.4e+6 32e+6 2768e+6 23 Pmldf Z LDL T 1.0e+6 8e+6 1105e+6 28 Hook D LU 1.5e+6 31e+6 4168e+6 35 Serena D LDL T 1.4e+6 32e+6 3365e+6 47 Table: Matrix description (Z: double complex, D: double). Machine Processors Frequency GPUs RAM Mirage Westmere Intel Xeon X5650 (2 6) 2.67 GHz Tesla M2070 ( 3) 36 GB M. Faverge - Journée problème de Poisson January 26, 2015-25

Sparse direct solver over runtimes: PaStiX On homogeneous shared memory nodes CPU scaling study: GFlop/s during numerical factorization native StarPU PaRSEC 1 core 1 core 1 core 3 cores 3 cores 3 cores 6 cores 6 cores 6 cores 9 cores 9 cores 9 cores 12 cores 12 cores 12 cores Performance (GFlop/s) 80 60 40 20 0 FilterV2(Z, LU) Flan(D, LL T ) audi(d, LL T ) MHD(D, LU) Geo1438(D, LL T ) pmldf(z, LDL T ) HOOK(D, LU) Serena(D, LDL T ) M. Faverge - Journée problème de Poisson January 26, 2015-26

Sparse direct solver over runtimes: PaStiX On homogeneous shared memory nodes CPU scaling study: GFlop/s during numerical factorization native StarPU PaRSEC 1 core 1 core 1 core 3 cores 3 cores 3 cores 6 cores 6 cores 6 cores 9 cores 9 cores 9 cores 12 cores 12 cores 12 cores Performance (GFlop/s) 80 60 40 20 0 FilterV2(Z, LU) Flan(D, LL T ) audi(d, LL T ) MHD(D, LU) Geo1438(D, LL T ) pmldf(z, LDL T ) HOOK(D, LU) Serena(D, LDL T ) M. Faverge - Journée problème de Poisson January 26, 2015-26

Sparse direct solver over runtimes: PaStiX On homogeneous shared memory nodes CPU scaling study: GFlop/s during numerical factorization native StarPU PaRSEC 1 core 1 core 1 core 3 cores 3 cores 3 cores 6 cores 6 cores 6 cores 9 cores 9 cores 9 cores 12 cores 12 cores 12 cores Performance (GFlop/s) 80 60 40 20 0 FilterV2(Z, LU) Flan(D, LL T ) audi(d, LL T ) MHD(D, LU) Geo1438(D, LL T ) pmldf(z, LDL T ) HOOK(D, LU) Serena(D, LDL T ) M. Faverge - Journée problème de Poisson January 26, 2015-26

Sparse direct solver over runtimes: PaStiX On homogeneous shared memory nodes CPU scaling study: GFlop/s during numerical factorization native StarPU PaRSEC 1 core 1 core 1 core 3 cores 3 cores 3 cores 6 cores 6 cores 6 cores 9 cores 9 cores 9 cores 12 cores 12 cores 12 cores Performance (GFlop/s) 80 60 40 20 0 FilterV2(Z, LU) Flan(D, LL T ) audi(d, LL T ) MHD(D, LU) Geo1438(D, LL T ) pmldf(z, LDL T ) HOOK(D, LU) Serena(D, LDL T ) M. Faverge - Journée problème de Poisson January 26, 2015-26

Sparse direct solver over runtimes: PaStiX On homogeneous shared memory nodes CPU scaling study: GFlop/s during numerical factorization native StarPU PaRSEC 1 core 1 core 1 core 3 cores 3 cores 3 cores 6 cores 6 cores 6 cores 9 cores 9 cores 9 cores 12 cores 12 cores 12 cores Performance (GFlop/s) 80 60 40 20 0 FilterV2(Z, LU) Flan(D, LL T ) audi(d, LL T ) MHD(D, LU) Geo1438(D, LL T ) pmldf(z, LDL T ) HOOK(D, LU) Serena(D, LDL T ) M. Faverge - Journée problème de Poisson January 26, 2015-26

3.2 Sparse direct solver over runtimes: PaStiX On heterogeneous shared memory nodes

Sparse direct solver over runtimes: PaStiX On heterogeneous shared memory nodes Why using GPUs on sparse problems? P 1 P 2 Compacted in memory P 1 P 2 updates are compute intensive small updates non continuous data rectangular shape M. Faverge - Journée problème de Poisson January 26, 2015-28

Sparse direct solver over runtimes: PaStiX On heterogeneous shared memory nodes GPU kernel performance 300 250 200 GFlop/s 150 100 cublas peak cublas - 1 stream ASTRA - 1 stream Sparse - 1 stream 50 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 Matrix size M (N=K=128) Figure: Performance comparison on the DGEMM kernel for three implementations: cublas library, ASTRA framework, and the sparse adaptation of the ASTRA framework. M. Faverge - Journée problème de Poisson January 26, 2015-29

Sparse direct solver over runtimes: PaStiX On heterogeneous shared memory nodes GPU kernel performance - 2 Streams 300 250 200 GFlop/s 150 100 cublas peak cublas - 1 stream cublas - 2 streams ASTRA - 1 stream ASTRA - 2 streams Sparse - 1 stream Sparse - 2 streams 50 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 Matrix size M (N=K=128) Figure: Performance comparison on the DGEMM kernel for three implementations: cublas library, ASTRA framework, and the sparse adaptation of the ASTRA framework. M. Faverge - Journée problème de Poisson January 26, 2015-30

Sparse direct solver over runtimes: PaStiX On heterogeneous shared memory nodes GPU kernel performance - 3 Streams 300 250 200 GFlop/s 150 100 cublas peak cublas - 1 stream cublas - 2 streams cublas - 3 streams ASTRA - 1 stream ASTRA - 2 streams ASTRA - 3 streams Sparse - 1 stream Sparse - 2 streams Sparse - 3 streams 50 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 Matrix size M (N=K=128) Figure: Performance comparison on the DGEMM kernel for three implementations: cublas library, ASTRA framework, and the sparse adaptation of the ASTRA framework. M. Faverge - Journée problème de Poisson January 26, 2015-31

Sparse direct solver over runtimes: PaStiX On heterogeneous shared memory nodes GPU scaling study : GFlop/s for numerical factorization 250 Native: CPU only StarPU: CPU only 1 GPU 2 GPU 3 GPU PaRSEC 1 stream: CPU only 1 GPU 2 GPU 3 GPU PaRSEC 3 streams: 1 GPU 2 GPU 3 GPU 200 Performance (GFlop/s) 150 100 50 0 FilterV2(Z, LU) Flan(D, LL T ) audi(d, LL T ) MHD(D, LU) Geo1438(D, LL T ) pmldf(z, LDL T ) HOOK(D, LU) Serena(D, LDL T ) M. Faverge - Journée problème de Poisson January 26, 2015-32

Sparse direct solver over runtimes: PaStiX On heterogeneous shared memory nodes GPU scaling study : GFlop/s for numerical factorization 250 Native: CPU only StarPU: CPU only 1 GPU 2 GPU 3 GPU PaRSEC 1 stream: CPU only 1 GPU 2 GPU 3 GPU PaRSEC 3 streams: 1 GPU 2 GPU 3 GPU 200 Performance (GFlop/s) 150 100 50 0 FilterV2(Z, LU) Flan(D, LL T ) audi(d, LL T ) MHD(D, LU) Geo1438(D, LL T ) pmldf(z, LDL T ) HOOK(D, LU) Serena(D, LDL T ) M. Faverge - Journée problème de Poisson January 26, 2015-32

Sparse direct solver over runtimes: PaStiX On heterogeneous shared memory nodes GPU scaling study : GFlop/s for numerical factorization 250 Native: CPU only StarPU: CPU only 1 GPU 2 GPU 3 GPU PaRSEC 1 stream: CPU only 1 GPU 2 GPU 3 GPU PaRSEC 3 streams: 1 GPU 2 GPU 3 GPU 200 Performance (GFlop/s) 150 100 50 0 FilterV2(Z, LU) Flan(D, LL T ) audi(d, LL T ) MHD(D, LU) Geo1438(D, LL T ) pmldf(z, LDL T ) HOOK(D, LU) Serena(D, LDL T ) M. Faverge - Journée problème de Poisson January 26, 2015-32

Sparse direct solver over runtimes: PaStiX On heterogeneous shared memory nodes GPU scaling study : GFlop/s for numerical factorization 250 Native: CPU only StarPU: CPU only 1 GPU 2 GPU 3 GPU PaRSEC 1 stream: CPU only 1 GPU 2 GPU 3 GPU PaRSEC 3 streams: 1 GPU 2 GPU 3 GPU 200 Performance (GFlop/s) 150 100 50 0 FilterV2(Z, LU) Flan(D, LL T ) audi(d, LL T ) MHD(D, LU) Geo1438(D, LL T ) pmldf(z, LDL T ) HOOK(D, LU) Serena(D, LDL T ) M. Faverge - Journée problème de Poisson January 26, 2015-32

Sparse direct solver over runtimes: PaStiX On heterogeneous shared memory nodes GPU scaling study : GFlop/s for numerical factorization 250 Native: CPU only StarPU: CPU only 1 GPU 2 GPU 3 GPU PaRSEC 1 stream: CPU only 1 GPU 2 GPU 3 GPU PaRSEC 3 streams: 1 GPU 2 GPU 3 GPU 200 Performance (GFlop/s) 150 100 50 0 FilterV2(Z, LU) Flan(D, LL T ) audi(d, LL T ) MHD(D, LU) Geo1438(D, LL T ) pmldf(z, LDL T ) HOOK(D, LU) Serena(D, LDL T ) M. Faverge - Journée problème de Poisson January 26, 2015-32

4 Hybrid direct/sparse solver: MaPHyS

Hybrid direct/sparse solver: MaPHyS Sparse linear solvers Goal: solving Ax = b, where A is sparse Usual trades off Direct Robust/accurate for general problems BLAS-3 based implementations Memory/CPU prohibitive for large 3D problems Limited weak scalability Iterative Problem dependent efficiency / accuracy Sparse computational kernels Less memory requirements and possibly faster Possible high weak scalability M. Faverge - Journée problème de Poisson January 26, 2015-34

Hybrid direct/sparse solver: MaPHyS Sparse linear solvers Goal: solving Ax = b, where A is sparse Usual trades off Direct Robust/accurate for general problems BLAS-3 based implementations Memory/CPU prohibitive for large 3D problems Limited weak scalability Iterative Problem dependent efficiency / accuracy Sparse computational kernels Less memory requirements and possibly faster Possible high weak scalability M. Faverge - Journée problème de Poisson January 26, 2015-34

Hybrid direct/sparse solver: MaPHyS Sparse linear solvers Goal: solving Ax = b, where A is sparse Usual trades off Direct Robust/accurate for general problems BLAS-3 based implementations Memory/CPU prohibitive for large 3D problems Limited weak scalability Iterative Problem dependent efficiency / accuracy Sparse computational kernels Less memory requirements and possibly faster Possible high weak scalability M. Faverge - Journée problème de Poisson January 26, 2015-34

Hybrid direct/sparse solver: MaPHyS Sparse linear solvers Goal: solving Ax = b, where A is sparse Usual trades off Direct Robust/accurate for general problems BLAS-3 based implementations Memory/CPU prohibitive for large 3D problems Limited weak scalability Iterative Problem dependent efficiency / accuracy Sparse computational kernels Less memory requirements and possibly faster Possible high weak scalability M. Faverge - Journée problème de Poisson January 26, 2015-34

Hybrid direct/sparse solver: MaPHyS Sparse linear solvers Goal: solving Ax = b, where A is sparse Usual trades off Direct Robust/accurate for general problems BLAS-3 based implementations Memory/CPU prohibitive for large 3D problems Limited weak scalability Iterative Problem dependent efficiency / accuracy Sparse computational kernels Less memory requirements and possibly faster Possible high weak scalability M. Faverge - Journée problème de Poisson January 26, 2015-34

Hybrid direct/sparse solver: MaPHyS Sparse linear solvers Goal: solving Ax = b, where A is sparse Usual trades off Direct Robust/accurate for general problems BLAS-3 based implementations Memory/CPU prohibitive for large 3D problems Limited weak scalability Iterative Problem dependent efficiency / accuracy Sparse computational kernels Less memory requirements and possibly faster Possible high weak scalability M. Faverge - Journée problème de Poisson January 26, 2015-34

Hybrid direct/sparse solver: MaPHyS Hybrid Linear Solvers Develop robust scalable parallel hybrid direct/iterative linear solvers Exploit the efficiency and robustness of the sparse direct solvers Develop robust parallel preconditioners for iterative solvers Take advantage of the natural scalable parallel implementation of iterative solvers Domain Decomposition (DD) Natural approach for PDE s Extend to general sparse matrices Partition the problem into subdomains, subgraphs Use a direct solver on the subdomains Robust preconditioned iterative solver M. Faverge - Journée problème de Poisson January 26, 2015-35

Hybrid direct/sparse solver: MaPHyS Preconditioning methods MaPHyS Global algebraic view Global hybrid decomposition: ( ) AII A A = IΓ A ΓI A ΓΓ Ω 1 Ω 2 Ω 3 Ω 4 88 cut edges Γ 0 Global Schur complement: 50 Ω 1 100 150 200 Ω 2 S = A ΓΓ A ΓI A 1 II A IΓ 250 Ω 3 300 Ω 350 4 400 0 50 100 150 200 250 300 350 400 nz = 1920 Γ M. Faverge - Journée problème de Poisson January 26, 2015-36

Hybrid direct/sparse solver: MaPHyS Preconditioning methods MaPHyS Local algebraic view Local hybrid decomposition: ( ) A (i) AIi I = i A Ii Γ i A Γi I i Local Schur Complement: A (i) ΓΓ Ω 1 Ω 2 Ω 3 Ω 4 88 cut edges Γ S (i) = A (i) ΓΓ A Γ i I i A 1 I i I i A Ii Γ i Algebraic Additive Schwarz Preconditioner 0 50 Ω 1 100 150 200 250 300 Ω 2 Ω 3 M = N R T Γ i ( S (i) ) 1 R Γi i=1 Ω 350 4 400 0 50 100 150 200 250 300 350 400 nz = 1920 Γ M. Faverge - Journée problème de Poisson January 26, 2015-36

Hybrid direct/sparse solver: MaPHyS Preconditioning methods Parallel implementation Each subdomain A (i) is handled by one CPU core ( ) A (i) AIi I i A Ii Γ i A Ii Γ i A (i) ΓΓ Concurrent partial factorizations are performed on each CPU core to form the so called local Schur complement S (i) = A (i) ΓΓ A Γ i I i A 1 I i I i A Ii Γ i The reduced system Sx Γ = f is solved using a distributed Krylov solver - One matrix vector product per iteration each CPU core computes S (i) (x (i) Γ )k = (y (i) ) k - One local preconditioner apply (M (i) )(z (i) ) k = (r (i) ) k - Local neighbor-neighbor communication per iteration - Global reduction (dot products) Compute simultaneously the solution for the interior unknowns A Ii I i x Ii = b Ii A Ii Γ i x Γi M. Faverge - Journée problème de Poisson January 26, 2015-37

Hybrid direct/sparse solver: MaPHyS Software Modular implementation Sequential partitioners Scotch [Pellegrini et al.] Metis [G. Karypis and V. Kumar] Sparse direct solvers - one-level parallelism Mumps [P. Amestoy et al.] (with Schur option) PaStiX [P. Ramet et al.] (with Schur option) Iterative Solvers Cg/Gmres/FGmres [V.Fraysse and L.Giraud] M. Faverge - Journée problème de Poisson January 26, 2015-38

Hybrid direct/sparse solver: MaPHyS Software Modular implementation and bottlenecks Sequential partitioners Scotch [Pellegrini et al.] Metis [G. Karypis and V. Kumar] Sparse direct solvers - one-level parallelism Mumps [P. Amestoy et al.] (with Schur option) PaStiX [P. Ramet et al.] (with Schur option) Iterative Solvers Cg/Gmres/FGmres [V.Fraysse and L.Giraud] M. Faverge - Journée problème de Poisson January 26, 2015-38

Hybrid direct/sparse solver: MaPHyS Software Modular implementation and bottlenecks Sequential partitioners Scotch [Pellegrini et al.] Metis [G. Karypis and V. Kumar] Sparse direct solvers - one-level parallelism Mumps [P. Amestoy et al.] (with Schur option) PaStiX [P. Ramet et al.] (with Schur option) Iterative Solvers Cg/Gmres/FGmres [V.Fraysse and L.Giraud] M. Faverge - Journée problème de Poisson January 26, 2015-38

Hybrid direct/sparse solver: MaPHyS Software Modular implementation and bottlenecks Sequential partitioners Scotch [Pellegrini et al.] Metis [G. Karypis and V. Kumar] Sparse direct solvers - one-level parallelism Mumps [P. Amestoy et al.] (with Schur option) PaStiX [P. Ramet et al.] (with Schur option) Iterative Solvers Cg/Gmres/FGmres [V.Fraysse and L.Giraud] M. Faverge - Journée problème de Poisson January 26, 2015-38

Hybrid direct/sparse solver: MaPHyS Software MPI-thread on Hopper@Nersc with Nachos4M - Time 3 t/p 6 t/p 12 t/p 24 t/p time (s) 100 10 768 1536 3072 6144 12288 24576 #cores M. Faverge - Journée problème de Poisson January 26, 2015-39

Hybrid direct/sparse solver: MaPHyS Software MPI-thread on Hopper@Nersc with Nachos4M - Memory Memroy_peak(MB) 100000 10000 1000 3 t/p 6 t/p 12 t/p 24 t/p 100 768 1536 3072 6144 12288 24576 #cores M. Faverge - Journée problème de Poisson January 26, 2015-40

5 Conclusion

Conclusion Sparse direct solver: PaStiX Main Features LLt, LDLt, LU : supernodal implementation (BLAS3) Static pivoting + Refinement: CG/GMRES/BiCGstab Column-block or Block mapping Simple/Double precision + Real/Complex arithmetic Support external ordering libraries ((PT-)Scotch or METIS...) MPI/Threads (Cluster/Multicore/SMP/NUMA) Multiple GPUs using DAG runtimes X. Lacoste PhD defense on Februar 18th, 2015 Multiple RHS (direct factorization) Incomplete factorization with ILU(k) preconditionner Schur complement computation C++/Fortran/Python/PETSc/Trilinos/FreeFem/... M. Faverge - Journée problème de Poisson January 26, 2015-42

Conclusion Sparse direct solver: PaStiX What next? Exploiting DAG runtimes: distributed memory architectures other accelerators: Intel Xeon Phi Ordering methods for the supernodes: Reduce the number of extra diagonal blocks H-matrix arithmetic (A. Buttari s talk): In collaboration with Univ. of Stanford M. Faverge - Journée problème de Poisson January 26, 2015-43

Conclusion Sparse hybrid solver: MaPHyS Current status and future Sequential version: Original MaPhyS version Multi-threaded implementation Integrated during S. Nakov PhD Hybrid MPI / Multi-threaded implementation Is it worth it? Task-flow programming model L. Poirel PhD started in October 2014 Non blocking algorithms (W. Vanroose s talk) Exploiting H matrix for the Shur complement M. Faverge - Journée problème de Poisson January 26, 2015-44

Conclusion Softwares Graph/Mesh partitioner and ordering : Sparse linear system solvers : http://scotch.gforge.inria.fr http://pastix.gforge.inria.fr MaPHyS https://wiki.bordeaux.inria.fr/maphys/doku.php M. Faverge - Journée problème de Poisson January 26, 2015-45

Conclusion Softwares Chameleon: DPlasma: http://icl.cs.utk.edu/projectsdev/morse http://project.inria.fr/chameleon/ http://icl.cs.utk.edu/dplasma/index.html M. Faverge - Journée problème de Poisson January 26, 2015-46

Thanks!