On the design of parallel linear solvers for large scale problems

Size: px
Start display at page:

Download "On the design of parallel linear solvers for large scale problems"

Transcription

1 On the design of parallel linear solvers for large scale problems Journée problème de Poisson, IHP, Paris M. Faverge, P. Ramet M. Faverge Assistant Professor Bordeaux INP LaBRI Inria Bordeaux - Sud-Ouest January 26, 2015

2 Context Inria HiePACS team Objectives: Contribute to the design of effective tools for frontier simulations arising from challenging research and industrial multi-scale applications towards exascale computing M. Faverge - Journée problème de Poisson January 26,

3 1 Introduction

4 Introduction Guideline Introduction Dense linear algebra Sparse direct solver over runtimes: PaStiX On homogeneous shared memory nodes On heterogeneous shared memory nodes Hybrid direct/sparse solver: MaPHyS Conclusion M. Faverge - Journée problème de Poisson January 26,

5 Introduction What are the modern architectures? Today software developers face systems with: 1 TFlop/s of compute power per node 32 cores or more per nodes, sometimes 100+ hardware threads Highly heterogeneous architectures (cores + specialized cores + accelerators/co-processors) Deep memory hierarchies And also distributed Consequences of our programming paradigms are systemic noise, load imbalance, overheads (< 70% peak on DLA) M. Faverge - Journée problème de Poisson January 26,

6 Introduction How to harness these devices productively? Certainly not the way we did it so far We need to shift our focus on: Power (forget GHz think MW) Cost (forget FLOPS think data movement) Concurrency (not linear anymore) Memory Scaling (compute grows twice faster than memory bandwidth) Heterogeneity (it s not going away) Reliability (it is supposed to fail one day). M. Faverge - Journée problème de Poisson January 26,

7 Introduction A short history of computing paradigms M. Faverge - Journée problème de Poisson January 26,

8 Introduction A short history of computing paradigms M. Faverge - Journée problème de Poisson January 26,

9 Introduction A short history of computing paradigms M. Faverge - Journée problème de Poisson January 26,

10 Introduction A short history of computing paradigms M. Faverge - Journée problème de Poisson January 26,

11 Introduction A short history of computing paradigms M. Faverge - Journée problème de Poisson January 26,

12 Introduction Task-based programming Focus on concepts, data dependencies and flows Don t develop for an architecture but for an impermeable portability layer Let the runtime deal with the hardware characteristics But provide some flexibility if possible StarSS, Swift, Parallax, Quark, XKaapi, StarPU, PaRSEC,... M. Faverge - Journée problème de Poisson January 26,

13 Introduction Task-based programming Workflows, bag-of-task, grid are similar concepts but at another level of granularity (we re talking tens of microseconds) Time constraint: Reduces the opportunity for complex dynamic scheduling strategies, but increases the probability to do the wrong thing Separation of concerns: each layer with it s own issues Humans focus on depicted the flow of data between tasks Compiler focus on tasks, problems that fit in one memory level And a runtime orchestrate the flows to maximize the throughput of the architecture M. Faverge - Journée problème de Poisson January 26,

14 2 Dense linear algebra

15 Dense linear algebra From sequential code to DAG Sequential code for QR factorization M. Faverge - Journée problème de Poisson January 26,

16 Dense linear algebra From sequential code to DAG Sequential C code Annotated through some specific syntax Insert Task INOUT, OUTPUT, INPUT REGION L, REGION U, REGION D,... LOCALITY M. Faverge - Journée problème de Poisson January 26,

17 Dense linear algebra From sequential code to DAG Automatic dataflow analysis M. Faverge - Journée problème de Poisson January 26,

18 Dense linear algebra From sequential code to DAG Automatic dataflow analysis M. Faverge - Journée problème de Poisson January 26,

19 Dense linear algebra Parameterized Task Graph VS Sequential Task Flow Advantages of PTG Local view of the DAG No need to unroll the full graph Full knowledge of Input/Output for each task Strong separation of concerns (Data/task distribution, algorithm) Advantages of STF Knowledge of the full graph for scheduling strategies Handle dynamic representation of the algorithm Submission process can be postponed to hang on data values (convergence criteria) More simple to implement than PTG M. Faverge - Journée problème de Poisson January 26,

20 Dense linear algebra QR factorization with Chameleon (StarPU) Heterogeneous shared memory node GPUs + 4 CPUs - Single 3 GPUs + 3 CPUs - Single 2 GPUs + 2 CPUs - Single 1 GPUs + 1 CPUs - Single 4 GPUs + 4 CPUs - Double 3 GPUs + 3 CPUs - Double 2 GPUs + 2 CPUs - Double 1 GPUs + 1 CPUs - Double Gflop/s Matrix order M. Faverge - Journée problème de Poisson January 26,

21 Dense linear algebra QR factorization with Chameleon (StarPU) Heterogeneous shared memory node GPUs + 16 CPUs - Single 4 GPUs + 4 CPUs - Single 3 GPUs + 3 CPUs - Single 2 GPUs + 2 CPUs - Single 1 GPUs + 1 CPUs - Single 4 GPUs + 16 CPUs - Double 4 GPUs + 4 CPUs - Double 3 GPUs + 3 CPUs - Double 2 GPUs + 2 CPUs - Double 1 GPUs + 1 CPUs - Double Gflop/s Matrix order GFlop/s but 12 cores = 150 GFlop/s M. Faverge - Journée problème de Poisson January 26,

22 Dense linear algebra QR factorization with DPLASMA (PaRSEC) Distributed homogeneous memory nodes (60 8 cores) Theoretical Peak: GFlop/s Theoretical Peak: GFlop/s Scalapack [BBD+10] [SLHD10] HQR Performance (GFlop/s) Performance (GFlop/s) Scalapack [BBD+10] [SLHD10] HQR N (M=67,200) M (N=4,480) M. Faverge - Journée problème de Poisson January 26,

23 Dense linear algebra Libraries for solving large dense linear systems Chameleon ( Inria HiePACS/Runtime StarPU Runtime STF programing model Smart/Complex scheduling DPLASMA ( ICL, University of Tennessee PaRSEC Runtime PTG model Simple scheduling based on data locality M. Faverge - Journée problème de Poisson January 26,

24 Dense linear algebra Libraries for solving large dense linear systems Chameleon ( Inria HiePACS/Runtime StarPU Runtime STF programing model Smart/Complex scheduling DPLASMA ( ICL, University of Tennessee PaRSEC Runtime PTG model Simple scheduling based on data locality What about irregular problems as sparse solvers? M. Faverge - Journée problème de Poisson January 26,

25 3 Sparse direct solver over runtimes: PaStiX

26 3.1 Sparse direct solver over runtimes: PaStiX On homogeneous shared memory nodes

27 Sparse direct solver over runtimes: PaStiX On homogeneous shared memory nodes Tasks structure POTRF TRSM SYRK GEMM (a) Dense tile task decomposition (b) Decomposition of the task applied while processing one panel M. Faverge - Journée problème de Poisson January 26,

28 Sparse direct solver over runtimes: PaStiX On homogeneous shared memory nodes A=>A gemm(3) gemm(4) gemm(6) panel(0) A=>A A=>A gemm(2) panel(1) A=>A panel(4) gemm(8) C=>A panel(2) A=>A A=>A gemm(12) A=>A gemm(1) gemm(10) panel(3) A=>A gemm(15) A=>A panel(5) gemm(14) A=>A C=>A panel(6) A=>A gemm(17) gemm(25) gemm(31) gemm(36) A=>A A=>A gemm(22) gemm(37) A=>A A=>A gemm(27) A=>A gemm(38) A=>A A=>A gemm(21) panel(7) A=>A gemm(19) gemm(20) C=>A panel(8) A=>A gemm(28) panel(9) A=>A gemm(40) C=>A panel(11) C=>A A=>A gemm(35) panel(10) A=>A C=>A A=>A A=>A A=>A gemm(34) A=>A A=>A A=>A gemm(24) gemm(29) gemm(30) gemm(33) A=>A gemm(23) DAG representation POTRF T=>T T=>T T=>T TRSM T=>T C=>A TRSM C=>A TRSM C=>A SYRK C=>A TRSM C=>A C=>B C=>B C=>A C=>A C=>B C=>B T=>T C=>A C=>A C=>B C=>B C=>A GEMM GEMM GEMM POTRF GEMM GEMM GEMM SYRK SYRK T=>T T=>T T=>T SYRK T=>T TRSM T=>T TRSM TRSM T=>T C=>A C=>B C=>A C=>B C=>A C=>A C=>A C=>B C=>A SYRK GEMM SYRK GEMM GEMM SYRK T=>T POTRF T=>T T=>T T=>T T=>T TRSM TRSM C=>A C=>B C=>A C=>A SYRK GEMM SYRK T=>T POTRF T=>T T=>T TRSM C=>A SYRK T=>T POTRF (c) Dense DAG (d) Sparse DAG representation of a sparse LDL T factorization M. Faverge - Journée problème de Poisson January 26,

29 Sparse direct solver over runtimes: PaStiX On homogeneous shared memory nodes Supernodal sequential algorithm forall the Supernode S 1 do panel (S 1 ); /* update of the panel */ forall the extra diagonal block B i of S 1 do S 2 supernode in front of (B i ); gemm (S 1,S 2 ); /* sparse GEMM B k,k i Bi T substracted from S 2 */ end end M. Faverge - Journée problème de Poisson January 26,

30 Sparse direct solver over runtimes: PaStiX On homogeneous shared memory nodes StarPU Tasks submission forall the Supernode S 1 do submit panel (S 1 ); /* update of the panel */ forall the extra diagonal block B i of S 1 do S 2 supernode in front of (B i ); submit gemm (S 1,S 2 ); /* sparse GEMM B k,k i B T i subtracted from S 2 */ end end wait for all tasks (); M. Faverge - Journée problème de Poisson January 26,

31 Sparse direct solver over runtimes: PaStiX On homogeneous shared memory nodes PaRSEC s parameterized task graph panel(0) A=>A A=>A A=>A gemm(3) A=>A gemm(1) gemm(4) gemm(2) C=>A panel(1) A=>A Task Graph 1 p a n e l ( j ) 2 3 / E x e c u t i o n Space / 4 j = 0.. cblknbr / Task L o c a l i t y ( Owner Compute ) / 7 : A( j ) 8 9 / Data d e p e n d e n c i e s / 10 RW A < ( l e a f )? A( j ) : C gemm( l a s t b r o w ) 11 > A gemm( f i r s t b l o c k +1.. l a s t b l o c k ) 12 > A( j ) Panel Factorization in JDF Format gemm(6) panel(2) A=>A gemm(8) panel(3) A=>A M. panel(4) Faverge - Journée gemm(10) problème de panel(5) Poisson panel(7) January 26,

32 Sparse direct solver over runtimes: PaStiX On homogeneous shared memory nodes Matrices and Machines Matrix Prec Method Size nnz A nnz L TFlop FilterV2 Z LU 0.6e+6 12e+6 536e Flan D LL T 1.6e+6 59e e Audi D LL T 0.9e+6 39e e MHD D LU 0.5e+6 24e e Geo1438 D LL T 1.4e+6 32e e+6 23 Pmldf Z LDL T 1.0e+6 8e e+6 28 Hook D LU 1.5e+6 31e e+6 35 Serena D LDL T 1.4e+6 32e e+6 47 Table: Matrix description (Z: double complex, D: double). Machine Processors Frequency GPUs RAM Mirage Westmere Intel Xeon X5650 (2 6) 2.67 GHz Tesla M2070 ( 3) 36 GB M. Faverge - Journée problème de Poisson January 26,

33 Sparse direct solver over runtimes: PaStiX On homogeneous shared memory nodes CPU scaling study: GFlop/s during numerical factorization native StarPU PaRSEC 1 core 1 core 1 core 3 cores 3 cores 3 cores 6 cores 6 cores 6 cores 9 cores 9 cores 9 cores 12 cores 12 cores 12 cores Performance (GFlop/s) FilterV2(Z, LU) Flan(D, LL T ) audi(d, LL T ) MHD(D, LU) Geo1438(D, LL T ) pmldf(z, LDL T ) HOOK(D, LU) Serena(D, LDL T ) M. Faverge - Journée problème de Poisson January 26,

34 Sparse direct solver over runtimes: PaStiX On homogeneous shared memory nodes CPU scaling study: GFlop/s during numerical factorization native StarPU PaRSEC 1 core 1 core 1 core 3 cores 3 cores 3 cores 6 cores 6 cores 6 cores 9 cores 9 cores 9 cores 12 cores 12 cores 12 cores Performance (GFlop/s) FilterV2(Z, LU) Flan(D, LL T ) audi(d, LL T ) MHD(D, LU) Geo1438(D, LL T ) pmldf(z, LDL T ) HOOK(D, LU) Serena(D, LDL T ) M. Faverge - Journée problème de Poisson January 26,

35 Sparse direct solver over runtimes: PaStiX On homogeneous shared memory nodes CPU scaling study: GFlop/s during numerical factorization native StarPU PaRSEC 1 core 1 core 1 core 3 cores 3 cores 3 cores 6 cores 6 cores 6 cores 9 cores 9 cores 9 cores 12 cores 12 cores 12 cores Performance (GFlop/s) FilterV2(Z, LU) Flan(D, LL T ) audi(d, LL T ) MHD(D, LU) Geo1438(D, LL T ) pmldf(z, LDL T ) HOOK(D, LU) Serena(D, LDL T ) M. Faverge - Journée problème de Poisson January 26,

36 Sparse direct solver over runtimes: PaStiX On homogeneous shared memory nodes CPU scaling study: GFlop/s during numerical factorization native StarPU PaRSEC 1 core 1 core 1 core 3 cores 3 cores 3 cores 6 cores 6 cores 6 cores 9 cores 9 cores 9 cores 12 cores 12 cores 12 cores Performance (GFlop/s) FilterV2(Z, LU) Flan(D, LL T ) audi(d, LL T ) MHD(D, LU) Geo1438(D, LL T ) pmldf(z, LDL T ) HOOK(D, LU) Serena(D, LDL T ) M. Faverge - Journée problème de Poisson January 26,

37 Sparse direct solver over runtimes: PaStiX On homogeneous shared memory nodes CPU scaling study: GFlop/s during numerical factorization native StarPU PaRSEC 1 core 1 core 1 core 3 cores 3 cores 3 cores 6 cores 6 cores 6 cores 9 cores 9 cores 9 cores 12 cores 12 cores 12 cores Performance (GFlop/s) FilterV2(Z, LU) Flan(D, LL T ) audi(d, LL T ) MHD(D, LU) Geo1438(D, LL T ) pmldf(z, LDL T ) HOOK(D, LU) Serena(D, LDL T ) M. Faverge - Journée problème de Poisson January 26,

38 3.2 Sparse direct solver over runtimes: PaStiX On heterogeneous shared memory nodes

39 Sparse direct solver over runtimes: PaStiX On heterogeneous shared memory nodes Why using GPUs on sparse problems? P 1 P 2 Compacted in memory P 1 P 2 updates are compute intensive small updates non continuous data rectangular shape M. Faverge - Journée problème de Poisson January 26,

40 Sparse direct solver over runtimes: PaStiX On heterogeneous shared memory nodes GPU kernel performance GFlop/s cublas peak cublas - 1 stream ASTRA - 1 stream Sparse - 1 stream Matrix size M (N=K=128) Figure: Performance comparison on the DGEMM kernel for three implementations: cublas library, ASTRA framework, and the sparse adaptation of the ASTRA framework. M. Faverge - Journée problème de Poisson January 26,

41 Sparse direct solver over runtimes: PaStiX On heterogeneous shared memory nodes GPU kernel performance - 2 Streams GFlop/s cublas peak cublas - 1 stream cublas - 2 streams ASTRA - 1 stream ASTRA - 2 streams Sparse - 1 stream Sparse - 2 streams Matrix size M (N=K=128) Figure: Performance comparison on the DGEMM kernel for three implementations: cublas library, ASTRA framework, and the sparse adaptation of the ASTRA framework. M. Faverge - Journée problème de Poisson January 26,

42 Sparse direct solver over runtimes: PaStiX On heterogeneous shared memory nodes GPU kernel performance - 3 Streams GFlop/s cublas peak cublas - 1 stream cublas - 2 streams cublas - 3 streams ASTRA - 1 stream ASTRA - 2 streams ASTRA - 3 streams Sparse - 1 stream Sparse - 2 streams Sparse - 3 streams Matrix size M (N=K=128) Figure: Performance comparison on the DGEMM kernel for three implementations: cublas library, ASTRA framework, and the sparse adaptation of the ASTRA framework. M. Faverge - Journée problème de Poisson January 26,

43 Sparse direct solver over runtimes: PaStiX On heterogeneous shared memory nodes GPU scaling study : GFlop/s for numerical factorization 250 Native: CPU only StarPU: CPU only 1 GPU 2 GPU 3 GPU PaRSEC 1 stream: CPU only 1 GPU 2 GPU 3 GPU PaRSEC 3 streams: 1 GPU 2 GPU 3 GPU 200 Performance (GFlop/s) FilterV2(Z, LU) Flan(D, LL T ) audi(d, LL T ) MHD(D, LU) Geo1438(D, LL T ) pmldf(z, LDL T ) HOOK(D, LU) Serena(D, LDL T ) M. Faverge - Journée problème de Poisson January 26,

44 Sparse direct solver over runtimes: PaStiX On heterogeneous shared memory nodes GPU scaling study : GFlop/s for numerical factorization 250 Native: CPU only StarPU: CPU only 1 GPU 2 GPU 3 GPU PaRSEC 1 stream: CPU only 1 GPU 2 GPU 3 GPU PaRSEC 3 streams: 1 GPU 2 GPU 3 GPU 200 Performance (GFlop/s) FilterV2(Z, LU) Flan(D, LL T ) audi(d, LL T ) MHD(D, LU) Geo1438(D, LL T ) pmldf(z, LDL T ) HOOK(D, LU) Serena(D, LDL T ) M. Faverge - Journée problème de Poisson January 26,

45 Sparse direct solver over runtimes: PaStiX On heterogeneous shared memory nodes GPU scaling study : GFlop/s for numerical factorization 250 Native: CPU only StarPU: CPU only 1 GPU 2 GPU 3 GPU PaRSEC 1 stream: CPU only 1 GPU 2 GPU 3 GPU PaRSEC 3 streams: 1 GPU 2 GPU 3 GPU 200 Performance (GFlop/s) FilterV2(Z, LU) Flan(D, LL T ) audi(d, LL T ) MHD(D, LU) Geo1438(D, LL T ) pmldf(z, LDL T ) HOOK(D, LU) Serena(D, LDL T ) M. Faverge - Journée problème de Poisson January 26,

46 Sparse direct solver over runtimes: PaStiX On heterogeneous shared memory nodes GPU scaling study : GFlop/s for numerical factorization 250 Native: CPU only StarPU: CPU only 1 GPU 2 GPU 3 GPU PaRSEC 1 stream: CPU only 1 GPU 2 GPU 3 GPU PaRSEC 3 streams: 1 GPU 2 GPU 3 GPU 200 Performance (GFlop/s) FilterV2(Z, LU) Flan(D, LL T ) audi(d, LL T ) MHD(D, LU) Geo1438(D, LL T ) pmldf(z, LDL T ) HOOK(D, LU) Serena(D, LDL T ) M. Faverge - Journée problème de Poisson January 26,

47 Sparse direct solver over runtimes: PaStiX On heterogeneous shared memory nodes GPU scaling study : GFlop/s for numerical factorization 250 Native: CPU only StarPU: CPU only 1 GPU 2 GPU 3 GPU PaRSEC 1 stream: CPU only 1 GPU 2 GPU 3 GPU PaRSEC 3 streams: 1 GPU 2 GPU 3 GPU 200 Performance (GFlop/s) FilterV2(Z, LU) Flan(D, LL T ) audi(d, LL T ) MHD(D, LU) Geo1438(D, LL T ) pmldf(z, LDL T ) HOOK(D, LU) Serena(D, LDL T ) M. Faverge - Journée problème de Poisson January 26,

48 4 Hybrid direct/sparse solver: MaPHyS

49 Hybrid direct/sparse solver: MaPHyS Sparse linear solvers Goal: solving Ax = b, where A is sparse Usual trades off Direct Robust/accurate for general problems BLAS-3 based implementations Memory/CPU prohibitive for large 3D problems Limited weak scalability Iterative Problem dependent efficiency / accuracy Sparse computational kernels Less memory requirements and possibly faster Possible high weak scalability M. Faverge - Journée problème de Poisson January 26,

50 Hybrid direct/sparse solver: MaPHyS Sparse linear solvers Goal: solving Ax = b, where A is sparse Usual trades off Direct Robust/accurate for general problems BLAS-3 based implementations Memory/CPU prohibitive for large 3D problems Limited weak scalability Iterative Problem dependent efficiency / accuracy Sparse computational kernels Less memory requirements and possibly faster Possible high weak scalability M. Faverge - Journée problème de Poisson January 26,

51 Hybrid direct/sparse solver: MaPHyS Sparse linear solvers Goal: solving Ax = b, where A is sparse Usual trades off Direct Robust/accurate for general problems BLAS-3 based implementations Memory/CPU prohibitive for large 3D problems Limited weak scalability Iterative Problem dependent efficiency / accuracy Sparse computational kernels Less memory requirements and possibly faster Possible high weak scalability M. Faverge - Journée problème de Poisson January 26,

52 Hybrid direct/sparse solver: MaPHyS Sparse linear solvers Goal: solving Ax = b, where A is sparse Usual trades off Direct Robust/accurate for general problems BLAS-3 based implementations Memory/CPU prohibitive for large 3D problems Limited weak scalability Iterative Problem dependent efficiency / accuracy Sparse computational kernels Less memory requirements and possibly faster Possible high weak scalability M. Faverge - Journée problème de Poisson January 26,

53 Hybrid direct/sparse solver: MaPHyS Sparse linear solvers Goal: solving Ax = b, where A is sparse Usual trades off Direct Robust/accurate for general problems BLAS-3 based implementations Memory/CPU prohibitive for large 3D problems Limited weak scalability Iterative Problem dependent efficiency / accuracy Sparse computational kernels Less memory requirements and possibly faster Possible high weak scalability M. Faverge - Journée problème de Poisson January 26,

54 Hybrid direct/sparse solver: MaPHyS Sparse linear solvers Goal: solving Ax = b, where A is sparse Usual trades off Direct Robust/accurate for general problems BLAS-3 based implementations Memory/CPU prohibitive for large 3D problems Limited weak scalability Iterative Problem dependent efficiency / accuracy Sparse computational kernels Less memory requirements and possibly faster Possible high weak scalability M. Faverge - Journée problème de Poisson January 26,

55 Hybrid direct/sparse solver: MaPHyS Hybrid Linear Solvers Develop robust scalable parallel hybrid direct/iterative linear solvers Exploit the efficiency and robustness of the sparse direct solvers Develop robust parallel preconditioners for iterative solvers Take advantage of the natural scalable parallel implementation of iterative solvers Domain Decomposition (DD) Natural approach for PDE s Extend to general sparse matrices Partition the problem into subdomains, subgraphs Use a direct solver on the subdomains Robust preconditioned iterative solver M. Faverge - Journée problème de Poisson January 26,

56 Hybrid direct/sparse solver: MaPHyS Preconditioning methods MaPHyS Global algebraic view Global hybrid decomposition: ( ) AII A A = IΓ A ΓI A ΓΓ Ω 1 Ω 2 Ω 3 Ω 4 88 cut edges Γ 0 Global Schur complement: 50 Ω Ω 2 S = A ΓΓ A ΓI A 1 II A IΓ 250 Ω Ω nz = 1920 Γ M. Faverge - Journée problème de Poisson January 26,

57 Hybrid direct/sparse solver: MaPHyS Preconditioning methods MaPHyS Local algebraic view Local hybrid decomposition: ( ) A (i) AIi I = i A Ii Γ i A Γi I i Local Schur Complement: A (i) ΓΓ Ω 1 Ω 2 Ω 3 Ω 4 88 cut edges Γ S (i) = A (i) ΓΓ A Γ i I i A 1 I i I i A Ii Γ i Algebraic Additive Schwarz Preconditioner 0 50 Ω Ω 2 Ω 3 M = N R T Γ i ( S (i) ) 1 R Γi i=1 Ω nz = 1920 Γ M. Faverge - Journée problème de Poisson January 26,

58 Hybrid direct/sparse solver: MaPHyS Preconditioning methods Parallel implementation Each subdomain A (i) is handled by one CPU core ( ) A (i) AIi I i A Ii Γ i A Ii Γ i A (i) ΓΓ Concurrent partial factorizations are performed on each CPU core to form the so called local Schur complement S (i) = A (i) ΓΓ A Γ i I i A 1 I i I i A Ii Γ i The reduced system Sx Γ = f is solved using a distributed Krylov solver - One matrix vector product per iteration each CPU core computes S (i) (x (i) Γ )k = (y (i) ) k - One local preconditioner apply (M (i) )(z (i) ) k = (r (i) ) k - Local neighbor-neighbor communication per iteration - Global reduction (dot products) Compute simultaneously the solution for the interior unknowns A Ii I i x Ii = b Ii A Ii Γ i x Γi M. Faverge - Journée problème de Poisson January 26,

59 Hybrid direct/sparse solver: MaPHyS Software Modular implementation Sequential partitioners Scotch [Pellegrini et al.] Metis [G. Karypis and V. Kumar] Sparse direct solvers - one-level parallelism Mumps [P. Amestoy et al.] (with Schur option) PaStiX [P. Ramet et al.] (with Schur option) Iterative Solvers Cg/Gmres/FGmres [V.Fraysse and L.Giraud] M. Faverge - Journée problème de Poisson January 26,

60 Hybrid direct/sparse solver: MaPHyS Software Modular implementation and bottlenecks Sequential partitioners Scotch [Pellegrini et al.] Metis [G. Karypis and V. Kumar] Sparse direct solvers - one-level parallelism Mumps [P. Amestoy et al.] (with Schur option) PaStiX [P. Ramet et al.] (with Schur option) Iterative Solvers Cg/Gmres/FGmres [V.Fraysse and L.Giraud] M. Faverge - Journée problème de Poisson January 26,

61 Hybrid direct/sparse solver: MaPHyS Software Modular implementation and bottlenecks Sequential partitioners Scotch [Pellegrini et al.] Metis [G. Karypis and V. Kumar] Sparse direct solvers - one-level parallelism Mumps [P. Amestoy et al.] (with Schur option) PaStiX [P. Ramet et al.] (with Schur option) Iterative Solvers Cg/Gmres/FGmres [V.Fraysse and L.Giraud] M. Faverge - Journée problème de Poisson January 26,

62 Hybrid direct/sparse solver: MaPHyS Software Modular implementation and bottlenecks Sequential partitioners Scotch [Pellegrini et al.] Metis [G. Karypis and V. Kumar] Sparse direct solvers - one-level parallelism Mumps [P. Amestoy et al.] (with Schur option) PaStiX [P. Ramet et al.] (with Schur option) Iterative Solvers Cg/Gmres/FGmres [V.Fraysse and L.Giraud] M. Faverge - Journée problème de Poisson January 26,

63 Hybrid direct/sparse solver: MaPHyS Software MPI-thread on with Nachos4M - Time 3 t/p 6 t/p 12 t/p 24 t/p time (s) #cores M. Faverge - Journée problème de Poisson January 26,

64 Hybrid direct/sparse solver: MaPHyS Software MPI-thread on with Nachos4M - Memory Memroy_peak(MB) t/p 6 t/p 12 t/p 24 t/p #cores M. Faverge - Journée problème de Poisson January 26,

65 5 Conclusion

66 Conclusion Sparse direct solver: PaStiX Main Features LLt, LDLt, LU : supernodal implementation (BLAS3) Static pivoting + Refinement: CG/GMRES/BiCGstab Column-block or Block mapping Simple/Double precision + Real/Complex arithmetic Support external ordering libraries ((PT-)Scotch or METIS...) MPI/Threads (Cluster/Multicore/SMP/NUMA) Multiple GPUs using DAG runtimes X. Lacoste PhD defense on Februar 18th, 2015 Multiple RHS (direct factorization) Incomplete factorization with ILU(k) preconditionner Schur complement computation C++/Fortran/Python/PETSc/Trilinos/FreeFem/... M. Faverge - Journée problème de Poisson January 26,

67 Conclusion Sparse direct solver: PaStiX What next? Exploiting DAG runtimes: distributed memory architectures other accelerators: Intel Xeon Phi Ordering methods for the supernodes: Reduce the number of extra diagonal blocks H-matrix arithmetic (A. Buttari s talk): In collaboration with Univ. of Stanford M. Faverge - Journée problème de Poisson January 26,

68 Conclusion Sparse hybrid solver: MaPHyS Current status and future Sequential version: Original MaPhyS version Multi-threaded implementation Integrated during S. Nakov PhD Hybrid MPI / Multi-threaded implementation Is it worth it? Task-flow programming model L. Poirel PhD started in October 2014 Non blocking algorithms (W. Vanroose s talk) Exploiting H matrix for the Shur complement M. Faverge - Journée problème de Poisson January 26,

69 Conclusion Softwares Graph/Mesh partitioner and ordering : Sparse linear system solvers : MaPHyS M. Faverge - Journée problème de Poisson January 26,

70 Conclusion Softwares Chameleon: DPlasma: M. Faverge - Journée problème de Poisson January 26,

71 Thanks!

On the design of parallel linear solvers for large scale problems

On the design of parallel linear solvers for large scale problems On the design of parallel linear solvers for large scale problems ICIAM - August 2015 - Mini-Symposium on Recent advances in matrix computations for extreme-scale computers M. Faverge, X. Lacoste, G. Pichon,

More information

Utilisation de la compression low-rank pour réduire la complexité du solveur PaStiX

Utilisation de la compression low-rank pour réduire la complexité du solveur PaStiX Utilisation de la compression low-rank pour réduire la complexité du solveur PaStiX 26 Septembre 2018 - JCAD 2018 - Lyon Grégoire Pichon, Mathieu Faverge, Pierre Ramet, Jean Roman Outline 1. Context 2.

More information

Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers

Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers UT College of Engineering Tutorial Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers Stan Tomov 1, George Bosilca 1, and Cédric

More information

Accelerating linear algebra computations with hybrid GPU-multicore systems.

Accelerating linear algebra computations with hybrid GPU-multicore systems. Accelerating linear algebra computations with hybrid GPU-multicore systems. Marc Baboulin INRIA/Université Paris-Sud joint work with Jack Dongarra (University of Tennessee and Oak Ridge National Laboratory)

More information

3D Cartesian Transport Sweep for Massively Parallel Architectures on top of PaRSEC

3D Cartesian Transport Sweep for Massively Parallel Architectures on top of PaRSEC 3D Cartesian Transport Sweep for Massively Parallel Architectures on top of PaRSEC 9th Scheduling for Large Scale Systems Workshop, Lyon S. Moustafa, M. Faverge, L. Plagne, and P. Ramet S. Moustafa, M.

More information

ACCELERATING SPARSE CHOLESKY FACTORIZATION ON THE GPU

ACCELERATING SPARSE CHOLESKY FACTORIZATION ON THE GPU ACCELERATING SPARSE CHOLESKY FACTORIZATION ON THE GPU STEVE RENNICH, SR. ENGINEER, NVIDIA DEVELOPER TECHNOLOGY DARKO STOSIC, PHD CANDIDATE, UNIV. FEDERAL DE PERNAMBUCO TIM DAVIS, PROFESSOR, CSE, TEXAS

More information

A High-Performance Parallel Hybrid Method for Large Sparse Linear Systems

A High-Performance Parallel Hybrid Method for Large Sparse Linear Systems Outline A High-Performance Parallel Hybrid Method for Large Sparse Linear Systems Azzam Haidar CERFACS, Toulouse joint work with Luc Giraud (N7-IRIT, France) and Layne Watson (Virginia Polytechnic Institute,

More information

ANR Project DEDALES Algebraic and Geometric Domain Decomposition for Subsurface Flow

ANR Project DEDALES Algebraic and Geometric Domain Decomposition for Subsurface Flow ANR Project DEDALES Algebraic and Geometric Domain Decomposition for Subsurface Flow Michel Kern Inria Paris Rocquencourt Maison de la Simulation C2S@Exa Days, Inria Paris Centre, Novembre 2016 M. Kern

More information

MUMPS. The MUMPS library: work done during the SOLSTICE project. MUMPS team, Lyon-Grenoble, Toulouse, Bordeaux

MUMPS. The MUMPS library: work done during the SOLSTICE project. MUMPS team, Lyon-Grenoble, Toulouse, Bordeaux The MUMPS library: work done during the SOLSTICE project MUMPS team, Lyon-Grenoble, Toulouse, Bordeaux Sparse Days and ANR SOLSTICE Final Workshop June MUMPS MUMPS Team since beg. of SOLSTICE (2007) Permanent

More information

Scalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver

Scalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver Scalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver Sherry Li Lawrence Berkeley National Laboratory Piyush Sao Rich Vuduc Georgia Institute of Technology CUG 14, May 4-8, 14, Lugano,

More information

MAGMA MIC 1.0: Linear Algebra Library for Intel Xeon Phi Coprocessors

MAGMA MIC 1.0: Linear Algebra Library for Intel Xeon Phi Coprocessors MAGMA MIC 1.0: Linear Algebra Library for Intel Xeon Phi Coprocessors J. Dongarra, M. Gates, A. Haidar, Y. Jia, K. Kabir, P. Luszczek, and S. Tomov University of Tennessee, Knoxville 05 / 03 / 2013 MAGMA:

More information

Dynamic Scheduling within MAGMA

Dynamic Scheduling within MAGMA Dynamic Scheduling within MAGMA Emmanuel Agullo, Cedric Augonnet, Jack Dongarra, Mathieu Faverge, Julien Langou, Hatem Ltaief, Samuel Thibault and Stanimire Tomov April 5, 2012 Innovative and Computing

More information

Lightweight Superscalar Task Execution in Distributed Memory

Lightweight Superscalar Task Execution in Distributed Memory Lightweight Superscalar Task Execution in Distributed Memory Asim YarKhan 1 and Jack Dongarra 1,2,3 1 Innovative Computing Lab, University of Tennessee, Knoxville, TN 2 Oak Ridge National Lab, Oak Ridge,

More information

SPARSE SOLVERS POISSON EQUATION. Margreet Nool. November 9, 2015 FOR THE. CWI, Multiscale Dynamics

SPARSE SOLVERS POISSON EQUATION. Margreet Nool. November 9, 2015 FOR THE. CWI, Multiscale Dynamics SPARSE SOLVERS FOR THE POISSON EQUATION Margreet Nool CWI, Multiscale Dynamics November 9, 2015 OUTLINE OF THIS TALK 1 FISHPACK, LAPACK, PARDISO 2 SYSTEM OVERVIEW OF CARTESIUS 3 POISSON EQUATION 4 SOLVERS

More information

MAGMA. Matrix Algebra on GPU and Multicore Architectures. Mark Gates. February 2012

MAGMA. Matrix Algebra on GPU and Multicore Architectures. Mark Gates. February 2012 MAGMA Matrix Algebra on GPU and Multicore Architectures Mark Gates February 2012 1 Hardware trends Scale # cores instead of clock speed Hardware issue became software issue Multicore Hybrid 1.E+07 1e7

More information

A dissection solver with kernel detection for unsymmetric matrices in FreeFem++

A dissection solver with kernel detection for unsymmetric matrices in FreeFem++ . p.1/21 11 Dec. 2014, LJLL, Paris FreeFem++ workshop A dissection solver with kernel detection for unsymmetric matrices in FreeFem++ Atsushi Suzuki Atsushi.Suzuki@ann.jussieu.fr Joint work with François-Xavier

More information

A model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization)

A model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization) A model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization) Schodinger equation: Hψ = Eψ Choose a basis set of wave functions Two cases: Orthonormal

More information

J.I. Aliaga 1 M. Bollhöfer 2 A.F. Martín 1 E.S. Quintana-Ortí 1. March, 2009

J.I. Aliaga 1 M. Bollhöfer 2 A.F. Martín 1 E.S. Quintana-Ortí 1. March, 2009 Parallel Preconditioning of Linear Systems based on ILUPACK for Multithreaded Architectures J.I. Aliaga M. Bollhöfer 2 A.F. Martín E.S. Quintana-Ortí Deparment of Computer Science and Engineering, Univ.

More information

Sparse LU Factorization on GPUs for Accelerating SPICE Simulation

Sparse LU Factorization on GPUs for Accelerating SPICE Simulation Nano-scale Integrated Circuit and System (NICS) Laboratory Sparse LU Factorization on GPUs for Accelerating SPICE Simulation Xiaoming Chen PhD Candidate Department of Electronic Engineering Tsinghua University,

More information

Algorithms and Methods for Fast Model Predictive Control

Algorithms and Methods for Fast Model Predictive Control Algorithms and Methods for Fast Model Predictive Control Technical University of Denmark Department of Applied Mathematics and Computer Science 13 April 2016 Background: Model Predictive Control Model

More information

Parallelization of Multilevel Preconditioners Constructed from Inverse-Based ILUs on Shared-Memory Multiprocessors

Parallelization of Multilevel Preconditioners Constructed from Inverse-Based ILUs on Shared-Memory Multiprocessors Parallelization of Multilevel Preconditioners Constructed from Inverse-Based ILUs on Shared-Memory Multiprocessors J.I. Aliaga 1 M. Bollhöfer 2 A.F. Martín 1 E.S. Quintana-Ortí 1 1 Deparment of Computer

More information

Contents. Preface... xi. Introduction...

Contents. Preface... xi. Introduction... Contents Preface... xi Introduction... xv Chapter 1. Computer Architectures... 1 1.1. Different types of parallelism... 1 1.1.1. Overlap, concurrency and parallelism... 1 1.1.2. Temporal and spatial parallelism

More information

A sparse multifrontal solver using hierarchically semi-separable frontal matrices

A sparse multifrontal solver using hierarchically semi-separable frontal matrices A sparse multifrontal solver using hierarchically semi-separable frontal matrices Pieter Ghysels Lawrence Berkeley National Laboratory Joint work with: Xiaoye S. Li (LBNL), Artem Napov (ULB), François-Henry

More information

Incomplete Cholesky preconditioners that exploit the low-rank property

Incomplete Cholesky preconditioners that exploit the low-rank property anapov@ulb.ac.be ; http://homepages.ulb.ac.be/ anapov/ 1 / 35 Incomplete Cholesky preconditioners that exploit the low-rank property (theory and practice) Artem Napov Service de Métrologie Nucléaire, Université

More information

MARCH 24-27, 2014 SAN JOSE, CA

MARCH 24-27, 2014 SAN JOSE, CA MARCH 24-27, 2014 SAN JOSE, CA Sparse HPC on modern architectures Important scientific applications rely on sparse linear algebra HPCG a new benchmark proposal to complement Top500 (HPL) To solve A x =

More information

From Direct to Iterative Substructuring: some Parallel Experiences in 2 and 3D

From Direct to Iterative Substructuring: some Parallel Experiences in 2 and 3D From Direct to Iterative Substructuring: some Parallel Experiences in 2 and 3D Luc Giraud N7-IRIT, Toulouse MUMPS Day October 24, 2006, ENS-INRIA, Lyon, France Outline 1 General Framework 2 The direct

More information

Multilevel low-rank approximation preconditioners Yousef Saad Department of Computer Science and Engineering University of Minnesota

Multilevel low-rank approximation preconditioners Yousef Saad Department of Computer Science and Engineering University of Minnesota Multilevel low-rank approximation preconditioners Yousef Saad Department of Computer Science and Engineering University of Minnesota SIAM CSE Boston - March 1, 2013 First: Joint work with Ruipeng Li Work

More information

1 Overview. 2 Adapting to computing system evolution. 11 th European LS-DYNA Conference 2017, Salzburg, Austria

1 Overview. 2 Adapting to computing system evolution. 11 th European LS-DYNA Conference 2017, Salzburg, Austria 1 Overview Improving LSTC s Multifrontal Linear Solver Roger Grimes 3, Robert Lucas 3, Nick Meng 2, Francois-Henry Rouet 3, Clement Weisbecker 3, and Ting-Ting Zhu 1 1 Cray Incorporated 2 Intel Corporation

More information

Using Random Butterfly Transformations to Avoid Pivoting in Sparse Direct Methods

Using Random Butterfly Transformations to Avoid Pivoting in Sparse Direct Methods Using Random Butterfly Transformations to Avoid Pivoting in Sparse Direct Methods Marc Baboulin 1, Xiaoye S. Li 2 and François-Henry Rouet 2 1 University of Paris-Sud, Inria Saclay, France 2 Lawrence Berkeley

More information

Large Scale Sparse Linear Algebra

Large Scale Sparse Linear Algebra Large Scale Sparse Linear Algebra P. Amestoy (INP-N7, IRIT) A. Buttari (CNRS, IRIT) T. Mary (University of Toulouse, IRIT) A. Guermouche (Univ. Bordeaux, LaBRI), J.-Y. L Excellent (INRIA, LIP, ENS-Lyon)

More information

Lattice Boltzmann simulations on heterogeneous CPU-GPU clusters

Lattice Boltzmann simulations on heterogeneous CPU-GPU clusters Lattice Boltzmann simulations on heterogeneous CPU-GPU clusters H. Köstler 2nd International Symposium Computer Simulations on GPU Freudenstadt, 29.05.2013 1 Contents Motivation walberla software concepts

More information

Performance and Scalability of the Block Low-Rank Multifrontal Factorization on Multicore Architectures

Performance and Scalability of the Block Low-Rank Multifrontal Factorization on Multicore Architectures Performance and Scalability of the Block Low-Rank Multifrontal Factorization on Multicore Architectures Patrick R. Amestoy, Alfredo Buttari, Jean-Yves L Excellent, Théo Mary To cite this version: Patrick

More information

A robust multilevel approximate inverse preconditioner for symmetric positive definite matrices

A robust multilevel approximate inverse preconditioner for symmetric positive definite matrices DICEA DEPARTMENT OF CIVIL, ENVIRONMENTAL AND ARCHITECTURAL ENGINEERING PhD SCHOOL CIVIL AND ENVIRONMENTAL ENGINEERING SCIENCES XXX CYCLE A robust multilevel approximate inverse preconditioner for symmetric

More information

Improvements for Implicit Linear Equation Solvers

Improvements for Implicit Linear Equation Solvers Improvements for Implicit Linear Equation Solvers Roger Grimes, Bob Lucas, Clement Weisbecker Livermore Software Technology Corporation Abstract Solving large sparse linear systems of equations is often

More information

Computing least squares condition numbers on hybrid multicore/gpu systems

Computing least squares condition numbers on hybrid multicore/gpu systems Computing least squares condition numbers on hybrid multicore/gpu systems M. Baboulin and J. Dongarra and R. Lacroix Abstract This paper presents an efficient computation for least squares conditioning

More information

Hybrid static/dynamic scheduling for already optimized dense matrix factorization. Joint Laboratory for Petascale Computing, INRIA-UIUC

Hybrid static/dynamic scheduling for already optimized dense matrix factorization. Joint Laboratory for Petascale Computing, INRIA-UIUC Hybrid static/dynamic scheduling for already optimized dense matrix factorization Simplice Donfack, Laura Grigori, INRIA, France Bill Gropp, Vivek Kale UIUC, USA Joint Laboratory for Petascale Computing,

More information

Enhancing Scalability of Sparse Direct Methods

Enhancing Scalability of Sparse Direct Methods Journal of Physics: Conference Series 78 (007) 0 doi:0.088/7-6596/78//0 Enhancing Scalability of Sparse Direct Methods X.S. Li, J. Demmel, L. Grigori, M. Gu, J. Xia 5, S. Jardin 6, C. Sovinec 7, L.-Q.

More information

Hydra: Generation and Tuning of parallel solutions for linear algebra equations. Alexandre X. Duchâteau University of Illinois at Urbana Champaign

Hydra: Generation and Tuning of parallel solutions for linear algebra equations. Alexandre X. Duchâteau University of Illinois at Urbana Champaign Hydra: Generation and Tuning of parallel solutions for linear algebra equations Alexandre X. Duchâteau University of Illinois at Urbana Champaign Collaborators Thesis Advisors Denis Barthou (Labri/INRIA

More information

QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment

QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment Emmanuel AGULLO (INRIA / LaBRI) Camille COTI (Iowa State University) Jack DONGARRA (University of Tennessee) Thomas HÉRAULT

More information

Static-scheduling and hybrid-programming in SuperLU DIST on multicore cluster systems

Static-scheduling and hybrid-programming in SuperLU DIST on multicore cluster systems Static-scheduling and hybrid-programming in SuperLU DIST on multicore cluster systems Ichitaro Yamazaki University of Tennessee, Knoxville Xiaoye Sherry Li Lawrence Berkeley National Laboratory MS49: Sparse

More information

Information Sciences Institute 22 June 2012 Bob Lucas, Gene Wagenbreth, Dan Davis, Roger Grimes and

Information Sciences Institute 22 June 2012 Bob Lucas, Gene Wagenbreth, Dan Davis, Roger Grimes and Accelerating the Multifrontal Method Information Sciences Institute 22 June 2012 Bob Lucas, Gene Wagenbreth, Dan Davis, Roger Grimes {rflucas,genew,ddavis}@isi.edu and grimes@lstc.com 3D Finite Element

More information

TR A Comparison of the Performance of SaP::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems

TR A Comparison of the Performance of SaP::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems TR-0-07 A Comparison of the Performance of ::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems Ang Li, Omkar Deshmukh, Radu Serban, Dan Negrut May, 0 Abstract ::GPU is a

More information

Block Low-Rank (BLR) approximations to improve multifrontal sparse solvers

Block Low-Rank (BLR) approximations to improve multifrontal sparse solvers Block Low-Rank (BLR) approximations to improve multifrontal sparse solvers Joint work with Patrick Amestoy, Cleve Ashcraft, Olivier Boiteau, Alfredo Buttari and Jean-Yves L Excellent, PhD started on October

More information

MagmaDNN High-Performance Data Analytics for Manycore GPUs and CPUs

MagmaDNN High-Performance Data Analytics for Manycore GPUs and CPUs MagmaDNN High-Performance Data Analytics for Manycore GPUs and CPUs Lucien Ng The Chinese University of Hong Kong Kwai Wong The Joint Institute for Computational Sciences (JICS), UTK and ORNL Azzam Haidar,

More information

Toward black-box adaptive domain decomposition methods

Toward black-box adaptive domain decomposition methods Toward black-box adaptive domain decomposition methods Frédéric Nataf Laboratory J.L. Lions (LJLL), CNRS, Alpines Inria and Univ. Paris VI joint work with Victorita Dolean (Univ. Nice Sophia-Antipolis)

More information

Some Geometric and Algebraic Aspects of Domain Decomposition Methods

Some Geometric and Algebraic Aspects of Domain Decomposition Methods Some Geometric and Algebraic Aspects of Domain Decomposition Methods D.S.Butyugin 1, Y.L.Gurieva 1, V.P.Ilin 1,2, and D.V.Perevozkin 1 Abstract Some geometric and algebraic aspects of various domain decomposition

More information

Tall and Skinny QR Matrix Factorization Using Tile Algorithms on Multicore Architectures LAPACK Working Note - 222

Tall and Skinny QR Matrix Factorization Using Tile Algorithms on Multicore Architectures LAPACK Working Note - 222 Tall and Skinny QR Matrix Factorization Using Tile Algorithms on Multicore Architectures LAPACK Working Note - 222 Bilel Hadri 1, Hatem Ltaief 1, Emmanuel Agullo 1, and Jack Dongarra 1,2,3 1 Department

More information

Level-3 BLAS on a GPU

Level-3 BLAS on a GPU Level-3 BLAS on a GPU Picking the Low Hanging Fruit Francisco Igual 1 Gregorio Quintana-Ortí 1 Robert A. van de Geijn 2 1 Departamento de Ingeniería y Ciencia de los Computadores. University Jaume I. Castellón

More information

Accelerating computation of eigenvectors in the nonsymmetric eigenvalue problem

Accelerating computation of eigenvectors in the nonsymmetric eigenvalue problem Accelerating computation of eigenvectors in the nonsymmetric eigenvalue problem Mark Gates 1, Azzam Haidar 1, and Jack Dongarra 1,2,3 1 University of Tennessee, Knoxville, TN, USA 2 Oak Ridge National

More information

Modeling and Tuning Parallel Performance in Dense Linear Algebra

Modeling and Tuning Parallel Performance in Dense Linear Algebra Modeling and Tuning Parallel Performance in Dense Linear Algebra Initial Experiences with the Tile QR Factorization on a Multi Core System CScADS Workshop on Automatic Tuning for Petascale Systems Snowbird,

More information

Jacobi-Based Eigenvalue Solver on GPU. Lung-Sheng Chien, NVIDIA

Jacobi-Based Eigenvalue Solver on GPU. Lung-Sheng Chien, NVIDIA Jacobi-Based Eigenvalue Solver on GPU Lung-Sheng Chien, NVIDIA lchien@nvidia.com Outline Symmetric eigenvalue solver Experiment Applications Conclusions Symmetric eigenvalue solver The standard form is

More information

INITIAL INTEGRATION AND EVALUATION

INITIAL INTEGRATION AND EVALUATION INITIAL INTEGRATION AND EVALUATION OF SLATE PARALLEL BLAS IN LATTE Marc Cawkwell, Danny Perez, Arthur Voter Asim YarKhan, Gerald Ragghianti, Jack Dongarra, Introduction The aim of the joint milestone STMS10-52

More information

Parallelism in FreeFem++.

Parallelism in FreeFem++. Parallelism in FreeFem++. Guy Atenekeng 1 Frederic Hecht 2 Laura Grigori 1 Jacques Morice 2 Frederic Nataf 2 1 INRIA, Saclay 2 University of Paris 6 Workshop on FreeFem++, 2009 Outline 1 Introduction Motivation

More information

Solving PDEs with CUDA Jonathan Cohen

Solving PDEs with CUDA Jonathan Cohen Solving PDEs with CUDA Jonathan Cohen jocohen@nvidia.com NVIDIA Research PDEs (Partial Differential Equations) Big topic Some common strategies Focus on one type of PDE in this talk Poisson Equation Linear

More information

Intel Math Kernel Library (Intel MKL) LAPACK

Intel Math Kernel Library (Intel MKL) LAPACK Intel Math Kernel Library (Intel MKL) LAPACK Linear equations Victor Kostin Intel MKL Dense Solvers team manager LAPACK http://www.netlib.org/lapack Systems of Linear Equations Linear Least Squares Eigenvalue

More information

Accelerating computation of eigenvectors in the dense nonsymmetric eigenvalue problem

Accelerating computation of eigenvectors in the dense nonsymmetric eigenvalue problem Accelerating computation of eigenvectors in the dense nonsymmetric eigenvalue problem Mark Gates 1, Azzam Haidar 1, and Jack Dongarra 1,2,3 1 University of Tennessee, Knoxville, TN, USA 2 Oak Ridge National

More information

Domain Decomposition-based contour integration eigenvalue solvers

Domain Decomposition-based contour integration eigenvalue solvers Domain Decomposition-based contour integration eigenvalue solvers Vassilis Kalantzis joint work with Yousef Saad Computer Science and Engineering Department University of Minnesota - Twin Cities, USA SIAM

More information

Leveraging Task-Parallelism in Energy-Efficient ILU Preconditioners

Leveraging Task-Parallelism in Energy-Efficient ILU Preconditioners Leveraging Task-Parallelism in Energy-Efficient ILU Preconditioners José I. Aliaga Leveraging task-parallelism in energy-efficient ILU preconditioners Universidad Jaime I (Castellón, Spain) José I. Aliaga

More information

Sparse factorizations: Towards optimal complexity and resilience at exascale

Sparse factorizations: Towards optimal complexity and resilience at exascale Sparse factorizations: Towards optimal complexity and resilience at exascale Xiaoye Sherry Li Lawrence Berkeley National Laboratory Challenges in 21st Century Experimental Mathematical Computation Workshop,

More information

ERLANGEN REGIONAL COMPUTING CENTER

ERLANGEN REGIONAL COMPUTING CENTER ERLANGEN REGIONAL COMPUTING CENTER Making Sense of Performance Numbers Georg Hager Erlangen Regional Computing Center (RRZE) Friedrich-Alexander-Universität Erlangen-Nürnberg OpenMPCon 2018 Barcelona,

More information

Saving Energy in the LU Factorization with Partial Pivoting on Multi-Core Processors

Saving Energy in the LU Factorization with Partial Pivoting on Multi-Core Processors 20th Euromicro International Conference on Parallel, Distributed and Network-Based Special Session on Energy-aware Systems Saving Energy in the on Multi-Core Processors Pedro Alonso 1, Manuel F. Dolz 2,

More information

Scheduling Trees of Malleable Tasks for Sparse Linear Algebra

Scheduling Trees of Malleable Tasks for Sparse Linear Algebra Scheduling Trees of Malleable Tasks for Sparse Linear Algebra Abdou Guermouche, Loris Marchal, Bertrand Simon, Frédéric Vivien To cite this version: Abdou Guermouche, Loris Marchal, Bertrand Simon, Frédéric

More information

Scalable Domain Decomposition Preconditioners For Heterogeneous Elliptic Problems

Scalable Domain Decomposition Preconditioners For Heterogeneous Elliptic Problems Scalable Domain Decomposition Preconditioners For Heterogeneous Elliptic Problems Pierre Jolivet, F. Hecht, F. Nataf, C. Prud homme Laboratoire Jacques-Louis Lions Laboratoire Jean Kuntzmann INRIA Rocquencourt

More information

Parallel Asynchronous Hybrid Krylov Methods for Minimization of Energy Consumption. Langshi CHEN 1,2,3 Supervised by Serge PETITON 2

Parallel Asynchronous Hybrid Krylov Methods for Minimization of Energy Consumption. Langshi CHEN 1,2,3 Supervised by Serge PETITON 2 1 / 23 Parallel Asynchronous Hybrid Krylov Methods for Minimization of Energy Consumption Langshi CHEN 1,2,3 Supervised by Serge PETITON 2 Maison de la Simulation Lille 1 University CNRS March 18, 2013

More information

Energy Consumption Evaluation for Krylov Methods on a Cluster of GPU Accelerators

Energy Consumption Evaluation for Krylov Methods on a Cluster of GPU Accelerators PARIS-SACLAY, FRANCE Energy Consumption Evaluation for Krylov Methods on a Cluster of GPU Accelerators Serge G. Petiton a and Langshi Chen b April the 6 th, 2016 a Université de Lille, Sciences et Technologies

More information

Case Study: Quantum Chromodynamics

Case Study: Quantum Chromodynamics Case Study: Quantum Chromodynamics Michael Clark Harvard University with R. Babich, K. Barros, R. Brower, J. Chen and C. Rebbi Outline Primer to QCD QCD on a GPU Mixed Precision Solvers Multigrid solver

More information

GPU acceleration of Newton s method for large systems of polynomial equations in double double and quad double arithmetic

GPU acceleration of Newton s method for large systems of polynomial equations in double double and quad double arithmetic GPU acceleration of Newton s method for large systems of polynomial equations in double double and quad double arithmetic Jan Verschelde joint work with Xiangcheng Yu University of Illinois at Chicago

More information

HPMPC - A new software package with efficient solvers for Model Predictive Control

HPMPC - A new software package with efficient solvers for Model Predictive Control - A new software package with efficient solvers for Model Predictive Control Technical University of Denmark CITIES Second General Consortium Meeting, DTU, Lyngby Campus, 26-27 May 2015 Introduction Model

More information

Tile QR Factorization with Parallel Panel Processing for Multicore Architectures

Tile QR Factorization with Parallel Panel Processing for Multicore Architectures Tile QR Factorization with Parallel Panel Processing for Multicore Architectures Bilel Hadri, Hatem Ltaief, Emmanuel Agullo, Jack Dongarra Department of Electrical Engineering and Computer Science, University

More information

GPU accelerated Arnoldi solver for small batched matrix

GPU accelerated Arnoldi solver for small batched matrix 15. 09. 22 GPU accelerated Arnoldi solver for small batched matrix Samsung Advanced Institute of Technology Hyung-Jin Kim Contents - Eigen value problems - Solution - Arnoldi Algorithm - Target - CUDA

More information

Accelerating interior point methods with GPUs for smart grid systems

Accelerating interior point methods with GPUs for smart grid systems Downloaded from orbit.dtu.dk on: Dec 18, 2017 Accelerating interior point methods with GPUs for smart grid systems Gade-Nielsen, Nicolai Fog Publication date: 2011 Document Version Publisher's PDF, also

More information

Block AIR Methods. For Multicore and GPU. Per Christian Hansen Hans Henrik B. Sørensen. Technical University of Denmark

Block AIR Methods. For Multicore and GPU. Per Christian Hansen Hans Henrik B. Sørensen. Technical University of Denmark Block AIR Methods For Multicore and GPU Per Christian Hansen Hans Henrik B. Sørensen Technical University of Denmark Model Problem and Notation Parallel-beam 3D tomography exact solution exact data noise

More information

Large-scale Electronic Structure Simulations with MVAPICH2 on Intel Knights Landing Manycore Processors

Large-scale Electronic Structure Simulations with MVAPICH2 on Intel Knights Landing Manycore Processors Large-scale Electronic Structure Simulations with MVAPICH2 on Intel Knights Landing Manycore Processors Hoon Ryu, Ph.D. (E: elec1020@kisti.re.kr) Principal Researcher / Korea Institute of Science and Technology

More information

An Integrative Model for Parallelism

An Integrative Model for Parallelism An Integrative Model for Parallelism Victor Eijkhout ICERM workshop 2012/01/09 Introduction Formal part Examples Extension to other memory models Conclusion tw-12-exascale 2012/01/09 2 Introduction tw-12-exascale

More information

Algorithm for Sparse Approximate Inverse Preconditioners in the Conjugate Gradient Method

Algorithm for Sparse Approximate Inverse Preconditioners in the Conjugate Gradient Method Algorithm for Sparse Approximate Inverse Preconditioners in the Conjugate Gradient Method Ilya B. Labutin A.A. Trofimuk Institute of Petroleum Geology and Geophysics SB RAS, 3, acad. Koptyug Ave., Novosibirsk

More information

Direct Self-Consistent Field Computations on GPU Clusters

Direct Self-Consistent Field Computations on GPU Clusters Direct Self-Consistent Field Computations on GPU Clusters Guochun Shi, Volodymyr Kindratenko National Center for Supercomputing Applications University of Illinois at UrbanaChampaign Ivan Ufimtsev, Todd

More information

Multipréconditionnement adaptatif pour les méthodes de décomposition de domaine. Nicole Spillane (CNRS, CMAP, École Polytechnique)

Multipréconditionnement adaptatif pour les méthodes de décomposition de domaine. Nicole Spillane (CNRS, CMAP, École Polytechnique) Multipréconditionnement adaptatif pour les méthodes de décomposition de domaine Nicole Spillane (CNRS, CMAP, École Polytechnique) C. Bovet (ONERA), P. Gosselet (ENS Cachan), A. Parret Fréaud (SafranTech),

More information

Feeding of the Thousands. Leveraging the GPU's Computing Power for Sparse Linear Algebra

Feeding of the Thousands. Leveraging the GPU's Computing Power for Sparse Linear Algebra SPPEXA Annual Meeting 2016, January 25 th, 2016, Garching, Germany Feeding of the Thousands Leveraging the GPU's Computing Power for Sparse Linear Algebra Hartwig Anzt Sparse Linear Algebra on GPUs Inherently

More information

Direct and Incomplete Cholesky Factorizations with Static Supernodes

Direct and Incomplete Cholesky Factorizations with Static Supernodes Direct and Incomplete Cholesky Factorizations with Static Supernodes AMSC 661 Term Project Report Yuancheng Luo 2010-05-14 Introduction Incomplete factorizations of sparse symmetric positive definite (SSPD)

More information

A parallel tiled solver for dense symmetric indefinite systems on multicore architectures

A parallel tiled solver for dense symmetric indefinite systems on multicore architectures A parallel tiled solver for dense symmetric indefinite systems on multicore architectures Marc Baboulin, Dulceneia Becker, Jack Dongarra INRIA Saclay-Île de France, F-91893 Orsay, France Université Paris

More information

Parallel Preconditioning Methods for Ill-conditioned Problems

Parallel Preconditioning Methods for Ill-conditioned Problems Parallel Preconditioning Methods for Ill-conditioned Problems Kengo Nakajima Information Technology Center, The University of Tokyo 2014 Conference on Advanced Topics and Auto Tuning in High Performance

More information

Parallel sparse linear solvers and applications in CFD

Parallel sparse linear solvers and applications in CFD Parallel sparse linear solvers and applications in CFD Jocelyne Erhel Joint work with Désiré Nuentsa Wakam () and Baptiste Poirriez () SAGE team, Inria Rennes, France journée Calcul Intensif Distribué

More information

A Numerical Study of Some Parallel Algebraic Preconditioners

A Numerical Study of Some Parallel Algebraic Preconditioners A Numerical Study of Some Parallel Algebraic Preconditioners Xing Cai Simula Research Laboratory & University of Oslo PO Box 1080, Blindern, 0316 Oslo, Norway xingca@simulano Masha Sosonkina University

More information

Parallel Multilevel Low-Rank approximation preconditioners Yousef Saad Department of Computer Science and Engineering University of Minnesota

Parallel Multilevel Low-Rank approximation preconditioners Yousef Saad Department of Computer Science and Engineering University of Minnesota Parallel Multilevel Low-Rank approximation preconditioners Yousef Saad Department of Computer Science and Engineering University of Minnesota ICME, Stanford April 2th, 25 First: Partly joint work with

More information

Saving Energy in Sparse and Dense Linear Algebra Computations

Saving Energy in Sparse and Dense Linear Algebra Computations Saving Energy in Sparse and Dense Linear Algebra Computations P. Alonso, M. F. Dolz, F. Igual, R. Mayo, E. S. Quintana-Ortí, V. Roca Univ. Politécnica Univ. Jaume I The Univ. of Texas de Valencia, Spain

More information

Multilevel Low-Rank Preconditioners Yousef Saad Department of Computer Science and Engineering University of Minnesota. Modelling 2014 June 2,2014

Multilevel Low-Rank Preconditioners Yousef Saad Department of Computer Science and Engineering University of Minnesota. Modelling 2014 June 2,2014 Multilevel Low-Rank Preconditioners Yousef Saad Department of Computer Science and Engineering University of Minnesota Modelling 24 June 2,24 Dedicated to Owe Axelsson at the occasion of his 8th birthday

More information

Fine-Grained Parallel Algorithms for Incomplete Factorization Preconditioning

Fine-Grained Parallel Algorithms for Incomplete Factorization Preconditioning Fine-Grained Parallel Algorithms for Incomplete Factorization Preconditioning Edmond Chow School of Computational Science and Engineering Georgia Institute of Technology, USA SPPEXA Symposium TU München,

More information

Efficiently solving large sparse linear systems on a distributed and heterogeneous grid by using the multisplitting-direct method

Efficiently solving large sparse linear systems on a distributed and heterogeneous grid by using the multisplitting-direct method Efficiently solving large sparse linear systems on a distributed and heterogeneous grid by using the multisplitting-direct method S. Contassot-Vivier, R. Couturier, C. Denis and F. Jézéquel Université

More information

Panorama des modèles et outils de programmation parallèle

Panorama des modèles et outils de programmation parallèle Panorama des modèles et outils de programmation parallèle Sylvain HENRY sylvain.henry@inria.fr University of Bordeaux - LaBRI - Inria - ENSEIRB April 19th, 2013 1/45 Outline Introduction Accelerators &

More information

ExaGeoStat: A High Performance Unified Software for Geostatistics on Manycore Systems

ExaGeoStat: A High Performance Unified Software for Geostatistics on Manycore Systems IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. XX, NO. X, M YYYY 1 ExaGeoStat: A High Performance Unified Software for Geostatistics on Manycore Systems Sameh Abdulah, Hatem Ltaief, Ying Sun,

More information

A dissection solver with kernel detection for symmetric finite element matrices on shared memory computers

A dissection solver with kernel detection for symmetric finite element matrices on shared memory computers A dissection solver with kernel detection for symmetric finite element matrices on shared memory computers Atsushi Suzuki, François-Xavier Roux To cite this version: Atsushi Suzuki, François-Xavier Roux.

More information

Parallel Algorithms for Solution of Large Sparse Linear Systems with Applications

Parallel Algorithms for Solution of Large Sparse Linear Systems with Applications Parallel Algorithms for Solution of Large Sparse Linear Systems with Applications Murat Manguoğlu Department of Computer Engineering Middle East Technical University, Ankara, Turkey Prace workshop: HPC

More information

Presentation of XLIFE++

Presentation of XLIFE++ Presentation of XLIFE++ Eigenvalues Solver & OpenMP Manh-Ha NGUYEN Unité de Mathématiques Appliquées, ENSTA - Paristech 25 Juin 2014 Ha. NGUYEN Presentation of XLIFE++ 25 Juin 2014 1/19 EigenSolver 1 EigenSolver

More information

The new challenges to Krylov subspace methods Yousef Saad Department of Computer Science and Engineering University of Minnesota

The new challenges to Krylov subspace methods Yousef Saad Department of Computer Science and Engineering University of Minnesota The new challenges to Krylov subspace methods Yousef Saad Department of Computer Science and Engineering University of Minnesota SIAM Applied Linear Algebra Valencia, June 18-22, 2012 Introduction Krylov

More information

Review for the Midterm Exam

Review for the Midterm Exam Review for the Midterm Exam 1 Three Questions of the Computational Science Prelim scaled speedup network topologies work stealing 2 The in-class Spring 2012 Midterm Exam pleasingly parallel computations

More information

Heterogeneous programming for hybrid CPU-GPU systems: Lessons learned from computational chemistry

Heterogeneous programming for hybrid CPU-GPU systems: Lessons learned from computational chemistry Heterogeneous programming for hybrid CPU-GPU systems: Lessons learned from computational chemistry and Eugene DePrince Argonne National Laboratory (LCF and CNM) (Eugene moved to Georgia Tech last week)

More information

Multilevel spectral coarse space methods in FreeFem++ on parallel architectures

Multilevel spectral coarse space methods in FreeFem++ on parallel architectures Multilevel spectral coarse space methods in FreeFem++ on parallel architectures Pierre Jolivet Laboratoire Jacques-Louis Lions Laboratoire Jean Kuntzmann DD 21, Rennes. June 29, 2012 In collaboration with

More information

Preconditioned Parallel Block Jacobi SVD Algorithm

Preconditioned Parallel Block Jacobi SVD Algorithm Parallel Numerics 5, 15-24 M. Vajteršic, R. Trobec, P. Zinterhof, A. Uhl (Eds.) Chapter 2: Matrix Algebra ISBN 961-633-67-8 Preconditioned Parallel Block Jacobi SVD Algorithm Gabriel Okša 1, Marián Vajteršic

More information

PRECONDITIONING IN THE PARALLEL BLOCK-JACOBI SVD ALGORITHM

PRECONDITIONING IN THE PARALLEL BLOCK-JACOBI SVD ALGORITHM Proceedings of ALGORITMY 25 pp. 22 211 PRECONDITIONING IN THE PARALLEL BLOCK-JACOBI SVD ALGORITHM GABRIEL OKŠA AND MARIÁN VAJTERŠIC Abstract. One way, how to speed up the computation of the singular value

More information

Communication-avoiding LU and QR factorizations for multicore architectures

Communication-avoiding LU and QR factorizations for multicore architectures Communication-avoiding LU and QR factorizations for multicore architectures DONFACK Simplice INRIA Saclay Joint work with Laura Grigori INRIA Saclay Alok Kumar Gupta BCCS,Norway-5075 16th April 2010 Communication-avoiding

More information