Utilisation de la compression low-rank pour réduire la complexité du solveur PaStiX

Similar documents
On the design of parallel linear solvers for large scale problems

On the design of parallel linear solvers for large scale problems

Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers

ACCELERATING SPARSE CHOLESKY FACTORIZATION ON THE GPU

Accelerating linear algebra computations with hybrid GPU-multicore systems.

1 Overview. 2 Adapting to computing system evolution. 11 th European LS-DYNA Conference 2017, Salzburg, Austria

3D Cartesian Transport Sweep for Massively Parallel Architectures on top of PaRSEC

A robust multilevel approximate inverse preconditioner for symmetric positive definite matrices

SOLVING SPARSE LINEAR SYSTEMS OF EQUATIONS. Chao Yang Computational Research Division Lawrence Berkeley National Laboratory Berkeley, CA, USA

Improvements for Implicit Linear Equation Solvers

Performance and Scalability of the Block Low-Rank Multifrontal Factorization on Multicore Architectures

Direct and Incomplete Cholesky Factorizations with Static Supernodes

A sparse multifrontal solver using hierarchically semi-separable frontal matrices

AMS526: Numerical Analysis I (Numerical Linear Algebra for Computational and Data Sciences)

Sparse LU Factorization on GPUs for Accelerating SPICE Simulation

Scalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver

Block Low-Rank (BLR) approximations to improve multifrontal sparse solvers

Static-scheduling and hybrid-programming in SuperLU DIST on multicore cluster systems

Incomplete Cholesky preconditioners that exploit the low-rank property

An Efficient Solver for Sparse Linear Systems based on Rank-Structured Cholesky Factorization

Saving Energy in the LU Factorization with Partial Pivoting on Multi-Core Processors

Jacobi-Based Eigenvalue Solver on GPU. Lung-Sheng Chien, NVIDIA

Enhancing Scalability of Sparse Direct Methods

Large Scale Sparse Linear Algebra

MUMPS. The MUMPS library: work done during the SOLSTICE project. MUMPS team, Lyon-Grenoble, Toulouse, Bordeaux

Parallelization of Multilevel Preconditioners Constructed from Inverse-Based ILUs on Shared-Memory Multiprocessors

SPARSE SOLVERS POISSON EQUATION. Margreet Nool. November 9, 2015 FOR THE. CWI, Multiscale Dynamics

IMPROVING THE PERFORMANCE OF SPARSE LU MATRIX FACTORIZATION USING A SUPERNODAL ALGORITHM

A dissection solver with kernel detection for unsymmetric matrices in FreeFem++

MARCH 24-27, 2014 SAN JOSE, CA

V C V L T I 0 C V B 1 V T 0 I. l nk

Jacobi-Davidson Eigensolver in Cusolver Library. Lung-Sheng Chien, NVIDIA

Sparse factorization using low rank submatrices. Cleve Ashcraft LSTC 2010 MUMPS User Group Meeting April 15-16, 2010 Toulouse, FRANCE

An Efficient Solver for Sparse Linear Systems based on Rank-Structured Cholesky Factorization

Performance comparison between hybridizable DG and classical DG methods for elastic waves simulation in harmonic domain

TR A Comparison of the Performance of SaP::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems

arxiv: v1 [cs.na] 20 Jul 2015

Bridging the gap between flat and hierarchical low-rank matrix formats: the multilevel BLR format

Dynamic Scheduling within MAGMA

Using Random Butterfly Transformations to Avoid Pivoting in Sparse Direct Methods

c 2017 Society for Industrial and Applied Mathematics

Sparse linear solvers

Information Sciences Institute 22 June 2012 Bob Lucas, Gene Wagenbreth, Dan Davis, Roger Grimes and

Sparse factorizations: Towards optimal complexity and resilience at exascale

A High-Performance Parallel Hybrid Method for Large Sparse Linear Systems

Fine-Grained Parallel Algorithms for Incomplete Factorization Preconditioning

A DISTRIBUTED-MEMORY RANDOMIZED STRUCTURED MULTIFRONTAL METHOD FOR SPARSE DIRECT SOLUTIONS

Large-scale Electronic Structure Simulations with MVAPICH2 on Intel Knights Landing Manycore Processors

Accelerating Model Reduction of Large Linear Systems with Graphics Processors

Exploiting Fill-in and Fill-out in Gaussian-like Elimination Procedures on the Extended Jacobian Matrix

Solving PDEs with CUDA Jonathan Cohen

Contents. Preface... xi. Introduction...

Multilevel low-rank approximation preconditioners Yousef Saad Department of Computer Science and Engineering University of Minnesota

Performance and Energy Analysis of the Iterative Solution of Sparse Linear Systems on Multicore and Manycore Architectures

Sparse Linear Algebra PRACE PLA, Ostrava, Czech Republic

High Performance Parallel Tucker Decomposition of Sparse Tensors

Computing least squares condition numbers on hybrid multicore/gpu systems

Eliminations and echelon forms in exact linear algebra

Sparse Linear Algebra: Direct Methods, advanced features

Parallel Transposition of Sparse Data Structures

Towards a highly-parallel PDE-Solver using Adaptive Sparse Grids on Compute Clusters

Consider the following example of a linear system:

A DISTRIBUTED-MEMORY RANDOMIZED STRUCTURED MULTIFRONTAL METHOD FOR SPARSE DIRECT SOLUTIONS

Leveraging Task-Parallelism in Energy-Efficient ILU Preconditioners

Partial Left-Looking Structured Multifrontal Factorization & Algorithms for Compressed Sensing. Cinna Julie Wu

Parallelism in FreeFem++.

MagmaDNN High-Performance Data Analytics for Manycore GPUs and CPUs

Dynamic Scheduling for Work Agglomeration on Heterogeneous Clusters

FINDING PARALLELISM IN GENERAL-PURPOSE LINEAR PROGRAMMING

J.I. Aliaga 1 M. Bollhöfer 2 A.F. Martín 1 E.S. Quintana-Ortí 1. March, 2009

Preconditioning techniques to accelerate the convergence of the iterative solution methods

Intel Math Kernel Library (Intel MKL) LAPACK

Recent advances in HPC with FreeFem++

Implementing QR Factorization Updating Algorithms on GPUs. Andrew, Robert and Dingle, Nicholas J. MIMS EPrint:

Level-3 BLAS on a GPU

Using Postordering and Static Symbolic Factorization for Parallel Sparse LU

Scaling and Pivoting in an Out-of-Core Sparse Direct Solver

Parallel programming practices for the solution of Sparse Linear Systems (motivated by computational physics and graphics)

Numerical Methods I Non-Square and Sparse Linear Systems

From Direct to Iterative Substructuring: some Parallel Experiences in 2 and 3D

Algorithm for Sparse Approximate Inverse Preconditioners in the Conjugate Gradient Method

Fine-grained Parallel Incomplete LU Factorization

Linear Solvers. Andrew Hazel

Preface to the Second Edition. Preface to the First Edition

Towards Fast, Accurate and Reproducible LU Factorization

Implicit Solution of Viscous Aerodynamic Flows using the Discontinuous Galerkin Method

Numerical Linear Algebra Primer. Ryan Tibshirani Convex Optimization /36-725

Incomplete LU Preconditioning and Error Compensation Strategies for Sparse Matrices

CME342 Parallel Methods in Numerical Analysis. Matrix Computation: Iterative Methods II. Sparse Matrix-vector Multiplication.

FAST STRUCTURED EIGENSOLVER FOR DISCRETIZED PARTIAL DIFFERENTIAL OPERATORS ON GENERAL MESHES

A dissection solver with kernel detection for symmetric finite element matrices on shared memory computers

Accelerating Band Linear Algebra Operations on GPUs with Application in Model Reduction

Opportunities for ELPA to Accelerate the Solution of the Bethe-Salpeter Eigenvalue Problem

Open-Source Parallel FE Software : FrontISTR -- Performance Considerations about B/F (Byte per Flop) of SpMV on K-Supercomputer and GPU-Clusters --

Parallel Asynchronous Hybrid Krylov Methods for Minimization of Energy Consumption. Langshi CHEN 1,2,3 Supervised by Serge PETITON 2

Efficiently solving large sparse linear systems on a distributed and heterogeneous grid by using the multisplitting-direct method

Iterative Methods and Multigrid

9. Numerical linear algebra background

Efficient Sparse LU Factorization with Partial Pivoting on Distributed Memory Architectures

CS 542G: Conditioning, BLAS, LU Factorization

Transcription:

Utilisation de la compression low-rank pour réduire la complexité du solveur PaStiX 26 Septembre 2018 - JCAD 2018 - Lyon Grégoire Pichon, Mathieu Faverge, Pierre Ramet, Jean Roman

Outline 1. Context 2. Block Low-Rank solver From full-rank to low-rank Impact of ordering techniques 3. Application: HORSE 4. Conclusion and future works M. Faverge PASTIX- JCAD 2/30

1 Context M. Faverge PASTIX- JCAD 3/30

The PASTIX solver Factorization sparse direct solver real and complex arithmetic LU, LL t, LDL t, LL H and LDL H factorizations Static pivoting Refinement with BICGSTAB, CG, GMRES Parallelism Supernodal approach Native static scheduling (used in next experiments) Runtime systems: PARSEC and STARPU Heterogeneous architectures ( GPUs, KNL ) Distributed (Ongoing work) https://gitlab.inria.fr/solverstack/pastix M. Faverge PASTIX- JCAD 4/30

Objectives Current sparse direct solver costs for 3D problems Θ(n 2 ) time complexity Θ(n 4 3 ) memory complexity BLAS Level 3 operations with block computations Current status Direct solver suffer from their high complexities Iterative or hybrid method are often problem dependent Machine precision is not required with inaccurate measures Objectives: low-rank compression in a sparse direct solver Introduce compression of dense blocks Reduce time and/or memory complexity Keep the same level of parallelism M. Faverge PASTIX- JCAD 5/30

2 Block Low-Rank solver M. Faverge PASTIX- JCAD 6/30

2.1 From full-rank to low-rank M. Faverge PASTIX- JCAD 7/30

Low-rank compression U, V Rn r = V M t M Rn n U Storage in 2nr instead of n2. Original picture. r = 10. r = 50. Compression of an image with n = 500. M. Faverge PA S TI X- JCAD 8/30

BLR compression Symbolic factorization Approach Large supernodes are split It increases the level of parallelism Operations Dense diagonal blocks TRSM are performed on dense off-diagonal blocks GEMM are performed between dense off-diagonal blocks M. Faverge PASTIX- JCAD 9/30

BLR compression Symbolic factorization Approach Large supernodes are split Large off-diagonal blocks are low-rank Operations Dense diagonal blocks TRSM are performed on low-rank off-diagonal blocks GEMM are performed between low-rank off-diagonal blocks M. Faverge PASTIX- JCAD 9/30

Strategy Just-In-Time Compress L 1. Eliminate each column block 1.1 Factorize the dense diagonal block Compress off-diagonal blocks belonging to the supernode 1.2 Apply a TRSM on LR blocks (cheaper) 1.3 LR update on dense matrices (LR2GE extend-add) 2. Solve triangular systems with low-rank blocks (3,1) (1,3) (1,1) (1,2) (2,1) (3,1) (1,3) (1,2) (2,1) Compression GETRF (Facto) TRSM (Solve) LR2LR (Update) LR2GE (Update) (3,2) (2,3) (3,2) (2,3) (3,2) (2,3) (2,2) (2,2) M. Faverge PASTIX- JCAD 10/30

Strategy Just-In-Time Compress L 1. Eliminate each column block 1.1 Factorize the dense diagonal block Compress off-diagonal blocks belonging to the supernode 1.2 Apply a TRSM on LR blocks (cheaper) 1.3 LR update on dense matrices (LR2GE extend-add) 2. Solve triangular systems with low-rank blocks (3,1) (1,3) (1,1) (1,2) (2,1) (3,1) (1,3) (1,2) (2,1) Compression GETRF (Facto) TRSM (Solve) LR2LR (Update) LR2GE (Update) (3,2) (2,3) (3,2) (2,3) (3,2) (2,3) (2,2) (2,2) M. Faverge PASTIX- JCAD 10/30

Strategy Just-In-Time Compress L 1. Eliminate each column block 1.1 Factorize the dense diagonal block Compress off-diagonal blocks belonging to the supernode 1.2 Apply a TRSM on LR blocks (cheaper) 1.3 LR update on dense matrices (LR2GE extend-add) 2. Solve triangular systems with low-rank blocks (3,1) (1,3) (1,1) (1,2) (2,1) (3,1) (1,3) (1,2) (2,1) Compression GETRF (Facto) TRSM (Solve) LR2LR (Update) LR2GE (Update) (3,2) (2,3) (3,2) (2,3) (3,2) (2,3) (2,2) (2,2) M. Faverge PASTIX- JCAD 10/30

Strategy Just-In-Time Compress L 1. Eliminate each column block 1.1 Factorize the dense diagonal block Compress off-diagonal blocks belonging to the supernode 1.2 Apply a TRSM on LR blocks (cheaper) 1.3 LR update on dense matrices (LR2GE extend-add) 2. Solve triangular systems with low-rank blocks (3,1) (1,3) (1,1) (1,2) (2,1) (3,1) (1,3) (1,2) (2,1) Compression GETRF (Facto) TRSM (Solve) LR2LR (Update) LR2GE (Update) (3,2) (2,3) (3,2) (2,3) (3,2) (2,3) (2,2) (2,2) M. Faverge PASTIX- JCAD 10/30

Strategy Just-In-Time Compress L 1. Eliminate each column block 1.1 Factorize the dense diagonal block Compress off-diagonal blocks belonging to the supernode 1.2 Apply a TRSM on LR blocks (cheaper) 1.3 LR update on dense matrices (LR2GE extend-add) 2. Solve triangular systems with low-rank blocks (3,1) (1,3) (1,1) (1,2) (2,1) (3,1) (1,3) (1,2) (2,1) Compression GETRF (Facto) TRSM (Solve) LR2LR (Update) LR2GE (Update) (3,2) (2,3) (3,2) (2,3) (3,2) (2,3) (2,2) (2,2) M. Faverge PASTIX- JCAD 10/30

Summary of the Just-In-Time strategy Assets The most expensive operation, the update, is less expensive with LR2GE kernel The formation of the dense update and its application is not expensive The size of the factors is reduced, as well as the solve cost A limitation of this approach All blocks are allocated in full-rank before being compressed Limiting this constraint reduces the level of parallelism Just-In-Time M. Faverge PASTIX- JCAD 11/30

Summary of the Just-In-Time strategy Assets The most expensive operation, the update, is less expensive with LR2GE kernel The formation of the dense update and its application is not expensive The size of the factors is reduced, as well as the solve cost A limitation of this approach All blocks are allocated in full-rank before being compressed Limiting this constraint reduces the level of parallelism Just-In-Time Minimal Memory M. Faverge PASTIX- JCAD 11/30

Strategy Minimal Memory Compress A 1. Compress large off-diagonal blocks in A (exploiting sparsity) 2. Eliminate each column block 2.1 Factorize the dense diagonal block 2.2 Apply a TRSM on LR blocks (cheaper) 2.3 LR update on LR matrices (LR2LR extend-add) 3. Solve triangular systems with LR blocks (2,3) (1,3) (2,1) (1,1) (3,1) (1,2) (3,2) (1,3) (2,1) (3,1) (1,2) Compression GETRF (Facto) TRSM (Solve) LR2LR (Update) LR2GE (Update) (2,3) (2,3) (2,2) (2,2) (3,2) (3,2) M. Faverge PASTIX- JCAD 12/30

Strategy Minimal Memory Compress A 1. Compress large off-diagonal blocks in A (exploiting sparsity) 2. Eliminate each column block 2.1 Factorize the dense diagonal block 2.2 Apply a TRSM on LR blocks (cheaper) 2.3 LR update on LR matrices (LR2LR extend-add) 3. Solve triangular systems with LR blocks (2,3) (1,3) (2,1) (1,1) (3,1) (1,2) (3,2) (1,3) (2,1) (3,1) (1,2) Compression GETRF (Facto) TRSM (Solve) LR2LR (Update) LR2GE (Update) (2,3) (2,3) (2,2) (2,2) (3,2) (3,2) M. Faverge PASTIX- JCAD 12/30

Strategy Minimal Memory Compress A 1. Compress large off-diagonal blocks in A (exploiting sparsity) 2. Eliminate each column block 2.1 Factorize the dense diagonal block 2.2 Apply a TRSM on LR blocks (cheaper) 2.3 LR update on LR matrices (LR2LR extend-add) 3. Solve triangular systems with LR blocks (2,3) (1,3) (2,1) (1,1) (3,1) (1,2) (3,2) (1,3) (2,1) (3,1) (1,2) Compression GETRF (Facto) TRSM (Solve) LR2LR (Update) LR2GE (Update) (2,3) (2,3) (2,2) (2,2) (3,2) (3,2) M. Faverge PASTIX- JCAD 12/30

Strategy Minimal Memory Compress A 1. Compress large off-diagonal blocks in A (exploiting sparsity) 2. Eliminate each column block 2.1 Factorize the dense diagonal block 2.2 Apply a TRSM on LR blocks (cheaper) 2.3 LR update on LR matrices (LR2LR extend-add) 3. Solve triangular systems with LR blocks (2,3) (1,3) (2,1) (1,1) (3,1) (1,2) (3,2) (1,3) (2,1) (3,1) (1,2) Compression GETRF (Facto) TRSM (Solve) LR2LR (Update) LR2GE (Update) (2,3) (2,3) (2,2) (2,2) (3,2) (3,2) M. Faverge PASTIX- JCAD 12/30

Experimental setup - architectures GENCI - Occigen 2 INTEL Xeon E5 2680v3 at 2.50 GHz 128 GB 2 12 cores MCIA - Avakas 2 INTEL Xeon x5675 at 3.06 GHz 48 GB 2 12 cores PlaFRIM - Miriel 2 INTEL Xeon E5 2680v3 at 2.50 GHz 128 GB 2 12 cores M. Faverge PASTIX- JCAD 13/30

Experimental setup - matrices Large (33 matrices) Square matrices issued from the SuiteSparse Matrix Collection real or complex 50K N 5M nb ops > 1 TFlops Web and DNA matrices were removed Small (6 matrices) Matrix Fact. Size Field Atmosmodj LU 1 270 432 atmospheric model Audi LL t 943 695 structural problem Hook LDL t 1 498 023 model of a steel hook Serena LDL t 1 391 349 gas reservoir simulation Geo1438 LL t 1 437 960 geomechanical model of earth lap120 LDL t 120 3 Poisson problem M. Faverge PASTIX- JCAD 14/30

Performance of RRQR/Just-In-Time wrt full-rank version 1.6 τ =10 8 1.4 Time BLR / Time PaStiX 1.2 1.0 0.8 0.6 0.4 6.1e-10 6.5e-10 2.4e-10 1.4e-08 3.7e-10 1.9e-08 0.2 0.0 lap120 atmosmodj audi Geo1438 Hook Serena Factorization time reduced by a factor of 2 for τ = 10 8. M. Faverge PASTIX- JCAD 15/30

Performance of RRQR/Just-In-Time wrt full-rank version 1.6 1.4 τ =10 4 τ =10 8 τ =10 12 Time BLR / Time PaStiX 1.2 1.0 0.8 0.6 0.4 1.0e-05 6.1e-10 2.3e-14 1.1e-05 6.5e-10 1.6e-14 2.6e-04 2.4e-10 2.1e-15 6.9e-05 1.4e-08 9.1e-13 4.2e-04 3.7e-10 7.4e-15 5.7e-05 1.9e-08 1.8e-12 0.2 0.0 lap120 atmosmodj audi Geo1438 Hook Serena Factorization time reduced by a factor of 2 for τ = 10 8. M. Faverge PASTIX- JCAD 15/30

Behavior of RRQR/Minimal Memory wrt full-rank version Time BLR / Time PaStiX 4.5 4.0 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0 τ =10 4 τ =10 8 τ =10 12 lap120 atmosmodj audi Geo1438 Hook Serena Performance Increase by a factor of 1.9 for τ = 10 8 Better for a lower accuracy Memory BLR / Memory PaStiX 1.4 1.2 1.0 0.8 0.6 0.4 0.2 0.0 τ =10 4 τ =10 8 τ =10 12 lap120 atmosmodj audi Geo1438 Hook Serena Memory footprint Reduction by a factor of 1.7 for τ = 10 8 Close to the results obtained using SVD M. Faverge PASTIX- JCAD 16/30

2.2 Impact of ordering techniques M. Faverge PASTIX- JCAD 17/30

Ordering for sparse matrices One can reorder separators without modifying the fill-in. Symbolic factorization of a 8 8 8 Laplacian. Enhance ordering Modify the partitioning process to enhance compressibility Reorder unknowns within a separator: To enhance clustering within the separator To enhance the coupling part M. Faverge PASTIX- JCAD 18/30

Clustering techniques: existing solutions k-way partitioning (dense, multifrontal) 5 5 6 6 12 2 2 3 3 18 (a) Symbolic factorization. (b) First separator clustering. 8 8 8 Laplacian partitioned using SCOTCH and k-way clustering on the first separator. M. Faverge PASTIX- JCAD 19/30

Clustering techniques: existing solutions k-way partitioning (dense, multifrontal) Supernode reordering techniques (supernodal) 5 4 4 3 9 2 2 3 3 9 (a) Symbolic factorization. (b) First separator clustering. 8 8 8 Laplacian partitioned using SCOTCH and Reordering clustering on the first separator. M. Faverge PASTIX- JCAD 19/30

Performance gain of the TSP ordering in full-rank Results Reduce the number of off-diagonal blocks and thus the overhead associated with low-rank updates Lead to larger data blocks suitable for modern architectures Always increase performance wrt SCOTCH Architecture Nb. units Mean gain Max gain Westmere 12 cores 2% 6% Xeon E5 24 cores 7% 13% Fermi 12 cores + 1 to 3 M2070 10% 20% Kepler 24 cores + 1 to 4 K40 15% 40% Xeon Phi 64 cores 20% 40% Performance gain when using PARSEC runtime system with TSP M. Faverge PASTIX- JCAD 20/30

Performance profile of the factorization time Number of matrices 1.0 30 25 0.8 20 15 10 0.65 0 30 0.4 25 20 15 0.2 10 5 Minimal-Memory, τ =10 8 Just-In-Time, τ =10 8 Minimal-Memory, τ =10 12 Just-In-Time, τ =10 12 TSP K-way Projections 0.00 0.0 1.0 1.1 0.2 1.2 1.3 1.4 0.4 1.5 1.00.6 1.1 1.2 0.81.3 1.4 1.0 1.5 Overhead percentage wrt the optimal Method For each matrix, percentage wrt the best method Percentage are sorted increasingly A curve close to x = 1 is close to optimal Results K-way outperforms TSP by 6% Projections outperforms k-way by 6% for Minimal Memory strategy M. Faverge PASTIX- JCAD 21/30

Performance profile of the memory consumption Number of matrices 1.0 30 25 0.8 20 15 10 0.65 0 30 0.4 25 20 15 0.2 10 5 Minimal-Memory, τ =10 8 Just-In-Time, τ =10 8 Minimal-Memory, τ =10 12 Just-In-Time, τ =10 12 TSP K-way Projections 0.00 0.0 1.0 1.1 0.2 1.2 1.3 1.4 0.4 1.5 1.00.6 1.1 1.2 0.81.3 1.4 1.0 1.5 Overhead percentage wrt the optimal Method For each matrix, percentage wrt the best method Percentage are sorted increasingly A curve close to x = 1 is close to optimal Results K-way outperforms TSP by 4% K-way outperforms Projections by 2% M. Faverge PASTIX- JCAD 22/30

Behavior of RRQR/Minimal Memory with PARSEC and Projections Time factorization / Time static scheduling in full-rank 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0 Before, τ =10 4 8 Before, τ =10 12 Before, τ =10 Now, τ =10 4 Now, τ =10 8 Now, τ =10 12 lap120 atmosmodj audi Geo1438 Hook Serena Adding Projections and the use of PARSEC runtime system reduces factorization time. M. Faverge PASTIX- JCAD 23/30

3 Application: HORSE M. Faverge PASTIX- JCAD 24/30

Hybridizable DG method in 3D Scattering of a plane wave by a jet Unstructured tetrahedral mesh: 1, 645, 874 elements / 3, 521, 251 faces Frequency: 600 MHz, Wavelength: λ 0.5 m, Penalty parameter: τ = 1 HDG method # DoF Λ field # DoF EM field HDG-P 1 21,127,506 39,500,976 HDG-P 2 42,255,012 98,752,440 HDG-P 3 70,425,020 197,504,880 E 3.881e-01 0.3 0.2 0.1 4.590e-06 E 2.573e+00 2.1 1.4 0.7 9.453e-06 E 3.294e+00 2.7 1.8 0.9 1.568e-06 Contour line of E - HDG P 1 to HDG P 3 M. Faverge PASTIX- JCAD 25/30

Impact on HORSE - Case HDG-P 2 Subs τ Method Fact(s) Nb. of iters Refinement (s) HDGM (s) Memory (GB) 32 48 64 - Full-rank 526.0 9 254.5 814.0 41.7 1e-4 Just-In-Time 209.3 9 164.8 405.8 41.7 (22.3) Minimal-Memory 516.7 15 273.2 822.0 24.5 1e-8 Just-In-Time 325.2 9 192.9 550.4 41.7 (29.4) Minimal-Memory 600.1 9 193.2 825.8 30.5 - Full-rank 237.3 8 118.6 374.6 24.9 1e-4 Just-In-Time 112.2 9 106.1 237.1 24.9 (14.1) Minimal-Memory 229.3 13 159.7 407.5 15.3 1e-8 Just-In-Time 171.5 8 105.8 296.1 24.9 (18.1) Minimal-Memory 319.4 8 109.8 447.9 18.8 - Full-rank 179.8 9 104.7 298.3 17.4 1e-4 Just-In-Time 79.7 10 91.1 184.1 17.4 (10.0) Minimal-Memory 138.1 13 120.7 272.2 11.0 1e-8 Just-In-Time 124.6 9 90.0 228.2 17.4 (12.9) Minimal-Memory 239.2 9 91.5 344.5 13.7 M. Faverge PASTIX- JCAD 26/30

4 Conclusion and future works M. Faverge PASTIX- JCAD 27/30

Conclusion Block low-rank allows to reduce time and memory complexity of sparse direct solver Possible to control the final accuracy Exploit the same level of parallelism as the full-rank solver Benefit from the dynamic scheduling of the runtime systems implementation for free May accelerate hybrid solvers such as HORSE M. Faverge PASTIX- JCAD 28/30

Future work Ordering strategies - E. Korkmaz (PhD) Target hierarchical compression formats Try to apply the strategy ine the ordering library (Scotch, Metis) Others Distributed version (load balancing) Looking for an intern student/engineer Low-rank operations on GPUs Use other low-rank compression kernels, such as randomization Integration in new applications M. Faverge PASTIX- JCAD 29/30

Thank you. Questions? M. Faverge PASTIX- JCAD 30/30