Block Low-Rank (BLR) approximations to improve multifrontal sparse solvers

Similar documents
SUMMARY INTRODUCTION BLOCK LOW-RANK MULTIFRONTAL METHOD

Large Scale Sparse Linear Algebra

A sparse multifrontal solver using hierarchically semi-separable frontal matrices

MUMPS. The MUMPS library: work done during the SOLSTICE project. MUMPS team, Lyon-Grenoble, Toulouse, Bordeaux

Performance and Scalability of the Block Low-Rank Multifrontal Factorization on Multicore Architectures

Sparse Linear Algebra: Direct Methods, advanced features

c 2017 Society for Industrial and Applied Mathematics

Sparse factorization using low rank submatrices. Cleve Ashcraft LSTC 2010 MUMPS User Group Meeting April 15-16, 2010 Toulouse, FRANCE

Improvements for Implicit Linear Equation Solvers

Enhancing Scalability of Sparse Direct Methods

An Efficient Solver for Sparse Linear Systems based on Rank-Structured Cholesky Factorization

A Parallel Geometric Multifrontal Solver Using Hierarchically Semiseparable Structure

An H-LU Based Direct Finite Element Solver Accelerated by Nested Dissection for Large-scale Modeling of ICs and Packages

Bridging the gap between flat and hierarchical low-rank matrix formats: the multilevel BLR format

1 Overview. 2 Adapting to computing system evolution. 11 th European LS-DYNA Conference 2017, Salzburg, Austria

Incomplete Cholesky preconditioners that exploit the low-rank property

Geophysical Journal International

A High-Performance Parallel Hybrid Method for Large Sparse Linear Systems

Utilisation de la compression low-rank pour réduire la complexité du solveur PaStiX

FAST STRUCTURED EIGENSOLVER FOR DISCRETIZED PARTIAL DIFFERENTIAL OPERATORS ON GENERAL MESHES

Fast algorithms for hierarchically semiseparable matrices

From Direct to Iterative Substructuring: some Parallel Experiences in 2 and 3D

An Adaptive Hierarchical Matrix on Point Iterative Poisson Solver

A Novel Aggregation Method based on Graph Matching for Algebraic MultiGrid Preconditioning of Sparse Linear Systems

Fast matrix algebra for dense matrices with rank-deficient off-diagonal blocks

Technology USA; 3 Dept. of Mathematics, University of California, Irvine. SUMMARY

Partial Left-Looking Structured Multifrontal Factorization & Algorithms for Compressed Sensing. Cinna Julie Wu

Numerical Methods I Non-Square and Sparse Linear Systems

SOLVING SPARSE LINEAR SYSTEMS OF EQUATIONS. Chao Yang Computational Research Division Lawrence Berkeley National Laboratory Berkeley, CA, USA

On the design of parallel linear solvers for large scale problems

Fast Structured Spectral Methods

Multilevel low-rank approximation preconditioners Yousef Saad Department of Computer Science and Engineering University of Minnesota

J.I. Aliaga 1 M. Bollhöfer 2 A.F. Martín 1 E.S. Quintana-Ortí 1. March, 2009

A Block Compression Algorithm for Computing Preconditioners

MARCH 24-27, 2014 SAN JOSE, CA

Towards parallel bipartite matching algorithms

A dissection solver with kernel detection for unsymmetric matrices in FreeFem++

LU factorization with Panel Rank Revealing Pivoting and its Communication Avoiding version

Using R for Iterative and Incremental Processing

A DISTRIBUTED-MEMORY RANDOMIZED STRUCTURED MULTIFRONTAL METHOD FOR SPARSE DIRECT SOLUTIONS

On the design of parallel linear solvers for large scale problems

THÈSE. En vue de l obtention du DOCTORAT DE L UNIVERSITÉ DE TOULOUSE. Théo Mary

Multipole-Based Preconditioners for Sparse Linear Systems.

Accelerating linear algebra computations with hybrid GPU-multicore systems.

A DISTRIBUTED-MEMORY RANDOMIZED STRUCTURED MULTIFRONTAL METHOD FOR SPARSE DIRECT SOLUTIONS

V C V L T I 0 C V B 1 V T 0 I. l nk

Lecture 8: Fast Linear Solvers (Part 7)

Hierarchical Matrices. Jon Cockayne April 18, 2017

BALANCING-RELATED MODEL REDUCTION FOR DATA-SPARSE SYSTEMS

SPARSE SOLVERS POISSON EQUATION. Margreet Nool. November 9, 2015 FOR THE. CWI, Multiscale Dynamics

Fast direct solvers for elliptic PDEs

Some Geometric and Algebraic Aspects of Domain Decomposition Methods

Domain Decomposition-based contour integration eigenvalue solvers

July 18, Abstract. capture the unsymmetric structure of the matrices. We introduce a new algorithm which

Linear Solvers. Andrew Hazel

Efficient domain decomposition methods for the time-harmonic Maxwell equations

Sparse linear solvers

A robust inner-outer HSS preconditioner

MULTI-LAYER HIERARCHICAL STRUCTURES AND FACTORIZATIONS

Multigrid based preconditioners for the numerical solution of two-dimensional heterogeneous problems in geophysics

Avoiding Communication in Distributed-Memory Tridiagonalization

Minisymposia 9 and 34: Avoiding Communication in Linear Algebra. Jim Demmel UC Berkeley bebop.cs.berkeley.edu

Optimal Interface Conditions for an Arbitrary Decomposition into Subdomains

Two-level Domain decomposition preconditioning for the high-frequency time-harmonic Maxwell equations

Hybrid static/dynamic scheduling for already optimized dense matrix factorization. Joint Laboratory for Petascale Computing, INRIA-UIUC

Inexactness and flexibility in linear Krylov solvers

Sparse LU Factorization on GPUs for Accelerating SPICE Simulation

A Fast Direct Solver for a Class of Elliptic Partial Differential Equations

Fast direct solvers for elliptic PDEs

Sparse factorizations: Towards optimal complexity and resilience at exascale

High Performance Parallel Tucker Decomposition of Sparse Tensors

Parallel Preconditioning Methods for Ill-conditioned Problems

Migration with Implicit Solvers for the Time-harmonic Helmholtz

Balanced Truncation Model Reduction of Large and Sparse Generalized Linear Systems

Effective matrix-free preconditioning for the augmented immersed interface method

Leveraging Task-Parallelism in Energy-Efficient ILU Preconditioners

Convergence Behavior of a Two-Level Optimized Schwarz Preconditioner

A direct solver for elliptic PDEs in three dimensions based on hierarchical merging of Poincaré-Steklov operators

AMS526: Numerical Analysis I (Numerical Linear Algebra for Computational and Data Sciences)

LU Factorization with Panel Rank Revealing Pivoting and Its Communication Avoiding Version

QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment

Jacobi-Davidson Eigensolver in Cusolver Library. Lung-Sheng Chien, NVIDIA

Scalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver

Robust solution of Poisson-like problems with aggregation-based AMG

Preface to the Second Edition. Preface to the First Edition

FAST SPARSE SELECTED INVERSION

Exploiting Low-Rank Structure in Computing Matrix Powers with Applications to Preconditioning

MASSIVELY PARALLEL STRUCTURED MULTIFRONTAL SOLVER FOR TIME-HARMONIC ELASTIC WAVES IN 3D ANISOTROPIC MEDIA

An Efficient Low Memory Implicit DG Algorithm for Time Dependent Problems

arxiv: v1 [cs.na] 20 Jul 2015

Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers

FINDING PARALLELISM IN GENERAL-PURPOSE LINEAR PROGRAMMING

Open-source finite element solver for domain decomposition problems

A dissection solver with kernel detection for symmetric finite element matrices on shared memory computers

Contents. Preface... xi. Introduction...

COMPARED to other computational electromagnetic

Direct solution methods for sparse matrices. p. 1/49

Exploiting off-diagonal rank structures in the solution of linear matrix equations

EFFECTIVE AND ROBUST PRECONDITIONING OF GENERAL SPD MATRICES VIA STRUCTURED INCOMPLETE FACTORIZATION

SEISMIC INVERSE SCATTERING VIA DISCRETE HELMHOLTZ OPERATOR FACTORIZATION AND OPTIMIZATION

Transcription:

Block Low-Rank (BLR) approximations to improve multifrontal sparse solvers Joint work with Patrick Amestoy, Cleve Ashcraft, Olivier Boiteau, Alfredo Buttari and Jean-Yves L Excellent, PhD started on October 1st, 2010 and financed by EDF Clément Weisbecker, ENSEEIHT-IRIT, University of Toulouse, France MUMPS Users Group Meeting Clamart, France, May 29-30, 2013

Introduction Multifrontal solver Block Low-Rank approximations to improve multifrontal sparse solvers direct solver for large linear systems objective: A = LU Low-rank approximations compression and flop reduction accuracy controlled by a numerical parameter Combine these two notions to improve multifrontal solvers (in the context of MUMPS) 2/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013

The multifrontal method

Separator tree (Schreiber 1982) dependencies between variables topological order 4/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013

Frontal matrices At each node, a partial factorization of the frontal matrix is performed : CB 1 CB 2 FS assembly elim CB CBs are stacked waiting for assembly peak of CB stack NFS FS separator variables NFS border variables FS 5/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013

Low-rank approximations

Low-rank matrices Low-rank matrix (Bebendorf) Let A be a m n matrix Let k ε be the approximated numerical rank of A at accuracy ε A is said to be a low-rank matrix if there exist matrices X of size m k ε, Y of size n k ε, E of size m n such that : A = X Y T + E, where E 2 ε and k ε (m + n) < mn n k 2 k 2 Y T 2 k 1 k 1 C = Y T 1 X 2 Low-rank product: X 1 (Y T 1 X 2)Y T 2 m X1 7/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013

Where can I find low-rank matrices? Problem: Matrices are usually not low-rank! Solution (Börm, Bebendorf): Well defined subblocks A b = A στ can be low-rank! (b = σ τ I I) Define a clustering C to obtain low-rank blocks I 1 I 2 I 3 I 1 I 2 I 3 8/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013

Admissibility condition How to find a good clustering? Admissibility condition for elliptic PDEs (Börm, Grasedyck, Hackbusch) Let σ I and τ I be clusters of indices The block (σ, τ) is said admissible if σ and τ satisfy diam(σ) + diam(τ) 2ηdist(σ, τ) where η is a problem dependant fixed parameter (usually 1 or 05) diam(σ) dist(σ,τ) diam(τ) σ 9/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013 τ

Illustration (1) D 1 D 2 10/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013

Illustration (1) D 1 D 2 10/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013

Illustration (1) D 1 D 2 10/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013

Illustration (1) D 1 D 2 10/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013

Illustration (1) D 1 D 2 10/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013

Illustration (1) D 1 D 2 10/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013

Illustration (2) 200 180 160 rank of 140 128 120 100 80 0 10 20 30 40 50 60 70 80 90 100 distance between and room for compression 11/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013

Exploiting low-rank subblocks: H matrices à = D 11 X 12 Y T 12 X 36 Y T 36 X 21 Y T 21 D 22 X 63 Y T 63 D 44 X 45 Y T 45 X 54 Y T 54 D 55 I 7 I 7 I 3 I 3 I 3 I 6 I 6 I 3 I 6 I 6 I 1 I 1 I 1 I 2 I 2 I 1 I 2 I 2 I 4 I 4 I 4 I 5 I 5 I 4 I 5 I 5 (Börm, Bebendorf, Hackbush, Grasedyck) 12/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013

Exploiting low-rank subblocks: HSS matrices D 1 X 1 B 1 Y T 2 X 1 R 1 B 3 W T 4 YT 4 X 1 R 1 B 3 W T 5 YT 5 Ã = X 2 B 2 Y1 T D 2 X 2 R 2 B 3 W4 T YT 4 X 2 R 2 B 3 W5 T YT 5 X 4 R 4 B 6 W1 T YT 1 X 4 R 4 B 6 W2 T YT 2 D 4 X 4 B 4 Y5 T X 5 R 5 B 6 W1 T YT 1 X 5 R 5 B 6 W2 T YT 2 X 5 B 5 Y4 T D 5 7 3 B 3 B 6 6 R 1,W 1 B 1 R 2,W 2 R 4,W 4 B 4 R 5,W 5 1 B 2 2 4 B 5 5 X 1,Y 1 X 2,Y 2 X 4,Y 4 X 5,Y 5 D 1 D 2 D 4 D 5 (Li, Napov, Xia, Gu, Wang, Rouet Gillman, Martinsson) 13/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013

Exploiting low-rank subblocks: BLR matrices D 1 X 12 Y T 12 X 13 Y T 13 X 14 Y T 14 Ã = X 21 Y T 21 D 2 X 23 Y T 23 X 24 Y T 24 X 31 Y T 31 X 32 Y T 32 D 3 X 34 Y T 34 X 41 Y T 41 X 42 Y T 42 X 43 Y T 43 D 4 particular case of H-matrices no tree natural matrix structure Idea: representing fronts with BLR matrices 14/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013

Comparative study: compression rates Compression rates of the frontal matrix at the root of a multifrontal tree, on two 3D stencils (discretization of a 128 x 128 x 128 cube) fraction of dense storage 035 03 025 02 015 01 005 Laplacian 0 10 14 10 12 10 10 10 8 10 6 10 4 10 2 low-rank threshold H HSS BLR fraction of dense storage 1 09 08 07 06 05 04 03 02 01 27-pts Geoazur stencil 0 10 14 10 12 10 10 10 8 10 6 10 4 10 2 low-rank threshold H HSS BLR 15/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013

Comparative study: compression cost Compression cost of the frontal matrix at the root of a multifrontal tree, on two 3D stencils (discretization of a 128 x 128 x 128 cube) log 10 (number of operations for compression) 125 12 115 11 105 10 95 Laplacian H HSS BLR log 10 (number of operations for compression) 27-pts Geoazur stencil 9 9 10 14 10 12 10 10 10 8 10 6 10 4 10 2 10 14 10 12 10 10 10 8 10 6 10 4 10 2 low-rank threshold low-rank threshold 13 125 12 115 11 105 10 95 H HSS BLR 16/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013

Adaptation to a multifrontal solver 17/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013

Adaptation to a multifrontal solver no relative order efficient clustering 17/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013

Adaptation to a multifrontal solver no relative order efficient compr and decompr 200 efficient clustering distribution assembly 17/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013

Adaptation to a multifrontal solver no relative order efficient compr and decompr flat non hierarchical structure efficient clustering distribution assembly pivoting 17/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013

Clustering variables

Admissible clustering Constraint : the admissibility condition should be satisfied large diameters fraction of memory used 83% small diameters fraction of memory used 57% 19/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013

Halo algorithm for the clustering of a separator Designed to catch the geometry of the problem Computed with the graph instead of the mesh Coupled with a third party partitioning tool 20/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013

Halo algorithm for the clustering of a separator Designed to catch the geometry of the problem Computed with the graph instead of the mesh Coupled with a third party partitioning tool 1 The separator 20/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013

Halo algorithm for the clustering of a separator Designed to catch the geometry of the problem Computed with the graph instead of the mesh Coupled with a third party partitioning tool 1 The separator 2 The halo 20/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013

Halo algorithm for the clustering of a separator Designed to catch the geometry of the problem Computed with the graph instead of the mesh Coupled with a third party partitioning tool 1 The separator 2 The halo 3 Extraction of the halo 20/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013

Halo algorithm for the clustering of a separator Designed to catch the geometry of the problem Computed with the graph instead of the mesh Coupled with a third party partitioning tool 1 The separator 2 The halo 3 Extraction of the halo 4 Partition of the halo 20/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013

Halo algorithm for the clustering of a separator Designed to catch the geometry of the problem Computed with the graph instead of the mesh Coupled with a third party partitioning tool P 1 P 2 P 3 1 The separator 2 The halo 3 Extraction of the halo 4 Partition of the halo 5 Partition of the separator (block size is fixed) 20/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013

Clustering of the variables of a front front = separator + border S 21/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013

Clustering of the variables of a front front = separator + border 1- separator : halo S 21/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013

Clustering of the variables of a front front = separator + border 1- separator : halo 2- border? 2 choices : S 21/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013

Clustering of the variables of a front front = separator + border 1- separator : halo 2- border? 2 choices : S EXPLICIT S 21/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013

Clustering of the variables of a front front = separator + border 1- separator : halo 2- border? 2 choices : S EXPLICIT S 21/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013

Clustering of the variables of a front front = separator + border 1- separator : halo 2- border? 2 choices : S EXPLICIT INHERITED (top down) S S 21/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013

Clustering of the variables of a front front = separator + border 1- separator : halo 2- border? 2 choices : S EXPLICIT INHERITED (top down) S S 21/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013

Clustering of the variables of a front front = separator + border 1- separator : halo 2- border? 2 choices : S EXPLICIT INHERITED (top down) S S 21/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013

Clustering of the variables of a front front = separator + border 1- separator : halo 2- border? 2 choices : S EXPLICIT INHERITED (top down) S S 21/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013

Structural comparison optimal optimal = optimal block 22/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013

Structural comparison optimal optimal = optimal block small optimal = large enough block 22/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013

Structural comparison optimal optimal = optimal block small optimal = large enough block small small = too small block 22/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013

Structural comparison 1 2 3 4 optimal optimal = optimal block small optimal = large enough block small small = too small block reclustering strategies 1 2 3 S 4 22/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013

Illustration with a top-level separator from SCOTCH side face view perspective view 23/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013

Block Low-Rank multifrontal method

General idea fronts approximated with low-rank matrices σ L i 1,1 U i 1,1 L i 2,2 U i 2,2 τ U i di,di L i di,di CB 25/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013

Switch level (1) percentof the entries of the factor computed in fronts larger than N 80 70 60 50 40 30 20 10 2 10 2 11 2 12 2 13 2 14 2 15 2 16 mesh size N =200 N =400 N =600 N =800 N =1000 a large ratio of the memory consumption is targeted 26/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013

Switch level (1) 100 percentof operations performed in fronts larger than N 95 90 85 80 75 70 65 60 N =200 N =400 N =600 N =800 N =1000 55 2 10 2 11 2 12 2 13 2 14 2 15 2 16 mesh size a large ratio of the flop count is targeted 26/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013

Switch level (2) low-rank techniques only used on large fronts mesh size 2 10 2 11 2 12 2 13 2 14 2 15 N = 200 161 168 171 173 174 174 N = 400 037 040 042 042 043 043 N = 600 017 020 021 022 023 023 N = 800 007 009 010 011 011 011 N = 1000 005 006 006 006 006 006 (Proportion of fronts larger than N with respect to the total number of fronts in the multifrontal tree, for different mesh sizes) Only a few fronts are actually compressed! 27/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013

Factorization algorithms task operation type dense low-rank Factor (F) B = LU T (2/3)n 3 (2/3)n 3 Compress (C) C = XY T kn 2 Solve (S) D = X(Y T L 1 ) n 3 kn 2 CB update (U) D = D X 1 (Y1 T X 2)Y2 T 2n 3 2kn 2 28/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013

Factorization algorithms task operation type dense low-rank Factor (F) B = LU T (2/3)n 3 (2/3)n 3 Compress (C) C = XY T kn 2 Solve (S) D = X(Y T L 1 ) n 3 kn 2 CB update (U) D = D X 1 (Y1 T X 2)Y2 T 2n 3 2kn 2 Standard FR algorithm: F Factor S Solve U Update 28/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013

Factorization algorithms task operation type dense low-rank Factor (F) B = LU T (2/3)n 3 (2/3)n 3 Compress (C) C = XY T kn 2 Solve (S) D = X(Y T L 1 ) n 3 kn 2 CB update (U) D = D X 1 (Y1 T X 2)Y2 T 2n 3 2kn 2 Standard FR algorithm: F Factor S Solve U Update 28/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013

Factorization algorithms task operation type dense low-rank Factor (F) B = LU T (2/3)n 3 (2/3)n 3 Compress (C) C = XY T kn 2 Solve (S) D = X(Y T L 1 ) n 3 kn 2 CB update (U) D = D X 1 (Y1 T X 2)Y2 T 2n 3 2kn 2 Standard FR algorithm: F Factor S Solve U Update 28/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013

Factorization algorithms task operation type dense low-rank Factor (F) B = LU T (2/3)n 3 (2/3)n 3 Compress (C) C = XY T kn 2 Solve (S) D = X(Y T L 1 ) n 3 kn 2 CB update (U) D = D X 1 (Y1 T X 2)Y2 T 2n 3 2kn 2 Standard FR algorithm: F Factor S Solve U Update 28/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013

Factorization algorithms task operation type dense low-rank Factor (F) B = LU T (2/3)n 3 (2/3)n 3 Compress (C) C = XY T kn 2 Solve (S) D = X(Y T L 1 ) n 3 kn 2 CB update (U) D = D X 1 (Y1 T X 2)Y2 T 2n 3 2kn 2 BLR FSCU algorithm: F Factor S Solve U Update 28/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013

Factorization algorithms task operation type dense low-rank Factor (F) B = LU T (2/3)n 3 (2/3)n 3 Compress (C) C = XY T kn 2 Solve (S) D = X(Y T L 1 ) n 3 kn 2 CB update (U) D = D X 1 (Y1 T X 2)Y2 T 2n 3 2kn 2 BLR FSCU algorithm: F Factor S Solve U Update 28/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013

Factorization algorithms task operation type dense low-rank Factor (F) B = LU T (2/3)n 3 (2/3)n 3 Compress (C) C = XY T kn 2 Solve (S) D = X(Y T L 1 ) n 3 kn 2 CB update (U) D = D X 1 (Y1 T X 2)Y2 T 2n 3 2kn 2 BLR FSCU algorithm: F Factor S Solve U Update 28/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013

Factorization algorithms task operation type dense low-rank Factor (F) B = LU T (2/3)n 3 (2/3)n 3 Compress (C) C = XY T kn 2 Solve (S) D = X(Y T L 1 ) n 3 kn 2 CB update (U) D = D X 1 (Y1 T X 2)Y2 T 2n 3 2kn 2 BLR FSCU algorithm: F Factor S Solve C Compress U Update 28/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013

Factorization algorithms task operation type dense low-rank Factor (F) B = LU T (2/3)n 3 (2/3)n 3 Compress (C) C = XY T kn 2 Solve (S) D = X(Y T L 1 ) n 3 kn 2 CB update (U) D = D X 1 (Y1 T X 2)Y2 T 2n 3 2kn 2 BLR FSCU algorithm: F Factor S Solve C Compress U Update 28/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013

Factorization algorithms task operation type dense low-rank Factor (F) B = LU T (2/3)n 3 (2/3)n 3 Compress (C) C = XY T kn 2 Solve (S) D = X(Y T L 1 ) n 3 kn 2 CB update (U) D = D X 1 (Y1 T X 2)Y2 T 2n 3 2kn 2 F Factor S Solve C Compress U Update FCSU version more efficient less stability 28/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013

Factorization algorithms task operation type dense low-rank Factor (F) B = LU T (2/3)n 3 (2/3)n 3 Compress (C) C = XY T kn 2 Solve (S) D = X(Y T L 1 ) n 3 kn 2 CB update (U) D = D X 1 (Y1 T X 2)Y2 T 2n 3 2kn 2 F Factor S Solve C Compress U Update FSUC version no flop reduction more stability 28/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013

Factorization algorithms task operation type dense low-rank Factor (F) B = LU T (2/3)n 3 (2/3)n 3 Compress (C) C = XY T kn 2 Solve (S) D = X(Y T L 1 ) n 3 kn 2 CB update (U) D = D X 1 (Y1 T X 2)Y2 T 2n 3 2kn 2 F Factor S Solve C Compress U Update High algorithmic flexibility 28/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013

Experiments

Set of problems Name Prop Arith N NZ mem LU flops LU CSR appli ( 10 6 ) ( 10 9 ) (GB) ( 10 12 ) Curl 5000 2 2D/sym D 50 02 29 5 2E-15 Geoazur 128 3 3D/unsym Z 2 55 54 60 3E-4 wave prop EDF_A_MECA_R12 2D/sym D 134 1 151 200 4E-15 mechanics EDF_D_THER_R7 3D/sym D 8 118 229 100 8E-15 thermics CSR = Componentwise Scaled Residual Code_Aster tpll01{a,d} test cases (refined) large matrices applicative problems 30/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013

Metrics memory: L = fraction of FR factors storage obtained with BLR (%) CB = fraction of FR maximum size of CB stack obtained with BLR (%) flops: fraction of FR operations needed for the BLR factorization (in percent or absolute data, including the compression cost) 31/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013

Clustering strategy memory flops time L CB facto clustering full rank analysis clustering inh exp inh exp inh exp inh exp Curl5000 637% 624% 70% 55% 109% 111% 13 s 27 s 897 s Geoazur128 790% 770% 470% 450% 608% 591% 5 s 42 s 62 s TH_RAFF7 341% 307% 175% 162% 72% 66% 34 s 206 s 387 s ME_RAFF12 529% 511% 48% 41% 61% 61% 121 s 239 s 1971 s inherited version is more than 2 times faster same results on L 11 comparable results on L 21 a little less good on CBs inherited clustering used for all the experiments 32/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013

Global ordering of the matrix: general results Results with different orderings on Geoazur128 problem FR ordering mry flops peak L CB flops AMD 109GB 39E + 14 92GB 735% 400% 594% AMF 72GB 18E + 14 45GB 699% 535% 483% PORD 55GB 10E + 14 28GB 704% 490% 489% METIS 46GB 62E + 13 20GB 787% 464% 626% SCOTCH 49GB 66E + 13 21GB 794% 481% 634% LR 33/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013

Global ordering of the matrix: general results Results with different orderings on Geoazur128 problem FR ordering mry flops peak L CB flops AMD 109GB 39E + 14 92GB 735% 400% 594% AMF 72GB 18E + 14 45GB 699% 535% 483% PORD 55GB 10E + 14 28GB 704% 490% 489% METIS 46GB 62E + 13 20GB 787% 464% 626% SCOTCH 49GB 66E + 13 21GB 794% 481% 634% LR Algebraic approach 33/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013

Global ordering of the matrix: AMF tree [visualization tool developed by M Bremond talk tomorrow 11:30] 34/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013

Global ordering of the matrix: SCOTCH tree [visualization tool developed by M Bremond talk tomorrow 11:30] 35/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013

Influence of the size of the problem different refinements of problem EDF_A_MECA done with Homard memory flops Refinement N L CB facto R8 2, 101, 258 71% 43% 37% R11 33, 570, 826 59% 29% 12% R12 134, 250, 506 53% 24% 7% the larger the problem, the more efficient the method target = challenging problems 36/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013

Scalability with respect to the size of the problem Laplacian problem, ε = 10 14 16 15 Full Rank O(N 2 ) BLR O(N 4/3 ) 14 flops facto 13 12 11 10 32 64 96 128 160 192 224256 Mesh size M O(N 4/3 ) complexity 37/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013

Influence of the size of the clusters memory flops block size L CB facto compr other 100 88% 55% 75% 98E+11 46E+13 150 83% 50% 68% 15E+12 41E+13 200 79% 46% 63% 19E+12 37E+13 250 76% 45% 60% 24E+12 34E+13 300 75% 44% 58% 27E+12 33E+13 350 74% 44% 56% 31E+12 32E+13 400 73% 45% 56% 34E+12 31E+13 450 73% 44% 56% 37E+12 31E+13 500 73% 44% 57% 44E+12 31E+13 flexibility (no need for a fine tuning) can be dynamically adapted for pivoting, parallelism, 38/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013

Global results: 2D problems 1 00 90 80 70 60 50 40 30 20 1 0 0 EDF_A_MECA_R12 QR DROPPING PARAMETER L CB 1 0 1 4 1 0 1 2 1 0 1 0 1 0 8 1 0 6 1 0 4 1 0 2 1 0 2 1 0 4 1 0 6 1 0 8 1 0 1 0 1 0 1 2 1 0 1 4 flops facto (%) CSR 1 1 0 1 00 90 80 70 60 50 40 30 20 1 0 0 CURL-CURL 2D 5000x5000 1 0 1 4 1 0 1 2 1 0 1 0 1 0 8 1 0 6 1 0 4 1 0 2 QR DROPPING PARAMETER 1 0 6 1 0 8 1 0 1 0 1 0 1 2 1 0 1 4 1 0 1 6 Name N mem LU flops LU CSR EDF_A_MECA_R12 134E+6 151 GB 2E+14 4E-15 Curl-Curl 5000 2 50E+6 29 GB 5E+12 2E-15 39/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013

Global results: 3D problems L CB flops facto (%) CSR 1 00 90 80 70 60 50 40 30 20 1 0 0 EDF_D_THER_R7 1 0 2 1 0 4 1 0 6 1 0 8 1 0 1 0 1 0 1 2 1 00 90 80 70 60 50 40 30 20 1 0 0 GEOAZUR 3D 128x128x128 1 0 1 0 2 1 0 4 1 0 6 1 0 8 1 0 1 0 1 0 1 2 1 0 1 4 1 0 1 2 1 0 1 0 1 0 8 1 0 6 1 0 4 1 0 2 1 0 1 4 1 0 1 2 1 0 1 0 1 0 8 1 0 6 1 0 4 1 0 2 QR DROPPING PARAMETER QR DROPPING PARAMETER Name N mem LU flops LU CSR EDF_D_THER_R7 8E+6 229 GB 1E+14 8E-15 Geoazur 128 3 2E+6 54 GB 6E+13 3E-4 40/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013

Application to geophysics (1) Helmholtz equation for seismic modeling (SEISCOPE project) EAGE overthrust ground model 0 20 Dip (km) 5 10 15 20 10 Cross (km) 15 5 Depth (km) 0 1 2 3 4 3000 4000 5000 6000 m/s fqcy Flops LU Mem LU Peak memory 2 Hz 8957E+11 3 GB 4 GB 4 Hz 1639E+13 22 GB 25 GB 8 Hz 5769E+14 247 GB 283 GB 41/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013

Application to geophysics (2) 5 Dip (km) 10 15 b) 20 ) Dip (km) 10 15 e) m ) 20 0 5 Dip (km) 10 15 Dip (km) 10 15 Depth (km) h) 20 20 0 5 Dip (km) 10 15 i) 20 15 k) 20 20 0 5 Dip (km) 10 15 C 10 Depth (km) 20 0 5 Dip (km) 10 15 20 15 10 5 5 0 1 2 3 4 20 (k ro ss s ro s 10 15 15 l) 20 15 (k m ) 15 5 Dip (km) 10 0 1 2 3 4 m ) 15 5 s Dip (km) 10 0 ro s 5 0 1 2 3 4 C 0 20 10 Depth (km) 20 20 5 Depth (km) Depth (km) j) 15 C ro ss C ro ss C ro ss 10 5 0 1 2 3 4 Dip (km) 10 (k (k m ) 15 10 5 0 1 2 3 4 m ) 5 0 C ro ss (k Depth (km) 0 20 15 10 0 1 2 3 4 5 (k m ) f) 20 15 C ro ss C ro ss 20 20 5 (k m ) g) 15 0 1 2 3 4 10 0 1 2 3 4 Dip (km) 10 ro ss C Depth (km) 20 10 5 10 m ) 5 0 (k 0 15 (k m ) 20 20 15 5 5 C c) 20 (k m Depth (km) Depth (km) Depth (km) 15 0 1 2 3 4 5 Depth (km) Dip (km) 10 10 0 1 2 3 4 Depth (km) 5 5 d) 42/51 0 C ro ss C ro ss 10 5 0 1 2 3 4 20 15 ) 0 (k m 20 15 (k m ) a) a) 0 1 2 3 4 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013

Application to geophysics (3) Dip (km) 10 15 20 0 5 Dip (km) 10 15 d) 20 15 0 5 Dip (km) 10 15 20 (k C ro 10 10 5 5 0 1 2 3 4 20 15 ss ss C ro 10 0 1 2 3 4 flops 43/51 c) 20 Depth (km) m ) 5 5 Depth (km) Depth (km) 0 (k (k ss C ro 10 5 0 1 2 3 4 20 15 m ) b) 20 ) 15 m Dip (km) 10 (k 5 C ro ss 0 Depth (km) 20 15 m ) a) 0 1 2 3 4 memory ε fqcy facto F CB (10 5 ) 2 Hz 4 Hz 8 Hz 418 % 274 % 218 % 618 % 500 % 416 % 323% 244% 239% (10 4 ) 2 Hz 4 Hz 8 Hz 329 % 200 % 152 % 534 % 422 % 289 % 239% 217% 194% (10 3 ) 2 Hz 4 Hz 8 Hz 246 % 138 % 98 % 447 % 345 % 213 % 168% 190% 159% MUMPS Users Group Meeting Clamart, France, May 29-30, 2013

Preconditioning with BLR: set of problems N NZ Cond application Piston 13E+6 547E+6 51E+5 external pressure force on the top perf001d 20E+6 758E+6 15E+11 cavity hook subjected to internal pressure force (challenging for EDF) 44/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013

Preconditioning with BLR: number of iterations Number of iterations needed for the convergence FR is MUMPS 410 single precision means no convergence in 1, 000 seconds FR 10 8 10 7 10 6 10 5 10 4 10 3 10 2 Piston 3 3 3 3 4 7 115 perf001d 47 65 65 65 65 77 perf001d: only 2 iterations to convergence with a BLR double precision precond at ε = 10 8 good solution versus coarse sword of usual single precision! 45/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013

Preconditioning with BLR: Piston (compr and time) 70 (%) PISTON 60 L CB flops facto (%) 50 40 30 20 10 1 0 8 1 0 6 1 0 5 1 0 4 1 0 3 1 0 7 QR DROPPING PARAMETER 46/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013

Preconditioning with BLR: Piston (compr and time) (s) 350 300 Total execution time Factorization time Solves time (GCPC) PISTON 250 time in seconds 200 150 100 50 FR = 313s FR = 308s FR = 5s 0 1 0 8 1 0 6 1 0 5 1 0 4 1 0 3 1 0 7 QR DROPPING PARAMETER 46/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013

Preconditioning with BLR: perf001d (compr and time) (%) PERF 60 50 L CB flops facto (%) 40 30 20 10 1 0 8 1 0 7 1 0 6 1 0 5 1 0 4 QR DROPPING PARAMETER 47/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013

Preconditioning with BLR: perf001d (compr and time) PERF (s) 450 Total execution time Solves time (GCPC) 400 Factorization time 350 time in seconds 300 250 200 FR = 569s FR = 117s FR = 452s 150 100 1 0 8 1 0 7 1 0 6 1 0 5 1 0 4 QR DROPPING PARAMETER 47/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013

Software work new FR factorization with two levels of blocking: internal panels: efficiency of BLAS 3 operations external panels: correspond to low-rank blocks done in factorization algorithm: BLR seq unsymm with all MUMPS features and double panel BLR seq symm prototype (single panel, no pivoting) actual flop compression is achieved in both cases MPI symm and unsymm: panel facto rewritten for BLR very short term: BLR seq symm with all MUMPS features and double panel short term: long term: BLR MPI symm and unsymm BLR solution phase actual memory gains 48/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013

Conclusion & perspectives efficient method on various applicative problems considerable memory reduction & substantial decrease in computations O(N 4/3 ) complexity, comparable to HSS can be used as a preconditioner or as a direct solver good potential for parallelism code is stable on tested problems MPI study on larger and more difficult problems error propagation study absolute or relative dropping parameter? relative to what? (work with S Gratton, M Ngom and D Titley-Peloquin started) 49/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013

Aknowledgements O Boiteau, B Quinnez and N Tardieu (EDF R&D) S Operto, R Brossier and J Virieux (SEISCOPE Project) for their contribution to the geophysics study the Toulouse Computing Center (CICT) and N Renon S Li, A Napov and F-H Rouet (LBNL Berkeley) Details on this work can be found in: P Amestoy, C Ashcraft, O Boiteau, A Buttari, J-Y L Excellent and C Weisbecker, Improving multifrontal methods by means of block low-rank representations, IRIT technical report RT/APO/12/6, INRIA technical report INRIA/RR-8199, submitted to SIAM SISC, http://weisbeckerperso enseeihtfr/documents/rt_apo_12_6_blrpdf, 2012 50/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013

Thank you! Any questions?