J.I. Aliaga 1 M. Bollhöfer 2 A.F. Martín 1 E.S. Quintana-Ortí 1. March, 2009

Similar documents
Parallelization of Multilevel Preconditioners Constructed from Inverse-Based ILUs on Shared-Memory Multiprocessors

Incomplete Cholesky preconditioners that exploit the low-rank property

Leveraging Task-Parallelism in Energy-Efficient ILU Preconditioners

Saving Energy in Sparse and Dense Linear Algebra Computations

A robust multilevel approximate inverse preconditioner for symmetric positive definite matrices

Contents. Preface... xi. Introduction...

Solving Large Nonlinear Sparse Systems

Multilevel Preconditioning of Graph-Laplacians: Polynomial Approximation of the Pivot Blocks Inverses

Preface to the Second Edition. Preface to the First Edition

Partial Left-Looking Structured Multifrontal Factorization & Algorithms for Compressed Sensing. Cinna Julie Wu

Direct solution methods for sparse matrices. p. 1/49

AMS526: Numerical Analysis I (Numerical Linear Algebra for Computational and Data Sciences)

Fine-grained Parallel Incomplete LU Factorization

Enhancing Scalability of Sparse Direct Methods

Solving PDEs with CUDA Jonathan Cohen

Challenges for Matrix Preconditioning Methods

5.1 Banded Storage. u = temperature. The five-point difference operator. uh (x, y + h) 2u h (x, y)+u h (x, y h) uh (x + h, y) 2u h (x, y)+u h (x h, y)

Direct and Incomplete Cholesky Factorizations with Static Supernodes

FINE-GRAINED PARALLEL INCOMPLETE LU FACTORIZATION

Balanced Truncation Model Reduction of Large and Sparse Generalized Linear Systems

A sparse multifrontal solver using hierarchically semi-separable frontal matrices

arxiv: v1 [cs.na] 20 Jul 2015

SPARSE SOLVERS POISSON EQUATION. Margreet Nool. November 9, 2015 FOR THE. CWI, Multiscale Dynamics

Fine-Grained Parallel Algorithms for Incomplete Factorization Preconditioning

AMS526: Numerical Analysis I (Numerical Linear Algebra)

Preconditioning Techniques for Large Linear Systems Part III: General-Purpose Algebraic Preconditioners

FINE-GRAINED PARALLEL INCOMPLETE LU FACTORIZATION

Fast algorithms for hierarchically semiseparable matrices

Scientific Computing with Case Studies SIAM Press, Lecture Notes for Unit VII Sparse Matrix

Multilevel low-rank approximation preconditioners Yousef Saad Department of Computer Science and Engineering University of Minnesota

c 2015 Society for Industrial and Applied Mathematics

A DISTRIBUTED-MEMORY RANDOMIZED STRUCTURED MULTIFRONTAL METHOD FOR SPARSE DIRECT SOLUTIONS

Bindel, Spring 2016 Numerical Analysis (CS 4220) Notes for

On the design of parallel linear solvers for large scale problems

Preconditioning Techniques Analysis for CG Method

SOLVING SPARSE LINEAR SYSTEMS OF EQUATIONS. Chao Yang Computational Research Division Lawrence Berkeley National Laboratory Berkeley, CA, USA

Some Geometric and Algebraic Aspects of Domain Decomposition Methods

AMS526: Numerical Analysis I (Numerical Linear Algebra)

Solving linear systems (6 lectures)

Algorithm for Sparse Approximate Inverse Preconditioners in the Conjugate Gradient Method

Sparse linear solvers

Notes on PCG for Sparse Linear Systems

Towards a highly-parallel PDE-Solver using Adaptive Sparse Grids on Compute Clusters

Information Sciences Institute 22 June 2012 Bob Lucas, Gene Wagenbreth, Dan Davis, Roger Grimes and

1. Fast Iterative Solvers of SLE

A DISTRIBUTED-MEMORY RANDOMIZED STRUCTURED MULTIFRONTAL METHOD FOR SPARSE DIRECT SOLUTIONS

Direct Methods for Solving Linear Systems. Simon Fraser University Surrey Campus MACM 316 Spring 2005 Instructor: Ha Le

AMS Mathematics Subject Classification : 65F10,65F50. Key words and phrases: ILUS factorization, preconditioning, Schur complement, 1.

Algebraic Multigrid as Solvers and as Preconditioner

Robust Preconditioned Conjugate Gradient for the GPU and Parallel Implementations

Parallel Algorithms for Solution of Large Sparse Linear Systems with Applications

Parallelism in Structured Newton Computations

AMS526: Numerical Analysis I (Numerical Linear Algebra) Lecture 23: GMRES and Other Krylov Subspace Methods; Preconditioning

Static-scheduling and hybrid-programming in SuperLU DIST on multicore cluster systems

Solving PDEs with Multigrid Methods p.1

Scientific Computing

Accelerating linear algebra computations with hybrid GPU-multicore systems.

Improvements for Implicit Linear Equation Solvers

ITERATIVE METHODS FOR SPARSE LINEAR SYSTEMS

Algebraic Multilevel Preconditioner For the Helmholtz Equation In Heterogeneous Media

Lecture # 20 The Preconditioned Conjugate Gradient Method

Indefinite and physics-based preconditioning

Fast matrix algebra for dense matrices with rank-deficient off-diagonal blocks

Matrix Assembly in FEA

Comparative Analysis of Mesh Generators and MIC(0) Preconditioning of FEM Elasticity Systems

A dissection solver with kernel detection for symmetric finite element matrices on shared memory computers

Solving Symmetric Indefinite Systems with Symmetric Positive Definite Preconditioners

The Conjugate Gradient Method

Preconditioned CG-Solvers and Finite Element Grids

MULTI-LAYER HIERARCHICAL STRUCTURES AND FACTORIZATIONS

Sparse Linear Algebra: Direct Methods, advanced features

V C V L T I 0 C V B 1 V T 0 I. l nk

Convergence Behavior of a Two-Level Optimized Schwarz Preconditioner

FINDING PARALLELISM IN GENERAL-PURPOSE LINEAR PROGRAMMING

Sparse factorization using low rank submatrices. Cleve Ashcraft LSTC 2010 MUMPS User Group Meeting April 15-16, 2010 Toulouse, FRANCE

An exploration of matrix equilibration

Computational Linear Algebra

Numerical Methods I Non-Square and Sparse Linear Systems

Structure preserving preconditioner for the incompressible Navier-Stokes equations

Overlapping Schwarz preconditioners for Fekete spectral elements

Numerical Linear Algebra

Program Lecture 2. Numerical Linear Algebra. Gaussian elimination (2) Gaussian elimination. Decompositions, numerical aspects

1. Introduction. We consider the problem of solving the linear system. Ax = b, (1.1)

Preconditioners for the incompressible Navier Stokes equations

Lecture 9 Approximations of Laplace s Equation, Finite Element Method. Mathématiques appliquées (MATH0504-1) B. Dewals, C.

Maximum-weighted matching strategies and the application to symmetric indefinite systems

Scalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver

Efficiently solving large sparse linear systems on a distributed and heterogeneous grid by using the multisplitting-direct method

Lecture 17: Iterative Methods and Sparse Linear Algebra

Multigrid and Iterative Strategies for Optimal Control Problems

Multipole-Based Preconditioners for Sparse Linear Systems.

Exploiting hyper-sparsity when computing preconditioners for conjugate gradients in interior point methods

A dissection solver with kernel detection for unsymmetric matrices in FreeFem++

PERFORMANCE ENHANCEMENT OF PARALLEL MULTIFRONTAL SOLVER ON BLOCK LANCZOS METHOD 1. INTRODUCTION

Binding Performance and Power of Dense Linear Algebra Operations

SCALABLE HYBRID SPARSE LINEAR SOLVERS

Preconditioning techniques to accelerate the convergence of the iterative solution methods

Calculate Sensitivity Function Using Parallel Algorithm

FAST STRUCTURED EIGENSOLVER FOR DISCRETIZED PARTIAL DIFFERENTIAL OPERATORS ON GENERAL MESHES

Computation of the mtx-vec product based on storage scheme on vector CPUs

Transcription:

Parallel Preconditioning of Linear Systems based on ILUPACK for Multithreaded Architectures J.I. Aliaga M. Bollhöfer 2 A.F. Martín E.S. Quintana-Ortí Deparment of Computer Science and Engineering, Univ. Jaume I (Spain) {aliaga,martina,quintana}@icc.uji.es 2 Institute of Computational Mathematics, TU-Braunschweig (Germany) m.bollhoefer@tu-braunschweig.de March, 29 J. I. Aliaga et. al. CSE9@MS97-Preconditioning Techniques / 37

Motivation and Introduction Motivation and Introduction We consider the efficient solution of linear systems Ax = b where A R n,n is a large and sparse SPD coefficient matrix x R n is the sought-after solution b R n is a given r.h.s. vector The solution of this problem Arises in very many large-scale application problems Frequently the most time-consuming part of the computation J. I. Aliaga et. al. CSE9@MS97-Preconditioning Techniques 2 / 37

Motivation and Introduction Motivation and Introduction We propose to solve Ax = b iteratively using Preconditioned Conjugate Gradient (PCG) Among the best iterative approaches to solve SPD linear systems ILUPACK preconditioning techniques Incomplete Cholesky (IC) factorization-based preconditioners Multilevel IC (MIC) preconditioners: Inverse-based approach Successful for a wide range of large-scale application problems with up to million of equations J. I. Aliaga et. al. CSE9@MS97-Preconditioning Techniques 3 / 37

Motivation and Introduction Motivation and Introduction Focus: The large scale of the problems faced by ILUPACK makes it evident the need to reduce the time-to-solution via parallel computing techniques Development of ILUPACK-based solvers on shared-memory architectures Parallelism revealed by nested dissection on adjacency graph G A recursive application of vertex separator partitioning Dynamic scheduling to improve load-balancing during execution run-time tuned for a specific platform/application problem PCG operations conformal with the MIC parallelization Implemented using standard tools: threads and OpenMP Parallelization approach preserves semantics of the MIC Delivers a high-degree of concurrence for shared-memory architectures Numerical results for two large-scale 3D application problems J. I. Aliaga et. al. CSE9@MS97-Preconditioning Techniques 4 / 37

Outline Motivation and Introduction Motivation and Introduction 2 ILUPACK inverse-based MIC factorization preconditioners 3 Parallel inverse-based MIC factorization preconditioners 4 Experimental Results 5 Conclusions and Future Work J. I. Aliaga et. al. CSE9@MS97-Preconditioning Techniques 5 / 37

ILUPACK inverse-based MIC factorization preconditioners Outline Motivation and Introduction 2 ILUPACK inverse-based MIC factorization preconditioners 3 Parallel inverse-based MIC factorization preconditioners 4 Experimental Results 5 Conclusions and Future Work J. I. Aliaga et. al. CSE9@MS97-Preconditioning Techniques 6 / 37

ILUPACK inverse-based MIC factorization preconditioners IC factorization preconditioners The (root-free) sparse Cholesky decomp. computes L R n,n, sparse unit lower triangular, and D R n,n, diagonal with positive entries s.t. A = LDL T In the IC factorization A is factored as A = L D L T + E where E R n,n is a small perturbation matrix consisting of those entries dropped during factorization M = L D L T is applied to the original system Ax = b in order to accelerate the convergence of the PCG solver J. I. Aliaga et. al. CSE9@MS97-Preconditioning Techniques 7 / 37

ILUPACK inverse-based MIC factorization preconditioners Inverse-based IC factorization preconditioners PCG solver converges quickly if the preconditioned matrix is close to I D /2 L A L T D /2 = I + D /2 L {z E L T } F D /2 D /2 F D /2 must be small to construct effective preconditioners Inverse-based IC factorizations construct M = L D L T s.t. L ν ν = 5 or ν = are good choices for many problems Pivoting and multilevel methods are required in general J. I. Aliaga et. al. CSE9@MS97-Preconditioning Techniques 8 / 37

ILUPACK inverse-based MIC factorization preconditioners Multilevel Inverse-based IC factorization st algebraic level factorized untouched continue pivoted updated e T L k ν after several factorization steps e k > ν postpone bad pivots current factorization step continue factorization continue factorization st algebraic level completed only bad pivots compute S c The IC decomp. employs Inverse-based pivoting Row/Col. k of L / L T obtained at step k along with estimation t k e T k L Row/Col. s.t. t k > ν are moved to the bottom/right-end of the matrix Considered bad pivots with respect to the constraint L ν The IC factorization stops when the trailing principal only contains bad pivots The computation of the approximate Schur complement S c completes... J. I. Aliaga et. al. CSE9@MS97-Preconditioning Techniques 9 / 37

ILUPACK inverse-based MIC factorization preconditioners Multilevel Inverse-based IC factorization st algebraic level factorized untouched continue pivoted updated e T L k ν after several factorization steps e k > ν postpone bad pivots continue factorization continue factorization st algebraic level completed only bad pivots compute S c current factorization step... a partial IC decomp. of a permuted system with L B ˆP T B F T ˆP F C «LB = L F I «DB Sc satisfying L B ν «LT B L T F I «+ E The whole method is restarted on S c leading to a multilevel approach J. I. Aliaga et. al. CSE9@MS97-Preconditioning Techniques / 37

ILUPACK inverse-based MIC factorization preconditioners Multilevel Inverse-based IC factorization The MIC factorization can be expressed algorithmically as follows Reorder A P T AP =  by some fill-reducing ordering matrix P 2 Compute a partial IC decomp. as by-product of inverse-based pivoting ««««ˆP T B F T LB DB LT ˆP = B + E F C L F I Sc I 3 Proceed to the next level by repeating steps and 2 with A S c until S c is void or dense enough to be handled by a dense Cholesky solver L T F J. I. Aliaga et. al. CSE9@MS97-Preconditioning Techniques / 37

ILUPACK inverse-based MIC factorization preconditioners Multilevel Inverse-based IC factorization MIC factorization of five-point matrix arising from Laplace PDE discretization ν = 5 and τ = 3 leads to a MIC factorization with 5 levels J. I. Aliaga et. al. CSE9@MS97-Preconditioning Techniques 2 / 37

Parallel inverse-based MIC factorization preconditioners Outline Motivation and Introduction 2 ILUPACK inverse-based MIC factorization preconditioners 3 Parallel inverse-based MIC factorization preconditioners 4 Experimental Results 5 Conclusions and Future Work J. I. Aliaga et. al. CSE9@MS97-Preconditioning Techniques 3 / 37

Parallel inverse-based MIC factorization preconditioners Nested dissection preprocessing step A Π T AΠ A G A (2,) (,2) (3,) (,4) (2,2) Task dependency tree (,) (3,) (,3) (3,) (2,) (2,2) (,) (,2) (,3) (,4) Parallelism revealed by nested dissection on adjacency graph G A fast and efficient multilevel (MLND) variants provided by SCOTCH package Parallelization driven by the dependencies captured in the task tree subdomains first eliminated independently, then separators (2,) and (2,2) in parallel, finally the root separator (3,) J. I. Aliaga et. al. CSE9@MS97-Preconditioning Techniques 4 / 37

Parallel inverse-based MIC factorization preconditioners Parallel Multilevel Inverse-based IC factorization (,) (3,) (2,) (,2) (,3) (2,2) (,4) = + + + Π T AΠ A (,) A(,2) A (,3) A(,4) A (,2) contribution blocks Disassemble Π T AΠ into a sum of submatrices, one per leaf node The factorization of the diagonal blocks of Π T AΠ can proceed in parallel as well as the updates on the blocks corresponding to all ancestor nodes The contribution blocks of each submatrix hold the updates on ancestor nodes during the factorization of the leading block J. I. Aliaga et. al. CSE9@MS97-Preconditioning Techniques 5 / 37

Parallel inverse-based MIC factorization preconditioners Parallel Multilevel Inverse-based IC factorization The tasks compute a local MIC restricted to the leading block of the submatrix st local algebraic level (3,) after several factorization steps factorized untouched pivoted updated e T L k ν postpone e T L k > ν contribution blocks current factorization step continue continue factorization st local algebraic level completed locally updated entries (,) only bad pivots on the leading block (2,) (2,2) (,2) compute Sc (,3) (,4) The IC decomp. employs restricted Inverse-Based pivoting Bad pivots are only moved to the bottom/right-end of the leading block The IC decomp. stops if the trailing of the leading block only contains bad pivots Postponed pivots still resolved inside the task by entering next MIC level J. I. Aliaga et. al. CSE9@MS97-Preconditioning Techniques 6 / 37

Parallel inverse-based MIC factorization preconditioners Parallel Multilevel Inverse-based IC factorization The tasks compute a local MIC restricted to the leading block of the submatrix kth local algebraic level (3,) after several factorization steps factorized pivoted untouched updated e T L k ν e T L k > ν contribution blocks continue postpone current factorization step continue factorization (2,) (2,2) kth local algebraic (,) (,2) (,3) (,4) level completed only bad pivots compute on the leading block Sc locally updated entries When the set of bad pivots is small enough the local MIC stops The task sends the local contributions to the parent task J. I. Aliaga et. al. CSE9@MS97-Preconditioning Techniques 7 / 37

Parallel inverse-based MIC factorization preconditioners Parallel Multilevel Inverse-based IC factorization (3,) (2,) (2,2) (,) Sc + = Merge Contributions (2,) (,) (,2) (,3) (,4) A (2,) A (2,) (,2) Sc The parent task merge contributions resulting from its children A new submatrix is constructed for the local MIC decomp. of the separator task Bad pivots resulting from its children moved on top of the leading block Contributions blocks added J. I. Aliaga et. al. CSE9@MS97-Preconditioning Techniques 8 / 37

Parallel inverse-based MIC factorization preconditioners Parallel Multilevel Inverse-based IC factorization The computation of a leaf node (i, j) can be expressed algorithmically as follows Construct task submatrix A (i,j) from Π T AΠ 2 Reorder A (i,j) (P (i,j) ) T A (i,j) P (i,j) = Â(i,j) by some fill-reducing ordering matrix P (i,j) restricted to the leading block of A (i,j) 3 Compute a partial IC decomp. as by-product of inverse-based pivoting restricted to the leading block of A (i,j) 4 Proceed to the next local level by repeating steps 2 and 3 until the set of bad pivots is small enough 5 Send local contributions to the parent task J. I. Aliaga et. al. CSE9@MS97-Preconditioning Techniques 9 / 37

Parallel inverse-based MIC factorization preconditioners Parallel Multilevel Inverse-based IC factorization For a separator intermediate task (i, j)... Construct task submatrix A (i,j) by merging contributions of its children 2 Reorder A (i,j) (P (i,j) ) T A (i,j) P (i,j) = Â(i,j) by some fill-reducing ordering matrix P (i,j) restricted to the leading block of A (i,j) 3 Compute a partial IC decomp. as by-product of inverse-based pivoting restricted to the leading block of A (i,j) 4 Proceed to the next local level by repeating steps 2 and 3 until the set of bad pivots is small enough 5 Send local contributions to the parent task J. I. Aliaga et. al. CSE9@MS97-Preconditioning Techniques 9 / 37

Parallel inverse-based MIC factorization preconditioners Parallel Multilevel Inverse-based IC factorization For the separator root task (i, j)... Construct task submatrix A (i,j) by merging contributions of its children 2 Reorder A (i,j) (P (i,j) ) T A (i,j) P (i,j) = Â(i,j) by some fill-reducing ordering matrix P (i,j) restricted to the leading block of A (i,j) 3 Compute a partial IC decomp. as by-product of inverse-based pivoting restricted to the leading block of A (i,j) 4 Proceed to the next local level by repeating steps 2 and 3 until the approximate Schur complement is dense enough or void 5 MIC factorization finished J. I. Aliaga et. al. CSE9@MS97-Preconditioning Techniques 9 / 37

Parallel inverse-based MIC factorization preconditioners updated untouched factorized pivoted postpone continue factorization level completed st algebraic MIC st algebraic level completed factorized bad pivots continue continue postpone postpone factorization continue updated untouched pivoted Parallel MIC J. I. Aliaga et. al. CSE9@MS97-Preconditioning Techniques 2 / 37

Parallel inverse-based MIC factorization preconditioners How does MIC behave when applied to a system reordered by MLND? e T k L defined as max z = e T k L z Estimators t k e T k L are computed step by step along with IC decomp. Compute specific vector ẑ s.t. t k = e T k L ẑ close to max z = e T k L z ẑ is computed while solving Ly = ẑ by column-oriented forward substitution At step k of this forward substitution process.... t k = y k is computed by selecting ẑ k = or ẑ k = 2. Vector y is updated by y j := y j l jk y k for j > k s.t. l jk ẑ k (y k ) is selected to maximize y j after the update Likely that y j becomes large if involved in many updates Separator nodes expected to lead to more fill-in than subdomain nodes When the MIC is applied to a system reordered by MLND it is likely that separator nodes are initially rejected more often J. I. Aliaga et. al. CSE9@MS97-Preconditioning Techniques 2 / 37

Parallel inverse-based MIC factorization preconditioners How does MIC behave when applied to a system reordered by MLND? e T k L defined as max z = e T k L z Estimators t k e T k L are computed step by step along with IC decomp. Compute specific vector ẑ s.t. t k = e T k L ẑ close to max z = e T k L z ẑ is computed while solving Ly = ẑ by column-oriented forward substitution At step k of this forward substitution process.... t k = y k is computed by selecting ẑ k = or ẑ k = 2. Vector y is updated by y j := y j l jk y k for j > k s.t. l jk ẑ k (y k ) is selected to maximize y j after the update Likely that y j becomes large if involved in many updates Separator nodes expected to lead to more fill-in than subdomain nodes When the MIC is applied to a system reordered by MLND it is likely that separator nodes are initially rejected more often J. I. Aliaga et. al. CSE9@MS97-Preconditioning Techniques 2 / 37

Parallel inverse-based MIC factorization preconditioners How does MIC behave when applied to a system reordered by MLND? e T k L defined as max z = e T k L z Estimators t k e T k L are computed step by step along with IC decomp. Compute specific vector ẑ s.t. t k = e T k L ẑ close to max z = e T k L z ẑ is computed while solving Ly = ẑ by column-oriented forward substitution At step k of this forward substitution process.... t k = y k is computed by selecting ẑ k = or ẑ k = 2. Vector y is updated by y j := y j l jk y k for j > k s.t. l jk ẑ k (y k ) is selected to maximize y j after the update Likely that y j becomes large if involved in many updates Separator nodes expected to lead to more fill-in than subdomain nodes When the MIC is applied to a system reordered by MLND it is likely that separator nodes are initially rejected more often J. I. Aliaga et. al. CSE9@MS97-Preconditioning Techniques 2 / 37

Parallel inverse-based MIC factorization preconditioners How does MIC behave when applied to a system reordered by MLND? e T k L defined as max z = e T k L z Estimators t k e T k L are computed step by step along with IC decomp. Compute specific vector ẑ s.t. t k = e T k L ẑ close to max z = e T k L z ẑ is computed while solving Ly = ẑ by column-oriented forward substitution At step k of this forward substitution process.... t k = y k is computed by selecting ẑ k = or ẑ k = 2. Vector y is updated by y j := y j l jk y k for j > k s.t. l jk ẑ k (y k ) is selected to maximize y j after the update Likely that y j becomes large if involved in many updates Separator nodes expected to lead to more fill-in than subdomain nodes When the MIC is applied to a system reordered by MLND it is likely that separator nodes are initially rejected more often J. I. Aliaga et. al. CSE9@MS97-Preconditioning Techniques 2 / 37

Parallel inverse-based MIC factorization preconditioners How does MIC behave when applied to a system reordered by MLND? e T k L defined as max z = e T k L z Estimators t k e T k L are computed step by step along with IC decomp. Compute specific vector ẑ s.t. t k = e T k L ẑ close to max z = e T k L z ẑ is computed while solving Ly = ẑ by column-oriented forward substitution At step k of this forward substitution process.... t k = y k is computed by selecting ẑ k = or ẑ k = 2. Vector y is updated by y j := y j l jk y k for j > k s.t. l jk ẑ k (y k ) is selected to maximize y j after the update Likely that y j becomes large if involved in many updates Separator nodes expected to lead to more fill-in than subdomain nodes When the MIC is applied to a system reordered by MLND it is likely that separator nodes are initially rejected more often J. I. Aliaga et. al. CSE9@MS97-Preconditioning Techniques 2 / 37

Parallel inverse-based MIC factorization preconditioners MIC applied to MLND-reordered finite-difference grid discretization of 2D Laplace PDE st MIC level 87% of separator nodes rejected, only 3% of subdomain nodes rejected separator node rejected separator node accepted subdomain node rejected subdomain node accepted 2nd MIC level 92% of separator nodes rejected, only 35% of subdomain nodes rejected This tendency continues until the bulk of subdomain nodes has been accepted Only after that the MIC begins to accept most of the separator nodes In the parallel MIC separator nodes are rejected by construction J. I. Aliaga et. al. CSE9@MS97-Preconditioning Techniques 22 / 37

Outline Experimental Results Motivation and Introduction 2 ILUPACK inverse-based MIC factorization preconditioners 3 Parallel inverse-based MIC factorization preconditioners 4 Experimental Results 5 Conclusions and Future Work J. I. Aliaga et. al. CSE9@MS97-Preconditioning Techniques 23 / 37

Experimental Results Experimental Framework SGI Altix 35 ccnuma shared-memory multiprocessor: 8 nodes, 2-processor-per-node, Intel Itanium2.5 GHz 32 GBytes of RAM shared via a SGI NUMAlink interconnect 4 GBytes per node Intel Compiler OpenMP 2.5 compliance -O3 optimization level One thread binded per CPU + no thread migration during execution IEEE double precision J. I. Aliaga et. al. CSE9@MS97-Preconditioning Techniques 24 / 37

Experimental Results Experimental Framework MLND MLND HAMD HAMD MLND MLND HAMD HAMD ND-HAMD-A ND-HAMD-B Two different approaches for the MLND preprocessing step and MIC reordering: ND-HAMD-A: Subdomains reordered by MLND recursively in the preprocessing step HAMD as MIC reordering except for the initial level of the leaf nodes ND-HAMD-B: Truncated MLND stops if the degree of hw. parallelism matched HAMD as MIC reordering Subdomains ordered by HAMD at initial MIC level of the leaf nodes This can be already done in parallel exploiting implicit parallelism J. I. Aliaga et. al. CSE9@MS97-Preconditioning Techniques 25 / 37

Experimental Results First Example Benchmark We consider the Laplacian PDE equation u = f in a 3D unit cube Ω = [, ] 3 with Dirichlet boundary conditions u = g on Ω For the discretization... Ω replaced by a N x N y N z uniform grid u approximated by centered finite differences seven-point-star difference stencil We consider four SPD benchmark linear systems from this problem N x N y N z n nnz A /n,, 7 25 25 25,953,25 7 5 5 5 3,375, 7 2 2 2 8,, 7 J. I. Aliaga et. al. CSE9@MS97-Preconditioning Techniques 26 / 37

Experimental Results ILUPACK-based solvers execution time 3 25 ILUPACK based solvers performance for laplace_25 matrix ND HAMD A MIC ND HAMD A PCG ND HAMD A ND HAMD B MIC ND HAMD B PCG ND HAMD B 5 45 4 ILUPACK based solvers performance for laplace_5 matrix ND HAMD A MIC ND HAMD A PCG ND HAMD A ND HAMD B MIC ND HAMD B PCG ND HAMD B Computational time (secs.) 2 5 Computational time (secs.) 35 3 25 2 5 5 5 2 4 8 6 #processors 2 4 8 6 #processors 25 25 25 Grid 5 5 5 Grid n =,953,25 n =3,375, J. I. Aliaga et. al. CSE9@MS97-Preconditioning Techniques 27 / 37

Experimental Results The nnz slightly changes and # of iterations only increase slightly with p the parallelization approach preserves the semantics of the MIC J. I. Aliaga et. al. CSE9@MS97-Preconditioning Techniques 28 / 37

Experimental Results Second Example Benchmark The second example addresses an irregular 3D PDE problem div (A grad u) = f in Ω where A(x, y, z) is chosen with positive random coefficients For the discretization... Linear finite elements are used Ω replaced by NETGEN-tool-generated mesh 5 levels of mesh refinement: VC, C, M, F, VF Each mesh refined further up to 3 times Ω J. I. Aliaga et. al. CSE9@MS97-Preconditioning Techniques 29 / 37

Experimental Results We consider 2 SPD benchmark linear systems from this second example Id. Code Initial Mesh # refs. n nnz A nnz A /n VC very coarse,79 6,669 9.75 2 C coarse 9,583 2,563.75 3 M moderate 32,429 42,25 2.7 4 F fine,296,368,594 3.5 5 VC2 very coarse 2 27,272 3,686,268 3.59 6 M moderate 297,927 4,34,255 3.88 7 VF very fine 658,69 9,294,72 4. 8 F fine 882,824 2,562,88 4.23 9 C2 coarse 2 96,882 2,854,824 4.7 VC3 very coarse 3 2,382,864 34,28,924 4.32 M2 moderate 2 2,539,954 36,768,88 4.48 2 VF very fine 5,43,52 78,935,74 4.58 J. I. Aliaga et. al. CSE9@MS97-Preconditioning Techniques 3 / 37

Experimental Results ILUPACK-based solvers execution time 35 3 ILUPACK based solvers performance for very_coarse_refined_3 matrix ND HAMD A MIC ND HAMD A PCG ND HAMD A ND HAMD B MIC ND HAMD B PCG ND HAMD B 9 8 ILUPACK based solvers performance for very_coarse_refined_3 matrix ND HAMD A MIC ND HAMD A PCG ND HAMD A ND HAMD B MIC ND HAMD B PCG ND HAMD B 25 7 Computational time (secs.) 2 5 Computational time (secs.) 6 5 4 3 2 5 2 4 8 6 #processors 2 4 8 6 #processors VC3 (3 refinements) VF ( refinement) n=2,382,864 n=5,43,52 J. I. Aliaga et. al. CSE9@MS97-Preconditioning Techniques 3 / 37

Experimental Results The nnz slightly changes and # of iterations only increase slightly with p the parallelization approach preserves the semantics of the MIC J. I. Aliaga et. al. CSE9@MS97-Preconditioning Techniques 32 / 37

Experimental Results MIC and PCG stages parallel efficiency 6 4 p=2 p=4 p=8 p=6 Parallel MIC performance for Example 3 6 4 p=2 p=4 p=8 p=6 Parallel PCG performance for Example 3 2 2 Speed up vs. sequential MIC 8 6 Speed up vs. sequential PCG 8 6 4 4 2 2 2 3 4 5 6 7 8 9 2 Matrix identifier MIC Speed-Up 2 3 4 5 6 7 8 9 2 Matrix identifier PCG Speed-Up J. I. Aliaga et. al. CSE9@MS97-Preconditioning Techniques 33 / 37

Experimental Results PARDISO vs. ILUPACK-based solvers 7 PARDISO vs. ILUPACK based solvers performance fine matrix PARDISO ILUPACK based 4 PARDISO vs. ILUPACK based solvers performance moderate_refined_ matrix PARDISO ILUPACK based 6 35 5 3 Computational time (secs.) 4 3 Computational time (secs.) 25 2 5 2 5 2 4 8 6 #processors 2 4 8 6 #processors F mesh n=,296 M ( refinement) n=297,927 J. I. Aliaga et. al. CSE9@MS97-Preconditioning Techniques 34 / 37

Experimental Results PARDISO vs. ILUPACK-based solvers 8 6 PARDISO vs. ILUPACK based solvers performance laplace_ matrix PARDISO ILUPACK based 9 8 PARDISO vs. ILUPACK based solvers performance laplace_25 matrix PARDISO ILUPACK based 4 7 Computational time (secs.) 2 8 6 Computational time (secs.) 6 5 4 3 4 2 2 2 4 8 6 #processors 2 4 8 6 #processors Grid 25 25 25 Grid n=,, n=,953,25 J. I. Aliaga et. al. CSE9@MS97-Preconditioning Techniques 35 / 37

Conclusions and Future Work Conclusions and Future work The OpenMP ILUPACK parallelization... Preserves the semantics of the method in ILUPACK Delivers a high-degree of concurrence for shared-memory multiprocessors for the MIC and PCG stages The partitioning stage starts dominating the overall computation time as p increases Ongoing work... Use parallel solutions for partitioning sparse matrices (PT-SCOTCH, ParMETIS) Extend the parallelization approach for the indefinite case Move to platforms with higher number of processors and larger scale problems to analyze the challenges to be faced and the limits of the parallelization approach J. I. Aliaga et. al. CSE9@MS97-Preconditioning Techniques 36 / 37

Conclusions and Future Work Questions? J. I. Aliaga et. al. CSE9@MS97-Preconditioning Techniques 37 / 37

Conclusions and Future Work Algorithm: computes parallel MIC decomp. of the reordered system A Π T AΠ [ Π, T ] nested_dissection(g A ) obtain task tree T and permutation Π Q { leaves(t ) } mark all tasks of T as not executed Begin parallel region pid get_process_identifier() repeat while pending tasks in Q do tid dequeue(q) end map [ tid ] pid execute(tid) mark tid as executed initialize Q with all leaves of T remove ready task from the head of Q process pid in charge of task tid construct tid s submatrix and compute local MIC if all dependencies of parent(tid) have been resolved then enqueue(parent(tid), Q) insert new ready task at the tail of Q end until not all tasks executed End parallel region J. I. Aliaga et. al. CSE9@MS97-Preconditioning Techniques 37 / 37