Algorithm for Sparse Approximate Inverse Preconditioners in the Conjugate Gradient Method

Similar documents
Robust Preconditioned Conjugate Gradient for the GPU and Parallel Implementations

Research Article A Rapid Numerical Algorithm to Compute Matrix Inversion

Preconditioning Techniques Analysis for CG Method

Incomplete-LU and Cholesky Preconditioned Iterative Methods Using CUSPARSE and CUBLAS

SPEEDING UP THE CONVERGENCE OF SIMPLE GRADIENT METHODS

The Conjugate Gradient Method

Computing least squares condition numbers on hybrid multicore/gpu systems

Parallel programming practices for the solution of Sparse Linear Systems (motivated by computational physics and graphics)

Iterative methods for Linear System of Equations. Joint Advanced Student School (JASS-2009)

Accelerating linear algebra computations with hybrid GPU-multicore systems.

OUTLINE 1. Introduction 1.1 Notation 1.2 Special matrices 2. Gaussian Elimination 2.1 Vector and matrix norms 2.2 Finite precision arithmetic 2.3 Fact

Finite-choice algorithm optimization in Conjugate Gradients

9.1 Preconditioned Krylov Subspace Methods

Fine-Grained Parallel Algorithms for Incomplete Factorization Preconditioning

S-Step and Communication-Avoiding Iterative Methods

On the modified conjugate gradient method in cloth simulation

Jae Heon Yun and Yu Du Han

Iterative methods for Linear System

Fine-grained Parallel Incomplete LU Factorization

A Novel Approach for Solving the Power Flow Equations

Preconditioned Parallel Block Jacobi SVD Algorithm

EIGIFP: A MATLAB Program for Solving Large Symmetric Generalized Eigenvalue Problems

On Algebraic Structure of Improved Gauss-Seidel Iteration. O. M. Bamigbola, A. A. Ibrahim

EECS 275 Matrix Computation

Evaluation of Iterative-Solutions used for Three-Dimensional Volume Integral Equation

A NEW METHOD FOR SOLVING ILL-CONDITIONED LINEAR SYSTEMS. Fazlollah Soleymani

Introduction to numerical computations on the GPU

Incomplete Cholesky preconditioners that exploit the low-rank property

Multilevel low-rank approximation preconditioners Yousef Saad Department of Computer Science and Engineering University of Minnesota

Formulation of a Preconditioned Algorithm for the Conjugate Gradient Squared Method in Accordance with Its Logical Structure

A robust multilevel approximate inverse preconditioner for symmetric positive definite matrices

Notes on PCG for Sparse Linear Systems

Parallel Iterative Methods for Sparse Linear Systems. H. Martin Bücker Lehrstuhl für Hochleistungsrechnen

1 Conjugate gradients

Lecture 18 Classical Iterative Methods

J.I. Aliaga 1 M. Bollhöfer 2 A.F. Martín 1 E.S. Quintana-Ortí 1. March, 2009

Accelerating interior point methods with GPUs for smart grid systems

Chapter 7 Iterative Techniques in Matrix Algebra

Various ways to use a second level preconditioner

TR A Comparison of the Performance of SaP::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems

the method of steepest descent

Lecture 17: Iterative Methods and Sparse Linear Algebra

AMS526: Numerical Analysis I (Numerical Linear Algebra)

ADDITIVE SCHWARZ FOR SCHUR COMPLEMENT 305 the parallel implementation of both preconditioners on distributed memory platforms, and compare their perfo

1. Fast Iterative Solvers of SLE

Accelerating Model Reduction of Large Linear Systems with Graphics Processors

The conjugate gradient method

Lecture 11: CMSC 878R/AMSC698R. Iterative Methods An introduction. Outline. Inverse, LU decomposition, Cholesky, SVD, etc.

The Conjugate Gradient Method

Numerical Methods I Non-Square and Sparse Linear Systems

Perm State University Research-Education Center Parallel and Distributed Computing

A Method for Constructing Diagonally Dominant Preconditioners based on Jacobi Rotations

Jacobi-Based Eigenvalue Solver on GPU. Lung-Sheng Chien, NVIDIA

Reduced Synchronization Overhead on. December 3, Abstract. The standard formulation of the conjugate gradient algorithm involves

AMS526: Numerical Analysis I (Numerical Linear Algebra for Computational and Data Sciences)

Review: From problem to parallel algorithm

SPARSE SOLVERS POISSON EQUATION. Margreet Nool. November 9, 2015 FOR THE. CWI, Multiscale Dynamics

ITERATIVE PROJECTION METHODS FOR SPARSE LINEAR SYSTEMS AND EIGENPROBLEMS CHAPTER 1 : INTRODUCTION

Linear Algebra: Matrix Eigenvalue Problems

Computers and Mathematics with Applications

Some Geometric and Algebraic Aspects of Domain Decomposition Methods

Incomplete LU Preconditioning and Error Compensation Strategies for Sparse Matrices

The amount of work to construct each new guess from the previous one should be a small multiple of the number of nonzeros in A.

Uniform Boundedness of a Preconditioned Normal Matrix Used in Interior Point Methods

Topics. The CG Algorithm Algorithmic Options CG s Two Main Convergence Theorems

Introduction to Scientific Computing

MARCH 24-27, 2014 SAN JOSE, CA

Jacobi-Davidson Eigensolver in Cusolver Library. Lung-Sheng Chien, NVIDIA

Jordan Journal of Mathematics and Statistics (JJMS) 5(3), 2012, pp A NEW ITERATIVE METHOD FOR SOLVING LINEAR SYSTEMS OF EQUATIONS

Calculation of ground states of few-body nuclei using NVIDIA CUDA technology

6.4 Krylov Subspaces and Conjugate Gradients

Conjugate gradient method. Descent method. Conjugate search direction. Conjugate Gradient Algorithm (294)

Linear Solvers. Andrew Hazel

From Stationary Methods to Krylov Subspaces

Deflated Krylov Iterations in Domain Decomposition Methods

AMS526: Numerical Analysis I (Numerical Linear Algebra) Lecture 23: GMRES and Other Krylov Subspace Methods; Preconditioning

Lab 1: Iterative Methods for Solving Linear Systems

Direct and Incomplete Cholesky Factorizations with Static Supernodes

UTPlaceF 3.0: A Parallelization Framework for Modern FPGA Global Placement

M.A. Botchev. September 5, 2014

Parallelization of Multilevel Preconditioners Constructed from Inverse-Based ILUs on Shared-Memory Multiprocessors

Solving Ax = b, an overview. Program

c 2015 Society for Industrial and Applied Mathematics

BLOCK ILU PRECONDITIONED ITERATIVE METHODS FOR REDUCED LINEAR SYSTEMS

Direct methods for symmetric eigenvalue problems

PRECONDITIONING IN THE PARALLEL BLOCK-JACOBI SVD ALGORITHM

I-v k e k. (I-e k h kt ) = Stability of Gauss-Huard Elimination for Solving Linear Systems. 1 x 1 x x x x

Preliminary Results of GRAPES Helmholtz solver using GCR and PETSc tools

DELFT UNIVERSITY OF TECHNOLOGY

Some minimization problems

Iterative Methods for Solving A x = b

FINE-GRAINED PARALLEL INCOMPLETE LU FACTORIZATION

Preface to the Second Edition. Preface to the First Edition

FINE-GRAINED PARALLEL INCOMPLETE LU FACTORIZATION

Iterative Methods for Sparse Linear Systems

Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers

GPU acceleration of Newton s method for large systems of polynomial equations in double double and quad double arithmetic

Contents. Preface... xi. Introduction...

Level-3 BLAS on a GPU

The solution of the discretized incompressible Navier-Stokes equations with iterative methods

Transcription:

Algorithm for Sparse Approximate Inverse Preconditioners in the Conjugate Gradient Method Ilya B. Labutin A.A. Trofimuk Institute of Petroleum Geology and Geophysics SB RAS, 3, acad. Koptyug Ave., Novosibirsk 630090, Russia ilya.labutin@gmail.com Irina V. Surodina Institute of Computational Mathematics and Mathematical Geophysics SB RAS, 6, acad. Lavrentiev Ave., Novosibirsk 630090, Russia sur@ommfao1.sscc.ru Abstract We propose a method for preconditioner construction and parallel implementations of the Preconditioned Conjugate Gradient algorithm on GPU platforms. The preconditioning matrix is an approximate inverse derived from an algorithm for the iterative improvement of a solution to linear equations. Using a sparse matrix-vector product, our preconditioner is well suited for massively parallel GPU architecture. We present numerical experiments and comparisons with CPU implementations. Keywords: sparse approximate inverse, preconditioned conjugate gradient, GPU AMS subject classifications: 68W10 1 Introduction Our work is motivated by, but not limited to, finite-difference methods applied to 2D and 3D potential field problems arising in mathematical geophysics from well resistivity logging applications. A numerical solution of the potential field problem leads to large sparse linear systems usually solved by iterative methods instead of direct ones. 2013. Submitted: February 15, 2013; Revised: November 29, 2013; Accepted: December 6, 120

Reliable Computing 19, 2013 121 The Conjugate Gradient (CG) algorithm is one of the best known iterative methods for solving linear systems with a symmetric positive definite matrix [1]. The performance of the CG can be increased dramatically with the suitable preconditioner. The concept of the preconditioning in iterative methods is to transform the original system into an equivalent system with the same solution, but with better properties. A good preconditioner allows an iterative solution to be computed more quickly for the transformed system than for the original problem [9]. However, the computational overhead of applying the preconditioner must not cancel out the benefit of fewer iterations (see, e.g., [3, 6]). Modern parallel implementations of the preconditioned conjugate gradient (PCG) on the graphical processing units (GPUs) use sparse approximate inverse (AINV) preconditioners due to their attractive features. First, the columns or rows of the approximate inverse matrix can be generated in parallel. Second, the preconditioner matrix is used in PCG through matrix-vector multiplications, which are easy to parallelize [6]. In this work, we present an algorithm for building a series of AINV preconditioners with (potentially) arbitrary high approximation accuracy. The algorithm presented derives from the Schulz-Hotelling algorithm for the iterative improvement of inverse matrices and solutions to linear equations (see, e.g., [5, 8]). We propose parallel implementations for the preconditioners obtained for the first, third, and seventh order approximations of the inverse matrix. 2 Preconditioned Conjugate Gradient Algorithm The potential field problem is solved using a finite difference method with the standard five and seven point stencils for 2D and 3D equations, respectively. The approximation leads to a system of linear equations Ax = b, where A is a symmetric positive-definite matrix of order N containing at most five (for 2D) and seven (for 3D) nonzero elements per row. N is the number of potential unknowns, and b is a finite difference approximation of the right-hand side of the equation. A commonly used and well examined iterative algorithm for solving sparse symmetric positive definite linear systems is the conjugate gradient (CG) method. The convergence rate of the CG method depends not only on the matrix size, but also on the condition number of the matrix cond (A) = λ max/λ min where λ max and λ min are maximum and minimum eigenvalues of A, respectively [7]. A matrix is called well conditioned if it is not too far from the identity matrix, which has cond (A) = 1. A matrix A having smaller condition number leads to faster convergence of the CG method applied to the system Ax = b. In typical well resistivity logging applications, the matrix A has a large condition number, i.e. cond (A) 1. The effectiveness of the CG method can be increased by applying a preconditioning technique that improves the properties of the matrix of a system. Suppose that M is a symmetric, positive definite matrix that approximates A, but is easier to invert. We can solve Ax = b indirectly by solving M 1 Ax = M 1 b. (1)

122 Labutin and Surodina, Sparse Preconditioners for Conjugate Gradient If cond (M 1 A) cond (A) or if the eigenvalues of M 1 A are better clustered than those of A, we can iteratively solve equation (1) more quickly than the original problem [9]. The iteration scheme of the preconditioned CG (PCG) algorithm is Initialization: x 0, r 0 = b Ax 0, Mz 0 = r 0, p 0 = z 0 ; 1. q i = Ap i, α i = zt i r i ; p T i qi 2. x i+1 = x i + α ip i, r i+1 = r i α iq i ; 3. Mz i+1 = r i+1 ; (2) 4. β i = zt i+1r i+1 z T i ri, p i+1 = r i+1 + β ip i. Each iteration requires one matrix-vector product, two inner products, three vector updates, and the solution of the linear system at Step 3. For the NVIDIA GPU implementation, all the required operations can be found in the standard CUBLAS library. Matrix-vector product can be implemented efficiently on a GPU for sparse matrices (see, e.g., [6, 2]). 3 Approximate Inverse Preconditioner To eliminate linear system solving at Step 3, M 1 can be computed as an approximation to the inverse of A. In this case, Step 3 is replaced with a matrix-vector product. We derive our algorithm for constructing AINV preconditioners from the procedure developed by Schulz [8] and Hotelling [5] for the iterative improvement of inverse matrices and solutions to linear equations. Let us assume that D 0 is an initial approximation of A 1 and satisfies the condition R 0 k < 1, R 0 = I AD 0, (3) where the matrix norm is any norm which is easy to compute. formal manipulation Next, consider the A 1 = (D 0D 1 0 )A 1 = D 0(D 1 0 A 1 ) = D 0(AD 0) 1 = D 0(I R 0) 1 = D 0(I + R 0 + R 2 0 + R 3 0 +... ). We can define the sequence of matrices D n by D n+1 = D n(i + R n), where R n = I AD n. What is more that Hence, we have R n = I AD n 1(I + R n 1) = I (I R n 1)(I + R n 1) = R 2 n 1 = R 4 n 2 =... = R 2n 0. D n = A 1 (I R 2n 0 ). (4) With (3) and (4) we can build an iterative process which converges to A 1 [4], and the error is n D n A 1 D 0 k2 1 k. (5)

Reliable Computing 19, 2013 123 In addition, the algorithm preserves the symmetry of the approximation to A 1 if the initial approximation is a symmetric matrix. Assuming A T = A and D T n 1 = D n 1, D T n = 2D T n 1 (D n 1AD n 1) T = 2D T n 1 D T n 1A T D T n 1 = 2D n 1 D n 1AD n 1 = D n. With a reasonable initial approximation D 0, we can use D n as the preconditioning matrix M 1 at Step 3 of the PCG algorithm (2). 4 Numerical Experiments For the numerical experiments, we use well-known Jacobi preconditioner D 0 = J = diag {a 1 11, a 1 22,..., a 1 nn} as an initial approximation to A 1. In this case, the matrix D 1 has the form D 1 = J + J(I AJ). According to (5), the matrix D 1 has better properties than D 0 = J. Also, we can see that D 1 has the same structure as the matrix A, so that the same matrix-vector product implementation can be used for iterations (Step 1) and for preconditioning (Step 3) in PCG algorithm (2). Next, the matrices D 2 and D 3 can be expressed in factorized form via D 1: D 2 = D 1 + D 1(I AD 1) = 2D 1 D 1AD 1, D 3 = D 2 + D 2(I AD 2), = (2D 1 D 1AD 1)(2I A(2D 1 D 1AD 1)) = 2(2D 1 D 1AD 1) (2D 1 D 1AD 1)A(2D 1 D 1AD 1). (6) Such factorized form is eminently suitable for computations, because we know the structure of the matrices A and D 1. It is hard to use the explicit form of the matrix D i with the indexes 2 and higher, because the number of fill-ins increases accordingly. For example, the matrix D 1 contains 5 diagonals for a 2D finite difference grid. D 2 contains 25 diagonals, and D 3 contains113 diagonals. On the other hand, D 1, D 2 and D 3, can be expressed as D 1 = D 0(I + R 0), D 2 = D 1(I + R 2 0) = D 0(I + R 0)(I + R 2 0) = D 0(I + R 0 + R 2 0 + R 3 0), D 3 = D 2(I + R 2) = D 2(I + R 4 0) = D 0(I + R 0 + R 2 0 + R 3 0)(I + R 4 0) = D 0(I + R 0 + R 2 0 + R 3 0 + R 4 0 + R 5 0 + R 6 0 + R 7 0). We can see that D 1, D 2 and D 3 are the first, third, and seventh order approximations of A 1, respectively. We first investigate the behaviour of our AINV preconditioners on a matrix arising from a 2D potential field problem discretized by a finite difference method on a

124 Labutin and Surodina, Sparse Preconditioners for Conjugate Gradient nonuniform grid. The resulting matrix has n = 17, 139. Table 1 shows the condition number of the initial matrix and preconditioned matrices. We notice that the D 3(J) preconditioner is significantly better than the Jacobi preconditioner. Next, we compare the generated preconditioners with the sequential Jacobi preconditioned CG implementation on the same finite difference grids produced for 2D and 3D potential field problems. For the numerical experiments, we used processors CPU: Intel Xeon X5670 2.93 GHz (using Intel Fortran 11.0), GPU: NVIDIA Tesla M2090, 512 Core, 4 GB RAM (using CUDA). Matrix Condition number A 4.5377 10 7 AJ 3.0542 10 5 AD 1 (J) 7.6542 10 4 AD 2 (J) 3.8271 10 4 AD 3 (J) 1.9135 10 4 Table 1: Condition number of AM 1 Method n=17,139 n=37,846 n=76,136 iter. time, speed count sec up iter. time, speed count sec up iter. time, count sec speed up CG(J) 2835 1.21-4523 3.75-7204 14.34 - D 0 = J 2684 0.26 5 4073 0.41 9 5731 0.73 20 D 1 (J) 1325 0.16 8 2038 0.27 14 2866 0.52 26 D 2 (D 1 ) 942 0.12 10 1451 0.22 17 2055 0.46 31 D 3 (D 1 ) 668 0.11 11 1038 0.19 20 1458 0.40 36 Table 2: Performances of the PCG algorithm and speed-up. 2D potential field problem. n=1,458,600 n=4,111,848 Method iter. count time, sec ratio iter. count time, sec ratio CG(J) 4769 190-7445 673 - D 0 = J 4981 6.98 27 7754 28.43 24 D 1 (J) 2495 4.35 44 3666 17.50 38 D 2 (D 1 ) 1885 5.56 34 2692 21.85 30 Table 3: Performances of the PCG algorithm and speed-up. 3D potential field problem. In Table 2, we compare GPU implementations of D 0, D 1, D 2, and D 3 preconditioners with CPU implementation of CG with Jacobi preconditioner for the 2D potential field problem. We use three finite difference grids of different sizes typical for well

Reliable Computing 19, 2013 125 resistivity logging applications. As expected, the lower the condition number of the matrix, the smaller the number of iterations. The speed-up increases accordingly. For the largest problem, the GPU implementation of D 3 is 36 times faster than the CG algorithm on the CPU. However, the speed-up of D 2 is relatively the same as the speed-up of D 3, despite of significant difference in the number of iterations. This fact is expected, because an application of the D 2 preconditioner requires only three matrix-vector products (see (6)), while D 3 requires seven matrix-vector products. In Table 3, we compare GPU implementations of D 0, D 1, and D 2 preconditioners with a CPU implementation of CG with the Jacobi preconditioner for the 3D potential field problem. For the largest problem, the GPU implementation of D 1 is 38 times faster than the CG algorithm on the CPU. For the large matrices, the number of matrix-vector products in the preconditioning application becomes important. D 1 is faster than D 2 for the 3D grid because it requires only one matrix-vector product. However for the 2D grid, D 2 is faster than D 1, see Table 2. In summary, the higher the index of the preconditioner obtained in the iteration (4), the higher the number of fill-ins in the resulting matrix. The choice of the preconditioner in this case is a trade-off between the quality of the resulting matrix in terms of (5) and performance of the preconditioning procedure, which depends on sparsity pattern and number of matrix-vector products in factorized form of the preconditioner. 5 Conclusion We have presented an algorithm for generating preconditioners and parallel GPU implementations of the preconditioned conjugate gradient algorithm for linear systems with symmetric positive definite matrices. Our preconditioners, derived from the Schulz-Hotelling algorithm for the iterative improvement of inverse matrices and solutions to linear equations, is an approximate inverse and therefore can be used in the PCG algorithm through a sparse matrix-vector product. In the future, we plan to investigate other preconditioners generated by the algorithm presented here, but with different initial approximations of A 1. References [1] R. Barrett, M. W. Berry, T. F. Chan, J. Demmel, J. Donato, J. Dongarra, V. Eijkhout, R. Pozo, C. Romine, and H. van der Vorst, Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods, Society for Industrial and Applied Mathematics, Philadelphia, 1993, URL: http://www.netlib.org/templates/templates.pdf [2] N. Bell and M. Garland, Efficient Sparse Matrix-Vector Multiplication on CUDA, Technical Report, NVIDIA, 2008. [3] J. Dongarra, I. Duff, D. Sorensen, H. Van der Vorst, Numerical Linear Algebra for High-Performance Computers, Society for Industrial and Applied Mathematics, Philadelphia, 1998. [4] D. K. Faddeev and V. N. Faddeeva, Computational Methods of Linear Algebra, Fizmatgiz, Moscow, 1963. (in Russian) [5] H. Hotelling, Analysis of a complex of statistical variables into principal components, Journal of Educational Psychology, 1933, pp. 417 441.

126 Labutin and Surodina, Sparse Preconditioners for Conjugate Gradient [6] R. Li and Y. Saad, GPU-Accelerated Preconditioned Iterative Linear Solvers, Minnesota Supercomputer Institute, University of Minnesota, Minneapolis, 2010, [7] Y. Saad, Iterative Methods for Sparse Linear Systems, 2nd edition, Society for Industrial and Applied Mathematics, Philadelphia, 2003. [8] G. Schulz, Iterative Berechnung der reziproken Matrix, Zeitschrift für Angewandte Mathematik und Mechanik (ZAMM), 1933, pp. 57 59. [9] J. R. Shewchuk, An Introduction to the Conjugate Gradient Method Without the Agonizing Pain, School of Computer Science, Carnegie Mellon University, Pittsburgh, 1994.