CME342 Parallel Methods in Numerical Analysis. Matrix Computation: Iterative Methods II. Sparse Matrix-vector Multiplication.

Similar documents
Summary of Iterative Methods for Non-symmetric Linear Equations That Are Related to the Conjugate Gradient (CG) Method

SOLVING SPARSE LINEAR SYSTEMS OF EQUATIONS. Chao Yang Computational Research Division Lawrence Berkeley National Laboratory Berkeley, CA, USA

AMS526: Numerical Analysis I (Numerical Linear Algebra)

Iterative methods for Linear System

Parallel Numerics, WT 2016/ Iterative Methods for Sparse Linear Systems of Equations. page 1 of 1

EECS 275 Matrix Computation

Conjugate gradient method. Descent method. Conjugate search direction. Conjugate Gradient Algorithm (294)

AMS526: Numerical Analysis I (Numerical Linear Algebra for Computational and Data Sciences)

Chapter 7 Iterative Techniques in Matrix Algebra

The amount of work to construct each new guess from the previous one should be a small multiple of the number of nonzeros in A.

Iterative methods for Linear System of Equations. Joint Advanced Student School (JASS-2009)

Review: From problem to parallel algorithm

OUTLINE ffl CFD: elliptic pde's! Ax = b ffl Basic iterative methods ffl Krylov subspace methods ffl Preconditioning techniques: Iterative methods ILU

Topics. The CG Algorithm Algorithmic Options CG s Two Main Convergence Theorems

Iterative Methods for Solving A x = b

PETROV-GALERKIN METHODS

Preface to the Second Edition. Preface to the First Edition

Efficient Deflation for Communication-Avoiding Krylov Subspace Methods

Lecture 18 Classical Iterative Methods

AMS526: Numerical Analysis I (Numerical Linear Algebra) Lecture 23: GMRES and Other Krylov Subspace Methods; Preconditioning

6. Iterative Methods for Linear Systems. The stepwise approach to the solution...

Contents. Preface... xi. Introduction...

Bindel, Fall 2016 Matrix Computations (CS 6210) Notes for

ITERATIVE METHODS FOR SPARSE LINEAR SYSTEMS

Solving Sparse Linear Systems: Iterative methods

Solving Sparse Linear Systems: Iterative methods

AMS526: Numerical Analysis I (Numerical Linear Algebra)

Some definitions. Math 1080: Numerical Linear Algebra Chapter 5, Solving Ax = b by Optimization. A-inner product. Important facts

Conjugate Gradient (CG) Method

Iterative Methods for Linear Systems of Equations

Parallel Iterative Methods for Sparse Linear Systems. H. Martin Bücker Lehrstuhl für Hochleistungsrechnen

9.1 Preconditioned Krylov Subspace Methods

Iterative Methods for Sparse Linear Systems

Computational Linear Algebra

6.4 Krylov Subspaces and Conjugate Gradients

9. Iterative Methods for Large Linear Systems

Lab 1: Iterative Methods for Solving Linear Systems

Iterative Methods and Multigrid

M.A. Botchev. September 5, 2014

Computational Linear Algebra

Applied Mathematics 205. Unit V: Eigenvalue Problems. Lecturer: Dr. David Knezevic

Computational Linear Algebra

Krylov Space Solvers

Linear Algebra. Brigitte Bidégaray-Fesquet. MSIAM, September Univ. Grenoble Alpes, Laboratoire Jean Kuntzmann, Grenoble.

The Conjugate Gradient Method

4.8 Arnoldi Iteration, Krylov Subspaces and GMRES

Algebra C Numerical Linear Algebra Sample Exam Problems

Iterative Methods. Splitting Methods

Lecture 8: Fast Linear Solvers (Part 7)

Solution to Laplace Equation using Preconditioned Conjugate Gradient Method with Compressed Row Storage using MPI

Solving Ax = b, an overview. Program

Course Notes: Week 1

Next topics: Solving systems of linear equations

Lecture 17: Iterative Methods and Sparse Linear Algebra

IDR(s) as a projection method

A DISSERTATION. Extensions of the Conjugate Residual Method. by Tomohiro Sogabe. Presented to

Modelling and implementation of algorithms in applied mathematics using MPI

A short course on: Preconditioned Krylov subspace methods. Yousef Saad University of Minnesota Dept. of Computer Science and Engineering

Notes on Some Methods for Solving Linear Systems

1 Conjugate gradients

Properties of Matrices and Operations on Matrices

4.6 Iterative Solvers for Linear Systems

Krylov Subspace Methods that Are Based on the Minimization of the Residual

Review of matrices. Let m, n IN. A rectangle of numbers written like A =

Numerical Methods I Non-Square and Sparse Linear Systems

Numerical Methods - Numerical Linear Algebra

Linear Solvers. Andrew Hazel

The conjugate gradient method

Solving Linear Systems of Equations

Sparse Matrices and Iterative Methods

FEM and sparse linear system solving

Introduction to Iterative Solvers of Linear Systems

Some minimization problems

Numerical Solution Techniques in Mechanical and Aerospace Engineering

IDR(s) Master s thesis Goushani Kisoensingh. Supervisor: Gerard L.G. Sleijpen Department of Mathematics Universiteit Utrecht

Algebraic Multigrid as Solvers and as Preconditioner

CLASSICAL ITERATIVE METHODS

From Stationary Methods to Krylov Subspaces

APPLIED NUMERICAL LINEAR ALGEBRA

1 Extrapolation: A Hint of Things to Come

Notes on PCG for Sparse Linear Systems

Chapter 7. Iterative methods for large sparse linear systems. 7.1 Sparse matrix algebra. Large sparse matrices

7.3 The Jacobi and Gauss-Siedel Iterative Techniques. Problem: To solve Ax = b for A R n n. Methodology: Iteratively approximate solution x. No GEPP.

Algorithms that use the Arnoldi Basis

DELFT UNIVERSITY OF TECHNOLOGY

Comparison of Fixed Point Methods and Krylov Subspace Methods Solving Convection-Diffusion Equations

ITERATIVE METHODS BASED ON KRYLOV SUBSPACES

Stabilization and Acceleration of Algebraic Multigrid Method

Lecture 11: CMSC 878R/AMSC698R. Iterative Methods An introduction. Outline. Inverse, LU decomposition, Cholesky, SVD, etc.

Lecture 17 Methods for System of Linear Equations: Part 2. Songting Luo. Department of Mathematics Iowa State University

Conjugate Gradient Method

Finite-choice algorithm optimization in Conjugate Gradients

A communication-avoiding thick-restart Lanczos method on a distributed-memory system

Numerical Methods in Matrix Computations

A short course on: Preconditioned Krylov subspace methods. Yousef Saad University of Minnesota Dept. of Computer Science and Engineering

KRYLOV SUBSPACE ITERATION

Classical iterative methods for linear systems

LU Factorization. Marco Chiarandini. DM559 Linear and Integer Programming. Department of Mathematics & Computer Science University of Southern Denmark

Parallel Programming. Parallel algorithms Linear systems solvers

DELFT UNIVERSITY OF TECHNOLOGY

Transcription:

CME342 Parallel Methods in Numerical Analysis Matrix Computation: Iterative Methods II Outline: CG & its parallelization. Sparse Matrix-vector Multiplication. 1

Basic iterative methods: Ax = b r = b Ax (residual) Split the matrix A = M N Solve Mx k+1 = Nx k + b x k+1 = x k + M 1 (b Ax) = x k + M 1 r Jacobi: M = D, N = L + U GS: M = D + L, N = U 2

Summary: Jacobi vs GS Jacobi is parallel. GS is sequential because of the data dependency. Can be fixed by coloring technique. Comparison: Jacobi GS Parallelism Good Poor Convergence Slow Fast Which one is faster on parallel machines? 3

Jacobi vs RB Gauss-Seidel RB GS converges twice as fast as Jacobi, but requires twice as many parallel steps; about the same run time in practice. Parallel efficiency alone is not sufficient to determine overall performance. We also need fast converging algorithms. 4

Conjugate Gradient Method For A=SPD, i.e. symmetric: A = A T. positive definite: x T Ax > 0, x 0. Consider the quadratic function: Φ(x) = 1 2 xt Ax x T b Its unique minimum x satisfies: Ax = b Hence, solving Ax = b min Φ(x). Given x k 1 and search direction d k, let x k = x k 1 + αd k, where α = step length. 5

CG (cont.) x x k x k-1 d k Determine α by min. Φ(x k ) along d k : min α Φ(x k 1 + αd k ). By differentiation, α k = α opt is given by α k = (rk 1 ) T d k (d k ) T Ad k where r k 1 = b Ax k 1 = residual vector. The search direction d k is chosen such that it is A-orthogonal to all previous directions, i.e. (d k ) T Ad j = 0, j = 0,..., k 1. 6

Using this fact, it can be shown that: α k = (rk 1 ) T r k 1 (d k ) T Ad k

CG Algorithm k = 0; r 0 = b Ax 0 ; ρ 0 = r 0 2 2 ; while ( ρ k > ɛ r 0 2 ) do k = k + 1; ρ k 1 = (r k 1 ) T r k 1 ; if (k = 1) d 1 = r 0 ; else β k 1 = ρ k 1 /ρ k 2 ; d k = r k 1 + β k 1 d k 1 ; endif e k = Ad k ; α k = ρ k 1 /(d k ) T e k ; x k = x k 1 + α k d k ; r k = r k 1 α k e k ; ρ k = (r k ) T r k ; end 7

Properties of CG No need to form A. Only Av is needed. Minimal storage: Short recurrence for the update of x k. Need only store: x k, r k, d k, Ad k. Minimal error: Define K k (A; r 0 ), Krylov subspace of dimension k: K k (A; r 0 ) {r 0, Ar 0,..., A k 1 r 0 }. It can be proved that: K k (A; r 0 ) = span{r 0, r 1,..., r k 1 } = span{d 1, d 2,..., d k }. x k K k (A; r 0 ) min. the error e k x k x in the A-norm: x k x A x x A x K k (A; r 0 ), where w A = w T Aw. 8

Properties of CG (cont.) In exact arithmetic, CG achieves exact solution in at most n steps (finite termination). In practice, an accurate approx. is obtained long before n steps. So, CG is usually used as an iterative method. Convergence of CG: Upper bound: x x k A ( κ 1 κ + 1 ) k x x 0 A, where κ is the condition number of A: κ A A 1 = λ max(a) λ min (A). In practice, the rate of convergence is usually much faster than the upper bound, especially in the case when the eigenvalues of A are located in clusters. 2D mesh with N unknowns, Poisson eqn: CG converges in O( N) steps, same as optimal SOR. 9

p = # of processors. Parallel CG DAXPY: scalar vector + vector (BLAS I) 3 daxpy operations: d k, x k, r k. O(n/p) flops. No communication. Inner products (BLAS I) 2 inner products: (r k 1 ) T r k 1, (e k ) T d k. O(n/p) flops. Communication for the reduction process (collecting partial sums from all processors). Matrix-vector product (Essentially BLAS I) Major cost of CG. Efficiency depends on sparse structure of A. Minor note: {x k } are not needed in the CG process. We may delay the update of x k,..., x k+j by storing d k,..., d k+j. 10

Overlapping Communication Want to overlap communication time with useful computations. In standard CG, there are 2 synchronization points (inner products). Possible to reduce to 1 synch. point by rearranging terms. A version is given in next slide. 1 additional daxpy operation. 2 inner products can be computed in parallel 1 synch. point. Main disadvantage is the possible numerical instability. 11

Variant CG Algorithm r 0 = b Ax 0 ; q 1 = p 1 = 0; β 1 = 0; s 0 = Ar 0 ; ρ 0 = (r 0 ) T r 0 ; µ 0 = (s 0 ) T r 0 ; α 0 = ρ 0 /µ 0 ; for k = 0, 1,... p k = r k + β k 1 p k 1 ; q k = s k + β k 1 q k 1 ; x k+1 = x k + α k p k ; r k+1 = r k α k q k ; check convergence; continue if necessary s k+1 = Ar k+1 ; ρ k+1 = (r k+1 ) T r k+1 ; µ k+1 = (s k+1 ) T r k+1 ; β k = ρ k+1 /ρ k ; α k+1 = ρ k+1 /(µ k+1 ρ k+1 β k /α k ); end 12

CG for non symmetric matrices Ax = b, where A is not SPD A T Ax = A T b (minimal residual) x = A T y solve AA T y = b (minimal error) get x from y When A is square, the condition number of A T A is bigger than the condition number of A: slower convergence 13

Krylov Subspace Methods A can be any nonsymmetric matrix. Let V k = {v 1,..., v k } be a basis for K k (A; r 0 ). Look for approx. solution x k K k (A; r 0 ), i.e., for some y IR k. x k = V k y, Four projection-type approaches to pick x k : Ritz-Galerkin: (CG, Lanczos, FOM, GENCG) b Ax k K k (A; r 0 ). Minimum Residual: (GMRES, MINRES) min b Ax k 2. x k K k Petrov-Galerkin: (BiCG, BiCGSTAB, QMR) e.g. S k = K k (A T ; s 0 ). b Ax k S k. Minimum Error: (SYMMLQ, GMERR) min x k K k x x k 2. 14

Sparse Matrix Computations Computations only carried out at nonzero entries. Thus, only nonzero entries are stored. Storage formats: Coordinate format: A = {I, J, V AL}, I, J, V AL, are nnz 1 array; nnz = # of nonzeros. Compress sparse row format: A = {I, J, V AL} V AL = nnz 1 array of nonzero entries J = nnz 1 array of column indices. I = n 1 array; ith entries of which points to the 1st entry of the ith row in V AL & J. Ellpack-Itpack format: m = max. # of nonzeros in any row. A = {V AL, J} V AL = n m array of nonzero entries. J = n m array of column indices. 15

Example: storage formats A = 1 0 0 2 3 4 0 0 0 5 6 0 7 0 8 9 Coordinate format: I = [1 1 2 2 3 3 4 4 4] J = [1 4 1 2 2 3 1 3 4] V AL = [1 2 3 4 5 6 7 8 9] CSR format: V AL = [1 2 3 4 5 6 7 8 9] J = [1 4 1 2 2 3 1 3 4] I = [1 3 5 7] Ellpack-Itpack format: V AL = 1 2 3 4 5 6 7 8 9, J = 1 4 1 1 2 1 2 3 1 1 3 4 16

Sparse Matrix-Vector Multiplication For iterative methods such as Jacobi, Gauss- Seidel, CG, etc, the major computational cost per iteration is Av. Sparse matrices usually have a special structure which should be exploited for max efficiency. Consider 3 cases: Matrices whose nonzero entries lie on a few diagonals. Unstructured matrices; the locations of the nonzero entries can be anywhere. Banded matrices; the nonzero elements are confined within a band. Key Idea: data mapping. 17

Pentadiagonal Matrices Decompose the matrix into 3 parts: = + + I II III Partition A by rows: P i contains row (i-1)(n/p)+1 to i(n/p), and vector elements (i-1)(n/p)+1 to i(n/p). I requires no communication. II requires comm between neighboring processors, i.e, P i comm with P i 1 for vector element (i-1)(n/p) & P i+1 for i(n/p)+1. Timing: Tcomm = 2(t s + t w ), 18

where t s =startup time & t w =time per word. Assume hypercube architecture.

Pentadiagonal Matrices (cont.) P i-p/m P i P i+p/m m=sqrt(n) III: p n: Comm between neighboring processors exchanging n elements. Timing: Tcomm = 2(t s + t w n). (including Tcomm for II) III: p > n: Need comm with P i±p/ n for n/p elements. Timing: 19

Tcomm = 2(t s + t w n/p + t h log p), where t h =time per hop between processors.

Checkerboard (domain) Partitioning P 1 P2 P 3 P P P 4 5 6 sqrt(n)/p P 7 P 8 P 9 Each processor stores n/p clusters of matrix rows, i.e. n/p elements. n/p Need comm only along boundary data with length = 4 n/p. Timing: Tcomm = 4(t s + t w n/p). 20

Unstructured Sparse Matrices Partition by rows. Require the entire vector in general. all-to-all broadcast. Hence Perform local matrix-vector multiplication. Parallel run time (hypercube): T p = t c m n p + t s log p + t w n = O(n). t c =computational time per data, m=max number of nonzero per row. (Sequential run time = O(n)) Hence, no gain in parallel. 21

Checkerboard Matrix Partitioning Hypercube: Tcomm = t s log p + 3 2 (t w n log p) p. Average computation = mn p, m = average number of nonzeros per row. Parallel run time, speedup, efficiency: T p = t c mn p S = E / + t s log p + 3 2 t n log p w p mt c pn mt c n + t s p log p + (3t w n p/2) log p E = T 1 pt p mt c = mt c + (t s p log p)/n + (3 p log p)/2 1 as n unscalable. T p = O(log p) + O( n log p p ) better than sequential run time for large p. 22

Planar Graph Suppose A sparse matrix with a symmetric structure, i.e. a i,j 0 iff a j,i 0. Let G(A) be the graph of A. A scalable parallel implementation of matrix-vector multiply exists if G(A) is planar. If G(A) is planar, communication occurs only at the boundaries of each processor s partition. Hence, by reducing the number of partitions, or equivalently, increasing the size of the partitions, it is possible to increase the computation to comm. ratio of the processors. Partition a graph to minimize interprocessor comm. is NP-hard. 23

Banded Unstructured Sparse Matrices Suppose bandwidth = w, Ellpack-Itpack format, partition the matrix by rows, average # of nonzeros = m. Assume n/p w: P i-wp/(2n) P i P i+wp/(2n) w Need vector element indices from i-(w-1)/2 to i+(w-1)/2 comm. with (w 1)p/n processors, i.e. P i comm. with P i wp/(2n) P i+wp/2n. 24