BLAS: Basic Linear Algebra Subroutines Analysis of the Matrix-Vector-Product Analysis of Matrix-Matrix Product

Level-1 BLAS: SAXPY BLAS-Notation: S single precision (D for double, C for complex) A α scalar X vector P plus operation Y vector SAXPY: y = αx + y Vectorization of SAXPY (αx + y) by pipelining: page 8 of 39

SAXPY Parallelization by Partitioning {1, 2,, n} = 1, n = I 1 I 2 I R x 1 x 2 x = = x n X 1 X R, y = y 1 y 2 = y n Y 1 Y R Using short vectors X j and Y j of length n R Each processor P j gets partial vector X j and Y j and computes Y j = αx j + Y j, j = 1, 2,, R Result: SAXPY very good vectorizable and parallelizable page 9 of 39

Further Level-1 BLAS Routines SCOPY: y = x or y x (compare SAXPY) DOT-product x T y: n norm: x 2 = j=1 i x iy i x 2 j = x T x (compare DOT-product) page 10 of 39

Level-2 BLAS Matrix-Vector operations with O(n 2 ) operations (sequentially) BLAS-Notation: S single precision G E } general matrix M V vector defines SGEMV, matrix-vector product: y = αax + βy Other Level-2 BLAS: solving triangular system Lx = b with triangular matrix L page 11 of 39

Level-3 BLAS Matrix-Matrix operations with O(n 3 ) operations (sequentially) BLAS-Notation: S single precision G E } general matrix M M matrix defines SGEMM, matrix-matrix product: C = αab + βc page 12 of 39

Granularity for BLAS BLAS level operation formula memory granularity BLAS-1 AXPY: 2n αx + y 2n + 1 < 1 BLAS-2 GEMV: 2n 2 αax + βy n 2 + 2n 2 BLAS-3 GEMM: 2n 3 αab + βc 4n 2 n 2 BLAS-3 has best operations to memory ratio! page 13 of 39

22 Analysis of the Matrix-Vector-Product A = (a ij ) i=1,,n j=1,,m 221 Vectorization c 1 a 11 a 1m = c n a n1 a nm m a 1j b j j=1 = m a nj b j j=1 R n m, b R m, c R n b 1 b n = a 11 b 1 + + a 1m b m = a n1 b 1 + + a nm b m m j=1 b j a 1j a nj = n DOT-products of length m m SAXPYs of length n (GAXPY) page 14 of 39

Pseudocode: ij-form c = 0; for i=1,,n for j=1,,m c i = c i + a ij b j end end } DOT-product c i = A i b, DOT-product of ith row of A with vector b page 15 of 39

Pseudocode: ji-form c = 0; for j=1,,m for i=1,,n c i = c i + a ij b j end end SAXPY updating vector c with jth column of A GAXPY: Sequence of SAXPYs related to the same vector Advantage: vector c, that is updated, can be kept in fast memory No additional data transfer page 16 of 39

GAXPY (repetition) SAXPY: y := y + αx GAXPY: y = y 0 for i = 1 : n y := y + α i x i end Series of SAXPYs regarding the same vector y length(gaxpy) = length(y) Advantage: less data transfer! page 17 of 39

222 Parallelization by Building Blocks Reduce matrix-vector product on smaller matrix-vector products {1, 2,, n} = 1, n = I 1 I 2 I R disjunct: I j I k = 0 for j k {1, 2,, m} = 1, m = J 1 J 2 J S disjunct: J j J k = 0 for j k Use 2-dimensional array of processors P rs P rs gets matrix block A rs := A(I r, J s ), b s := b(j s ), c r := c(i r ) c r = S A rs b s =: s=1 S s=1 c (s) r page 18 of 39

Pseudocode for r = 1,, R for s = 1,, S c (s) r = A rs b s ; end end for r = 1,, R c r = 0 for s = 1,, S c r = c r + c (s) r ; end end Small, independent matrix-vector products No communication necessary during computations! Blockwise collection and addition of vectors Rowwise communication! Fan-in page 19 of 39

Blocking: Special Cases S = 1: The computation of A i b is vectorizable by GAXPYs c = A 1 A 2 A 1 b b = A 2 b No communication necessary between processor P 1,, P R R = 1: A j b j are independent c = (A 1 A 2 ) b 1 b 2 = A 1b 1 + A 2 b 2 + Then collection of partial results from processor P 1,, P S Fan-in Final sum in one processor: vectorizable by GAXPYs page 20 of 39

Rules 1 Inner loops of a program should be simple, vectorizable 2 Outer loop of a programm should be substantial, independent, parallelizable substantial and parallelizable simple, vectorizable 3 Reuse of data (cache, minimal data transfer, blocking) page 21 of 39

223 c = Ab for Banded Matrix Bandwidth β (symmetric) 2β+1 diagonals: main diag + β subdiag + β superdiag β = 1: tridiagonal page 22 of 39

Notation: Banded Matrices A and Ã a 11 a 1,β+1 0 0 a 22 a A = β+1,1 0 0 an β,n an 1,n 1 0 0 a n,n β a nn ã 10 ã 1,β 0 0 ã 20 ã Ã = β+1, β 0 0 ãn β,β ãn 1,0 0 0 ã n, β ã n,0 page 23 of 39

c = Ab for Banded Matrix Storing entries diagonalwise: n(2β + 1) matrix instead of n 2 ã i,s = a i,i+s for row i = 1,, n 1 i + s n and β s β and 1 i n 1 i s n i and β s β in row i s [l i, r i ] = [max{ β, 1 i}, min{β, n i}] 1 s i n s and 1 i n in diag s i [ l s, r s ] = [max{1, 1 s}, min{n, n s}] page 24 of 39

Computation of the mtx-vec product based on storage scheme on vector CPUs For i = 1,, n : c i = A i b = j General TRIAD, no SAXPY: for s = β : β r i a ij b j = a i,i+s b i+s = s=l i for i = max{1 s, 1} : min{n s, n} c i = c i + ã i,s b i+s end end or, partial DOT-product: for i = 1 : n for s = max{ β, 1 i} : max{β, n i} c i = c i + ã i,s b i+s end end Sparsity less operations, but also loss of efficiency r i s=l i ã i,s b i+s page 25 of 39

Band Ab in Parallel Partitioning: R 1, n = I r, disjunct r=1 for i I r c i = end r i s=l i ã is b i+s Processor P r gets rows to index set I r := [m r, M r ] in order to compute its part of the final vector c What part of vector b does processor P r need in order to compute its part of c? page 26 of 39

Band Ab in Parallel Necessary for I r : b j = b i+s : j = i + s m r + max{ β, 1 m r } = max{m r β, 1} j = i + s M r + r Mr = M r + min{β, n M r } = min{m r + β, n} Processor P r with index set I r needs from b the indices j [max{1, m r β}, min{n, M r + β}] page 27 of 39

23 Analysis of Matrix-Matrix Product A = (a ij ) i=1,,n j=1,,m R n m, B(b ij ) i=1,,m j=1,,q R m q, for i = 1 : n for j = 1 : q m c ij = a ik b kj end end k=1 C = AB = (c ij ) i=1,,n j=1,,q R n q page 28 of 39

231 Vectorization Algorithm 1: (ijk)-form: for i = 1 : n for j = 1 : q for k = 1 : m c ij = c ij + a ik b kj } DOT-product of length m end end end c ij = A i B j for all i, j All entries c ij are fully computed, one after another Access to A and C is rowwise, to B columnwise (depends on inner most loops!) page 29 of 39

Other View on the Matrix-Matrix Product Matrix A considered as combination of columns or rows A = A 1 e T 1 + +A m e T m = (A 1 0 )+(0 A 2 0 )+ +( 0 A m ) a 1 = e 1 a 1 + + e n a n = 0 + + 0 a n AB = n A j ej T j=1 m k=1 e k b k = k,j A j (e T j e k )b k = m A k b k }{{} k=1 full n q matrices as a sum of full matrices A k b k by outer product of the kth column of A and the kth row of B page 30 of 39

Algorithm 2: (jki)-form for j=1,,q for k=1,,m for i=1,,n c ij = c ij + a ik b kj end end end Vector update: c j = c j + a k b kj Sequence of SAXPYs for the same vector: c j = k b kj a k C computed columnwise; access to A columnwise Access to B columnwise, but delayed page 31 of 39

Algorithm 3: (kji)-form for k=1,,m for j=1,,q for i=1,,n c ij = c ij + a ik b kj end end end Vector update: c j = c j + a k b kj Sequence of SAXPYs for different vectors c j (no GAXPY) Access to A columnwise Access to B rowwise + delayed C computed with intermediate values c (k) ij which are computed columnwise page 32 of 39

Overview of Different Forms Access A by Access B by to to Comput of C ijk ikj kij jik jki kji Alg 1 Alg 2 Alg 3 row row column column column row row column row row row column column column Computat ion of c ij direct delayed delayed direct delayed delayed Vector operation Vector length DOT GAXPY SAXPY DOT GAXPY SAXPY m q q m n n Better: GAXPY (longer vector length) Access to matrices according to storage scheme (rowwise or columnwise) page 33 of 39

232 Matrix-Matrix Product in Parallel 1, n = R I r, 1, m = r=1 S T K s, 1, q = Distribute the blocks relative to index sets I r, K s, and J t to processor array P rst : s=1 t=1 J t 1 Processor P rst computes small matrix-matrix product All processors in parallel: c (s) rt = A rs B st 2 Compute sum by fan-in in s: c rt = S s=1 c (s) rt page 34 of 39

Mtx-Mtx in Parallel: Special Case S = 1 Each processor P rt can compute its part of c, c rt, independently without communication Each processor needs full block of rows of A, relative to index set I r, and full block of columns of B, relative to index set J t, to compute c rt relative to rows I k and columns J t page 35 of 39

Mtx-Mtx in Parallel: Special Case S = 1 With n q processors each processor has to compute one DOT-product with O(m) parallel time steps c rt = m a rk b kt k=1 Fan-in by m nq additional processors for all DOT-products reduces number of parallel time steps to O(log(m)) page 36 of 39

1D-Parallelization of A B 1D: p processors linear, each processor gets full A and column slice of B, computing the related column slice of C = AB Communication: N 2 p for A and (N N p ) p = N2 Granularity: N 3 N 2 (1 + p) = N 1 + p Blocking only in i, the columns of B! for i = 1 : n for j = 1 : n for k = 1 : n C j,i = C j,i + A j,k B k,i for B page 37 of 39

2D-Parallelization of A B 2D: p processors square, q := p, each proc gets row slice of A and column slice of B computing full subblock of C = AB Communication: N 2 p for A and N 2 p for B Granularity: N 3 2N 2 p = N 2 p Blocking in i and j, the columns of B and the rows of A! for i = 1 : n for j = 1 : n for k = 1 : n C j,i = C j,i + A j,k B k,i page 38 of 39

3D-Parallelization A B 3D: p processors cubic, each processor gets subblock of A and subblock of B, computing part of subblock of C = AB Additional fan-in to collect parts to full subblock of C (q = p 1 3 ) Communication: N 2 p 1 3 for A and for B ( = p N2 p 2 3 ) = p blocksize, fan-in: N 2 p 1 3 Granularity: N 3 3N 2 p 1 3 = N 3p 1 3 Blocking in i, j, and k! for i = 1 : n for j = 1 : n for k = 1 : n C j,i = C j,i + A j,k B k,i page 39 of 39

Linear Systems of Equations with Dense Matrices GE in Parallel: Blockwise QR-Decomposition Parallel Numerics, WT 2014/2015 3 Linear Systems of Equations with Dense Matrices Parallel Numerics, WT 2014/2015 3 Linear Systems of Equations with Dense Matrices page 1 of 36

Linear Systems of Equations with Dense Matrices GE in Parallel: Blockwise QR-Decomposition Contents 1 Introduction 11 Computer Science Aspects 12 Numerical Problems 13 Graphs 14 Loop Manipulations 2 Elementary Linear Algebra Problems 21 BLAS: Basic Linear Algebra Subroutines 22 Matrix-Vector Operations 23 Matrix-Matrix-Product 3 Linear Systems of Equations with Dense Matrices 31 Gaussian Elimination 32 Parallelization 33 QR-Decomposition with Householder matrices 4 Sparse Matrices 41 General Properties, Storage 42 Sparse Matrices and Graphs 43 Reordering 44 Gaussian Elimination for Sparse Matrices 5 Iterative Methods for Sparse Matrices 51 Stationary Methods 52 Nonstationary Methods 53 Preconditioning 6 Domain Decomposition 61 Overlapping Domain Decomposition 62 Non-overlapping Domain Decomposition 63 Schur Complements Parallel Numerics, WT 2014/2015 3 Linear Systems of Equations with Dense Matrices page 2 of 36

Linear Systems of Equations with Dense Matrices GE in Parallel: Blockwise QR-Decomposition 31 Linear Systems of Equations with Dense Matrices 311 Gaussian Elimination: Basic Properties Linear system of equations: a 11 x 1 + + a 1n x n = b 1 a n1 x 1 + + a nn x n = b n Solve Ax = b a 11 a 1n a n1 a nn x 1 x n = b 1 b n Generate simpler linear equations (matrices) Transform A in triangular form: A = A (1) A (2) A (n) = U Parallel Numerics, WT 2014/2015 3 Linear Systems of Equations with Dense Matrices page 3 of 36

Linear Systems of Equations with Dense Matrices GE in Parallel: Blockwise QR-Decomposition Transformation to Upper Triangular Form a 11 a 12 a 1n a 21 a 22 a 2n a n1 a n2 a nn row transformations: (2) (2) a 21 a 11 (1),, (n) (n) a n1 a 11 (1) leads to a 11 a 12 a 13 a 1n 0 a (2) 22 a (2) 23 a (2) 2n A (2) = 0 a (2) 32 a (2) 33 a (2) 3n 0 a (2) n2 a (2) n3 a (2) nn next transformations: (3) (3) a(2) 32 a (2) 22 Parallel Numerics, WT 2014/2015 3 Linear Systems of Equations with Dense Matrices (2),, (n) (n) a(2) n2 (2) a (2) 22 page 4 of 36

Linear Systems of Equations with Dense Matrices GE in Parallel: Blockwise QR-Decomposition Transformation to Triangular Form (cont) a 11 a 12 a 13 a 1n 0 a (2) 22 a (2) 23 a (2) 2n A (3) = 0 0 a (3) 33 a (3) 3n 0 0 a (3) n3 a (3) nn next transformations: (4) (4) a(3) 43 (3),, (n) (n) a(3) n3 (3) a (3) 33 a (3) 33 a 11 a 12 a 13 a 1n 0 a (2) 22 a (2) 23 a (2) 2n A (n) = 0 0 a (3) 33 a (3) 3n = U 0 0 0 a (n) nn Parallel Numerics, WT 2014/2015 3 Linear Systems of Equations with Dense Matrices page 5 of 36

Linear Systems of Equations with Dense Matrices GE in Parallel: Blockwise QR-Decomposition Pseudocode Gaussian Elimination (GE) Simplification: assume that no pivoting is necessary a (k) kk 0 or a(k) kk ρ > 0 for k = 1, 2,, n for k = 1 : n 1 for i = k + 1 : n l i,k = a i,k a k,k end for i = k + 1 : n for j = k + 1 : n a i,j = a i,j l i,k a k,j end end end In practice: Include pivoting and include right hand side b There is still to solve a triangular system in U! Parallel Numerics, WT 2014/2015 3 Linear Systems of Equations with Dense Matrices page 6 of 36

Linear Systems of Equations with Dense Matrices GE in Parallel: Blockwise QR-Decomposition Intermediate Systems A (k), k = 1, 2,, n with A = A (1) and U = A (n) a (1) 11 a (1) 1,k 1 a (1) 1,k a (1) 1,n 0 (k 1) a k 1,k 1 a (k 1) k 1,k a (k 1) k 1,n 0 0 a (k) k,k a (k) k,n 0 0 a (k) n,k a (k) n,n Parallel Numerics, WT 2014/2015 3 Linear Systems of Equations with Dense Matrices page 7 of 36

Linear Systems of Equations with Dense Matrices GE in Parallel: Blockwise QR-Decomposition Define Auxiliary Matrices 1 0 0 L = l 2,1 1 0 l n,1 l n,n 1 1 and U = A (n) 0 0 0 0 0 0 0 0 0 0 L k := 0 0 0 0 0, L = I + 0 0 l k+1,k 0 0 k 0 0 l n,k 0 0 L k Parallel Numerics, WT 2014/2015 3 Linear Systems of Equations with Dense Matrices page 8 of 36

Linear Systems of Equations with Dense Matrices GE in Parallel: Blockwise QR-Decomposition Elimination Step in Terms of Auxiliary Matrices A (k+1) = (I L k ) A (k) = A (k) L k A (k) U = A (n) = (I L n 1 ) A (n 1) = = (I L n 1 ) (I L 1 )A (1) = L A L := (I L n 1 ) (I L 1 ) A = L 1 U with U upper triangular and L lower triangular Theorem 2: L 1 = L and therefore A = LU Advantage: Every further problem Ax = b j can be reduced to (LU)x = b j for arbitrary j Solve two triangular problems (LU)x = Ly = b and Ux = y Parallel Numerics, WT 2014/2015 3 Linear Systems of Equations with Dense Matrices page 9 of 36

Linear Systems of Equations with Dense Matrices GE in Parallel: Blockwise QR-Decomposition Theorem 2: L 1 = L A = LU for i j : L i L j = (I + L j )(I L j ) = I + L j L j L 2 j = I (I L j ) 1 = I + L j (I + L i )(I + L j ) = I + L i + L j + L i L j = I + L i + L j }{{} L 1 = [(I L n 1 ) (I L 1 )] 1 = (I L 1 ) 1 (I L n 1 ) 1 = (I + L 1 )(I + L 2 ) (I + L n 1 ) = I + L 1 + L 2 + + L n 1 = L Parallel Numerics, WT 2014/2015 3 Linear Systems of Equations with Dense Matrices page 10 of 36

Linear Systems of Equations with Dense Matrices GE in Parallel: Blockwise QR-Decomposition 32 GE in Parallel: Blockwise Main idea: Blocking of GE to avoid data transfer between processors Basic Concepts: Replace GE or large LU-decomposition of full matrix by small intermediate steps (by sequence of small block operations): Solving collection of small triangular systems LU k = B k (parallelism in columns of U) A A LU updating matrices (also easy to parallelize) small B = LU-decompositions (parallelism in rows of B) Parallel Numerics, WT 2014/2015 3 Linear Systems of Equations with Dense Matrices page 11 of 36

Linear Systems of Equations with Dense Matrices GE in Parallel: Blockwise QR-Decomposition How to Choose Blocks in L/U Satisfying LU = A L 11 0 0 U 11 U 12 U 13 A 11 A 12 A 13 L 21 L 22 0 0 U 22 U 23 = = L 31 L 32 L 33 0 0 U 33 A 21 A 22 A 23 A 31 A 32 A 33 L 11 U 11 L 11 U 12 L 11 U 13 = L 21 U 11 L 21 U 12 + L 22 U 22 L 21 U 13 + L 22 U 23 L 31 U 11 L 31 U 12 + L 32 U 22 Different ways of computing L and U depending on start (assume first entry/row/column of L/U as given) how to compute new entry/row/column of L/U update of block structure of L/U by grouping in known blocks blocks newly to compute blocks to be computed later Parallel Numerics, WT 2014/2015 3 Linear Systems of Equations with Dense Matrices page 12 of 36

Linear Systems of Equations with Dense Matrices GE in Parallel: Blockwise QR-Decomposition Crout Form Parallel Numerics, WT 2014/2015 3 Linear Systems of Equations with Dense Matrices page 13 of 36

Linear Systems of Equations with Dense Matrices GE in Parallel: Blockwise QR-Decomposition Crout Form (cont) 1 Solve by small LU-decomposition of the modified part of A L 22, L 32, and U 22 2 Solve by solving small triangular systems of equations in L 22 U 23 Initial steps: L 11 U 11 = A 11, ( ) ( ) L21 A21 U L 11 =, L 31 A 11 (U 12 U 13 ) = (A 12 A 13 ) 31 Parallel Numerics, WT 2014/2015 3 Linear Systems of Equations with Dense Matrices page 14 of 36