BLAS: Basic Linear Algebra Subroutines Analysis of the Matrix-Vector-Product Analysis of Matrix-Matrix Product

Similar documents
Computation of the mtx-vec product based on storage scheme on vector CPUs

2.1 Gaussian Elimination

Computational Linear Algebra

1 Multiply Eq. E i by λ 0: (λe i ) (E i ) 2 Multiply Eq. E j by λ and add to Eq. E i : (E i + λe j ) (E i )

LU Factorization. LU Decomposition. LU Decomposition. LU Decomposition: Motivation A = LU

Direct Methods for Solving Linear Systems. Simon Fraser University Surrey Campus MACM 316 Spring 2005 Instructor: Ha Le

Parallel Numerics. Prof. Dr. Thomas Huckle. July 2, Technische Universität München, Institut für Informatik

MODULE 7. where A is an m n real (or complex) matrix. 2) Let K(t, s) be a function of two variables which is continuous on the square [0, 1] [0, 1].

Direct Methods for Solving Linear Systems. Matrix Factorization

Numerical Linear Algebra

Program Lecture 2. Numerical Linear Algebra. Gaussian elimination (2) Gaussian elimination. Decompositions, numerical aspects

Dense LU factorization and its error analysis

Gaussian Elimination and Back Substitution

Lecture Notes to Accompany. Scientific Computing An Introductory Survey. by Michael T. Heath. Chapter 2. Systems of Linear Equations

Scientific Computing: An Introductory Survey

5. Direct Methods for Solving Systems of Linear Equations. They are all over the place...

Scientific Computing

LU Factorization. LU factorization is the most common way of solving linear systems! Ax = b LUx = b

CS412: Lecture #17. Mridul Aanjaneya. March 19, 2015

Ax = b. Systems of Linear Equations. Lecture Notes to Accompany. Given m n matrix A and m-vector b, find unknown n-vector x satisfying

The Solution of Linear Systems AX = B

Direct solution methods for sparse matrices. p. 1/49

Linear Algebra Section 2.6 : LU Decomposition Section 2.7 : Permutations and transposes Wednesday, February 13th Math 301 Week #4

CS475: Linear Equations Gaussian Elimination LU Decomposition Wim Bohm Colorado State University

Numerical Linear Algebra

Basic Concepts in Linear Algebra

Numerical Methods I Solving Square Linear Systems: GEM and LU factorization

Linear Equations and Matrix

14.2 QR Factorization with Column Pivoting

Review of Basic Concepts in Linear Algebra

AMS526: Numerical Analysis I (Numerical Linear Algebra)

LU Factorization a 11 a 1 a 1n A = a 1 a a n (b) a n1 a n a nn L = l l 1 l ln1 ln 1 75 U = u 11 u 1 u 1n 0 u u n 0 u n...

Solving Dense Linear Systems I

Linear Algebraic Equations

Linear Systems of n equations for n unknowns

MTH 464: Computational Linear Algebra

Fundamentals of Engineering Analysis (650163)

Solution of Linear Systems

MATH 3511 Lecture 1. Solving Linear Systems 1

Consider the following example of a linear system:

Chapter 1 Matrices and Systems of Equations

Linear Algebra (Review) Volker Tresp 2017

Scientific Computing: Dense Linear Systems

Parallel Programming. Parallel algorithms Linear systems solvers

Algebra C Numerical Linear Algebra Sample Exam Problems

Sparse BLAS-3 Reduction

Linear Algebra (Review) Volker Tresp 2018

COURSE Numerical methods for solving linear systems. Practical solving of many problems eventually leads to solving linear systems.

Chapter 12 Block LU Factorization

Numerical Methods - Numerical Linear Algebra

. =. a i1 x 1 + a i2 x 2 + a in x n = b i. a 11 a 12 a 1n a 21 a 22 a 1n. i1 a i2 a in

Numerical Linear Algebra

Solving Linear Systems of Equations

Introduction. Vectors and Matrices. Vectors [1] Vectors [2]

A Review of Matrix Analysis

Matrix Computations: Direct Methods II. May 5, 2014 Lecture 11

6. Iterative Methods for Linear Systems. The stepwise approach to the solution...

Applied Numerical Linear Algebra. Lecture 8

MATRICES. a m,1 a m,n A =

Engineering Computation

5.1 Banded Storage. u = temperature. The five-point difference operator. uh (x, y + h) 2u h (x, y)+u h (x, y h) uh (x + h, y) 2u h (x, y)+u h (x h, y)

Math 304 (Spring 2010) - Lecture 2

Example: Current in an Electrical Circuit. Solving Linear Systems:Direct Methods. Linear Systems of Equations. Solving Linear Systems: Direct Methods

Numerical Linear Algebra

Parallel Numerics, WT 2016/ Iterative Methods for Sparse Linear Systems of Equations. page 1 of 1

April 26, Applied mathematics PhD candidate, physics MA UC Berkeley. Lecture 4/26/2013. Jed Duersch. Spd matrices. Cholesky decomposition

Parallel Scientific Computing

Applied Linear Algebra

Solving linear systems (6 lectures)

LU Factorization. Marco Chiarandini. DM559 Linear and Integer Programming. Department of Mathematics & Computer Science University of Southern Denmark

Iterative Methods. Splitting Methods

Scientific Computing: Solving Linear Systems

Linear Algebra and Matrix Inversion

MAT 610: Numerical Linear Algebra. James V. Lambers

Numerical Analysis: Solving Systems of Linear Equations

MA2501 Numerical Methods Spring 2015

5 Solving Systems of Linear Equations

Introduction to Numerical Analysis

Chapter 1: Systems of linear equations and matrices. Section 1.1: Introduction to systems of linear equations

Determinants. Chia-Ping Chen. Linear Algebra. Professor Department of Computer Science and Engineering National Sun Yat-sen University 1/40

Numerical Linear Algebra

Review of matrices. Let m, n IN. A rectangle of numbers written like A =

On the Skeel condition number, growth factor and pivoting strategies for Gaussian elimination

Homework 2 Foundations of Computational Math 2 Spring 2019

VECTORS, TENSORS AND INDEX NOTATION

A Divide-and-Conquer Algorithm for Functions of Triangular Matrices

V C V L T I 0 C V B 1 V T 0 I. l nk

The System of Linear Equations. Direct Methods. Xiaozhou Li.

Matrix Computations (Chapter 1-2)

SOLVING LINEAR SYSTEMS

Chapter 7 Iterative Techniques in Matrix Algebra

Applied Mathematics 205. Unit II: Numerical Linear Algebra. Lecturer: Dr. David Knezevic

AMS526: Numerical Analysis I (Numerical Linear Algebra)

Matrix-Matrix Multiplication

7 Matrix Operations. 7.0 Matrix Multiplication + 3 = 3 = 4

Institute for Advanced Computer Studies. Department of Computer Science. Two Algorithms for the The Ecient Computation of

Lecture 2 Decompositions, perturbations

Lecture: Numerical Linear Algebra Background

Chapter 4 No. 4.0 Answer True or False to the following. Give reasons for your answers.

Topics. Vectors (column matrices): Vector addition and scalar multiplication The matrix of a linear function y Ax The elements of a matrix A : A ij

Transcription:

Level-1 BLAS: SAXPY BLAS-Notation: S single precision (D for double, C for complex) A α scalar X vector P plus operation Y vector SAXPY: y = αx + y Vectorization of SAXPY (αx + y) by pipelining: page 8 of 39

SAXPY Parallelization by Partitioning {1, 2,, n} = 1, n = I 1 I 2 I R x 1 x 2 x = = x n X 1 X R, y = y 1 y 2 = y n Y 1 Y R Using short vectors X j and Y j of length n R Each processor P j gets partial vector X j and Y j and computes Y j = αx j + Y j, j = 1, 2,, R Result: SAXPY very good vectorizable and parallelizable page 9 of 39

Further Level-1 BLAS Routines SCOPY: y = x or y x (compare SAXPY) DOT-product x T y: n norm: x 2 = j=1 i x iy i x 2 j = x T x (compare DOT-product) page 10 of 39

Level-2 BLAS Matrix-Vector operations with O(n 2 ) operations (sequentially) BLAS-Notation: S single precision G E } general matrix M V vector defines SGEMV, matrix-vector product: y = αax + βy Other Level-2 BLAS: solving triangular system Lx = b with triangular matrix L page 11 of 39

Level-3 BLAS Matrix-Matrix operations with O(n 3 ) operations (sequentially) BLAS-Notation: S single precision G E } general matrix M M matrix defines SGEMM, matrix-matrix product: C = αab + βc page 12 of 39

Granularity for BLAS BLAS level operation formula memory granularity BLAS-1 AXPY: 2n αx + y 2n + 1 < 1 BLAS-2 GEMV: 2n 2 αax + βy n 2 + 2n 2 BLAS-3 GEMM: 2n 3 αab + βc 4n 2 n 2 BLAS-3 has best operations to memory ratio! page 13 of 39

22 Analysis of the Matrix-Vector-Product A = (a ij ) i=1,,n j=1,,m 221 Vectorization c 1 a 11 a 1m = c n a n1 a nm m a 1j b j j=1 = m a nj b j j=1 R n m, b R m, c R n b 1 b n = a 11 b 1 + + a 1m b m = a n1 b 1 + + a nm b m m j=1 b j a 1j a nj = n DOT-products of length m m SAXPYs of length n (GAXPY) page 14 of 39

Pseudocode: ij-form c = 0; for i=1,,n for j=1,,m c i = c i + a ij b j end end } DOT-product c i = A i b, DOT-product of ith row of A with vector b page 15 of 39

Pseudocode: ji-form c = 0; for j=1,,m for i=1,,n c i = c i + a ij b j end end SAXPY updating vector c with jth column of A GAXPY: Sequence of SAXPYs related to the same vector Advantage: vector c, that is updated, can be kept in fast memory No additional data transfer page 16 of 39

GAXPY (repetition) SAXPY: y := y + αx GAXPY: y = y 0 for i = 1 : n y := y + α i x i end Series of SAXPYs regarding the same vector y length(gaxpy) = length(y) Advantage: less data transfer! page 17 of 39

222 Parallelization by Building Blocks Reduce matrix-vector product on smaller matrix-vector products {1, 2,, n} = 1, n = I 1 I 2 I R disjunct: I j I k = 0 for j k {1, 2,, m} = 1, m = J 1 J 2 J S disjunct: J j J k = 0 for j k Use 2-dimensional array of processors P rs P rs gets matrix block A rs := A(I r, J s ), b s := b(j s ), c r := c(i r ) c r = S A rs b s =: s=1 S s=1 c (s) r page 18 of 39

Pseudocode for r = 1,, R for s = 1,, S c (s) r = A rs b s ; end end for r = 1,, R c r = 0 for s = 1,, S c r = c r + c (s) r ; end end Small, independent matrix-vector products No communication necessary during computations! Blockwise collection and addition of vectors Rowwise communication! Fan-in page 19 of 39

Blocking: Special Cases S = 1: The computation of A i b is vectorizable by GAXPYs c = A 1 A 2 A 1 b b = A 2 b No communication necessary between processor P 1,, P R R = 1: A j b j are independent c = (A 1 A 2 ) b 1 b 2 = A 1b 1 + A 2 b 2 + Then collection of partial results from processor P 1,, P S Fan-in Final sum in one processor: vectorizable by GAXPYs page 20 of 39

Rules 1 Inner loops of a program should be simple, vectorizable 2 Outer loop of a programm should be substantial, independent, parallelizable substantial and parallelizable simple, vectorizable 3 Reuse of data (cache, minimal data transfer, blocking) page 21 of 39

223 c = Ab for Banded Matrix Bandwidth β (symmetric) 2β+1 diagonals: main diag + β subdiag + β superdiag β = 1: tridiagonal page 22 of 39

Notation: Banded Matrices A and à a 11 a 1,β+1 0 0 a 22 a A = β+1,1 0 0 an β,n an 1,n 1 0 0 a n,n β a nn ã 10 ã 1,β 0 0 ã 20 ã à = β+1, β 0 0 ãn β,β ãn 1,0 0 0 ã n, β ã n,0 page 23 of 39

c = Ab for Banded Matrix Storing entries diagonalwise: n(2β + 1) matrix instead of n 2 ã i,s = a i,i+s for row i = 1,, n 1 i + s n and β s β and 1 i n 1 i s n i and β s β in row i s [l i, r i ] = [max{ β, 1 i}, min{β, n i}] 1 s i n s and 1 i n in diag s i [ l s, r s ] = [max{1, 1 s}, min{n, n s}] page 24 of 39

Computation of the mtx-vec product based on storage scheme on vector CPUs For i = 1,, n : c i = A i b = j General TRIAD, no SAXPY: for s = β : β r i a ij b j = a i,i+s b i+s = s=l i for i = max{1 s, 1} : min{n s, n} c i = c i + ã i,s b i+s end end or, partial DOT-product: for i = 1 : n for s = max{ β, 1 i} : max{β, n i} c i = c i + ã i,s b i+s end end Sparsity less operations, but also loss of efficiency r i s=l i ã i,s b i+s page 25 of 39

Band Ab in Parallel Partitioning: R 1, n = I r, disjunct r=1 for i I r c i = end r i s=l i ã is b i+s Processor P r gets rows to index set I r := [m r, M r ] in order to compute its part of the final vector c What part of vector b does processor P r need in order to compute its part of c? page 26 of 39

Band Ab in Parallel Necessary for I r : b j = b i+s : j = i + s m r + max{ β, 1 m r } = max{m r β, 1} j = i + s M r + r Mr = M r + min{β, n M r } = min{m r + β, n} Processor P r with index set I r needs from b the indices j [max{1, m r β}, min{n, M r + β}] page 27 of 39

23 Analysis of Matrix-Matrix Product A = (a ij ) i=1,,n j=1,,m R n m, B(b ij ) i=1,,m j=1,,q R m q, for i = 1 : n for j = 1 : q m c ij = a ik b kj end end k=1 C = AB = (c ij ) i=1,,n j=1,,q R n q page 28 of 39

231 Vectorization Algorithm 1: (ijk)-form: for i = 1 : n for j = 1 : q for k = 1 : m c ij = c ij + a ik b kj } DOT-product of length m end end end c ij = A i B j for all i, j All entries c ij are fully computed, one after another Access to A and C is rowwise, to B columnwise (depends on inner most loops!) page 29 of 39

Other View on the Matrix-Matrix Product Matrix A considered as combination of columns or rows A = A 1 e T 1 + +A m e T m = (A 1 0 )+(0 A 2 0 )+ +( 0 A m ) a 1 = e 1 a 1 + + e n a n = 0 + + 0 a n AB = n A j ej T j=1 m k=1 e k b k = k,j A j (e T j e k )b k = m A k b k }{{} k=1 full n q matrices as a sum of full matrices A k b k by outer product of the kth column of A and the kth row of B page 30 of 39

Algorithm 2: (jki)-form for j=1,,q for k=1,,m for i=1,,n c ij = c ij + a ik b kj end end end Vector update: c j = c j + a k b kj Sequence of SAXPYs for the same vector: c j = k b kj a k C computed columnwise; access to A columnwise Access to B columnwise, but delayed page 31 of 39

Algorithm 3: (kji)-form for k=1,,m for j=1,,q for i=1,,n c ij = c ij + a ik b kj end end end Vector update: c j = c j + a k b kj Sequence of SAXPYs for different vectors c j (no GAXPY) Access to A columnwise Access to B rowwise + delayed C computed with intermediate values c (k) ij which are computed columnwise page 32 of 39

Overview of Different Forms Access A by Access B by to to Comput of C ijk ikj kij jik jki kji Alg 1 Alg 2 Alg 3 row row column column column row row column row row row column column column Computat ion of c ij direct delayed delayed direct delayed delayed Vector operation Vector length DOT GAXPY SAXPY DOT GAXPY SAXPY m q q m n n Better: GAXPY (longer vector length) Access to matrices according to storage scheme (rowwise or columnwise) page 33 of 39

232 Matrix-Matrix Product in Parallel 1, n = R I r, 1, m = r=1 S T K s, 1, q = Distribute the blocks relative to index sets I r, K s, and J t to processor array P rst : s=1 t=1 J t 1 Processor P rst computes small matrix-matrix product All processors in parallel: c (s) rt = A rs B st 2 Compute sum by fan-in in s: c rt = S s=1 c (s) rt page 34 of 39

Mtx-Mtx in Parallel: Special Case S = 1 Each processor P rt can compute its part of c, c rt, independently without communication Each processor needs full block of rows of A, relative to index set I r, and full block of columns of B, relative to index set J t, to compute c rt relative to rows I k and columns J t page 35 of 39

Mtx-Mtx in Parallel: Special Case S = 1 With n q processors each processor has to compute one DOT-product with O(m) parallel time steps c rt = m a rk b kt k=1 Fan-in by m nq additional processors for all DOT-products reduces number of parallel time steps to O(log(m)) page 36 of 39

1D-Parallelization of A B 1D: p processors linear, each processor gets full A and column slice of B, computing the related column slice of C = AB Communication: N 2 p for A and (N N p ) p = N2 Granularity: N 3 N 2 (1 + p) = N 1 + p Blocking only in i, the columns of B! for i = 1 : n for j = 1 : n for k = 1 : n C j,i = C j,i + A j,k B k,i for B page 37 of 39

2D-Parallelization of A B 2D: p processors square, q := p, each proc gets row slice of A and column slice of B computing full subblock of C = AB Communication: N 2 p for A and N 2 p for B Granularity: N 3 2N 2 p = N 2 p Blocking in i and j, the columns of B and the rows of A! for i = 1 : n for j = 1 : n for k = 1 : n C j,i = C j,i + A j,k B k,i page 38 of 39

3D-Parallelization A B 3D: p processors cubic, each processor gets subblock of A and subblock of B, computing part of subblock of C = AB Additional fan-in to collect parts to full subblock of C (q = p 1 3 ) Communication: N 2 p 1 3 for A and for B ( = p N2 p 2 3 ) = p blocksize, fan-in: N 2 p 1 3 Granularity: N 3 3N 2 p 1 3 = N 3p 1 3 Blocking in i, j, and k! for i = 1 : n for j = 1 : n for k = 1 : n C j,i = C j,i + A j,k B k,i page 39 of 39

Linear Systems of Equations with Dense Matrices GE in Parallel: Blockwise QR-Decomposition Parallel Numerics, WT 2014/2015 3 Linear Systems of Equations with Dense Matrices Parallel Numerics, WT 2014/2015 3 Linear Systems of Equations with Dense Matrices page 1 of 36

Linear Systems of Equations with Dense Matrices GE in Parallel: Blockwise QR-Decomposition Contents 1 Introduction 11 Computer Science Aspects 12 Numerical Problems 13 Graphs 14 Loop Manipulations 2 Elementary Linear Algebra Problems 21 BLAS: Basic Linear Algebra Subroutines 22 Matrix-Vector Operations 23 Matrix-Matrix-Product 3 Linear Systems of Equations with Dense Matrices 31 Gaussian Elimination 32 Parallelization 33 QR-Decomposition with Householder matrices 4 Sparse Matrices 41 General Properties, Storage 42 Sparse Matrices and Graphs 43 Reordering 44 Gaussian Elimination for Sparse Matrices 5 Iterative Methods for Sparse Matrices 51 Stationary Methods 52 Nonstationary Methods 53 Preconditioning 6 Domain Decomposition 61 Overlapping Domain Decomposition 62 Non-overlapping Domain Decomposition 63 Schur Complements Parallel Numerics, WT 2014/2015 3 Linear Systems of Equations with Dense Matrices page 2 of 36

Linear Systems of Equations with Dense Matrices GE in Parallel: Blockwise QR-Decomposition 31 Linear Systems of Equations with Dense Matrices 311 Gaussian Elimination: Basic Properties Linear system of equations: a 11 x 1 + + a 1n x n = b 1 a n1 x 1 + + a nn x n = b n Solve Ax = b a 11 a 1n a n1 a nn x 1 x n = b 1 b n Generate simpler linear equations (matrices) Transform A in triangular form: A = A (1) A (2) A (n) = U Parallel Numerics, WT 2014/2015 3 Linear Systems of Equations with Dense Matrices page 3 of 36

Linear Systems of Equations with Dense Matrices GE in Parallel: Blockwise QR-Decomposition Transformation to Upper Triangular Form a 11 a 12 a 1n a 21 a 22 a 2n a n1 a n2 a nn row transformations: (2) (2) a 21 a 11 (1),, (n) (n) a n1 a 11 (1) leads to a 11 a 12 a 13 a 1n 0 a (2) 22 a (2) 23 a (2) 2n A (2) = 0 a (2) 32 a (2) 33 a (2) 3n 0 a (2) n2 a (2) n3 a (2) nn next transformations: (3) (3) a(2) 32 a (2) 22 Parallel Numerics, WT 2014/2015 3 Linear Systems of Equations with Dense Matrices (2),, (n) (n) a(2) n2 (2) a (2) 22 page 4 of 36

Linear Systems of Equations with Dense Matrices GE in Parallel: Blockwise QR-Decomposition Transformation to Triangular Form (cont) a 11 a 12 a 13 a 1n 0 a (2) 22 a (2) 23 a (2) 2n A (3) = 0 0 a (3) 33 a (3) 3n 0 0 a (3) n3 a (3) nn next transformations: (4) (4) a(3) 43 (3),, (n) (n) a(3) n3 (3) a (3) 33 a (3) 33 a 11 a 12 a 13 a 1n 0 a (2) 22 a (2) 23 a (2) 2n A (n) = 0 0 a (3) 33 a (3) 3n = U 0 0 0 a (n) nn Parallel Numerics, WT 2014/2015 3 Linear Systems of Equations with Dense Matrices page 5 of 36

Linear Systems of Equations with Dense Matrices GE in Parallel: Blockwise QR-Decomposition Pseudocode Gaussian Elimination (GE) Simplification: assume that no pivoting is necessary a (k) kk 0 or a(k) kk ρ > 0 for k = 1, 2,, n for k = 1 : n 1 for i = k + 1 : n l i,k = a i,k a k,k end for i = k + 1 : n for j = k + 1 : n a i,j = a i,j l i,k a k,j end end end In practice: Include pivoting and include right hand side b There is still to solve a triangular system in U! Parallel Numerics, WT 2014/2015 3 Linear Systems of Equations with Dense Matrices page 6 of 36

Linear Systems of Equations with Dense Matrices GE in Parallel: Blockwise QR-Decomposition Intermediate Systems A (k), k = 1, 2,, n with A = A (1) and U = A (n) a (1) 11 a (1) 1,k 1 a (1) 1,k a (1) 1,n 0 (k 1) a k 1,k 1 a (k 1) k 1,k a (k 1) k 1,n 0 0 a (k) k,k a (k) k,n 0 0 a (k) n,k a (k) n,n Parallel Numerics, WT 2014/2015 3 Linear Systems of Equations with Dense Matrices page 7 of 36

Linear Systems of Equations with Dense Matrices GE in Parallel: Blockwise QR-Decomposition Define Auxiliary Matrices 1 0 0 L = l 2,1 1 0 l n,1 l n,n 1 1 and U = A (n) 0 0 0 0 0 0 0 0 0 0 L k := 0 0 0 0 0, L = I + 0 0 l k+1,k 0 0 k 0 0 l n,k 0 0 L k Parallel Numerics, WT 2014/2015 3 Linear Systems of Equations with Dense Matrices page 8 of 36

Linear Systems of Equations with Dense Matrices GE in Parallel: Blockwise QR-Decomposition Elimination Step in Terms of Auxiliary Matrices A (k+1) = (I L k ) A (k) = A (k) L k A (k) U = A (n) = (I L n 1 ) A (n 1) = = (I L n 1 ) (I L 1 )A (1) = L A L := (I L n 1 ) (I L 1 ) A = L 1 U with U upper triangular and L lower triangular Theorem 2: L 1 = L and therefore A = LU Advantage: Every further problem Ax = b j can be reduced to (LU)x = b j for arbitrary j Solve two triangular problems (LU)x = Ly = b and Ux = y Parallel Numerics, WT 2014/2015 3 Linear Systems of Equations with Dense Matrices page 9 of 36

Linear Systems of Equations with Dense Matrices GE in Parallel: Blockwise QR-Decomposition Theorem 2: L 1 = L A = LU for i j : L i L j = (I + L j )(I L j ) = I + L j L j L 2 j = I (I L j ) 1 = I + L j (I + L i )(I + L j ) = I + L i + L j + L i L j = I + L i + L j }{{} L 1 = [(I L n 1 ) (I L 1 )] 1 = (I L 1 ) 1 (I L n 1 ) 1 = (I + L 1 )(I + L 2 ) (I + L n 1 ) = I + L 1 + L 2 + + L n 1 = L Parallel Numerics, WT 2014/2015 3 Linear Systems of Equations with Dense Matrices page 10 of 36

Linear Systems of Equations with Dense Matrices GE in Parallel: Blockwise QR-Decomposition 32 GE in Parallel: Blockwise Main idea: Blocking of GE to avoid data transfer between processors Basic Concepts: Replace GE or large LU-decomposition of full matrix by small intermediate steps (by sequence of small block operations): Solving collection of small triangular systems LU k = B k (parallelism in columns of U) A A LU updating matrices (also easy to parallelize) small B = LU-decompositions (parallelism in rows of B) Parallel Numerics, WT 2014/2015 3 Linear Systems of Equations with Dense Matrices page 11 of 36

Linear Systems of Equations with Dense Matrices GE in Parallel: Blockwise QR-Decomposition How to Choose Blocks in L/U Satisfying LU = A L 11 0 0 U 11 U 12 U 13 A 11 A 12 A 13 L 21 L 22 0 0 U 22 U 23 = = L 31 L 32 L 33 0 0 U 33 A 21 A 22 A 23 A 31 A 32 A 33 L 11 U 11 L 11 U 12 L 11 U 13 = L 21 U 11 L 21 U 12 + L 22 U 22 L 21 U 13 + L 22 U 23 L 31 U 11 L 31 U 12 + L 32 U 22 Different ways of computing L and U depending on start (assume first entry/row/column of L/U as given) how to compute new entry/row/column of L/U update of block structure of L/U by grouping in known blocks blocks newly to compute blocks to be computed later Parallel Numerics, WT 2014/2015 3 Linear Systems of Equations with Dense Matrices page 12 of 36

Linear Systems of Equations with Dense Matrices GE in Parallel: Blockwise QR-Decomposition Crout Form Parallel Numerics, WT 2014/2015 3 Linear Systems of Equations with Dense Matrices page 13 of 36

Linear Systems of Equations with Dense Matrices GE in Parallel: Blockwise QR-Decomposition Crout Form (cont) 1 Solve by small LU-decomposition of the modified part of A L 22, L 32, and U 22 2 Solve by solving small triangular systems of equations in L 22 U 23 Initial steps: L 11 U 11 = A 11, ( ) ( ) L21 A21 U L 11 =, L 31 A 11 (U 12 U 13 ) = (A 12 A 13 ) 31 Parallel Numerics, WT 2014/2015 3 Linear Systems of Equations with Dense Matrices page 14 of 36