Computation of the mtx-vec product based on storage scheme on vector CPUs

BLAS: Basic Linear Algebra Subroutines BLAS: Basic Linear Algebra Subroutines BLAS: Basic Linear Algebra Subroutines Analysis of the Matrix Computation of the mtx-vec product based on storage scheme on vector CPUs For i = 1,..., n : c i = A i b = j General TRIAD, no SAXPY: for s = β : β r i a ij b j = a i,i+s b i+s = s=l i...for i = max{1 s, 1} : min{n s, n}...c i = c i + ã i,s b i+s...end end r i s=l i ã i,s b i+s Parallel Numerics, WT 2016/2017 2 Elementary Linear Algebra Problems page 25 of 39

BLAS: Basic Linear Algebra Subroutines BLAS: Basic Linear Algebra Subroutines BLAS: Basic Linear Algebra Subroutines Analysis of the Matrix Computation of the mtx-vec product based on storage scheme on vector CPUs For i = 1,..., n : c i = A i b = j General TRIAD, no SAXPY: for s = β : β r i a ij b j = a i,i+s b i+s = s=l i...for i = max{1 s, 1} : min{n s, n}...c i = c i + ã i,s b i+s...end end or, partial DOT-product: for i = 1 : n...for s = max{ β, 1 i} : max{β, n i}...c i = c i + ã i,s b i+s...end end Sparsity less operations, but also loss of efficiency. r i s=l i ã i,s b i+s Parallel Numerics, WT 2016/2017 2 Elementary Linear Algebra Problems page 25 of 39

BLAS: Basic Linear Algebra Subroutines BLAS: Basic Linear Algebra Subroutines BLAS: Basic Linear Algebra Subroutines Analysis of the Matrix Band Ab in Parallel Partitioning: R 1, n = I r, disjunct r=1 for i I r...c i = end r i s=l i ã is b i+s Processor P r gets rows to index set I r := [m r, M r ] in order to compute its part of the final vector c. What part of vector b does processor P r need in order to compute its part of c? Parallel Numerics, WT 2016/2017 2 Elementary Linear Algebra Problems page 26 of 39

BLAS: Basic Linear Algebra Subroutines BLAS: Basic Linear Algebra Subroutines BLAS: Basic Linear Algebra Subroutines Analysis of the Matrix Band Ab in Parallel Necessary for I r : b j = b i+s : j = i + s m r + max{ β, 1 m r } = max{m r β, 1} j = i + s M r + r Mr = M r + min{β, n M r } = min{m r + β, n} Processor P r with index set I r needs from b the indices j [max{1, m r β}, min{n, M r + β}] Parallel Numerics, WT 2016/2017 2 Elementary Linear Algebra Problems page 27 of 39

BLAS: Basic Linear Algebra Subroutines BLAS: Basic Linear Algebra Subroutines BLAS: Basic Linear Algebra Subroutines Analysis of the Matrix 2.6. Analysis of Matrix-Matrix Product A = (a ij ) i=1,...,n j=1,...,m R n m, B(b ij ) i=1,...,m j=1,...,q R m q, for i = 1 : n...for j = 1 : q m...c ij = a ik b kj...end end k=1 C = AB = (c ij ) i=1,...,n j=1,...,q R n q Parallel Numerics, WT 2016/2017 2 Elementary Linear Algebra Problems page 28 of 39

BLAS: Basic Linear Algebra Subroutines BLAS: Basic Linear Algebra Subroutines BLAS: Basic Linear Algebra Subroutines Analysis of the Matrix 2.6.1. Vectorization Algorithm 1: (ijk)-form: for i = 1 : n...for j = 1 : q...for k = 1 : m...c. ij = c ij + a ik b kj } DOT-product of length m...end...end end c ij = A i B j for all i, j All entries c ij are fully computed, one after another. Access to A and C is rowwise, to B columnwise (depends on inner most loops!) Parallel Numerics, WT 2016/2017 2 Elementary Linear Algebra Problems page 29 of 39

BLAS: Basic Linear Algebra Subroutines BLAS: Basic Linear Algebra Subroutines BLAS: Basic Linear Algebra Subroutines Analysis of the Matrix Other View on the Matrix-Matrix Product Matrix A considered as combination of columns or rows A = A 1 e T 1 +...+A m e T m = (A 1 0 )+(0 A 2 0...)+...+(... 0 A m ) a 1 = e 1 a 1 +... + e n a n = 0. +... + 0. a n AB = n A j ej T j=1 m k=1 e k b k = k,j A j (e T j e k )b k = m A k b k }{{} k=1 full n q matrices as a sum of full matrices A k b k by outer product of the kth column of A and the kth row of B. Parallel Numerics, WT 2016/2017 2 Elementary Linear Algebra Problems page 30 of 39

BLAS: Basic Linear Algebra Subroutines BLAS: Basic Linear Algebra Subroutines BLAS: Basic Linear Algebra Subroutines Analysis of the Matrix Algorithm 2: (jki)-form for j=1,...,q...for k=1,...,m...for i=1,...,n...c ij = c ij + a ik b kj...end...end end Vector update: c j = c j + a k b kj Sequence of SAXPYs for the same vector: c j = k b kj a k C computed columnwise; access to A columnwise. Access to B columnwise, but delayed. Parallel Numerics, WT 2016/2017 2 Elementary Linear Algebra Problems page 31 of 39

BLAS: Basic Linear Algebra Subroutines BLAS: Basic Linear Algebra Subroutines BLAS: Basic Linear Algebra Subroutines Analysis of the Matrix Algorithm 3: (kji)-form for k=1,...,m...for j=1,...,q...for i=1,...,n...c ij = c ij + a ik b kj...end...end end Vector update: c j = c j + a k b kj Sequence of SAXPYs for different vectors c j (no GAXPY) Access to A columnwise. Access to B rowwise + delayed. C computed with intermediate values c (k) ij which are computed columnwise. Parallel Numerics, WT 2016/2017 2 Elementary Linear Algebra Problems page 32 of 39

BLAS: Basic Linear Algebra Subroutines BLAS: Basic Linear Algebra Subroutines BLAS: Basic Linear Algebra Subroutines Analysis of the Matrix Overview of Different Forms Access A by Access B by to to Comput. of C ijk ikj kij jik jki kji Alg. 1 Alg. 2 Alg. 3 row row column column column row row column row row row column column column Computat ion of c ij direct delayed delayed direct delayed delayed Vector operation Vector length DOT GAXPY SAXPY DOT GAXPY SAXPY m q q m n n Better: GAXPY (longer vector length). Access to matrices according to storage scheme (rowwise or columnwise) Parallel Numerics, WT 2016/2017 2 Elementary Linear Algebra Problems page 33 of 39

BLAS: Basic Linear Algebra Subroutines BLAS: Basic Linear Algebra Subroutines BLAS: Basic Linear Algebra Subroutines Analysis of the Matrix 2.6.2. Matrix-Matrix Product in Parallel 1, n = R I r, 1, m = r=1 S T K s, 1, q = Distribute the blocks relative to index sets I r, K s, and J t to processor array P rst : s=1 t=1 J t 1. Processor P rst computes small matrix-matrix product. All processors in parallel: c (s) rt = A rs B st 2. Compute sum by fan-in in s: c rt = S s=1 c (s) rt Parallel Numerics, WT 2016/2017 2 Elementary Linear Algebra Problems page 34 of 39

BLAS: Basic Linear Algebra Subroutines BLAS: Basic Linear Algebra Subroutines BLAS: Basic Linear Algebra Subroutines Analysis of the Matrix Mtx-Mtx in Parallel: Special Case S = 1 Each processor P rt can compute its part of c, c rt, independently without communication. Each processor needs full block of rows of A, relative to index set I r, and full block of columns of B, relative to index set J t, to compute c rt relative to rows I k and columns J t. Parallel Numerics, WT 2016/2017 2 Elementary Linear Algebra Problems page 35 of 39

BLAS: Basic Linear Algebra Subroutines BLAS: Basic Linear Algebra Subroutines BLAS: Basic Linear Algebra Subroutines Analysis of the Matrix Mtx-Mtx in Parallel: Special Case S = 1 With n q processors each processor has to compute one DOT-product with O(m) parallel time steps. c rt = m a rk b kt k=1 Fan-in by m nq additional processors for all DOT-products reduces number of parallel time steps to O(log(m)). Parallel Numerics, WT 2016/2017 2 Elementary Linear Algebra Problems page 36 of 39

BLAS: Basic Linear Algebra Subroutines BLAS: Basic Linear Algebra Subroutines BLAS: Basic Linear Algebra Subroutines Analysis of the Matrix 1D-Parallelization of A B 1D: p processors linear, each processor gets full A and column slice of B, computing the related column slice of C = AB Communication: N 2 p for A and (N N p ) p = N2 Granularity: N 3 N 2 (1 + p) = N 1 + p Blocking only in i, the columns of B! for i = 1 : n...for j = 1 : n...for k = 1 : n...c j,i = C j,i + A j,k B k,i Parallel Numerics, WT 2016/2017 2 Elementary Linear Algebra Problems for B page 37 of 39

BLAS: Basic Linear Algebra Subroutines BLAS: Basic Linear Algebra Subroutines BLAS: Basic Linear Algebra Subroutines Analysis of the Matrix 2D-Parallelization of A B 2D: p processors square, q := p, each proc. gets row slice of A and column slice of B computing full subblock of C = AB Communication: N 2 p for A and N 2 p for B Granularity: N 3 2N 2 p = N 2 p Blocking in i and j, the columns of B and the rows of A! for i = 1 : n...for j = 1 : n...for k = 1 : n...c j,i = C j,i + A j,k B k,i Parallel Numerics, WT 2016/2017 2 Elementary Linear Algebra Problems page 38 of 39

BLAS: Basic Linear Algebra Subroutines BLAS: Basic Linear Algebra Subroutines BLAS: Basic Linear Algebra Subroutines Analysis of the Matrix 3D-Parallelization A B 3D: p processors cubic, each processor gets subblock of A and subblock of B, computing part of subblock of C = AB. Additional fan-in to collect parts to full subblock of C. (q = p 1 3 ). Communication: N 2 p 1 3 for A and for B ( = p N2 p 2 3 ) = p blocksize, fan-in: N 2 p 1 3 Granularity: N 3 3N 2 p 1 3 = N 3p 1 3 Blocking in i, j, and k! for i = 1 : n...for j = 1 : n...for k = 1 : n...c j,i = C j,i + A j,k B k,i Parallel Numerics, WT 2016/2017 2 Elementary Linear Algebra Problems page 39 of 39

Linear Systems of Equations with Dense Matrices GE in Parallel: Blockwise QR-Decomposition QR-Decomposition Parallel Numerics, WT 2016/2017 3 Linear Systems of Equations with Dense Matrices Parallel Numerics, WT 2016/2017 3 Linear Systems of Equations with Dense Matrices page 1 of 36

Linear Systems of Equations with Dense Matrices GE in Parallel: Blockwise QR-Decomposition QR-Decomposition Contents 1 Introduction 1.1 Computer Science Aspects 1.2 Numerical Problems 1.3 Graphs 1.4 Loop Manipulations 2 Elementary Linear Algebra Problems 2.1 BLAS: Basic Linear Algebra Subroutines 2.2 Matrix-Vector Operations 2.3 Matrix-Matrix-Product 3 Linear Systems of Equations with Dense Matrices 3.1 Gaussian Elimination 3.2 Parallelization 3.3 QR-Decomposition with Householder matrices 4 Sparse Matrices 4.1 General Properties, Storage 4.2 Sparse Matrices and Graphs 4.3 Reordering 4.4 Gaussian Elimination for Sparse Matrices 5 Iterative Methods for Sparse Matrices 5.1 Stationary Methods 5.2 Nonstationary Methods 5.3 Preconditioning 6 Domain Decomposition 6.1 Overlapping Domain Decomposition 6.2 Non-overlapping Domain Decomposition 6.3 Schur Complements Parallel Numerics, WT 2016/2017 3 Linear Systems of Equations with Dense Matrices page 2 of 36

Linear Systems of Equations with Dense Matrices GE in Parallel: Blockwise QR-Decomposition QR-Decomposition 3.1. Linear Systems of Equations with Dense Matrices 3.1.1. Gaussian Elimination: Basic Properties Linear system of equations: a 11 x 1 +... + a 1n x n = b 1.. a n1 x 1 +... + a nn x n = b n Solve Ax = b a 11 a 1n..... a n1 a nn x 1. x n = b 1. b n Generate simpler linear equations (matrices). Transform A in triangular form: A = A (1) A (2)... A (n) = U. Parallel Numerics, WT 2016/2017 3 Linear Systems of Equations with Dense Matrices page 3 of 36

Linear Systems of Equations with Dense Matrices GE in Parallel: Blockwise QR-Decomposition QR-Decomposition Transformation to Upper Triangular Form a 11 a 12 a 1n a 21 a 22 a 2n...... a n1 a n2 a nn row transformations: (2) (2) a 21 a 11 (1),..., (n) (n) a n1 a 11 (1) leads to a 11 a 12 a 13 a 1n 0 a (2) 22 a (2) 23 a (2) 2n A (2) = 0 a (2) 32 a (2) 33 a (2) 3n....... 0 a (2) n2 a (2) n3 a (2) nn next transformations: (3) (3) a(2) 32 a (2) 22 Parallel Numerics, WT 2016/2017 3 Linear Systems of Equations with Dense Matrices (2),..., (n) (n) a(2) n2 (2) a (2) 22 page 4 of 36

Linear Systems of Equations with Dense Matrices GE in Parallel: Blockwise QR-Decomposition QR-Decomposition Transformation to Triangular Form (cont.) a 11 a 12 a 13 a 1n 0 a (2) 22 a (2) 23 a (2) 2n A (3) = 0 0 a (3) 33 a (3) 3n....... 0 0 a (3) n3 a (3) nn next transformations: (4) (4) a(3) 43 (3),..., (n) (n) a(3) n3 (3) a (3) 33 a (3) 33 a 11 a 12 a 13 a 1n 0 a (2) 22 a (2) 23 a (2) 2n A (n) = 0 0 a (3) 33 a (3) 3n = U....... 0 0 0 a (n) nn Parallel Numerics, WT 2016/2017 3 Linear Systems of Equations with Dense Matrices page 5 of 36

Linear Systems of Equations with Dense Matrices GE in Parallel: Blockwise QR-Decomposition QR-Decomposition Pseudocode Gaussian Elimination (GE) Simplification: assume that no pivoting is necessary. a (k) kk 0 or a(k) kk ρ > 0 for k = 1, 2,..., n for k = 1 : n 1...for i = k + 1 : n...l i,k = a i,k a k,k...end...for i = k + 1 : n...for j = k + 1 : n...a i,j = a i,j l i,k a k,j...end...end end In practice: Include pivoting and include right hand side b. There is still to solve a triangular system in U! Parallel Numerics, WT 2016/2017 3 Linear Systems of Equations with Dense Matrices page 6 of 36

Linear Systems of Equations with Dense Matrices GE in Parallel: Blockwise QR-Decomposition QR-Decomposition Intermediate Systems A (k), k = 1, 2,..., n with A = A (1) and U = A (n) a (1) 11 a (1) 1,k 1 a (1) 1,k a (1) 1,n. 0............ (k 1) a k 1,k 1 a (k 1) k 1,k a (k 1) k 1,n 0 0 a (k) k,k a (k) k,n.......... 0 0 a (k) n,k a (k) n,n Parallel Numerics, WT 2016/2017 3 Linear Systems of Equations with Dense Matrices page 7 of 36

Linear Systems of Equations with Dense Matrices GE in Parallel: Blockwise QR-Decomposition QR-Decomposition Define Auxiliary Matrices 1 0 0. L = l 2,1 1.. 0........ l n,1 l n,n 1 1 and U = A (n) 0 0 0 0 0........... 0 0 0 0 0 L k := 0 0 0 0 0, L = I + 0 0 l k+1,k 0 0 k........... 0 0 l n,k 0 0 L k Parallel Numerics, WT 2016/2017 3 Linear Systems of Equations with Dense Matrices page 8 of 36

Linear Systems of Equations with Dense Matrices GE in Parallel: Blockwise QR-Decomposition QR-Decomposition Elimination Step in Terms of Auxiliary Matrices A (k+1) = (I L k ) A (k) = A (k) L k A (k) U = A (n) = (I L n 1 ) A (n 1) =... = (I L n 1 ) (I L 1 )A (1) = L A L := (I L n 1 ) (I L 1 ) A = L 1 U with U upper triangular and L lower triangular Parallel Numerics, WT 2016/2017 3 Linear Systems of Equations with Dense Matrices page 9 of 36

Linear Systems of Equations with Dense Matrices GE in Parallel: Blockwise QR-Decomposition QR-Decomposition Elimination Step in Terms of Auxiliary Matrices A (k+1) = (I L k ) A (k) = A (k) L k A (k) U = A (n) = (I L n 1 ) A (n 1) =... = (I L n 1 ) (I L 1 )A (1) = L A L := (I L n 1 ) (I L 1 ) A = L 1 U with U upper triangular and L lower triangular Theorem 2: L 1 = L and therefore A = LU. Advantage: Every further problem Ax = b j can be reduced to (LU)x = b j for arbitrary j. Solve two triangular problems (LU)x = Ly = b and Ux = y. Parallel Numerics, WT 2016/2017 3 Linear Systems of Equations with Dense Matrices page 9 of 36

Linear Systems of Equations with Dense Matrices GE in Parallel: Blockwise QR-Decomposition QR-Decomposition Theorem 2: L 1 = L A = LU for i j : L i L j = (I + L j )(I L j ) = I + L j L j L 2 j = I (I L j ) 1 = I + L j (I + L i )(I + L j ) = I + L i + L j + L i L j = I + L i + L j }{{} Parallel Numerics, WT 2016/2017 3 Linear Systems of Equations with Dense Matrices page 10 of 36

Linear Systems of Equations with Dense Matrices GE in Parallel: Blockwise QR-Decomposition QR-Decomposition Theorem 2: L 1 = L A = LU for i j : L i L j = (I + L j )(I L j ) = I + L j L j L 2 j = I (I L j ) 1 = I + L j (I + L i )(I + L j ) = I + L i + L j + L i L j = I + L i + L j }{{} L 1 = [(I L n 1 ) (I L 1 )] 1 = (I L 1 ) 1 (I L n 1 ) 1 = (I + L 1 )(I + L 2 ) (I + L n 1 ) = I + L 1 + L 2 + + L n 1 = L Parallel Numerics, WT 2016/2017 3 Linear Systems of Equations with Dense Matrices page 10 of 36

Vectorization of GE (kij)-form (standard form): For k = 1 : n-1 For i = k+1 : n l i,k = a i,k / a k,k ; end For i = k+1 : n For j = k+1 : n a i,j = a i,j l i,k a k,j ; end end end Vector operation x SAXPY in rows a i and a k No GAXPY U computed rowwise, L columnwise. 1

already computed, remains unchanged, not used anymore U L A (k) newly computed updated in every step Standard (kij) form is also called rightlooking GE. 2

First Elimination step: A (1) Compute first column of L Update A (1) 3

Second step: A (2) Compute second column of L Update A (2) 4

Second step: A (3) Compute third column of L Update A (3) 5

k-1st step: U L A (k) Compute k-th column of L Update A (k) 6

Rules for different i,j,k forms: In the following we again interchange the kij loops. Necessary conditions: 1 1 k k i n j n Furthermore: Innermost index i,j, or k determines whether the computation is done row, column, or block-wise. Outermost index shows how the final parts are derived. Weights l jk have to be computed before they are used to eliminate related entries. 7

1 k i n (ikj)-form: 1 k j n For i = 2 : n For k = 1 : i-1 l i,k = a i,k / a k,k ; For j = k+1 : n a i,j = a i,j l i,k a k,j ; end end end GAXPY in a.i 8

already computed, not used any more L U already computed and used i newly computed A unchanged, not used L and U computed rowwise. Compute l i,1, then SAXPY for 1st and i-th row; then l i,2 and so on 9

First step A (1) 10

Second step A (2) 11

k-1-st step L U A (k-1) 12

(ijk)-form: 1 k i n 1 k j n For i = 2 : n For j = 2 : i l i,j-1 = a i,j-1 / a j-1,j-1 ; For k = 1 : j-1 a i,j = a i,j l i,k a k,j ; end new row Dot product left part end For j = i+1 : n For k = 1 : i-1 a i,j = a i,j l i,k a k,j ; end end Dot product right part end Compute l i,1 and update a i,2 ; then compute l i,2 and update a i,2 and a i,3,. Accumulating a i,j 13

1 k i n 1 k j n (jki)-form: For j = 2 : n For k = j : n l k,j-1 = a k,j-1 / a j-1,j-1 ; end For k = 1 : j-1 For i = k+1 : n a i,j = a i,j l i,k a k,j ; end end end x new column of L GAXPY in a.i 14

Left looking GE computed, not used U already computed and used L A unchanged, not used j-1, newly computed 15

First step A 16

Second step A 17

k-1-st step U L A 18

Overview kij kji ikj ijk jki jik Access to A and U Access to L Computat ion of U Computat ion of L Vector operation Vector length row column row column column column --------- column --------- row column row row row row row column column column column row row column column SAXPY SAXPY GAXPY DOT GAXPY DOT 2/3 n 2/3 n 2/3 n n/3 2/3 n n/3 Vector length = average of occuring vector lengths 19 Optimal form depends on storage of matrices and vector length.

Linear Systems of Equations with Dense Matrices GE in Parallel: Blockwise QR-Decomposition QR-Decomposition 3.2. GE in Parallel: Blockwise Main idea: Blocking of GE to avoid data transfer between processors. Basic Concepts: Replace GE or large LU-decomposition of full matrix by small intermediate steps (by sequence of small block operations): Solving collection of small triangular systems LU k = B k (parallelism in columns of U) A A LU updating matrices (also easy to parallelize) small B = LU-decompositions (parallelism in rows of B) Parallel Numerics, WT 2016/2017 3 Linear Systems of Equations with Dense Matrices page 11 of 36

Linear Systems of Equations with Dense Matrices GE in Parallel: Blockwise QR-Decomposition QR-Decomposition How to Choose Blocks in L, resp. U Satisfying LU = A L 11 0 0 U 11 U 12 U 13 A 11 A 12 A 13 L 21 L 22 0 0 U 22 U 23 = = L 31 L 32 L 33 0 0 U 33 A 21 A 22 A 23 A 31 A 32 A 33 L 11 U 11 L 11 U 12 L 11 U 13 = L 21 U 11 L 21 U 12 + L 22 U 22 L 21 U 13 + L 22 U 23 L 31 U 11 L 31 U 12 + L 32 U 22 Different ways of computing L and U depending on start (assume first entry/row/column of L/U as given) how to compute new entry/row/column of L/U update of block structure of L/U by grouping in known blocks blocks newly to compute blocks to be computed later Parallel Numerics, WT 2016/2017 3 Linear Systems of Equations with Dense Matrices page 12 of 36

Linear Systems of Equations with Dense Matrices GE in Parallel: Blockwise QR-Decomposition QR-Decomposition Crout Form Parallel Numerics, WT 2016/2017 3 Linear Systems of Equations with Dense Matrices page 13 of 36

Linear Systems of Equations with Dense Matrices GE in Parallel: Blockwise QR-Decomposition QR-Decomposition Crout Form (cont.) 1. Solve by small LU-decomposition of the modified part of A L 22, L 32, and U 22. 2. Solve by solving small triangular systems of equations in L 22 U 23. Parallel Numerics, WT 2016/2017 3 Linear Systems of Equations with Dense Matrices page 14 of 36

Linear Systems of Equations with Dense Matrices GE in Parallel: Blockwise QR-Decomposition QR-Decomposition Crout Form (cont.) 1. Solve by small LU-decomposition of the modified part of A L 22, L 32, and U 22. 2. Solve by solving small triangular systems of equations in L 22 U 23. Initial steps: L 11 U 11 = A 11, ( ) ( ) L21 A21 U L 11 =, L 31 A 11 (U 12 U 13 ) = (A 12 A 13 ) 31 Parallel Numerics, WT 2016/2017 3 Linear Systems of Equations with Dense Matrices page 14 of 36

Linear Systems of Equations with Dense Matrices GE in Parallel: Blockwise QR-Decomposition QR-Decomposition New Partitioning Combine already computed parts from second column of L and second row of U into first column of L and first row of U. Split the until now ignored parts L 33 and U 33 into new columns/rows. Repeat this overall procedure until L and U are fully computed. Parallel Numerics, WT 2016/2017 3 Linear Systems of Equations with Dense Matrices page 15 of 36

Linear Systems of Equations with Dense Matrices GE in Parallel: Blockwise QR-Decomposition QR-Decomposition Block Structure Intermediate block structure: Solve for red blocks. Parallel Numerics, WT 2016/2017 3 Linear Systems of Equations with Dense Matrices page 16 of 36

Linear Systems of Equations with Dense Matrices GE in Parallel: Blockwise QR-Decomposition QR-Decomposition Block Structure Intermediate block structure: Solve for red blocks. Reconfigure the block structure: Repeat until done. Parallel Numerics, WT 2016/2017 3 Linear Systems of Equations with Dense Matrices page 16 of 36

Linear Systems of Equations with Dense Matrices GE in Parallel: Blockwise QR-Decomposition QR-Decomposition Left Looking GE Parallel Numerics, WT 2016/2017 3 Linear Systems of Equations with Dense Matrices page 17 of 36

Linear Systems of Equations with Dense Matrices GE in Parallel: Blockwise QR-Decomposition QR-Decomposition Left Looking GE Solve L 11 U 12 = A 12 by a couple of parallel triangular solves and ( ) ( ) ( ) ) L22 A22 L21 (Â22 U L 22 = U 32 A 32 L 12 =: 31 Â 32 update part of A and perform small LU-decompostion. Parallel Numerics, WT 2016/2017 3 Linear Systems of Equations with Dense Matrices page 17 of 36

Linear Systems of Equations with Dense Matrices GE in Parallel: Blockwise QR-Decomposition QR-Decomposition Left Looking GE Solve L 11 U 12 = A 12 by a couple of parallel triangular solves and ( ) ( ) ( ) ) L22 A22 L21 (Â22 U L 22 = U 32 A 32 L 12 =: 31 Â 32 update part of A and perform small LU-decompostion. Reorder blocks and repeat until ready. Start: L 11 U 11 = A 11, L 21 U 11 = A 21, and L 31 U 11 = A 31. Parallel Numerics, WT 2016/2017 3 Linear Systems of Equations with Dense Matrices page 17 of 36

Linear Systems of Equations with Dense Matrices GE in Parallel: Blockwise QR-Decomposition QR-Decomposition Block Structure Intermediate block structure: Solve for red blocks. Reconfigure the block structure: Repeat until done. Parallel Numerics, WT 2016/2017 3 Linear Systems of Equations with Dense Matrices page 18 of 36