Computation of the mtx-vec product based on storage scheme on vector CPUs

Similar documents
BLAS: Basic Linear Algebra Subroutines Analysis of the Matrix-Vector-Product Analysis of Matrix-Matrix Product

Computational Linear Algebra

2.1 Gaussian Elimination

Direct Methods for Solving Linear Systems. Matrix Factorization

1 Multiply Eq. E i by λ 0: (λe i ) (E i ) 2 Multiply Eq. E j by λ and add to Eq. E i : (E i + λe j ) (E i )

Direct Methods for Solving Linear Systems. Simon Fraser University Surrey Campus MACM 316 Spring 2005 Instructor: Ha Le

LU Factorization. LU Decomposition. LU Decomposition. LU Decomposition: Motivation A = LU

Dense LU factorization and its error analysis

Numerical Linear Algebra

Program Lecture 2. Numerical Linear Algebra. Gaussian elimination (2) Gaussian elimination. Decompositions, numerical aspects

MODULE 7. where A is an m n real (or complex) matrix. 2) Let K(t, s) be a function of two variables which is continuous on the square [0, 1] [0, 1].

AMS526: Numerical Analysis I (Numerical Linear Algebra)

Gaussian Elimination and Back Substitution

Solution of Linear Systems

LU Factorization. LU factorization is the most common way of solving linear systems! Ax = b LUx = b

. =. a i1 x 1 + a i2 x 2 + a in x n = b i. a 11 a 12 a 1n a 21 a 22 a 1n. i1 a i2 a in

CS412: Lecture #17. Mridul Aanjaneya. March 19, 2015

5. Direct Methods for Solving Systems of Linear Equations. They are all over the place...

SOLVING LINEAR SYSTEMS

Review of matrices. Let m, n IN. A rectangle of numbers written like A =

LU Factorization a 11 a 1 a 1n A = a 1 a a n (b) a n1 a n a nn L = l l 1 l ln1 ln 1 75 U = u 11 u 1 u 1n 0 u u n 0 u n...

Parallel Numerics. Prof. Dr. Thomas Huckle. July 2, Technische Universität München, Institut für Informatik

COURSE Numerical methods for solving linear systems. Practical solving of many problems eventually leads to solving linear systems.

Linear Algebraic Equations

MATH 3511 Lecture 1. Solving Linear Systems 1

Linear Algebra Section 2.6 : LU Decomposition Section 2.7 : Permutations and transposes Wednesday, February 13th Math 301 Week #4

CS475: Linear Equations Gaussian Elimination LU Decomposition Wim Bohm Colorado State University

Linear Systems of n equations for n unknowns

1.5 Gaussian Elimination With Partial Pivoting.

The Solution of Linear Systems AX = B

Institute for Advanced Computer Studies. Department of Computer Science. Two Algorithms for the The Ecient Computation of

Scientific Computing

Linear Equations and Matrix

Next topics: Solving systems of linear equations

Parallel Scientific Computing

Parallel Numerics, WT 2016/ Iterative Methods for Sparse Linear Systems of Equations. page 1 of 1

Pivoting. Reading: GV96 Section 3.4, Stew98 Chapter 3: 1.3

14.2 QR Factorization with Column Pivoting

Solving linear systems (6 lectures)

Draft. Lecture 12 Gaussian Elimination and LU Factorization. MATH 562 Numerical Analysis II. Songting Luo

LU Factorization. Marco Chiarandini. DM559 Linear and Integer Programming. Department of Mathematics & Computer Science University of Southern Denmark

Direct solution methods for sparse matrices. p. 1/49

Numerical Methods - Numerical Linear Algebra

Fundamentals of Engineering Analysis (650163)

Parallel Programming. Parallel algorithms Linear systems solvers

GAUSSIAN ELIMINATION AND LU DECOMPOSITION (SUPPLEMENT FOR MA511)

Algebra C Numerical Linear Algebra Sample Exam Problems

Math 304 (Spring 2010) - Lecture 2

Linear Algebra (Review) Volker Tresp 2018

Linear Algebra (Review) Volker Tresp 2017

Roundoff Analysis of Gaussian Elimination

Chapter 1: Systems of linear equations and matrices. Section 1.1: Introduction to systems of linear equations

CHAPTER 6. Direct Methods for Solving Linear Systems

CSE 160 Lecture 13. Numerical Linear Algebra

Topics. Vectors (column matrices): Vector addition and scalar multiplication The matrix of a linear function y Ax The elements of a matrix A : A ij

VECTORS, TENSORS AND INDEX NOTATION

Index Notation for Vector Calculus

The practical revised simplex method (Part 2)

Lecture Notes to Accompany. Scientific Computing An Introductory Survey. by Michael T. Heath. Chapter 2. Systems of Linear Equations

Numerical Methods I Solving Square Linear Systems: GEM and LU factorization

MTH 464: Computational Linear Algebra

V C V L T I 0 C V B 1 V T 0 I. l nk

AMS 209, Fall 2015 Final Project Type A Numerical Linear Algebra: Gaussian Elimination with Pivoting for Solving Linear Systems

Scientific Computing WS 2018/2019. Lecture 9. Jürgen Fuhrmann Lecture 9 Slide 1

Lecture Note 2: The Gaussian Elimination and LU Decomposition

Solution of Linear Equations

Numerical Linear Algebra

Sparse BLAS-3 Reduction

MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS. + + x 1 x 2. x n 8 (4) 3 4 2

Introduction. Vectors and Matrices. Vectors [1] Vectors [2]

Solving Linear Systems of Equations

Scientific Computing: An Introductory Survey

Chapter 1 Matrices and Systems of Equations

Numerical Analysis: Solving Systems of Linear Equations

Ax = b. Systems of Linear Equations. Lecture Notes to Accompany. Given m n matrix A and m-vector b, find unknown n-vector x satisfying

Solving Linear Systems Using Gaussian Elimination. How can we solve

Solving Linear Systems of Equations

CME 302: NUMERICAL LINEAR ALGEBRA FALL 2005/06 LECTURE 6

Solving Dense Linear Systems I

Chapter 12 Block LU Factorization

This can be accomplished by left matrix multiplication as follows: I

Math 471 (Numerical methods) Chapter 3 (second half). System of equations

Chapter 2. Solving Systems of Equations. 2.1 Gaussian elimination

Matrix Arithmetic. j=1

Gaussian Elimination without/with Pivoting and Cholesky Decomposition

MA2501 Numerical Methods Spring 2015

Matrix decompositions

MATRICES. a m,1 a m,n A =

Matrix decompositions

Gaussian Elimination for Linear Systems

The System of Linear Equations. Direct Methods. Xiaozhou Li.

AMS526: Numerical Analysis I (Numerical Linear Algebra for Computational and Data Sciences)

Numerical Linear Algebra

7. LU factorization. factor-solve method. LU factorization. solving Ax = b with A nonsingular. the inverse of a nonsingular matrix

5 Solving Systems of Linear Equations

Basic Concepts in Linear Algebra

On the Skeel condition number, growth factor and pivoting strategies for Gaussian elimination

Matrix Algebra & Elementary Matrices

Lecture Notes 1: Matrix Algebra Part C: Pivoting and Matrix Decomposition

April 26, Applied mathematics PhD candidate, physics MA UC Berkeley. Lecture 4/26/2013. Jed Duersch. Spd matrices. Cholesky decomposition

Transcription:

BLAS: Basic Linear Algebra Subroutines BLAS: Basic Linear Algebra Subroutines BLAS: Basic Linear Algebra Subroutines Analysis of the Matrix Computation of the mtx-vec product based on storage scheme on vector CPUs For i = 1,..., n : c i = A i b = j General TRIAD, no SAXPY: for s = β : β r i a ij b j = a i,i+s b i+s = s=l i...for i = max{1 s, 1} : min{n s, n}...c i = c i + ã i,s b i+s...end end r i s=l i ã i,s b i+s Parallel Numerics, WT 2016/2017 2 Elementary Linear Algebra Problems page 25 of 39

BLAS: Basic Linear Algebra Subroutines BLAS: Basic Linear Algebra Subroutines BLAS: Basic Linear Algebra Subroutines Analysis of the Matrix Computation of the mtx-vec product based on storage scheme on vector CPUs For i = 1,..., n : c i = A i b = j General TRIAD, no SAXPY: for s = β : β r i a ij b j = a i,i+s b i+s = s=l i...for i = max{1 s, 1} : min{n s, n}...c i = c i + ã i,s b i+s...end end or, partial DOT-product: for i = 1 : n...for s = max{ β, 1 i} : max{β, n i}...c i = c i + ã i,s b i+s...end end Sparsity less operations, but also loss of efficiency. r i s=l i ã i,s b i+s Parallel Numerics, WT 2016/2017 2 Elementary Linear Algebra Problems page 25 of 39

BLAS: Basic Linear Algebra Subroutines BLAS: Basic Linear Algebra Subroutines BLAS: Basic Linear Algebra Subroutines Analysis of the Matrix Band Ab in Parallel Partitioning: R 1, n = I r, disjunct r=1 for i I r...c i = end r i s=l i ã is b i+s Processor P r gets rows to index set I r := [m r, M r ] in order to compute its part of the final vector c. What part of vector b does processor P r need in order to compute its part of c? Parallel Numerics, WT 2016/2017 2 Elementary Linear Algebra Problems page 26 of 39

BLAS: Basic Linear Algebra Subroutines BLAS: Basic Linear Algebra Subroutines BLAS: Basic Linear Algebra Subroutines Analysis of the Matrix Band Ab in Parallel Necessary for I r : b j = b i+s : j = i + s m r + max{ β, 1 m r } = max{m r β, 1} j = i + s M r + r Mr = M r + min{β, n M r } = min{m r + β, n} Processor P r with index set I r needs from b the indices j [max{1, m r β}, min{n, M r + β}] Parallel Numerics, WT 2016/2017 2 Elementary Linear Algebra Problems page 27 of 39

BLAS: Basic Linear Algebra Subroutines BLAS: Basic Linear Algebra Subroutines BLAS: Basic Linear Algebra Subroutines Analysis of the Matrix 2.6. Analysis of Matrix-Matrix Product A = (a ij ) i=1,...,n j=1,...,m R n m, B(b ij ) i=1,...,m j=1,...,q R m q, for i = 1 : n...for j = 1 : q m...c ij = a ik b kj...end end k=1 C = AB = (c ij ) i=1,...,n j=1,...,q R n q Parallel Numerics, WT 2016/2017 2 Elementary Linear Algebra Problems page 28 of 39

BLAS: Basic Linear Algebra Subroutines BLAS: Basic Linear Algebra Subroutines BLAS: Basic Linear Algebra Subroutines Analysis of the Matrix 2.6.1. Vectorization Algorithm 1: (ijk)-form: for i = 1 : n...for j = 1 : q...for k = 1 : m...c. ij = c ij + a ik b kj } DOT-product of length m...end...end end c ij = A i B j for all i, j All entries c ij are fully computed, one after another. Access to A and C is rowwise, to B columnwise (depends on inner most loops!) Parallel Numerics, WT 2016/2017 2 Elementary Linear Algebra Problems page 29 of 39

BLAS: Basic Linear Algebra Subroutines BLAS: Basic Linear Algebra Subroutines BLAS: Basic Linear Algebra Subroutines Analysis of the Matrix Other View on the Matrix-Matrix Product Matrix A considered as combination of columns or rows A = A 1 e T 1 +...+A m e T m = (A 1 0 )+(0 A 2 0...)+...+(... 0 A m ) a 1 = e 1 a 1 +... + e n a n = 0. +... + 0. a n AB = n A j ej T j=1 m k=1 e k b k = k,j A j (e T j e k )b k = m A k b k }{{} k=1 full n q matrices as a sum of full matrices A k b k by outer product of the kth column of A and the kth row of B. Parallel Numerics, WT 2016/2017 2 Elementary Linear Algebra Problems page 30 of 39

BLAS: Basic Linear Algebra Subroutines BLAS: Basic Linear Algebra Subroutines BLAS: Basic Linear Algebra Subroutines Analysis of the Matrix Algorithm 2: (jki)-form for j=1,...,q...for k=1,...,m...for i=1,...,n...c ij = c ij + a ik b kj...end...end end Vector update: c j = c j + a k b kj Sequence of SAXPYs for the same vector: c j = k b kj a k C computed columnwise; access to A columnwise. Access to B columnwise, but delayed. Parallel Numerics, WT 2016/2017 2 Elementary Linear Algebra Problems page 31 of 39

BLAS: Basic Linear Algebra Subroutines BLAS: Basic Linear Algebra Subroutines BLAS: Basic Linear Algebra Subroutines Analysis of the Matrix Algorithm 3: (kji)-form for k=1,...,m...for j=1,...,q...for i=1,...,n...c ij = c ij + a ik b kj...end...end end Vector update: c j = c j + a k b kj Sequence of SAXPYs for different vectors c j (no GAXPY) Access to A columnwise. Access to B rowwise + delayed. C computed with intermediate values c (k) ij which are computed columnwise. Parallel Numerics, WT 2016/2017 2 Elementary Linear Algebra Problems page 32 of 39

BLAS: Basic Linear Algebra Subroutines BLAS: Basic Linear Algebra Subroutines BLAS: Basic Linear Algebra Subroutines Analysis of the Matrix Overview of Different Forms Access A by Access B by to to Comput. of C ijk ikj kij jik jki kji Alg. 1 Alg. 2 Alg. 3 row row column column column row row column row row row column column column Computat ion of c ij direct delayed delayed direct delayed delayed Vector operation Vector length DOT GAXPY SAXPY DOT GAXPY SAXPY m q q m n n Better: GAXPY (longer vector length). Access to matrices according to storage scheme (rowwise or columnwise) Parallel Numerics, WT 2016/2017 2 Elementary Linear Algebra Problems page 33 of 39

BLAS: Basic Linear Algebra Subroutines BLAS: Basic Linear Algebra Subroutines BLAS: Basic Linear Algebra Subroutines Analysis of the Matrix 2.6.2. Matrix-Matrix Product in Parallel 1, n = R I r, 1, m = r=1 S T K s, 1, q = Distribute the blocks relative to index sets I r, K s, and J t to processor array P rst : s=1 t=1 J t 1. Processor P rst computes small matrix-matrix product. All processors in parallel: c (s) rt = A rs B st 2. Compute sum by fan-in in s: c rt = S s=1 c (s) rt Parallel Numerics, WT 2016/2017 2 Elementary Linear Algebra Problems page 34 of 39

BLAS: Basic Linear Algebra Subroutines BLAS: Basic Linear Algebra Subroutines BLAS: Basic Linear Algebra Subroutines Analysis of the Matrix Mtx-Mtx in Parallel: Special Case S = 1 Each processor P rt can compute its part of c, c rt, independently without communication. Each processor needs full block of rows of A, relative to index set I r, and full block of columns of B, relative to index set J t, to compute c rt relative to rows I k and columns J t. Parallel Numerics, WT 2016/2017 2 Elementary Linear Algebra Problems page 35 of 39

BLAS: Basic Linear Algebra Subroutines BLAS: Basic Linear Algebra Subroutines BLAS: Basic Linear Algebra Subroutines Analysis of the Matrix Mtx-Mtx in Parallel: Special Case S = 1 With n q processors each processor has to compute one DOT-product with O(m) parallel time steps. c rt = m a rk b kt k=1 Fan-in by m nq additional processors for all DOT-products reduces number of parallel time steps to O(log(m)). Parallel Numerics, WT 2016/2017 2 Elementary Linear Algebra Problems page 36 of 39

BLAS: Basic Linear Algebra Subroutines BLAS: Basic Linear Algebra Subroutines BLAS: Basic Linear Algebra Subroutines Analysis of the Matrix 1D-Parallelization of A B 1D: p processors linear, each processor gets full A and column slice of B, computing the related column slice of C = AB Communication: N 2 p for A and (N N p ) p = N2 Granularity: N 3 N 2 (1 + p) = N 1 + p Blocking only in i, the columns of B! for i = 1 : n...for j = 1 : n...for k = 1 : n...c j,i = C j,i + A j,k B k,i Parallel Numerics, WT 2016/2017 2 Elementary Linear Algebra Problems for B page 37 of 39

BLAS: Basic Linear Algebra Subroutines BLAS: Basic Linear Algebra Subroutines BLAS: Basic Linear Algebra Subroutines Analysis of the Matrix 2D-Parallelization of A B 2D: p processors square, q := p, each proc. gets row slice of A and column slice of B computing full subblock of C = AB Communication: N 2 p for A and N 2 p for B Granularity: N 3 2N 2 p = N 2 p Blocking in i and j, the columns of B and the rows of A! for i = 1 : n...for j = 1 : n...for k = 1 : n...c j,i = C j,i + A j,k B k,i Parallel Numerics, WT 2016/2017 2 Elementary Linear Algebra Problems page 38 of 39

BLAS: Basic Linear Algebra Subroutines BLAS: Basic Linear Algebra Subroutines BLAS: Basic Linear Algebra Subroutines Analysis of the Matrix 3D-Parallelization A B 3D: p processors cubic, each processor gets subblock of A and subblock of B, computing part of subblock of C = AB. Additional fan-in to collect parts to full subblock of C. (q = p 1 3 ). Communication: N 2 p 1 3 for A and for B ( = p N2 p 2 3 ) = p blocksize, fan-in: N 2 p 1 3 Granularity: N 3 3N 2 p 1 3 = N 3p 1 3 Blocking in i, j, and k! for i = 1 : n...for j = 1 : n...for k = 1 : n...c j,i = C j,i + A j,k B k,i Parallel Numerics, WT 2016/2017 2 Elementary Linear Algebra Problems page 39 of 39

Linear Systems of Equations with Dense Matrices GE in Parallel: Blockwise QR-Decomposition QR-Decomposition Parallel Numerics, WT 2016/2017 3 Linear Systems of Equations with Dense Matrices Parallel Numerics, WT 2016/2017 3 Linear Systems of Equations with Dense Matrices page 1 of 36

Linear Systems of Equations with Dense Matrices GE in Parallel: Blockwise QR-Decomposition QR-Decomposition Contents 1 Introduction 1.1 Computer Science Aspects 1.2 Numerical Problems 1.3 Graphs 1.4 Loop Manipulations 2 Elementary Linear Algebra Problems 2.1 BLAS: Basic Linear Algebra Subroutines 2.2 Matrix-Vector Operations 2.3 Matrix-Matrix-Product 3 Linear Systems of Equations with Dense Matrices 3.1 Gaussian Elimination 3.2 Parallelization 3.3 QR-Decomposition with Householder matrices 4 Sparse Matrices 4.1 General Properties, Storage 4.2 Sparse Matrices and Graphs 4.3 Reordering 4.4 Gaussian Elimination for Sparse Matrices 5 Iterative Methods for Sparse Matrices 5.1 Stationary Methods 5.2 Nonstationary Methods 5.3 Preconditioning 6 Domain Decomposition 6.1 Overlapping Domain Decomposition 6.2 Non-overlapping Domain Decomposition 6.3 Schur Complements Parallel Numerics, WT 2016/2017 3 Linear Systems of Equations with Dense Matrices page 2 of 36

Linear Systems of Equations with Dense Matrices GE in Parallel: Blockwise QR-Decomposition QR-Decomposition 3.1. Linear Systems of Equations with Dense Matrices 3.1.1. Gaussian Elimination: Basic Properties Linear system of equations: a 11 x 1 +... + a 1n x n = b 1.. a n1 x 1 +... + a nn x n = b n Solve Ax = b a 11 a 1n..... a n1 a nn x 1. x n = b 1. b n Generate simpler linear equations (matrices). Transform A in triangular form: A = A (1) A (2)... A (n) = U. Parallel Numerics, WT 2016/2017 3 Linear Systems of Equations with Dense Matrices page 3 of 36

Linear Systems of Equations with Dense Matrices GE in Parallel: Blockwise QR-Decomposition QR-Decomposition Transformation to Upper Triangular Form a 11 a 12 a 1n a 21 a 22 a 2n...... a n1 a n2 a nn row transformations: (2) (2) a 21 a 11 (1),..., (n) (n) a n1 a 11 (1) leads to a 11 a 12 a 13 a 1n 0 a (2) 22 a (2) 23 a (2) 2n A (2) = 0 a (2) 32 a (2) 33 a (2) 3n....... 0 a (2) n2 a (2) n3 a (2) nn next transformations: (3) (3) a(2) 32 a (2) 22 Parallel Numerics, WT 2016/2017 3 Linear Systems of Equations with Dense Matrices (2),..., (n) (n) a(2) n2 (2) a (2) 22 page 4 of 36

Linear Systems of Equations with Dense Matrices GE in Parallel: Blockwise QR-Decomposition QR-Decomposition Transformation to Triangular Form (cont.) a 11 a 12 a 13 a 1n 0 a (2) 22 a (2) 23 a (2) 2n A (3) = 0 0 a (3) 33 a (3) 3n....... 0 0 a (3) n3 a (3) nn next transformations: (4) (4) a(3) 43 (3),..., (n) (n) a(3) n3 (3) a (3) 33 a (3) 33 a 11 a 12 a 13 a 1n 0 a (2) 22 a (2) 23 a (2) 2n A (n) = 0 0 a (3) 33 a (3) 3n = U....... 0 0 0 a (n) nn Parallel Numerics, WT 2016/2017 3 Linear Systems of Equations with Dense Matrices page 5 of 36

Linear Systems of Equations with Dense Matrices GE in Parallel: Blockwise QR-Decomposition QR-Decomposition Pseudocode Gaussian Elimination (GE) Simplification: assume that no pivoting is necessary. a (k) kk 0 or a(k) kk ρ > 0 for k = 1, 2,..., n for k = 1 : n 1...for i = k + 1 : n...l i,k = a i,k a k,k...end...for i = k + 1 : n...for j = k + 1 : n...a i,j = a i,j l i,k a k,j...end...end end In practice: Include pivoting and include right hand side b. There is still to solve a triangular system in U! Parallel Numerics, WT 2016/2017 3 Linear Systems of Equations with Dense Matrices page 6 of 36

Linear Systems of Equations with Dense Matrices GE in Parallel: Blockwise QR-Decomposition QR-Decomposition Intermediate Systems A (k), k = 1, 2,..., n with A = A (1) and U = A (n) a (1) 11 a (1) 1,k 1 a (1) 1,k a (1) 1,n. 0............ (k 1) a k 1,k 1 a (k 1) k 1,k a (k 1) k 1,n 0 0 a (k) k,k a (k) k,n.......... 0 0 a (k) n,k a (k) n,n Parallel Numerics, WT 2016/2017 3 Linear Systems of Equations with Dense Matrices page 7 of 36

Linear Systems of Equations with Dense Matrices GE in Parallel: Blockwise QR-Decomposition QR-Decomposition Define Auxiliary Matrices 1 0 0. L = l 2,1 1.. 0........ l n,1 l n,n 1 1 and U = A (n) 0 0 0 0 0........... 0 0 0 0 0 L k := 0 0 0 0 0, L = I + 0 0 l k+1,k 0 0 k........... 0 0 l n,k 0 0 L k Parallel Numerics, WT 2016/2017 3 Linear Systems of Equations with Dense Matrices page 8 of 36

Linear Systems of Equations with Dense Matrices GE in Parallel: Blockwise QR-Decomposition QR-Decomposition Elimination Step in Terms of Auxiliary Matrices A (k+1) = (I L k ) A (k) = A (k) L k A (k) U = A (n) = (I L n 1 ) A (n 1) =... = (I L n 1 ) (I L 1 )A (1) = L A L := (I L n 1 ) (I L 1 ) A = L 1 U with U upper triangular and L lower triangular Parallel Numerics, WT 2016/2017 3 Linear Systems of Equations with Dense Matrices page 9 of 36

Linear Systems of Equations with Dense Matrices GE in Parallel: Blockwise QR-Decomposition QR-Decomposition Elimination Step in Terms of Auxiliary Matrices A (k+1) = (I L k ) A (k) = A (k) L k A (k) U = A (n) = (I L n 1 ) A (n 1) =... = (I L n 1 ) (I L 1 )A (1) = L A L := (I L n 1 ) (I L 1 ) A = L 1 U with U upper triangular and L lower triangular Theorem 2: L 1 = L and therefore A = LU. Advantage: Every further problem Ax = b j can be reduced to (LU)x = b j for arbitrary j. Solve two triangular problems (LU)x = Ly = b and Ux = y. Parallel Numerics, WT 2016/2017 3 Linear Systems of Equations with Dense Matrices page 9 of 36

Linear Systems of Equations with Dense Matrices GE in Parallel: Blockwise QR-Decomposition QR-Decomposition Theorem 2: L 1 = L A = LU for i j : L i L j = (I + L j )(I L j ) = I + L j L j L 2 j = I (I L j ) 1 = I + L j (I + L i )(I + L j ) = I + L i + L j + L i L j = I + L i + L j }{{} Parallel Numerics, WT 2016/2017 3 Linear Systems of Equations with Dense Matrices page 10 of 36

Linear Systems of Equations with Dense Matrices GE in Parallel: Blockwise QR-Decomposition QR-Decomposition Theorem 2: L 1 = L A = LU for i j : L i L j = (I + L j )(I L j ) = I + L j L j L 2 j = I (I L j ) 1 = I + L j (I + L i )(I + L j ) = I + L i + L j + L i L j = I + L i + L j }{{} L 1 = [(I L n 1 ) (I L 1 )] 1 = (I L 1 ) 1 (I L n 1 ) 1 = (I + L 1 )(I + L 2 ) (I + L n 1 ) = I + L 1 + L 2 + + L n 1 = L Parallel Numerics, WT 2016/2017 3 Linear Systems of Equations with Dense Matrices page 10 of 36

Vectorization of GE (kij)-form (standard form): For k = 1 : n-1 For i = k+1 : n l i,k = a i,k / a k,k ; end For i = k+1 : n For j = k+1 : n a i,j = a i,j l i,k a k,j ; end end end Vector operation x SAXPY in rows a i and a k No GAXPY U computed rowwise, L columnwise. 1

already computed, remains unchanged, not used anymore U L A (k) newly computed updated in every step Standard (kij) form is also called rightlooking GE. 2

First Elimination step: A (1) Compute first column of L Update A (1) 3

Second step: A (2) Compute second column of L Update A (2) 4

Second step: A (3) Compute third column of L Update A (3) 5

k-1st step: U L A (k) Compute k-th column of L Update A (k) 6

Rules for different i,j,k forms: In the following we again interchange the kij loops. Necessary conditions: 1 1 k k i n j n Furthermore: Innermost index i,j, or k determines whether the computation is done row, column, or block-wise. Outermost index shows how the final parts are derived. Weights l jk have to be computed before they are used to eliminate related entries. 7

1 k i n (ikj)-form: 1 k j n For i = 2 : n For k = 1 : i-1 l i,k = a i,k / a k,k ; For j = k+1 : n a i,j = a i,j l i,k a k,j ; end end end GAXPY in a.i 8

already computed, not used any more L U already computed and used i newly computed A unchanged, not used L and U computed rowwise. Compute l i,1, then SAXPY for 1st and i-th row; then l i,2 and so on 9

First step A (1) 10

Second step A (2) 11

k-1-st step L U A (k-1) 12

(ijk)-form: 1 k i n 1 k j n For i = 2 : n For j = 2 : i l i,j-1 = a i,j-1 / a j-1,j-1 ; For k = 1 : j-1 a i,j = a i,j l i,k a k,j ; end new row Dot product left part end For j = i+1 : n For k = 1 : i-1 a i,j = a i,j l i,k a k,j ; end end Dot product right part end Compute l i,1 and update a i,2 ; then compute l i,2 and update a i,2 and a i,3,. Accumulating a i,j 13

1 k i n 1 k j n (jki)-form: For j = 2 : n For k = j : n l k,j-1 = a k,j-1 / a j-1,j-1 ; end For k = 1 : j-1 For i = k+1 : n a i,j = a i,j l i,k a k,j ; end end end x new column of L GAXPY in a.i 14

Left looking GE computed, not used U already computed and used L A unchanged, not used j-1, newly computed 15

First step A 16

Second step A 17

k-1-st step U L A 18

Overview kij kji ikj ijk jki jik Access to A and U Access to L Computat ion of U Computat ion of L Vector operation Vector length row column row column column column --------- column --------- row column row row row row row column column column column row row column column SAXPY SAXPY GAXPY DOT GAXPY DOT 2/3 n 2/3 n 2/3 n n/3 2/3 n n/3 Vector length = average of occuring vector lengths 19 Optimal form depends on storage of matrices and vector length.

Linear Systems of Equations with Dense Matrices GE in Parallel: Blockwise QR-Decomposition QR-Decomposition 3.2. GE in Parallel: Blockwise Main idea: Blocking of GE to avoid data transfer between processors. Basic Concepts: Replace GE or large LU-decomposition of full matrix by small intermediate steps (by sequence of small block operations): Solving collection of small triangular systems LU k = B k (parallelism in columns of U) A A LU updating matrices (also easy to parallelize) small B = LU-decompositions (parallelism in rows of B) Parallel Numerics, WT 2016/2017 3 Linear Systems of Equations with Dense Matrices page 11 of 36

Linear Systems of Equations with Dense Matrices GE in Parallel: Blockwise QR-Decomposition QR-Decomposition How to Choose Blocks in L, resp. U Satisfying LU = A L 11 0 0 U 11 U 12 U 13 A 11 A 12 A 13 L 21 L 22 0 0 U 22 U 23 = = L 31 L 32 L 33 0 0 U 33 A 21 A 22 A 23 A 31 A 32 A 33 L 11 U 11 L 11 U 12 L 11 U 13 = L 21 U 11 L 21 U 12 + L 22 U 22 L 21 U 13 + L 22 U 23 L 31 U 11 L 31 U 12 + L 32 U 22 Parallel Numerics, WT 2016/2017 3 Linear Systems of Equations with Dense Matrices page 12 of 36

Linear Systems of Equations with Dense Matrices GE in Parallel: Blockwise QR-Decomposition QR-Decomposition How to Choose Blocks in L, resp. U Satisfying LU = A L 11 0 0 U 11 U 12 U 13 A 11 A 12 A 13 L 21 L 22 0 0 U 22 U 23 = = L 31 L 32 L 33 0 0 U 33 A 21 A 22 A 23 A 31 A 32 A 33 L 11 U 11 L 11 U 12 L 11 U 13 = L 21 U 11 L 21 U 12 + L 22 U 22 L 21 U 13 + L 22 U 23 L 31 U 11 L 31 U 12 + L 32 U 22 Different ways of computing L and U depending on start (assume first entry/row/column of L/U as given) how to compute new entry/row/column of L/U update of block structure of L/U by grouping in known blocks blocks newly to compute blocks to be computed later Parallel Numerics, WT 2016/2017 3 Linear Systems of Equations with Dense Matrices page 12 of 36

Linear Systems of Equations with Dense Matrices GE in Parallel: Blockwise QR-Decomposition QR-Decomposition Crout Form Parallel Numerics, WT 2016/2017 3 Linear Systems of Equations with Dense Matrices page 13 of 36

Linear Systems of Equations with Dense Matrices GE in Parallel: Blockwise QR-Decomposition QR-Decomposition Crout Form (cont.) 1. Solve by small LU-decomposition of the modified part of A L 22, L 32, and U 22. 2. Solve by solving small triangular systems of equations in L 22 U 23. Parallel Numerics, WT 2016/2017 3 Linear Systems of Equations with Dense Matrices page 14 of 36

Linear Systems of Equations with Dense Matrices GE in Parallel: Blockwise QR-Decomposition QR-Decomposition Crout Form (cont.) 1. Solve by small LU-decomposition of the modified part of A L 22, L 32, and U 22. 2. Solve by solving small triangular systems of equations in L 22 U 23. Initial steps: L 11 U 11 = A 11, ( ) ( ) L21 A21 U L 11 =, L 31 A 11 (U 12 U 13 ) = (A 12 A 13 ) 31 Parallel Numerics, WT 2016/2017 3 Linear Systems of Equations with Dense Matrices page 14 of 36

Linear Systems of Equations with Dense Matrices GE in Parallel: Blockwise QR-Decomposition QR-Decomposition New Partitioning Combine already computed parts from second column of L and second row of U into first column of L and first row of U. Split the until now ignored parts L 33 and U 33 into new columns/rows. Repeat this overall procedure until L and U are fully computed. Parallel Numerics, WT 2016/2017 3 Linear Systems of Equations with Dense Matrices page 15 of 36

Linear Systems of Equations with Dense Matrices GE in Parallel: Blockwise QR-Decomposition QR-Decomposition Block Structure Intermediate block structure: Solve for red blocks. Parallel Numerics, WT 2016/2017 3 Linear Systems of Equations with Dense Matrices page 16 of 36

Linear Systems of Equations with Dense Matrices GE in Parallel: Blockwise QR-Decomposition QR-Decomposition Block Structure Intermediate block structure: Solve for red blocks. Reconfigure the block structure: Repeat until done. Parallel Numerics, WT 2016/2017 3 Linear Systems of Equations with Dense Matrices page 16 of 36

Linear Systems of Equations with Dense Matrices GE in Parallel: Blockwise QR-Decomposition QR-Decomposition Left Looking GE Parallel Numerics, WT 2016/2017 3 Linear Systems of Equations with Dense Matrices page 17 of 36

Linear Systems of Equations with Dense Matrices GE in Parallel: Blockwise QR-Decomposition QR-Decomposition Left Looking GE Solve L 11 U 12 = A 12 by a couple of parallel triangular solves and ( ) ( ) ( ) ) L22 A22 L21 (Â22 U L 22 = U 32 A 32 L 12 =: 31 Â 32 update part of A and perform small LU-decompostion. Parallel Numerics, WT 2016/2017 3 Linear Systems of Equations with Dense Matrices page 17 of 36

Linear Systems of Equations with Dense Matrices GE in Parallel: Blockwise QR-Decomposition QR-Decomposition Left Looking GE Solve L 11 U 12 = A 12 by a couple of parallel triangular solves and ( ) ( ) ( ) ) L22 A22 L21 (Â22 U L 22 = U 32 A 32 L 12 =: 31 Â 32 update part of A and perform small LU-decompostion. Reorder blocks and repeat until ready. Start: L 11 U 11 = A 11, L 21 U 11 = A 21, and L 31 U 11 = A 31. Parallel Numerics, WT 2016/2017 3 Linear Systems of Equations with Dense Matrices page 17 of 36

Linear Systems of Equations with Dense Matrices GE in Parallel: Blockwise QR-Decomposition QR-Decomposition Block Structure Intermediate block structure: Solve for red blocks. Parallel Numerics, WT 2016/2017 3 Linear Systems of Equations with Dense Matrices page 18 of 36

Linear Systems of Equations with Dense Matrices GE in Parallel: Blockwise QR-Decomposition QR-Decomposition Block Structure Intermediate block structure: Solve for red blocks. Reconfigure the block structure: Repeat until done. Parallel Numerics, WT 2016/2017 3 Linear Systems of Equations with Dense Matrices page 18 of 36