Parallel Numerics. Prof. Dr. Thomas Huckle. July 2, Technische Universität München, Institut für Informatik

Size: px

Start display at page:

Download "Parallel Numerics. Prof. Dr. Thomas Huckle. July 2, Technische Universität München, Institut für Informatik"

Jeffrey Sutton
5 years ago
Views:

1 Parallel Numerics Prof Dr Thomas Huckle July 2, 2006 Technische Universität München, Institut für Informatik 1

2 Contents 1 Introduction 4 11 Computer Science Aspects of Parallel Numerics Parallelism in CPU Memory Organization Parallel Processors Performance Analysis Further Keywords: 9 12 Numerical Problems Data Depency Graphs Directed Graph G=(E,V) Depency Graphs of Iterative Algorithms Depency graph for solving a triangular linear system 17 2 Elementary Linear Algebra Problems BLAS Basic Linear Algebra Subroutines Program package Analysis of Matrix-Vector product Vectorization Parallelization by building blocks c = Ab for banded matrix Analysis of the Matrix-Matrix-product 27 3 Linear Equations with dense matrices Gaussian Elimination: Basic facts Vectorization of the Gaussian Elimination Gaussian Elimination in Parallel Crout method left looking GE: Right looking / Gaussian Elimination standard QR-Decomposition with Householder matrices QR-decomposition Householder method for QR Householder method in parallel 38 4 Linear Equations with sparse matrices General properties of sparse matrices Storage in coordinate form Compressed Sparse Row Format: CSR Improving CSR Diagonalwise storage 41

3 415 rectangular, rowwise storage scheme Jagged diagonal form Sparse Matrices and Graphs A = A T > 0 (n n - matrix) symmetric A non symmetric: directed graph Dissection form preserved during GE Reordering Smaller Bandwidth by Cuthill Mckee-Algorithm Dissection Reordering Algebraic pivoting: during GE Gaussian Elimination in Graph Different direct solvers 55 5 Iterative methods for sparse matrices stationary methods Richardson Iteration Better splitting of A Jacobi (Diagonal) - Splitting: Gauss-Seidel method by improving convergence Nonstationary Methods Let A symmetric positive definite A = A T > 0 (spd) Improving the gradient method conjugate gradients GMRES for General Matrix A, not spd Convergence of cg or GMRES 67 6 Collection remaining problems Domain Decomposition Methods for Solving PDE Parallel Computation of the Discrete Fourier Transformation Parallel Computation of Eigenvalues 73 2

4 Literature: Numerical Linear Algebra for High Performance Computers Dongarra, 3

5 1 Introduction 11 Computer Science Aspects of Parallel Numerics 111 Parallelism in CPU Elementary operations in CPU are carried out in pipelines: Divide a task into a sequence of smaller tasks Each small task is executed on a piece of hardware, that operates concurrently with the other stages of the pipeline Example: Multiplication Advantage: If pipeline is filled, per clock one result comes out All multiplications should be organized such that the pipeline is always filled! If pipeline is empty, it is not efficient Special Case: Vector instruction: for set of data the same operation has to be executed: α x 1 x n Cost: Startup time + vector length*clock period 4

6 Chaining: Combine pipelines directly: advantage: total cost = startup time }{{} longer + vector length clock period Problem: Data Depency Fibonacci: x 0 = 0, x 1 = 1, x 2 = x 1 + x 0,, x i = x i 1 + x i 2 X1-X0 X2-X1 X 2 Have to wait until X2 Next pair has to wait until x 2 / pipeline is empty in each step 112 Memory Organization 5

7 Cache idea: Buffer between large slow memory and small fast memory By considering the flow of the last used data, we try to predict which data will be requested in the next step: keep the last used data in cache for fast access keep also the neighbourhood of this data in cache Cache hit: CPU looks for data and data is in cache and finds it Cache miss: Data is not in cache: look in main memory and copy new page in the cache 113 Parallel Processors MIMD-Architecture: Multiple Instruction-Multiple Data (global)shared memory (P=processors, M=memory): (Global) Shared Memory P1 P2 (Local) P1 Pn or Distributed Memory M1 Mn M: data or virtual shared memory: physically distributed data but organized as shared memory Topology of processors/memory: Interconnection (shared memory) Bus: P1 Cache Local Memory I/O bus Global Memory Pn 6

8 Mesh: (distributed memory) Hypercube: data depency: log(n) Shared memory communication for different processors by: synchronization: eg barrier p 1 p n halt continue iff all completed MPI: Message Passing Interface: Communication library for C, C++, FORTRAN Compiling: mpicc <options> progc Start: mpirun - arch <architecture> - up <up> prog 7

9 Commands: MPI S MPI Bcast MPI Recv MPI Gather MPI Barrier 114 Performance Analysis computation speed: r = N t Mflops N floating point operations in t microseconds or by known speed: t = N r flops Amdahl s law: Algorithm takes N flops fraction f carried out with speed of V Mflops (good parallel) fraction 1 f carried out with speed of S Mflops (bad parallel) fraction f is well-suited for parallel execution ; 1 f is not Total CPU-time: t = f N V + (1 f) N S = N ( f V + 1 f S ) microseconds Overall Speed (Performance): r = N t = 1 f + Mflops Amdahl s law 1 f V S Interpretation: f must be close to 1 in order to benefit significantly from parallelism Speedup by using p parallel processors for a given job: t j := wall clock time to execute the job on j parallel processor Speedup: S p = t 1 /t p (ideal : t 1 = pt p ) Efficiency: E p = Sp 0 E p p 1 E p 1 : very good parallelizable t p = t 1 /p problem scales t p = parallel {}}{ ft 1 p + (1 f)t 1 = t 1 f + (1 f)p p 8 (1 f)t 1

10 S p = 1 f (1 f)p p = p f + (1 f)p Ware s law E p = 1 f + (1 f)p ; lim p E p = lim p 1 f + (1 f)p 0 Assume that the given problem can be solved in 1 unit of time on a parallel machine with p processors A uniprocessor would perform (1 f) + f p Speedup: S pf = t 1 = 1 f + fp t p 1 = p + (1 p)(1 f) Gustafson s law E pf = S p p = 1 f p 115 Further Keywords: + f p f (only theoretical use!) - An algorithm is scaling iff with p processors we can reduce the operation time by a factor of p Larger problem can be solved in the same time by using more processors (This means: speedup p, efficiency 1 ) - load balancing: The job has to be distributed on different processors such that all processors are busy: Avoid idle processors - deadlock: two or more processors are waiting indefinitely for an event that can be caused only by one of the waiting processors - data depency: compute C = A + B (1) Z = C X + Y (2) 1 2 Each waiting fot the results of the other one 9

11 (2) can be computed only after (1) example loop: for(i = 1; i n; i++) a[i] = b[i] + a[i 1] + c[i] (strongly sequential) 12 Numerical Problems vectors x, y R n dot product (inner product) x T y = (x 1,, x n ) sum of vectors: x + αy = outer product: xy T = x 1 x n x 1 + αy 1 x n + αy n (y 1,, y m ) = y 1 y n = n i=1 x 1 y 1 x 1 y m x n y 1 x n y m matrix product: A R n,k, B R k,m, C = A B R n,m a 11 a 1k b 11 b 1m c 11 c 1m = a n1 a nk b k1 b km c n1 c nm with c ij = k a ir b rj, r=1 Solving linear equations, i=1,,n j=1,,m a 11 a 1n 0 a 22 a 2n eg triangular: 0 0 a nn or x 1 x 2 x n = b 1 b 2 b n x i y i 10

12 a 11 x a 1n x n = b 1 a 22 x a 2n x n = b 2 a nn x n = b n solution : x n = bn a nn, x n 1 = b n 1 a n 1,n x n a n 1,n 1 a n 1,n 1 x n 1 + a n 1,n x n = b n 1 general form x j = b j P n a jk x k k=j+1 a jj for j = n,, 1 Gaussian Elimination, LU-Decomposition (Cholesky-Decomposition) Least Squares problem (normal equation) min x Ax B 2 QR -Decomposition Differential Equations (PDE) eigenvalues, singular values, FFT 13 Data Depency Graphs 131 Directed Graph G=(E,V) with edges E, vertices/nodes V Example: Computation of (x 1 + x 2 )(x 2 + x 3 ) = x 1 x 2 + x x 1 x 3 + x 2 x 3 11

13 Input: X1 X2 X Data flow X1+X2 X2+X Time Sequentially (X1+x2)(X2+X3) time steps In parallel: x 1 + x 2 and x 2 + x 3 can be computed indepently Parallel: X1 X2 X3 X1+X2 X2+X (X1+x2)(X2+X3) time steps Second equivalent formula (Parallel): 12

14 132 Depency Graphs of Iterative Algorithms Given: Function f, start x (0), x (k+1) = f(x (k) ), Notation: x (k+1) ˆ=x(k + 1) (k) k x x = f( x) fix point of f compare Newton s method: x k+1 = x k g(x k) g( x) = 0 g (x k ) x 1 (k + 1) f 1 (x 1 (k)),, x n (k) In vector form: = x (k+1) = f x (k) = x n (k + 1) f n (x 1 (k)),, x n (k) Example: x 1 (k + 1) = f 1 (x 1 (k), x 3 (k)) x 2 (k + 1) = f 2 (x 1 (k), x 2 (k)) x 3 (k + 1) = f 3 (x 2 (k), x 3 (k), x 4 (k)) x 4 (k + 1) = f 4 (x 2 (k), x 4 (k)) edge i to j iff for x (k+1) i we need x (k) j

15 Parallel Computation: Depency Graph for Iteration: Single-step or Jacobi-Iteration (k) k Very nice in parallel ; Convergence slow: x x Idea for accelerating the Convergence: Use always the newest available information: x 1 (k + 1) = f 1 (x 1 (k), x 3 (k)) x 2 (k + 1) = f 2 (x 1 (k + 1), x 2 (k)) x 3 (k + 1) = f 3 (x 2 (k + 1), x 3 (k), x 4 (k)) x 4 (k + 1) = f 4 (x 2 (k + 1), x 4 (k)) leads to much faster convergence 14

16 Full-step or Gauss-Seidel-method (Drawback: loss of Parallelism:) In this form the iteration deps on the ordering of the variables x 1,, x n x 1 (k + 1) = f 1 (x 1 (k), x 3 (k)) x 3 (k + 1) = f 3 (x 2 (k), x 3 (k), x 4 (k)) x 4 (k + 1) = f 4 (x 2 (k), x 4 (k)) x 2 (k + 1) = f 2 (x 1 (k + 1), x 2 (k)) 15

17 Better parallelism, but slower convergence Find optimal ordering with fast convergence and good in parallel Colouring algorithms for depency graphs: - use k colours for the vertices of the graph - vertices of the same colour can be computed in parallel - optimal colouring for minimal k, but without cycles connecting vertices of the same colour in subset of vertices of the same colour there are no cycles subgraph is a tree Ordering by starting with the leaves and ing with the root Example: 16

18 x 3 does not dep on x 1 ; x 4 does not dep on x 3 x 2 uses new computed x 1, x 3, x 4 and needs one time step Computation uses only old information in parallel in one time step Theorem 1 Theorem: Two statements are equivalent: a There exists an ordering, such that the one Gauss-Seidel-Iteration-step takes k (time) levels b There exists a colouring with k colours, such that there is no cycle of edges of the same colour Proof: Colouring in subgraph with no cycles tree ordering (from leaves to root) no data depency in subgraph subgraph in parallel Graph: discretization of physical problems in R; neighbourconnections k=2 red-black-gauss-seidel in PDE, 2 time steps r b r b r b r b r b r b b r r b 133 Depency graph for solving a triangular linear system a 11 x 1 = b 1 a 21 x 1 + a 22 x 2 = b 2 a 31 x 1 + a 32 x 2 + a 33 x 3 = b 3 a 41 x 1 + a 42 x 2 + a 43 x 3 + a 44 x 4 = b 4 17

19 or a a 21 a a 31 a 32 a 33 0 a 41 a 42 a 43 a 44 x 1 x 2 x 3 x 4 = b 1 b 2 b 3 b 4 solution: x 1 = b 1 /a 11 x 2 = (b 2 a 21 x 1 )/a 22 x 3 = (b 3 a 31 x 1 a 32 x 2 )/a 33 x 4 = (b 4 a 41 x 1 a 42 x 2 a 43 x 3 )/a 44 strongly sequential problem General: x k = (b k k 1 j=1 a kj x j )/a kk for k = 1,, n Depency Graph: Assume a jj = 1 2n-1 timesteps 18

20 2 Elementary Linear Algebra Problems (dense matrices, parallel-vectorised) 21 BLAS Basic Linear Algebra Subroutines Program package Sum: s = n i=1 a i by fan-in process a (k) = a (k) 1 a (k) 2 N k = a (k 1) 1 a (k 1) 2 N k + a (k 1) 2 N k +1 a (k 1) 2 N k+1 Grouping: a a 8 = [(a 1 + a 5 ) + (a 3 + a 7 )] + [(a 2 + a 6 ) + (a 4 + a 8 )] for (k = 1; k N; k + +) for (j = 1; j 2 N k ; j + +) a j = a j + a j+2 N k; 19

21 full binary tree with n = 2 N leaves depth ˆ= time log n = N (sequential: O(n)) Level 1 BLAS: Basic Linear Algebra Subroutines with O(n) problems (Vectors only) eg DOT-product by fan-in: s = x T y = n x j y j j=1 Parallelization of dot-product: n x j y j j=1 Dot-product is not very good in parallel in vectorization Other way of computing DOT-product on a special architecture: Distribute data on linear 1 dimensional processor array with r = n/k processors Break x 1,, x n in r small vectors of length k Each processor computes a j1 b j1 + + a jk b jk 20

22 Time for this parallel computation: k (add + mult) ie time for one Addition/Multiplication After computing this part, processor P 1 /P r ss his result to the right/left neighbour, which adds the new data to his own result, and ss the new data to his right/left neighbour until P r/2 holds the final number Total time: (deping on n and r) f(r) = K(add + multi) + r 2 s + r 2 add = n r (add + mult) + r a + s 2 Minimize total time f(r): 0 = f (r) = a + m n + a + s r 2 2 2(a + m) r = n = O( n) a + s optimal: with n processors, the time is O( n) f( n) = n n (a + m) + n a+s 2 = O( n) Then with n processors the total time is O(log n) Further level-1 BLAS problems S single(precision) A α X x P + Y y : y = α x + y by pipelining vectorization: a Xi axj X + yj axk+yk 21

23 Parallelization by partitioning: 1, n = {1, 2, 3,, n} = I 1 I 2 I R x 1 y 1 x 2 y 2 x = x R, y = y R Each processor p j from p 1 to p r gets x j and y j and computes α x j + y j very good vectorizable and parallelizable: SCOPY: y = x NORM: n x 2 = j=1 x 2 j compare DOT Level-2 BLAS: Matrix-Vector O(n 2 ) sequentially S single precision G E general matrix : y = αa x + β y M V vector or solving triangular system Lx = b, L is lower triangular matrix Level-3 BLAS: Matrix-Matrix O(n 3 ) S single precision G E general matrix : y = αab + βc M M matrix Based on BLAS: LAPACK- subroutines for solving linear equations, least squares, QR-decomposition, eigenvalues, eigenvectors 22 Analysis of Matrix-Vector product A = (a i=1n,j=1m ) R n,m, b R m, c R n, C = Ab 22

24 221 Vectorization c 1 a 11 a 1m = c n = (ij) form: c = ; j=1 a n1 a nm m a 1j b j j=1 m a nj b j b 1 b m = }{{} collection of DOT (rows of A) = a 11 b a 1m b m a n1 b a nm b m m j=1 b j a 1j a nj }{{} collection of SAXPY s (columns of A) GAXPY for i = 1,,n for j = 1,,m c i = c i + a ij b j (ji) form: DOT (entries of c), where c i = (a i ) b i-th row of A b for j = 1,,m for i = 1,,n c i = c i + a ij b j SAXP Y c = c + b j (a j ) Add j-th column of A GAXP Y Number of SAXPY s with the same vector c Advantage of GAXPY: keep c in fast register memory SAXPY/GAXPY good vectorizable 222 Parallelization by building blocks Reduce Matrix-vector on smaller Matrix-vector on precessors 1, n = {1,2,3,,n} = I 1 I 2 I 3 I R disjunct: I j I k = 1, m = J 1 J 2 J 3 J s J j J k = for j k processor P rs gets matrix A rs := A(I r, J s ), b s = b(j s ), c r = c(i r ) 23

25 I r J s A rs b s }J s = c r }I r c r = S A rs b s = S s=1 s=1 for r = 1,,R for s = 1,,S c (S) r = A rs b s for r = 1,,R c r = 0 for s = 1,,S c r = c r + c (S) r c (s) r Special case S=1: A 1 A 1 b c = A 2 b = A 2 b for r = 1,,R small, indepent matrix-vector products no communication, totally parallel blockwise collection and addition of vectors rowwise communication (parallel) no communincation between processors (processors indepent of each other) compute A 1 b in vectorizable form by GAXPY s b 1 Special case R=1: c = (A 1 A 2 ) b 2 = A 1 b 1 + A 2 b 2 + A i b i indepent, then collection of result P 1 P s (not so good in parallel) Rule: 1) Vectorization - pipelining (inner most loops) 2) Cache 3) Parallel (outer most loops) P 1 P 2 : 24

26 223 c = Ab for banded matrix b 0 0 bandwidth b eg β = 1 : tridiagonal matrix (0: main diagonal, +/-1 first upper/lower) 0 0 A = 0 0 notation 0 0 ã 10 ã 11 ã 1β 0 0 ã 2, 1 ã 20 0 Ã = ãn β,β ã β+1, β ã n, β ã n0 0 0 ã 1,0 ã 1,β 0 ã β+1, β ã n β,β 0 ã n, β ã n, n = (2β + 1) O(n)

27 ã is = a i,i+s for row i = 1,, n 1 i + s n S [l i, r i ] = [max{ β, 1 i}, min{β, n i}] Therefore we get the inequality eg 1 i S n i, β S β, 1 S i n S row i = 1 : S [0, β] row i = β + 1 : S [ β, β] i = n β : S [ β, β] i = n : S [ β, 0] computation of matrix-vector-product C = A b on vector processor C i = A ij b = j a ij b j = r i a i,i+s b i + S S=l i }{{} j = r i S=l i ã i,s b i+s for i = 1,,n Algorithm: for s = - β : 1 : β for i = max{1-s, 1} : 1 : min {n-s,n} c i = c i + ã ij b i+s parallel computation: 1, n = for i I r c i = r i R r=1 general triade (no SAXPY) I r s=l i ã is b i+s Processor P r gets rows to index set I r := [m r, M r ] to compute its part of C What part of vector b is necessary to process P r? 26

28 b j for j = i + s m r + l mr = m r + max{ β, 1 m r } = max{m r β, 1} j = i + s M r + r Mr = M r + min{β, n M r } = min{m r + β, n} Hence processor P r ( I r ) needs b j for j [max{1, m r β}, min{n, M r + β}] 23 Analysis of the Matrix-Matrix-product A = (a ij ) i=1n j=1m B = (b ij ) i=1m j=1q C = A B = (c ij ) i=1n j=1q for i = 1n, c ij = for j=1q: m a ik b kj = k=1 a i1 a im b 1j b mj = c ij Algorithm 1 (ijk) - form: for i = 1:n for j = 1:q for k = 1:m c ij = c ij + a ik b kj DOT-product c ij = A i B j All entries c ij are fully computed, one after another Access to A rowwise, to B columnwise Algorithm 2 (jki) - form for j = 1:q k = 1:m for i = 1:n c ij = c ij + a ik b kj SAXPY c j = c j + a k b kj vector c j c computed columnwise; access to A columnwise GAXPY c j = k b kja k 27

29 Algorithm 3 (kji) - form for k = 1:m for j = 1:q for i = 1:n c ij = c ij + a ik b kj SAXPY NO GAXPY because different c j There are computed intermediate values c (k) ij Access to A columnwise ijk ikj kij jik jki kji Access to A row row column column Access to B column row row column Computation of c row row row column column column c ij direct delayed delayed direct delayed delayed vector operation DOT GAXPY SAXPY DOT GAXPY SAXPY with vector length m q q m m m usually GAXPY better; longer vector length better; choose the right access to A,B, deping on the storage Matrix-Matrix-product in parallel 1, n = 1, m = 1, q = R r=1 S s=1 T t=1 I r K s J t Distribute blocks relative to index sets I r, K s, J t to processor P rst : K s J t J t I r A rs K s B st = I r c (s) rt 28

30 1 process P rst : c (s) rt = A rs B st small matrix-matrix-product 2 sum: Special case S = 1: I r c rt = S s=1 J t all processors indepently c (s) rt fan-in in S = J t c rt I r Each process computes a block of c indepently without communication Each process needs full block of rows of A( I r ) and block of columns of B( J t ), to compute the block c rt with n q processor: each processor has to compute one DOT-product c rt = k a rk b kt in O(m) If we use more processors to compute all these DOT-products by fan-in, we can reduce the parallel complexity to O(log m) 3 Linear Equations with dense matrices 31 Gaussian Elimination: Basic facts Linear equations a 11 x a 1n x n = b 1 a n1 x a nn x n = b n a 11 a 1n x 1 b 1 = a n1 a nn x n b n }{{} Ax = b A = A (1) 29

31 Solving triangular equations is easy, so we try, to transform the given system in triangular form: (1) a 11 a 12 a 1n (2) a 21 a 22 a 2n (2) (2) a 21 a 11 (1) (n) (n) (n) a n1 a 11 (1) a n1 = A (2) = = A (3) = a nn a (2) 11 a (2) 12 a (2) 1n 0 a (2) 22 a (2) 2n a (2) 32 0 a (2) nn a (3) 11 a (3) 12 a (3) 1n 0 a (3) 22 a (3) 2n 0 a (3) a (3) n3 a (3) nn (3) (3) a 32 a 22 (2) (n) (n) a n2 a 22 (2) a 11 a 1n 0 A (n) = an 1,n 1 a n 1,n 0 0 a nn upper triangular form No pivoting (we assume a (k) kk 0 for all k), we ignore the right hand side b Algorithm: for k = 1:n-1 for i = k+1:n l ik = a ik a kk for i = k+1:n for j = k+1:n a ij = a ij l ik a kj Intermediate System A (u) = A (u) A (u) kk 0 0 A (u) nk A (u) kn A (u) nn 30

32 Define matrix with entries l ik from above algorithm l 21 1 L = and L k = l k+1,k l n,1 l n,n 1 1 l n,k 0 0 Each Elimination step in Gaussian Elimination can be written in the form A (k+1) = (1 L k ) A (k) = A (k) L k A (k) (3) with A (1) = A and A (n) = U = upper triangular U = A (n) = (1 l n 1 )A (n 1) = = (1 l n 1 ) (1 l 1 ) A }{{} (1) = L A L with L := (1 l n 1 ) (1 l 1 ) 1 l j lower triangular L lower triangular L 1 lower triangular A = L 1 U with L 1 lower and U upper triangular Theorem 2 L 1 = L Proof: i j 0 = l i l j = 0 0 i 0 0 j 0 0 Therefore (1 + l j )(1 l j ) = 1 + l j l j lj 2 = I and (1 l j ) 1 = 1 + l j L 1 [(1 l n 1 ) (1 l 1 )] 1 = (1 l 1 ) 1 (1 l n 1 ) 1 = (1 + l 1 )(1 + l 2 ) (1 + l n 1 ) = 1 + l 1 + l l n 1 = L because eg (1 + l 1 )(1 + l 2 ) = 1 + l 1 + l 2 + l 1 l }{{} 2 = 1 + l 1 + l 2 =0 Total: A = L U with L lower and U upper triangular 31

33 32 Vectorization of the Gaussian Elimination (kij)-form (standard) for k = 1:n-1 for i = k+1:n l i,k = a ik a kk for i = k+1:n for j = k+1:n a ij = a ij l ik a kj Vector operation α x SAXPY in row a i and a k U computed rowwise, columnwise In the following, we want to interchange the kij-loops: No GAXPY already computed unchanged no more computed L U A (n) newly computed updated in every step right looking GE Necessary condition: (ikj)-form: for i = 2:n for k = 1:i-1 l ik = a ik a kk j = k+1:n a ij = a ij l ik a kj 1 k < i n 1 k < j n GAXPY in a ii compute l i1 by SAXPY combine the 1st row and the i-th row, then compute l 12, and so on L and U are computed rowwise 32

34 already computed not used L i li1 A U already computed used newly computed unchanged (ijk)-form: for i = 2:n for j = 2:i l i,j 1 = a i,j 1 a j 1,j 1 for k = 1:j-1 a ij = a ij l ik a kj for j = i+1:n for k = 1:i-1 a ij = a ij l ik a kj (jki)-form for j = 2:n for k = j:n l k,j 1 = a k,j 1 a j 1,j 1 for k = 1:j-1 a ij = a ij l ik a kj α x DOT (upper left) DOT (upper right) GAXPY in a ij 33

35 computed j U already computed used L A unchanged not used left looking GE newly computed kij kji ikj ijk jki jik Access to AU row column row column column column Access to L column row column row Computation of U row row row row column column Computation of L column column row row column column Vector Operation SAXPY SAXPY GAXPY DOT GAXPY DOT 2 Vector Length Vector length = average of occurring vector lengths 33 Gaussian Elimination in Parallel: Blockwise GE (better in environment) (i) solve triangular system L : U = A indepently columns of U (ii) A 22 LU updating blocks (easy parallelize) (iii) small LU-decomposition l u 11 u 12 u 13 l 21 l u 22 u 23 l 31 l 32 l u 33 = = l 11 u 11 l 11 u 12 l 11 u 13 l 21 u 11 l 21 u 12 + l 22 u 22 l 21 u 13 + l 22 u 23 l 31 u 11 l 31 u 12 + l 32 u 22 A 11 A 12 A 13 A 21 A 22 A 23 A 31 A 32 A 33 Different ways of computing L and U, deping on ordering: different algorithm 34

36 331 Crout method l u 11 u 12 u 13 l 21 l u 22 u 23 l 31 l In Bold: already computed In italics: has to be computed in this step ( ) ( ) ( ) l22 u 22 l 22 u 23! A22 l = 21 u 12 A 23 l 21 u 13 Â22 Â = 23 l 32 u 22 A 32 l 31 u 12 Â 32 ( ) ( ) l22 Â22 (1) U 22 = by small LU-decomposition gives l 22, l 32, and U 22 l 32 Â 32 (2) l 22 u 23 = Â23 by solving triangular system in l 22 l 11 U 11 U 12 U 13 in total: l 21 l 22 U 22 U 23 = A l 31 l 32 l 33 U 33 Put the computed parts in the first row/column blocks of L and U Split l 33 and U 33 in new parts l 22, l 32, U 22, U 23 and repeat 332 left looking GE: U L l l 21 l 22 0 l 31 l 32 l 33 u 11 u 12 u 13 u 22 u 23 u 33 = A In Bold: already computed In italics: has to be computed in this step equations: l 11 u 12 = A 12 can be solved by triangular gives u 12 ( ) ( ) ( ) Â22 A22 l21 Compute = U Â 32 A 32 l 12 by matrix multiplication 31 ( ) ( ) l22 Â22 and U 22 = small LU-decomposition l 22, l 32, U 22 l 32 Â 32 35

37 333 Right looking / Gaussian Elimination standard l 11 u 11 u 12 u 13 l 21 l 22 u 22 u 23 = A l 31 l 32 l 33 u 33 In Bold: already computed In italics: has to be computed in this step A 11 = l 11 u 11 (small LU-decomposition) with equations: l 21 u 11 = A 21 l 21 ; l 11 u 12 = A 12 u 12 triangular solve l 22 u 22 = A 22 l 21 u 12 = Â22 by LU-decomposition of Â22 In comparison, all variants have nearly the same efficiency in parallel, flops in Matrix-Matrix-Multiplication, triangular solve and LU-decomposition 34 QR-Decomposition with Householder matrices 341 QR-decomposition Similar to LU-decomposition (numerically not stable) by Gaussian-Elimination We are interested in A = QR with Q orthogonal and R upper triangular b = Ax = QRx Rx = Q T b for solving linear system QR has advantages for ill-conditioned A Application for overdetermined systems A x! = b Ax = b has no solution best approximate solution by solving min Ax b 2 2 = min(x T A T Ax 2x T A T b + b T b) x x gradient equal zero leads to A T Ax = A T b (normal equation) A T A has a larger condition number than A Advantages of QR-decomposition: ( ) R1 A = QR, R =, cond(r 0 1 ) = cond(a) A T Ax = A T b (QR) T QRx = (QR) T b R T Rx = R T Q T b }{{} ( R1 T 0 ) ( ) R 1 x = ( R1 0 T 0 ) ˆb R T 1 R 1 x = ( R1 T 0 ) (ˆb1 ) ˆb2 R T 1 R 1 x = R T 1 ˆb 1 R 1 x = ˆb 1 ˆb 36

38 342 Householder method for QR u vector R n with length 1, u 2 = 1 H := 1 2uu T is called Householder matrix (rank-1-perturbation of identity) H is orthogonal (H T H = 1) ; H = H T : H T H = H 2 = (1 2uu T )(1 2uu T ) = 1 2uu T 2uu T + 4u u T u u T = I }{{} 1 First step: use H to transform the first column of A in upper triangular form: H 1 A = (1 2u 1 u T 1 )(a 1 ) = (a 1 2(u T 1 a 1 )u 1 )! = Hence we have to find u 1 of length 1 with a 1 2(u T! 1 a 1 )u 1 = αe 1 α H 1 is orthogonal, therefore a 1 2 = 0 : = α 0 We can set α = a 1, therefore u 1 = a 1 a 1 2 e 1 2(u T 1 a 1) = a 1 a 1 2 e 1 a 1 a 1 2 e 1 2 H 1 A 1 = (1 2u 1 u T 1 )A = Apply the same procedure on A 2 : * H 2 A 2 = (1 2u 2 u T 2 )A 2 = 0 0 a 1 * * 0 A 2 0 A 3 α 0 : 0 V 1 := u 1 (1 2u 2 u T 2 ) dimension n-1 ( ) 0 ext u 2 to vector of length n : v 2 :=, u 2 0 Hence H 2 H 1 A = (1 2v 2 v2 T )(1 2v 1 v1 T )A = 0 A

39 Total: H n 1 H 2 H }{{} 1 A = R = upper triangular Q T A = QR with Q = (H n 1 H 2 H 1 ) T = (H 1 H n 1 ) 343 Householder method in parallel Idea: Compute u 1 u k, but application of H k H 1 A in blocked form for elimination of first k columns Question: What is the structure of H k H i =: V k {}}{ A = ( A 1 A 2 ) =? QR compute u 1, H 1 = I 2u 1 u T 1, H 1 A 1 compute u 2, H 2 = I 2u 2 u T 2, H 2 (H 1 A 1 ) usw vḳ T Theorem 3 H k H i = (1 2v k vk T ) (1 2v ivi T ) = I (v k v i )T i vi T with T i upper triangular matrix Proof [ by Induction: (1 2vk vk T ) (1 2v i vi T ) ] (1 2v i 1 v }{{} i 1) T Assumption vḳ T = I (v k v i )T i (1 2v i 1 vi 1) T vi T vḳ T vk T = I 2v i 1 vi 1 T v i 1 (v k v i )T i + 2(v k v i ) T i vi 1 T vi T vi T v i 1 }{{} y vḳ ( ) T Ti -2y = I (v k v i v i 1 ) 0 2 vi T vi 1 T ( R * Computation of H k H i A as (I Y T Y T )A = A Y T (Y T A) = 0 Ã with y = (u 1,, u k ) and then repeat with Ã ) 38

40 4 Linear Equations with sparse matrices 41 General properties of sparse matrices Full n n matrix: O(n 2 )storage O(n 3 )solution } too costly Formulate the given problem such that the resulting linear system is sparse O(n) storage O(n) solution? example: tridiagonal: most of the entries are zero Example matrix: A = ; n = 5, nnz (number of nonzero entries) = Storage in coordinate form values AA row JR column JC Superfluous information: (storage nnz floating point numbers, 2nnz+2 integer numbers) Computation of C = Ab for j = 1 : nnz(a) C JR(j) = C JR(j) + AA (j) b JC(j); }{{} a JR(j),JC(j) indirect addressing (indexing) no c and b jumping in memory (Disadvantage) Advantage: does not prefer rows or columns 39

41 412 Compressed Sparse Row Format: CSR row1 row2 row 3 row4 row5 AA JA values column indices A of first row of last row pointer to begin of each row (nnz floating point numbers ; nnz+n+3 integer numbers) C = Ab: for i = 1 : n for IA(i) : IA(i + 1) 1 C(i) = C(i) + AA(j)b(JA(j)) only indirect addressing and jumps in b Compressed Sparse Column format 413 Improving CSR by extracting the main diagonal entries (nnz+1 floating point numbers ; nnz+1+2 integer numbers) AA JA main diagonal indices non diagonal entris in CSA, values and column indices * Pointer to begin of first row Storage: 2(nnz(A)+1) for i = 1 : n C(i) = AA(i)b(i) for JA(i) : JA(i + 1) 1 c(i) = c(i) + AA(j)b(JA(j)) 40

42 414 Diagonalwise storage, eg for band matrices example: only efficient for band matrices 415 rectangular, rowwise storage scheme by compressing from the right gives COEF (values) = JCOEF (columnination) = storage: n }{{} =5 * nnz of longest row of A }{{} nl=3 41

43 C = Ab ; C = 0 for i = 1 : n for j = 1 : nl C(i) = C(i) + COEF F (i, j) b(jcoef F (i, j)) ELLPACK 416 Jagged diagonal form First step: Sort rows after their length: Storage for PA in the form: DJ values: 3} 6 1{{ 9 11} 4} 7 2{{ 10 12} 5 8 first jagged diagonal second Column indices: JDIAG: IDIAG: C = Ab ; C = 0 j = 1 : NDIAG for i = 1 : length of j jagged diagonals {}}{ IDIAG(j + 1) IDIAG(j) k {}}{{}}{ C(i) = C(i) + DJ( IDIAG(j) +i 1)b(JDIAG( IDIAG(j) + i 1)) }{{} similar to SAXP Y Operations on local block data! k 42

44 42 Sparse Matrices and Graphs 421 A = A T > 0 (n n - matrix) symmetric, define Graph G(A): Knots, vertices: e 1,, e n ; edges (e i, e j ) a ij 0 0 example: A = G(A) = undirected graph: Graph G(A) has adjacency matrix A(G(A)) = has exactly the structure of A Symmetric permutation P AP T, by permuting row and columns of A in the same way: renumbering of the knots (renumbering 3 4 means, that r 3 r 4 and c 3 c 4 ) 43

45 422 A non symmetric: directed graph good sparsity pattern: Block Diagonal: 0 0 Example: A = = e e e e Graph splits into two subgraphs; use permutation that groups together edges in the same subgraph: 2 3 new G(A): P AP T = = ( A1 0 0 A 2 ) block diagonal Reduce the solution of the large given matrix to the solution of small block parts The block pattern is not disturbed in Gauss-Elimination 44

46 ( ) 1 ( ) A1 0 A = 0 A 2 0 A 1 2 a 11 a 1p 0 0 a Band Matrix: A = q a nn Gaussian Elimination without pivoting maintains this pattern cols: O(n pq) and A = LU with l u 11 u 1p l L = q1 0 0 and U = l nn 0 0 u nn with pivoting u will have a larger bandwidth Similarly: A= structure is preserved by GE 423 Dissection form preserved during GE 0 0 (no fill-in GE) 45

47 Schur Complement for Block Matrices: Reduce to smaller matrices: ( ) ( ) ( ) ( B1 B 2 B 1 1 D I B B 3 B 4 0 S 1 = 1 D + B 2 S 1! I 0 B 3 B1 1 B 3 D + B 4 S 1 = I ) Therefore B 1 D + B 2 S 1 =! 0 = D = B1 1 B 2 S 1 and B 3 D + B 4 S 1 =! I = I = B 3 B1 1 B 2 S 1 + B 4 S 1 = I = (B 4 B 3 B 1 1 B 2 )S 1 = S = B 4 B 3 B1 1 B 2 (Schur Complement) ( ) ( ) ( ) B1 B 2 I 0 B1 B = B 3 B 4 B 3 B S 1 I Instead of solving LE in B, we have to solve small systems in B 1 and S Application in Dissection form: A 1 0 F 1 0 A 2 F 2 G 1 G 2 A 3 ( ) A1 0 Schur complement relative to : 0 A 2 S = A 3 ( ) ( A G 1 G 2 0 A 1 2 = A 3 G 1 A 1 1 F 1 G 2 A 1 2 F 2 ) ( F1 F 2 ) Linear Equation in Dissection form: A 1 0 F 1 x 1 A 1 x 1 + F 1 x 3 = b 1 0 A 2 F 2 x 2 = A 2 x 2 + F 2 x 3 = b 2 G 1 G 2 A 3 x 3 G 1 x 1 + G 2 x 2 + A 3 x 3 = b 3 = x 1 = A 1 1 b 1 A 1 1 F 1 x 3 x 2 = A 1 2 b 2 A 1 2 F 2 x 3 = (G 1 A 1 1 b 1 G 1 A 1 1 F 1 x 3 ) + (G 2 A 1 2 b 2 G 2 A 1 2 F 2 x 3 ) + A 3 x 3 = b 3 = (A 3 G 1 A 1 1 F 1 G 2 A 1 2 F 2 )x 3 = b 3 G 1 A 1 1 b 1 G 2 A 1 2 b 2 Sx 3 = ˆb 3 46

48 1 Compute S by using A 1 1 and A Solve Sx 3 = b 3 3 Compute x 1 and x 2 by using A 1 1 and A 1 2 Sometimes S is full or too expensive to compute Then use iterative method for Sx 3 = ˆb 3, that uses only s*vector, which can be computed easily with F 1, F 2, G 1, G 2 and A 1 1, A Reordering 431 Smaller Bandwidth by Cuthill Mckee-Algorithm Given sparse matrix A, G(A) Define level sets: S 1 = {1} S 2 = set of new edges connected to S 1 by vertex {2, 3, 4} S 3 = set of new edges connected to S 2 by vertex {5, 6, 7} S 4 = {8, 9, 10} S 5 = {11} Starting from one chosen vertex, according distance to the start knot First edge in S 1 gets number 1 In each level set we sort and order the knots such that the first group of entries in S i are the neighbours of the first entry in S i 1, and the second group of entries in S i+1 are the neighbours of the second entry in S i, and so on 47

49 often Cuthill McKee-Algorithm ordering is reversed: Reverse Cuthill McKee 48

50 432 Dissection Reordering A, G(A), eg leads to pattern = 0 A 1 0 F 1 0 A 2 F 2 G 1 G 2 A

51 433 Algebraic pivoting: during GE (Numerical pivoting: choose largest a ij a kk as pivot element) Algebraic pivoting: choose largest a ij from sparse row/column a kk as pivot element (small fill-in during GE-step) Minimum degree re-ordering for A = A T > 0 first step: define r j = # entries in row j (non-zero) in the G(A) = # edges connected with vertex j Repeat choose i such that r i = min r j j (nearly empty now) choose a ii pivot element i 1 by symmetric permutation Do the elimination step in GE reduce the matrix by one Generalization to nonsymmetric case: Markowitz-Criterion define r j = # entries in row j (non-zero) c k = # entries in column k (non-zero) minimizes: min j,k (r j 1)(c k 1) choose a jk as pivot element Apply permutation to put a j,k in diagonal position In practise: mixtures of algebraic and numerical pivoting: include a condition, that a i,s should be not too small! 0 * * * Example: GE 0 * * 0 * 0 * * * full G(A): n-1 n 50

52 Cuthill-Mckee with starting edge {1} : S 1 = {1}, S 1 = {1,, n} given no improvement with start {2} : S 1 = {1}, S 2 = {1}, S 3 = {2,, n} Permutation such that smallest bandwidth also not very helpful Matching: set of edges, for each row/column index there is exactly one edge Matching gives a permutation of the rows nonzero diagonal entries Example: ω(π) = log a ij i,j nonzero move here for example (1,3) to (3,3) and (2,1) to (1,1) Perfect matching, maximizing ω(π) heuristic methods to get approximal solutions For symmetric matrix we need a symmetric permutation P AP T low permutation perfect matching ( ) ( ) ( 1 3 ) ( 3 2 ) : ( ) ( ) 1 3 2, 3 1,

53 bandwidth n/2 } * * * * * * 0 0 * * Minimum Degree: is very good Choose edge 1 with degree n-1 Therefore replace 1 by 2 with degree GE 0 * * * * * * 0 0 * Next pivot 3, and so on ; works in O(n) Global reordering: be permutation 1 n ; GE in O(n) * * * Change the numbering such that indices in a 2 2 permutation have subsequent numbers: Apply symmetric permutation The large entries appear in 2 2 diagonal blocks 52

54 44 Gaussian Elimination in Graph A = A T > 0 symmetric positive definite No need for numerical pivoting Example G(A): Choose as pivot * * * * 7 * * * * Fill in: pattern of row J is added to non zero entries in column 7 = indices connected with row 7 give a dense submatrix here the submatrix to row/column 3, 6, 8 and 11 gets dense leads to fill in 53

55 New graph: one step GE in the graph consists in - remove edge 7 - add vertices such that all neighbours of 7 get fully connected Definition: A fully connected graph is called clique, eg or or or or In each elimination step the pivot knot is removed (pivot row and column are removed) and a subclique in the graph is generated Connecting all neighbours of the pivot entry Next step in GE: with pivot 6: neighbours: 2, 3, 5, 8, 10,

56 Gaussian elimination can be modelled without numerical computations only by computing the graphs algebraically Advantages: - algebraic prestep is cheap - gives information on the data structure (pattern) of resulting matrices - shows whether Gaussian Elimination makes sense - formulation in cliques, because in the course of GE, there will appear more and more cliques: cliques give short discretization of the graphs 45 Different direct solvers Frontal methods for band matrices b b b 0 frontal matrix of size ( b+1)x(2 b+1) treated as dense matrix 0 - Apply first GE step with column pivoting in dense frontal matrix - compute next row/column and move frontal matrix one step right+down Multifrontal method for general sparse matrices 0 example: A = d 11 first pivot element is related to first frontal matrix, that contain all numbers related to one step GE with a 11 : a i1 a 1j a 11 : in dense submatrix: a 11 a 13 a 14 a a 31 a 13 a 31 a a 11 a 11 a a 41 a 13 a 41 a a 11 a 11 55

57 Because a 12 = 0, wee can in parallel consider a 22 and the frontal matrix, related to the one step GE with a 22 : a 22 a 23 a 24 a a 32 a 23 a 32 a a 22 a 22 a a 42 a 23 a 42 a a 22 a 22 The computations a ij a ij a i1a 1j a 11 and a ij a ij a i2a 2j a 22 are indepent and can be done in parallel 56

58 5 Iterative methods for sparse matrices X 0 initial guess (eg X 0 = 0) Iteration function φ : x k+1 = φ(x k ) gives sequence x 0, x 1, x 2, x 3, x 4, k should converge x k x = A 1 b (fast convergence) Advantage: computation of φ(x) needs only matrix-vector products Do not change the pattern It is easy to parallelize Big question: fast convergence? 51 stationary methods 511 Richardson Iteration for Solving Ax = b : x := A 1 b b = (A I + I)x = (A I)x + x = x = b + (I A)x = b + Nx Fix point iteration x = φ(x) with φ(x) = b + Nx x 0 start x k+1 = φ(x k ) = b + Nx k if x k convergent, x k x, then x = b + N x A x = b ˆx = x other formulation: φ(x) = b + x Ax = x + (b Ax) = x + r(x) r - residual Convergence analysis via Neumann Series x k = b + Nx k 1 = b + N(b + Nx k 2 ) = b + Nb + Nx k 2 = b + Nb + N 2 b + Nx k 3 = = b + Nb + + N k 1 b + N k x 0 = ( k 1 i=0 N i )b + N k x 0 Special case: x 0 = 0 : x k = ( k 1 j=0 N j )b x k span(b, Nb, N 2 b,, N k 1 b) = span(b, Ab, A 2 b,, A k 1 b) = K k (A, b) = Krylov-row of dimension k to matrix A and vector b, assume N < 1: then k 1 j=0 N j convergence j=0 N j = (I N) 1 = A 1 ( q j = 1 ) 1 q j=0 x k ( N j )b = (I N) 1 b = (I (I A)) 1 b = A 1 b = x j=0 Richardson gives convergent sequence if A=I Error: e k := x k ˆx e k+1 = x k+1 ˆx = (b + Nx k ) (b + N ˆx) = N(x }{{}}{{} k ˆx) = Ne k φ(x k ) φ(ˆx) 57

59 e k N e k 1 N 2 e k 2 N k e 0 N < 1 N k k 0 e k k 0 ; ρ(n) = ρ(i A) < 1 largest absolute value of an eigenvalue < 1 define a norm with A < 1 Eigenvalues of A have to be in a circle around 1 with radius Better splitting of A A := M N Modifications of Richardson to get better convergence b = Ax = (M N)x = Mx Nx x = M 1 b + M 1 Nx new φ(x) = M 1 b + M 1 Nx = M 1 b + M 1 (M A)x = M 1 (b Ax) + x = x + M 1 r(x) M should be simple (easy to solve) x k+1 = M 1 b + M 1 Nx k = x k + M 1 (b Ax k ) = x k + M 1 r k is equivalent to Richardson applied on M 1 Ax = M 1 b Therefore convergent for ρ(m 1 N) = ρ(i M 1 A) < 1 M is also called a precondition, because M 1 A should be better conditioned than A itself: M 1 A I 513 Jacobi (Diagonal) - Splitting: A = M N = D (L + U) with L: lower triangular, U: upper triangular, D: diagonal part of A -U = -L D x k+1 = D 1 b + D 1 (L + U)x k = D 1 b + D 1 (D A)x k = x k + D 1 r k convergent if ρ(m 1 N) = ρ(i D 1 A) < 1 58

60 elementwise: or a jj x k+1 j x k+1 j = 1 a jj (b j = b j j 1 m=1 n m=1 m j a jm x (k) m a jm x (k) m ) n m=j+1 a jm x (k) m To improve convergence: x k+1 = x k + D 1 r k D 1 r k correction step x k+1 = x k + include damping {}}{ ω D 1 r k with step length ω: damped Jacobi x k+1 = x k + ωd 1 r k = x k + ωd 1 (b Ax k ) = x k + ωd 1 b ωd 1 Ax k = (I ωd 1 A)x k + ωd 1 b = (I ωd 1 (D L U))x k + ωd 1 b = ωd 1 b + [(1 ω)i + ωd 1 (L + U)]x k convergent if ρ([(1 ω)i + ωd 1 (L + U) ]) < 1 }{{} I for ω 0 Jacobi method is easy to parallelize: for ω = 1: Jacobi: look for optimal ω only A * vector, D 1 * vector To improve convergence (Block Jacobi): A = -L -U D 514 Gauss-Seidel method by improving convergence a jj x (k+1) j = b j j 1 m=1 a jm x (k+1) m n m=j+1 a jm x (k) m j = 1, 2,, n 59

61 Try to use the newest available information in each step Advantage: fast convergence a kk x (k+1) j = b j j 1 m=1 a jm x (k+1) m n m=j+1 a jm x (k) m Dx k+1 = b + Lx k+1 + Ux k or (D L)x k+1 = b + Ux k is related to splitting A = (D L) }{{}}{{} U : Gauss-Seidel-method M N In each step we have to solve a triangular linear system: disadvantage in parallel! Data deping graphs for iteration methods reorder A colouring of the graph red-block (circle): = compromise: convergence parallelism convergence deping on ρ(1 (D L) 1 A) < 1 (D L) 1 Ax = (D L) 1 b Damping: x k+1 = x k + ω(d L) 1 r k Stationary methods can be written in the form in general: x k+1 = C + Bx k with constant vector C and iteration matrix B = x k + }{{} F preconditioner r k ρ(b) < 1 convergence B = I F A 52 Nonstationary Methods 521 Let A symmetric positive definite A = A T > 0 (spd) Consider function φ(x) = 1 2 (xt Ax) b T x Derivative: φ(x) = Ax b gradient φ Paraboloid x Minimum of φ is unique with φ( x) = A x b = 0 A x = b Compute x, solution of Ax = b by approximating minimum of φ iteratively 60

62 x k last iterate: find x k+1 = x k + λv with search direction v and stepsize λ, such that φ(x k+1 ) < φ(x k ) and hence x k+1 is nearer to minimum Search direction v: d φ (x dλ k + λv) λ=0 = φ(x k )v (directional derivative) <! 0 Optimal search direction v = φ(x k ) = b Ax k = r K x k+1 = x k + λr k stepsize λ : finding min λ φ(x k + λr K ) is a simple 1D-problem d φ(x dλ k + λr k ) = d ( 1 dλ 2 (xt k + λrt k )A(x k + λr k ) b T (x k + λr k )) = d dλ ( 1 2 xt k Ax k + λr T k Ax k + λ2 2 rt k Ar k b T x k λb T r k ) = r T k Ax k b T r k + λr T k Ar k = r T k r k + λr T k Ar k! = 0 λ = rt k r k r T k Ar k v k = r k Algorithm: x k+1 = x k + rt k r k r T k Ar k r k with r k = b Ax k Gradient Method, steepest decent locally optimal search directions are not globally optimal, if paraboloid is very distorted very small and large eigenvalues if condition(a) = A 2 A 1 2 = λmax λ min >> 1 cond(a) >> 1 guaranteed convergence, but mostly slow! To analyse this slow convergence also theoretically, we introduce the following norm (so-called A-norm) x A := x T Ax Then it holds for the error x x with x = A 1 b 61

63 x x 2 A = x A 1 b 2 A = (x A 1 b) T A(x A 1 b) = = x T Ax 2b T x + b T A 1 b = 2φ(x) + b T A 1 b Hence minimizing φ is equivalent to minimizing the error in the A-norm with x j+1 := x j + λ j r j, r j = b Ax j and λ j = rt j r j r T j Ar j we get the following inequality between φ(x j+1 ) and φ(x j ): φ(x j+1 ) = 1 2 (xt j + λ j r T j )A(x j + λ j r j ) (x T j + λ j r T j )b = 1 2 xt j Ax j + λ j x T j Ar j + λ2 j 2 rt j Ar j x T j b λ j r T j b = φ(x j ) + λ j r T j (Ax j b) + λ2 j 2 rt j Ar j (r T j r j) 2 = φ(x j ) rt j r j rj T Ar rt j j r j + 1 r T 2 (rj T Ar j) 2 j Ar j = φ(x j ) 1 2 = φ(x j ) 1 2 (r T j r j) 2 (r T j Ar j) (r T j r j) 2 rj T Ar j rj T A 1 rj T A 1 r j r j }{{} ρ j = φ(x j ) 1 2 ρ j(b Ax j ) T A 1 (b Ax j ) = φ(x j ) ρ j 2 (bt A 1 b + x T j Ax j 2b T x j ) = φ(x j ) ρ j (φ(x j ) bt A 1 b) φ(x j+1 ) bt A 1 b = φ(x j ) bt A 1 b ρ j (φ(x j ) bt A 1 b) x j+1 x 2 A = x j x 2 A (1 ρ j) error in next step ρ j = (rj T r j) 2 1 rj T Ar jrj T A 1 r j λ max 1/λ min }{{} λmax(a 1 ) = 1 cond(a) (range(a) = { rt j Ar j r T j r j r j A} [λ min (A), λ max (A)]) x j+1 x 2 A = (1 1 cond(a) ) x j x 2 A Is therefore cond(a) >> 1, then the improvement in every iteration step is nearly nothing Therefore we have very slow convergence! 62

64 522 Improving the gradient method conjugate gradients Ansatz: x k+1 = x k + α k p k (α k stepsize ; p k search direction) As search direction we do not use the gradient, but a modification of the gradient Choose new search direction such that p k is orthogonal to p j p T k Ap j = 0 We choose the new search direction as the projection of the gradient on the A-conjugate subspace relative to previous p k α k derived by I-dimensional minimization as before Algorithm: x 0 = 0, r 0 = b Ax 0 for k = 1, 2, : β k 1 = r T k 1 r k 1/r T k 2 r k 2 β 0 = 0 p k = r k 1 + β k 1 p k 1 (p k A-conjugate to p k 1, p k 2, ) α k = r T k 1 r k 1/p T k Ap k x k = x k 1 + α k p k (1-dimension min) r k = r k 1 α k Ap k if r k < ɛ : stop Conjugate gradient method Main properties of the computed vectors: p T j Ap k = 0 = r T j r k for j k span(p 1,, p j ) = span(r 0,, r j 1 ) = span(r 0, Ar 0,,, A (j 1) r 0 ) = K j (A, r 0 ) (Krylov subspaces) especially for x 0 = 0 : span(b, Ab,, A (j 1) b) = K j (A, b) x k is the best approximate solution in subspace K k (A, b) for x 0 = 0 : x k span(b, Ab,, A (j 1) b) and x k x A = main property: x = A 1 b min x x A x K(A,b) Choosing these special search directions, the 1D minimization gives us the best solution relative to a k-dimensional subspace In each step optimal solution to larger and larger subspaces! Consequence: after n steps: K n (A, b) = R n = x n = x in exact arithmetic or: min x k K n x k x A = 0 Unfortunately, this is only true in exact arithmetic Also convergence after n steps would be not good enough 63

65 error estimation for x 0 = 0 : e k A = x k x k = min x K k (a,b) x x A = min α j k 1 j=0 α j (A j b) x A = min p k 1 (A)b x A = min p k 1 (A)A x x A p k 1 (x) p k 1 (x) = min q k (A) ( x x 0 ) q k (0)=1 }{{} A = e k A e 0 a spd A = UΛU T (Λ: diagonal matrix/eigenvalues, in U: eigenvectors) we can write: e 0 = n ξ j u j (u 1,, u n are ONB of eigenvectors of A) j=1 { e }} 0 { n e k A = min q k (0)=1 q k (A) ξ j u j A j=1 = min n ξ j q k (A)u j A q k (0)=1 j=1 = min q k (0)=1 n ξ j q k (λ j )u j A j=1 min [ q k (0)=1 n q k(λ j ) n j=1 max n j=1 = min [ max q k(λ j ) ] e 0 A q k (0)=1 j=1 ξ j u j ] by choosing any polynomial with q k (0) = 1 and degree k, we can derive estimates for the error e k eg: q k (x) := (1 2 λ max+λ min x) K leads to: e k A max n k(λ j ) e 0 A = max n 2 λ j K e 0 j=1 j=1 λ max + λ min = 2λ max (1 ) k e 0 A λ max + λ min = ( λ min λ max λ max + λ min ) k e 0 A = ( cond(a) 1 cond(a) + 1 )k e 0 A Better estimates by normalized Chebychev polynomials: T n (x) = cos(n arccos(x)) ( ) k 1 e k A 2 cond(a) 1 T k ( cond(a)+1 cond(a) 1 ) cond(a)+1 64

66 eg assume that A has only two eigenvalues λ 1 and λ 2 set q 2 (x) := (λ 1 x)(λ 2 x) λ 1 λ 2 q 2 (0) = 1 e 2 A max j=1,2 q 2(λ j ) e 0 A = 0 convergence of cg-method after 2 steps! Similar behaviour for eigenvalue clusters After 2 steps: small error 523 GMRES for General Matrix A, not spd Consider small subspaces U m and determine optimal approximate solutions in these subspaces for Ax = b in U m so we restrict x to the form x = U m y: min x U m Ax b 2 = min y A(U m y) b 2 could be solved by normal equations U T ma T AU m y = U T ma T b What subspace U m should we choose? (relative to Ax=b): U m := U m (A, b) = span(b, Ab,, A m 1 b) (bad basis for U m ) First step: provide Orthonormal basis for U m (A, b) : u 1 := b/ b 2 for j = 2 : m ũ j := Au j 1 j 1 (u T k Au j 1 ) u k=1 }{{} k ũ j u 1,, u j 1 h k,j 1 u j := ũ j / ũ j }{{} 2 h j,j 1 j 1 j 1 Au j 1 = (u T k Au j 1 )u k + ũ j = h k,j 1 u k + h j,j 1 u j = k=1 k=1 j h k,j 1 u k k=1 AU m = A(u 1,, u m ) = (u 1,, u m+1 ) H m+1,m = Ũm H m+1,m with H m+1,m = h 11 h 1m h 21 0 hm,m 0 0 h m+1,m Upper m+1 m Hessenberg form 65

67 Now we can solve the minimization problem: min Ax b 2 = min A(U m y) b 2 x U m y = min y U m H(m+1,m) y b u 1 2 = min y U m ( H (m+1,m) y b e 1 ) 2 = min y H (m+1,m) y b e 1 2 because U m is part of an orthogonal matrix (invariant) We can use Givens rotation to compute a QR-decomposition of the upper Hessenberg matrix H (m+1,m) 0 G 1, G 2 0,, G m 0 ( ) gives Q H (m+1m) = G m G 2 G 1 H(m+1m) = R = Rm = min Ax b 2 = min x U m = min y = min y = min y H (m+1,m) y b e 1 2 y ( Rm y 0 ) y b G m G }{{} 1 e 1 2 bm ( ) Rm y 0 b 2 m ( Rm y b ) 1 b 2 2 Solution: GMRES: R m y = b 1 Y - Compute H (m+1,m) by Arnoldi-orthogonalization - compute QR-factorization - solve least squares problem x k 66

68 Iterative: enlarge U m to A m U m new column in H (m+1,m) new Givens matrix update QR new column in R solve enlarged LS by updating x k Gets very costly after 50 steps Restarted version: GMRES(20) x A( x x) = b Ax = b A x = r Call GMRES(20) for Ax = r r m 2 := Ax m b 2 = min Ax b 2 = min x U m = min p m 1 Ap m 1 (A)b b 2 = V 2 V 1 2 b }{{} 2 [ max j=1 r j 2 n x j α j (A j b)] b 2 m 1 A[ j=0 min q m (A)b 2 = q m(0)=1 min q m(0)=1 524 Convergence of cg or GMRES min V q m (1)V 1 b 2 q m(0)=1 q m (λ j ) ] = cond V r j 2 min max q m(λ j ) q m(0)=1 j=1 }{{} like cg Convergence of cg/gmres deps strongly on the position of eigenvalues A = 1 0 n X X X X X X X X X X X X GMRES needs n steps! Preconditioning: Improve the eigenvalue location of A: P 1 Ax = P 1 b (implicit) (P A) replace the given Ax = b by or MAx = Mb (explicit) (M A 1 ) (P1 1 AP2 1 )(P 2 x) = P1 1 b Ã x = b 67

69 symmetric: (P1 1 AP1 T )(P1 T x) = (P 1 b) (Ã should have clustered eigenvalues) stationary methods: A = M N ; b = Ax = (M N)x = Mx Nx x k+1 = M 1 b + M 1 Nx k = M 1 b + (I M 1 A)x k convergent iff I M 1 A < 1 eigenvalues of M 1 A are clustered near 1 Good splitting good precondition Improve stationary methods by using the related splitting as preconditions in cg or GMRES: (i) Jacobi-splitting with D = diag(a) Jacobi preconditioner M := D (ii) Gauss-Seidel splitting M := L + D (iii) ILU = incomplete LU decomposition: Apply GE-algorithm, but reduced cg on the pattern of the sparse matrix A 0 A = to L =, U = 0 A = LU + R Modification: ILU(0) related pattern of A ; ILU(1) related to L(0)U(0) ILUT: Treshhold ILU: Apply standard GE, but in each step sparsification by deleting all entries less or equal MILU: Modified ILU: = Apply GE with sparsification Move all deleted entries to the diagonal IC (incomplete cholesky - symmetric form of ILU) implicite preconditioners disadvantage: in each step we have to solve sparse triangular system ILU L,U hard to parallelize Idea: explicit preconditioner M A 1 to minimize AM I? choose Frobenius norm B 2 F = n (B j ) 2 2 = n (B i ) 2 2 j=1 i=1 68

70 choose matrix class polynomial preconditioner: A 1 (A n +γ n 1 A n 1 + +γ 1 A+γ 0 ) 0 γ 0 A 1 = A n 1 γ n 1 A n 2 γ 1 min I p m (A)A : min p m P m max 1 p m (λ)λ λ 1,λ m Assume that the eigenvalues in interval: 0 < c λ d < for min max 1 p m (λ)λ solution: p m (x) = T m+1( d c) T d+c m+1( d+c 2x p m P m λ [c,d] xt( d+c (transformation from oscillation [ 1, 1] to [c, d]) cg, GMRES are optimal in Krylov spaces (b, Ab, A 2 b, ) p m (A)b is easy to parallelize, but not optimal choose M: sparse matrices, same sparcity as A min AM I 2 F = min n (AM I)e j 2 2 = n M P (A) M P (A) j=1 min j=1 M P (A) d c) AM j e }{{} j vectors n indepent minimization problems for computing M 1, M 2,, M n d c ) 2 2 A( :, I j )M(I j ) e j (I j are the non-zero entry indices of M j ) J j := indices of non-zero rows of A( :, I j ) min A(J j, I j )M(I j ) e j (J j ) Least squares problem can be solved by QR-method, Givens or Householder Solve n indepent small LS problems, to get M To apply this preconditioner in cg or GMRES: we only have to multiply sparse M times vector 69

71 6 Collection remaining problems 61 Domain Decomposition Methods for Solving PDE G W region Ω with boundary Γ Given PDE, eg Laplace equation: u = u xx + u yy = δ2 u + δ2 u! = f(x, y) δx 2 δy 2 in Ω and u Γ = q Dirichlet problem How to parallelize? W 1 and G ~ ~ G 1 2 W 2 overlapping W Γ 1 boundary of Ω 1 with Γ 1 unknown values Γ 2 boundary of Ω 2 with Γ 2 unknown values Idea: Solve PDE Ω 1 with some Solve PDE Ω 2 with some estimated boundary values for Γ 1 some estimated boundary values for Γ 2 Exchange boundary values on Γ 1 and Γ 2 overlapping Domain Decomposition 70

72 Second approach: Nonoverlapping DD Dissection method: W 2 W 1 ~ G A 1 0 F 1 0 A 2 F 2 G 1 G 2 A 3 u 1 û 2 û 3 = f 1 f 2 f 3 Solve by Schur complement or preconditioner (S = A 3 G 1 A 1 1 F 1 G 2 A 1 2 F 2 ): A 1 1 M = A 1 2 to solving PDE in Ω 1 and Ω Parallel Computation of the Discrete Fourier Transformation Definition: ω n = e 2πi n ω n ω n (n 1) Y = 1 ω n (n 1) ω n (n 1)(n 1) ; y = DF T (x) ; y k = n 1 x 0 x n 1 ω kj j=0 n x j (k = 0, 1,, n 1) f 1 x f 2 x = f n x collection of n indepent dot-products For a dot-product we can use fan-in algorithm: n-processors O(logn) steps In total: n n processors O(logn) steps for DFT Complexity in parallel for DFT is O(logn) Sequentially the complexity of the DFT is O(n logn) (FFT-method) What is a good sparsity pattern for M? ( ) 1 ( ) A B = B A A priori patterns for M: A, A 2, A 3, triangular ; A T, A 2T, A 3T, orthogonal, (A T A), (A T A) 2 Sparsification: Delete small entries in A, (ɛ), A ɛ, A 2 ɛ, A T ɛ, 71

A priori static pattern for computing M Dynamic minimization, that finds a good pattern automatically: We start with the diagonal pattern or with A ɛ for example solve n related LS-problems M j How

73 A priori static pattern for computing M Dynamic minimization, that finds a good pattern automatically: We start with the diagonal pattern or with A ɛ for example solve n related LS-problems M j How to find a better pattern? min A(M j + µ k e k ) e j 2 2 = min (AM j e j ) +Aµ µ k µ k }{{} k e k 2 2 r j d dµ r j + Aµ k e k 2 2 µ k = rt j Ae k Ae k 2 2 (most cases: µ k = 0) improvement: r j + µ k A j 2 = r j 2 2 (rt j Ae k) 2 Ae k 2 2 Factorized Sparse Approximation, Inverses: spd A = A T > 0 A = L T A L A (Cholesky factorization) ; A 1 LL T min L A L I F (sparsity structure of L) ; min L A (I, J)L K (J) e K (I) normal equations: L T A (I, J)L A(I, J)L K (J) = L T A (I, J)e K(I) A(J, J)L K (J) = L } A,KK e {{} K (I) (diagonal scaling) =1 First we compute L under the assumption L A,KK = 1 (diagonal entries) D = L T AL ; Replace L by LD 1/2 result {}}{ function (v 0,, v n 1 ) = IDF T if n == 1 then v 0 = c 0 ; else m = n ; 2 z1 = IDF T (c 0, c 2, c 4,, c n 2, m) z2 = IDF T (c 1, c 3, c 5,, c n 1, m) for j = 0,, m 1 v j = z1 + ω j z2j ; v m+j = z1 ω j z2j ; for if input {}}{ (c 0,, c n 1, n) (ω = e 2πi n ) 72

BLAS: Basic Linear Algebra Subroutines Analysis of the Matrix-Vector-Product Analysis of Matrix-Matrix Product

BLAS: Basic Linear Algebra Subroutines Analysis of the Matrix-Vector-Product Analysis of Matrix-Matrix Product Level-1 BLAS: SAXPY BLAS-Notation: S single precision (D for double, C for complex) A α scalar X vector P plus operation Y vector SAXPY: y = αx + y Vectorization of SAXPY (αx + y) by pipelining: page 8