IV-1 Parallel Scientific Computing Matrix-vector multiplication. Matrix-matrix multiplication. Direct method for solving a linear equation. Gaussian Elimination. Iterative method for solving a linear equation. Jacobi, Gauss-Seidel. Sparse linear systems and differential equations.
IV- Matrix-Matrix Multiplication Problem: C = A B where A and B are n n matrices. Sequential code: for i = 1 to n do for j = 1 to n do sum = 0; for k = 1 to n do sum = sum+a[i,k] b[k,j]; c[i,j] = sum; endfor endfor endfor
IV-3 An example of A B A B 1 B B C 1 3 1... = 1 3 4 5 7 6 8 = 1 5+ 6 1 7+ 8 3 5+4 6 3 7+4 8 = 17 3 39 53
IV-4 Task graph of C = A B Partitioned code: for i = 1 to n do T i : for j = 1 to n do sum = 0; for k = 1 to n do sum = sum+a[i,k] b[k,j]; endfor c[i,j] = sum; endfor endfor T i : Read row A i and matrix B. Write row C i Task graph: T 1 T T... 3 T n
IV-5 Task and data mapping for C = A B SPMD code: for i = 1 to n if proc map(i)=me do T i Data mapping: A is partitioned using rowwise block mapping C is partitioned using rowwise block mapping B is duplicated to all processors Changes in T i s code: a ik a local(i)k c ij c local(i)j
IV-6 Parallel SPMD code of C = A B for i = 1 to n do if proc map(i)=me do endfor for j = 1 to n do sum = 0; for k = 1 to n do endfor endfor endif sum = sum+a[local(i),k] b[k,j]; c[local(i), j] = sum;
IV-7 Parallel algorithm with 1D partitioning Partitioned code: for i = 1 to n do for j = 1 to n do T i,j : sum = 0; for k = 1 to n do sum = sum+a(i,k) b(k,j); Endfor c(i,j) = sum; Endfor Endfor Data access: Each task T i,j reads row A i and column B j to write data element c i,j.
Task graph: n independent tasks: T 1,1 T 1, T 1,n T,1 T, T,n T n,1 T n, T n,n IV-8 Mapping. Matrix A is partitioned using row-wise block mapping Matrix C is partitioned using row-wise block mapping Matrix B is partitioned using column-wise block mapping Task T i,j is mapped to the processor of row i in matrix A. Cluster 1: T 1,1 T 1, T 1,n Cluster : T,1 T, T,n Cluster n: T n,1 T n, T n,n
IV-9 Parallel algorithm: For j = 1 to n Broadcast column B j to all processors Do tasks T 1,j,T,j,,T n,j in parallel. Endfor Evaluation: Each multiplication or addition counts one time unit ω. Each task T i,j costs nω. Assume that each broadcast costs (α+βn)logp. PT = n ((α+βn)logp+ n p nω) j=1 = n(α+βn)logp+ n3 ω p.
IV-10 Gaussian Elimination -Direct Method for Solving Linear System- (1) 4x 1 9x +x 3 = () x 1 4x +4x 3 = 3 (3) x 1 +x +x 3 = 1 ()-(1)* 4 0.5x +3x 3 = (4) (3)-(1)*- 1 4 1 4 x + 5 x 3 = 3 (5) (5)-(4)*- 1 4x 3 = 5 4x 1 9x +x 3 = 1 x +3x 3 = 4x 3 = 5
IV-11 Backward substitution: x 3 = 5 8 = 1 4 x = 3x 3 1 x 1 = +9x x 3 4 = 3 4
IV-1 GE on Augmented Matrices Use an augmented matrix to express elimination process for solving Ax = b. Augmented matrix: (A b). 4 9 4 4 3 1 1 ()=() (1) 4 (3)=(3) (1) 1 4 = 4 9 0 1 3 0 1 4 5 3 4 9 0 1/ 3 0 0 4 5/ Column n+1 of A stores column b!
IV-13 Gaussian Elimination Algorithm Forward Elimination For k = 1 to n 1 For i = k +1 to n a ik = a ik /a kk ; For j = k +1 to n+1 endfor endfor endfor a ij = a ij a ik a kj ; Loop k controls the elimination steps. Loop i controls i-th row accessing and loop j controls j-th column accessing.
IV-14 Backward Substitution Note that x i uses the space of a i,n+1. For i = n to 1 For j = i+1 to n x i = x i a i,j x j ; Endfor x i = x i /a i,i ; Endfor
IV-15 Algorithm Complexity Each division, multiplication, subtraction counts one time unit ω. Ignore loop overhead. #Operations in forward elimination: n 1 n k=1 i=k+1 1+ n j=k+1 +ω = n 1 n k=1 i=k+1 ((n k)+3)ω ω n 1 k=1 (n k) n3 3 ω #Operations in backward substitution: n (1+ k=1 n i=k+1 )ω ω n (n k) n ω k=1 Total #Operations: n3 3 ω. Total space: n double-precision numbers.
IV-16 Parallel Row-Oriented GE For k = 1 to n 1 For i = k +1 to n T i k : a ik = a ik /a kk For j = k +1 to n+1 a ij = a ij a ik a kj EndFor T i k : Read rows A k,a i Write row A i Dependence Graph 3 4 T T T 1 1 1 T 1 n k=1 T 3 T 4 T n k= T 3 4 T 3 n k=3 Tn n 1 k=n 1
IV-17 Parallelism and Scheduling Parallelism: Tasks T k+1 k T k+ k... T n k are independent. Parallel Algorithm(Basic idea) For k = 1 to n 1 Do T k+1 k T k+ k... Tk n on p processors. in parallel
IV-18 Task Mapping Define n clusters: C 1 C C 3 C n φ T 1 T 1 3 T n 1 T 3... T n.. n p procs T n-1 Cluster T1 C Cluster T1 3T3 C 3... Map n clusters to p processors: C k = proc_map(k) cyclic block
IV-19 Block vs. Cyclic Mapping If block mapping is used Profile the computation load of C,C 3,...,C n. Load... cluster C C C C 3 4 n Then Load(P 0 ) Load(P 1 )... Load(P n 1 ) Load is NOT balanced among processors! If cyclic mapping is used. Load is balanced.
Parallel Algorithm: Proc 0 broadcasts Row 1 For k = 1 to n 1 Do T k+1 k...tk n in parallel (Tk i proc_map(i)). Broadcast row k +1. endfor IV-0 SPMD Code: me=mynode(); For i = 1 to n if proc map(i)==me, initialize Row i; If proc map(1)==me, broadcast Row 1 else receive it; For k = 1 to n 1 For i = k +1 to n Ifproc map(i)==me, dot i k If proc map(k+1)==me, then broadcast Row k +1 else receive it.
IV-1 Column-Oriented GE Interchange loops i and j of the row-oriented GE. For k = 1 to n 1 For i = k +1 to n a ik = a ik /a kk EndFor For j = k +1 to n+1 For i = k +1 to n a ij = a ij a ik a kj EndFor EndFor EndFor
Impact on data accessing patterns IV- Example 4 9 4 4 3 1 1 ()=() (1) 4 (3)=(3) (1) 1 4 = 4 9 0 1 3 0 1 4 5 3 Data access (writing) sequence for row-oriented GE: 1 3 4 5 6 7 8 Data writing sequence for column-oriented GE: 1 3 5 7 4 6 8
IV-3 Column-oriented backward substitution. Interchange loops i and j in the row-oriented backward substitution code. For j = n to 1 x j = x j /a j,j ; For i = j 1 to 1 x i = x i a i,j x j ; Endfor EndFor For example, given: 4x 1 9x +x 3 = 0.5x +3x 3 = 4x 3 = 5.
IV-4 The row-oriented algorithm performs: x 3 = 5 8 x = 3x 3 x = x 0.5 x 1 = +9x x 1 = x 1 x 3 x 1 = x 1 4. The column-oriented algorithm performs: x 3 = 5 8 x = 3x 3 x 1 = x 3 x = x 0.5 x 1 = x 1 +9x x 1 = x 1 4.
IV-5 Parallel Column-Oriented GE Partitioned code: For k = 1 to n 1 T k k : For i = k +1 to n a ik = a ik /a kk For j = k +1 to n+1 T j k : For i = k +1 to n a ij = a ij a ik a kj
IV-6 Task graph: T 1 1 T 1 T 1 3 T 1 4... n+1 k=1 T 1 T T 3 4 n k= T... +1 T T 3 3 T 3 4... n+1 T 3 k=3 Tn+1 n 1 k=n 1 Schedule:? SPMD code:?
IV-7 Column-oriented backward substitution Partitioning: For j = n to 1 Sj x x j = x j /a j,j ; For i = j 1 to 1 x i = x i a i,j x j ; Endfor EndFor Dependence: S x n S x n 1 Sx 1.
IV-8 Parallel Algorithm: Execute all these tasks (Sj x, j = n,,1) gradually on the processor that owns x (column n+1). For j = n to 1 If owner(column x)==me then Receive column j if not available. Do Sj x. Else If owner(column j)==me, send column j to the owner of column x. EndFor
IV-9 Problems with the GE Method Problem 1: a k,k = 0. (1) 0+x +x 3 = () 3x 1 +x 3x 3 = (3) x 1 +5x x 3 = 5 0 1 1 3 3 1 5 1 x = 5 Using Gaussian elimination: Eq() (1) 3 0 Eq(3) (1) 1 0 Solution: At stage k, interchange rows such that a k,k is the maximum in the lower portion of the column k.
IV-30 Gaussian Elimination with Pivoting Row-oriented Forward Elimination For k = 1 to n 1 Find m such that a m,k = max i k { a i,k }; If a m,k = 0, No unique solution, stop; Swap row(k) with row(m); For i = k +1 to n a ik = a ik /a kk ; For j = k +1 to n endfor endfor endfor a ij = a ij a ik a kj ; b i = b i a ik b k ;
An example of GE with Pivoting IV-31 0 1 1 3 3 1 5 1 5 (1) () = 3 3 0 1 1 1 5 1 5 (3) (1) 1 3 = () (3) = 3 3 0 1 1 0 13 3 0 3 3 0 13 3 0 0 1 1 13 3 13 3 (3) () 3 13 = 3 3 0 13 3 0 0 0 1 13 3 1 x 1 = 1 x = 1 x 3 = 1
IV-3 Column-Oriented GE with Pivoting For k = 1 to n 1 Find m such that a m,k = max i k { a i,k }; If a m,k = 0, No unique solution, stop. Swap row(k) with row(m); For i = k +1 to n a ik = a ik /a kk EndFor For j = k +1 to n+1 For i = k +1 to n a ij = a ij a ik a kj EndFor EndFor EndFor
IV-33 Parallel column-oriented GE with pivoting Partitioned forward elimination: For k = 1 to n 1 P k k Find m such that a m,k = max i k { a i,k }; If a m,k = 0, No unique solution, stop. For j = k to n+1 S j k : Swap a k,j with a m,j ; Endfor T k k : For i = k +1 to n a i,k = a i,k /a k,k endfor
IV-34 For j = k +1 to n+1 T j k For i = k+1 to n a i,j = a i,j a i,k a k,j endfor endfor
IV-35 Dependence structure for iteration k P k k Find the maximum element. Broadcast swapping positions S k k S k k+1... S k n+1 Swap each column T k k T k k+1... T k n+1 Scaling column k Broadcast column k updating columns k+1,k+,...,n+1
IV-36 Combining messages and merging tasks Define task U k k as performing Pk k, Sk k, and Tk k. Define task U j k as performing Sj k, and Tj k (k +1 j n+1). U k k P k k S k k T k k Find the maximum element. Swap column k. Scaling column k Broadcast swapping positions and column k. S k k+1 T k k+1 U k k+1... n+1 S k... n+1 T k Uk n+1 Swap column k+1,k+,...,n+1 updating columns k+1,k+,...,n+1
IV-37 Parallel algorithm for pivoting For k = 1 to n 1 The owner of column k does Uk k and broadcasts the swapping positions and column k. Do U k+1 k...uk n in parallel endfor
IV-38 Iterative Methods for Solving Ax = b Ex: (1) 6x 1 x +x 3 = 11 () x 1 +7x +x 3 = 5 (3) x 1 +x 5x 3 = -1 = x 1 = 11 6 1 6 ( x +x 3 ) x = 5 7 1 7 ( x 1 +x 3 ) x 3 = 1 5 1 5 (x 1 +x ) = x (k+1) 1 = 1 6 (11 ( x(k) +x (k) 3 )) x (k+1) = 1 7 (5 ( x(k) 1 +x (k) 3 )) x (k+1) 3 = 1 5 ( 1 (x(k) 1 +x (k) ))
IV-39 Initial Approximation: x 1 = 0,x = 0,x 3 = 0 Iter 0 1 3 4 8 x 1 0 1.833.038.085.004.000 x 0 0.714 1.181 1.053 1.001 1.000 x 3 0 0. 0.85 1.080 1.038 1.000 Stop when x (k+1) x (k) < 10 4 Need to define norm x (k+1) x (k).
IV-40 Iterative methods in a matrix format x 1 x x 3 k+1 = 0 6 1 6 7 0 7 1 5 5 0 x 1 x x 3 k + 11 6 5 7 1 5 General iterative method: Assign an initial value to x (0) k=0 Do x (k+1) = H x (k) +d until x (k+1) x (k) < ε
IV-41 Norm of a Vector Given x = (x 1,x, x n ): x 1 = n i=1 x i x = xi x = max x i Example: x = ( 1,1,) x 1 = 4 x = 1+1+ = 6 x = Applications: Error ε
IV-4 Jacobi Method for Ax = b x k+1 i = 1 a ii (b i j i a ij x k j) i = 1, n Example: (1) 6x 1 x +x 3 = 11 () x 1 +7x +x 3 = 5 (3) x 1 +x 5x 3 = -1 = x 1 = 11 6 1 6 ( x +x 3 ) x = 5 7 1 7 ( x 1 +x 3 ) x 3 = 1 5 1 5 (x 1 +x )
IV-43 Jacobi method in a matrix-vector form x 1 x x 3 k+1 = 0 6 1 6 7 0 7 1 5 5 0 x 1 x x 3 k + 11 6 5 7 1 5
IV-44 Parallel Jacobi Method or in general x k+1 = D 1 Bx k +D 1 b x k+1 = Hx k +d. Parallel solution: Distribute rows of H to processors. Perform computation based on owner-computes rule. Perform all-all broadcasting after each iteration.
IV-45 If the iterative matrix is sparse If it contains a lot of zeros, the code design should take advantage of this: Not store too many known zeros. Code should explicitly skip those operations applied to zero elements. Example: y 0 = y n+1 = 0. y 0 y 1 +y = h y 1 y +y 3 = h. y n 1 y n +y n+1 = h
IV-46 This set of equations can be rewritten as: 1 1 1 1 1... 1 1 y 1 y. y n 1 y n = h h. h h The Jacobi method in a matrix format (right side): 0.5 0 1 1 0 1 1 0 1... 1 1 0 y 1 y. y n 1 y n k 0.5 h h. h h Too time and space consuming if you multiply using the entire iterative matrix!
IV-47 Correct solution: write the Jacobi method as: Repeat For i= 1 to n y new i Endfor = 0.5(y old i 1 +yold i+1 h ) Until y new y old < ε
IV-48 Gauss-Seidel Method Utilize new solutions as soon as they are available. (1) 6x 1 x +x 3 = 11 () x 1 +7x +x 3 = 5 (3) x 1 +x 5x 3 = -1 = Jacobi method. x k+1 1 = 1 6 (11 ( xk +xk 3 )) x k+1 = 1 7 (5 ( xk 1 +xk 3 )) x k+1 3 = 1 5 ( 1 (xk 1 +xk )) = Gauss-Seidel method. x k+1 1 1 = 6 (11 ( xk +xk 3 )) x k+1 = 1 7 (5 ( xk+1 1 +x k 3 )) x k+1 3 = 1 5 ( 1 (xk+1 1 +x k+1 ))
IV-49 ε = 10 4 0 1 3 4 5 x 1 0 1.833.069 1.998 1.999.000 x 0 1.38 1.00 0.995 1.000 1.000 x 3 0 1.06 1.015 0.998 1.000 1.000 It converges faster than Jacobi s method.