Review: From problem to parallel algorithm

Size: px

Start display at page:

Download "Review: From problem to parallel algorithm"

Adelia Imogene Mills
5 years ago
Views:

1 Review: From problem to parallel algorithm Mathematical formulations of interesting problems abound Poisson s equation Sources: Electrostatics, gravity, fluid flow, image processing (!) Numerical solution: Discretize and solve Ax = b Many methods, which serve as building blocks for other problems and algorithms Jacobi s method: Easy to parallelize iterative method with slow convergence 1

2 Recall: Algorithms for 2-D (3-D) Poisson, N=n 2 (=n 3 ) Algorithm Serial PRAM Memory # procs Dense LU Band LU Jacobi Explicit inverse Conj. grad. RB SOR Sparse LU FFT Multigrid Lower bound N 3 N N 2 N 2 N 2 (N 7/3 ) N N 3/2 (N 5/3 ) N (N 4/3 ) N 2 (N 5/3 ) N (N 2/3 ) N N N 2 log N N 2 N 2 N 3/2 (N 4/3 ) N 1/2(1/3) log N N N N 3/2 (N 4/3 ) N 1/2 (N 1/3 ) N N N 3/2 (N 2 ) N 1/2 N log N (N 4/3 ) N N log N log N N N N log 2 N N N N log N N PRAM = idealized parallel model with zero communication cost. Source: Demmel (1997) 2

3 Recall: 2-D Poisson Equation Graph and stencil T =

4 Recall: Jacobi s method is easy to parallelize Parallelism: Update all points independently Block partition domain Partition domain into blocks n 2 /p elements / block Communicate at boundaries n/p per neighbor Small if n >> p 4

5 I: More Poisson II: Performance metrics and models Prof. Richard Vuduc Georgia Institute of Technology CSE/CS 8803 PNA, Spring 2008 [L.04] Thursday, January 17,

6 Sources for today s material CS 267 (Yelick & Demmel, UCB) Sourcebook, eds. Dongarra, et al. Mike Heath at UIUC Intro to the CG method w/o the agonizing pain, by Jonathan Shewchuk (UCB) 6

7 Algorithms for 2-D (3-D) Poisson, N=n 2 (=n 3 ) Algorithm Serial PRAM Memory # procs Dense LU Band LU Jacobi Explicit inverse Conj. grad. RB SOR Sparse LU FFT Multigrid Lower bound N 3 N N 2 N 2 N 2 (N 7/3 ) N N 3/2 (N 5/3 ) N (N 4/3 ) N 2 (N 5/3 ) N (N 2/3 ) N N N 2 log N N 2 N 2 N 3/2 (N 4/3 ) N 1/2(1/3) log N N N N 3/2 (N 4/3 ) N 1/2 (N 1/3 ) N N N 3/2 (N 2 ) N 1/2 N log N (N 4/3 ) N N log N log N N N N log 2 N N N N log N N PRAM = idealized parallel model with zero communication cost. Source: Demmel (1997) 7

8 Can we speed up the rate at which information propagates? RHS True solution 5 steps of Jacobi 5 steps of another technique 8

9 Can speed up info propagation? Jacobi: u t+1 i,j = 1 4 ( u t i 1,j + u t i+1,j + u t i,j 1 + u t i,j+1 + h 2 f i,j ) If processing in lexicographic order, can use most recent values u t+1 i,j = 1 4 ( u t+1 i 1,j + ut i+1,j + u t+1 i,j 1 + ut i,j+1 + h 2 f i,j ) Gauss-Seidel algorithm 9

10 Red-black Gauss-Seidel Alternately update R & B subsets General graphs? Not much improvement Convergence: (only) 2x faster PRAM: 2x parallel steps 10

11 Successive overrelaxation (SOR) Rewrite Jacobi as original + correction: u t+1 i,j = u t i,j + i,j If correction is a good direction, accelerate by relaxation factor ω > 1: u t+1 i,j = u t i,j + ω i,j Red-black SOR: Alternately apply the following to red, black subsets u t+1 i,j = (1 ω)u t i,j + ω 4 ( u t i 1,j + u t i+1,j + u t i,j 1 + u t i,j+1 + h 2 f i,j ) 11

12 Red-black SOR Red-black SOR: Alternately apply the following to red, black subsets u t+1 i,j = (1 ω)u t i,j + ω 4 ( u t i 1,j + u t i+1,j + u t i,j 1 + u t i,j+1 + h 2 f i,j ) Can show, for Poisson, that error minimized when: [Demmel (1997)] 1 < ω = 2 1+sin π N+1 < 2 Can also show no. of steps to converge is O(n) vs. Jacobi s O(n 2 ) Serial complexity = O(n 3 =N 3/2 ) vs. Jacobi s O(n 4 =N 2 ). [PRAM?] 12

13 Algorithms for 2-D (3-D) Poisson, N=n 2 (=n 3 ) Algorithm Serial PRAM Memory # procs Dense LU Band LU Jacobi Explicit inverse Conj. grad. RB SOR Sparse LU FFT Multigrid Lower bound N 3 N N 2 N 2 N 2 (N 7/3 ) N N 3/2 (N 5/3 ) N (N 4/3 ) N 2 (N 5/3 ) N (N 2/3 ) N N N 2 log N N 2 N 2 N 3/2 (N 4/3 ) N 1/2(1/3) log N N N N 3/2 (N 4/3 ) N 1/2 (N 1/3 ) N N N 3/2 (N 2 ) N 1/2 N log N (N 4/3 ) N N log N log N N N N log 2 N N N N log N N PRAM = idealized parallel model with zero communication cost. Source: Demmel (1997) 13

14 Ax=b solves a minimization problem If A is symmetric positive definite, then the quadratic form, φ(x) = 1 2 xt Ax x T b is minimized when Ax = b Intuition? Consider ( 3 2 ) ( 2 ) A = 2 6 b = 8 14

15 φ(x) = 1 2 xt Ax x T b 15

16 φ(x) = 1 2 xt Ax x T b Constant ϕ 16

17 Positive-definite Negative-definite Singular, positive-indefinite Indefinite 17

18 Why the quadratic form? φ(x) = 1 2 xt Ax x T b Consider error between approx. and true solution at step k of some method: e = xapprox x e 2 A = e T Ae 18

19 Some additional observations about this minimization problem In general, an iterative numerical optimization method has the form x k+1 = x k + α s k Choose α to minimize ϕ(xk + αsk) Negative gradient is the residual vector φ(x) =b Ax r Can show analytically that α = rt k s k s T k As k 19

20 Conjugate gradient algorithm for solving linear systems (Sparse) matrix-vector multiply 20

21 Refer to Templates book for broad survey of iterative linear solvers N Symmetric? Y N Storage Expensive? A T available? Y Try QMR N Try Minres or CG Definite? Y Eigenvalues Known? N Y N Y Try GMRES Try CGS or Bi-CGSTAB Try CG Try CG or Chebyshev 21

22 Algorithms for 2-D (3-D) Poisson, N=n 2 (=n 3 ) Algorithm Serial PRAM Memory # procs Dense LU Band LU Jacobi Explicit inverse Conj. grad. RB SOR Sparse LU FFT Multigrid Lower bound N 3 N N 2 N 2 N 2 (N 7/3 ) N N 3/2 (N 5/3 ) N (N 4/3 ) N 2 (N 5/3 ) N (N 2/3 ) N N N 2 log N N 2 N 2 N 3/2 (N 4/3 ) N 1/2(1/3) log N N N N 3/2 (N 4/3 ) N 1/2 (N 1/3 ) N N N 3/2 (N 2 ) N 1/2 N log N (N 4/3 ) N N log N log N N N N log 2 N N N N log N N PRAM = idealized parallel model with zero communication cost. Source: Demmel (1997) 22

23 Administrivia 23

24 Administrative stuff Accounts: Apparently, you already have them or will soon (!) Try logging into warp1 with your UNIX account password If it doesn t work, go see TSO Help Desk (and good luck!) CCB 148 / M-F 7a-5p / / AIM:tsohlpdsk Summer internships at national and industrial research labs 24

25 Metrics and models of efficiency and scalability 25

26 Outline Parallel efficiency: Effectiveness of parallel algorithm compared to serial Scalability - definitions, problem scaling, isoefficiency Simple models Slides in this section taken from Heath (UIUC) 26

27 Parallel efficiency: 4 scenarios Consider load balance, concurrency, and overhead 27

28 (a) Perfect load balance and concurrency 28

29 (b) Good initial concurrency but poor load balance 29

30 (c) Good load balance but poor concurrency 30

31 (d) Good load balance and concurrency, but with overheads 31

32 Basic definitions M W V T C Memory complexity Computational complexity Processor speed Execution time Computational cost Storage for given problem (e.g., words) Amount of work for given problem (e.g., flops) Ops / time (e.g., flop/s) Elapsed wallclock (e.g., secs) (No. procs) * (exec. time) [e.g., processor-hours] 32

33 Basic definitions M W V T C Memory complexity Computational complexity Processor speed Execution time Computational cost Storage for given problem (e.g., words) Amount of work for given problem (e.g., flops) Ops / time (e.g., flop/s) Elapsed wallclock (e.g., secs) (No. procs) * (exec. time) [e.g., processor-hours] Subscripts denote processors used, e.g., T1 = serial time, Wp = work for p procs. 33

34 Basic definitions M W V T C Memory complexity Computational complexity Processor speed Execution time Computational cost Storage for given problem (e.g., words) Amount of work for given problem (e.g., flops) Ops / time (e.g., flop/s) Elapsed wallclock (e.g., secs) (No. procs) * (exec. time) [e.g., processor-hours] Assumptions: M p M 1,W p W 1 34

35 Quantities may be functions of one another Consider: W(M) to indicate that work depends on memory complexity. Example: Multiplying two n x n matrices M = O(n 2 ),W = O(n 3 ) = W = O(M 3 2 ) 35

36 Comments on processor speed, V Processor speed will depend on M due to memory hierarchies V (M)? = V (N) ; V Homogeneous vs. heterogeneous processors ( M p V p (M)? = V 1 (M) )? V (M) Aggregate speed p V ( M p ) 36

37 Execution time vs. cost Serial and parallel execution time: Work / Speed T 1 = W 1 V (M) T p = p V W p ( M p ) Cost = (no. procs) * (execution time) C 1 T 1 C p p T p = W p ( V M p ) 37

38 Efficiency and speedup Efficiency E p C 1 = T 1 = W 1 V (M/p) C p p T p W p V (M) Speedup S p T 1 T p = p E p Question: When might superlinear speedup occur? 38

39 Example: Summation using a tree algorithm Memory usage is the same M 1 = M p = n 39

40 Example: Summation using a tree algorithm Work: parallel case does more for intermediate sums W 1 n W p n + p log p 40

41 Example: Summation using a tree algorithm Time: Assume perfect load balance & concurrency T 1 n T p n p + log p 41

42 Example: Summation using a tree algorithm Cost: (no. procs) * (time) C 1 n C p n + p log p 42

43 Example: Summation using a tree algorithm Efficiency E p C 1 C p n n + p log p = 1 1+ p n log p 43

44 Example: Summation using a tree algorithm Speedup S p T 1 T p n n p + log p = p 1+ p n log p 44

45 Parallel scalability Algorithm is scalable if E p = Θ(1) as p Why use more processors? Solve fixed problem in less time Solve larger problem in same time (or any time) Obtain sufficient aggregate memory Tolerate latency and/or use all available bandwidth (Little s Law) 45

46 Problem scaling: Fixed serial work More processors eventually hits diminishing returns Summation algorithm is not scalable in fixed work case E p C 1 C p n n + p log p = 1 1+ p n log p 46

47 Problem scaling: Fixed execution time Applies when a strict time limit applies T 1 = W 1 V (M) T p = p V W p ( M p ) Algorithm scales only if work scales linearly with p Summation algorithm does not scale in this scenario T 1 n T p n p + log p 47

48 Problem scaling: Scaled speedup Fixed work per processor T 1 = W 1 V (M) T p = p V W p ( M p ) Summation algorithm does not scale in this scenario E p W 1 W p pn pn + p log p = 1 1+ log p n 0 48

49 Problem scaling Fixed memory per processor Fixed accuracy Fixed efficiency (isoefficiency) next time 49

50 In conclusion 50

CME342 Parallel Methods in Numerical Analysis. Matrix Computation: Iterative Methods II. Sparse Matrix-vector Multiplication.

CME342 Parallel Methods in Numerical Analysis Matrix Computation: Iterative Methods II Outline: CG & its parallelization. Sparse Matrix-vector Multiplication. 1 Basic iterative methods: Ax = b r = b Ax