Parallel programming using MPI. Analysis and optimization. Bhupender Thakur, Jim Lupo, Le Yan, Alex Pacheco

Size: px

Start display at page:

Download "Parallel programming using MPI. Analysis and optimization. Bhupender Thakur, Jim Lupo, Le Yan, Alex Pacheco"

Jared Stevens
5 years ago
Views:

1 Parallel programming using MPI Analysis and optimization Bhupender Thakur, Jim Lupo, Le Yan, Alex Pacheco

2 Outline l Parallel programming: Basic definitions l Choosing right algorithms: Optimal serial and parallel l l l l Load Balancing Rank ordering, Domain decomposition Blocking vs Non blocking Overlap computation and communication MPI-IO and avoiding I/O bottlenecks Hybrid Programming model MPI + OpenMP MPI + Accelerators for GPU clusters

3 Choosing right algorithms: How does your serial algorithm scale? "#$%$&#'! "%()! *+%(,-)!!./01! 2#'3$%'$! 4)$)5(&'&'6!&7!%!'8(9)5!&3!):)'!#5!#;;<!83&'6!%!=%3=!$%9-)!!./!-#6-#6!'1! ;#89-)!-#6! >&';&'6!%'!&$)(!83&'6!&'$)5,#-%$&#'!3)%52=!&'!%!3#5$);!%55%?!!./!-#6!'1! -#6! 2!1A!2B0! C)%52=&'6!&'!%!D;E$5))!!./'1! -&')%5! >&';&'6!%'!&$)(!&'!%'!8'3#5$);!-&3$!#5!&'!%'!8'3#5$);!%55%?!!./!'-#6!'!1! -#6-&')%5! F)57#5(&'6!%!>%3$!>#85&)5!$5%'37#5(<!=)%,3#5$A!G8&2D3#5$!!./' H 1! G8%;5%$&2! "%I:)!9899-)!3#5$!!./' 2 1A!2J0!,#-?'#(&%-! K%$5&+!(8-$&,-&2%$&#'A!&':)53&#'!!./!2 '!1A!2J0! )+,#')'$&%-! >&';&'6!$=)!/)+%2$1!3#-8$&#'!$#!$=)!$5%:)--&'6!3%-)3(%'!,5#9-)(!!./!'L!1! 7%2$#5&%-! 6)')5%$&'6!%--!8'5)3$5&2$);!,)5(8$%$&#'3!#7!%!,#3)$!!

4 Parallel programming concepts: Basic definitions

5 Parallel programming concepts: Performance metrics Speedup: Ratio of parallel to serial execution time Efficiency: The ratio of speedup to the number of processors Work Product of parallel time and processors S = t s t p E = S p W = tp Parallel overhead Idle time wasted in parallel execution T 0 = pt p! t s

6 Parallel programming concepts Analyzing serial vs parallel algorithms Ask yourself What fraction of the code can you completely parallelize? f? How does problem size scale? Processors scale as p. How does problem-size n scale with p? n(p)? How does parallel overhead grow? T o (p)? Does the problem scale? M(p) Serial drawbacks Latency Single core Parallel drawbacks Extra computation Communication/ idling Contention(Memory, Network) Serial percentage

7 Parallel programming concepts Analyzing serial vs parallel algorithms Amdahl s law What fraction of the code can you completely parallelize? f? Serial time: t s Parallel time: ft s + (1-f) (t s /p) Serial drawbacks Latency Single core Parallel drawbacks Extra computation Communication/ idling Contention(Memory, Network) Serial percentage S = t s t p = = t s t s f + (1! f )t s p 1 f + (1! f ) p

8 Parallel programming concepts Analyzing serial vs parallel algorithms Quiz if the serial fraction is 5%, what is the maximum speedup you can achieve? Serial drawbacks Parallel drawbacks Serial time = 100 secs Serial percentage = 5 % Maximum speedup? Latency Single core Extra computation Communication/ idling Contention(Memory, Network) Serial percentage

9 Parallel programming concepts Analyzing serial vs parallel algorithms Amdahl s law What fraction of the code can you completely parallelize? f? Serial time = 100 secs Parallel percentage = 5 % Serial drawbacks Latency Single core Parallel drawbacks Extra computation Communication/ idling Contention(Memory, Network) = = (1! f ) p = 20 Serial percentage

Parallel programming concepts Analyzing serial vs parallel algorithms Amdahl's Law approximately suggests: Suppose a car is traveling between two cities 60 miles apart, and has already spent one hour

No matter how fast you drive the last half, it is impossible to achieve 90 mph average before reaching the second city Serial drawbacks Latency Extra computatio n Communication/ idling Single core

10 Parallel programming concepts Analyzing serial vs parallel algorithms Amdahl's Law approximately suggests: Suppose a car is traveling between two cities 60 miles apart, and has already spent one hour traveling half the distance at 30 mph. No matter how fast you drive the last half, it is impossible to achieve 90 mph average before reaching the second city Serial drawbacks Latency Extra computatio n Communication/ idling Single core Gustafson's Law approximately states: Suppose a car has already been traveling for some time at less than 90mph. Given enough time and distance to travel, the car's average speed can always eventually reach 90mph, no matter how long or how slowly it has already traveled. Parallel drawbacks Contention(Mem ory, Network) Serial percentage Source:

Parallel programming concepts Analyzing serial vs parallel algorithms Communication Overhead Simplest model: Serial drawbacks Parallel drawbacks Transfer time = Startup time + Hop time(node latency)

11 Parallel programming concepts Analyzing serial vs parallel algorithms Communication Overhead Simplest model: Serial drawbacks Parallel drawbacks Transfer time = Startup time + Hop time(node latency) + (Message length)/bandwidth = t s + t h l + t w l Send one big message instead of several small messages! Reduce the total amount of bytes! Bandwidth depends on protocol Latency Single core Extra computation Communication/ idling Contention(Memory, Network) Serial percentage

12 Parallel programming concepts Analyzing serial vs parallel algorithms Point to point (MPI_Send) ( t s + t w m ) Collective overhead All-to-all Broadcast (MPI Allgather): t s log 2 p + (p 1) t w m All-reduce (MPI Allreduce) : ( t s + t w m ) log 2 p Scatter and Gather (MPI Scatter) : ( t s + t w m ) log 2 p All to all (personalized): (p 1) ( t s + t w m ) Serial drawbacks Latency Single core Parallel drawbacks Extra computation Communication/ idling Contention(Memory, Network) Serial percentage

13 Parallel programming: Basic definitions Isoefficiency: Can we maintain efficiency/speedup of the algorithm? How should the problem size scale with p to keep efficiency constant? E = 1 1+ (T o / w) Maintain ratio To(W,p) / W, overhead to parallel work constant

14 Parallel programming: Basic definitions Isoefficiency relation: To keep efficiency constant you must increase problem size such that T (n,1)! T o (n, p) Procedure: 1. Get the sequential time T(n,1) 2. Get the parallel time pt(n,p) 3. Calculate the overhead T o =pt(n,p) - T(n,1) How does the overhead compare to the useful work being done?

15 Parallel programming: Basic definitions Isoefficiency relation: To keep efficiency constant you must increase problem size such that T (n,1)! T o (n, p) Scalability: Do you have enough resources(memory) to scale to that size Maintain ratio To(W,p) / W, overhead to parallel work constant

16 Parallel programming: Basic definitions Scalaility: Do you have enough resources(memory) to scale to that size Memory needed per processor Cannot maintain efficiency Memory Size Cplogp Can maintain efficiency C Cp Number of processors

17 Adding numbers Each processor has n/p numbers. 4, 3, 9 1, 3, 7 7,5,8 5,2,9 Serial time n Parallel time n / p Communicate and add log p + log p

18 Adding numbers Each processor has n/p numbers. Steps to communicate and add are 2 log p Speedup: n S =! n " # p + 2log p $ % & E = n n + 2 plog p ( ) Isoefficiency If you increase problem size n as O(p log p) then efficiency can remain constant!

19 Adding numbers Each processor has n/p numbers. Steps to communicate and add are 2 log p Speedup: Scalability E = n n + 2 plog p ( ) M (n)! (n / p) = plog p / p = log p

20 Sorting Each processor has n/p numbers. 4, 3, 9 1, 3, 7 7,5,8 5,2,9 Our plan 1. Split list into parts 2. Sort parts individually 3. Merge lists to get sorted list

21 Sorting Each processor has n/p numbers. 4, 3, 9 1, 3, 7 7,5,8 5,2,9 Background Bubble sort O(n 2 ) Stable Exchanging Selection sort O(n 2 ) Unstable Selection Insertion sort O(n 2 ) Stable Insertion Merge sort O(nlog n) Stable Merging Quick sort O(nlog n) Unstable Partitioning

22 Choosing right algorithms: Optimal serial and parallel Case Study: Bubble sort Main loop For i : 1 to length_of(a) -1 [ ] Secondary loop For j : i+1 to length_of(a) Compare and swap so smaller element is to left if ( A[j] < A[i] ) swap( A[i], A[j] )

23 Choosing right algorithms: Optimal serial and parallel Case Study: Bubble sort Compare Swap Main loop For i : 1 to length_of(a) -1 Secondary loop For j : i+1 to length_of(a) Compare and swap so smaller element is to left if ( A[j] < A[i] ) swap( A[i], A[j] ) (i=1, j=2) [ ] Y (i=1, j=3) [ ] N (i=1, j=4) [ ] N (i=2, j=3) [ ] Y (i=2, j=4) [ ] Y (i=3, j=4) [ ] Y

24 Choosing right algorithms: Optimal serial and parallel Case Study: Bubble sort Compare Swap Main loop For i : 1 to length_of(a) -1 Secondary loop For j : i+1 to length_of(a) Compare and swap so smaller element is to left if ( A[j] < A[i] ) swap( A[i], A[j] ) (i=1, j=2) [ ] Y (i=1, j=3) [ ] N (i=1, j=4) [ ] N (i=2, j=3) [ ] Y N(N-1)/2 = O(N 2 ) Comparisons (i=2, j=4) [ ] Y (i=3, j=4) [ ] Y

25 Choosing right algorithms: Optimal serial and parallel Case Study: Merge sort Recursively merge lists having one element each Split [ ] [ 1 5 ] [ 4 2 ] [ 1 ] [ 5 ] [ 4 ] [ 2 ] Merge [ 1 5 ] [ 2 4 ] [ ]

26 Choosing right algorithms: Optimal serial and parallel Case Study: Merge sort Recursively merge lists having one element each Split [ ] [ 1 5 ] [ 4 2 ] [ 1 ] [ 5 ] [ 4 ] [ 2 ] Merge Best sorting algorithms need O( N log N) [ 1 5 ] [ 2 4 ] [ ]

Choosing right algorithms: Parallel sorting

[ 2 4 6 8 ] [ 3 5 6 ] => [ 1 2 ] [ 2 4 6 8 ]

27 Choosing right algorithms: Parallel sorting technique Merging : Merge p lists having n/p elements each Sub-optimally : Pop-push merge on 1 processor O( np) 3 [ ] => [ 1 ] [ ] [ ] => [ 1 2 ] [ ] n-1 [ ] => [ ] [ ]

28 Choosing right algorithms: Parallel sorting technique Merging : Merge p lists having n/p elements each Optimal: Recursively or tree based Merge 2 lists on 1 0 Merge 4 lists on 2 Merge 8 lists on Best merge algorithms need O( nlog p)

29 Sorting Each processor has n/p numbers. 4, 3, 9 1, 3, 7 7,5,8 5,2,9 Serial time nlogn Parallel time n p log! " # n $ p% & Merge time (n!1). n p n p + nlog p

30 Sorting Each processor has n/p numbers. 4, 3, 9 1, 3, 7 7,5,8 5,2,9 Overhead ( ) = nlogn! p (n / p)log(n / p) + nlog p! plog p + nplog p! nlog p Isoefficiency nlogn! c(nlog p) " n! p c Scalability = n / p = p c!1 Low for c>2

31 Data decomposition 1D : Row-wise R R R

32 Data decomposition 1D : Column-wise C C C

33 Data decomposition 2D : Block-wise B1 B2 B B4 B5 B6 B6 B7 B8

34 Data decomposition Laplace solver: (n x n) mesh with p processors Time to communicate 1 cell: t comm cell =! s + t w Time to evaluate stencil once: t comp cell = 5 *(t float ) a(i, j+1) a(i+1, j) a(i-1, j) a(i, j+1)

35 Data decomposition Laplace solver: 1D Row-wise (n x n) with p processors a(i+1, j) Proc 0 a(i, j+1) a(i, j+1) a(i-1, j) Proc 1

36 Data decomposition Laplace solver: 1D Row-wise (n x n) with p processors Serial time: t seq comp = n 2 t cell Parallel computation time comp t process comm = n2 p t comp cell Ghost communication: t comm comm = 2nt cell a(i, j+1) a(i+1, j) a(i-1, j) a(i, j+1)

37 Data decomposition Laplace solver: 1D Row-wise (n x n) with p processors Overhead: a(i+1, j) = t seq! pt p = pn a(i, j+1) a(i, j+1) Isoefficieny: n 2! cnp " n! cp a(i-1, j) Scalability : = c 2 p 2 / p = Cp Poor Scaling

38 Data decomposition Laplace solver: 2D Block-wise a(i+1, j) Proc 0 Proc 2 a(i, j+1) a(i, j+1) a(i-1, j) Proc 1 Proc 3

39 Data decomposition Laplace solver: 2D Row-wise (n x n) with p processors Serial time: a(i+1, j) t comp seq = n 2 comm t cell a(i, j+1) a(i, j+1) Parallel computation time: a(i-1, j) comp t process = n2 p t comp cell Ghost communication: t comm = 4n p t comm cell

40 Data decomposition Laplace solver: 2D Row-wise (n x n) with p processors Overhead: = p.n / p a(i, j+1) a(i+1, j) a(i, j+1) = n p a(i-1, j) Isoefficiency: n! p Scalability : = (c p) 2 / p = C Perfect Scaling

41 Data decomposition Matrix vector multiplication: 1D row-wise decomposition Computation: Each processor computes n/p elements, n multiplies + (n-1) adds for each! " # O n2 p $ % & Communication: All gather in the end so each processor has full copy of output vector log p log p + " 2 i!1. n = log p + p i=1 n(p!1) p Proc 0 Proc 1 Proc X 1 2 3

42 Data decomposition Matrix vector multiplication: 1D row-wise decomposition Algorithm: 1. Collect vector using MPI_Allgather 2. Local matrix multiplication to get output vector Proc Wastes much memory Proc X 2 Proc

Data decomposition Matrix vector multiplication: 1D row-wise decomposition Computation: Each processor computes n/p elements, n multiplies + (n-1) adds for each!

43 Data decomposition Matrix vector multiplication: 1D row-wise decomposition Computation: Each processor computes n/p elements, n multiplies + (n-1) adds for each! " # O n2 p $ % & Proc Communication: All gather in the end so each processor has full copy of output vector Proc X 2! w n +! s log p Overhead:! s plog p +! w np Proc

44 Data decomposition Matrix vector multiplication: 1D row-wise decomposition Speedup: S = p " 1+ p(! s log p + t w n) % # $ t c n 2 & ' Proc Isoefficiency: n 2! plog p + np! n " cp Proc X 2 Scalability: M (p)! n 2 / p = c 2 p Proc Not scalable!

Parallel Computation? Proc 0 Proc 1 Proc 2 Overhead?

45 Data decomposition Matrix vector multiplication: 1D column-wise decomposition Serial Computation? Parallel Computation? Proc 0 Proc 1 Proc 2 Overhead? Isoefficiency? Scalability?

46 Data decomposition Matrix vector multiplication: 2D decomposition Algorithm: Uses p Processors on a grid A 00 A 01 A 02 proc 0 proc 1 proc 2 v 0 A 10 A 11 A 12 proc 3 proc 4 proc5 X v 1 A 20 A 21 A 22 proc 6 proc 7 proc 8 v 2

47 Data decomposition Matrix vector multiplication: 2D decomposition Algorithm Step 0: Copy part-vector to diagonal A 00 A 01 A 02 v 0 proc 0 proc 1 proc 2 A 10 A 11 A 12 proc 3 proc 4 proc5 X v 1 A 20 A 21 A 22 v 2 proc 6 proc 7 proc 8

48 Data decomposition Matrix vector multiplication: 2D decomposition Algorithm Step 1: Broadcast vector along columns A 00 A 01 A 02 v 0 proc 0 proc 1 proc 2 A 10 A 11 A 12 proc 3 proc 4 proc5 X v 1 A 20 A 21 A 22 v 2 proc 6 proc 7 proc 8

49 Data decomposition Matrix vector multiplication: 2D decomposition Algorithm Step 2: Local computation on each processor A 00 v o A 01 v 1 A 02 v 2 v 0 proc 0 proc 1 proc 2 A 10 v 0 A 11 v 1 A 12 v 2 proc 3 proc 4 proc5 X v 1 A 20 v 0 A 21 v 1 A 22 v 2 proc 6 proc 7 proc 8 v 2

50 Data decomposition Matrix vector multiplication: 2D decomposition Algorithm Step 3: Reduce across rows u 0 A 00 v o A 01 v 1 A 02 v 2 v 0 proc 0 proc 1 proc 2 u 1 A 10 v 0 A 11 v 1 A 12 v 2 proc 3 proc 4 proc5 X v 1 u 2 A 20 v 0 A 21 v 1 A 22 v 2 v 2 proc 6 proc 7 proc 8

51 Data decomposition Matrix vector multiplication: 2D decomposition Computation: Each processor computes n 2 /p elements, " n 2 %! c # $ p & ' Communication: 2n p log( p) + n p! nlog p p A 00 A 01 A 02 proc 0 proc 1 proc 2 A 10 A 11 A 12 proc 3 proc 4 proc5 v 0 v 1 X Overhead: n p log(p) A 20 A 21 A 22 proc 6 proc 7 proc 8 v 2

52 Data decomposition Matrix vector multiplication: 2D decomposition Isoefficiency: n 2! n p log p! n " c p log p A 00 A 01 A 02 proc 0 proc 1 proc 2 v 0 Scalability: M (p)! n2 p = (log p)2 A 10 A 11 A 12 proc 3 proc 4 proc5 v 1 X Scales better than 1D! A 20 A 21 A 22 proc 6 proc 7 proc 8 v 2

53 Lets look at the code

54 Summary u Know your algorithm! u Don t expect the unexpected! u Pay attention to parallel design and implementation right from the outset. It will save you lot of labor.

Performance Evaluation of Codes. Performance Metrics

CS6230 Performance Evaluation of Codes Performance Metrics Aim to understanding the algorithmic issues in obtaining high performance from large scale parallel computers Topics for Conisderation General