1 / 28. Parallel Programming.

Size: px

Start display at page:

Download "1 / 28. Parallel Programming."

Chrystal Fox
5 years ago
Views:

1 1 / 28 Parallel Programming pauldj@aices.rwth-aachen.de

2 Collective Communication 2 / 28 Barrier Broadcast Reduce Scatter Gather Allgather Reduce-scatter Allreduce Alltoall. References Collective Communication: Theory, Practice, and Experience, Chan, Heimlich, Purkayastha, van de Geijn. (FLAME working note #22) Collective Communications in MPI

3 Collective Communication 3 / 28 Synchronization Barrier Almost never needed! Data Movement Broadcast, Scatter, Gather, Allgather, Alltoall Reductions Reduce, Reduce-scatter, Allreduce, Scan,... For all collectives: no tags; blocking.

4 4 / 28 int MPI_BCast(...) Before: α After: α α α α

5 5 / 28 int MPI_Reduce(...) Before: δ 0 δ 1 δ 2 δ 3 After: p 1 Op i=0 δ i MPI_Op: MPI_MAX, MPI_MIN, MPI_SUM, MPI_PROD, MPI_LAND, MPI_BAND,..., MPI_Datatype: MPI_CHAR, MPI_INT, MPI_UNSIGNED, MPI_FLOAT, MPI_DOUBLE,...

6 6 / 28 int MPI_Scatter(...) Before: v[0] v[1] v[2] v[3] After: v[0] v[1] v[2] v[3]

7 7 / 28 int MPI_Gather(...) Before: v[0] v[1] v[2] v[3] After: v[0] v[1] v[2] v[3]

8 8 / 28 int MPI_Allgather(...) Before: v[0] v[1] v[2] v[3] After: v[0] v[0] v[0] v[0] v[1] v[1] v[1] v[1] v[2] v[2] v[2] v[2] v[3] v[3] v[3] v[3]

9 9 / 28 int MPI_Reduce_scatter(...) Before: v 0[0] v 1[0] v 2[0] v 3[0] v 0[1] v 1[1] v 2[1] v 3[1] v 0[2] v 1[2] v 2[2] v 3[2] v 0[3] v 1[3] v 2[3] v 3[3] After: Op v i[0] i Op v i[1] i Op v i[2] i Op v i[3] i

10 10 / 28 int MPI_Allreduce(...) Before: δ 0 δ 1 δ 2 δ 3 After: p 1 Op δ i i=0 p 1 Op δ i i=0 p 1 Op δ i i=0 p 1 Op δ i i=0

11 11 / 28 int MPI_Alltoall(...) Before: v 0[0] v 1[0] v 2[0] v 3[0] v 0[1] v 1[1] v 2[1] v 3[1] v 0[2] v 1[2] v 2[2] v 3[2] v 0[3] v 1[3] v 2[3] v 3[3] After: v 0[0] v 0[1] v 0[2] v 0[3] v 1[0] v 1[1] v 1[2] v 1[3] v 2[0] v 2[1] v 2[2] v 2[3] v 3[0] v 3[1] v 3[2] v 3[3]

12 More Collectives 12 / 28 Variable length MPI_Scatterv MPI_Gatherv MPI_Allgatherv MPI_Alltoallv Partial reduction MPI_Scan Non-blocking collectives MPI_I* Multiple communicators Example 1) World (everybody) + Workers (everybody-masters) + Masters (everybody-workers) Example 2) r c mesh of processes; communication within rows/columns only MPI_Comm_create(...) MPI_Comm_split(...)...

13 More Collectives More Communicators 12 / 28 Variable length MPI_Scatterv MPI_Gatherv MPI_Allgatherv MPI_Alltoallv Partial reduction MPI_Scan Non-blocking collectives MPI_I* Multiple communicators Example 1) World (everybody) + Workers (everybody-masters) + Masters (everybody-workers) Example 2) r c mesh of processes; communication within rows/columns only MPI_Comm_create(...) MPI_Comm_split(...)...

14 Collective Communication: Lower Bounds 13 / 28 Cost of communication: α + nβ Cost of computation: γ #ops α = latency, startup n = length of the message p = # of processes β = 1/ bandwidth γ = cost of 1 flop Primitive Latency Bandwidth Computation Broadcast log 2 (p) α nβ - p 1 Reduce log 2 (p) α nβ p nγ p 1 Scatter log 2 (p) α p nβ - p 1 Gather log 2 (p) α p nβ - p 1 Allgather log 2 (p) α p nβ - p 1 Reduce-Scatter log 2 (p) α p nβ p 1 p

15 Implementation of Bcast and Reduce 14 / 28 IDEA: recursive doubling / Minimum Spanning Tree (MST) At each step, double the number of active processes. How to map the idea to the specific topology? ring: linear doubling (2d) mesh: 1 dimension first, then another, then another... hypercube: obvious, same as mesh Cost? # steps: log 2 p cost(step): α + nβ total time: log 2 (p)α + log 2 (p)nβ lower bound: log 2 (p)α + nβ note: cost(p 2 ) = 2 cost(p)! Reduce BCast in reverse; cost(computation)?

16 Implementation of Gather (and Scatter) 15 / 28 IDEA: MST again At step i, only 1 2 i -th of the message is sent # steps: log 2 p cost(step i ): α + n 2 i β total time: log 2 (p) i=1 α + n 2 i β = log 2(p)α + p 1 p nβ lower bound: log 2 (p)α + p 1 p nβ optimal!

17 Implementation of Allgather (and Reduce-scatter) IDEA: Recursive-doubling (bidirectional exchange) Recursive allgather of half data + exchange data between disjoint nodes. v[0] v[1] v[2] v[3] v[0] v[1] v[0] v[1] v[2] v[3] v[2] v[3] # steps: log 2 p total time: log 2 (p) i=1 α + n 2 β = log i 2 (p)α + p 1 p nβ v[0] v[0] v[0] v[0] v[1] v[1] v[1] v[1] v[2] v[2] v[2] v[2] v[3] v[3] v[3] v[3] 16 / 28

18 Another Implementation of Allgather 17 / 28 IDEA: Cyclic algorithm v[0] v[1] v[2] v[3] v[0] v[3] v[0] v[1] v[1] v[2] v[2] v[3] # steps: p 1 total time: p 1 α + n p 1 β = (p 1)α + p p nβ i=1 v[0] v[0] v[0] v[1] v[1] v[1] v[2] v[2] v[2] v[3] v[3] v[3]

19 Another Implementation of Bcast 18 / 28 IDEA: Scatter + cyclic algorithm

20 Dense Matrix-Vector Product: Scalability 19 / 28 1D matrix distribution y := Ax, x, y R n, A R n n A is partitioned by rows and distributed over p processes. Process i owns the block of rows A i. Similarly for x: process i owns x i. Goal: Compute y and distribute it the same way as x. A = A 0 A 1., x = x 0 x 1., and y = y 0 y 1. A p 1 x p 1 y p 1 Note: A i, x i and y i indicate a block of rows, as opposed to a single one.

21 Dense Matrix-Vector Product: Scalability 1D matrix distribution Algorithm 1 x = Allgather(x i ) x becomes available to every process 2 y i = A i x local computation Parallel cost (lower bound for T p (n)) 1 log 2 (p) α + p 1 p nβ log 2(p)α + nβ 2 2 n2 p γ Sequential cost T 1 (n) = 2n 2 γ 20 / 28

22 GEMV: Scalability 21 / 28 1D matrix distribution Speedup Efficiency Strong scalability Weak scalability S p (n) = T 1(n) T p (n) = E p (n) = S p(n) p = lim E p(n) = 0 p 2n 2 γ log 2 (p)α + nβ + 2 n2 p γ p log 2 (p) α 2n 2 γ + p β 2n γ M = local memory; Mp = combined memory; n M = largest problem solvable with p processes n 2 M = Mp lim E p(n M ) = p log 2 (p) α 2M γ + p 2 β M γ = 0

23 Exercise: y = Ax, A distributed by columns 22 / 28 A is partitioned by columns and distributed over p processes. Process i owns the block of columns A i. Process i owns x i. Goal: Compute y and distribute it the same way as x. A = A 0 A 1... A p 1, x = x 0 x 1., and y = y 0 y 1. x p 1 y p 1

24 Scalability Algorithm 1 y (i) = A i x i local computation 2 y = Reduce-Scatter(y (i) ) reduction-sum of y (i) s + scatter Parallel cost 1 2 n2 p γ 2 log 2 (p) α + p 1 p p 1 nβ + p nγ log 2(p)α + n(β + γ) 3 Lower bound for T p (n) = log 2 (p)α + n(β + γ) + 2 n2 p γ T p (n) has one extra term (nγ) with respect to the case of A partitioned by rows, therefore this algorithm is also not scalable. 23 / 28

25 GEMV: Scalability 24 / 28 2D matrix distribution A is partitioned according to a 2D p = r c mesh; The (i, j) process (P ij ) owns the block A ij. x and y are partitioned in p chunks; x is mapped to the mesh by colums, y by rows. Example: p = 2 3 [ A00 A A = 01 A 02 A 10 A 11 A 12 ], x = x 0 x 1. x 5, and y = y 0 y 1. y 5 P ij owns A ij, P 00 owns x 0 P 10 owns x 1 P 01 owns x 2, and P 11 owns x 3 P 02 owns x 4 P 12 owns x 5 P 00 owns y 0 P 01 owns y 1 P 02 owns y 2 P 10 owns y 3 P 11 owns y 4 P 12 owns y 5.

26 GEMV: Scalability 25 / 28 2D matrix distribution Algorithm 1 x I = Allgather(x i) within columns x I is a block 2 y J = A ijx I local computation 3 y j = Reduce-scatter(y J ) within rows Parallel cost (lower bound for T p(n)) 1 log 2 (r) α + r 1 nβ log p 2 (r)α + n β c 2 2 n2 p γ 3 log 2 (c) α + c 1 p nβ + c 1 nγ log 2 (c)α + n β + n γ r r p Sequential cost T 1(n) = 2n 2 γ

27 GEMV: Scalability 26 / 28 2D matrix distribution Parallel cost: assuming r = c = p: Speedup S p (n) = ( 2 n2 n p γ + (log 2(r) + log 2 (c)) α + c + n ) β + n r r γ 2 n2 p γ + log 2(p)α + n p (2β + γ) p 1 + p log 2 (p) α 2n 2 Efficiency E p (n) = S p (n)/p =... γ + p 2n 2β+γ γ Strong scalability lim E p(n) = 0 p Weak scalability M = local memory; Mp = combined memory; n M = largest problem solvable with p processes n 2 M = Mp lim E p(n M ) = p log 2 (p) α 2M γ β+γ M γ = 0...

28 27 / 28 Exercise 2D matrix distribution A is partitioned according to a 2D p = r c mesh; The (i, j) process (P ij ) owns the block A ij. Both x and y are partitioned in p chunks, and mapped to the mesh by colums. Example: p = 2 3 [ A00 A A = 01 A 02 A 10 A 11 A 12 ], x = x 0 x 1. x 5, and y = y 0 y 1. y 5 P ij owns A ij, P 00 owns x 0 P 10 owns x 1 P 01 owns x 2, and P 11 owns x 3 P 02 owns x 4 P 12 owns x 5 P 00 owns y 0 P 10 owns y 1 P 01 owns y 2 P 11 owns y 3 P 02 owns y 4 P 12 owns y 5.

29 Parallel Matrix Distribution How to store a (large) matrix using p = r c processes? 1D, block of rows (or columns) Bad idea 1D (block) cyclic, either by rows or columns Bad idea 2D, r c quadrants Not so good 2D (block) cyclic Good idea! References Applet: Introduction to High-Performance Scientific Computing by Victor Eijkhout (free download). A Comprehensive Approach to Parallel Linear Algebra Libraries (Technical Report, University of Texas). Chapter 3, pages / 28

Algorithms for Collective Communication. Design and Analysis of Parallel Algorithms

Algorithms for Collective Communication Design and Analysis of Parallel Algorithms Source A. Grama, A. Gupta, G. Karypis, and V. Kumar. Introduction to Parallel Computing, Chapter 4, 2003. Outline One-to-all