The Message Passing Interface (MPI) TMA4280 Introduction to Supercomputing

Size: px

Start display at page:

Download "The Message Passing Interface (MPI) TMA4280 Introduction to Supercomputing"

Merryl Henry
5 years ago
Views:

1 The Message Passing Interface (MPI) TMA4280 Introduction to Supercomputing NTNU, IMF February

2 Recap: Parallelism with MPI An MPI execution is started on a set of processes P with: mpirun -n N P./exe with N P = card(p) the number of processes. A program is executed in parallel on each process: Process 0 Process 1 Process 2 where each process in P = {P0, P1, P2} has a access to data in its memory space: remote data exchanged through interconnect. An ordered set of processes defines a Group (MPI_Group): the inital group (WORLD) consists of all processes. 2

3 Recap: Parallelism with MPI An MPI execution is enclosed between calls of function to initialize and finalize the subsystem: MPI_Init,... { Program body }..., MPI_Finalize. Processes in a Group can communicate through a Communicator (MPI_Comm) with attributes: rank: MPI_Comm_rank(MPI_COMM_WORLD, &rank); size: MPI_Comm_size(MPI_COMM_WORLD, &size); The default communicator for all processes is MPI_COMM_WORLD. New Groups can be created on subsets of P to restrict the communication between selected processes. Example: Monte-Carlo simulation, structure implementation of adjacent exchange. 3

4 Recap: Parallelism with MPI Process communicate using the Message Passing paradigm: 1. Envelope: how to process the message, 2. Body: data to exchange. MPI routines are categorized depending on their relation: Point-to-Point Collective one-to-one one-to-all all-to-one all-to-all where all is defined by the Communicator used. Point-to-Point operations are the base of Message Passing: MPI_Send, MPI_Recv to exchange data. Collective operations are implemented in terms of Point-to-Point operations but may be optimized. 4

5 Message buffers buffer count dest tag comm MPI_Send(buffer, count, datatype, dest, tag, comm }{{}}{{} data envelope memory location defining the begining of the buffer max length in bytes of the send buffer (bigger than actual) receiving process rank arbitrary identifier communicator MPI_Recv(buffer, count, datatype, source, tag, comm, &status }{{}}{{} data envelope buffer count datatype source tag comm status memory location defining the begining of the buffer maxlength in bytes of received data MPI data type sending process rank arbitrary identifier communicator metadata: actual length ); ); 5

6 Recap: Parallelism with MPI Two types of MPI commmunication: block and non-blocking. send(x, 1) Process 0 recv(y, 0) req = irecv(y, 2) wait(req) Process 1 req = isend(x, 1) wait(req) Process 2 1. blocking: the function returns once the buffer is processed (what processed ( ) 2. non-blocking: the function returns immediately MPI implementations use an internal buffer: data may be buffered internally on both sides, sending process can modify the send buffer once it is copied to the internal buffe regardless of receiver status, 6

7 Recap: Parallelism with MPI ( ) What processed means depends on the communication mode:. MPI_Send, MPI_SSend, MPI_RSend, MPI_BSend MPI may provide several versions of a routine for different communication modes Synchronous: send blocks until handshake is done and matching receive has started. Ready: send blocks until matching receive has started (no handshake). Buffered: send returns as soon as data is buffered internally. Standard: Synchronous or Buffered depending on message size and resources. Important: do not assume anything about the implementation, different handling of buffers can hide bugs! 7

8 Parallelism Pairwise send receive: send(x1, 1) Process 0 send(x2, 2) recv(y0, 0) Process 1 recv(y1, 1) Process 2 A safe choice for Point-to-Point communication and optimized by vendors for their hardware. 8

9 Collective operations In general, MUST involves all of the processes in a group, and are more efficient and less tedious to use compared to point-to-point communication. 1. Barrier synchronization 2. Broadcast (one-to-all) 3. Scatter (one-to-all) 4. Gather (all-to-one) 5. Global reduction operations: min, max, sum (all-to-all) 6. All-to-all data exchange 7. Scan across all processes An example (synchronization between processes): MPI_Barrier ( comm ); 9

10 Collective operations Data Processes A 0 A 0 One-to-all broadcast A 0 MPI_Bcast A 0 A 0 A 0 A 1 A 2 A 3 All-to-one gather MPI_Gather A 0 A 1 A 2 A 3 A 1 A 2 A 3 10

11 Global reduction Processes with initial data p = 0 p = 1 p = 2 p = 3 After MPI_Reduce(..., MPI_MIN, 0,...) 0 2 After MPI_Allreduce(..., MPI_MIN,...)

12 Global reduction MPI_Reduce(sbuf, rbuf, count, datatype, op, root, comm }{{}}{{} data envelope MPI_Allreduce(sbuf, rbuf, count, datatype, op, comm); }{{}}{{} data envelope Examples of predefined operations (C): MPI_SUM MPI_PROD MPI_MIN MPI_MAX ); 12

13 Numerical integration 5 f (x) A i = where i = 0,..., n 1, and h = 1/n. 1 0 x i 1 ( ) ( xi 2 h, with x i = i + 1 ) h 2 x 13

14 Numerical integration π = 1 0 n x 2 dx h xi 2 = π n i=0 14

15 Calculating pi with MPI in C # include < stdlib.h > # include < stdio.h > # include < math.h > # include < mpi.h > int main ( int argc, char ** argv ) { if ( argc < 2) { printf ( " Requires argument : number of intervals. " ); return 1; } int nprocs, rank ; MPI_Init (& argc, & argv ); MPI_Comm_size ( MPI_COMM_WORLD, & nprocs ); MPI_Comm_rank ( MPI_COMM_WORLD, & rank ); int nintervals = atoi ( argv [1]); double time_start ; if ( rank == 0) { time_start = MPI_Wtime (); } 15

16 Calculating pi with MPI in C (cntd.) double h = 1.0 / ( double ) nintervals ; double sum = 0.0; int i ; for ( i = rank ; i < nintervals ; i += nprocs ) { double x = h * (( double )i + 0.5); sum += 4.0 / (1.0 + x * x ); } double my_pi = h * sum ; double pi ; MPI_Reduce (& my_pi, &pi, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD ); if ( rank == 0) { double duration = MPI_Wtime () - time_start ; double error = fabs ( pi * atan (1.0)); printf ( " pi=% e, error =% e, duration =% e \ n ", pi, error, duration ); } MPI_Finalize (); } return 0; 16

17 Things to consider 1. Is the program correct, e.g., is the convergence rate as expected? 2. Is the program load-balanced? 3. Do we get the same value of π n for different values of P? 4. Is the program scalable? 17

18 Convergence test n error = π π n Hence, π π n O(h 2 ) where h = 1/n. 18

19 Scalability: timing results on Vilje 10 Speedup, T 1 /T P Ideal n = 10 4 n = P 19

20 Inner product N 1 σ = x y = x y = x m y m m=0 y x 0 x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 y y 0 y 1 y 2 y 3 y 4 y 5 y 6 y 7 y 8 y 9 20

21 Distribution of work x 0 y 0 x 1 y 1 x 2 y 2 x 3 y 3 x 4 y 4 x 5 y 5 x 6 y 6 x 7 y 7 x 8 y 8 x 9 y 9 ω 0 = 2 m=0 x my m x 0 y 0 x 1 y 1 x 2 y 2 x 3 y 3 x 4 y 4 x 5 y 5 x 6 y 6 x 7 y 7 x 8 y 8 x 9 y 9 ω 1 = 5 m=3 x my m x 0 y 0 x 1 y 1 x 2 y 2 x 3 y 3 x 4 y 4 x 5 y 5 x 6 y 6 x 7 y 7 x 8 y 8 x 9 y 9 ω 2 = 7 m=6 x my m x 0 y 0 x 1 y 1 x 2 y 2 x 3 y 3 x 4 y 4 x 5 y 5 x 6 y 6 x 7 y 7 x 8 y 8 x 9 y 9 ω 3 = 9 m=8 x my m 21

22 Program on processor p x, y : vectors of dimension N ω p = m N p x m y m Send ω p to processor q p Receive ω q from processor q P 1 σ = ω q. q=0 22

23 Local and global numbering Global indices: N = {0, 1, 2,..., N 1}. Cardinality N = N. Subdivision: N = P 1 p=0 N p : a subset of global indices N p, N p N q =, p q. N p = N p : the number of global indices assigned to process p. I p = {0, 1,..., N p 1}: a local index set Note I p = N p = N p. 23

24 Local and global numbering µ m p, i m N µ 1 0 p < P i I p 24

25 Local and global numbering Requirements for the numbering of entities: The local-to-global numbering is a one-to-one relationship. Local indices are found in the interval I p = {0, 1,..., N p 1}. Global indices are not necessarily alway contiguous: this is the case when the global number of entities is not known. The entities are usually renumbered globally such that global indices are packed contiguously. Each process p is then assigned a range [i p 0, ip N p [ with 1. i p 0 is the process range offset, 2. i p N p is the local size. 25

26 Local and global numbering How to compute the offset with MPI? 1. MPI 1: MPI_Scan 2. MPI 2: MPI_Exscan 26

27 Distribution of work and data ˆx 0 ŷ 0 ˆx 1 ŷ 1 ˆx 2 ŷ 2 ω 0 = 2 i=0 ˆx iŷ i ˆx 0 ŷ 0 ˆx 1 ŷ 1 ˆx 2 ŷ 2 ω 1 = 2 i=0 ˆx iŷ i ˆx 0 ŷ 0 ˆx 1 ŷ 1 ω 2 = 1 i=0 ˆx iŷ i ˆx 0 ŷ 0 ˆx 1 ŷ 1 ω 3 = 1 i=0 ˆx iŷ i 27

28 Program on processor p ˆx, ŷ : vectors of dimension I p ω p = I p 1 i=0 I p 1 ˆx i ŷ i = x µ (p,i)y 1 µ 1 (p,i) = x m y m. m N p i=0 Send ω p to processor q p Receive ω q from processor q P 1 σ = ω q. q=0 28

29 Global reduction Reduction algorithm for the global sum P 1 σ = ω q. MPI_Reduce(ω, σ, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD) (the answer will be known to process zero), or q=0 MPI_Allreduce(ω, σ, 1, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD) (the answer will be known to every process) 29

30 Global sum P = 4, MPI_Reduce p = 0 : ω 0 p = 1 : ω 1 p = 2 : ω 2 p = 3 : ω 3 ω 0 + ω 1 ω 2 + ω 3 ω 0 + ω 1 + ω 2 + ω 3 = σ Can be completed in log 2 P steps. 30

31 Global sum P = 4, MPI_Allreduce p = 0 : ω 0 p = 1 : ω 1 p = 2 : ω 2 p = 3 : ω 3 ω 0 + ω 1 ω 1 + ω 0 ω 2 + ω 3 ω 3 + ω 2 ω 0 + ω 1 + ω 2 + ω 3 = σ ω 1 + ω 0 + ω 3 + ω 2 = σ ω 2 + ω 3 + ω 0 + ω 1 = σ ω 3 + ω 2 + ω 1 + ω 0 = σ Can be completed in log 2 P steps. 31

32 Program on processor p ˆx, ŷ : vectors of dimension I p σ = I p 1 i=0 I p 1 ˆx i ŷ i = x µ (p,i)y 1 µ 1 (p,i) = x m y m. m N p i=0 for d = 0,..., log 2 P 1 end Send σ to processor q = p 2 d Receive σ q from processor q = p 2 d σ = σ + σ q Here, is exclusive or. p and p 2 d differ only in bit d. 32

A quick introduction to MPI (Message Passing Interface)

A quick introduction to MPI (Message Passing Interface) M1IF - APPD Oguz Kaya Pierre Pradic École Normale Supérieure de Lyon, France 1 / 34 Oguz Kaya, Pierre Pradic M1IF - Presentation MPI Introduction