Parallel Algorithms. A. Legrand. Arnaud Legrand, CNRS, University of Grenoble. LIG laboratory, October 2, / 235

Size: px

Start display at page:

Download "Parallel Algorithms. A. Legrand. Arnaud Legrand, CNRS, University of Grenoble. LIG laboratory, October 2, / 235"

Benjamin McLaughlin
6 years ago
Views:

1 Parallel Arnaud Legrand, CNRS, University of Grenoble LIG laboratory, October 2, / 235

2 Outline Parallel Part I Models Part II Communications on a Ring Part III Speedup and Efficiency Part IV on a Ring Part V Algorithm on an Heterogeneous Ring Part VI on an Grid 2 / 235

3 P2P Communication Hockney LogP and Friends TCP Modeling Concurency Multi-port Single-port (Pure and Full Duplex) Flows Imperfection Part I Models Topology A Few Examples Virtual Topologies 3 / 235

4 Motivation Parallel P2P Communication Hockney LogP and Friends TCP Modeling Concurency Multi-port Single-port (Pure and Full Duplex) Flows Imperfection Topology A Few Examples Virtual Topologies Scientific computing : large needs in computation or storage resources. Need to use systems with several processors : Parallel computers with shared/distributed memory Clusters Heterogeneous clusters Clusters of clusters of workstations The Grid Desktop Grids When modeling platform, communications modeling seems to be the most controversial part. Two kinds of people produce communication models: those who are concerned with scheduling and those who are concerned with performance evaluation. All these models are imperfect and intractable. 4 / 235

5 Outline Parallel P2P Communication Hockney LogP and Friends TCP Modeling Concurency Multi-port Single-port (Pure and Full Duplex) Flows Imperfection Topology A Few Examples Virtual Topologies 1 Point to Point Communication Models Hockney LogP and Friends TCP 2 Modeling Concurency Multi-port Single-port (Pure and Full Duplex) Flows 3 Remind This is a Model, Hence Imperfect 4 Topology A Few Examples Virtual Topologies 5 / 235

6 UET-UCT Parallel P2P Communication Hockney LogP and Friends TCP Modeling Concurency Multi-port Single-port (Pure and Full Duplex) Flows Imperfection Topology A Few Examples Virtual Topologies Unit Execution Time - Unit Communication Time. Hem... This one is mainly used by scheduling theoreticians to prove that their problem is hard and to know whether there is some hope to prove some clever result or not. Some people have introduced a model whith cost of ε for local communications and 1 for communications with the outside. 6 / 235

7 Hockney Model Parallel Hockney [Hoc94] proposed the following model for performance evaluation of the Paragon. A message of size m from P i to P j requires: P2P Communication Hockney LogP and Friends TCP Modeling Concurency Multi-port Single-port (Pure and Full Duplex) Flows Imperfection Topology A Few Examples Virtual Topologies t i,j (m) = L i,j + m/b i,j In scheduling, there are three types of corresponding models: Communications are not splitable and each communication k is associated to a communication time t k (accounting for message size, latency, bandwidth, middleware,... ). Communications are splitable but latency is considered to be negligible (linear divisible model): t i,j (m) = m/b i,j Communications are splitable and latency cannot be neglected (linear divisible model): t i,j (m) = L i,j + m/b i,j 7 / 235

8 LogP Parallel P2P Communication Hockney LogP and Friends TCP Modeling Concurency Multi-port Single-port (Pure and Full Duplex) Flows Imperfection Topology A Few Examples Virtual Topologies The LogP model [CKP + 96] is defined by 4 parameters: L is the network latency o is the middleware overhead (message splitting and packing, buffer management, connection,... ) for a message of size w g is the gap (the minimum time between two packets communication) between two messages of size w P is the number of processors/modules Sender Card Card Receiver o g L Sending m bytes with packets of size w: 2o + L + m w max(o, g) Occupation on the sender and on the receiver: o + L + ( m w 1 ) max(o, g) o g g o o g g o o g g o g o 8 / 235

9 LogP Parallel P2P Communication Hockney LogP and Friends TCP Modeling Concurency Multi-port Single-port (Pure and Full Duplex) Flows Imperfection Topology A Few Examples Virtual Topologies The LogP model [CKP + 96] is defined by 4 parameters: L is the network latency o is the middleware overhead (message splitting and packing, buffer management, connection,... ) for a message of size w g is the gap (the minimum time between two packets communication) between two messages of size w P is the number of processors/modules Sender Card Card Receiver o g L Sending m bytes with packets of size w: 2o + L + m w max(o, g) Occupation on the sender and on the receiver: o + L + ( m w 1 ) max(o, g) o g g o o g g o o o g g o g g o 8 / 235

10 LogGP & plogp Parallel P2P Communication Hockney LogP and Friends TCP Modeling Concurency Multi-port Single-port (Pure and Full Duplex) Flows Imperfection Topology A Few Examples Virtual Topologies The previous model works fine for short messages. However, many parallel machines have special support for long messages, hence a higher bandwidth. LogGP [AISS97] is an extension of LogP: G captures the bandwidth for long messages: short messages 2o + L + m w max(o, g) long messages 2o + L + (m 1)G There is no fundamental difference... OK, it works for small and large messages. Does it work for averagesize messages? plogp [KBV00] is an extension of LogP when L, o and g depends on the message size m. They also have introduced a distinction between o s and o r. This is more and more precise but concurency is still not taken into account. 9 / 235

11 Bandwidth as a Function of Message Size Parallel With the Hockney model: m L+m/B. P2P Communication Hockney LogP and Friends TCP Modeling Concurency Multi-port Single-port (Pure and Full Duplex) Flows Imperfection Bande passante [Mbits/s] Mpich sans optimisation Mpich avec optimisation Topology A Few Examples Virtual Topologies Ko2Ko4Ko 16Ko 64Ko 256Ko 1Mo 4Mo Taille des messages MPICH, TCP with Gigabit Ethernet 16Mo 10 / 235

12 Bandwidth as a Function of Message Size Parallel With the Hockney model: m L+m/B. P2P Communication Hockney LogP and Friends TCP Modeling Concurency Multi-port Single-port (Pure and Full Duplex) Flows Imperfection Bande passante [Mbits/s] Mpich sans optimisation Mpich avec optimisation Topology A Few Examples Virtual Topologies Ko2Ko4Ko 16Ko 64Ko 256Ko 1Mo 4Mo Taille des messages MPICH, TCP with Gigabit Ethernet 16Mo 10 / 235

13 What About TCP-based s? Parallel P2P Communication Hockney LogP and Friends TCP Modeling Concurency Multi-port Single-port (Pure and Full Duplex) Flows Imperfection Topology A Few Examples Virtual Topologies The previous models work fine for parallel machines. Most networks use TCP that has fancy flow-control mechanism and slow start. Is it valid to use affine model for such networks? The answer seems to be yes but latency and bandwidth parameters have to be carefully measured [LQDB05]. Probing for m = 1b and m = 1Mb leads to bad results. The whole middleware layers should be benchmarked (theoretical latency is useless because of middleware, theoretical bandwidth is useless because of middleware and latency). The slow-start does not seem to be too harmful. Most people forget that the round-trip time has a huge impact on the bandwidth. 11 / 235

14 Outline Parallel P2P Communication Hockney LogP and Friends TCP Modeling Concurency Multi-port Single-port (Pure and Full Duplex) Flows Imperfection Topology A Few Examples Virtual Topologies 1 Point to Point Communication Models Hockney LogP and Friends TCP 2 Modeling Concurency Multi-port Single-port (Pure and Full Duplex) Flows 3 Remind This is a Model, Hence Imperfect 4 Topology A Few Examples Virtual Topologies 12 / 235

15 Multi-ports Parallel P2P Communication Hockney LogP and Friends TCP Modeling Concurency Multi-port Single-port (Pure and Full Duplex) Flows Imperfection Topology A Few Examples Virtual Topologies A given processor can communicate with as many other processors as he wishes without any degradation. This model is widely used by scheduling theoreticians (think about all DAG with commmunications scheduling problems) to prove that their problem is hard and to know whether there is some hope to prove some clever result or not. This model is borderline, especially when allowing duplication, when one communicates with everybody at the same time, or when trying to design algorithms with super tight approximation ratios. Frankly, such a model is totally unrealistic. Using MPI and synchronous communications, it may not be an issue. However, with multi-core, multi-processor machines, it cannot be ignored... B A Multi-port (numbers in s) C 13 / 235

16 Bounded Multi-port Parallel P2P Communication Hockney LogP and Friends TCP Modeling Concurency Multi-port Single-port (Pure and Full Duplex) Flows Imperfection Topology A Few Examples Virtual Topologies Assume now that we have threads or multi-core processors. We can write that sum of the throughputs of all communications (incomming and outgoing). Such a model is OK for wide-area communications [HP04]. Remember, the bounds due to the round-trip-time must not be forgotten! B A β/2 β/2 β/2 C Multi-port (β) (numbers in Mb/s) 14 / 235

17 Single-port (Pure) Parallel P2P Communication Hockney LogP and Friends TCP Modeling Concurency Multi-port Single-port (Pure and Full Duplex) Flows Imperfection Topology A Few Examples Virtual Topologies A process can communicate with only one other process at a time. This constraint is generally written as a constraint on the sum of communication times and is thus rather easy to use in a scheduling context (even though it complexifies problems). This model makes sense when using non-threaded versions of communication libraries (e.g., MPI). As soon as you re allowed to use threads, bounded-multiport seems a more reasonnable option (both for performance and scheduling complexity). B 1/3 A 1/3 1/3 C 1-port (pure) (numbers in s) 15 / 235

18 Single-port (Full-Duplex) Parallel P2P Communication Hockney LogP and Friends TCP Modeling Concurency Multi-port Single-port (Pure and Full Duplex) Flows Imperfection Topology A Few Examples Virtual Topologies At a given time, a process can be engaged in at most one emission and one reception. This constraint is generally written as two constraints: one on the sum of incomming communication times and one on the sum of outgoing communication times. B 1/2 A 1/2 1/2 C 1-port (full duplex) (numbers in Mb/s) 16 / 235

19 Single-port (Full-Duplex) Parallel P2P Communication Hockney LogP and Friends TCP Modeling Concurency Multi-port Single-port (Pure and Full Duplex) Flows Imperfection This model somehow makes sense when using networks like Myrinet that have few multiplexing units and with protocols without control flow [Mar07]. Topology A Few Examples Virtual Topologies Even if it does not model well complex situations, such a model is not harmfull. 17 / 235

20 Fluid Modeling Parallel P2P Communication Hockney LogP and Friends TCP Modeling Concurency Multi-port Single-port (Pure and Full Duplex) Flows Imperfection Topology A Few Examples Virtual Topologies When using TCP-based networks, it is generally reasonnable to use flows to model bandwidth sharing [MR99, Low03]. l L, Income Maximization maximize ϱ r r R ϱ r c l r R s.t. l r Max-Min Fairness maximize min ϱ r r R Proportional Fairness maximize r R log(ϱ r ) Potential Delay Minimization minimize r R Some weird function minimize r R arctan(ϱ r ) 1 ϱ r 18 / 235

21 Fluid Modeling Parallel P2P Communication Hockney LogP and Friends TCP Modeling Concurency Multi-port Single-port (Pure and Full Duplex) Flows Imperfection Topology A Few Examples Virtual Topologies When using TCP-based networks, it is generally reasonnable to use flows to model bandwidth sharing [MR99, Low03]. l L, Income Maximization maximize ϱ r r R ϱ r c l r R s.t. l r Max-Min Fairness maximize min ϱ r ATM r R Proportional Fairness maximize r R TCP Vegas log(ϱ r ) Potential Delay Minimization minimize r R Some weird function minimize r R arctan(ϱ r ) 1 ϱ r TCP Reno 18 / 235

22 Flows Extensions Parallel P2P Communication Hockney LogP and Friends TCP Modeling Concurency Multi-port Single-port (Pure and Full Duplex) Flows Imperfection Topology A Few Examples Virtual Topologies Note that this model is a multi-port model with capacity-constraints (like in the previous bounded multi-port). When latencies are large, using multiple connections enables to get more bandwidth. As a matter of fact, there is very few to loose in using multiple connections... Therefore many people enforce a sometimes artificial (but less intrusive) bound on the maximum number of connections per link [Wag05, MYCR06]. 19 / 235

23 Outline Parallel P2P Communication Hockney LogP and Friends TCP Modeling Concurency Multi-port Single-port (Pure and Full Duplex) Flows Imperfection Topology A Few Examples Virtual Topologies 1 Point to Point Communication Models Hockney LogP and Friends TCP 2 Modeling Concurency Multi-port Single-port (Pure and Full Duplex) Flows 3 Remind This is a Model, Hence Imperfect 4 Topology A Few Examples Virtual Topologies 20 / 235

24 Remind This is a Model, Hence Imperfect Parallel P2P Communication Hockney LogP and Friends TCP Modeling Concurency Multi-port Single-port (Pure and Full Duplex) Flows Imperfection Topology A Few Examples Virtual Topologies The previous sharing models are nice but you generally do not know other flows... Communications use the memory bus and hence interfere with computations. Taking such interferences into account may become more and more important with multi-core architectures. Interference between communications are sometimes... surprising. Modeling is an art. You have to know your platform and your application to know what is negligeable and what is important. Even if your model is imperfect, you may still derive interesting results. 21 / 235

25 Outline Parallel P2P Communication Hockney LogP and Friends TCP Modeling Concurency Multi-port Single-port (Pure and Full Duplex) Flows Imperfection Topology A Few Examples Virtual Topologies 1 Point to Point Communication Models Hockney LogP and Friends TCP 2 Modeling Concurency Multi-port Single-port (Pure and Full Duplex) Flows 3 Remind This is a Model, Hence Imperfect 4 Topology A Few Examples Virtual Topologies 22 / 235

26 Various Topologies Used in the Litterature Parallel P2P Communication Hockney LogP and Friends TCP Modeling Concurency Multi-port Single-port (Pure and Full Duplex) Flows Imperfection Topology A Few Examples Virtual Topologies 23 / 235

27 P2P Communication Hockney LogP and Friends TCP Modeling Concurency Multi-port Single-port (Pure and Full Duplex) Flows Imperfection Topology A Few Examples Virtual Topologies Beyond MPI_Comm_rank()? So far, MPI gives us a unique number for each processor With this one can do anything But it s pretty inconvenient because one can do anything with it Typically, one likes to impose constraints about which processor/process can talk to which other processor/process With this constraint, one can then think of the algorithm in simpler terms There are fewer options for communications between processors So there are fewer choices to implementing an algorithm 24 / 235

28 P2P Communication Hockney LogP and Friends TCP Modeling Concurency Multi-port Single-port (Pure and Full Duplex) Flows Imperfection Topology A Few Examples Virtual Topologies Virtual Topologies? MPI provides an abstraction over physical computers Each host has an IP address MPI hides this address with a convenient numbers There could be multiple such numbers mapped to the same IP address All numbers can talk to each other A Virtual Topology provides an abstraction over MPI Each process has a number, which may be different from the MPI number There are rules about which numbers a number can talk to A virtual topology is defined by specifying the neighbors of each process 25 / 235

29 P2P Communication Hockney LogP and Friends TCP Modeling Concurency Multi-port Single-port (Pure and Full Duplex) Flows Imperfection Topology A Few Examples Virtual Topologies Implementing a Virtual Topology ,0 1,0 1,1 2,0 2,1 2,2 2,3 7 (i,j) = (floor(log2(rank+1)), rank - 2 max(i,0) +1) rank = j max(i,0) 3,0 26 / 235

30 P2P Communication Hockney LogP and Friends TCP Modeling Concurency Multi-port Single-port (Pure and Full Duplex) Flows Imperfection Topology A Few Examples Virtual Topologies Implementing a Virtual Topology ,0 1,0 1,1 2,0 2,1 2,2 2,3 7 (i,j) = (floor(log2(rank+1)), rank - 2 max(i,0) +1) rank = j max(i,0) my_parent(i,j) = (i-1, floor(j/2)) my_left_child(i,j) = (i+1, j*2), if any my_right_child(i,j) = (i+1, j*2+1), if any 3,0 27 / 235

31 P2P Communication Hockney LogP and Friends TCP Modeling Concurency Multi-port Single-port (Pure and Full Duplex) Flows Imperfection Topology A Few Examples Virtual Topologies Implementing a Virtual Topology ,0 1,0 1,1 2,0 2,1 2,2 2,3 7 (i,j) = (floor(log2(rank+1)), rank - 2 max(i,0) +1) rank = j max(i,0) my_parent(i,j) = (i-1, floor(j/2)) my_left_child(i,j) = (i+1, j*2), if any my_right_child(i,j) = (i+1, j*2+1), if any 3,0 MPI_Send(, rank(my_parent(i,j)), ) MPI_Recv(, rank(my_left_child(i,j)), ) 28 / 235

32 P2P Communication Hockney LogP and Friends TCP Modeling Concurency Multi-port Single-port (Pure and Full Duplex) Flows Imperfection Topology A Few Examples Virtual Topologies Typical Topologies Common Topologies (see Section 3.1.2) Linear Array Ring 2-D grid 2-D torus One-level Tree Fully connected graph Arbitrary graph Two options for all topologies: Monodirectional links: more constrained but simpler Bidirectional links: less constrained but potential more complicated By complicated we typically mean more bug-prone We ll look at Ring and Grid in detail 29 / 235

33 P2P Communication Hockney LogP and Friends TCP Modeling Concurency Multi-port Single-port (Pure and Full Duplex) Flows Imperfection Topology A Few Examples Virtual Topologies Main Assumption and Big Question The main assumption is that once we ve defined the virtual topology we forget it s virtual and write parallel algorithms assuming it s physical We assume communications on different (virtual) links do not interfere with each other We assume that computations on different (virtual) processors do not interfere with each other The big question: How well do these assumptions hold? The question being mostly about the network Two possible bad cases Case #1: the assumptions do not hold and there are interferences We ll most likely achieve bad performance Our performance models will be broken and reasoning about performance improvements will be difficult Case #2: the assumptions do hold but we leave a lot of the network resources unutilized We could perhaps do better with another virtual topology 30 / 235

34 P2P Communication Hockney LogP and Friends TCP Modeling Concurency Multi-port Single-port (Pure and Full Duplex) Flows Imperfection Topology A Few Examples Virtual Topologies Which Virtual Topology to Pick We will see that some topologies are really well suited to certain algorithms The question is whether they are well-suite to the underlying architecture The goal is to strike a good compromise Not too bad given the algorithm Not too bad given the platform Fortunately, many platforms these days use switches, which support naturally many virtual topologies Because they support concurrent communications between disjoint pairs of processors As part of a programming assignment, you will explore whether some virtual topology makes sense on our cluster 31 / 235

35 P2P Communication Hockney LogP and Friends TCP Modeling Concurency Multi-port Single-port (Pure and Full Duplex) Flows Imperfection Topology A Few Examples Virtual Topologies Topologies and Data Distribution One of the common steps when writing a parallel algorithm is to distribute some data (array, data structure, etc.) among the processors in the topology Typically, one does data distribution in a way that matches the topology E.g., if the data is 3-D, then it s nice to have a 3-D virtual topology One question that arises then is: how is the data distributed across the topology? In the next set of slides we look at our first topology: a ring 32 / 235

36 Assumptions Broadcast Scatter All-to-All Broadcast: Going Faster Part II Communications on a Ring 33 / 235

37 Outline Parallel Assumptions Broadcast Scatter All-to-All Broadcast: Going Faster 5 Assumptions 6 Broadcast 7 Scatter 8 All-to-All 9 Broadcast: Going Faster 34 / 235

38 Assumptions Broadcast Scatter All-to-All Broadcast: Going Faster Ring Topology (Section 3.3) P p-1 P 3 P 0 P 2 P 1 Each processor is identified by a rank MY_NUM() There is a way to find the total number of processors NUM_PROCS() Each processor can send a message to its successor SEND(addr, L) And receive a message from its predecessor RECV(addr, L) We ll just use the above pseudocode rather than MPI Note that this is much simpler than the example tree topology we saw in the previous set of slides 35 / 235

39 Assumptions Broadcast Scatter All-to-All Broadcast: Going Faster Virtual vs. Physical Topology Now that we have chosen to consider a Ring topology we pretend our physical topology is a ring topology We can always implement a virtual ring topology (see previous set of slides) And read Section 4.6 So we can write many ring algorithms It may be that a better virtual topology is better suited to our physical topology But the ring topology makes for very simple programs and is known to be reasonably good in practice So it s a good candidate for our first look at parallel algorithms 36 / 235

40 Assumptions Broadcast Scatter All-to-All Broadcast: Going Faster Cost of communication (Sect ) It is actually difficult to precisely model the cost of communication E.g., MPI implementations do various optimizations given the message sizes We will be using a simple model Time = L + m/b L: start-up cost or latency B: bandwidth (b = 1/B) m: message size We assume that if a message of length m is sent from P 0 to P q, then the communication cost is q(l + m b) There are many assumptions in our model, some not very realistic, but we ll discuss them later 37 / 235

41 Assumptions Broadcast Scatter All-to-All Broadcast: Going Faster Assumptions about Communications Several Options Both Send() and Recv() are blocking Called rendez-vous Very old-fashioned systems Recv() is blocking, but Send() is not Pretty standard MPI supports it Both Recv() and Send() are non-blocking Pretty standard as well MPI supports it 38 / 235

42 Assumptions Broadcast Scatter All-to-All Broadcast: Going Faster Assumptions about Concurrency One question that s important is: can the processor do multiple things at the same time? Typically we will assume that the processor can send, receive, and compute at the same time Call MPI_IRecv() Call MPI_ISend() Compute something This of course implies that the three operations are independent E.g., you don t want to send the result of the computation E.g., you don t want to send what you re receiving (forwarding) When writing parallel algorithms (in pseudo-code), we ll simply indicate concurrent activities with a sign 39 / 235

43 Assumptions Broadcast Scatter All-to-All Broadcast: Going Faster Collective Communications To write a parallel algorithm, we will need collective operations Broadcasts, etc. Now MPI provide those, and they likely: Do not use the ring logical topology Utilize the physical resources well Let s still go through the exercise of writing some collective communication algorithms We will see that for some algorithms we really want to do these communications by hand on our virtual topology rather than using the MPI collective 40 / 235

44 Outline Parallel Assumptions Broadcast Scatter All-to-All Broadcast: Going Faster 5 Assumptions 6 Broadcast 7 Scatter 8 All-to-All 9 Broadcast: Going Faster 41 / 235

45 Assumptions Broadcast Scatter All-to-All Broadcast: Going Faster Broadcast (Section 3.3.1) We want to write a program that has Pk send the same message of length m to all other processors Broadcast(k,addr,m) On the ring, we just send to the next processor, and so on, with no parallel communications whatsoever This is of course not the way one should implement a broadcast in practice if the physical topology is not merely a ring MPI uses some type of tree topology 42 / 235

46 Assumptions Broadcast (Section 3.3.1) Broadcast Scatter All-to-All Broadcast: Going Faster Brodcast(k,addr,m) q = MY_NUM() p = NUM_PROCS() if (q == k) SEND(addr,m) else if (q == k 1 mod p) RECV(addr,m) else RECV(addr,m) SEND(addr,m) endif endif Assumes a blocking receive Sending may be non-blocking The broadcast time is (p-1)(l+m b) 43 / 235

47 Outline Parallel Assumptions Broadcast Scatter All-to-All Broadcast: Going Faster 5 Assumptions 6 Broadcast 7 Scatter 8 All-to-All 9 Broadcast: Going Faster 44 / 235

48 Assumptions Broadcast Scatter All-to-All Broadcast: Going Faster Scatter (Section 3.2.2) Processor k sends a different message to all other processors (and to itself) Pk stores the message destined to P q at address addr[q], including a message at addr[k] At the end of the execution, each processor holds the message it had received in msg The principle is just to pipeline communication by starting to send the message destined to P k-1, the most distant processor 45 / 235

49 Assumptions Broadcast Scatter All-to-All Broadcast: Going Faster Scatter (Section 3.3.2) Scatter(k,msg,addr,m) Same execution time as the broadcast q = MY_NUM() (p-1)(l + m b) p = NUM_PROCS() if (q == k) for i = 0 to p 2 SEND(addr[k+p 1 i mod p],m) msg addr[k] else RECV(tempR,L) for i = 1 to k 1 q mod p temps tempr SEND(tempS,m) RECV(tempR,m) msg tempr Swapping of send buffer and receive buffer (pointer) Sending and Receiving in Parallel, with a non blocking Send 46 / 235

50 Assumptions Broadcast Scatter All-to-All Broadcast: Going Faster Scatter (Section 3.3.2) Scatter(k,msg,addr,m) q = MY_NUM() p = NUM_PROCS() if (q == k) for i = 0 to p 2 SEND(addr[k+p 1 i mod p],m) msg addr[k] else RECV(tempR,L) for i = 1 to k 1 q mod p temps tempr SEND(tempS,m) RECV(tempR,m) msg tempr 0 1 Proc q=2 Proc q=3 Proc q=0 k = 2, p = 4 send addr[ % 4 = 1] send addr[ % 4 = 0] send addr[ % 4 = 3] msg = addr[2] recv (addr[1]) // loop % 4 = 2 times send (addr[1]) recv (addr[0]) send (addr[0]) recv (addr[3]) msg = addr[3] recv (addr[1]) // loop % 4 = 1 time send (addr[1]) recv (addr[0]) msg = addr[0] 3 2 Proc q=1 // loop % 4 = 0 time recv (addr[1]) msg = addr[1] 47 / 235

51 Outline Parallel Assumptions Broadcast Scatter All-to-All Broadcast: Going Faster 5 Assumptions 6 Broadcast 7 Scatter 8 All-to-All 9 Broadcast: Going Faster 48 / 235

52 Assumptions Broadcast Scatter All-to-All Broadcast: Going Faster All-to-all (Section 3.3.3) All2All(my_addr, addr, m) Same execution time as the scatter q = MY_NUM() (p-1)(l + m b) p = NUM_PROCS() addr[q] my_addr for i = 1 to p 1 SEND(addr[q i+1 mod p],m) RECV(addr[q i mod p],m) / 235

53 Outline Parallel Assumptions Broadcast Scatter All-to-All Broadcast: Going Faster 5 Assumptions 6 Broadcast 7 Scatter 8 All-to-All 9 Broadcast: Going Faster 50 / 235

54 Assumptions Broadcast Scatter All-to-All Broadcast: Going Faster A faster broadcast? How can we improve performance? One can cut the message in many small pieces, say in r pieces where m is divisible by r. The root processor just sends r messages The performance is as follows Consider the last processor to get the last piece of the message There need to be p-1 steps for the first piece to arrive, which takes (p-1)(l + m b / r) Then the remaining r-1 pieces arrive one after another, which takes (r-1)(l + m b / r) For a total of: (p r) (L + mb / r) 51 / 235

55 Assumptions Broadcast Scatter All-to-All Broadcast: Going Faster A faster broadcast? The question is, what is the value of r that minimizes (p r) (L + m b / r)? One can view the above expression as (c+ar)(d+b/r), with four constants a, b, c, d The non-constant part of the expression is then ad.r + cb/r, which must be minimized It is known that this value is minimized for sqrt(cb / ad) and we have r opt = sqrt(m(p-2) b / L) with the optimal time (sqrt((p-2) L) + sqrt(m b)) 2 which tends to mb when m is large, which is independent of p! 52 / 235

56 Assumptions Broadcast Scatter All-to-All Broadcast: Going Faster Well-known Principle We have seen that if we cut a (large) message in many (small) messages, then we can send the message over multiple hops (in our case p-1) almost as fast as we can send it over a single hop This is a fundamental principle of IP networks We cut messages into IP frames Send them over many routers But really go as fast as the slowest router 53 / 235

57 Speedup Amdahl s Law Part III Speedup and Efficiency 54 / 235

58 Outline Parallel Speedup Amdahl s Law 10 Speedup 11 Amdahl s Law 55 / 235

59 Speedup Parallel Speedup Amdahl s Law We need a metric to quantify the impact of your performance enhancement Speedup: ratio of old time to new time new time = 1h speedup = 2h / 1h = 2 Sometimes one talks about a slowdown in case the enhancement is not beneficial Happens more often than one thinks 56 / 235

60 Performance Parallel Speedup Amdahl s Law The notion of speedup is completely generic By using a rice cooker I ve achieved a 1.20 speedup for rice cooking For parallel programs one defines the Parallel Speedup (we ll just say speedup ): Parallel program takes time T1 on 1 processor Parallel program takes time Tp on p processors Parallel Speedup: S(p) = T 1 T p In the ideal case, if my sequential program takes 2 hours on 1 processor, it takes 1 hour on 2 processors: called linear speedup 57 / 235

61 Speedup Parallel Speedup Amdahl s Law speedup superlinear linear sub-linear number of processors Superlinear Speedup? There are several possible causes 58 / 235

62 Speedup Parallel Speedup Amdahl s Law speedup superlinear linear sub-linear number of processors Superlinear Speedup? There are several possible causes Algorithm with optimization problems, throwing many processors at it increases the chances that one will get lucky and find the optimum fast 58 / 235

63 Speedup Parallel Speedup Amdahl s Law speedup superlinear linear sub-linear number of processors Superlinear Speedup? There are several possible causes Algorithm with optimization problems, throwing many processors at it increases the chances that one will get lucky and find the optimum fast Hardware with many processors, it is possible that the entire application data resides in cache (vs. RAM) or in RAM (vs. Disk) 58 / 235

64 Outline Parallel Speedup Amdahl s Law 10 Speedup 11 Amdahl s Law 59 / 235

65 Bad News: Amdahl s Law Parallel Speedup Amdahl s Law Consider a program whose execution consists of two phases 1 One sequential phase : T seq = (1 f )T 1 2 One phase that can be perfectly parallelized (linear speedup) T par = ft 1 Therefore: T p = T seq + T par /p = (1 f )T 1 + ft 1 /p. Amdahl s Law: 1 S p = 1 f + f p Speedup f = 10% 20% 50% f = 80% Number of processors 60 / 235

66 Lessons from Amdahl s Law Parallel Speedup Amdahl s Law It s a law of diminishing return If a significant fraction of the code (in terms of time spent in it) is not parallelizable, then parallelization is not going to be good It sounds obvious, but people new to high performance computing often forget how bad Amdahl s law can be Luckily, many applications can be almost entirely parallelized and f is small 61 / 235

67 Efficiency Parallel Speedup Amdahl s Law Efficiency is defined as Eff (p) = S(p)/p Typically < 1, unless linear or superlinear speedup Used to measure how well the processors are utilized If increasing the number of processors by a factor 10 increases the speedup by a factor 2, perhaps it s not worth it: efficiency drops by a factor 5 Important when purchasing a parallel machine for instance: if due to the application s behavior efficiency is low, forget buying a large cluster 62 / 235

68 Scalability Parallel Speedup Amdahl s Law Measure of the effort needed to maintain efficiency while adding processors Efficiency also depends on the problem size: Eff (n, p) Isoefficiency: At which rate does the problem size need to be increase to maintain efficiency nc(p) such that Eff (n c(p), p) = c By making a problem ridiculously large, on can typically achieve good efficiency Problem: is it how the machine/code will be used? 63 / 235

69 Vector Product Open MP First MPI Distributing Matrices Second MPI Third MPI Mixed Parallelism Part IV on a Ring Stencil Application Principle Greedy Reducing the Granularity LU Factorization Gaussian Elimination LU 64 / 235

70 Outline Parallel Vector Product Open MP First MPI Distributing Matrices Second MPI Third MPI Mixed Parallelism Stencil Application Principle Greedy Reducing the Granularity LU Factorization Gaussian Elimination LU 12 Vector Product Open MP First MPI Distributing Matrices Second MPI Third MPI Mixed Parallelism Stencil Application Principle Greedy Reducing the Granularity 15 LU Factorization Gaussian Elimination LU 65 / 235

71 Vector Product Open MP First MPI Distributing Matrices Second MPI Third MPI Mixed Parallelism Stencil Application Principle Greedy Reducing the Granularity LU Factorization Gaussian Elimination LU Parallel -Vector product y = A x Let n be the size of the matrix int a[n][n]; int x[n]; for i = 0 to n 1 { y[i] = 0; for j = 0 to n 1 y[i] = y[i] + a[i,j] * x[j]; } How do we do this in parallel? Section 4.1 in the book a[n][n] x[n] y[n] 66 / 235

72 Vector Product Open MP First MPI Distributing Matrices Second MPI Third MPI Mixed Parallelism Stencil Application Principle Greedy Reducing the Granularity LU Factorization Gaussian Elimination LU Parallel -Vector product How do we do this in parallel? For example: Computations of elements of vector y are independent Each of these computations requires one row of matrix a and vector x In shared-memory: #pragma omp parallel for private(i,j) for i = 0 to n 1 { y[i] = 0; for j = 0 to n 1 y[i] = y[i] + a[i,j] * x[j]; } a[n][n] x[n] y[n] 67 / 235

73 Vector Product Open MP First MPI Distributing Matrices Second MPI Third MPI Mixed Parallelism Stencil Application Principle Greedy Reducing the Granularity LU Factorization Gaussian Elimination LU Parallel -Vector Product In distributed memory, one possibility is that each process has a full copy of matrix a and of vector x Each processor declares a vector y of size n/p We assume that p divides n Therefore, the code can just be load(a); load(x) p = NUM_PROCS(); r = MY_RANK(); for (i=r*n/p; i<(r+1)*n/p; i++) { for (j=0;j<n;j++) y[i r*n/p] = a[i][j] * x[j]; } It s embarrassingly parallel What about the result? 68 / 235

74 Vector Product Open MP First MPI Distributing Matrices Second MPI Third MPI Mixed Parallelism Stencil Application Principle Greedy Reducing the Granularity LU Factorization Gaussian Elimination LU What about the result? After the processes complete the computation, each process has a piece of the result One probably wants to, say, write the result to a file Requires synchronization so that the I/O is done correctly For example... if (r!= 0) { recv(&token,1); } open(file, append ); for (j=0; j<n/p ; j++) write(file, y[j]); send(&token,1); close(file) barrier(); // optional Could also use a gather so that the entire vector is returned to processor 0 vector y fits in the memory of a single node 69 / 235

75 Vector Product Open MP First MPI Distributing Matrices Second MPI Third MPI Mixed Parallelism Stencil Application Principle Greedy Reducing the Granularity LU Factorization Gaussian Elimination LU What if matrix a is too big? a may not fit in memory Which is a motivation to use distributed memory implementations In this case, each processor can store only a piece of matrix a For the matrix-vector multiply, each processor can just store n/p rows of the matrix Conceptually: A[n][n] But the program declares a[n/p][n] This raises the (annoying) issue of global indices versus local indices 70 / 235

76 Vector Product Open MP First MPI Distributing Matrices Second MPI Third MPI Mixed Parallelism Global vs. Local indices When an array is split among processes global index (I,J) that references an element of the matrix local index (i,j) that references an element of the local array that stores a piece of the matrix Translation between global and local indices think of the algorithm in terms of global indices implement it in terms of local indices Stencil Application Principle Greedy Reducing the Granularity P0 P1 Global: A[5][3] Local: a[1][3] on process P1 a[i,j] = A[(n/p)*rank + i][j] LU Factorization Gaussian Elimination LU P2 N n / p 71 / 235

77 Vector Product Open MP First MPI Distributing Matrices Second MPI Third MPI Mixed Parallelism Stencil Application Principle Greedy Reducing the Granularity LU Factorization Gaussian Elimination LU Global Index Computation Real-world parallel code often implements actual translation functions GlobalToLocal() LocalToGlobal() This may be a good idea in your code, although for the ring topology the computation is pretty easy, and writing functions may be overkill We ll see more complex topologies with more complex associated data distributions and then it s probably better to implement such functions 72 / 235

78 Vector Product Open MP First MPI Distributing Matrices Second MPI Third MPI Mixed Parallelism Stencil Application Principle Greedy Reducing the Granularity LU Factorization Gaussian Elimination LU Distributions of arrays At this point we have 2-D array a distributed 1-D array y distributed 1-D array x replicated Having distributed arrays makes it possible to partition work among processes But it makes the code more complex due to global/local indices translations It may require synchronization to load/save the array elements to file 73 / 235

79 Vector Product Open MP First MPI Distributing Matrices Second MPI Third MPI Mixed Parallelism Stencil Application Principle Greedy Reducing the Granularity LU Factorization Gaussian Elimination LU All vector distributed? So far we have array x replicated It is usual to try to have all arrays involved in the same computation be distributed in the same way makes it easier to read the code without constantly keeping track of what s distributed and what s not e.g., local indices for array y are different from the global ones, but local indices for array x are the same as the global ones will lead to bugs What one would like it for each process to have N/n rows of matrix A in an array a[n/p][n] N/n components of vector x in an array x[n/p] N/n components of vector y in an array y[n/p] Turns out there is an elegant solution to do this 74 / 235

80 Vector Product Open MP First MPI Distributing Matrices Second MPI Third MPI Mixed Parallelism Stencil Application Principle Greedy Reducing the Granularity LU Factorization Gaussian Elimination LU Principle of the Algorithm A 00 A 01 A 02 A 03 A 04 A 05 A 06 A 07 x P 0 0 A 10 A 11 A 12 A 13 A 14 A 15 A 16 A 17 x 1 A 20 A 21 A 22 A 23 A 24 A 25 A 26 A 27 x P 2 1 A 30 A 31 A 32 A 33 A 34 A 35 A 36 A 37 x 3 A 40 A 41 A 42 A 43 A 44 A 45 A 46 A 47 x P 4 2 A 50 A 51 A 52 A 53 A 54 A 55 A 56 A 57 x 5 A 60 A 61 A 62 A 63 A 64 A 65 A 66 A 67 x P 6 3 A 70 A 71 A 72 A 73 A 74 A 75 A 76 A 77 x 7 Initial data distribution for: n = 8 p = 4 n/p = 2 75 / 235

81 Vector Product Open MP First MPI Distributing Matrices Second MPI Third MPI Mixed Parallelism Principle of the Algorithm A 00 A 01 x P 0 0 A 10 A 11 x 1 A 22 A 23 x P 2 1 A 32 A 33 x 3 A 44 A 45 x P 4 2 A 54 A 55 x 5 Stencil Application Principle Greedy Reducing the Granularity LU Factorization Gaussian Elimination LU A 66 A 67 x P 6 3 A 76 A 77 x 7 Step 0 76 / 235

82 Vector Product Open MP First MPI Distributing Matrices Second MPI Third MPI Mixed Parallelism Stencil Application Principle Greedy Reducing the Granularity LU Factorization Gaussian Elimination LU Principle of the Algorithm A 06 A 07 x P 6 0 A 16 A 17 x 7 A 20 A 21 x P 0 1 A 30 A 31 x 1 A x P 42 A A 52 A 53 A x P 64 A A 74 A 75 Step 1 x 3 x 5 77 / 235

83 Vector Product Open MP First MPI Distributing Matrices Second MPI Third MPI Mixed Parallelism Stencil Application Principle Greedy Reducing the Granularity LU Factorization Gaussian Elimination LU Principle of the Algorithm A x P 04 A A 14 A 15 A 26 A 27 x P 6 1 A 36 A 37 x 7 A 40 A 41 x P 0 2 A 50 A 51 A x P 62 A A 72 A 73 Step 2 x 5 x 1 x 3 78 / 235

84 Vector Product Open MP First MPI Distributing Matrices Second MPI Third MPI Mixed Parallelism Stencil Application Principle Greedy Reducing the Granularity LU Factorization Gaussian Elimination LU Principle of the Algorithm A x P 02 A A 12 A 13 A x P 24 A A 34 A 35 A 46 A 47 x P 6 2 A 56 A 57 x 7 A 60 A 61 x P 0 3 A 70 A 71 Step 3 x 3 x 5 x 1 79 / 235

85 Vector Product Open MP First MPI Distributing Matrices Second MPI Third MPI Mixed Parallelism Stencil Application Principle Greedy Reducing the Granularity LU Factorization Gaussian Elimination LU Principle of the Algorithm A x P 00 A A 10 A 11 A x P 22 A A 32 A 33 A x P 44 A A 54 A 55 A 66 A 67 x P 6 3 A 76 A 77 x 7 Final state x 1 x 3 x 5 The final exchange of vector x is not strictly necessary, but one may want to have it distributed as the end of the computation like it was distributed at the beginning. 80 / 235

86 Vector Product Open MP First MPI Distributing Matrices Second MPI Third MPI Mixed Parallelism Stencil Application Principle Greedy Reducing the Granularity LU Factorization Gaussian Elimination LU Algorithm Uses two buffers temps for sending and tempr to receiving float A[n/p][n], x[n/p], y[n/p]; r n/p temps x /* My piece of the vector (n/p elements) */ for (step=0; step<p; step++) { /* p steps */ SEND(tempS,r) RECV(tempR,r) for (i=0; i<n/p; i++) for (j=0; j <n/p; j++) y[i] y[i] + a[i,(rank step mod p) * n/p + j] * temps[j] temps tempr } In our example, process of rank 2 at step 3 would work with the 2x2 matrix block starting at column ((2-3) mod 4)*8/4 = 3 * 8 / 4 = 6; 81 / 235

87 Vector Product Open MP First MPI Distributing Matrices Second MPI Third MPI Mixed Parallelism Stencil Application Principle Greedy Reducing the Granularity LU Factorization Gaussian Elimination LU A few General Principles Large data needs to be distributed among processes (running on different nodes of a cluster for instance) causes many arithmetic expressions for index computation People who do this for a leaving always end up writing local_to_global() and global_to_local() functions Data may need to be loaded/written before/after the computation requires some type of synchronization among processes Typically a good idea to have all data structures distributed similarly to avoid confusion about which indices are global and which ones are local In our case, all indices are local In the end the code looks much more complex than the equivalent OpenMP implementation 82 / 235

88 Vector Product Open MP First MPI Distributing Matrices Second MPI Third MPI Mixed Parallelism Stencil Application Principle Greedy Reducing the Granularity LU Factorization Gaussian Elimination LU Performance There are p identical steps During each step each processor performs three activities: computation, receive, and sending Computation: r 2 w w: time to perform one += * operation Receiving: L + r b Sending: L + r b T(p) = p (r 2 w + 2L + 2rb) 83 / 235

89 Vector Product Open MP First MPI Distributing Matrices Second MPI Third MPI Mixed Parallelism Stencil Application Principle Greedy Reducing the Granularity LU Factorization Gaussian Elimination LU Asymptotic Performance T(p) = p(r 2 w + 2L + 2rb) Speedup(p) = n 2 w / p (r 2 w + 2L + 2rb) = n 2 w / (n 2 w/p + 2pL + 2nb) Eff(p) = n 2 w / (n 2 w+ 2p 2 L + 2pnb) For p fixed, when n is large, Eff(p) ~ 1 Conclusion: the algorithm is asymptotically optimal 84 / 235

90 Vector Product Open MP First MPI Distributing Matrices Second MPI Third MPI Mixed Parallelism Stencil Application Principle Greedy Reducing the Granularity LU Factorization Gaussian Elimination LU Performance (2) Note that an algorithm that initially broadcasts the entire vector to all processors and then have every processor compute independently would be in time (p-1)(l + n b) + pr 2 w Could use the pipelined broadcast which: has the same asymptotic performance is a simpler algorithm wastes only a tiny little bit of memory is arguably much less elegant It is important to think of simple solutions and see what works best given expected matrix size, etc. 85 / 235

91 Vector Product Open MP First MPI Distributing Matrices Second MPI Third MPI Mixed Parallelism Stencil Application Principle Greedy Reducing the Granularity LU Factorization Gaussian Elimination LU Back to the Algorithm float A[n/p][n], x[n/p], y[n/p]; r n/p temps x /* My piece of the vector (n/p elements) */ for (step=0; step<p; step++) { /* p steps */ SEND(tempS,r) RECV(tempR,r) for (i=0; i<n/p; i++) for (j=0; j <n/p; j++) y[i] y[i] + a[i,(rank step mod p) * n/p + j] * temps[j] temps tempr } In the above code, at each iteration, the SEND, the RECV, and the computation can all be done in parallel Therefore, one can overlap communication and computation by using non-blocking SEND and RECV if available MPI provides MPI_ISend() and MPI_IRecv() for this purpose 86 / 235

92 Vector Product Open MP First MPI Distributing Matrices Second MPI Third MPI Mixed Parallelism Stencil Application Principle Greedy Reducing the Granularity LU Factorization Gaussian Elimination LU Nore Concurrent Algorithm Notation for concurrent activities: float A[n/p][n], x[n/p], y[n/p]; temps x /* My piece of the vector (n/p elements) */ r n/p for (step=0; step<p; step++) { /* p steps */ SEND(tempS,r) RECV(tempR,r) for (i=0; i<n/p; i++) for (j=0; j <n/p; j++) y[i] y[i]+a[i,(rank step mod p)*n/p+j]*temps[j] temps tempr } 87 / 235

93 Vector Product Open MP First MPI Distributing Matrices Second MPI Third MPI Mixed Parallelism Stencil Application Principle Greedy Reducing the Granularity LU Factorization Gaussian Elimination LU Better Performance There are p identical steps During each step each processor performs three activities: computation, receive, and sending Computation: r 2 w Receiving: L + rb Sending: L + rb T(p) = p max(r 2 w, L + rb) Same asymptotic performance as above, but better performance for smaller values of n 88 / 235

94 Vector Product Open MP First MPI Distributing Matrices Second MPI Third MPI Mixed Parallelism Stencil Application Principle Greedy Reducing the Granularity LU Factorization Gaussian Elimination LU Hybrid parallelism We have said many times that multi-core architectures are about to become the standard When building a cluster, the nodes you will buy will be multi-core Question: how to exploit the multiple cores? Or in our case how to exploit the multiple processors in each node Option #1: Run multiple processes per node Causes more overhead and more communication In fact will cause network communication among processes within a node! MPI will not know that processes are colocated 89 / 235

95 Vector Product Open MP First MPI Distributing Matrices Second MPI Third MPI Mixed Parallelism Stencil Application Principle Greedy Reducing the Granularity LU Factorization Gaussian Elimination LU OpenMP MPI Program Option #2: Run a single multi-threaded process per node Much lower overhead, fast communication within a node Done by combining MPI with OpenMP! Just write your MPI program Add OpenMP pragmas around loops Let s look back at our -Vector multiplication example 90 / 235

96 Vector Product Open MP First MPI Distributing Matrices Second MPI Third MPI Mixed Parallelism Stencil Application Principle Greedy Reducing the Granularity LU Factorization Gaussian Elimination LU Hybrid Parallelism float A[n/p][n], x[n/p], y[n/p]; temps x /* My piece of the vector (n/p elements) */ for (step=0; step<p; step++) { /* p steps */ SEND(tempS,r) RECV(tempR,r) #pragma omp parallel for private(i,j) for (i=0; i<n/p; i++) for (j=0; j <n/p; j++) y[i] y[i] + a[i,(rank step mod p)*n/p+j]* temps[j] temps tempr } This is called Hybrid Parallelism Communication via the network among nodes Communication via the shared memory within nodes 91 / 235

97 Outline Parallel Vector Product Open MP First MPI Distributing Matrices Second MPI Third MPI Mixed Parallelism Stencil Application Principle Greedy Reducing the Granularity LU Factorization Gaussian Elimination LU 12 Vector Product Open MP First MPI Distributing Matrices Second MPI Third MPI Mixed Parallelism Stencil Application Principle Greedy Reducing the Granularity 15 LU Factorization Gaussian Elimination LU 92 / 235

98 Vector Product Open MP First MPI Distributing Matrices Second MPI Third MPI Mixed Parallelism Stencil Application Principle Greedy Reducing the Granularity LU Factorization Gaussian Elimination LU on the Ring See Section 4.2 Turns out one can do matrix multiplication in a way very similar to matrix-vector multiplication A matrix multiplication is just the computation of n 2 scalar products, not just n We have three matrices, A, B, and C We want to compute C = A*B We distribute the matrices to that each processor owns a block row of each matrix Easy to do if row-major is used because all matrix elements owned by a processor are contiguous in memory 93 / 235

99 Vector Product Open MP First MPI Distributing Matrices Second MPI Third MPI Mixed Parallelism Data Distribution r n B Stencil Application Principle Greedy Reducing the Granularity LU Factorization Gaussian Elimination LU A C 94 / 235

100 First Step Vector Product Open MP First MPI Distributing Matrices Second MPI Third MPI Mixed Parallelism p=4 let s look at processor P 1 n r B 1,0 B 1,1 B 1,2 B 1,3 Stencil Application Principle Greedy Reducing the Granularity LU Factorization Gaussian Elimination LU += += A 1,0 A 1,1 A 1,2 A 1,3 A 1,1 xb 1,0 A 1,1 xb 1,1 += A 1,1 xb 1,2 += A 1,1 xb 1,3 95 / 235

101 Vector Product Open MP First MPI Distributing Matrices Second MPI Third MPI Mixed Parallelism Shifting of block rows of B p=4 let s look at processor P q n r Stencil Application Principle Greedy Reducing the Granularity LU Factorization Gaussian Elimination LU A q,0 A q,1 A q,2 A q,3 96 / 235

Parallel programming using MPI. Analysis and optimization. Bhupender Thakur, Jim Lupo, Le Yan, Alex Pacheco

Parallel programming using MPI Analysis and optimization Bhupender Thakur, Jim Lupo, Le Yan, Alex Pacheco Outline l Parallel programming: Basic definitions l Choosing right algorithms: Optimal serial and