Parallel Algorithms. A. Legrand. Arnaud Legrand, CNRS, University of Grenoble. LIG laboratory, October 2, / 235

Size: px
Start display at page:

Download "Parallel Algorithms. A. Legrand. Arnaud Legrand, CNRS, University of Grenoble. LIG laboratory, October 2, / 235"

Transcription

1 Parallel Arnaud Legrand, CNRS, University of Grenoble LIG laboratory, October 2, / 235

2 Outline Parallel Part I Models Part II Communications on a Ring Part III Speedup and Efficiency Part IV on a Ring Part V Algorithm on an Heterogeneous Ring Part VI on an Grid 2 / 235

3 P2P Communication Hockney LogP and Friends TCP Modeling Concurency Multi-port Single-port (Pure and Full Duplex) Flows Imperfection Part I Models Topology A Few Examples Virtual Topologies 3 / 235

4 Motivation Parallel P2P Communication Hockney LogP and Friends TCP Modeling Concurency Multi-port Single-port (Pure and Full Duplex) Flows Imperfection Topology A Few Examples Virtual Topologies Scientific computing : large needs in computation or storage resources. Need to use systems with several processors : Parallel computers with shared/distributed memory Clusters Heterogeneous clusters Clusters of clusters of workstations The Grid Desktop Grids When modeling platform, communications modeling seems to be the most controversial part. Two kinds of people produce communication models: those who are concerned with scheduling and those who are concerned with performance evaluation. All these models are imperfect and intractable. 4 / 235

5 Outline Parallel P2P Communication Hockney LogP and Friends TCP Modeling Concurency Multi-port Single-port (Pure and Full Duplex) Flows Imperfection Topology A Few Examples Virtual Topologies 1 Point to Point Communication Models Hockney LogP and Friends TCP 2 Modeling Concurency Multi-port Single-port (Pure and Full Duplex) Flows 3 Remind This is a Model, Hence Imperfect 4 Topology A Few Examples Virtual Topologies 5 / 235

6 UET-UCT Parallel P2P Communication Hockney LogP and Friends TCP Modeling Concurency Multi-port Single-port (Pure and Full Duplex) Flows Imperfection Topology A Few Examples Virtual Topologies Unit Execution Time - Unit Communication Time. Hem... This one is mainly used by scheduling theoreticians to prove that their problem is hard and to know whether there is some hope to prove some clever result or not. Some people have introduced a model whith cost of ε for local communications and 1 for communications with the outside. 6 / 235

7 Hockney Model Parallel Hockney [Hoc94] proposed the following model for performance evaluation of the Paragon. A message of size m from P i to P j requires: P2P Communication Hockney LogP and Friends TCP Modeling Concurency Multi-port Single-port (Pure and Full Duplex) Flows Imperfection Topology A Few Examples Virtual Topologies t i,j (m) = L i,j + m/b i,j In scheduling, there are three types of corresponding models: Communications are not splitable and each communication k is associated to a communication time t k (accounting for message size, latency, bandwidth, middleware,... ). Communications are splitable but latency is considered to be negligible (linear divisible model): t i,j (m) = m/b i,j Communications are splitable and latency cannot be neglected (linear divisible model): t i,j (m) = L i,j + m/b i,j 7 / 235

8 LogP Parallel P2P Communication Hockney LogP and Friends TCP Modeling Concurency Multi-port Single-port (Pure and Full Duplex) Flows Imperfection Topology A Few Examples Virtual Topologies The LogP model [CKP + 96] is defined by 4 parameters: L is the network latency o is the middleware overhead (message splitting and packing, buffer management, connection,... ) for a message of size w g is the gap (the minimum time between two packets communication) between two messages of size w P is the number of processors/modules Sender Card Card Receiver o g L Sending m bytes with packets of size w: 2o + L + m w max(o, g) Occupation on the sender and on the receiver: o + L + ( m w 1 ) max(o, g) o g g o o g g o o g g o g o 8 / 235

9 LogP Parallel P2P Communication Hockney LogP and Friends TCP Modeling Concurency Multi-port Single-port (Pure and Full Duplex) Flows Imperfection Topology A Few Examples Virtual Topologies The LogP model [CKP + 96] is defined by 4 parameters: L is the network latency o is the middleware overhead (message splitting and packing, buffer management, connection,... ) for a message of size w g is the gap (the minimum time between two packets communication) between two messages of size w P is the number of processors/modules Sender Card Card Receiver o g L Sending m bytes with packets of size w: 2o + L + m w max(o, g) Occupation on the sender and on the receiver: o + L + ( m w 1 ) max(o, g) o g g o o g g o o o g g o g g o 8 / 235

10 LogGP & plogp Parallel P2P Communication Hockney LogP and Friends TCP Modeling Concurency Multi-port Single-port (Pure and Full Duplex) Flows Imperfection Topology A Few Examples Virtual Topologies The previous model works fine for short messages. However, many parallel machines have special support for long messages, hence a higher bandwidth. LogGP [AISS97] is an extension of LogP: G captures the bandwidth for long messages: short messages 2o + L + m w max(o, g) long messages 2o + L + (m 1)G There is no fundamental difference... OK, it works for small and large messages. Does it work for averagesize messages? plogp [KBV00] is an extension of LogP when L, o and g depends on the message size m. They also have introduced a distinction between o s and o r. This is more and more precise but concurency is still not taken into account. 9 / 235

11 Bandwidth as a Function of Message Size Parallel With the Hockney model: m L+m/B. P2P Communication Hockney LogP and Friends TCP Modeling Concurency Multi-port Single-port (Pure and Full Duplex) Flows Imperfection Bande passante [Mbits/s] Mpich sans optimisation Mpich avec optimisation Topology A Few Examples Virtual Topologies Ko2Ko4Ko 16Ko 64Ko 256Ko 1Mo 4Mo Taille des messages MPICH, TCP with Gigabit Ethernet 16Mo 10 / 235

12 Bandwidth as a Function of Message Size Parallel With the Hockney model: m L+m/B. P2P Communication Hockney LogP and Friends TCP Modeling Concurency Multi-port Single-port (Pure and Full Duplex) Flows Imperfection Bande passante [Mbits/s] Mpich sans optimisation Mpich avec optimisation Topology A Few Examples Virtual Topologies Ko2Ko4Ko 16Ko 64Ko 256Ko 1Mo 4Mo Taille des messages MPICH, TCP with Gigabit Ethernet 16Mo 10 / 235

13 What About TCP-based s? Parallel P2P Communication Hockney LogP and Friends TCP Modeling Concurency Multi-port Single-port (Pure and Full Duplex) Flows Imperfection Topology A Few Examples Virtual Topologies The previous models work fine for parallel machines. Most networks use TCP that has fancy flow-control mechanism and slow start. Is it valid to use affine model for such networks? The answer seems to be yes but latency and bandwidth parameters have to be carefully measured [LQDB05]. Probing for m = 1b and m = 1Mb leads to bad results. The whole middleware layers should be benchmarked (theoretical latency is useless because of middleware, theoretical bandwidth is useless because of middleware and latency). The slow-start does not seem to be too harmful. Most people forget that the round-trip time has a huge impact on the bandwidth. 11 / 235

14 Outline Parallel P2P Communication Hockney LogP and Friends TCP Modeling Concurency Multi-port Single-port (Pure and Full Duplex) Flows Imperfection Topology A Few Examples Virtual Topologies 1 Point to Point Communication Models Hockney LogP and Friends TCP 2 Modeling Concurency Multi-port Single-port (Pure and Full Duplex) Flows 3 Remind This is a Model, Hence Imperfect 4 Topology A Few Examples Virtual Topologies 12 / 235

15 Multi-ports Parallel P2P Communication Hockney LogP and Friends TCP Modeling Concurency Multi-port Single-port (Pure and Full Duplex) Flows Imperfection Topology A Few Examples Virtual Topologies A given processor can communicate with as many other processors as he wishes without any degradation. This model is widely used by scheduling theoreticians (think about all DAG with commmunications scheduling problems) to prove that their problem is hard and to know whether there is some hope to prove some clever result or not. This model is borderline, especially when allowing duplication, when one communicates with everybody at the same time, or when trying to design algorithms with super tight approximation ratios. Frankly, such a model is totally unrealistic. Using MPI and synchronous communications, it may not be an issue. However, with multi-core, multi-processor machines, it cannot be ignored... B A Multi-port (numbers in s) C 13 / 235

16 Bounded Multi-port Parallel P2P Communication Hockney LogP and Friends TCP Modeling Concurency Multi-port Single-port (Pure and Full Duplex) Flows Imperfection Topology A Few Examples Virtual Topologies Assume now that we have threads or multi-core processors. We can write that sum of the throughputs of all communications (incomming and outgoing). Such a model is OK for wide-area communications [HP04]. Remember, the bounds due to the round-trip-time must not be forgotten! B A β/2 β/2 β/2 C Multi-port (β) (numbers in Mb/s) 14 / 235

17 Single-port (Pure) Parallel P2P Communication Hockney LogP and Friends TCP Modeling Concurency Multi-port Single-port (Pure and Full Duplex) Flows Imperfection Topology A Few Examples Virtual Topologies A process can communicate with only one other process at a time. This constraint is generally written as a constraint on the sum of communication times and is thus rather easy to use in a scheduling context (even though it complexifies problems). This model makes sense when using non-threaded versions of communication libraries (e.g., MPI). As soon as you re allowed to use threads, bounded-multiport seems a more reasonnable option (both for performance and scheduling complexity). B 1/3 A 1/3 1/3 C 1-port (pure) (numbers in s) 15 / 235

18 Single-port (Full-Duplex) Parallel P2P Communication Hockney LogP and Friends TCP Modeling Concurency Multi-port Single-port (Pure and Full Duplex) Flows Imperfection Topology A Few Examples Virtual Topologies At a given time, a process can be engaged in at most one emission and one reception. This constraint is generally written as two constraints: one on the sum of incomming communication times and one on the sum of outgoing communication times. B 1/2 A 1/2 1/2 C 1-port (full duplex) (numbers in Mb/s) 16 / 235

19 Single-port (Full-Duplex) Parallel P2P Communication Hockney LogP and Friends TCP Modeling Concurency Multi-port Single-port (Pure and Full Duplex) Flows Imperfection This model somehow makes sense when using networks like Myrinet that have few multiplexing units and with protocols without control flow [Mar07]. Topology A Few Examples Virtual Topologies Even if it does not model well complex situations, such a model is not harmfull. 17 / 235

20 Fluid Modeling Parallel P2P Communication Hockney LogP and Friends TCP Modeling Concurency Multi-port Single-port (Pure and Full Duplex) Flows Imperfection Topology A Few Examples Virtual Topologies When using TCP-based networks, it is generally reasonnable to use flows to model bandwidth sharing [MR99, Low03]. l L, Income Maximization maximize ϱ r r R ϱ r c l r R s.t. l r Max-Min Fairness maximize min ϱ r r R Proportional Fairness maximize r R log(ϱ r ) Potential Delay Minimization minimize r R Some weird function minimize r R arctan(ϱ r ) 1 ϱ r 18 / 235

21 Fluid Modeling Parallel P2P Communication Hockney LogP and Friends TCP Modeling Concurency Multi-port Single-port (Pure and Full Duplex) Flows Imperfection Topology A Few Examples Virtual Topologies When using TCP-based networks, it is generally reasonnable to use flows to model bandwidth sharing [MR99, Low03]. l L, Income Maximization maximize ϱ r r R ϱ r c l r R s.t. l r Max-Min Fairness maximize min ϱ r ATM r R Proportional Fairness maximize r R TCP Vegas log(ϱ r ) Potential Delay Minimization minimize r R Some weird function minimize r R arctan(ϱ r ) 1 ϱ r TCP Reno 18 / 235

22 Flows Extensions Parallel P2P Communication Hockney LogP and Friends TCP Modeling Concurency Multi-port Single-port (Pure and Full Duplex) Flows Imperfection Topology A Few Examples Virtual Topologies Note that this model is a multi-port model with capacity-constraints (like in the previous bounded multi-port). When latencies are large, using multiple connections enables to get more bandwidth. As a matter of fact, there is very few to loose in using multiple connections... Therefore many people enforce a sometimes artificial (but less intrusive) bound on the maximum number of connections per link [Wag05, MYCR06]. 19 / 235

23 Outline Parallel P2P Communication Hockney LogP and Friends TCP Modeling Concurency Multi-port Single-port (Pure and Full Duplex) Flows Imperfection Topology A Few Examples Virtual Topologies 1 Point to Point Communication Models Hockney LogP and Friends TCP 2 Modeling Concurency Multi-port Single-port (Pure and Full Duplex) Flows 3 Remind This is a Model, Hence Imperfect 4 Topology A Few Examples Virtual Topologies 20 / 235

24 Remind This is a Model, Hence Imperfect Parallel P2P Communication Hockney LogP and Friends TCP Modeling Concurency Multi-port Single-port (Pure and Full Duplex) Flows Imperfection Topology A Few Examples Virtual Topologies The previous sharing models are nice but you generally do not know other flows... Communications use the memory bus and hence interfere with computations. Taking such interferences into account may become more and more important with multi-core architectures. Interference between communications are sometimes... surprising. Modeling is an art. You have to know your platform and your application to know what is negligeable and what is important. Even if your model is imperfect, you may still derive interesting results. 21 / 235

25 Outline Parallel P2P Communication Hockney LogP and Friends TCP Modeling Concurency Multi-port Single-port (Pure and Full Duplex) Flows Imperfection Topology A Few Examples Virtual Topologies 1 Point to Point Communication Models Hockney LogP and Friends TCP 2 Modeling Concurency Multi-port Single-port (Pure and Full Duplex) Flows 3 Remind This is a Model, Hence Imperfect 4 Topology A Few Examples Virtual Topologies 22 / 235

26 Various Topologies Used in the Litterature Parallel P2P Communication Hockney LogP and Friends TCP Modeling Concurency Multi-port Single-port (Pure and Full Duplex) Flows Imperfection Topology A Few Examples Virtual Topologies 23 / 235

27 P2P Communication Hockney LogP and Friends TCP Modeling Concurency Multi-port Single-port (Pure and Full Duplex) Flows Imperfection Topology A Few Examples Virtual Topologies Beyond MPI_Comm_rank()? So far, MPI gives us a unique number for each processor With this one can do anything But it s pretty inconvenient because one can do anything with it Typically, one likes to impose constraints about which processor/process can talk to which other processor/process With this constraint, one can then think of the algorithm in simpler terms There are fewer options for communications between processors So there are fewer choices to implementing an algorithm 24 / 235

28 P2P Communication Hockney LogP and Friends TCP Modeling Concurency Multi-port Single-port (Pure and Full Duplex) Flows Imperfection Topology A Few Examples Virtual Topologies Virtual Topologies? MPI provides an abstraction over physical computers Each host has an IP address MPI hides this address with a convenient numbers There could be multiple such numbers mapped to the same IP address All numbers can talk to each other A Virtual Topology provides an abstraction over MPI Each process has a number, which may be different from the MPI number There are rules about which numbers a number can talk to A virtual topology is defined by specifying the neighbors of each process 25 / 235

29 P2P Communication Hockney LogP and Friends TCP Modeling Concurency Multi-port Single-port (Pure and Full Duplex) Flows Imperfection Topology A Few Examples Virtual Topologies Implementing a Virtual Topology ,0 1,0 1,1 2,0 2,1 2,2 2,3 7 (i,j) = (floor(log2(rank+1)), rank - 2 max(i,0) +1) rank = j max(i,0) 3,0 26 / 235

30 P2P Communication Hockney LogP and Friends TCP Modeling Concurency Multi-port Single-port (Pure and Full Duplex) Flows Imperfection Topology A Few Examples Virtual Topologies Implementing a Virtual Topology ,0 1,0 1,1 2,0 2,1 2,2 2,3 7 (i,j) = (floor(log2(rank+1)), rank - 2 max(i,0) +1) rank = j max(i,0) my_parent(i,j) = (i-1, floor(j/2)) my_left_child(i,j) = (i+1, j*2), if any my_right_child(i,j) = (i+1, j*2+1), if any 3,0 27 / 235

31 P2P Communication Hockney LogP and Friends TCP Modeling Concurency Multi-port Single-port (Pure and Full Duplex) Flows Imperfection Topology A Few Examples Virtual Topologies Implementing a Virtual Topology ,0 1,0 1,1 2,0 2,1 2,2 2,3 7 (i,j) = (floor(log2(rank+1)), rank - 2 max(i,0) +1) rank = j max(i,0) my_parent(i,j) = (i-1, floor(j/2)) my_left_child(i,j) = (i+1, j*2), if any my_right_child(i,j) = (i+1, j*2+1), if any 3,0 MPI_Send(, rank(my_parent(i,j)), ) MPI_Recv(, rank(my_left_child(i,j)), ) 28 / 235

32 P2P Communication Hockney LogP and Friends TCP Modeling Concurency Multi-port Single-port (Pure and Full Duplex) Flows Imperfection Topology A Few Examples Virtual Topologies Typical Topologies Common Topologies (see Section 3.1.2) Linear Array Ring 2-D grid 2-D torus One-level Tree Fully connected graph Arbitrary graph Two options for all topologies: Monodirectional links: more constrained but simpler Bidirectional links: less constrained but potential more complicated By complicated we typically mean more bug-prone We ll look at Ring and Grid in detail 29 / 235

33 P2P Communication Hockney LogP and Friends TCP Modeling Concurency Multi-port Single-port (Pure and Full Duplex) Flows Imperfection Topology A Few Examples Virtual Topologies Main Assumption and Big Question The main assumption is that once we ve defined the virtual topology we forget it s virtual and write parallel algorithms assuming it s physical We assume communications on different (virtual) links do not interfere with each other We assume that computations on different (virtual) processors do not interfere with each other The big question: How well do these assumptions hold? The question being mostly about the network Two possible bad cases Case #1: the assumptions do not hold and there are interferences We ll most likely achieve bad performance Our performance models will be broken and reasoning about performance improvements will be difficult Case #2: the assumptions do hold but we leave a lot of the network resources unutilized We could perhaps do better with another virtual topology 30 / 235

34 P2P Communication Hockney LogP and Friends TCP Modeling Concurency Multi-port Single-port (Pure and Full Duplex) Flows Imperfection Topology A Few Examples Virtual Topologies Which Virtual Topology to Pick We will see that some topologies are really well suited to certain algorithms The question is whether they are well-suite to the underlying architecture The goal is to strike a good compromise Not too bad given the algorithm Not too bad given the platform Fortunately, many platforms these days use switches, which support naturally many virtual topologies Because they support concurrent communications between disjoint pairs of processors As part of a programming assignment, you will explore whether some virtual topology makes sense on our cluster 31 / 235

35 P2P Communication Hockney LogP and Friends TCP Modeling Concurency Multi-port Single-port (Pure and Full Duplex) Flows Imperfection Topology A Few Examples Virtual Topologies Topologies and Data Distribution One of the common steps when writing a parallel algorithm is to distribute some data (array, data structure, etc.) among the processors in the topology Typically, one does data distribution in a way that matches the topology E.g., if the data is 3-D, then it s nice to have a 3-D virtual topology One question that arises then is: how is the data distributed across the topology? In the next set of slides we look at our first topology: a ring 32 / 235

36 Assumptions Broadcast Scatter All-to-All Broadcast: Going Faster Part II Communications on a Ring 33 / 235

37 Outline Parallel Assumptions Broadcast Scatter All-to-All Broadcast: Going Faster 5 Assumptions 6 Broadcast 7 Scatter 8 All-to-All 9 Broadcast: Going Faster 34 / 235

38 Assumptions Broadcast Scatter All-to-All Broadcast: Going Faster Ring Topology (Section 3.3) P p-1 P 3 P 0 P 2 P 1 Each processor is identified by a rank MY_NUM() There is a way to find the total number of processors NUM_PROCS() Each processor can send a message to its successor SEND(addr, L) And receive a message from its predecessor RECV(addr, L) We ll just use the above pseudocode rather than MPI Note that this is much simpler than the example tree topology we saw in the previous set of slides 35 / 235

39 Assumptions Broadcast Scatter All-to-All Broadcast: Going Faster Virtual vs. Physical Topology Now that we have chosen to consider a Ring topology we pretend our physical topology is a ring topology We can always implement a virtual ring topology (see previous set of slides) And read Section 4.6 So we can write many ring algorithms It may be that a better virtual topology is better suited to our physical topology But the ring topology makes for very simple programs and is known to be reasonably good in practice So it s a good candidate for our first look at parallel algorithms 36 / 235

40 Assumptions Broadcast Scatter All-to-All Broadcast: Going Faster Cost of communication (Sect ) It is actually difficult to precisely model the cost of communication E.g., MPI implementations do various optimizations given the message sizes We will be using a simple model Time = L + m/b L: start-up cost or latency B: bandwidth (b = 1/B) m: message size We assume that if a message of length m is sent from P 0 to P q, then the communication cost is q(l + m b) There are many assumptions in our model, some not very realistic, but we ll discuss them later 37 / 235

41 Assumptions Broadcast Scatter All-to-All Broadcast: Going Faster Assumptions about Communications Several Options Both Send() and Recv() are blocking Called rendez-vous Very old-fashioned systems Recv() is blocking, but Send() is not Pretty standard MPI supports it Both Recv() and Send() are non-blocking Pretty standard as well MPI supports it 38 / 235

42 Assumptions Broadcast Scatter All-to-All Broadcast: Going Faster Assumptions about Concurrency One question that s important is: can the processor do multiple things at the same time? Typically we will assume that the processor can send, receive, and compute at the same time Call MPI_IRecv() Call MPI_ISend() Compute something This of course implies that the three operations are independent E.g., you don t want to send the result of the computation E.g., you don t want to send what you re receiving (forwarding) When writing parallel algorithms (in pseudo-code), we ll simply indicate concurrent activities with a sign 39 / 235

43 Assumptions Broadcast Scatter All-to-All Broadcast: Going Faster Collective Communications To write a parallel algorithm, we will need collective operations Broadcasts, etc. Now MPI provide those, and they likely: Do not use the ring logical topology Utilize the physical resources well Let s still go through the exercise of writing some collective communication algorithms We will see that for some algorithms we really want to do these communications by hand on our virtual topology rather than using the MPI collective 40 / 235

44 Outline Parallel Assumptions Broadcast Scatter All-to-All Broadcast: Going Faster 5 Assumptions 6 Broadcast 7 Scatter 8 All-to-All 9 Broadcast: Going Faster 41 / 235

45 Assumptions Broadcast Scatter All-to-All Broadcast: Going Faster Broadcast (Section 3.3.1) We want to write a program that has Pk send the same message of length m to all other processors Broadcast(k,addr,m) On the ring, we just send to the next processor, and so on, with no parallel communications whatsoever This is of course not the way one should implement a broadcast in practice if the physical topology is not merely a ring MPI uses some type of tree topology 42 / 235

46 Assumptions Broadcast (Section 3.3.1) Broadcast Scatter All-to-All Broadcast: Going Faster Brodcast(k,addr,m) q = MY_NUM() p = NUM_PROCS() if (q == k) SEND(addr,m) else if (q == k 1 mod p) RECV(addr,m) else RECV(addr,m) SEND(addr,m) endif endif Assumes a blocking receive Sending may be non-blocking The broadcast time is (p-1)(l+m b) 43 / 235

47 Outline Parallel Assumptions Broadcast Scatter All-to-All Broadcast: Going Faster 5 Assumptions 6 Broadcast 7 Scatter 8 All-to-All 9 Broadcast: Going Faster 44 / 235

48 Assumptions Broadcast Scatter All-to-All Broadcast: Going Faster Scatter (Section 3.2.2) Processor k sends a different message to all other processors (and to itself) Pk stores the message destined to P q at address addr[q], including a message at addr[k] At the end of the execution, each processor holds the message it had received in msg The principle is just to pipeline communication by starting to send the message destined to P k-1, the most distant processor 45 / 235

49 Assumptions Broadcast Scatter All-to-All Broadcast: Going Faster Scatter (Section 3.3.2) Scatter(k,msg,addr,m) Same execution time as the broadcast q = MY_NUM() (p-1)(l + m b) p = NUM_PROCS() if (q == k) for i = 0 to p 2 SEND(addr[k+p 1 i mod p],m) msg addr[k] else RECV(tempR,L) for i = 1 to k 1 q mod p temps tempr SEND(tempS,m) RECV(tempR,m) msg tempr Swapping of send buffer and receive buffer (pointer) Sending and Receiving in Parallel, with a non blocking Send 46 / 235

50 Assumptions Broadcast Scatter All-to-All Broadcast: Going Faster Scatter (Section 3.3.2) Scatter(k,msg,addr,m) q = MY_NUM() p = NUM_PROCS() if (q == k) for i = 0 to p 2 SEND(addr[k+p 1 i mod p],m) msg addr[k] else RECV(tempR,L) for i = 1 to k 1 q mod p temps tempr SEND(tempS,m) RECV(tempR,m) msg tempr 0 1 Proc q=2 Proc q=3 Proc q=0 k = 2, p = 4 send addr[ % 4 = 1] send addr[ % 4 = 0] send addr[ % 4 = 3] msg = addr[2] recv (addr[1]) // loop % 4 = 2 times send (addr[1]) recv (addr[0]) send (addr[0]) recv (addr[3]) msg = addr[3] recv (addr[1]) // loop % 4 = 1 time send (addr[1]) recv (addr[0]) msg = addr[0] 3 2 Proc q=1 // loop % 4 = 0 time recv (addr[1]) msg = addr[1] 47 / 235

51 Outline Parallel Assumptions Broadcast Scatter All-to-All Broadcast: Going Faster 5 Assumptions 6 Broadcast 7 Scatter 8 All-to-All 9 Broadcast: Going Faster 48 / 235

52 Assumptions Broadcast Scatter All-to-All Broadcast: Going Faster All-to-all (Section 3.3.3) All2All(my_addr, addr, m) Same execution time as the scatter q = MY_NUM() (p-1)(l + m b) p = NUM_PROCS() addr[q] my_addr for i = 1 to p 1 SEND(addr[q i+1 mod p],m) RECV(addr[q i mod p],m) / 235

53 Outline Parallel Assumptions Broadcast Scatter All-to-All Broadcast: Going Faster 5 Assumptions 6 Broadcast 7 Scatter 8 All-to-All 9 Broadcast: Going Faster 50 / 235

54 Assumptions Broadcast Scatter All-to-All Broadcast: Going Faster A faster broadcast? How can we improve performance? One can cut the message in many small pieces, say in r pieces where m is divisible by r. The root processor just sends r messages The performance is as follows Consider the last processor to get the last piece of the message There need to be p-1 steps for the first piece to arrive, which takes (p-1)(l + m b / r) Then the remaining r-1 pieces arrive one after another, which takes (r-1)(l + m b / r) For a total of: (p r) (L + mb / r) 51 / 235

55 Assumptions Broadcast Scatter All-to-All Broadcast: Going Faster A faster broadcast? The question is, what is the value of r that minimizes (p r) (L + m b / r)? One can view the above expression as (c+ar)(d+b/r), with four constants a, b, c, d The non-constant part of the expression is then ad.r + cb/r, which must be minimized It is known that this value is minimized for sqrt(cb / ad) and we have r opt = sqrt(m(p-2) b / L) with the optimal time (sqrt((p-2) L) + sqrt(m b)) 2 which tends to mb when m is large, which is independent of p! 52 / 235

56 Assumptions Broadcast Scatter All-to-All Broadcast: Going Faster Well-known Principle We have seen that if we cut a (large) message in many (small) messages, then we can send the message over multiple hops (in our case p-1) almost as fast as we can send it over a single hop This is a fundamental principle of IP networks We cut messages into IP frames Send them over many routers But really go as fast as the slowest router 53 / 235

57 Speedup Amdahl s Law Part III Speedup and Efficiency 54 / 235

58 Outline Parallel Speedup Amdahl s Law 10 Speedup 11 Amdahl s Law 55 / 235

59 Speedup Parallel Speedup Amdahl s Law We need a metric to quantify the impact of your performance enhancement Speedup: ratio of old time to new time new time = 1h speedup = 2h / 1h = 2 Sometimes one talks about a slowdown in case the enhancement is not beneficial Happens more often than one thinks 56 / 235

60 Performance Parallel Speedup Amdahl s Law The notion of speedup is completely generic By using a rice cooker I ve achieved a 1.20 speedup for rice cooking For parallel programs one defines the Parallel Speedup (we ll just say speedup ): Parallel program takes time T1 on 1 processor Parallel program takes time Tp on p processors Parallel Speedup: S(p) = T 1 T p In the ideal case, if my sequential program takes 2 hours on 1 processor, it takes 1 hour on 2 processors: called linear speedup 57 / 235

61 Speedup Parallel Speedup Amdahl s Law speedup superlinear linear sub-linear number of processors Superlinear Speedup? There are several possible causes 58 / 235

62 Speedup Parallel Speedup Amdahl s Law speedup superlinear linear sub-linear number of processors Superlinear Speedup? There are several possible causes Algorithm with optimization problems, throwing many processors at it increases the chances that one will get lucky and find the optimum fast 58 / 235

63 Speedup Parallel Speedup Amdahl s Law speedup superlinear linear sub-linear number of processors Superlinear Speedup? There are several possible causes Algorithm with optimization problems, throwing many processors at it increases the chances that one will get lucky and find the optimum fast Hardware with many processors, it is possible that the entire application data resides in cache (vs. RAM) or in RAM (vs. Disk) 58 / 235

64 Outline Parallel Speedup Amdahl s Law 10 Speedup 11 Amdahl s Law 59 / 235

65 Bad News: Amdahl s Law Parallel Speedup Amdahl s Law Consider a program whose execution consists of two phases 1 One sequential phase : T seq = (1 f )T 1 2 One phase that can be perfectly parallelized (linear speedup) T par = ft 1 Therefore: T p = T seq + T par /p = (1 f )T 1 + ft 1 /p. Amdahl s Law: 1 S p = 1 f + f p Speedup f = 10% 20% 50% f = 80% Number of processors 60 / 235

66 Lessons from Amdahl s Law Parallel Speedup Amdahl s Law It s a law of diminishing return If a significant fraction of the code (in terms of time spent in it) is not parallelizable, then parallelization is not going to be good It sounds obvious, but people new to high performance computing often forget how bad Amdahl s law can be Luckily, many applications can be almost entirely parallelized and f is small 61 / 235

67 Efficiency Parallel Speedup Amdahl s Law Efficiency is defined as Eff (p) = S(p)/p Typically < 1, unless linear or superlinear speedup Used to measure how well the processors are utilized If increasing the number of processors by a factor 10 increases the speedup by a factor 2, perhaps it s not worth it: efficiency drops by a factor 5 Important when purchasing a parallel machine for instance: if due to the application s behavior efficiency is low, forget buying a large cluster 62 / 235

68 Scalability Parallel Speedup Amdahl s Law Measure of the effort needed to maintain efficiency while adding processors Efficiency also depends on the problem size: Eff (n, p) Isoefficiency: At which rate does the problem size need to be increase to maintain efficiency nc(p) such that Eff (n c(p), p) = c By making a problem ridiculously large, on can typically achieve good efficiency Problem: is it how the machine/code will be used? 63 / 235

69 Vector Product Open MP First MPI Distributing Matrices Second MPI Third MPI Mixed Parallelism Part IV on a Ring Stencil Application Principle Greedy Reducing the Granularity LU Factorization Gaussian Elimination LU 64 / 235

70 Outline Parallel Vector Product Open MP First MPI Distributing Matrices Second MPI Third MPI Mixed Parallelism Stencil Application Principle Greedy Reducing the Granularity LU Factorization Gaussian Elimination LU 12 Vector Product Open MP First MPI Distributing Matrices Second MPI Third MPI Mixed Parallelism Stencil Application Principle Greedy Reducing the Granularity 15 LU Factorization Gaussian Elimination LU 65 / 235

71 Vector Product Open MP First MPI Distributing Matrices Second MPI Third MPI Mixed Parallelism Stencil Application Principle Greedy Reducing the Granularity LU Factorization Gaussian Elimination LU Parallel -Vector product y = A x Let n be the size of the matrix int a[n][n]; int x[n]; for i = 0 to n 1 { y[i] = 0; for j = 0 to n 1 y[i] = y[i] + a[i,j] * x[j]; } How do we do this in parallel? Section 4.1 in the book a[n][n] x[n] y[n] 66 / 235

72 Vector Product Open MP First MPI Distributing Matrices Second MPI Third MPI Mixed Parallelism Stencil Application Principle Greedy Reducing the Granularity LU Factorization Gaussian Elimination LU Parallel -Vector product How do we do this in parallel? For example: Computations of elements of vector y are independent Each of these computations requires one row of matrix a and vector x In shared-memory: #pragma omp parallel for private(i,j) for i = 0 to n 1 { y[i] = 0; for j = 0 to n 1 y[i] = y[i] + a[i,j] * x[j]; } a[n][n] x[n] y[n] 67 / 235

73 Vector Product Open MP First MPI Distributing Matrices Second MPI Third MPI Mixed Parallelism Stencil Application Principle Greedy Reducing the Granularity LU Factorization Gaussian Elimination LU Parallel -Vector Product In distributed memory, one possibility is that each process has a full copy of matrix a and of vector x Each processor declares a vector y of size n/p We assume that p divides n Therefore, the code can just be load(a); load(x) p = NUM_PROCS(); r = MY_RANK(); for (i=r*n/p; i<(r+1)*n/p; i++) { for (j=0;j<n;j++) y[i r*n/p] = a[i][j] * x[j]; } It s embarrassingly parallel What about the result? 68 / 235

74 Vector Product Open MP First MPI Distributing Matrices Second MPI Third MPI Mixed Parallelism Stencil Application Principle Greedy Reducing the Granularity LU Factorization Gaussian Elimination LU What about the result? After the processes complete the computation, each process has a piece of the result One probably wants to, say, write the result to a file Requires synchronization so that the I/O is done correctly For example... if (r!= 0) { recv(&token,1); } open(file, append ); for (j=0; j<n/p ; j++) write(file, y[j]); send(&token,1); close(file) barrier(); // optional Could also use a gather so that the entire vector is returned to processor 0 vector y fits in the memory of a single node 69 / 235

75 Vector Product Open MP First MPI Distributing Matrices Second MPI Third MPI Mixed Parallelism Stencil Application Principle Greedy Reducing the Granularity LU Factorization Gaussian Elimination LU What if matrix a is too big? a may not fit in memory Which is a motivation to use distributed memory implementations In this case, each processor can store only a piece of matrix a For the matrix-vector multiply, each processor can just store n/p rows of the matrix Conceptually: A[n][n] But the program declares a[n/p][n] This raises the (annoying) issue of global indices versus local indices 70 / 235

76 Vector Product Open MP First MPI Distributing Matrices Second MPI Third MPI Mixed Parallelism Global vs. Local indices When an array is split among processes global index (I,J) that references an element of the matrix local index (i,j) that references an element of the local array that stores a piece of the matrix Translation between global and local indices think of the algorithm in terms of global indices implement it in terms of local indices Stencil Application Principle Greedy Reducing the Granularity P0 P1 Global: A[5][3] Local: a[1][3] on process P1 a[i,j] = A[(n/p)*rank + i][j] LU Factorization Gaussian Elimination LU P2 N n / p 71 / 235

77 Vector Product Open MP First MPI Distributing Matrices Second MPI Third MPI Mixed Parallelism Stencil Application Principle Greedy Reducing the Granularity LU Factorization Gaussian Elimination LU Global Index Computation Real-world parallel code often implements actual translation functions GlobalToLocal() LocalToGlobal() This may be a good idea in your code, although for the ring topology the computation is pretty easy, and writing functions may be overkill We ll see more complex topologies with more complex associated data distributions and then it s probably better to implement such functions 72 / 235

78 Vector Product Open MP First MPI Distributing Matrices Second MPI Third MPI Mixed Parallelism Stencil Application Principle Greedy Reducing the Granularity LU Factorization Gaussian Elimination LU Distributions of arrays At this point we have 2-D array a distributed 1-D array y distributed 1-D array x replicated Having distributed arrays makes it possible to partition work among processes But it makes the code more complex due to global/local indices translations It may require synchronization to load/save the array elements to file 73 / 235

79 Vector Product Open MP First MPI Distributing Matrices Second MPI Third MPI Mixed Parallelism Stencil Application Principle Greedy Reducing the Granularity LU Factorization Gaussian Elimination LU All vector distributed? So far we have array x replicated It is usual to try to have all arrays involved in the same computation be distributed in the same way makes it easier to read the code without constantly keeping track of what s distributed and what s not e.g., local indices for array y are different from the global ones, but local indices for array x are the same as the global ones will lead to bugs What one would like it for each process to have N/n rows of matrix A in an array a[n/p][n] N/n components of vector x in an array x[n/p] N/n components of vector y in an array y[n/p] Turns out there is an elegant solution to do this 74 / 235

80 Vector Product Open MP First MPI Distributing Matrices Second MPI Third MPI Mixed Parallelism Stencil Application Principle Greedy Reducing the Granularity LU Factorization Gaussian Elimination LU Principle of the Algorithm A 00 A 01 A 02 A 03 A 04 A 05 A 06 A 07 x P 0 0 A 10 A 11 A 12 A 13 A 14 A 15 A 16 A 17 x 1 A 20 A 21 A 22 A 23 A 24 A 25 A 26 A 27 x P 2 1 A 30 A 31 A 32 A 33 A 34 A 35 A 36 A 37 x 3 A 40 A 41 A 42 A 43 A 44 A 45 A 46 A 47 x P 4 2 A 50 A 51 A 52 A 53 A 54 A 55 A 56 A 57 x 5 A 60 A 61 A 62 A 63 A 64 A 65 A 66 A 67 x P 6 3 A 70 A 71 A 72 A 73 A 74 A 75 A 76 A 77 x 7 Initial data distribution for: n = 8 p = 4 n/p = 2 75 / 235

81 Vector Product Open MP First MPI Distributing Matrices Second MPI Third MPI Mixed Parallelism Principle of the Algorithm A 00 A 01 x P 0 0 A 10 A 11 x 1 A 22 A 23 x P 2 1 A 32 A 33 x 3 A 44 A 45 x P 4 2 A 54 A 55 x 5 Stencil Application Principle Greedy Reducing the Granularity LU Factorization Gaussian Elimination LU A 66 A 67 x P 6 3 A 76 A 77 x 7 Step 0 76 / 235

82 Vector Product Open MP First MPI Distributing Matrices Second MPI Third MPI Mixed Parallelism Stencil Application Principle Greedy Reducing the Granularity LU Factorization Gaussian Elimination LU Principle of the Algorithm A 06 A 07 x P 6 0 A 16 A 17 x 7 A 20 A 21 x P 0 1 A 30 A 31 x 1 A x P 42 A A 52 A 53 A x P 64 A A 74 A 75 Step 1 x 3 x 5 77 / 235

83 Vector Product Open MP First MPI Distributing Matrices Second MPI Third MPI Mixed Parallelism Stencil Application Principle Greedy Reducing the Granularity LU Factorization Gaussian Elimination LU Principle of the Algorithm A x P 04 A A 14 A 15 A 26 A 27 x P 6 1 A 36 A 37 x 7 A 40 A 41 x P 0 2 A 50 A 51 A x P 62 A A 72 A 73 Step 2 x 5 x 1 x 3 78 / 235

84 Vector Product Open MP First MPI Distributing Matrices Second MPI Third MPI Mixed Parallelism Stencil Application Principle Greedy Reducing the Granularity LU Factorization Gaussian Elimination LU Principle of the Algorithm A x P 02 A A 12 A 13 A x P 24 A A 34 A 35 A 46 A 47 x P 6 2 A 56 A 57 x 7 A 60 A 61 x P 0 3 A 70 A 71 Step 3 x 3 x 5 x 1 79 / 235

85 Vector Product Open MP First MPI Distributing Matrices Second MPI Third MPI Mixed Parallelism Stencil Application Principle Greedy Reducing the Granularity LU Factorization Gaussian Elimination LU Principle of the Algorithm A x P 00 A A 10 A 11 A x P 22 A A 32 A 33 A x P 44 A A 54 A 55 A 66 A 67 x P 6 3 A 76 A 77 x 7 Final state x 1 x 3 x 5 The final exchange of vector x is not strictly necessary, but one may want to have it distributed as the end of the computation like it was distributed at the beginning. 80 / 235

86 Vector Product Open MP First MPI Distributing Matrices Second MPI Third MPI Mixed Parallelism Stencil Application Principle Greedy Reducing the Granularity LU Factorization Gaussian Elimination LU Algorithm Uses two buffers temps for sending and tempr to receiving float A[n/p][n], x[n/p], y[n/p]; r n/p temps x /* My piece of the vector (n/p elements) */ for (step=0; step<p; step++) { /* p steps */ SEND(tempS,r) RECV(tempR,r) for (i=0; i<n/p; i++) for (j=0; j <n/p; j++) y[i] y[i] + a[i,(rank step mod p) * n/p + j] * temps[j] temps tempr } In our example, process of rank 2 at step 3 would work with the 2x2 matrix block starting at column ((2-3) mod 4)*8/4 = 3 * 8 / 4 = 6; 81 / 235

87 Vector Product Open MP First MPI Distributing Matrices Second MPI Third MPI Mixed Parallelism Stencil Application Principle Greedy Reducing the Granularity LU Factorization Gaussian Elimination LU A few General Principles Large data needs to be distributed among processes (running on different nodes of a cluster for instance) causes many arithmetic expressions for index computation People who do this for a leaving always end up writing local_to_global() and global_to_local() functions Data may need to be loaded/written before/after the computation requires some type of synchronization among processes Typically a good idea to have all data structures distributed similarly to avoid confusion about which indices are global and which ones are local In our case, all indices are local In the end the code looks much more complex than the equivalent OpenMP implementation 82 / 235

88 Vector Product Open MP First MPI Distributing Matrices Second MPI Third MPI Mixed Parallelism Stencil Application Principle Greedy Reducing the Granularity LU Factorization Gaussian Elimination LU Performance There are p identical steps During each step each processor performs three activities: computation, receive, and sending Computation: r 2 w w: time to perform one += * operation Receiving: L + r b Sending: L + r b T(p) = p (r 2 w + 2L + 2rb) 83 / 235

89 Vector Product Open MP First MPI Distributing Matrices Second MPI Third MPI Mixed Parallelism Stencil Application Principle Greedy Reducing the Granularity LU Factorization Gaussian Elimination LU Asymptotic Performance T(p) = p(r 2 w + 2L + 2rb) Speedup(p) = n 2 w / p (r 2 w + 2L + 2rb) = n 2 w / (n 2 w/p + 2pL + 2nb) Eff(p) = n 2 w / (n 2 w+ 2p 2 L + 2pnb) For p fixed, when n is large, Eff(p) ~ 1 Conclusion: the algorithm is asymptotically optimal 84 / 235

90 Vector Product Open MP First MPI Distributing Matrices Second MPI Third MPI Mixed Parallelism Stencil Application Principle Greedy Reducing the Granularity LU Factorization Gaussian Elimination LU Performance (2) Note that an algorithm that initially broadcasts the entire vector to all processors and then have every processor compute independently would be in time (p-1)(l + n b) + pr 2 w Could use the pipelined broadcast which: has the same asymptotic performance is a simpler algorithm wastes only a tiny little bit of memory is arguably much less elegant It is important to think of simple solutions and see what works best given expected matrix size, etc. 85 / 235

91 Vector Product Open MP First MPI Distributing Matrices Second MPI Third MPI Mixed Parallelism Stencil Application Principle Greedy Reducing the Granularity LU Factorization Gaussian Elimination LU Back to the Algorithm float A[n/p][n], x[n/p], y[n/p]; r n/p temps x /* My piece of the vector (n/p elements) */ for (step=0; step<p; step++) { /* p steps */ SEND(tempS,r) RECV(tempR,r) for (i=0; i<n/p; i++) for (j=0; j <n/p; j++) y[i] y[i] + a[i,(rank step mod p) * n/p + j] * temps[j] temps tempr } In the above code, at each iteration, the SEND, the RECV, and the computation can all be done in parallel Therefore, one can overlap communication and computation by using non-blocking SEND and RECV if available MPI provides MPI_ISend() and MPI_IRecv() for this purpose 86 / 235

92 Vector Product Open MP First MPI Distributing Matrices Second MPI Third MPI Mixed Parallelism Stencil Application Principle Greedy Reducing the Granularity LU Factorization Gaussian Elimination LU Nore Concurrent Algorithm Notation for concurrent activities: float A[n/p][n], x[n/p], y[n/p]; temps x /* My piece of the vector (n/p elements) */ r n/p for (step=0; step<p; step++) { /* p steps */ SEND(tempS,r) RECV(tempR,r) for (i=0; i<n/p; i++) for (j=0; j <n/p; j++) y[i] y[i]+a[i,(rank step mod p)*n/p+j]*temps[j] temps tempr } 87 / 235

93 Vector Product Open MP First MPI Distributing Matrices Second MPI Third MPI Mixed Parallelism Stencil Application Principle Greedy Reducing the Granularity LU Factorization Gaussian Elimination LU Better Performance There are p identical steps During each step each processor performs three activities: computation, receive, and sending Computation: r 2 w Receiving: L + rb Sending: L + rb T(p) = p max(r 2 w, L + rb) Same asymptotic performance as above, but better performance for smaller values of n 88 / 235

94 Vector Product Open MP First MPI Distributing Matrices Second MPI Third MPI Mixed Parallelism Stencil Application Principle Greedy Reducing the Granularity LU Factorization Gaussian Elimination LU Hybrid parallelism We have said many times that multi-core architectures are about to become the standard When building a cluster, the nodes you will buy will be multi-core Question: how to exploit the multiple cores? Or in our case how to exploit the multiple processors in each node Option #1: Run multiple processes per node Causes more overhead and more communication In fact will cause network communication among processes within a node! MPI will not know that processes are colocated 89 / 235

95 Vector Product Open MP First MPI Distributing Matrices Second MPI Third MPI Mixed Parallelism Stencil Application Principle Greedy Reducing the Granularity LU Factorization Gaussian Elimination LU OpenMP MPI Program Option #2: Run a single multi-threaded process per node Much lower overhead, fast communication within a node Done by combining MPI with OpenMP! Just write your MPI program Add OpenMP pragmas around loops Let s look back at our -Vector multiplication example 90 / 235

96 Vector Product Open MP First MPI Distributing Matrices Second MPI Third MPI Mixed Parallelism Stencil Application Principle Greedy Reducing the Granularity LU Factorization Gaussian Elimination LU Hybrid Parallelism float A[n/p][n], x[n/p], y[n/p]; temps x /* My piece of the vector (n/p elements) */ for (step=0; step<p; step++) { /* p steps */ SEND(tempS,r) RECV(tempR,r) #pragma omp parallel for private(i,j) for (i=0; i<n/p; i++) for (j=0; j <n/p; j++) y[i] y[i] + a[i,(rank step mod p)*n/p+j]* temps[j] temps tempr } This is called Hybrid Parallelism Communication via the network among nodes Communication via the shared memory within nodes 91 / 235

97 Outline Parallel Vector Product Open MP First MPI Distributing Matrices Second MPI Third MPI Mixed Parallelism Stencil Application Principle Greedy Reducing the Granularity LU Factorization Gaussian Elimination LU 12 Vector Product Open MP First MPI Distributing Matrices Second MPI Third MPI Mixed Parallelism Stencil Application Principle Greedy Reducing the Granularity 15 LU Factorization Gaussian Elimination LU 92 / 235

98 Vector Product Open MP First MPI Distributing Matrices Second MPI Third MPI Mixed Parallelism Stencil Application Principle Greedy Reducing the Granularity LU Factorization Gaussian Elimination LU on the Ring See Section 4.2 Turns out one can do matrix multiplication in a way very similar to matrix-vector multiplication A matrix multiplication is just the computation of n 2 scalar products, not just n We have three matrices, A, B, and C We want to compute C = A*B We distribute the matrices to that each processor owns a block row of each matrix Easy to do if row-major is used because all matrix elements owned by a processor are contiguous in memory 93 / 235

99 Vector Product Open MP First MPI Distributing Matrices Second MPI Third MPI Mixed Parallelism Data Distribution r n B Stencil Application Principle Greedy Reducing the Granularity LU Factorization Gaussian Elimination LU A C 94 / 235

100 First Step Vector Product Open MP First MPI Distributing Matrices Second MPI Third MPI Mixed Parallelism p=4 let s look at processor P 1 n r B 1,0 B 1,1 B 1,2 B 1,3 Stencil Application Principle Greedy Reducing the Granularity LU Factorization Gaussian Elimination LU += += A 1,0 A 1,1 A 1,2 A 1,3 A 1,1 xb 1,0 A 1,1 xb 1,1 += A 1,1 xb 1,2 += A 1,1 xb 1,3 95 / 235

101 Vector Product Open MP First MPI Distributing Matrices Second MPI Third MPI Mixed Parallelism Shifting of block rows of B p=4 let s look at processor P q n r Stencil Application Principle Greedy Reducing the Granularity LU Factorization Gaussian Elimination LU A q,0 A q,1 A q,2 A q,3 96 / 235

Parallel programming using MPI. Analysis and optimization. Bhupender Thakur, Jim Lupo, Le Yan, Alex Pacheco

Parallel programming using MPI. Analysis and optimization. Bhupender Thakur, Jim Lupo, Le Yan, Alex Pacheco Parallel programming using MPI Analysis and optimization Bhupender Thakur, Jim Lupo, Le Yan, Alex Pacheco Outline l Parallel programming: Basic definitions l Choosing right algorithms: Optimal serial and

More information

Analytical Modeling of Parallel Programs (Chapter 5) Alexandre David

Analytical Modeling of Parallel Programs (Chapter 5) Alexandre David Analytical Modeling of Parallel Programs (Chapter 5) Alexandre David 1.2.05 1 Topic Overview Sources of overhead in parallel programs. Performance metrics for parallel systems. Effect of granularity on

More information

Divisible Load Scheduling

Divisible Load Scheduling Divisible Load Scheduling Henri Casanova 1,2 1 Associate Professor Department of Information and Computer Science University of Hawai i at Manoa, U.S.A. 2 Visiting Associate Professor National Institute

More information

1 GSW Sets of Systems

1 GSW Sets of Systems 1 Often, we have to solve a whole series of sets of simultaneous equations of the form y Ax, all of which have the same matrix A, but each of which has a different known vector y, and a different unknown

More information

Analytical Modeling of Parallel Systems

Analytical Modeling of Parallel Systems Analytical Modeling of Parallel Systems Chieh-Sen (Jason) Huang Department of Applied Mathematics National Sun Yat-sen University Thank Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar for providing

More information

2.6 Complexity Theory for Map-Reduce. Star Joins 2.6. COMPLEXITY THEORY FOR MAP-REDUCE 51

2.6 Complexity Theory for Map-Reduce. Star Joins 2.6. COMPLEXITY THEORY FOR MAP-REDUCE 51 2.6. COMPLEXITY THEORY FOR MAP-REDUCE 51 Star Joins A common structure for data mining of commercial data is the star join. For example, a chain store like Walmart keeps a fact table whose tuples each

More information

Lecture 4: Linear Algebra 1

Lecture 4: Linear Algebra 1 Lecture 4: Linear Algebra 1 Sourendu Gupta TIFR Graduate School Computational Physics 1 February 12, 2010 c : Sourendu Gupta (TIFR) Lecture 4: Linear Algebra 1 CP 1 1 / 26 Outline 1 Linear problems Motivation

More information

Lecture 23: Illusiveness of Parallel Performance. James C. Hoe Department of ECE Carnegie Mellon University

Lecture 23: Illusiveness of Parallel Performance. James C. Hoe Department of ECE Carnegie Mellon University 18 447 Lecture 23: Illusiveness of Parallel Performance James C. Hoe Department of ECE Carnegie Mellon University 18 447 S18 L23 S1, James C. Hoe, CMU/ECE/CALCM, 2018 Your goal today Housekeeping peel

More information

Parallel Performance Theory - 1

Parallel Performance Theory - 1 Parallel Performance Theory - 1 Parallel Computing CIS 410/510 Department of Computer and Information Science Outline q Performance scalability q Analytical performance measures q Amdahl s law and Gustafson-Barsis

More information

Performance and Scalability. Lars Karlsson

Performance and Scalability. Lars Karlsson Performance and Scalability Lars Karlsson Outline Complexity analysis Runtime, speedup, efficiency Amdahl s Law and scalability Cost and overhead Cost optimality Iso-efficiency function Case study: matrix

More information

Overview: Synchronous Computations

Overview: Synchronous Computations Overview: Synchronous Computations barriers: linear, tree-based and butterfly degrees of synchronization synchronous example 1: Jacobi Iterations serial and parallel code, performance analysis synchronous

More information

Matrix Computations: Direct Methods II. May 5, 2014 Lecture 11

Matrix Computations: Direct Methods II. May 5, 2014 Lecture 11 Matrix Computations: Direct Methods II May 5, 2014 ecture Summary You have seen an example of how a typical matrix operation (an important one) can be reduced to using lower level BS routines that would

More information

Solution of Linear Systems

Solution of Linear Systems Solution of Linear Systems Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico May 12, 2016 CPD (DEI / IST) Parallel and Distributed Computing

More information

15-451/651: Design & Analysis of Algorithms September 13, 2018 Lecture #6: Streaming Algorithms last changed: August 30, 2018

15-451/651: Design & Analysis of Algorithms September 13, 2018 Lecture #6: Streaming Algorithms last changed: August 30, 2018 15-451/651: Design & Analysis of Algorithms September 13, 2018 Lecture #6: Streaming Algorithms last changed: August 30, 2018 Today we ll talk about a topic that is both very old (as far as computer science

More information

Notation. Bounds on Speedup. Parallel Processing. CS575 Parallel Processing

Notation. Bounds on Speedup. Parallel Processing. CS575 Parallel Processing Parallel Processing CS575 Parallel Processing Lecture five: Efficiency Wim Bohm, Colorado State University Some material from Speedup vs Efficiency in Parallel Systems - Eager, Zahorjan and Lazowska IEEE

More information

Distributed Systems Principles and Paradigms

Distributed Systems Principles and Paradigms Distributed Systems Principles and Paradigms Chapter 6 (version April 7, 28) Maarten van Steen Vrije Universiteit Amsterdam, Faculty of Science Dept. Mathematics and Computer Science Room R4.2. Tel: (2)

More information

CME342 Parallel Methods in Numerical Analysis. Matrix Computation: Iterative Methods II. Sparse Matrix-vector Multiplication.

CME342 Parallel Methods in Numerical Analysis. Matrix Computation: Iterative Methods II. Sparse Matrix-vector Multiplication. CME342 Parallel Methods in Numerical Analysis Matrix Computation: Iterative Methods II Outline: CG & its parallelization. Sparse Matrix-vector Multiplication. 1 Basic iterative methods: Ax = b r = b Ax

More information

Parallel Performance Theory

Parallel Performance Theory AMS 250: An Introduction to High Performance Computing Parallel Performance Theory Shawfeng Dong shaw@ucsc.edu (831) 502-7743 Applied Mathematics & Statistics University of California, Santa Cruz Outline

More information

Distributed Systems Principles and Paradigms. Chapter 06: Synchronization

Distributed Systems Principles and Paradigms. Chapter 06: Synchronization Distributed Systems Principles and Paradigms Maarten van Steen VU Amsterdam, Dept. Computer Science Room R4.20, steen@cs.vu.nl Chapter 06: Synchronization Version: November 16, 2009 2 / 39 Contents Chapter

More information

Lecture 4: Constructing the Integers, Rationals and Reals

Lecture 4: Constructing the Integers, Rationals and Reals Math/CS 20: Intro. to Math Professor: Padraic Bartlett Lecture 4: Constructing the Integers, Rationals and Reals Week 5 UCSB 204 The Integers Normally, using the natural numbers, you can easily define

More information

Lecture 19. Architectural Directions

Lecture 19. Architectural Directions Lecture 19 Architectural Directions Today s lecture Advanced Architectures NUMA Blue Gene 2010 Scott B. Baden / CSE 160 / Winter 2010 2 Final examination Announcements Thursday, March 17, in this room:

More information

Image Reconstruction And Poisson s equation

Image Reconstruction And Poisson s equation Chapter 1, p. 1/58 Image Reconstruction And Poisson s equation School of Engineering Sciences Parallel s for Large-Scale Problems I Chapter 1, p. 2/58 Outline 1 2 3 4 Chapter 1, p. 3/58 Question What have

More information

Parallel Numerics. Scope: Revise standard numerical methods considering parallel computations!

Parallel Numerics. Scope: Revise standard numerical methods considering parallel computations! Parallel Numerics Scope: Revise standard numerical methods considering parallel computations! Required knowledge: Numerics Parallel Programming Graphs Literature: Dongarra, Du, Sorensen, van der Vorst:

More information

2 Systems of Linear Equations

2 Systems of Linear Equations 2 Systems of Linear Equations A system of equations of the form or is called a system of linear equations. x + 2y = 7 2x y = 4 5p 6q + r = 4 2p + 3q 5r = 7 6p q + 4r = 2 Definition. An equation involving

More information

Scheduling divisible loads with return messages on heterogeneous master-worker platforms

Scheduling divisible loads with return messages on heterogeneous master-worker platforms Scheduling divisible loads with return messages on heterogeneous master-worker platforms Olivier Beaumont 1, Loris Marchal 2, and Yves Robert 2 1 LaBRI, UMR CNRS 5800, Bordeaux, France Olivier.Beaumont@labri.fr

More information

Parallel Programming. Parallel algorithms Linear systems solvers

Parallel Programming. Parallel algorithms Linear systems solvers Parallel Programming Parallel algorithms Linear systems solvers Terminology System of linear equations Solve Ax = b for x Special matrices Upper triangular Lower triangular Diagonally dominant Symmetric

More information

Getting Started with Communications Engineering

Getting Started with Communications Engineering 1 Linear algebra is the algebra of linear equations: the term linear being used in the same sense as in linear functions, such as: which is the equation of a straight line. y ax c (0.1) Of course, if we

More information

2.5D algorithms for distributed-memory computing

2.5D algorithms for distributed-memory computing ntroduction for distributed-memory computing C Berkeley July, 2012 1/ 62 ntroduction Outline ntroduction Strong scaling 2.5D factorization 2/ 62 ntroduction Strong scaling Solving science problems faster

More information

Algorithms for Collective Communication. Design and Analysis of Parallel Algorithms

Algorithms for Collective Communication. Design and Analysis of Parallel Algorithms Algorithms for Collective Communication Design and Analysis of Parallel Algorithms Source A. Grama, A. Gupta, G. Karypis, and V. Kumar. Introduction to Parallel Computing, Chapter 4, 2003. Outline One-to-all

More information

AN ALGEBRA PRIMER WITH A VIEW TOWARD CURVES OVER FINITE FIELDS

AN ALGEBRA PRIMER WITH A VIEW TOWARD CURVES OVER FINITE FIELDS AN ALGEBRA PRIMER WITH A VIEW TOWARD CURVES OVER FINITE FIELDS The integers are the set 1. Groups, Rings, and Fields: Basic Examples Z := {..., 3, 2, 1, 0, 1, 2, 3,...}, and we can add, subtract, and multiply

More information

NCU EE -- DSP VLSI Design. Tsung-Han Tsai 1

NCU EE -- DSP VLSI Design. Tsung-Han Tsai 1 NCU EE -- DSP VLSI Design. Tsung-Han Tsai 1 Multi-processor vs. Multi-computer architecture µp vs. DSP RISC vs. DSP RISC Reduced-instruction-set Register-to-register operation Higher throughput by using

More information

Scheduling divisible loads with return messages on heterogeneous master-worker platforms

Scheduling divisible loads with return messages on heterogeneous master-worker platforms Scheduling divisible loads with return messages on heterogeneous master-worker platforms Olivier Beaumont, Loris Marchal, Yves Robert Laboratoire de l Informatique du Parallélisme École Normale Supérieure

More information

Parallelization of the QC-lib Quantum Computer Simulator Library

Parallelization of the QC-lib Quantum Computer Simulator Library Parallelization of the QC-lib Quantum Computer Simulator Library Ian Glendinning and Bernhard Ömer VCPC European Centre for Parallel Computing at Vienna Liechtensteinstraße 22, A-19 Vienna, Austria http://www.vcpc.univie.ac.at/qc/

More information

Review: From problem to parallel algorithm

Review: From problem to parallel algorithm Review: From problem to parallel algorithm Mathematical formulations of interesting problems abound Poisson s equation Sources: Electrostatics, gravity, fluid flow, image processing (!) Numerical solution:

More information

Time Synchronization

Time Synchronization Massachusetts Institute of Technology Lecture 7 6.895: Advanced Distributed Algorithms March 6, 2006 Professor Nancy Lynch Time Synchronization Readings: Fan, Lynch. Gradient clock synchronization Attiya,

More information

B629 project - StreamIt MPI Backend. Nilesh Mahajan

B629 project - StreamIt MPI Backend. Nilesh Mahajan B629 project - StreamIt MPI Backend Nilesh Mahajan March 26, 2013 Abstract StreamIt is a language based on the dataflow model of computation. StreamIt consists of computation units called filters connected

More information

Introduction to Parallel Programming in OpenMP Dr. Yogish Sabharwal Department of Computer Science & Engineering Indian Institute of Technology, Delhi

Introduction to Parallel Programming in OpenMP Dr. Yogish Sabharwal Department of Computer Science & Engineering Indian Institute of Technology, Delhi Introduction to Parallel Programming in OpenMP Dr. Yogish Sabharwal Department of Computer Science & Engineering Indian Institute of Technology, Delhi Lecture - 33 Parallel LU Factorization So, now, how

More information

PageRank: The Math-y Version (Or, What To Do When You Can t Tear Up Little Pieces of Paper)

PageRank: The Math-y Version (Or, What To Do When You Can t Tear Up Little Pieces of Paper) PageRank: The Math-y Version (Or, What To Do When You Can t Tear Up Little Pieces of Paper) In class, we saw this graph, with each node representing people who are following each other on Twitter: Our

More information

Convolutional Coding LECTURE Overview

Convolutional Coding LECTURE Overview MIT 6.02 DRAFT Lecture Notes Spring 2010 (Last update: March 6, 2010) Comments, questions or bug reports? Please contact 6.02-staff@mit.edu LECTURE 8 Convolutional Coding This lecture introduces a powerful

More information

Barrier. Overview: Synchronous Computations. Barriers. Counter-based or Linear Barriers

Barrier. Overview: Synchronous Computations. Barriers. Counter-based or Linear Barriers Overview: Synchronous Computations Barrier barriers: linear, tree-based and butterfly degrees of synchronization synchronous example : Jacobi Iterations serial and parallel code, performance analysis synchronous

More information

Model Order Reduction via Matlab Parallel Computing Toolbox. Istanbul Technical University

Model Order Reduction via Matlab Parallel Computing Toolbox. Istanbul Technical University Model Order Reduction via Matlab Parallel Computing Toolbox E. Fatih Yetkin & Hasan Dağ Istanbul Technical University Computational Science & Engineering Department September 21, 2009 E. Fatih Yetkin (Istanbul

More information

An Intuitive Introduction to Motivic Homotopy Theory Vladimir Voevodsky

An Intuitive Introduction to Motivic Homotopy Theory Vladimir Voevodsky What follows is Vladimir Voevodsky s snapshot of his Fields Medal work on motivic homotopy, plus a little philosophy and from my point of view the main fun of doing mathematics Voevodsky (2002). Voevodsky

More information

Matrix Assembly in FEA

Matrix Assembly in FEA Matrix Assembly in FEA 1 In Chapter 2, we spoke about how the global matrix equations are assembled in the finite element method. We now want to revisit that discussion and add some details. For example,

More information

MITOCW ocw-18_02-f07-lec17_220k

MITOCW ocw-18_02-f07-lec17_220k MITOCW ocw-18_02-f07-lec17_220k The following content is provided under a Creative Commons license. Your support will help MIT OpenCourseWare continue to offer high quality educational resources for free.

More information

[Disclaimer: This is not a complete list of everything you need to know, just some of the topics that gave people difficulty.]

[Disclaimer: This is not a complete list of everything you need to know, just some of the topics that gave people difficulty.] Math 43 Review Notes [Disclaimer: This is not a complete list of everything you need to know, just some of the topics that gave people difficulty Dot Product If v (v, v, v 3 and w (w, w, w 3, then the

More information

Class Notes: Solving Simultaneous Linear Equations by Gaussian Elimination. Consider a set of simultaneous linear equations:

Class Notes: Solving Simultaneous Linear Equations by Gaussian Elimination. Consider a set of simultaneous linear equations: METR/OCN 465: Computer Programming with Applications in Meteorology and Oceanography Dr Dave Dempsey Dept of Geosciences, SFSU Class Notes: Solving Simultaneous Linear Equations by Gaussian Elimination

More information

Arnaud Legrand, CNRS, University of Grenoble. November 1, LIG laboratory, Theoretical Parallel Computing. A.

Arnaud Legrand, CNRS, University of Grenoble. November 1, LIG laboratory, Theoretical Parallel Computing. A. Theoretical Arnaud Legrand, CNRS, University of Grenoble LIG laboratory, arnaudlegrand@imagfr November 1, 2009 1 / 63 Outline Theoretical 1 2 3 2 / 63 Outline Theoretical 1 2 3 3 / 63 Models for Computation

More information

Lecture 5. 1 Review (Pairwise Independence and Derandomization)

Lecture 5. 1 Review (Pairwise Independence and Derandomization) 6.842 Randomness and Computation September 20, 2017 Lecture 5 Lecturer: Ronitt Rubinfeld Scribe: Tom Kolokotrones 1 Review (Pairwise Independence and Derandomization) As we discussed last time, we can

More information

Descriptive Statistics (And a little bit on rounding and significant digits)

Descriptive Statistics (And a little bit on rounding and significant digits) Descriptive Statistics (And a little bit on rounding and significant digits) Now that we know what our data look like, we d like to be able to describe it numerically. In other words, how can we represent

More information

High Performance Computing

High Performance Computing Master Degree Program in Computer Science and Networking, 2014-15 High Performance Computing 2 nd appello February 11, 2015 Write your name, surname, student identification number (numero di matricola),

More information

At the start of the term, we saw the following formula for computing the sum of the first n integers:

At the start of the term, we saw the following formula for computing the sum of the first n integers: Chapter 11 Induction This chapter covers mathematical induction. 11.1 Introduction to induction At the start of the term, we saw the following formula for computing the sum of the first n integers: Claim

More information

INF 4140: Models of Concurrency Series 3

INF 4140: Models of Concurrency Series 3 Universitetet i Oslo Institutt for Informatikk PMA Olaf Owe, Martin Steffen, Toktam Ramezani INF 4140: Models of Concurrency Høst 2016 Series 3 14. 9. 2016 Topic: Semaphores (Exercises with hints for solution)

More information

INF2270 Spring Philipp Häfliger. Lecture 8: Superscalar CPUs, Course Summary/Repetition (1/2)

INF2270 Spring Philipp Häfliger. Lecture 8: Superscalar CPUs, Course Summary/Repetition (1/2) INF2270 Spring 2010 Philipp Häfliger Summary/Repetition (1/2) content From Scalar to Superscalar Lecture Summary and Brief Repetition Binary numbers Boolean Algebra Combinational Logic Circuits Encoder/Decoder

More information

Introductory MPI June 2008

Introductory MPI June 2008 7: http://people.sc.fsu.edu/ jburkardt/presentations/ fdi 2008 lecture7.pdf... John Information Technology Department Virginia Tech... FDI Summer Track V: Parallel Programming 10-12 June 2008 MPI: Why,

More information

Thanks to: University of Illinois at Chicago NSF DMS Alfred P. Sloan Foundation

Thanks to: University of Illinois at Chicago NSF DMS Alfred P. Sloan Foundation Building circuits for integer factorization D. J. Bernstein Thanks to: University of Illinois at Chicago NSF DMS 0140542 Alfred P. Sloan Foundation I want to work for NSA as an independent contractor.

More information

Parallel Scientific Computing

Parallel Scientific Computing IV-1 Parallel Scientific Computing Matrix-vector multiplication. Matrix-matrix multiplication. Direct method for solving a linear equation. Gaussian Elimination. Iterative method for solving a linear equation.

More information

These are special traffic patterns that create more stress on a switch

These are special traffic patterns that create more stress on a switch Myths about Microbursts What are Microbursts? Microbursts are traffic patterns where traffic arrives in small bursts. While almost all network traffic is bursty to some extent, storage traffic usually

More information

Behavioral Simulations in MapReduce

Behavioral Simulations in MapReduce Behavioral Simulations in MapReduce Guozhang Wang, Marcos Vaz Salles, Benjamin Sowell, Xun Wang, Tuan Cao, Alan Demers, Johannes Gehrke, Walker White Cornell University 1 What are Behavioral Simulations?

More information

Lecture 6: Introducing Complexity

Lecture 6: Introducing Complexity COMP26120: Algorithms and Imperative Programming Lecture 6: Introducing Complexity Ian Pratt-Hartmann Room KB2.38: email: ipratt@cs.man.ac.uk 2015 16 You need this book: Make sure you use the up-to-date

More information

Dynamic resource sharing

Dynamic resource sharing J. Virtamo 38.34 Teletraffic Theory / Dynamic resource sharing and balanced fairness Dynamic resource sharing In previous lectures we have studied different notions of fair resource sharing. Our focus

More information

Solving PDEs with CUDA Jonathan Cohen

Solving PDEs with CUDA Jonathan Cohen Solving PDEs with CUDA Jonathan Cohen jocohen@nvidia.com NVIDIA Research PDEs (Partial Differential Equations) Big topic Some common strategies Focus on one type of PDE in this talk Poisson Equation Linear

More information

The Performance Evolution of the Parallel Ocean Program on the Cray X1

The Performance Evolution of the Parallel Ocean Program on the Cray X1 The Performance Evolution of the Parallel Ocean Program on the Cray X1 Patrick H. Worley Oak Ridge National Laboratory John Levesque Cray Inc. 46th Cray User Group Conference May 18, 2003 Knoxville Marriott

More information

EECS 358 Introduction to Parallel Computing Final Assignment

EECS 358 Introduction to Parallel Computing Final Assignment EECS 358 Introduction to Parallel Computing Final Assignment Jiangtao Gou Zhenyu Zhao March 19, 2013 1 Problem 1 1.1 Matrix-vector Multiplication on Hypercube and Torus As shown in slide 15.11, we assumed

More information

上海超级计算中心 Shanghai Supercomputer Center. Lei Xu Shanghai Supercomputer Center San Jose

上海超级计算中心 Shanghai Supercomputer Center. Lei Xu Shanghai Supercomputer Center San Jose 上海超级计算中心 Shanghai Supercomputer Center Lei Xu Shanghai Supercomputer Center 03/26/2014 @GTC, San Jose Overview Introduction Fundamentals of the FDTD method Implementation of 3D UPML-FDTD algorithm on GPU

More information

Improvements for Implicit Linear Equation Solvers

Improvements for Implicit Linear Equation Solvers Improvements for Implicit Linear Equation Solvers Roger Grimes, Bob Lucas, Clement Weisbecker Livermore Software Technology Corporation Abstract Solving large sparse linear systems of equations is often

More information

CSCI Final Project Report A Parallel Implementation of Viterbi s Decoding Algorithm

CSCI Final Project Report A Parallel Implementation of Viterbi s Decoding Algorithm CSCI 1760 - Final Project Report A Parallel Implementation of Viterbi s Decoding Algorithm Shay Mozes Brown University shay@cs.brown.edu Abstract. This report describes parallel Java implementations of

More information

Welcome to MCS 572. content and organization expectations of the course. definition and classification

Welcome to MCS 572. content and organization expectations of the course. definition and classification Welcome to MCS 572 1 About the Course content and organization expectations of the course 2 Supercomputing definition and classification 3 Measuring Performance speedup and efficiency Amdahl s Law Gustafson

More information

Massively parallel semi-lagrangian solution of the 6d Vlasov-Poisson problem

Massively parallel semi-lagrangian solution of the 6d Vlasov-Poisson problem Massively parallel semi-lagrangian solution of the 6d Vlasov-Poisson problem Katharina Kormann 1 Klaus Reuter 2 Markus Rampp 2 Eric Sonnendrücker 1 1 Max Planck Institut für Plasmaphysik 2 Max Planck Computing

More information

Designing Information Devices and Systems I Fall 2018 Lecture Notes Note Positioning Sytems: Trilateration and Correlation

Designing Information Devices and Systems I Fall 2018 Lecture Notes Note Positioning Sytems: Trilateration and Correlation EECS 6A Designing Information Devices and Systems I Fall 08 Lecture Notes Note. Positioning Sytems: Trilateration and Correlation In this note, we ll introduce two concepts that are critical in our positioning

More information

Math 350: An exploration of HMMs through doodles.

Math 350: An exploration of HMMs through doodles. Math 350: An exploration of HMMs through doodles. Joshua Little (407673) 19 December 2012 1 Background 1.1 Hidden Markov models. Markov chains (MCs) work well for modelling discrete-time processes, or

More information

Outline. EECS Components and Design Techniques for Digital Systems. Lec 18 Error Coding. In the real world. Our beautiful digital world.

Outline. EECS Components and Design Techniques for Digital Systems. Lec 18 Error Coding. In the real world. Our beautiful digital world. Outline EECS 150 - Components and esign Techniques for igital Systems Lec 18 Error Coding Errors and error models Parity and Hamming Codes (SECE) Errors in Communications LFSRs Cyclic Redundancy Check

More information

Slope Fields: Graphing Solutions Without the Solutions

Slope Fields: Graphing Solutions Without the Solutions 8 Slope Fields: Graphing Solutions Without the Solutions Up to now, our efforts have been directed mainly towards finding formulas or equations describing solutions to given differential equations. Then,

More information

Notes for CS542G (Iterative Solvers for Linear Systems)

Notes for CS542G (Iterative Solvers for Linear Systems) Notes for CS542G (Iterative Solvers for Linear Systems) Robert Bridson November 20, 2007 1 The Basics We re now looking at efficient ways to solve the linear system of equations Ax = b where in this course,

More information

CS475: Linear Equations Gaussian Elimination LU Decomposition Wim Bohm Colorado State University

CS475: Linear Equations Gaussian Elimination LU Decomposition Wim Bohm Colorado State University CS475: Linear Equations Gaussian Elimination LU Decomposition Wim Bohm Colorado State University Except as otherwise noted, the content of this presentation is licensed under the Creative Commons Attribution

More information

CRYSTAL in parallel: replicated and distributed (MPP) data. Why parallel?

CRYSTAL in parallel: replicated and distributed (MPP) data. Why parallel? CRYSTAL in parallel: replicated and distributed (MPP) data Roberto Orlando Dipartimento di Chimica Università di Torino Via Pietro Giuria 5, 10125 Torino (Italy) roberto.orlando@unito.it 1 Why parallel?

More information

Parallelization of the QC-lib Quantum Computer Simulator Library

Parallelization of the QC-lib Quantum Computer Simulator Library Parallelization of the QC-lib Quantum Computer Simulator Library Ian Glendinning and Bernhard Ömer September 9, 23 PPAM 23 1 Ian Glendinning / September 9, 23 Outline Introduction Quantum Bits, Registers

More information

Efficient algorithms for symmetric tensor contractions

Efficient algorithms for symmetric tensor contractions Efficient algorithms for symmetric tensor contractions Edgar Solomonik 1 Department of EECS, UC Berkeley Oct 22, 2013 1 / 42 Edgar Solomonik Symmetric tensor contractions 1/ 42 Motivation The goal is to

More information

FPGA Implementation of a Predictive Controller

FPGA Implementation of a Predictive Controller FPGA Implementation of a Predictive Controller SIAM Conference on Optimization 2011, Darmstadt, Germany Minisymposium on embedded optimization Juan L. Jerez, George A. Constantinides and Eric C. Kerrigan

More information

The Gram-Schmidt Process

The Gram-Schmidt Process The Gram-Schmidt Process How and Why it Works This is intended as a complement to 5.4 in our textbook. I assume you have read that section, so I will not repeat the definitions it gives. Our goal is to

More information

Communication Engineering Prof. Surendra Prasad Department of Electrical Engineering Indian Institute of Technology, Delhi

Communication Engineering Prof. Surendra Prasad Department of Electrical Engineering Indian Institute of Technology, Delhi Communication Engineering Prof. Surendra Prasad Department of Electrical Engineering Indian Institute of Technology, Delhi Lecture - 41 Pulse Code Modulation (PCM) So, if you remember we have been talking

More information

Qualitative vs Quantitative metrics

Qualitative vs Quantitative metrics Qualitative vs Quantitative metrics Quantitative: hard numbers, measurable Time, Energy, Space Signal-to-Noise, Frames-per-second, Memory Usage Money (?) Qualitative: feelings, opinions Complexity: Simple,

More information

Cosets and Lagrange s theorem

Cosets and Lagrange s theorem Cosets and Lagrange s theorem These are notes on cosets and Lagrange s theorem some of which may already have been lecturer. There are some questions for you included in the text. You should write the

More information

CS 700: Quantitative Methods & Experimental Design in Computer Science

CS 700: Quantitative Methods & Experimental Design in Computer Science CS 700: Quantitative Methods & Experimental Design in Computer Science Sanjeev Setia Dept of Computer Science George Mason University Logistics Grade: 35% project, 25% Homework assignments 20% midterm,

More information

Building a Multi-FPGA Virtualized Restricted Boltzmann Machine Architecture Using Embedded MPI

Building a Multi-FPGA Virtualized Restricted Boltzmann Machine Architecture Using Embedded MPI Building a Multi-FPGA Virtualized Restricted Boltzmann Machine Architecture Using Embedded MPI Charles Lo and Paul Chow {locharl1, pc}@eecg.toronto.edu Department of Electrical and Computer Engineering

More information

Introduction to Algebra: The First Week

Introduction to Algebra: The First Week Introduction to Algebra: The First Week Background: According to the thermostat on the wall, the temperature in the classroom right now is 72 degrees Fahrenheit. I want to write to my friend in Europe,

More information

Parallelization of the Dirac operator. Pushan Majumdar. Indian Association for the Cultivation of Sciences, Jadavpur, Kolkata

Parallelization of the Dirac operator. Pushan Majumdar. Indian Association for the Cultivation of Sciences, Jadavpur, Kolkata Parallelization of the Dirac operator Pushan Majumdar Indian Association for the Cultivation of Sciences, Jadavpur, Kolkata Outline Introduction Algorithms Parallelization Comparison of performances Conclusions

More information

Compiling Techniques

Compiling Techniques Lecture 11: Introduction to 13 November 2015 Table of contents 1 Introduction Overview The Backend The Big Picture 2 Code Shape Overview Introduction Overview The Backend The Big Picture Source code FrontEnd

More information

1.20 Formulas, Equations, Expressions and Identities

1.20 Formulas, Equations, Expressions and Identities 1.0 Formulas, Equations, Expressions and Identities Collecting terms is equivalent to noting that 4 + 4 + 4 + 4 + 4 + 4 can be written as 6 4; i.e., that multiplication is repeated addition. It s wise

More information

/633 Introduction to Algorithms Lecturer: Michael Dinitz Topic: Matroids and Greedy Algorithms Date: 10/31/16

/633 Introduction to Algorithms Lecturer: Michael Dinitz Topic: Matroids and Greedy Algorithms Date: 10/31/16 60.433/633 Introduction to Algorithms Lecturer: Michael Dinitz Topic: Matroids and Greedy Algorithms Date: 0/3/6 6. Introduction We talked a lot the last lecture about greedy algorithms. While both Prim

More information

7.2 Linear equation systems. 7.3 Linear least square fit

7.2 Linear equation systems. 7.3 Linear least square fit 72 Linear equation systems In the following sections, we will spend some time to solve linear systems of equations This is a tool that will come in handy in many di erent places during this course For

More information

Parallel Program Performance Analysis

Parallel Program Performance Analysis Parallel Program Performance Analysis Chris Kauffman CS 499: Spring 2016 GMU Logistics Today Final details of HW2 interviews HW2 timings HW2 Questions Parallel Performance Theory Special Office Hours Mon

More information

Logical Time. 1. Introduction 2. Clock and Events 3. Logical (Lamport) Clocks 4. Vector Clocks 5. Efficient Implementation

Logical Time. 1. Introduction 2. Clock and Events 3. Logical (Lamport) Clocks 4. Vector Clocks 5. Efficient Implementation Logical Time Nicola Dragoni Embedded Systems Engineering DTU Compute 1. Introduction 2. Clock and Events 3. Logical (Lamport) Clocks 4. Vector Clocks 5. Efficient Implementation 2013 ACM Turing Award:

More information

Speculative Parallelism in Cilk++

Speculative Parallelism in Cilk++ Speculative Parallelism in Cilk++ Ruben Perez & Gregory Malecha MIT May 11, 2010 Ruben Perez & Gregory Malecha (MIT) Speculative Parallelism in Cilk++ May 11, 2010 1 / 33 Parallelizing Embarrassingly Parallel

More information

Burst Scheduling Based on Time-slotting and Fragmentation in WDM Optical Burst Switched Networks

Burst Scheduling Based on Time-slotting and Fragmentation in WDM Optical Burst Switched Networks Burst Scheduling Based on Time-slotting and Fragmentation in WDM Optical Burst Switched Networks G. Mohan, M. Ashish, and K. Akash Department of Electrical and Computer Engineering National University

More information

Computational complexity and some Graph Theory

Computational complexity and some Graph Theory Graph Theory Lars Hellström January 22, 2014 Contents of todays lecture An important concern when choosing the algorithm to use for something (after basic requirements such as correctness and stability)

More information

Q: How can quantum computers break ecryption?

Q: How can quantum computers break ecryption? Q: How can quantum computers break ecryption? Posted on February 21, 2011 by The Physicist Physicist: What follows is the famous Shor algorithm, which can break any RSA encryption key. The problem: RSA,

More information

CS425: Algorithms for Web Scale Data

CS425: Algorithms for Web Scale Data CS425: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS425. The original slides can be accessed at: www.mmds.org Challenges

More information

TIME DEPENDENCE OF SHELL MODEL CALCULATIONS 1. INTRODUCTION

TIME DEPENDENCE OF SHELL MODEL CALCULATIONS 1. INTRODUCTION Mathematical and Computational Applications, Vol. 11, No. 1, pp. 41-49, 2006. Association for Scientific Research TIME DEPENDENCE OF SHELL MODEL CALCULATIONS Süleyman Demirel University, Isparta, Turkey,

More information

Cyclops Tensor Framework

Cyclops Tensor Framework Cyclops Tensor Framework Edgar Solomonik Department of EECS, Computer Science Division, UC Berkeley March 17, 2014 1 / 29 Edgar Solomonik Cyclops Tensor Framework 1/ 29 Definition of a tensor A rank r

More information

VLSI System Design Part V : High-Level Synthesis(2) Oct Feb.2007

VLSI System Design Part V : High-Level Synthesis(2) Oct Feb.2007 VLSI System Design Part V : High-Level Synthesis(2) Oct.2006 - Feb.2007 Lecturer : Tsuyoshi Isshiki Dept. Communications and Integrated Systems, Tokyo Institute of Technology isshiki@vlsi.ss.titech.ac.jp

More information