EECS 358 Introduction to Parallel Computing Final Assignment

Size: px

Start display at page:

Download "EECS 358 Introduction to Parallel Computing Final Assignment"

Gregory Beasley
5 years ago
Views:

1 EECS 358 Introduction to Parallel Computing Final Assignment Jiangtao Gou Zhenyu Zhao March 19, Problem Matrix-vector Multiplication on Hypercube and Torus As shown in slide 15.11, we assumed row-wise striping for an n n matrix and a n 1 vector. Without loss of generality, we assumed that p n. For a special case where one row per processor, p = n. Ax = y. Processor P i initially stored vector elements x[in/p],, x[(i + 1)n/p 1] and matrix elements A[in/p, 0],, A[in/p, n 1], A[in/p+1, 0],, A[in/p+ 1, n 1],, A[(i + 1)n/p 1, 0],, A[(i + 1)n/p 1, n 1], and was responsible for calculating y[in/p],, y[(i + 1)n/p 1]. The first step was an all-to-all broadcast because every processor need the entire vector. The second step was an in-processor computing process. Matrix multiplication itself contained n 2 multiplications and n(n 1) additions. By assuming the sequential run time W = n 2, each processor spent n 2 /p time multiplying its own n/p rows to get the n/p elements of result vector. 1

2 1.1.1 Hypercube The communication time on hypercube was T H all2all. T H all2all = log 2 p i=1 ( ts + t h + 2 i 1 (n/p)t s ) = t s log 2 p + t h log 2 p + t w (n/p)(p 1). By neglecting per-hop time t h, the communication time was So the runtime on hypercube is Torus T H all2all = t s log 2 p + t w (n/p)(p 1). T P = n 2 /p + t s log 2 p + t w (n n/p). The first phase of p simultaneous ring-style all-to-all broadcasts consumed T 1 = (t s + t h + t w (n/p)) ( p 1). The second phase of ring-style all to all broadcasts was on the other dimension, which took T 2 = (t s + t h + t w ( pn/p)) ( p 1). The total communication time was T T all2all = T 1 + T 2 = 2(t s + t h ) ( p 1) + t w ( p 1) (n/p + n/ p) = 2(t s + t h ) ( p 1) + t w (n n/p) 2t s ( p 1) + t w (n n/p). So the runtime on hypercube is T P = n 2 /p + 2t s ( p 1) + t w (n n/p). 2

3 1.2 Matrix Transposition on Ring Here we considered three different ring structures, as shown in Fig 1. In Fig 1, we took a 16-processor parallel machine as an example. If we used the first ring structure, when transposing a matrix, the longest path would cover 7 links. If we used the second ring structure, when transposing a matrix, the longest path would cover 6 links. If we used the third ring structure, when transposing a matrix, the longest path would cover 3 links. Note that the third ring structure was significantly better than the other two when doing a matrix transposition, we decided to use this structure. By assuming that the number of processors p is less than n 2, the transpose of the entrie matrix was computed in two phases. In the first phase, we transposed the square matrix blocks. Note that the longest path contained p 1 links, so the communication time between processors was T R = t s ( p 1) + t w ( p 1)n 2 /p t s p + tw n 2 / p, where we assumed that per-hop time t h was negligible. In the second phase, we processed a local exchange. Each processor contained a n/ p n/ p matrix, and the transposition took a time n 2 /2p. The total parallel tun time on ring was by using our specific ring connection. T P = n 2 /2p + t s p + tw n 2 / p, 2 Problem Algorithm Description We applied a top-down greedy algorithm to find the partition with the minimal cost. 3

4 Figure 1: Ring Structures 4

5 2.1.1 Sequential Algorithm, 1 processor Our Greedy Partition Algorithm was shown in Fig 2. In each step, we chose the better partition with the smaller cost between the horizontal partition and the vertical partition, to equally divide the points in this given intermediate quadrant. We kept partitioning until we reached the pre-specified number of quadrants. When comparing the horizontal partition and the vertical partition, we need to compute the cost. There was a trick that we did not need to compute within-group cost, but only need to compute between-group cost, as shown in Figure 3. Let us assume that there were M = 2 m points in this quadrant. We had two partition choices, a vertical partition by dividing this quadrant into area 1 and 2, and a horizontal partition by dividing this quadrant into area 3 and 4. The cost of the vertical partition was a sum of (1) cost within group 13, (2) cost within group 14, (3) cost within group 23, (4) cost within group 24, (5) cost between group 13 and 14, (5) cost between group 23 and 24. The cost of the horizontal partition was a sum of (1) cost within group 13, (2) cost within group 14, (3) cost within group 23, (4) cost within group 24, (5) cost between group 13 and 23, (5) cost between group 14 and 24. So in order to make the comparison, we only need to compute betweengroup costs. Since each area 1, 2, 3 or 4 contained M/4 points, the comparison need to compute 4 (M/4) 2 = M 2 /4 distances. Assume that there were a total of N = 2 n points in a 2-dimensional coordinate system (Here N = 524, 288 and n = 19 in find quadrants given by professor, or N = 1, 048, 576 and n = 20 in the description of homework 4). We assume that in a unit time one processor can compute a distance between two points (which contained a square root, two multiplications and three additions). Assume that the number of quadrants was Q = 2 q, here q = 6, 7, 8. Numbers of processors were p = 1, 2, 4, 8, 16. 5

6 Figure 2: Greedy Partition Algorithm 6

7 Figure 3: Cost Comparison 7

8 In order to compare cost, we need to compute T 1 = N (N/2)2 4 q 1 = 2 i (N/2i ) 2 4 i=0 q 1 1 = 4 i=0 = N 2 2 N 2 2 i (1 12 q ). + 4 (N/4)2 4 + So when q was relatively large, there will be no difference between different q. Since we need the totally cost, in the last step we need to compute the with-in group distance, which took T 2 = 2 q 1 (N/2q 1 ) 2 So the time cost of computing cost is 4 = N 2 2 q+1. T cost = T 1 + T 2 = N 2 2, which is irrelevant with q. When partitioning, we need to sort the points based on either x-coordinate or y-coordinate, since we use quick sort which has an average complexity 1.39n log 2 n, which was significantly smaller than T cost, we concluded that the sequential algorithm took T S = N 2 to search the partition with minimum cost Parallel Algorithm, p processor When using multiple processors, we used interleaved schedule to assign cost computation to different processors. When there were p processors, it took N 2 /2p to compute cost. 8 2

9 Between every two steps, we need one single-node accumulation in order to summarize the cost, and one one-to-all broadcast to let each processor knew the decision between two possible partitions. The decision was made by one processor, say P 0. After getting the decision about the better partition, each processor would individually partition the points, where we need to sort the list of points again at first. The sorting process (O(N log N)) took much less time than computing cost (O(N 2 )), so we could omit the time cost of sorting. By assuming that accumulation and broadcast each took log p time, we conclude that T P = N 2 + 2q log p. 2p 2.2 Scalability Speedup Efficiency Note that S = N N 2 + 2q log p = pn N 2 + 4qp log p. 2p E = S/p = N 2 N 2 + 4qp log p. S = = = pn 2 N 2 + 4qp log p p 1 + 4qp log p/n 2 p 1 + (p 1). 4qp log p (p 1)N 2 So α = 4qp log p (p 1)N 2. Since a 0 as N, this algorithm is effective. Meanwhile, note that S = pn 2 N 2 + 4qp log p, 9

10 and so it is scalable. (2p)N 2 N 2 + 4q(2p) log(2p) 2pN 2 N 2 + 4qp log p = 2S, 2.3 Results: Run time and Partition cost We used the setting that N = 524, 288 and n = 19. Outputs were shown in Figure 5, 6, 7, 8, 9, 10, 11, 12 and 13. We print the coordinates of four corners of each quadrant, the cost of the quadrant distribution, and the wall time, as shown in Figure Results were summarized in Table 1 and Table 2. Table 1: Cost Number of Quadrants Cost Table 2: Time consuming on Lab Machine (second) Number of Processors 64 Quadrants 128 Quadrants 256 Quadrants Running on 358smp Host machine took longer time than running on Wilkinson lab machines (almost triple). 10

11 Figure 4: Time vs Number of Processors 11

12 Figure 5: Results (Lab machine): 64 quadrants, 1 and 2 processors 12

13 Figure 6: Results (Lab machine): 64 quadrants, 4 and 8 processors 13

14 Figure 7: Results (Lab machine): 64 quadrants, 16 processors 14

15 Figure 8: Results (Lab machine): 128 quadrants, 1 and 2 processors 15

16 Figure 9: Results (Lab machine): 128 quadrants, 4 and 8 processors 16

17 Figure 10: Results (Lab machine): 128 quadrants, 16 processors 17

18 Figure 11: Results (Lab machine): 256 quadrants, 1 and 2 processors 18

19 Figure 12: Results (Lab machine): 256 quadrants, 4 and 8 processors 19

20 Figure 13: Results (Lab machine): 256 quadrants, 16 processors 20

Parallel programming using MPI. Analysis and optimization. Bhupender Thakur, Jim Lupo, Le Yan, Alex Pacheco

Parallel programming using MPI Analysis and optimization Bhupender Thakur, Jim Lupo, Le Yan, Alex Pacheco Outline l Parallel programming: Basic definitions l Choosing right algorithms: Optimal serial and