Speedup for Multi-Level Parallel Computing

Size: px

Start display at page:

Download "Speedup for Multi-Level Parallel Computing"

Dwayne Gardner
5 years ago
Views:

1 Seedu for Multi-Level Parallel Comuting Shaniang Tang, Bu-Sung Lee,2, Bingsheng He School of Comuter Engineering 2 Service Platform Lab Nanyang Technological University HP Labs Singaore {stang5, ebslee, bshe}@ntu.edu.sg francis.lee@h.com Abstract This aer studies the seedu for multi-level arallel comuting. Two models of arallel seedu are considered, namely, fixed-size seedu and fixed-time seedu. Based on these two models, we start with the seedu formulation that takes into account uneven allocation and communication latency, and gives an accurate estimation. Next, we roose a highlevel abstract case with roviding a global view of ossible erformance enhancement, namely E-Amdahl s Law for fixed-size seedu and E-Gustafson s Law for fixed-time seedu. These two laws demonstrate seemingly oosing views about the seedu of multi-level arallel comuting. Our study illustrates that they are not contradictory but unified and comlementary. The results lead to a better understanding in the erformance and scalability of multi-level arallel comuting. The exerimental results show that E-Amdahl s Law can be alied as a rediction model as well as a guide for the erformance otimization in multi-level arallel comuting. Keywords-E-Amdahl s Law, E-Gustafson s Law, Multi-Level Parallel Comuting I. INTRODUCTION High erformance comuting systems continue to grow raidly over the ast few years. The raid develoment of multi/many-core technology in recent years has made it a denser arallel architecture (i.e. more rocessing elements er hysical node) []. For instance, more than 8% of the systems of the To5 suercomuters [2] belong to the multi/many-core rocessor family. That is, most systems have featured a hierarchical hardware design: comuting nodes with multi/many-core CPUs are connected via network infrastructure. They have evolved to suort arallelism of multile levels (e.g., at levels of hysical machines, CPUs and cores), in contrast with a single level of hysical machines. To fully exloit the otential of multi-level arallelism of hardware architecture, it is natural to emloy the multi-level arallel comuting aradigm which takes rocesses for coarsegrained arallelism across the nodes and threads for finegrained arallelism within the node at the same time [3]. For examle, it is quite suitable for a cluster of SMP nodes that MPI is needed for arallelism across nodes and OenMP can be used to exloit the arallelism within a node [4]. Seedu has been a classical metric to measuring the erformance and scalability of arallel rocessing [5]. It is generally defined as sequential execution time over arallel execution time. There are two well-known seedu models, Amdahl s Law [6] and Gustafson s Law [7]. Amdahl s Law models the seedu of arallelized imlementations under the assumtion that the roblem size remains the same when arallelized. In contrast, Gustafson s Law models the seedu involving arbitrarily large data sets can be efficiently arallelized. Both seedus catured by Amdahl s Law and Gustafson s Law are indeed based on the single-level arallelism, whereas multi-level arallelism is not considered. In comarison to single-level arallelism, multi-level arallelism is more comlicated, involving a nested arallelism with granularity from coarse to fine. Neither Amdahl s Law nor Gustafson s Law is able to differentiate the varying granularity of multi-level arallel comuting, from coarse-grained arallelism to finegrained arallelism. To evaluate the seedu for the multi-level arallel comuting, we need to combine seedus at different levels of arallelism. That motivates us to study seedu for multi-level arallel comuting. In this aer, we roose a general multi-level arallelism model. Based on this model, the fixed-size seedu and fixedtime seedu are studied. With both uneven workload allocation and communication latency considered, the formulations of generalized fixed-sized seedu and fixed-time seedu are derived, giving accurate erformance estimations for different alications. In order to facilitate model the seedu of multilevel arallelism, we roose a fixed-size seedu named E- Amdahl s Law and a fixed-time seedu named E-Gustafson s Law, with assumtions that the communication overhead is zero and the workload only consists of sequential and erfectly arallel ortion. Those models can be viewed as extensions of Amdahl s Law and Gustafson s Law resectively to multi-level arallel comuting. We further study the imlications of E-Amdahl s Law and E-Gustafson s Law on erformance otimization of multilevel arallel comuting. E-Amdahl s Law indicates that if the degree of arallelism at the first level is not large, increasing the degree of arallelism at the second level will not significantly increase the overall erformance. It is imortant but easily ignored by users during the rogramming and otimization of the multi-level comuting. For examle, in multi-gpu rogramming [8], rogrammers often focus most of their attentions on otimizing intra-gpu arallelism, since they generally regard the intra-gpu arallelism as the key art for erformance otimization. Thus, the otimization work of arallelism across different GPUs might be neglected. On the other hand, E-Gustafson s Law suggests that the seedu of multi-level arallel comuting is unbounded. While the two laws look conflictive, they are unified and comlementary with each other, and reflect different viewoints and different tyes of alications.

2 Since E-Gustafson s Law and E-Amdahl s Law are unified, we take E-Amdahl s Law as a case study to evaluate the erformance of NAS arallel benchmark with multi-zone version [4]. The results show that the estimated seedu based on E-Amdahl s Law is much more accurate than that with Amdahl s Law for multi-level arallel comuting. Moreover, E-Amdahl s Law always gives out the uer bound for the seedu. Thus, E-Amdahl s Law can be used as a rediction model as well as a guide to users on erformance otimization of multi-level arallel comuting. This aer is organized as follows. Section II reviews the related work. Section III resents the multi-level arallelism model, followed by the descrition of the generalized multilevel arallel seedus. The high-level abstract multi-level arallel seedus are resented in Section V. Section VI resents the erformance evaluation. Section VII concludes the aer and gives out the future work. III. MODEL AND MOTIVATION In this section, we resent the concet of multi-level arallel comuting and the motivation for new seedu models. A. Multi-level Parallel Comuting II. RELATED WORK Seedu is a key erformance metric for arallel comuting. Two basic definitions of seedu are available, namely, absolute seedu and relative seedu [9]. The absolute seedu refers to how much faster a roblem can be solved with N rocessors by comaring the best sequential algorithm with the arallel algorithm under consideration. The relative seedu is defined as the ratio of elased time of the arallel algorithm on one rocessor to elased time of the arallel algorithm on N rocessors. It focuses on the inherent arallelism of the arallel algorithm under consideration. Moreover, there are other kinds of seedu definitions such as real seedu and asymtotic seedu. All tyes of seedu mentioned, together with their advantages and disadvantages, are surveyed in [6]. However, all those definitions of seedu are based on the the single-level arallelism. Assuming that the roblem size is fixed, Amdahl s Law [6] was roosed. It describes how much erformance enhancement can be achieved when increasing the number of rocessing elements. It essimistically indicates that the maximum seedu is bounded by the sequential fraction of the workload and massively arallel rocessing may not obtain high erformance. Later, Gustafson s Law [7] was roosed. It is concerned with the fixed-time scenario and interested in the scale of a roblem that can be handled within a given time. It otimistically suggests that the seedu is unbounded and increases linearly with the number of rocessing elements available. The above laws are simle and have great influences on the erformance otimizations of arallel comuting. Sun and Ni s work [5], [] has also described other seedu models for memory-bounded scenarios. However, all the revious works of seedu model are imlicitly based on the single-level arallelism, which are unsuitable for the multilevel arallelism. In contrast, our work in this aer extends the fixed-size and fixed-time seedu models for the multilevel arallelism, giving both the generalized formulations as well as high-level abstract formulations. Fig. : Multi-level arallelism model. Figure illustrates a general arallelism model for the multi-level arallel comuting. It is a nested arallelism from coarser granularity to finer granularity. The number of arallelism levels is m(m ). We call each rocessing art as a arallelism unit. Let PE i, denote the arallelism unit with the id of at the i th level. The model assumes that a arallel rogram can be searated into two arts: sequential art and arallel art. For the arallel art, we arallelize it in multile levels of arallelism. The arallel art of the coarse-grained arallelism can be further arallelized with fine-grained arallelism imlementations. For examle, we can write a two-level rogram with MPI/OenMP rogramming aradigm for SMP clusters. The first-level arallelism (L) isa rocess-level arallelism with MPI across the comute nodes. The second-level arallelism (L2) is a thread-level arallelism using OenMP within a rocess. More levels of arallelism can also be considered, e.g., instruction-level arallelism from the comiler asect. B. A Motivating Examle Multi-level arallelization has been widely adoted in many alications. To motivate the imortance of extending the traditional concet and formula of seedu for the multi-level arallelism, we take as a case study benchmark LU-MZ [4], imlemented with MPI/OenMP rogramming aradigm. We comare the erformance estimation using Amdahl s Law and with our roosed E-Amdahl s Law, as illustrated in Figure 2. The details on E-Amdahl s Law is resented in Section V. The erformance of the benchmark is measured on a SMP cluster consisting of 8 comute nodes, each with two quad-core chis. Let denote the number of rocesses Amdahl s Law: seedu F + N F, where F is the arallel fraction of the workload, N is the number of rocessors used.

3 Fig. 3: Parallelism rofile of a hyothetical alication. t 5 t 4 t 3 t 2 t Fig. 4: Shae of the alication. Fig. 2: Exerimental and estimated seedus for NAS multi-level arallel comuting benchmark LU-MZ, imlemented with hybrid MPI/OenMP rogramming aradigm. denotes the number of rocesses, t denotes the number of threads er rocess. We estimate the seedu with E-Amdahl s Law by using Formula (7). We estimate that α.9892,β.86 with Algorithm (in Section VI(A)), where β is the arallel fraction of the thread-level arallelism. and t denote the number of threads er rocess. With the ersective of the number of rocessors (i.e. CPU cores) used for the multi-level arallelism, we estimate the seedu based on Amdahl s Law by using the formula: α+, where α α t is the arallel ortion at the rocess-level arallelism. Note that Amdahl s Law cannot well estimate the seedu for the benchmark with multi-level arallelism. Particularly, Amdahl s Law is unable to differentiate the coarse-grained and finegrained arallelism. For examle, there is no difference in seedu when t 8, 2 4, 4 2, 8 using Amdahl s Law. Moreover, the estimated seedu of Amdahl s Law becomes more inaccurate when the number of threads er rocess (t) increase. For instance, the estimation error at oint t 8 8 is much larger than that of t 8 4. In contrast, E-Amdahl s Law is able to well distinguish the coarse-grained and fine-grained arallelism, and estimate the seedu for the multi-level arallelism. Secifically, in this examle, the average ratio of estimation error 2 for Amdahl s Law is 55%. In comarison, the average ratio of estimation error for E-Amdahl s Law is %. This significant estimation error difference motivates us to model the seedu of multilevel arallelism. IV. GENERALIZED MULTI-LEVEL PARALLEL SPEEDUP The arallelism within an alication can be characterized in different ways [5]. Two maor degradation factors for effecient arallelism are considered in the aer, uneven allocation (load unbalance) and communication latency. Both degradations should be well considered in order to get an accurate estimation. The uneven allocation is measured by degree of arallelism [5]. 2 Average ratio of estimation error R E, where R is the n R exerimental result, E is the estimated result, n is the number of testings. Definition. The degree of arallelism is an integer which indicates the number of rocessing elements in a arallelism level that are busy during the execution of the rogram, given an unbounded number of available rocessing elements (e.g. comuting nodes, CPUs). Figure 3 shows a grah of the degree of arallelism in a arallelism level over the execution time for a hyothetical arallel alication. We refer to this as the arallelism rofile. It can be rearranged to form the shae (See Figure 4) of the alication execution by gathering the time taken at each degree of arallelism []. In Figure, we assume that arallelism units at the same level are identical. Thus we only need to consider one multi- level arallelism ath (e.g. PEi,, ( i m)) for the whole comutation. Let W be the whole amount of comutation (work) for an alication. Let (i) be the number of rocessing elements used to execute the arallelism units at the i th level, which are sawned from the same PE.For examle, in Figure, we have (),(2) 2,(3) 4. Let P reresent the whole rocessing elements. Formally, the seedu for a multi-level hardware architecture with P rocessing elements and with the total amount of work W is defined as SP P (W ) T (W ) T P (W ), () where T i (W ) is the time needed to comlete W amount of work on i rocessing elements. Let W i, be the amount of work for each node at the i th level with the degree of arallelism, and m i be the maximum degree of arallelism at the i th level. Thus, W m W,. For each PE i,i ( i<m), the sequential ortion of work W i, is comuted and the arallel ortion of work W i, ( <<m i ) is distributed to its low-level nodes. Thus, it has: m i 2 W i, m i+ W i+,, ( i<m). (2) Let be the comuting caacity of each rocessing element, the execution time on a single rocessing element for the total work W will be T (W ) W. (3) The multi-level arallelism is a recursive master-slave execution rocess. When one node PE i,i ( i<m) comletes

4 its sequential execution, it will wait until its arallel ortion is solved by its low-level nodes. In contrast, PE m,i at the bottom level will not disatch its arallel ortion, but instead it comletes its execution of sequential and arallel ortions by itself. By the definition of degree of arallelism, any two subtasks (e.g. W i, and W i,k, k) at the i th level with different degrees of arallelism cannot be solved simultaneously. The total execution time on a multi-level hardware latform with an unbounded number of rocessing elements available will be T (W ) W, + W 2, + + W m, Thus, the seedu will be m m + SP (W ) T (W ) T (W ) W W, +W 2, + +W m, + m m W m, W m i W i, + m m m W, m i W i, + m m W m, W m, W m,. (4). (5) When the communication overhead is considered and the number of rocessing elements for each node in each arallelism level is finite, the seedu will be smaller than the seedu SP. For each node PE i,i at the i th level, if the number of its arallel rocessing elements is (i) and (i) <, then some rocessing elements have to do W i, (i) work and the remaining ones will do W i, (i) work. We assume that at each level, the allocation of work strictly follows an order of the ids of PE from small to large and distributes the work with Wi, Wi, (i) first and later the work with (i). Then for the nodes in the ath PE i,, ( i m), it holds: m i 2 W i, m i+ (i) W i+,, ( i<m). (6) The execution time on the multi-level hardware architecture, with work W and limited number of rocessing elements, is T P (W ) W, + W 2, + + W m, Hence, the seedu is m m + W m, (m). (7) SP P (W ) T (W ) T P (W ) W,+W 2,+ +W m, m W W i W i, + m m m W, m i W i, + m m + m m W m, (m) W m, (m) W m, (8) (m). Another imortant degradation factor of erformance is communication latency. In contrast to unbalance workload, communication latency is communication network deendent [], e.g, routing schemes and switching techniques, etc. Let Q P (W ) be the communication overhead of the multilevel arallel comuting when P rocessing elements of the multi-level hardware architecture are used to comlete P amount of work. Q P (W ) deends on lots of factors including the communication attern, message sizes of the alication, system-deendent communication latency, etc. Assuming the degree of arallelism is not affected by the communication overhead, the general seedu is SP P (W ) m i W i, + m m m W, W m, (m) + Q P (W ). (9) Note that all above derivations for seedu of the multilevel arallel comuting are indeed for the fixed roblem size, or the fixed workload. The seedu emhasizes the time reduction of a given roblem. This seedu formulation is called fixed-size seedu []. Equation (9) is the general seedu formula for fixed-size seedu. It is suitable for many alications whose sizes cannot be scaled. In contrast, another seedu model called fixed-time seedu [7], where the workload of alication is scaled u with the number of rocessing elements available. One tyical alication for this is weather broadcasting [2]. Given more comutation ower, we may not want to get the result earlier. Instead, we may want to increase the roblem size by adding more relevant factors into the weather model and obtain a more accurate solution, which in turn gives a more recise forecast. Moreover, it assumes that the workload scaling occurs only at the arallel ortion. Let W be the total amount of scaled work. Let W i, be the amount of scaled work for each node at the i th arallelism level with the degree of arallelism, andm i be the maximum degree of arallelism of the scaled workload with (i) rocessing elements available at the i th level. Then it has: W m W, and W i, W i,, ( i m). For the nodes in the ath PE i,, ( i m), it holds: m i 2 W i, m i+ (i) W i+,, ( i<m). ()

5 m i W i, 2 m i+ W i+,, ( i<m). () In the lowest level, i m, in order to kee the same turnaround time as the sequential version, its scaled workload must hold m m W m, m m W m,, (i m). (2) (m) Thus, based on Equations (), (), (2), the generalized seedu formula for fixed-time seedu is SP P (W ) T (W ) T (W ) W, +W 2, + +W m, W + m W m m, (m) + Q P (W ) W m i W i, + m W m m, (m) + Q P (W ) W m i W i, + m m W m, + Q P (W ) W W + Q P (W ). (3) V. HIGH-LEVEL ABSTRACT MULTI-LEVEL PARALLEL SPEEDUP The generalized seedu formulations have been roosed for the fixed-size seedu model and the fixed-time seedu model. The formulations consider the erformance degradation due to uneven allocation and communication latency. The generalized seedus are alication deendent and much closer to the real seedu. However, it is very comlicated and difficult to understand and to have a global view of ossible erformance gains since they contain lots of detailed information for each alication. In this section, we make some assumtions to have a high-level seedu model, focusing on the essential characteristic of the multi-level arallel comuting. We assume that the communication overhead cost is zero (i.e. Q P ), and that the workload at each arallelism level only consists of two ortions, a sequential ortion and a erfectly arallel ortion. Thus, W i, W i,, for i m, and (i). We call the high-level abstract fixed-size seedu as E-Amdahl s Law, as an extension of Amdahl s Law and the high-level abstract fixed-time seedu as E-Gustafson s Law, as an extension of Gustafson s Law for the multi-level arallel comuting. In the following, let f(i) be the ortion of the workload at the i th level that can be arallelized, and s(i) denote the multi-level seedu at the i th level. A. E-Amdahl s Law The multi-level seedu can be viewed as the relative comuting caacity of a multi-level hardware latform with resect to an unirocessor. To figure out the high-level abstract fixedsize seedu for the multi-level arallel comuting model, a bottom-u method is adoted. That is, we first comute the lowest-level result and then deduce the nearest uer-level one based on current level result. Such rocess continues until the final result(first-level seedu) is obtained. Secifically, it is as follows: ). When i m, it is the bottom level. Since there is no more level below it, its seedu can be obtained directly in terms of Amdahl s Law, that is: s(i), (i m). (4) f(m)+ f(m) (m) 2).When i<m, the seedu at the i th level directly deends on the (i +) th level. It is: s(i) f(i)+ f(i) (i)s(i+), ( i<m). (5) In summary, we obtain E-Amdahl s Law with the following formula: s(i) f(m)+ f(m) (m) f(i) f(i)+ (i)s(i+) (i m) ( i<m). (6) Generally, we often consider the multi-level arallel comuting with two levels(m 2). Let s consider clusters of SMP nodes. It often takes hybrid MPI/OenMP rogramming model [3], which uses MPI for arallelism across the nodes and OenMP for arallelism within a node. Suose it is one rocess er node and the number of rocesses(nodes) is. For each rocess(node), we assume the number of threads is t. In the rocess level, the fraction of comutation that can concurrently execute is α( α ). The arallelizable ortion of comutation in the thread level is β( β ). In terms of Formula (6), we have the multi-level seedu as follows: ŝ(α, β,, t) s(i ). (7) α + α( β+ β t ) Some roerties can be attained for Formula (7): a). ŝ(α, β,, ) : The sequential condition holds. b). ŝ(α, β,, ) α+ : It turns to be a traditionally α single-level Amdahl s Law with arallelizable ercent of α when t. c). ŝ(α, β,,t) : It becomes a single-level αβ+ αβ t Amdahl s Law with arallelizable ortion of αβ when. Figure 5 resents fixed-size seedu curves under E- Amdahl s Law of Equation (7) when m 2. The x-axis denotes the number of rocesses (). The y-axis gives the seedu for the multi-level arallel comuting. Curves within

6 Seedu Seedu Seedu α.9, t (a) α.9, t6 5 (d) α.9, t64 (g) Seedu Seedu Seedu α.975, t (b) α.975, t (e) α.975, t64 (h) Seedu Seedu Seedu α.999, t (c) α.999, t (f) α.999, t64 β β β β Fig. 5: Seedu under E-Amdahl s Law. The grahs from left to right in each row give the variation of seedu when increasing the value of α. The grahs from to to down in each column resent the variation of seedu when we increase the number of threads (t). The curves within each grah show seedus under different values of β. each grah show seedus under different values of β. The grahs from left to right in each row give the variation of seedu when increasing the value of α. The grahs from to to down in each column resent the variation of seedu when we increase the number of threads (t). Result : For fixed-size alications, the results of E- Amadahl s Law indicate that exloring arallelism at each level of the multi-level arallel comuting is crucial for erformance imrovement. However, if the degree of arallelism (α) at the first (rocess) level is not large, increasing the degree of arallelism at the second (thread) level cannot have a significant contribution to the whole erformance imrovement. For examle, in Figure 5(a), the curves are very close to each other when increasing β from.5 to.999 in the case of α.9. In contrast, it will have a significant erformance imrovement when increasing the arallelism β at the second level if the arallelism (α) at the first level is large enough. Considering α.999,,t 8 in Figure 5(c) for instance, the erformance is significantly imroved when the arallelism β at the second level increases from.5 to.999. Result 2: The maximum fixed-size seedu is bounded by the degree of arallelism at the first level. For examle, if α.9, its maximum seedu is. As shown in Figure 5(a),5(d),5(g), no matter how to increase the number of rocesses (), the number of threads (t), and the degree of arallelism β at the second level, the seedu infinitely aroximates to. (i) B. E-Gustafson s Law The fixed-time seedu can be regarded as the ratio of scaled workload in arallel hardware to the workload of unirocessor. A bottom-u method is adoted to get the highlevel abstract fixed-time seedu for the multi-level arallel comuting model,. We first figure out the fixed-time seedu at the bottom level and then deduce the nearest uer-level one based on current level result. The rocess continues until the final result (first-level seedu) is obtained. That is, ). When i m, it is the bottom level. Its seedu can be obtained directly in terms of Gustafson s Law 3. that is: s(i) f(m)+f(m)(m), (i m). (8) 2).When i<m, the seedu(workload) at the i th level directly deends on the (i +) th level. Note that Equation (8) can be viewed as the normalized scaled workload when the workload of unirocessor is. Then it holds that s(i) f(i)+f(i)(i)s(i +), ( i<m). (9) Therefore, we obtain E-Gustafson s Law with the following formula: f(m)+f(m)(m) (i m) s(i) f(i)+f(i)(i)s(i +) ( i<m). (2) In general, it is common to have two levels of arallelism for the multi-level arallel comuting(m 2). Given clusters of SMP nodes, for examle, it often takes hybrid MPI/OenMP rogramming model [3]. We assume the number of rocesses is and the number of threads er rocess is t. The ortion of comutation at the first(rocess) level that can be arallelized is α( α ) and that at the second(thread) level is β( β ). According to Formula (2), the multi-level fixed-time seedu is: ŝ(α, β,, t) s(i ) α +( β + βt)α. (2) There are some roerties for Formula (2) as follows: a). ŝ(α, β,, ) : The sequential condition holds. b). ŝ(α, β,, ) α + α: It turns out to be a traditionally single-level Gustafson s Law with arallelizable ercent of α when t. c). ŝ(α, β,,t) αβ + αβt: It becomes a single-level Gustafson s Law with arallelizable ortion of αβ when. Equation (2) indicates that there is a ositive linear relationshi between the seedu and erformance factor θ {α, β,, t}. Figure 6 gives fixed-time seedu curves under E- Gustafson s Law of Equation (2). From the fixed-time oint of view, E-Gustafson s Law imlies that, 3 Gustafson s Law: seedu F + F N, where F is the arallel fraction of the workload, N is the number of rocessors used.

7 Seedu Seedu Seedu α.9, t (a) α.9, t6 (d) α.9, t64 (g) Seedu Seedu Seedu α.975, t (b) α.975, t (e) α.975, t64 (h) Seedu Seedu Seedu 5 5 α.999, t (c) α.999, t (f) α.999, t64 β β β β Fig. 6: Seedu under E-Gustafson s Law. The grahs from left to right in each row give the variation of seedu when increasing the value of α. The grahs from to to down in each column resent the variation of seedu when we increase the number of threads (t). The curves within each grah show seedus under different values of β. Result 3: For alications with scaled workload, the maximum seedu of the multi-level arallel comuting is unbounded. For instance, in Figure 6(a), the seedu is scaled u linearly with. Note that E-Amdahl s Law(Result 2) and E-Gustafson s Law(Result 3) give totally oosite viewoints towards the maximum seedu of the multi-level arallel comuting. However, E-Amdahl s Law and E-Gustafson s Law are not conflictive but unified. They ust consider the seedu from different ersectives. Indeed, E-Gustafson s Law is imlicitly based on the assumtion that the roblem size is scaled and the arallel ortion is unfixed, whereas E-Amdahl s Law suoses that the roblem size is fixed and the arallel ortion is unchanged. The equivalent roof is given in APPENDIX A. (i) In the following, we first resent the argument estimation for E-Amdahl s Law, and then resent the exerimental results. A. Argument Estimation To evaluate the seedu based on E-Amdahl s Law of Formula (7), it needs to estimate the exact values of arguments α, β for a given alication first. It is a challenging issue and no otimal solution is readily available, since it deends on lots of factors such as rogram imlementation, inut data and arguments, hardware latform, etc. In this aer, we give a ossible argument estimation method in our exeriment. This method tries to estimate the values of α, β through the multi-level rogram execution. It figures out the estimated value of α, β by solving Equation (7) with samling exerimental results. This method can well contain various overhead costs such as rocess creation cost, communication and synchronization cost, etc. Its estimated result is much close to the actual value. The detailed estimation rocess is given by Algorithm. Algorithm Argument Estimation for α, β. : Execute the rogram k times in the multi-level arallel execution mode with the chosen arameters (,t ), ( 2,t 2 ),..., ( k,t k ) resectively. Then it gets (,t,s ), ( 2,t 2,s 2 ),..., ( k,t k,s k ). 2: Choose all ossible combinations of two arrays ( i,t i,s i ) and (,t,s ) to figure out the value (α s,β s ) based on Equation (7), where s denotes the s th combination and i, k & i. 3: Check all ossible airs of value (α s,β s ) to guarantee that the air of estimated values is valid(i.e. α s, β s ). Otherwise, discards it. 4: Collect all valid airs of (α i,β i ), (i, 2,..., k ), where k reresents the number of valid airs. Remove the noise airs by clustering with the guard condition: α i α <ε & β i β <ε. 5: The exact value of α, β can thereby be estimated with the formula: α k k i α i β k k i β i, where k is the number of clustered airs. VI. EXPERIMENTAL EVALUATION In the revious section, we have demonstrated the equivalence of E-Amdahl s Law and E-Gustafson s Law. Therefore, we take E-Amdahl s Law as a case study to evaluate the seedu model. In articular, we choose the NAS Parallel Benchmark (NPB) Multi-Zone (MZ) [4], [5] codes BT-MZ, SP-MZ and LU-MZ, with hybrid MPI+OenMP rogramming model. They are executed on a Linux cluster consisting of eight comute nodes, each with two 3. GHz Intel Xeon quadcore chis and 6 GB of memory. We do exeriments by setting one MPI rocess er comute node and increasing the number of OenMP threads for each rocess from u to 8. Notable, the correct choice of i and t i is very crucial for estimation. Particularly, we should avoid those airs which may cause workload unbalance during the arallel execution. For examle, for the benchmark SP-MZ, the suitable value of i and t i should be, 2, 4, 8, 6,..., since the workload is well balanced in these cases. In contrast, it is unbalanced when i or t i equals to 3, 7. B. Results Evaluation The exerimental and estimated seedu results are shown in Figure 7. To estimate the values of α, β with Algorithm, the values of i and t i are set to be, 2, 4 resectively, and ε

8 .. The grahs in the first column resent the exerimental seedus. The estimated seedu results with E-Amdahl s Law are given in the second column. The comarison grahs of the corresonding exerimental and estimated results are shown in the third column resectively. (a) Exerimental result of BT- MZ (class W). (d) Exerimental result of SP- MZ (class A). (g) Exerimental result of LU- MZ (class A). (b) Estimated result of BT-MZ. α.977,β.5822 (e) Estimated result of SP-MZ. α.979,β.7263 (h) Estimated result of LU-MZ. α.9892,β.86 (c) Comarison result of (a) and (b). (f) Comarison result of (d) and (e). (i) Comarison result of (g) and (h) Fig. 7: Exerimental and estimated seedu results for NAS Parallel Benchmarks of Multi-Zone Versions. is the number of MPI rocesses sawned. t is the number of OenMP threads for each rocess. The estimated seedu is based on the roosed E-Amdahl s Law. Figures 7(a) 7(c) resent seedu results for the alication benchmark BT-MZ with class W 4. It is a block tridiagonal simulated CFD algorithm. The number of zones for class W is 4 4. However, the size of zones varies significantly, with a ratio of about 2 between the largest and smallest zone. This oses a roblem for workload balancing. Its estimated coarse-grained arallelism ortion is α.977 and finegrained arallelism ortion is β With E-Amdahl s Law, the uer bound seedu result can be estimated. The comarison result in Figure 7(c) indicates that the workload unbalance roblem is becoming increasingly serious as the number of rocesses increases. In this scenario, E-Amdahl s Law can be used to evaluate the quality of the algorithm. Meanwhile, it can be a guidance to users on how much erformance imrovement sace is available for erformance otimization. SP-MZ is a scalar enta-diagonal simulated CFD algorithm. LU-MZ is a lower-uer symmetric gauss-seidel simulated CFD algorithm. In contrast to BT-MZ, the artitioned zones are identical in size for the two algorithms. The workload should be more balanced than BT-MZ. The number of zones for class A is 4 4. The corresonding seedu results for each algorithm are shown in Figure 7(d) 7(f), 7(g) 7(i) resectively. The estimated arallelism ortions are α.979,β.7263 for SP-MZ, and α.9892,β.86 for LU-MZ. The workloads are well balanced when the number of zones (e.g. 6) is divisible by the number of rocesses. The comarison results in Figure 7(f) and 7(i) show that the exerimental results are very close to estimated results when, 2, 4, 8, indicating that the erformance is fine and E-Amdahl s Law can be alied as a rediction model in these cases. However, the workload is not balanced (i.e. the number of zones cannot be evenly distributed among rocesses) when the number of rocesses equals to 3, 5, 6, 7. It leads to a negatively big imact on the overall arallel erformance (as illustrated in Figure 7(d),7(g)). With E-Amdahl s Law, users can have an intuitive understanding of how much erformance influence it has through comarisons of exerimental and estimated results as shown in Figure 7(f) and 7(i). C. Comarison on Estimated Seedu Figure 8 shows the erformance results of NPB-MZ benchmarks under different rocess-thread combinations with a fixed number of 8 CPUs. The exerimental seedus as well as the estimated seedus based on Amdahl s Law and our roosed E-Amdahl s Law are illustrated. We estimate seedu based on Amdahl s Law with the formula: α+. The α t estimated seedu based on E-Amdahl s Law is figured out by using Formula (7). It can be noted that for Amdahl s Law, there is no any difference in estimated seedus for diverse combinations of t under a fixed number of rocessors. The reason is that, Amdahl s Law is based on the single-level arallelism. There is no concet of combined coarse-grained and fine-grained arallelism in the single-level arallelism. For Amdahl s Law, it treats all sizes of granularity of arallelism to be the same. However, the multi-level arallelism is a nested arallelism with the granularity of arallelism from coarse to fine. For the multi-level arallelism, Amdahl s Law is unable to differentiate its coarse-grained and fine-grained arallelism. The seedu results based on Amdahl s Law become more inaccurate when there are more number of rocessors used for fine-grained arallelism. For examle, for benchmark SP-MZ, shown in Figure 8(b), when t 8, 4 2, 2 4, 8, the ratio of estimation errors 5 are.6%, 3.%, 86.7%, 27.5% resectively. In contrast, E-Amdahl s Law can well identify the coarse-grained and fine-grained arallelism and accurately estimate the seedu for the multi-level arallelism. Note that the 4 Class W refers to a tye of roblem size in NAS benchmark. It classifies the tyes of roblem size into S, W, A, B, C, etc 5 Ratio of estimation error R E, where R is the exerimental result, R E is the estimated result.

9 (a) α.977,β.5822 (b) α.979,β.7263 (c) α.9892,β.86 Fig. 8: Exerimental and estimated seedus of NPB-MZ for different combinations of t under a given total number of 8 rocessors. The seedu based on Amdahl s Law is estimated with the formula α+ t α. The seedu based on E-Amdahl s Law is estimated by using Formula (7). seedu curves of E-Amdahl s Law in Figure 8 are very close to the exerimental results. For the same examle of benchmark SP-MZ, under seedu estimation of E-Amdahl s Law, the ratio of estimation errors are.6%, 6.2%, 9.8%, 6.7% resectively for t 8, 4 2, 2 4, 8. Moreover, for these samling cases, the average ratio of estimation errors with E-Amdahl s Law for benchmarks BT-MZ, SP- MZ and LU-MZ are 25.5%, 8.3%, 3.% resectively. In contrast, the average ratio of estimation errors by using Amdahl s Law for benchmarks BT-MZ, SP-MZ and LU-MZ are 34.5%, 8.5%, 62.5% resectively. Particularly, we can note that the estimated seedu curves of BT-MZ are a bit far from the exerimental results in comarison to other two benchmarks based on E-Amdahl s Law. This is because the data size of zones for BT-MZ varies significantly, causing the workload unbalanced among rocesses and in turn reducing the overall erformance. In contrast, for benchmarks SP-MZ and LU-MZ, the data size of zones are the same, making workload well balanced among rocesses and thereby with a good erformance in arallelism. VII. CONCLUSION AND FUTURE WORK This aer studies seedu for multi-level arallel comuting. Both fixed-sized seedu model and fixed-time seedu model are considered. Moreover, we have derived E-Amdahl s Law for the high-level abstract fixed-size seedu, and E- Gustafson s Law for the high-level abstract fixed-time seedu for multi-level arallel comuting. The exeriment results illustrate that E-Amdahl s Law is much more accurate in erformance estimation than Amdahl s Law for multi-level arallelism in NPB benchmarks. Develoers can use E-Amdahl s Law in erformance rediction and erformance otimization for multi-level arallelism. In this aer, our multi-level seedu models is targeted at the homogeneous multi-level arallelism, which requires that all the rocessing elements of hardware architectures are identical. It is our future work to extend the seedu model to the heterogeneous multi-level arallelism by taking into account the different comuting caacities of heterogeneous rocessing elements. For examle, GPGPU has attracted significant amount of research in HPC systems [8], [9], [2]. The multi-gpus arallelism with the heterogeneous aradigm [2] belongs to a heterogeneous multi-level arallelism. Consider a GPU cluster [7] of comuting nodes each equied with multile GPUs. User need to consider different comuting caacities of CPU cores and GPUs in such heterogeneous rocessing environments. VIII. ACKNOWLEDGMENT This work was suorted by the roect User and Domain Driven Data Analytics, granted by A*STAR Thematic Strategic Research Programme. REFERENCES [] P. Balai, D. Buntinas, et al, Fine-Grained Multithreading Suort for Hybrid Threaded MPI Programming, International Journal of High Performance Comuting Alications, vol 24, , 2. [2] To5 suercomuter sites. htt:// //, 2. [3] R. Rabenseifner, G. Hager, G. Jost, Hybrid MPI/OenMP arallel rogramming on clusters of multi-core SMP nodes, 7th Euromicro International Conference on Parallel, Distributed and Network-based Processing, , 29. [4] G. Jost, H.Q. Jin, D.A. Mey, F.F. Hatay, Comaring the OenMP, MPI, and Hybrid Programming Paradigms on an SMP Cluster, NAS Technical Reort NAS-3-9, 23. [5] X.H. Sun, L.M. Ni, Another view on arallel seedu, Proceedings of Suercomuting 9 (Cat. No.9CH296-5), , 99. [6] G. M. Amdahl, Validity of the Single-Processor Aroach to Achieving Large Scale Comuting Caabilities, In AFIPS Conference Proceedings, , 967. [7] J.L. Gustafson, Reevaluating Amdahl s law, Communications of the ACM, v 3, n 5, , May 988. [8] NVIDIA CUDA C Programming Guide Version 4., htt://develoer.download.nvidia.com/comute/cuda/4 /toolkit/ docs/cuda C Programming Guide.df. May, 2. [9] X.H. Sun, J.L. Gustafson,Toward a better arallel erformance metric, Parallel Comuting, v 7, n -, 93-9, Dec. 99.

10 [] K.C. Sevcik, Characterizations of arallelism in alications and their use in scheduling, International Conference on Measurement and Modeling of Comuter Systems, PP. 7-8, May 989. [] X.H. Sun, L.M. Ni, Scalable roblems and memory-bounded seedu, Journal of Parallel and Distributed Comuting, v 9, n, 27-37, Set [2] L. Wolters, G. Cats, N. Gustafsson, Data-arallel numerical weather forecasting, Scientific Programming, v 9, n 3,. 59-7, Nov [3] B.M. Chaman, Parallel Alication Develoment with the Hybrid MPI+OenMP Programming Model, Proceedings of the 9th Euroean PVM/MPI Users Grou Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface,. 3, 22. [4] R.F. Van Der Wingaart, H. Jin, NAS Parallel Benchmarks, Multi-Zone Versions, NAS Technical Reort NAS-3-, NASA Ames Research Center, Moffett Field, CA, 23. [5] H. Q. Jin, R. F. Wingaart, Performance characteristics of the multi-zone NAS arallel benchmarks, 8th International arallel and distributed rocessing symosium,. 26-3, Aril, 24. [6] S. Sahni, V. Thanvantri, Performance metrics: keeing the focus on runtime, IEEE Parallel and Distributed Technology: Systems and Alications, v 4, no., , 996. [7] V.V. Kindratenko, J.J. Enos, et al, GPU clusters for higherformance comuting, 29 IEEE International Conference on Cluster Comuting and Workshos (CLUSTER),. -8, Set. 29. [8] W. Fang, B. He, Q. Luo, G. Naga, Mars: Accelerating MaReduce with Grahics Processors, Parallel and Distributed Systems, IEEE Transactions on, , 2. [9] He, B., Fang, W., Luo, Q., Govindarau, N.K., and Wang, T., Mars: a MaReduce framework on grahics rocessors, In Proceedings of PACT. 28, , 28. [2] Bingsheng He, Ke Yang, Rui Fang et al. Relational oins on grahics rocessors, In Proceedings of the 28 ACM SIGMOD international conference on Management of data (28), [2] Y. S. Tan, B.-S. Lee, R. H.Cambell, B. He. A Ma-Reduce Based Framework for Heterogeneous Processing Element Cluster Environments. CCGrid 22,.-8. Note that Equation (23) is identical to Equation (2). Thereby, Equation (6) and (2) are equivalent when i m. (ii). Induction Ste: Suose it is true that Equation (6) and (2) are equivalent for i k, ( <k m). We need to rove the roosition is true when i k : In terms of Equation (2), the scaled ortion of arallelism at i k level is f f(k )(k )s(k) (i k ) f(k ) + f(k )(k )s(k). (24) Based on E-Amdahl s Law of Equation (6), there is, s(i k ) f (k ) + f (k ) (k )s(k) f(k )(k )s(k) f(k )+f(k )(k )s(k) + f(k )(k )s(k) f(k )+f(k )(k )s(k) (k )s(k) f(k ) + f(k )(k )s(k). (25) Note that when i k, Equation (2) is equivalent to Equation (25). Hence, the roosition is true when i k. (iii). Conclusion Ste: according to (i) and (ii), it is true that Equation (6) and (2) are equivalent for i m. APPENDIX A EQUIVALENCE PROOF OF E-AMDAHL S LAW AND E-GUSTAFSON S LAW The following is the roof of the equivalence between Equation (6) and (2) by the induction method with the reverse order. Proof: (i). Base Case: When i m, in terms of Equation (2), the scaled ortion of arallelism is f f(m)(m) (m) f(m)+f(m)(m). (22) According to E-Amdahl s Law of Equation (6), it holds: s(i m) f (m)+ f (m) (m) f(m)(m) f(m)+f(m)(m) + f(m)(m) f(m)+f(m)(m) (m) f(m)+f(m)(m). (23)

Analysis of execution time for parallel algorithm to dertmine if it is worth the effort to code and debug in parallel

Performance Analysis Introduction Analysis of execution time for arallel algorithm to dertmine if it is worth the effort to code and debug in arallel Understanding barriers to high erformance and redict