Scalable Scheduling in Parallel Processors

Size: px

Start display at page:

Download "Scalable Scheduling in Parallel Processors"

Penelope Summers
5 years ago
Views:

1 00 Conference on Information Sciences and Systems, Princeton University, March 0, 00 Scalable Scheduling in Parallel Processors Jui Tsun Hung Hyoung Joong Kim Thomas G Robertai 3 Abstract A study of scalable data intensive scheduling involving load distribution on multi-level fat tree netors is presented It is shon that the speedup of processing divisible loads concurrently on a homogeneous single level tree increases linearly as the number of processors is increased This behavior is maredly different from that of past research on sequential scheduling here speedup saturates beyond a certain number of processors It is demonstrated that the ultimate limiting factor for speedup is due to hardare, not to the scheduling strategy on a multi-level fat tree I Introduction The processing of massive amounts of data on distributed and parallel netors is becoming more and more common Applications include large scientific experiments, database applications, image processing, and sensor data processing Over the past doen years, a number of researchers have mathematically modeled such processing using a divisible load scheduling model [],that is useful for data parallelism applications Divisible loads are ones that consist of data that can be arbitrarily partitioned among a number of processors interconnected through some netor Divisible load modeling usually assumes no precedence relations amongst the data Due to the linearity of the divisible model, optimal scheduling strategies under a variety of environments have been devised The majority of the divisible load scheduling literature has appeared in computer engineering periodicals Hoever, divisible load modeling should be of interest to the netoring community as it models, both computation and netor communication in a completely seamless integrated manner, and it is tractable ith its linearity assumption It has been used to accurately and directly model such features as specific netor topologies [-0], computation versus communication load intensity [][], time varying inputs [0], multiple job submission [], and monetary cost optimiation [][] Hoever, researchers in this field have noted an important performance saturation limit If speedup (or solution time is considered as a function of the number of processors, an asymptotic constant is reached as the number of processors is increased Beyond a certain point, adding processors results in minimal performance improvement In other ords, the Jui Tsun Hung, Department of Electrical and Computer Engineering, University at Stony Broo, trent@ecesunysbedu Hyoung Joong Kim, Department of Control and Instrumentation Engineering, Kangon National University, hj@ccangonacr 3 Thomas G Robertai, a Senior Member, IEEE, Associate Professor of Department of Electrical and Computer Engineering, University at Stony Broo, Stony Broo, NY 794 Phone ( , Fax ( , tom@ecesunysbedu scheduling strategies considered to date have not been scalable For the first interconnection topology considered in the literature, the linear daisy chain [], the saturation limit is usually explained by noting that, if load originates at a processor at a boundary of the chain, data must be transmitted and retransmitted i times from processor to processor before it arrives at the ith processor (assuming a node ith store and forard transmission Hoever, for subsequent interconnection topologies considered (eg bus, single level tree, hypercube, the reason for this lac of scalability has been less obvious In this paper, it is demonstrated that the reason hy the saturation occurs is because of the assumption that a node distributes load sequentially to one of its children at a time This is true for both single and multi-installment scheduling strategies discussed to date [][5] It is shon here that in a single level tree (star topology, if a processor can distribute load to all of its children concurrently, the speedup is a linear function of the number of processors The only scalability limitations are a proportionality constant hich depends on system parameters and the ability of a processor to distribute loads concurrently to all of its outgoing lins Ho might one implement the strategy that a processor distributes load concurrently to all its neighbors? A direct method, similar to hat is done in pacet sitches, is to envision that a processor has a CPU and an output buffer for each output lin Scalability can be achieved as long as the CPU can effectively distribute load to all of its output buffers concurrently Naturally, as the number of neighbors of a processor node is increased, a point ill be reached hile the CPU can not load the buffers as fast as those buffers are being emptied Hoever, for a given CPU, this architecture allos a better use of the time shared CPU up to a large number of children nodes than the earlier assumption that a CPU distributes load sequentially and completely to one output buffer and lin at a time Certainly, one can also envision high performance architectures such as multiple CPUs ithin the same node Through a consideration of concurrent load distribution by a node to its neighbors, one can see that the ultimate limitation is on hardare, not scheduling, as it should be Note that previous related or on scalability issues for parallel processing includes an experimental study of several real time load balancing schemes [3] and an experimental study of scalable scheduling for function parallelism on distributed memory multiprocessors [4] It has been non on an intuitive basis that netor elements should be ept constantly busy for good performance [5] This research mathematically quantifies that This paper begins ith the development of some notation and analytic bacground in section II The parent nodes do both load distribution and data computation in section III The conclusion appears in section IV II Model, Notation and Load Distribution In this paper, a homogeneous multi-level fat tree netor here root processors are equipped ith a front-end proces-

2 sor for off-loading communications is considered Root nodes, called intelligent roots, process a fraction of the load as ell as distribute the remaining load to their children processors, (see Fig (Layer (Layer (Layer Level (Layer 0 α Level Level p p - α α α α α p 3 α α p - p the entire load p - p p - p 0 p 0 α α α Figure : Homogeneous multi-level fat tree ith intelligent root First of all, a heterogeneous single level fat tree, level i +, ith intelligent root is described as follos All the children processors are connected to the root (parent processor via communication lins Fig shos that a intelligent root processes a fraction of the load as ell as distributes the remaining load to its children processors α Root p i pi p i m α Figure : A heterogeneous single level fat tree, level i+, ith intelligent root Note that each child processor starts computing and transmitting immediately after receiving its assigned fraction of load and continues ithout any interruption until all its assigned load fraction have been processed This is a store and forard mode of operation for computation and communication The root can begin processing at time 0, the time hen all the load is assumed to be present at the root The notations for a single heterogeneous tree are 0 m : The load fraction assigned to the root processor α i : The load fraction assigned to the i th lin-processor pair i : The inverse of the computing speed of the i th processor i : The inverse of the lin speed of the i th lin : Computing intensity constant The entire load can be processed in i seconds on the ith processor T cm : intensity constant The entire load can be transmitted in i T cm seconds over the ith lin : The finish time Time at hich the last processor accomplishes computation Therefore, α i i is the time to process the fraction α i of the entire load on the ith processor Note that the units of α i i are [load] [sec/load] [dimensionless quantity] For a multi-level homogeneous fat tree, the notations are α j 0 : The load fraction assigned to the root processor of an equivalent j th level tree α j i : The load fraction assigned to the i th lin-processor pair on an equivalent j th level tree eqi : The inverse of the equivalent computing speed of the i th level tree( from level i descending to level p i : The multiplier of the inverse of expanded capacity of the lins of level i+ ith respect to the inverse of capacity of the lins on level The value of the multiplier, p i, is the inverse of the total number of children processors descended from this lin In other ords, p i ( i j0 mj, and 0 < p i The folloing assumptions are initially made: The interconnection netor used is a star netor (single level tree netor The computing and communication loads are divisible (ie perfectly partitioned ith no precedence constraints [] Transmission and computation time are proportional (linear to the sie of the problem Each node transmits load concurrently, (simultaneously, to its children Store and forard is the method of transmission from level to level III Intelligent Root Fat Tree Single Level Tree, Level i +, ith Intelligent Root A single level tree netor, level i+, ith intelligent root, hich has m + processors and m lins, (see Fig All children processors are connected to the root processor via direct communication lins The intelligent root processor, assumed to be the only processor at hich the divisible load arrives, partitions a total processing load into m + fractions, eeps its on fraction, and distributes the other fractions α, α,, to the children processors respectively and concurrently Each processor begins computing immediately after receiving its assigned fraction of load and continues ithout any interruption until all of its assigned load fraction has been processed In order to minimie the processing finish time, all of the utilied processors in the netor must finish computing at the same time [] The process of load distribution can be represented by Gantt chart-lie timing diagrams, as illustrated in Fig 3 Note that this is a completely deterministic model From the timing diagram, Fig 3, an equation for the root and st child s solution time is 0 α p i T cm + α (

3 Root 0 α p i T cm α α p i T cm α p i m T cm m m Let then f i p i i T cm + i p i i T cm + i (8 α i f i α i ( i j f j α (9 pitcm + Tcp ( α i, 3,, m (0 p i i T cm + i From equation (8, l f l can be simplified as l f l p i T cm + p i + T cm + +,,, m ( In order to solve the set of equations, is defined as: 0 p i T cm + α ( If one substitutes this equation into the normaliation equation, the normaliation equation becomes α + α + f α + + f f f m α (3 Utiliing equation ( and solving once again for α : Figure 3: Timing diagram of single level fat tree, level i +, ith intelligent root The fundamental recursive equations of the system can be formulated as follos: α p i T cm + α α p i T cm + α ( α i p i i T cm + α i i α ip i it cm + α i i (3 p i m T cm + m p i m T cm + m (4 The normaliation equation for the single level tree ith intelligent root is + α + α + + (5 This gives m + linear equations ith m + unnons For a multi-level fat tree ith intelligent root (see Fig, the normaliation equation for each level j (equivalent to a single level tree is α j 0 + αj + αj + + αj m j,, (6 Here α j i is the fraction of load hich one of layer j s processor (one root node in level j distributes to the i th child processor No, one can manipulate equation ( - (4 to yield the solution, α i ( p i i T cm + i p i i T cm + i α i i, 3,, m (7 α + + m ( f l l + + (p i T cm + m ( p i + T cm (p i T cm + m ( 0 p i + T cm+ + Accordingly, + (p i T cm + m ( 0 p i + T cm + + (4 α i More generally, if e define 0 j i j fj fj, then + (p i T cm + m ( 0 p i + T cm+ + (5 for i,,, m From Fig 4, the finish time at hich a solution is achieved as:,m α (p i T cm + p i T cm + + (p i T cm + m ( 0 p i + T cm + + (6 As a special case, consider the situation of a homogeneous netor here all children processors have the same inverse computing speed and all lins have the same inverse transmission speed (ie i and i for i,,, m Therefore, from (8, f i is equal to, (for i,,, m Note for the root 0 can be different from i

4 For a single level tree, let T h f,0 be the solution time for the entire divisible load solved on the root processor and let T h f,m be the solution time solved on the hole tree Tf,0 h 0 Here, Tf,m h ( + m (p it cm + (7 Root (Layer Consequently, Speedup,0 h 0 ( + m Tf,m h p it cm + ( + m + m (8 (Layer - (Node i i,,,m α i p - T cm (Store and forard - Here, speedup is the effective processing gain in using m+ processors Our finding is that the speedup of the single level homogeneous tree is equal to Θ(m, hich is proportional to the number of children, per node m Speedup is linear as long as the root CPU can concurrently (simultaneously transmit load to all of its children That is, the speedup of the single level tree does not saturate (in contrast to the sequential load distribution as in [] Multiple Level Fat Tree ith Intelligent Root Consider a homogeneous multi-level fat tree netor here all processors have the same inverse computing speed,, and lins of level i+ have the transmission speed, p i, (see Fig [( i ] p i m j (9 j0 The process of load distribution for the multi-level fat tree netor using store and forard strategy for computing and communicating can be represented by Gantt chart-lie timing diagrams, (see Fig 4 For the loest single level tree, level, (see Fig 5, the inverse computational speed of an equivalent processor is defined as eq This is a valid concept as the model is a linear one (as in a Norton s equivalent queue, for instance [6] Therefore, from equation ( and (7, the computation time of level is : eq T cm + for q 0 /( T cm + Let σ T cm /, then q 0 + m (0 q 0 + σ ( If eq0 is defined as, γ 0 can be defined as eq0 / Hence, equation ( can be transformed to q 0 + σ γ 0 + σ ( The goal here is to find an expression for an equivalent processor that has the same load processing characteristics as the entire homogeneous fat tree Our strategy is to replace each of the loest most single level tree netors (hich e call level ith an equivalent processor Proceeding recursively up the tree, e replace each of the current loest most single level subtrees ith an equivalent processor This continues until the entire homogeneous fat tree netor is replaced by (Layer - (Node j j,,,m (Layer 0 (Node l l,,,m α j - p - T cm (Store and forard - (Store and forard α l T cm α l Figure 4: Timing diagram of multi-level fat tree using store and forard strategy α p 0 p 0 α α,,,m eq0 eq0 eq0 Figure 5: Level of multi-level fat tree ith intellginent root

5 a single equivalent processor, ith inverse processing speed eq Here, is the th level Levels here are numbered from the bottom level upards In terms of notation, this is done from level (this is the to bottom most layers, level (currently next bottom most to layers, up to the top level (top to layers, (see Fig Note that for the entire initial (st level equivalent processor replacement, both parent and children processors have the same inverse speed, (see Fig 5 At the th level, (equivalent to a single level tree, the parent ill have inverse speed,, and its children ill have equivalent speed eq, (see Fig 6 Referring to equation (0 and (, the equivalent Here, from equation (, 0,and m eq, q Let γ eq /, then p T cm + eq (5 q eq + p σ γ + p σ (6 Referring to equation (4, the equivalent computation time of level is given as follos: the entire load eq ptcm + eq m + γ + p σ (7 Therefore, the equivalent equation of a th level subtree, (see Fig, the equivalent computation time p - p - p - eq p T cm + eq m + γ + p σ (8 α eq- α eq- eq- Figure 6: Level of multi-level fat tree ith intelligent root computation time for the st level can be defined as: eq T cm + m + γ 0 + σ (3 For level, (see Fig 7, the equivalent inverse computational speed is defined as eq Therefore, from equation (7, the 3 α,,,m Referring to equation (8, γ eq γ + p σ m + γ + p σ eq Tcp γ + ( j0 mj σ m + γ + ( j0 mj σ (9 (30 Consequently, γ is a recursive function The value, /γ, is the speedup of a multi-level fat tree netor ith concurrent load distribution on each level and ith store and forard computation and communication from level to level Let Tf,0 e be the equivalent solution time for the entire divisible load solved on only one processor and let T e, f,m be the equivalent solution time of a hole homogeneous -level fat tree netor, on hich each level has m children processors as ell as the root processor Then, Consequently, T e f,0 the entire load T e, f,m eq the entire load p p p Speedup,0 e Tcp T e, eq f,m eq α eq α eq eq Figure 7: Level of multi-level fat tree ith intelligent root computation time eq p T cm + eq + m q (4 γ m + γ + ( j0 mj σ γ + ( j0 mj σ m + γ + ( j0 mj σ (3 (3 if m and p i, this model is the same as an linear netor ith store and forard strategy if m, this model is a binary fat tree If m 3, this model is a ternary fat tree if p i, this model is not a fat tree Each lin in this model has the same transmission speed

6 if ( i j0 mj σ approaches to ero, the model approaches an ideal case Each node can receive the load instantly and compute the data immediately In such assumption, the recursive function (30 can be simplified as A closed form solution γ γ γ m + γ (33 m 0 + m + m + + m (34 Speedup m i (35 j0 Speedup is proportional to the total number of nodes, hich is m 0 + m + m + + m Note, from (33, e can derive Speedup + m( (36 γ γ This equation expresses that the speedup of -level fat tree is the sum of the speedup of root and all the speedup from m children The speedup of -level equivalent tree is Θ(m, hich is proportional to the number of children, per node m The number of levels of a tree increases, the speedup ill approach a linear function Therefore, the multi-level fat tree ill not saturate IV Kim Type Scheduling We note that the use of Kim type scheduling [7], here processing at a child node commences as soon as load begins to be received, can be analyed in a similar manner to that described here Performance should improve somehat because of the expedited computing in this case V Discussion and Conclusions This research confirms to important points Firstly, up to the limit of CPU speed, concurrent load distribution for a single level tree leads to a linear speedup as a function of the number of children Secondly, the use of store and forard load distribution for a fat tree leads to a speedup hich approaches a linear speedup Acnoledgments The support of NSF grant CCR9933 in the course of this research is gratefully acnoledged [5] V Bharadaj, D Ghosee, and V Mani, Multi-installment load distribution in tree netors ith delay, IEEE Transaction on Aerospace and Electronic Systems, vol 3, no, pp , 995 [6] V Bharadaj, D Ghosee, and V Mani, An efficient load distribution strategy for a distributed linear netor of processors ith communication delays, Computers and Mathematics ith Applications, vol 9, no 9, pp 95, 995 [7] J Blaeic and M Drodosi, Scheduling divisible jobs on hypercubes, Parallel Computing, vol, no, pp , 995 [8] J Blaeic and M Drodosi, The performance limits of a to dimensional netor of load sharing processors, Foundations of Computing and Decision Sciences, vol, no, pp 3 5, 996 [9] J Blaeic and M Drodosi, Distributed processing of divisible jobs ith communication start-up costs, Discrete Applied Mathematics, vol 76, no -3, pp 4, May 997 [0] J Sohn and T G Robertai, Optimal time varying load sharing for divisible loads, IEEE Transactions on Aerospace and Electronic Systems, vol 34, no 3, pp , July 998 [] J Sohn, T G Robertai, and S Luryi, Optimiing computing costs using divisible load analysis, IEEE Transactions on Parallel and Distributed Systems, vol 9, pp 5 34, March 998 [] S Charcranoon, T G Robertai, and S Luryi, Cost efficient load sequencing in single-level tree netors, in Proceedings 998 Conference on Information Sciences and Systems, Princeton, NJ, March 998, Princeton University [3] V Kumar, AY Grama, and N R Vempaty, Scalable load balancing techniques for parallel computers, Journal of Parallel and Distributed Computing, vol, pp 60 79, 994 [4] S Pande, DP Agraal, and J Mauney, A scalable scheduling scheme for functional parallelism on distributedmemory multiprocessor systems, IEEE Transactions on Parallel and Distributed Systems, vol 6, pp , 995 [5] David E Culler, and Jasinder Pal Singh, Parallel Computer Architecture: A Hardare/Softare Approach, Morgan Kaufmann Publishers, 999 [6] Herog, U, Woo, L and Chandy, KM, Solution of Queueing Problems by a Recursive Technique, IBM Journal of Research and Development, pp , May 975 [7] H-J Kim, A Novel Load Distribution Algorithm for Divisible Loads, Special Issue of Cluster Computing on Divisible Load Scheduling, Summer/Fall, 00 References [] V Bharadaj, D Ghosee, V Mani, and TG Robertai, Scheduling divisible loads in parallel and distributed systems, IEEE Computer Society Press, Los Alamitos CA, 996 [] YC Cheng and TG Robertai, Distributed computation ith communication delays, IEEE Transactions on Aerospace and Electronic Systems, vol, pp 60 79, 988 [3] G D Barlas, Collection aare optimum sequencing of operations and closed form solutions for the distribution of divisible load on arbitrary processor trees, IEEE Transactions on Parallel and Distributed Systems, vol 9, no 5, pp 49 44, May 998 [4] S Bataineh and T G Robertai, Bus oriented load sharing for a netor of sensor driven processors, IEEE Transactions on Systems, Man and Cybernetics, vol, no 5, p 05, 99

Equal Allocation Scheduling for Data Intensive Applications

I. INTRODUCTION Equal Allocation Scheduling for Data Intensive Applications KWANGIL KO THOMAS G. ROBERTAZZI, Senior Member, IEEE Stony Brook University A new analytical model for equal allocation of divisible