arxiv: v1 [cs.dc] 19 Jul PDF Free Download

DECENTRALIZED LIST SCHEDULING MARC TCHIBOUKDJIAN, NICOLAS GAST, AND DENIS TRYSTRAM arxiv:1107.3734v1 [cs.dc] 19 Jul 2011 ABSTRACT. Classical list scheduling is a very popular and efficient technique for scheduling jobs in parallel and distributed platfors. It is inherently centralized. However, with the increasing nuber of processors, the cost for anaging a single centralized list becoes too prohibitive. A suitable approach to reduce the contention is to distribute the list aong the coputational units: each processor has only a local view of the work to execute. Thus, the scheduler is no longer greedy and standard perforance guarantees are lost. The objective of this work is to study the extra cost that ust be paid when the list is distributed aong the coputational units. We first present a general ethodology for coputing the expected akespan based on the analysis of an adequate potential function which represents the load unbalance between the local lists. We obtain an equation on the evolution of the potential by coputing its expected decrease in one step of the schedule. Our ain theore shows how to solve such equations to bound the akespan. Then, we apply this ethod to several scheduling probles, naely, for unit independent tasks, for weighted independent tasks and for tasks with precendence constraints. More precisely, we prove that the tie for scheduling a global workload W coposed of independent unit tasks on processors is equal to W/ plus an additional ter proportional to log 2 W. We provide a lower bound which shows that this is optial up to a constant. This result is extended to the case of weighted independent tasks. In the last setting, precedence task graphs, our analysis leads to an iproveent on the bound of Arora et al 2001). We finally provide soe experients using a siulator. The distribution of the akespan is shown to fit existing probability laws. Moreover, the siulations give a better insight on the additive ter whose value is shown to be around 3 log 2 W confiring the tightness of our analysis. 1. INTRODUCTION 1.1. Context and otivations. Scheduling is a crucial issue while designing efficient parallel algoriths on new ulti-core platfors. The proble corresponds to distribute the tasks of an application that we will called load) aong available coputational units and deterine at what tie they will be executed. The ost coon objective is to iniize the copletion tie of the latest task to be executed called the akespan and denoted by C ax ). It is a hard challenging proble which received a lot of attention during the last decade Leung, 2004). Two new books have been published recently on the topic Drozdowski, 2009; Robert and Vivien, 2009), which confir how active is the area. List scheduling is one of the ost popular technique for scheduling the tasks of a parallel progra. This algorith has been introduced by Graha 1969) and was used with profit in any further works for instance the earliest task first heuristic which extends the analysis for counication delays in Hwang et al 1989), for unifor achines in Chekuri and Bender 2001), or for parallel rigid jobs in Schwiegelshohn et al 2008)). Its principle is to build a list of ready tasks and schedule the as soon as there exist available resources. List scheduling algoriths are low-cost greedy) whose perforances are not too far fro 1

2 MARC TCHIBOUKDJIAN, NICOLAS GAST, AND DENIS TRYSTRAM optial solutions. Most proposed list algoriths differ in the way of considering the priority of the tasks for building the list, but they always consider a centralized anageent of the list. However, today the parallel and distributed platfors involve ore and ore processors. Thus, the tie needed for anaging such a centralized data structure can not be ignored anyore. Practically, ipleenting such schedulers induces synchronization overheads when several processors access the list concurrently. Such overheads involve low-level synchronization echaniss. 1.2. Related works. Most related works dealing with scheduling consider centralized list algoriths. However, at execution tie, the cost for anaging the list is neglected. To our knowledge, the only approach that takes into account this extra anageent cost is work stealing Bluofe and Leiserson, 1999) denoted by WS in short). Contrary to classical centralized scheduling techniques, WS is by nature a distributed algorith. Each processor anages its own list of tasks. When a processor becoes idle, it randoly chooses another processor and steals soe work. To odel contention overheads, processors that request work on the sae reote list are in copetition and only one can succeed. WS has been ipleented in any languages and parallel libraries including Cilk Frigo et al, 1998), TBB Robison et al, 2008) and KAAPI Gautier et al, 2007). It has been analyzed in a seinal paper of Bluofe and Leiserson 1999) where they show that the expected akespan of series-parallel precedence graph with W unit tasks on processors is bounded by E [C ax ] W/ + OD) where D is the critical path of the graph its depth). This analysis has been iproved in Arora et al 2001) using a proof based on a potential function. The case of varying processor speeds has been analyzed in Bender and Rabin 2002). However, in all these previous analyses, the precedence graph is constrained to have only one source and out-degree at ost 2 which does not easily odel the basic case of independent tasks. Siulating independent tasks with a binary tree of precedences gives a bound of W/ + Olog W ) as a coplete binary tree of W nodes has a depth of D log 2 W. However, with this approach, the structure of the binary tree dictates which tasks are stolen. Our approach achieves a bound of the sae order with a better constant and processors are free to choose which tasks to steal. Notice that there exist other ways to analyze work stealing where the work generation is probabilist and that targets steady state results Berenbrink et al, 2003; Mitzenacher, 1998; Gast and Gaujal, 2010). Another related approach which deals with distributed load balancing is balls into bins gaes Azar et al, 1999; Berenbrink et al, 2008). The principle is to study the axiu load when n balls are randoly thrown into bins. This is a siple distributed algorith which is different fro the scheduling probles we are interested in. First, it sees hard to extend this kind of analysis for tasks with precendence constraints. Second, as the load balancing is done in one phase at the beginning, the cost of coputing the schedule is not considered. Adler et al 1995) study parallel allocations but still do not take into account contention on the bins. Our approach, like in WS, considers contention on the lists. Soe works have been proposed for the analysis of algoriths in data structures and cobinatorial optiization including variants of scheduling) using potential functions. Our analysis is also based on a potential function representing the load unbalance between the local queues. This technique has been successfully used for analyzing convergence to Nash equilibria in gae theory Berenbrink et al, 2007), load diffusion on graphs Berenbrink et al, 2009) and WS Arora et al, 2001).

DECENTRALIZED LIST SCHEDULING 3 1.3. Contributions. List scheduling is centralized in nature. The purpose of this work is to study the effects of decentralization on list scheduling. The ain result is a new fraework for analyzing distributed list scheduling algoriths DLS). Based on the analysis of the load balancing between two processors during a work request, it is possible to deduce the total expected nuber of work requests and then, to derive a bound on the expected akespan. This ethodology is generic and it is applied in this paper on several relevant variants of the scheduling proble. We first show that the expected akespan of DLS applied on W unit independent tasks is equal to the absolute lower bound W/ plus an additive ter in 3.65 log 2 W. We propose a lower bound which shows that the analysis is tight up to a constant factor. This analysis is refined and applied to several variants of the proble. In particular, a slight change on the potential function iproves the ultiplicative factor fro 3.65 to 3.24. Then, we study the possibility of processors to cooperate while requesting soe tasks in the sae list. Finally, we study the initial repartition of the tasks and show that a balanced initial allocation induces less work requests. Second, the previous analysis is extended to the weighted case of any unknown processing ties. The analysis achieves the sae bound as before with an extra ter involving p ax the axial value of the processing ties). Third, we provide a new analysis for the WS algorith of Arora et al 2001) for scheduling DAGs that iproves the bound on the nuber of work requests fro 32D to 5.5D. Fourth, we developed a coplete experiental capaign that gives statistical evidence that the akespan of DLS follows known probability distributions depending on the considered variant. Moreover, the experients show that the theoretical analysis for independent tasks is alost tight: the overhead to W/ is less than 37% away of the exact value. 1.4. Content. We start by introducing the odel and we recall the analysis for classical list scheduling in Section 2. Then, we present the principle of the analysis in Section 3 and we apply this analysis on unit independent tasks in Section 4. Section 5 discusses variations on the unit tasks odel: iproveents on the potential function and cooperation aong thieves. We extend the analysis for weighted independent tasks in Section 6 and for tasks with dependencies in Section 7. Finally, we present and analyze siulation experients in Section 8. 2. MODEL AND NOTATIONS 2.1. Platfor and workload characteristics. We consider a parallel platfor coposed of identical processors and a workload of n tasks with processing ties p j. The total work of the coputation is denoted by W = n j=1 p j. The tasks can be independent or constrained by a directed acyclic graph DAG) of precedences. In this case, we denote by D the critical path of the DAG its depth). We consider an online odel where the processing ties and precedences are discovered during the coputation. More precisely, we learn the processing tie of a task when its execution is terinated and we discover new tasks in the DAG only when all their precedences have been satisfied. The proble is to study the axiu copletion tie akespan denoted by C ax ) taking into account the scheduling cost.

4 MARC TCHIBOUKDJIAN, NICOLAS GAST, AND DENIS TRYSTRAM FIGURE 1. A typical execution of W = 2000 unit independent tasks on = 25 processors using distributed list scheduling. Grey area represents idle ties due to steal requests. 2.2. Centralized list scheduling. Let us recall briefly the principle of list scheduling as it was introduced by Graha 1969). The analysis states that the akespan of any list algorith is not greater than twice the optial akespan. One way of proving this bound is to use a geoetric arguent on the Gantt chart: C ax = W + S idle where the last ter is the surface of idle periods represented in grey in figure 1). Depending on the scheduling proble with or without precedence constraints, unit tasks or not), there are several ways to copute S idle. With precedence constraints, S idle 1) D. For independent tasks, the results can be written as S idle 1) p ax where p ax is the axiu of the processing ties. For unit independent tasks, it is straightforward to obtain an optial algorith where the load is evenly balanced. Thus S idle 1, i.e. at ost one slot of the schedule contains idle ties. 2.3. Decentralized list scheduling. When the list of ready tasks is distributed aong the processors, the analysis is ore coplex even in the eleentary case of unit independent tasks. In this case, the extra S idle ter is induced by the distributed nature of the proble. Processors can be idle even when ready tasks are available. Fig. 1 is an exaple of a schedule obtained using distributed list scheduling which shows the coplicated repartition of the idle ties S idle. 2.4. Model of the distributed list. We now describe precisely the behavior of the distributed list. Each processor i aintains its own local queue Q i of tasks ready to execute. At the beginning of the execution, ready tasks can be arbitrarily spread aong the queues. While Q i is not epty, processor i picks a task and executes it. When this task has been executed, it is reoved fro the queue and another one starts being processed. When Q i is epty, processor i sends a steal request to another processor k chosen uniforly at rando. If Q k is epty or contains only one task currently executed by processor k), then the request fails and processor i will send a new request at the next tie step. If Q k contains ore than one task, then i is given half of the tasks and it will restart a noral execution at the next step. To odel the contention on the queues, no ore than one steal request per processor can succeed in the sae tie slot. If several requests target the sae processor, a rando one succeeds and all the others fail. This assuption will be relaxed in Section 5.2. A steal request is said successful if the target queue contains ore than one task and the request is not aborted due to contention. In all the other cases, the steal request is said unsuccessful. This is a high level odel of a distributed list but it accurately odels the case of independent tasks and the WS algorith of Arora et al 2001). We justify here soe choices of this odel. There is no explicit counication cost since WS algoriths ost

DECENTRALIZED LIST SCHEDULING 5 often target shared eory platfors. In addition, a steal request is done in constant tie independently of the aount of tasks transfered. This assuption is not restrictive as the description of a large nuber of tasks can be very short. In the case of independent tasks, a whole subpart of an array of tasks can be represented in a copact way by the range of the corresponding indices, each cell containing the effective description of a task a STL transfor in Traoré et al 2008)). For ore general cases with precedence constraints, it is usually enough to transfer a task which represents a part of the DAG. More details on the DAG odel are provided in Section 7. Finally, there is no contention between a processor executing a task fro its own queue and a processor stealing in the sae queue. Indeed, one can use queue data structures allowing these two operations to happen concurrently Frigo et al, 1998). 2.5. Properties of the work. At tie t, let w i t) represent the aount of work in queue Q i cf. Fig. 2). w i t) ay be defined as the su of processing ties of all tasks in Q i as in Section 4 but can differ as in Sections 6 and 7. In all cases, the definition of w i t) satisfies the following properties. 1) When w i t) > 0, processor i is active and executes soe work: w i t+1) w i t). 2) When w i t) = 0, processor i is idle and send a steal request to a rando processor k. If the steal request is successful, a certain aount of work is transfered fro processor k to processor i and we have ax{w i t + 1), w k t + 1)} < w k t). 3) The execution terinates when there is no ore work in the syste, i.e. i, w i t) = 0. We also denote the total aount of work on all processors by wt) = i=1 w it) and the nuber of processors sending steal requests by r t [0, 1]. Notice that when r t =, all queues are epty and thus the execution is coplete. w 1 t) w 1 t + 1) w 2 t) w 3 t) w 4 t) a) Workload at tie t w 2 t + 1) w 3 t + 1) w 4 t + 1) b) Workload at tie step t + 1 FIGURE 2. Evolution of the workload of the different processors during a tie step. At tie t, processors 2 and 3 are idle and they both choose processor 1 to steal fro. At tie t + 1, only processor 2 succeed in stealing soe of the work of processor 1. The work is split between the two processors. Processors 1 and 4 both execute soe work during this tie step represented by a shaded zone). 3. PRINCIPLE OF THE ANALYSIS AND MAIN THEOREM This section presents the principle of the analysis. The ain result is Theore 1 that gives bounds on the expectation of the steal requests done by the schedule as well as the

6 MARC TCHIBOUKDJIAN, NICOLAS GAST, AND DENIS TRYSTRAM probability that the nuber of work requests exceeds this bound. As a processor is either executing or requesting work, the nuber of work requests plus the total aount of tasks to be executed is equal to C ax, where C ax is the total copletion tie. The akespan can be derived fro the total nuber of work requests: 1) C ax = W + R. The ain idea of our analysis is to study the decrease of a potential Φ t.the potential Φ t depends on the load on all processors at tie t, wt). The precise definition of Φ t varies depending on the scenario see Sections 4 to 7). For exaple, the potential function used in Section 4 is Φ t = i=1 w it) wt)/) 2. For each scenario, we will prove that the diinution of the potential during one tie step depends on the nuber of steal requests, r t. More precisely, we will show that there exists a function h : {0... } [0; 1] such that the average value of the potential at tie t + 1 is less than Φ t /hr t ). Using the expected diinution of the potential, we derive a bound on the nuber of steal requests until Φ t becoes less than one, R = τ 1 s=0 r s, where τ denotes the first tie that Φ t is less than 1. If all r t were equal to r and the potential decrease was deterinistic, the nuber of tie steps before Φ t 1 would be log Φ 0 / log hr) and the nuber of steal requests would be r/ log hr) log Φ 0. As r can vary between 1 and, the worst case for this bound is λ log Φ 0, where λ = ax 1 r r/ loghr)). The next theore shows that nuber of steal requests is indeed bounded by λ log Φ 0 plus an additive ter due to the stochastic nature of Φ t. The fact that λ corresponds to the worst choice of r t at each tie step akes the bound looser than the real constant. However, we show in Section 8 that the gap between the obtained bound and the values obtained by siulation is sall. Moreover, the coputation of the constant λ is siple and akes this analysis applicable in several scenarios, such as the ones presented in Sections 4 to 7. In the following theore and its proof, we use the following notations. F t denotes the knowledge of the syste up to tie t naely, the filtration associated to the process wt)). For a rando variable X, the conditional expectation of A knowing F t is denoted E [X F t ]. Finally, the notation 1 A denotes the rando variable equal to 1 if the event A is true and 0 otherwise. In particular, this eans that the probability of an event A is P {A} = E [1 A ]. Theore 1. Assue that there exists a function h : {0... } [0, 1] such that the potential satisfies: E [Φ t+1 F t ] hr t ) Φ t. Let Φ 0 denotes the potential at tie 0 and λ be defined as: λ def r = ax 1 r log 2 hr)) Let τ be the first tie that Φ t is less than 1, τ def = in{t : Φ t < 1}. The nuber of steal requests until τ, R = τ 1 s=0 r s, satisfies: i) P {R λ log 2 Φ0) + + u} 2 u/ λ) ii) E [R] λ log 2 Φ0) + 1 + λ ln 2 ).

DECENTRALIZED LIST SCHEDULING 7 Proof. For two tie steps t T, we call Rt T the nuber of steal requests between t and T : in{τ,t } 1 def = r s. R T t The nuber of steal requests until Φ t < 1 is R = τ 1 s=0 r s = li T R0 T. We show by a backward induction on t that for all t T : ] 2) if Φ t 1, then u R : E [1 R Tt λ log2 Φt++u F t 2 u/ λ). [ ] For t=t, RT T = 0 and E 1 R T t λ log 2 Φ t++u F t = 0. Thus, 2) is true for t=t. Assue that 2) holds for soe t+1 T and suppose that Φ t 1. Let u > 0 if u 0... ). Since Rt T = r t + Rt+1, T the probability P { } Rt T λ log 2 Φ t + + u F t is equal to ] ] 3) E [1 R Tt λ log2 Φt++u F t = E [1 rt+r Tt+1 λ log2 Φt++u F t ] 4) = E [1 rt+r Tt+1 λ log2 Φt++u 1 Φt+1 1 F t ] 5) + E [1 rt+r Tt+1 λ log2 Φt++u 1 Φt+1<1 F t If Φ t+1 < 1, then R T t+1 = 0. Since r t and Φ t 1, λ log 2 Φ t + + u r t 0. This shows that the ter of Equation 5) is equal to zero. 4) is the probability that R T t+1 is greater than λ log 2 Φ t + + u r t = λ log 2 Φ t+1 + + u r t λ logφ t+1 /Φ t ) Therefore, using the induction hypothesis, 4) is equal to ] [ ] E [1 R Tt+1 λ log2 Φt++u rt 1 Φt+1>1 F t = E 2 u r t λ logφ t+1 /Φ t ) λ 1 Φt+1>1 F t [ ] = 2 u r t Φt+1 λ E 1 Φt+1>1 F t Φ t s=t = 2 u r t λ hrt ) = 2 u λ 2 r t/λ+log 2 hr t)), where at the first line we used both the fact that for a rando variable X, E [X F t ] = E [E [X F t+1 ] F t ] and the induction hypothesis. If r t = 0, 2 rt/λ+log 2 hrt)) = hr t ) 1. Otherwise, by definition of λ = ax 1r r/ loghr)), r t /λ + log 2 hr t )) 0 and 2 rt/λ+log 2 hrt)) 1. This shows that 2) holds for t. Therefore, by induction on t, this shows that 2) holds for t = 0: for all u 0: ] E [1 R T0 λ log2 Φt++u F 0 2 u/ λ) As r t 0, the sequence R0 T ) T is increasing and converges to R. Therefore, the sequence 1 R T 0 λ log 2 Φ 0++u is increasing in T and converges to 1 R λ log2 Φ 0++u. Thus, by Lebesgue s onotone convergence theore, this shows that [ ] P {R λ log 2 Φ 0 + + u} = li E 1 R T T 0 λ log 2 Φ 0++u 2 u λ.

8 MARC TCHIBOUKDJIAN, NICOLAS GAST, AND DENIS TRYSTRAM The second part of the theore ii) is a direct consequence of i). Indeed, E [R] = 0 P {R u} du λ log 2 Φ 0 + + λ log 2 Φ 0 + + λ log 2 Φ 0 + 1 + 0 0 P {R λ log 2 Φ 0 + + u} du 2 u λ du λ ln 2 ). 4. UNIT INDEPENDENT TASKS We apply the analysis presented in the previous section for the case of independent unit tasks. In this case, each processor i aintains a local queue Q i of tasks to execute. At every tie slot, if the local queue Q i is not epty, processor i picks a task and executes it. When Q i is epty, processor i sends a steal request to a rando processor j. If Q j is epty or contains only one task currently executed by processor j), then the request fails and processor i will have to send a new request at the next slot. If Q j contains ore than one task, then i is given half of the tasks after that the task executed at tie t by processor j has been reoved fro Q j ). The aount of work on processor i at tie t, w i t), is the nuber of tasks in Q i t). At the beginning of the execution, w0) = W and tasks can be arbitrarily spread aong the queues. 4.1. Potential function and expected decrease. Applying the ethod presented in Section 3, the first step of the analysis is to define the potential function and copute the potential decrease when a steal occurs. For this exaple, Φt) is defined by: Φt) = i=1 w i t) wt) ) 2 = i=1 w i t) 2 w2 t). This potential represents the load unbalance in the syste. If all queues have the sae load w i t) = wt)/, then Φt) = 0. Φt) 1 iplies that there is at ost one processor with at ost one ore task than the others. In that case, there will be no steal until there is just one processor with 1 task and all others idle. Moreover, the potential function is axial when all the work is concentrated on a single queue. That is Φt) wt) 2 wt) 2 / 1 1/)w 2 t). Three events contribute to a variation of potential: successful steals, tasks execution and decrease of w 2 t)/. 1) If the queue i has w i t) 1 tasks and it receives one or ore steal requests, it chooses a processor j aong the thieves. At tie t + 1, i has executed one task and the rest of the work is split between i and j. Therefore, w i t + 1) = Thus, we have: w i t + 1) 2 + w j t + 1) 2 = w i t) 1)/2 and w j t + 1) = w i t) 1)/2 2 2 w i t) 1)/2 + w i t) 1)/2 wi t) 2 /2 w i t) + 1. Therefore, this generates a difference of potential of 6) δ i t) w i t) 2 /2 + w i t) 1..

DECENTRALIZED LIST SCHEDULING 9 2) If i has w i t) 1 tasks and receives zero steal requests, it potential goes fro w i t) 2 to w i t) 1) 2, generating a potential decrease of 2w i t) 1. 3) As there are r t active processors, i=1 w it)) 2 / goes fro wt) 2 / to wt + 1) 2 = wt) + r) 2 /, generating a potential increase of 2 r t )wt)/ r t ) 2 /. Recall that at tie t, there are r t processors that send steal requests. A processor i receives zero steal requests if the r t thieves choose another processor. Each of these events is independent and happens with probability 2)/ 1). Therefore, the probability for the processor to receive one or ore steal requests is qr t ) where qr t ) = 1 1 1 ) rt. 1 If Φ t =Φ and r t =r, by suing the expected decrease on each active processor δ i, the expected potential decrease is greater than: i/w it)>0 = qr) wi t) 2 ) + w i t) 1 +1 qr))2w i t) 1) } 2 2wt) r {{} i/w it)>0 δ i qr) 2 w it) 2 qr)wt) + 2wt) r) 2wt) r Using that 2wt) 2wt) r = 2wt) r r)2, that r) + that w i t) 2 = Φ + wt) 2, this equals: + r)2 + r)2. = r) r and qr) 2 Φ + qr) wt) 2 2 qr)wt) + 2wt) r r) r = qr) 2 Φ + qr) wt) 2 2 qr)wt) + r 2wt) + r) = qr) 2 Φ + qr)wt) wt) 2 2 + 2r ) + r wt) + r). qr) By concavity of x 1 1 x) r ), 1 1 x) r ) r x. This shows that qr) = 1 1 1 1 )r r/ 1). Thus, r/qr) 1. Moreover, as r is the nuber of active processors, w r each processor has at least one task). This shows that the expected decrease of potential is greater than: qr) 2 Φ + qr)wt) wt) 2 2 + 2 1 ) = qr) 2 Φ + qr)wt) wt) 2). 2 If wt) 2, then the expected decrease of potential is greater than qr t )Φ t /2. If wt) < 2, this eans that wt) = 1 and wt + 1) = 0 and therefore Φ t+1 = 0. Thus, for all t: 7) E [Φ t+1 F t ] 1 qr ) t) Φ t. 2 4.2. Bound on the akespan. Using Theore 1 of the previous section, we can solve equation 7) and conclude the analysis. Theore 2. Let C ax be the akespan of W = n unit independent tasks scheduled by DLS and Φ 0 def = i w i W )2 the potential when the schedule starts. Then:

10 MARC TCHIBOUKDJIAN, NICOLAS GAST, AND DENIS TRYSTRAM i) E [C ax ] W + 1 1 log 2 1 + 1 e ) log 2 Φ 0 + 1 ) + 1 ln 2 { ii) P C ax W ) } + 1 1 log 2 1 + 1 e ) 1 log 2 Φ 0 + log 2 + 1 ɛ ɛ In particular: iii) E [C ax ] W + 2 1 log 2 1 + 1 e ) log 2 W + 1 ) + 1 2 ln 2 These bounds are optial up to a constant factor in log 2 W. Proof. Equation 7) shows that E [Φ t+1 F t ] gr t )Φ t with gr) = 1 qr)/2. Defining Φ t = Φ t /1 1/ 1)), the potential function Φ t also satisfies 7). Therefore, Φ t satisfies the conditions of Theore 1. This shows that the nuber of work requests R until Φ t < 1 satisfies E [R] λ log 2 Φ 0 ) + 1 + λ ), ln 2 with λ = ax 1 r 1 r/ log 2 hr)). One can show that r/ log 2 hr)) is decreasing in r. Thus its iniu is attained for r = 1. This shows that λ 1/1 log 2 1+ 1 e ))Ṫhe inial non zero-value for Φ t is when one processor has one task and the others zero. In that case, Φ t = 1 1/ 1). Therefore, when Φ t < 1, this eans that Φ t = 0 and the schedule is finished. As pointed out in Equation 1), at each tie step of the schedule, a processor is either coputing one task or stealing work. Thus, the nuber of steal requests plus the nuber of tasks to be executed is equal to C ax, i.e. C ax = W + R. This shows that log 2 Φ 0 + 1 ) + 1. ln 2 E [C ax ] W + 1 1 log 2 1 + 1 e ) This concludes the proof of i). The proof of the i) applies utatis utandis to prove the bound in probability ii) using Theore 1 ii). We now give a lower bound for this proble. Consider W = 2 k+1 tasks and = 2 k processors, all the tasks being on the sae processor at the beginning. In the best case, all steal requests target processors with highest loads. In this case the akespan is C ax = k + 2: the first k = log 2 steps for each processor to get soe work; one step where all processors are active; and one last step where only one processor is active. In that case, C ax W + log 2 W 1. This theore shows that the factor before log 2 W is bounded by 1 and 2/1 log 2 1 + 1/e)) < 3.65. Siulations reported in Section 8 see to indicate that the factor of log 2 W is slightly less than 3.65. This shows that the constants obtained by our analysis are sharp. 4.3. Influence of the initial repartition of tasks. In the worst case, all tasks are in the sae queue at the beginning of the execution and Φ 0 = W W/) 2 W 2. This leads to a bound on the nuber of work requests in 3.65 log 2 W see the ite iii) of Theore 2). However, using bounds in ters of Φ 0, our analysis is able to capture the difference for the nuber of work requests if the initial repartition is ore balanced. One can show that a ore balanced initial repartition Φ 0 W 2 ) leads to fewer steal requests on average. Suppose for exaple that the initial repartition is a balls-and-bins assignent: each tasks is assigned to a processor at rando. In this case, the initial nuber of tasks in queue

DECENTRALIZED LIST SCHEDULING 11 i, w i 0), follows a binoial distribution BW, 1/). The expected value of Φ 0 is: E [Φ 0 ] = E [ w 2 ] W 2 i = Var [w i ] + E [w i ] 2) W 2 1 = 1 ) W i i Since the nuber of work requests is proportional to log 2 Φ 0, this initial repartition of tasks reduces the nuber of steal requests by a factor of 2 on average. This leads to a better bound on the akespan in W/ + 1.83 log 2 W + 3.63. 5. GOING FURTHER ON THE UNIT TASKS MODEL In this section, we provide two different analysis of the odel of unit tasks of the previous section. We first show how the use of a different potential function Φ t = i w it) ν for soe ν > 1) leads to a better bound on the nuber of work requests. Then we show how cooperation aong thieves leads to a reduction of the bound on the nuber of work requests by 12%. The later is corroborated by our siulation that shows a decrease on the nuber of work requests between 10% and 15%. 5.1. Iproving the analysis by changing the potential function. We consider the sae odel of unitary tasks as in Section 4. The potential function of our syste is defined as Φ t = w i t) ν, i=1 where ν > 1 is a constant factor. When an idle processor steals a processor with w i t) tasks, the potential decreases by δ i = w i t) ν wi t) 1 ν wi t) 1 ν + wi t) ν wi t) ν wi t) ν + 2 2 2 2 1 2 1 ν) w i t) ν. This shows that the expected value of the potential at tie t + 1 is E [Φ t+1 ] 1 qr)1 2 1 ν )) Φ t. where qr) is the probability for a processor to receive at least one work request when r r. processors are stealing, qr) = 1 1 1) 1 Following the analysis of the previous part, and as Φ 0 W ν the expected akespan is bounded by: W + λν) log Φ 0 + 1 + 1 ) W ln 2 + νλν) log W + 1 + 1 ), ln 2 where λν) is a constant depending on ν equal to: { 8) λν) def r } = ax r log 2 1 qr)1 2 1 ν )) As for ν = 2 of Section 4, it can be shown the axiu of Equation 8 is attained for r = 1. The constant factor in front of log W is νλν). Nuerically, the iniu of νλν) is for ν 2.94 and is less than 3.24. Theore 3. Let C ax be the akespan of W = n unit independent tasks scheduled DLS. Then: E [C ax ] W + 3.24 log 2 W + 1 ) + 1 2 ln 2

12 MARC TCHIBOUKDJIAN, NICOLAS GAST, AND DENIS TRYSTRAM In Section 4, we have shown that the akespan was bounded by W + 2λ2) log 2 Φ 0 + 1 ln 2 ) + 1 W + 3.65 log 2 W + 1 2 ln 2 ) + 1. Theore 3 iproves the constant factor in front of log 2 W. However, we loose the inforation of the initial repartition of tasks Φ 0. 5.2. Cooperation aong thieves. In this section, we odify the protocol for anaging the distributed list. Previously, when k > 1 steal requests were sent on the sae processor, only one of the could be served due to contention on the list. We now allow the k requests to be served in unit tie. This odel has been ipleented in the iddleware Kaapi Gautier et al, 2007). When k steal requests target the sae processor, the work is divided into k + 1 pieces. In practice, allowing concurrent thieves increase the cost of a steal request but we neglect this additional cost here. We assue that the k concurrent steal requests can be served in unit tie. We study the influence of this new protocol on the nuber of steal requests in the case of unit independent tasks. We define the potential of the syste at tie t to be: Φt) = ) w i t) ν w i t). i=1 Let us first copute the decrease of the potential when processor i receives k 1 steal requests. If w i t) > 0, it can be written w i t) = k + 1)q + b with 0 b < k + 1. We neglect the decrease of potential due to the execution tasks ν > 1 iplies that execution of tasks decreases the potential). After one tie step and k steal requests, the work will be divided into r parts with q + 1 tasks and k + 1 r parts with q tasks. i w it) does not vary during the stealing phase. Therefore, the difference of potential due to these k work requests is δ k i = k + 1)q + b) ν bq + 1) ν k + 1 b)q ν. Let us denote α def = b/k +1) [0; 1) and let fx) = x+α) ν +1 2 1 ν )x+α) 1 α)x ν αx + 1) ν. The first derivative of f is f x) = νx + α) ν 1 + 1 2 1 ν ) ν1 α)x ν 1 αx + 1) ν 1 and the derivative of f is f x) = ν1 ν)x + α) ν 2 1 α)x ν 2 αx + 1) ν 2. As ν < 3, the function x x ν 2 is concave which iplies than f x) 0. Therefore, f is increasing. Moreover, f 0) = να ν 1 α)+1 2 1 ν ) 0. This shows that for all x, f x) 0 and that f is increasing. The value of f in 0 is f0) = α ν 1 2 1 ν )α α = α ν 1 2α) 1 ν ) 0 which iplies that for all x, fx) 0. Recall that w i t) = k + 1)q + b and α = b/k + 1). Using the notation f and the fact that k + 1) 1 ν 2 1 ν, the decrease of potential δi k can be written 9) δ k i = 1 k + 1) 1 ν ) w i t) ν w i t)) + k + 1) fq) 1 k + 1) 1 ν ) w i t) ν w i t)). Let q k r) be the probability for a processor to receive k work requests when r processors are stealing. q k r) is equal to: q k r) = r k ) 1 1) k ) r k 2 1

DECENTRALIZED LIST SCHEDULING 13 The expected decrease of the potential caused by the steals on processor i is equal to r k=0 δk i q kr). Using equation 9), we can bound the expected potential at tie t + 1 by E [Φ t Φ t+1 F t ] = E [Φ t+1 F t ] i=0 k=0 1 r δi k q k r) r ) 1 k + 1) 1 ν ) q k r) Φ t k=0 Theore 4. The akespan C coop ax of W = n unit independent tasks scheduled with cooperative work stealing satisfies: i) E [Cax coop ] W + 2.88 log 2 W + 3.4 { ii) P Cax coop W )} 1 + 2.88 log 2 W + 2 + log 2 ɛ. ɛ Proof. The proof is very siilar to the one of Theore 2. Let r hr) def = 1 1 k + 1) 1 ν ) q k r) k=0 and λ coop ν) def r = ax 1 r log 2 hr). Using Theore 1, we have: E [C coop ax ] W + νλcoop ν) log 2 W + λν) ln 2 + 1. In the general case the exact coputation of hr) is intractable. However, by a nuerical coputation, one can show that 3λ coop 3) < 2.88. When Φ t < 1, we have i w it) ν w i t) < 1. This iplies that for all processor i, w i t) equals 0 or 1. This adds at ost) one step of coputation at the end of the schedule. As λ3)/ ln2) + 1 + 1 = 3.4, we obtain the calied bound. Copared to the situation with no cooperation aong thieves, the nuber of steal requests is reduced by a factor 3.24/2.88 12%. We will see in Section 8 that this is close to the value obtained by siulation. Reark. The exact coputation can be accoplished for ν = 2 Tchiboukdjian et al, 2010) and leads to a constant factor of 2λ coop 2) 2/ log 2 1 1 e ) < 3.02. 6. WEIGHTED INDEPENDENT TASKS In this section, we analyze the nuber of work requests for weighted independent tasks. Each task j has a processing tie p j which is unknown. When an idle processor attepts to steal a processor, half of the tasks of the victi are transfered fro the active processor to the idle one. A task that is currently executed by a processor cannot be stolen. If the victi has 2k+1) tasks plus one for the task that is currently executed), the work is split in k+1), k. If the victi has 2k + 1+1) tasks, the work is split in k+1), k + 1. In all this analysis, we consider that the scheduler does not know the weight of the different tasks p j. Therefore, when the work is split in two parts, we do not assue that the work is split fairly see for exaple Figure 3) but only that the nuber of tasks is split in two equal parts.

14 MARC TCHIBOUKDJIAN, NICOLAS GAST, AND DENIS TRYSTRAM p 1 p 2 p 3 p 4 p 5 Work executed by 1 p 1 p 3 p 5 a) Workload at tie t p 2 p 4 b) Workload at tie t + 1 FIGURE 3. Evolution of the repartition of tasks during one tie step. At tie t, one processor has all the tasks. p 1 can not be stolen since the processor 1 has already started executing it. After one work request done by the second processor, one processor has 3 tasks and one has 2 tasks but the workload ay be very different, depending on the processing ties p j. 6.1. Definition of the potential function and expected decrease. As the processing ties are unknown, the work cannot be shared evenly between both processors and can be as bad as one processor getting all the sallest tasks and one all the biggest tasks see Figure 3). Let us call w i t) the nuber of tasks possessed by the processor i. The potential of the syste at tie t is defined as: 10) Φ t def = i w i t) ν w i t)). During a work request, half of the tasks are transfered fro an active processor to the idle processor. If the processor j is stealing tasks fro processor i, the nuber of tasks possessed by i and j at tie t + 1 are w j t + 1) = w i t)/2 and w i t + 1) = w i t)/2. Therefore, the decrease of potential is equal to the one of the cooperative steal of Equation 9 for k = 1: δ i 1 2 1 ν ) w i t) ν w i t)). Following the analysis of Section 5.2, this shows that in average: 11) E [Φ t+1 ] 1 1 2 1 ν )qr)) Φ t. 6.2. Bound on the akespan. Equation 11 allows us to apply Theore 1 to derive a bound on the akespan of weighted tasks by the distributed list scheduling algorith. This bound differs fro the one for unit tasks only by an additive ter of p ax. def Theore 5. Let p ax = ax p j be the axiu processing ties. The expected akespan to schedule n weighted tasks of total processing tie W = p j by DLS is bounded by E [C ax ] W + 1 p ax + 3.24 log 2 n + 1 2 ln 2 ) + 1 Proof. Let Φ t be the potential defined by Equation 10. At tie t = 0, the potential of the syste is bounded by W ν W. Therefore, by Theore 1, the nuber of work requests before Φ t < 1 is bounded by λ log 2 Φ 0 + 1 + 1 ln 2 ) νλν) 2 log 2 W + 1 + 1 ), ln 2 where νλν) < 3.24 is the sae constant as the bound for the unit tasks with the potential function i wν i of Theore 3.

DECENTRALIZED LIST SCHEDULING 15 As Φ t N, Φ t < 1 iplies that Φ t = 0. Moreover, by definition of Φ t, this iplies that for all i: w i t) ν w i t) = 0, which iplies that for all i: w i t) 1. Therefore, once Φ t is equal to 0, there is at ost one task per processor. This phase can last for at ost p ax unit of tie, generating at ost 1)p ax work requests. Reark. The sae analysis applies for the cooperative stealing schee of Section 5.2 leading to the sae iproved bound in 2.88 log 2 n instead of 3.24 log 2 n. 7. TASKS WITH PRECEDENCES In this section, we show how the well known non-blocking work stealing of Arora et al 2001) denoted ABP in the sequel) can be analyzed with our ethod which provides tighter bounds for the akespan. We first recall the WS scheduler of ABP, then we show how to define the aount of work on a processor w i t), finally we apply the analysis of Section 3 to bound the akespan. 7.1. ABP work-stealing scheduler. Following Arora et al 2001), a ultithreaded coputation is odeled as a directed acyclic graph G with W unit tasks task and edges define precedence constraints. There is a single source task and the out-degree is at ost 2. The critical path of G is denoted by D. ABP schedules the DAG G as follows. Each processor i aintains a double-ended queue called a deque) Q i of ready tasks. At each slot, an active processor i with a non-epty deque executes the task at the botto of its deque Q i ; once its execution is copleted, this task is popped fro the botto of the deque, enabling i.e. aking ready 0, 1 or 2 child tasks that are pushed at the botto of Q i. At each top, an idle processor j with an epty deque Q j becoes a thief: it perfors a steal request on another randoly chosen victi deque; if the victi deque contains ready tasks, then its top-ost task is popped and pushed into the deque of one of its concurrent thieves. If j becoes active just after its steal request, the steal request is said successful. Otherwise, Q j reains epty and the steal request fails which ay occur in the three following situations: either the victi deque Q i is epty; or, Q i contains only one task currently in execution on i; or, due to contention, another thief perfors a successful steal request on i siultaneously. 7.2. Definition of w i t). Let us first recall the definition of the enabling tree of Arora et al 2001). If the execution of task u enables task v, then the edge u, v) of G is an enabling edge. The sub-graph of G consisting of only enabling edges fors a rooted tree called the enabling tree. We denote by hu) the height of a task u in the enabling tree. The root of the DAG has height D. Moreover, it has been shown in Arora et al 2001) that tasks in the deque have strictly decreasing height fro top to botto except for the two botto ost tasks which can have equal heights. We now define w i t), the aount of work on processor i at tie t. Let h t be the axiu height of all tasks in the deque. If the deque contains at least two tasks including the one currently executing we define w i t) = 2 2) ht. If the deque contains only one task currently executing we define w i t) = 1 2 2 2) ht. The following lea states that this definition of w i t) behaves in a siilar way than the one used for the independent unit tasks analysis of Section 4. Lea 1. For any active processor i, we have w i t + 1) w i t). Moreover, after any successful steal request fro a processor j on i, w i t + 1) w i t)/2 and w j t + 1) w i t)/2 and if all steal requests are unsuccessful we have w i t + 1) w i t)/ 2.

16 MARC TCHIBOUKDJIAN, NICOLAS GAST, AND DENIS TRYSTRAM Proof. We first analyze the execution of one task u at the botto of the deque. Executing task u enables at ost two tasks and these tasks are the children of u in the enabling tree. If the deque contains ore than one task, the top ost task has height h t and this task is still in the deque at tie t + 1. Thus the axiu height does not change and w i t) = w i t + 1). If the deque contains only one task, we have w i t) = 1 2 2 2) ht and w i t + 1) 2 2) ht 1. Thus w i t + 1) w i t). We now analyze a successful steal fro processor j. In this case, the deque of processor i contains at least two tasks and w i t) = 2 2) ht. The stolen task is one with the axiu height and is the only task in the deque of processor j thus w j t+1) = 1 2 2 2) ht w i t)/2. For the processor i, either its deque contains only one task after the steal with height at ost h t and w i t + 1) 1 2 2 2) ht w i t)/2, either there are still ore than 2 tasks and w i t + 1) 2 2) ht 1 < w i t)/2. Finally, if all steal requests are unsuccessful, the deque of processor i contains at ost one task. If the deque is epty w i t + 1) = w i t) = 0 and thus w i t + 1) w i t)/ 2. If the deque contains exactly one task, w i t) = 1 2 2 2) ht and w i t + 1) 2 2) ht 1 thus w i t + 1) w i t)/ 2. 7.3. Bound on the akespan. To study the nuber of steals, we follow the analysis presented in Section 3 with the potential function Φt) = i w it) 2. Using results fro lea 1, we copute the decrease of the potential δ i t) due to steal requests on processor i by distinguishing two cases. If there is a successful steal fro processor j, δ i t) = w i t) 2 w i t + 1) 2 w j t + 1) 2 w i t) 2 wi t) ) 2 1 2 2 2 w it) 2. If all steals are unsuccessful, the decrease of the potential is δ i t) = w i t) 2 w i t + 1) 2 w i t) 2 wi t) ) 2 1 2 2 w it) 2. In all cases, δ i t) w i t) 2 /2. We obtain the expected potential at tie t + 1 by suing the expected decrease on each active processor: w i t) 2 E [Φ t Φ t+1 F t ] qr t ) 2 i=0 E [Φ t+1 F t ] Finally, we can state the following theore. 1 qr t) 2 ) Φt) Theore 6. On a DAG coposed of W unit tasks, with critical path D, one source and out-degree at ost 2, the akespan of ABP work stealing verifies: i) E [C ax ] W + 3 ii) P 1 log 2 1 + 1 e ) D + 1 < W + 5.5 D + 1. 1 ) } D + log 2 + 1 ɛ ɛ { C ax W + 3 1 log 2 1 + 1 e ) Proof. The proof is a direct application of Theore 1. As in the initial step there is only one non epty deque containing the root task with height D, the initial potential is 1 Φ0) = 2 2 D ) 2. 2)

DECENTRALIZED LIST SCHEDULING 17 Thus the expected nuber of steal requests before Φt) < 1 is bounded by [ 1 E [R] λ log 2 2 2 ) D ) 2 ] 2 + 1 + λ ) ln2) 2λ D log 2 2 2) + 1 + λ ) ln2) 2λ 3λ D as 1 + λ/ ln2) 2λ < 0) where λ = 1 log 2 1 + 1/e)) 1 is the sae constant as the bound for the unit tasks of Section 4. Moreover, when Φt) < 1, we have i, w i t) < 1. There is at ost one task of height 0 in each deque, i.e. a leaf of the enabling tree which cannot enable any other task. This last step generates at ost 1 additional steal requests. In total, the expected nuber of steal requests is bounded by E [R] 3λ D + 1. The bound on the akespan is obtained using the relation C ax = W + R. The proof of i) applies utatis utandis to prove the bound in probability ii). Reark. In Arora et al 2001), the authors established the upper bounds : E [C ax ] W + 32 D and P {C ax W + 64 D + 16 log 2 } 1 ɛ ɛ in Section 4.3, proof of Theore 9. Our bounds greatly iprove the constant factors of this previous result. 8. EXPERIMENTAL STUDY The theoretical analysis gives an upper bounds on the expected value of the akespan and deviation fro the ean for the various odels we considered. In this section, we study experientally the distribution of the akespan. Statistical tests give evidence that the akespan for independent tasks follows a generalized extree value gev) distribution Kotz and Nadarajah, 2001). This was expected since such a distribution arises when dealing with axiu of rando variables. For tasks with dependencies, it depends on the structure of the graph: DAGs with short critical path still follow a gev distribution but when the critical path grows, it tends to a gaussian distribution. We also study in ore details the overhead to W/ and show that it is approxiately 2.37 log 2 W for unit independent tasks which is close to the theoretical result of 3.24 log 2 W cf. Section 5). We developed a siulator that strictly follows our odel. At the beginning, all the tasks are given to processor 0 in order to be in the worst case, i.e. when the initial potential Φ 0 is axiu. Each pair,w ) is siulated 10000 to get accurate results, with a coefficient of variation about 2%. 8.1. Distribution of the akespan. We consider here a fixed workload W = 2 17 on = 2 10 processors for independent tasks and = 2 7 processors for tasks with dependencies. For the weighted odel, processing ties were generated randoly and uniforly between 1 and 10. For the DAG odel, graphs have been generated using a layer by layer ethod. We generated two types of DAGs, one with a short critical path close to the iniu possible log 2 W ) and the other one with a long critical path around W/4 in order to keep enough tasks per processor per layer). Fig. 4 presents histogras for C ax W/. The distributions of the first three odels a,b,c in Fig. 4) are clearly not gaussian: they are asyetrical with an heavier right tail. To fit these three odels, we use the generalized extree value gev) distribution Kotz and Nadarajah, 2001). In the sae way as the

18 MARC TCHIBOUKDJIAN, NICOLAS GAST, AND DENIS TRYSTRAM Density 0.00 0.04 0.08 0.12 Density 0.00 0.04 0.08 0.12 Density 0.00 0.02 0.04 0.06 0.08 0.10 Density 0.00 0.01 0.02 0.03 0.04 30 35 40 45 50 Makespan 35 40 45 50 55 60 65 Makespan 35 40 45 50 55 60 65 Makespan 420 440 460 480 500 Makespan a) Unit Tasks b) Weighted Tasks c) DAG short D) d) DAG long D) FIGURE 4. Distribution of the akespan for unit independent tasks 4a), weighted independent tasks 4b) and tasks with dependencies 4c) and 4d). The first three odels follow a gev distribution blue curves), the last one is gaussian red curve). noral distribution arises when studying the su of independent and identically distributed iid) rando variables, the gev distribution arises when studying the axiu of iid rando variables. The extree value theore, an equivalent of the central liit theore for axia, states that the axiu of iid rando variables converges in distribution to a gev distribution. In our setting, the rando variables easuring the load of each processor are not independent, thus the extree value theore cannot apply directly. However, it is possible to fit the distribution of the akespan to a gev distribution. In Fig. 4, the fitted distributions blue curve) closely follow the histogras. To confir this graphical approach, we perfored a goodness of fit test. The χ 2 test is well-suited to our data because the distribution of the akespan is discrete. We copared the results of the best fitted gev to the best fitted gaussian. The χ 2 test strongly rejects the gaussian hypothesis but does not reject the gev hypothesis with a p-value of ore than 0.5. This confirs that the akespan follows a gev distribution. We fitted the last odel, DAG with long critical path, with a gaussian red curve in Fig. 4d)). In this last case, the copletion tie of each layer of the DAG should correspond to a gev distribution but the total akespan, the sus of all layers, should tend to a gaussian by the central liit theore. Indeed the χ 2 test does not reject the gaussian hypothesis with a p-value around 0.3. 8.2. Study of the log 2 W ter. We focus now on unit independent tasks as the other odels rely on too any paraeters the choice of the processing ties for weighted tasks and the structure of the DAG for tasks with dependencies). We want to show that the nuber of work requests is proportional to log 2 W and study the proportionality constant. We first launch siulations with a fixed nuber of processors and a wide range of work in successive powers of 10. A linear regression confirs the linear dependency in log 2 W with a coefficient of deterination r squared ) greater than 0.9999 1. Then, we obtain the slope of the regression for various nuber of processors. The value of the slope tends to a liit around 2.37 cf. Fig. 5left)). This shows that the theoretical analysis of Theore 2 is alost accurate with a constant of approxiately 3.24. We also study the constant factor of log 2 W for the cooperative steal of Section 5. The theoretical value of 2.88 is again close to the value obtained by siulation 2.08 cf. Figure 5left)). The difference between the theoretical and the practical values can be explained by the worst case analysis on the nuber of steal requests per tie step in Theore 1. 1 the closer to 1, the better

arxiv: v1 [cs.dc] 19 Jul 2011