List Scheduling: The Price of Distribution

Size: px

Start display at page:

Download "List Scheduling: The Price of Distribution"

Avice Marsh
5 years ago
Views:

List Scheduling: The Price of Distribution Marc Tchiboukdjian, Denis Trystra, Jean-Louis Roch, Julien Bernard To cite this version: Marc Tchiboukdjian, Denis Trystra, Jean-Louis Roch, Julien Bernard.

1 List Scheduling: The Price of Distribution Marc Tchiboukdjian, Denis Trystra, Jean-Louis Roch, Julien Bernard To cite this version: Marc Tchiboukdjian, Denis Trystra, Jean-Louis Roch, Julien Bernard. List Scheduling: The Price of Distribution. [Research Report] RR-7208, INRIA. 2010, pp.20. <inria > HAL Id: inria Subitted on 19 Feb 2010 HAL is a ulti-disciplinary open access archive for the deposit and disseination of scientific research docuents, whether they are published or not. The docuents ay coe fro teaching and research institutions in France or abroad, or fro public or private research centers. L archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de docuents scientifiques de niveau recherche, publiés ou non, éanant des établisseents d enseigneent et de recherche français ou étrangers, des laboratoires publics ou privés.

2 INSTITUT NATIONAL DE RECHERCHE EN INFORMATIQUE ET EN AUTOMATIQUE List Scheduling: The Price of Distribution Marc Tchiboukdjian Denis Trystra Jean-Louis Roch Julien Bernard N 7208 February 2010 Distributed and High Perforance Coputing apport de recherche ISSN ISRN INRIA/RR FR+ENG

4 List Scheduling: The Price of Distribution Marc Tchiboukdjian, Denis Trystra, Jean-Louis Roch, Julien Bernard Thee : Distributed and High Perforance Coputing Équipe-Projet MOAIS Rapport de recherche n 7208 February pages Abstract: Classical list scheduling is a very popular and efficient technique for scheduling jobs in parallel and distributed platfors. It is inherently centralized. However, with the increasing nuber of processors in new parallel platfors, the cost for anaging a single centralized list becoes too prohibitive. A suitable approach to reduce the contention is to distribute the list aong the coputational units. Thus each processor has only a local view of the work to execute. The objective of this work is to study the extra cost that ust be paid when the list is distributed aong the coputational units. We present a general ethodology for coputing the expected akespan based on the analysis of an adequate potential function which represents the load unbalance between the local lists. It is applied to several scheduling probles, naely, for arbitrary divisible load, for unit independent tasks, for weighted independent tasks and for tasks with dependencies. It is presented in detail for the siplest case of divisible load, and then extended to the other cases. More precisely, we prove that the tie for scheduling a global workload W on processors is equal to W/ with an additional ter in 4e e 1 log 2W. We provide a lower bound which shows that this is optial up to a constant factor in log 2 W. This result is extended to the case of independent unit tasks and independent weighted tasks using a siilar analysis. Moreover, we show how to adapt this ethodology to the case of precedence task graphs. This analysis allows to iprove the constant factor in the additive ter of the bound of Arora, Bluofe and Plaxton. We finally provide soe experients using a siulator. We are able to fit the distribution of the akespan with existing probability laws. Moreover, we study the additive factor and show that it is around 3log 2 W which indicate that our analysis is tight up to a factor less than3. Key-words: Scheduling, List algoriths, Distributed data, Work stealing arc.tchiboukdjian@iag.fr, CNRS and CEA/DAM,DIF denis.trystra@iag.fr, Grenoble University jean-louis.roch@iag.fr, Grenoble University julien.bernard@lifc.univ-fcote.fr, Université de Franche-Coté Centre de recherche INRIA Grenoble Rhône-Alpes 655, avenue de l Europe, Montbonnot Saint Isier Téléphone : Télécopie

5 Ordonnanceent par liste : cout de la répartition de la liste Résué : Les algorithes de liste sont une technique classique et efficace pour l ordonnanceent de tâches sur des systèes parallèles et distribués. Cette approche est fondaentaleent centralisée. A cause de l augentation du nobre de processeurs dans les nouvelles platefores parallèles, le coût de gestion d une seule liste centralisée devient prohibitif. Une technique pour diinuer la contention est de répartir la liste sur les unités de calculs. Dans ce cas, chaque processeur a seuleent une vue locale du travail à exécuter. L objectif de cet article est d étudier le surcout induit par la répartition de la liste de tâches sur les différents processeurs. On présente une éthode générale pour calculer l espérance du teps de coplétion basée sur l analyse d une fonction potentielle qui représente le déséquilibre de charge entre les listes locales. On applique cette éthode à plusieurs problèes d ordonnanceent : travail divisible, tâches unitaires indépendantes, tâches pondérées indépendantes et tâches avec dépendances. On présente dans un preier teps la éthode dans le cas siple du travail divisible puis on l étend aux autres cas. On prouve que le teps de calcul d une charge globale W sur processeurs est égal à W/ plus un tere additionnel en 4e e 1 log 2W. On présente une borne inférieure qui ontre que ce résultat est optial à un facteur constant près en log 2 W. On étend ce résultat au cas des tâches indépendantes avec une analyse siilaire. De plus, on ontre coent adapter cette éthode aux cas des graphes de dépendances. Cette analyse peret d aéliorer le facteur constant dans le tere additif de la borne de Arora, Bluofe et Plaxton. Enfin, on présente des résultats expérientaux en utilisant un siulateur. La distribution du teps de coplétion correspond à une loi de probabilité connue. De plus, on étudie le tere additif et on ontre qu il est environ de 3log 2 W indiquant que notre analyse est précise à oins d un facteur3. Ordonnanceent, Algorithes de liste, Vol de Travail, Données dis- Mots-clés : tribuées

6 List Scheduling: The Price of Distribution 3 1 Introduction Scheduling is a key issue while designing efficient parallel algoriths. The proble is to distribute the tasks of an application load) aong available coputational units and deterine at what tie they will be executed. In ost cases, the target objective is to iniize the copletion tie of the latest task to be executed called the akespan and usually denoted byc ax ). It is a hard challenging proble which received a lot of attention during the last decades [17]. Two new books have been published few onths ago on the topic [11, 20], which shows how still active the area is. List scheduling is one of the ost popular technique for scheduling the tasks of a parallel progra. This algorith has been introduced by Graha [14] and was used with profit in any further works for instance in ETF [15] with counication ties, in HEFT [23] for heterogeneous unrelated) processors, for unifor achines in [10], for parallel rigid jobs in [22]). Its principle is to build a list of ready tasks and schedule the as soon as there exist available resources. Lists scheduling are low-cost greedy) algoriths that are not too far fro optial solutions. Most proposed list algoriths differ in the way of considering the priority of the tasks for building the list, but they always consider a centralized anageent of the list. However, today the parallel and distributed platfors involve ore and ore processors. Thus, the tie needed for anaging such a centralized data structure can not be ignored anyore. We describe below the underlying coputational odel and define inforally the proble to solve. The parallel syste is coposed of a set of interconnected identical processors, that ust execute a set of tasks. The total aount of load isw and can represent a global divisible load, W unit independent tasks, weighted tasks whose su is equal to W, tasks with dependencies. Each processor owns its local queue of ready tasks and no one has a global view of the syste. When the processor becoes idle, it can ake a request of load to other processors in order to get soe tasks. If the request fails, another request can be sent at the next tie slot. Otherwise, a certain aount of load is transfered to this processor. There is no explicit counication cost in this odel, it is iplicitly taken into account in the cost of a load request. A load transfer is done in constant tie, independently of the size of the load. This assuption is not restrictive: for the case of independent tasks, the description of a large nuber of tasks can be very short. For instance a whole subpart of an array of tasks can be represented in a copact way by the range of the corresponding indices, each cell containing the effective description of a task a STL transfor in [24]). For ore general cases with dependencies, it is usually enough to transfer a root task which represents a part of the graph [3]. More details are provided in section 3. In other words, the proble can be considered as a distributed version of the classical problesp C ax andp prec,p j = 1 C ax [17]. Related works. Most related works dealing with scheduling consider centralized list algoriths. However, at execution tie, the cost for anaging the list is neglected. Practically, ipleenting such schedulers induces synchronization overheads when several processors access the list concurrently. To our knowledge, the only approach that takes into account this extra anageent cost is work stealing [9] denoted by WS in short). Contrary to classical centralized scheduling techniques, WS is a distributed algorith. Each processor anages its own list of tasks. When a processor becoes idle, it randoly chooses another processor and requests soe work. WS has been iple-

7 List Scheduling: The Price of Distribution 4 ented in any languages and parallel libraries including Cilk [12], TBB [21] and KAAPI [13]. It has been analyzed in a seinal paper of Bluofe and Leiserson [9] where they show that the expected akespan of series-parallel precedence graph with unit tasks is bounded by E[C ax ] W/ + OT ) where T is the critical path of the graph. However the precedence graph is constrained to have only one source and out-degree at ost 2 which does not easily odel the basic case of independent tasks. Siulating independent tasks with a binary tree of dependencies gives a bound of W/+OlogW) as a coplete binary tree ofw nodes has a depth oft log 2 W. However, with this approach, the structure of the binary tree dictates which tasks are stolen. Our approach achieves a siilar bound with a better constant and processors are free to choose which tasks to steal. This is essential when tasks have affinity to certain processors because of cache considerations for instance [1]. Notice that there exists other ways to analyze work stealing where the work generation is probabilist and that targets steady state results [5, 18]. Another related approach which deals with load balancing is balls into bins gaes [4, 8]. The principle is to study the axiu load whennballs are randoly thrown into bins. This is a siple distributed algorith which is far fro the scheduling probles we are interested in. First, it sees hard to extend this kind of analysis for tasks with dependencies. Second, as the load balancing is done in one phase at the beginning, the cost of coputing the schedule is not considered. Soe works [2] study parallel allocations but still do not take into account contention on the bins. Our approach, like WS, considers contention on the queues. Processors that request load on the sae reote queue are in copetition and only one can succeed. Soe works have been proposed for the analysis of algoriths in data structures and cobinatorial optiization including variants of scheduling) using potential functions. Our analysis is also based on a potential function representing the load unbalance between the local queues. This technique has been successfully used for analyzing basic load balancing [6] and WS [3] which iproved the result of [9]. There exist several other connected works for instance convergence tie to Nash equilibria in gae theory [6] or load diffusion on graphs [7] that are not presented in this paper. Contributions. List scheduling is centralized in nature. The purpose of this work is to study the effects of decentralization on list scheduling. The ain result is to present a new fraework for analyzing the distributed list scheduling algorith DLS). Based on the analysis of the load balancing between two processors during a work request, the total expected nuber of work requests can be deduced and a bound on the expected akespan is derived. The ethodology is generic and it is applied in this paper on several relevant variants of the scheduling proble. We show first that the expected akespan of DLS applied on W unit independent tasks is equal to the absolute lower bound W/ plus an additive ter in 4e e 1 log 2W 6.33log 2 W. We propose a lower bound which shows that the analysis is tight up to a constant factor. We take into account the fact that tasks are not fully divisible which akes the analysis difficult when the nuber of tasks per queue is sall. Second, we extend the previous analysis to the weighted case with unknown processing ties. The additive ter becoes O pax p in logw) where p in and p ax are the extreal values of the processing ties. Third, we provide a new analysis for the WS odel for scheduling DAGs fro [9] that slightly iproves the bound on the nuber of work requests to 4e 1) e 1 T 6.33T. Fourth, we

8 List Scheduling: The Price of Distribution 5 Figure 1: A typical execution ofw = 2000 unit independent tasks on = 25 processors using distributed list scheduling. Grey area represents idle ties. show that a balanced initial allocation induces less work requests. Finally, we give statistical evidence that the akespan of DLS on independent tasks follows a known distribution. Moreover, the experients show that the theoretical analysis for independent tasks is alost tight: the overhead to W/ is less than a factor of 3 away of the exact value. Roadap. We start by presenting the principle of the analysis in section 4, we apply this analysis on the siplest odel, naely the divisible load odel in section 5 and we adapt the proof for unit independent tasks in section 6. Then we extend the analysis for weighted independent tasks in section 7 and for tasks with dependencies in section 8. Section 9 discusses the initial repartition of tasks. We present and analyze siulation experients in section Classical List Scheduling Let us recall briefly the principle of list scheduling as it was introduced by Graha [14]. The proble is to schedule a precedence graph of n tasks whose processing ties are denoted p j for 1 j n. The analysis states that the akespan of any list algorith is not greater than twice the optial akespan. One way of proving this bound is to use a geoetric arguent on the Gantt chart: C ax = 1 j n p j +S idle where the last ter is the surface of idle periods represented in grey in figure 1). Depending on the scheduling proble with or without precedence constraints, unit tasks or not), there are several ways to copute S idle. With precedence constraints, S idle 1) T where T is the length of the longest path of the graph. For independent tasks, the results can be written as S idle 1) p ax where p ax is the axiu of the processing ties. For unit independent tasks, it is straightforward to obtain an optial algorith where the global load is evenly balanced. S idle = W/ W 1, at ost one slot of the schedule contains idle ties. When the list of ready tasks is distributed aong the processors, the analysis is ore coplicated even in the eleentary case of unit independent tasks. In this case, the extra S idle ter is induced by the distributed nature of the proble. Processors can be idle even when ready tasks are available. Fig. 1 is an exaple of a schedule obtained using distributed list scheduling which shows the coplex repartition of the idle ties S idle.

9 List Scheduling: The Price of Distribution 6 3 Model of the Distributed List Let consider a parallel platfor coposed of identical processors and a workload of n tasks with processing ties p j. The tasks can be independent or constrained by a directed acyclic graph DAG) of dependencies. We consider a non-clairvoyant odel where the processing ties and dependencies are unknown at the beginning of the coputation. The proble is to study the axiu copletion tie akespan denoted byc ax ) taking into account the scheduling cost. Instead of storing ready tasks in a centralized queue like in classical list scheduling, each processoriaintains its own local queueq i of tasks to execute. When a processor finishes the execution of a task, this task is reoved fro the queue and another one starts being processed. At the beginning of the execution, ready tasks can be arbitrarily spread aong the queues. We now define the odel of execution for anaging the distributed list. At every tie slot, if the local queue Q i is not epty, processor i picks a task and executes it. When Q i is epty, processor i sends a work request to a rando processor k. If Q k is epty or contains only one task currently executed by processor k), then the request fails and processor i will have to send a new request at the next slot. If Q k contains ore than one task, theniis given half of the tasks and it will restart a noral execution at the next slot. It is iportant to notice that the underlying echaniss to send work requests and share work are not specified and could be ipleented in any ways. To odel the contention on the queues, which was pointed out before as the bottleneck in centralized list scheduling, no ore than one request per queue can succeed in the sae tie slot. If several requests target the sae queue, a rando one succeeds and all the others fail. We now introduce soe notations. At tie t, let w i t) be the aount of work in queue Q i, i.e. w i t) = j Q p it) j, and let wt) = i=1 w it) be the total work in all queues. The initial workload W is equal to w0). At tie t, let αt) be the nuber of active processors, i.e. the nuber of processors with a non-epty queue. 4 Principle of the Analysis The analysis is based on a potential function Φ representing the load unbalance between the queues. At each tie slot, both tasks execution by busy processors e Φ in lea 5.1) and work requests by idle processors r Φ in lea 5.2) contribute to the diinution of the potential. To copute the potential decrease due to work requests, we copute the expected decrease of the potential due to work requests that targets the sae active processor δ i in eq. 1). When the potential reaches0no ore work requests occurs until the end of the schedule. Using a probabilistic arguent, we bound the expected nuber of work requets E[R] theore 5.3) which leads to an upper bound on the expected akespan E[C ax ] theore 5.4) with the geoetric analysis presented in section 2. First, we define a potential function Φt) that is, to a ultiplicative factor, the variance of the queue sizes. Definition 4.1 At tiet, the potential functionφis defined as: Φt) = w i t) wt) i=1 ) 2

10 List Scheduling: The Price of Distribution 7 Moreover, let Φt) = Φt) Φt+1). We report below three useful properties on the potential functionφ. Property 4.2 When Φt) = 0, we have i w i t) = wt)/. No ore work requests occurs until the end of the schedule as each processor has the sae aount of work. Property 4.3 The potential function is axial when all the work is in a single queue. Φt) 1 1 ) wt) ) W 2 Property 4.4 The potential function can also be written: Φt) = wit) 2 w2 t) 5 Divisible Workload i=1 In this section, we detail the analysis of DLS on the siplest odel, i.e. when the tasks are in a whole divisible workload. It is like independent unit tasks except that the tasks can be divided evenly when a work request occurs, even if the nuber of tasks is odd cf. eq. 1). Lea 5.1 Ipact of tasks execution on Φ) The execution of one slot of the schedule, without taking into account work requests, decreases the potential function by: e Φt) = 1 αt) ) ) 2 wt) αt) 0 This proof is obtained by a direct calculus see appendix A). Lea 5.2 Ipact of work requests onφ) The expected diinution of the potential function due to work requests can be bounded by [ ] E r Φt) ) αt) ) Φt) 1 Proof. Let δ i t) denote the decrease of the potential function due to work requests on processor i. If i is not requested then δ i t) = 0. If i receives at least one request, we have: δ i t) = w i t) wt) ) wt) ) 2 ) wi t) 2 = w it) 2 2 wt) ) 2 wi t) + 2 wt) ) 2 ) The probability p r that a processor receives a work request if αt) processors are idle is 1 { } p r = 1 P i is not requested by k k,w k t)=0 = ) αt) 2) 1 { } 1 P event denotes the probability of an event 1)

11 List Scheduling: The Price of Distribution 8 Fro equations 1 and 2, we can deduce the expected diinution due to work requests on processori. [ ] E δ i t) = w it) 2 p r 2 Using the linearity of expectation, we copute the expected diinution of the potential due to all work requests. [ ] [ ] E r Φt) = E δ i t) = p r w 2 2 it) 3) i,w it) 1 i Using property 4.4, we get [ ] E r Φt) = 1 2 p r Φt)+ w2 t) ) 1 2 p r Φt) 4) Theore 5.3 Work Requests) The expected nuber of work requestsris at ost [ ] E R 2e 1) log e 1 2 Φ0)+ 1 Proof. We divide the schedule into phases fro 1 to N where the potential function is halved in each phase. In phasei, we have and thus, as e Φt) 0 cf. lea 5.1), Φ0) 2 i < Φt) Φ0) 2 i 1 Φt) = e Φt)+ r Φt) r Φt) ) αt) ) Φ0) 2 i 5) Let us denote the function in the right hand side of eq. 5 byψα). We analyze now the expected nuber of work requests R i in phase i. We do not take into account tie steps where αt) = as they do not produce work requests. Notice that Φt) = 0 when αt) = cf. lea 5.1). Let t denote the tie steps when αt). Let T i be the rando variable indicating the end of phasei, { T i = in t Φt) Φ0) 2 i } The nuber of work requests is equal to the nuber of idle processors at each tie step, R i = T i t=t i 1 αt) As it is difficult to study the sequence ofαt), we will assue that an adversary can choose eachαt) knowing the history of the syste but not the future rando choices. ] We construct a Markov decision process which leads to an upper bound on E [R i

12 List Scheduling: The Price of Distribution 9 as follows see [19] for ore details on Markov decision process). The state of the syste is Φt). At each epoch, the adversary chooses an action αt) and is rewarded αt) which corresponds to the nuber of work requests. The syste goes in state Φt+1) = Φt) Φt,αt)) where Φt,αt)) is a rando variable corresponding to the diinution of the potential function. Let u t Φt)) the expected reward obtained following the optial policy fro state Φt) at decision epochs t,t + 1,...,T. u t satisfies the optiality equation Chapter4of [19]): u t Φt)) = { ax αt)+ 1 αt) 1 δ We prove in appendix B using backwards induction ontthat whereλis such that u t Φt)) λ Φt) Φ0) ) 2 i { }} u t+1 Φt) δ)p Φt,αt)) = δ 1 α 1, α λψα) 0 6) As the optial policy of the Markov decision process optiizes the expected nuber of work requests over all sequences ofαt), the expected nuber of work requests for a fixed sequence ofαt) is bounded by the optial policy. ] Φ0) ) E [R i u 0 2 i 1 λ Φ0) 2 i Φ0) 2 i ax 1 α 1 2 e 1) e 1 α 1 2 p r Φ0) 2 i cf. appendix C) 7) Overall, we can bound the total nuber of work requestsr untilφt) 1. [ E R ] = N i=1 ] E [R i log 2 Φ0) 2e 1) e 1 In the last phase, i.e. Φt) 1, when a processor becoes idle, we have i,w i t) 1. Thus, there are at ost 1 work requests in the last phase, which leads to the claied bound. Finally, we are now able to bound the expected value of the akespan. Theore 5.4 The expected akespan for a divisible workload W scheduled by DLS is bounded by ] E [C ax W + 4e e 1 log 2W +1 Proof. As Φ0) 1 1 ) W2 W 2, we can bound the expected nuber of work requests by [ ] E R log 2 W 2 2e 1) + 1 e 1 We obtain the claied bound by using the basic relationc ax = W +R.

13 List Scheduling: The Price of Distribution 10 6 Independent Unit Tasks The previous analysis assues all queues can be evenly divided during a work request. This is not exactly true for queues of odd size. When we take this issue into account, we obtain the following theore. Theore 6.1 The expected akespan for W unit independent tasks scheduled by DLS is bounded by ] E [C ax W + 4e e 1 log 2W + Olog) This is optial up to a constant factor inlog 2 W. Proof. In the previous analysis in section 5, eq. 1 can be rewritten for unit non-divisible tasks into δ i t) = w i t) 2 wi t) 2 wi t) 2 w i t) Eq. 4 becoes [ ] E r Φt) 1 2 p r Φt)+ 1 ) wt)2 αt) The additional ter in αt) is due to the su of 1 2 on all active processors. As αt), the lower bound of lea 5.2 still holds as long as wt). When wt) <, we can obtain a relaxed version of lea 5.2 [ ] E r Φt) 3 8 p r Φt) 8) as long as αt) wt)/2 see proof in appendix D). As αt) > wt)/2 can happen only log 2 ties when wt) <, we obtain the claied bound. The overhead in Olog) coes fro the relaxed equation 8 applied when wt) < and the additionallog 2 when equation 8 is not valid. We now give a lower bound for this proble. Consider W = 2 k+1 tasks and = 2 k processors. All the tasks are on the sae processor at the beginning. In the best case, all work requests target processors with highest loads, the akespan is C ax = k + 2. The first k = log 2 steps for each processor to get soe work, one step where all processors are active and one last step where only one processor is active. ThusC ax W +log 2W 1. 7 Weighted independent tasks In this section, we analyze the nuber of work requests for weighted independent tasks. Each task j has a processing tie p j which is unknown. Let p in and p ax be the iniu and axiu processing ties. During a work request, half of the tasks are transfered fro the active processor to the idle processor. As the processing ties are unknown, the work cannot be shared evenly between both processors and can be as bad as getting all the sallest tasks. However, we can adapt the previous analysis to the case of weighted tasks and state the following theore. Theore 7.1 The expected akespan for weighted independent tasks of total processing tiew scheduled by DLS is bounded by ] E [C ax W + O pax ) logw +p ax log p in

14 List Scheduling: The Price of Distribution 11 Proof. We first exaine the load balancing during a work request fro processor j to another processor i. Assue i has at least two tasks in its queue at least one task can be transfered). Letγ be the fraction of work sent to processorj. Eq. 1 becoes δ i t) = w i t) wt) ) wt) ) 2 γw i t) wt ) 2 1 γ)w i t) wt) = 2γ1 γ)wit) 2 Processorj receives half of the tasks of processori, but in the worst case we have p in γ = 2p ax +p in i just started processing one of its two tasks of weight p ax and j gets the last task of weightp in. Whenihas only one task currently in process no task can be transfered) the potential function does not decrease. As this task can be of weightp ax, we have δ i t) 2γ1 γ) wit) p 2 2 ax) p in p ax 4 2p ax +p in ) 2 w2 it) p 2 ax) 4 9 pin p ax w 2 it) p 2 ax) We obtain a new version of lea 5.2, [ ] E r Φt) 4 9 pin p r p ax As long aswt) p ax, 4 9 pin p ax p r i,w it) 1 ) 2 ) wit) p 2 2 ax Φt)+ w2 t) αp2 ax [ ] E r Φt) 4 9 pin p r Φt) p ax and when wt) < p ax, we can obtain a relaxed version of the bound like in section 6. The overhead due to non-divisible tasks becoes Op ax log) and we obtain the expected akespan. ) ] E [C ax W + 9e 2e 1) p ax p in log 2 Φ 0 + Op ax log) As in theore 5.4, we can boundφ 0 byw 2, which leads to the result. 8 Tasks with Precedences In this section, we show how the well known non-blocking work stealing of Arora et al. [3] denoted ABP in the sequel) can be analyzed with our ethod which gives a slightly tighter bound for the expected akespan. Following [3], a ultithreaded

15 List Scheduling: The Price of Distribution 12 coputation is odeled as a directed acyclic graph G with a single root node; each node corresponds to a unit task and edges define precedence constraints. The outdegree of each node is either 0, 1 or 2. The critical path of G is denoted T and W is its total nuber of nodes. ABP schedules the DAG G as follows. Each process i aintains a double-ended queue deque Q i of ready nodes the notion of process here corresponds to our processors). At each top, an active process i with a non-epty deque executes the node at the botto of its deque Q i ; once its execution is copleted, this node is popped fro the botto of the deque, enabling i.e. aking ready 0, 1 or 2 child nodes that are pushed at the botto of Q i. At each top, an idle process j with an epty deque Q j becoes a theft: it perfors a work request on another randoly chosen victi deque; if the victi deque contains ready nodes, then its top-ost node is popped and pushed into the deque of one of its concurrent thefts. If j becoes active just after its work request, the work request is said successful. Otherwise,Q j reains epty and the work request failed which ay occur in the three following situations: either the victi deque Q i is epty; or, Q i contains only one node currently in execution on i; or, due to contention, another theft perfors a successful work request onisiultaneously. Let us first recall the definition of the enabling tree of [3]. If the execution of node u enables node v, then the edge u,v) of G is an enabling edge. The sub-graph of G consisting of only enabling edges fors a rooted tree called the enabling tree. Following [3], if du) is the depth of a node u in the enabling tree, then its weight is defined asωu) = T du). The weight of the root node ist. To represent the aount of work on each processor, we use a potential workw i t) = 2 ax{wu):u Qit)}, i.e. the axiu nuber of nodes that can be activated by a task inq i. We first need to study the repartition of the potential work during a work request the equivalent of eq. 1). Lea 8.1 For any active processi, we havew i t+1) w i t). Moreover, after any work request fro a process j on i, w i t+1) w it) 2 and w j t+1) w it) 2. Proof. The proof is derived fro [3], corollary 4 in section 3: if at t, Q i contains the k + 1 nodes v 0,v 1,...,v k fro botto to top, then ωv 0 ) ωv 1 ) <... < ωv k 1 ) < ωv k ). After the execution of a node u, the axiu weight of its two enabled children is less thanωu) 1. Thus the potential work w i cannot increase. We now state that the potential is halved after any work request by distinguishing two cases. First, when a successful steal occurs on i fro j, then the node v k has been stolen and executed by j. Thus, either w i t + 1) = 0 if Q i is epty at t + 1; or w i t+1) = 2 ωvk 1) 2 ωvk) 1 w i t)/2. Besides, after execution of v k by j, w j t + 1) 2 ωvk) 1 w i t)/2. Secondly, if all work requests that occur on i are unsuccessful, then there was only one node v 0 in Q i whose execution was processed byi. Then, att+1,w i t+1) 2 ωv0) 1 w i t)/2 and w j t+1) = 0. We can now state the following theore. Theore 8.2 The expected akespan verifies: ] E [C ax W + 4e e 1 T +1 < W +6.33T +1.

16 List Scheduling: The Price of Distribution 13 Proof. The proof is a direct application of theores 5.3 and 5.4 using a odified potential function. Note that we cannot use directly the sae as before defined in 4.1) because the total aount of potential work ay be reduced when a steal occurs 2. Yet let Φt) = i w2 i t) be the potential. Fro lea 8.1, we know that the reduction of the [ potential ] due to a work request oniis at leastδ i t) wi 2 t)/2, thus eq. 3 becoes E r Φt) p r Φt)/2. As Φ0) = 2 2T ), fro theore 5.3, we get the expected [ ] nuber of work requests E R ] E [C ax 4e 1) e 1 W + 4e e 1 T +1 which copletes the proof. T + 1 and the expected akespan Reark. Additionally, [3] stated that [ C ax ] W = OT ) with high probability, but also established the upper bound E C ax W < 32T see [3], proof of theore [ ] 9 in section 4.3). Yet, the upper bound E C ax W < 6.33T + 1 iproves the previous upper bound fro [3]. 9 Initial Repartition of Tasks In the previous sections, we assued that all tasks are on the sae processor at the beginning of the execution. In this section, we use the bounds in ters of Φ 0 to show that a ore balanced initial repartition leads to fewer work requests on average. Let us first focus on unit independent tasks. Recall the result of theore 6.1 for the worst initial repartition. We take a balls-and-bins assignent as the initial repartition. For each task, we choose uniforly a rando processor and put the task in its queue. We copute the expected value ofφ 0 using property 4.4. ] E [Φ 0 = [ E wi ] 2 W2 i = ] ] 2 Var [w i +E [w i ) W2 i ] = Var [w 1 = 1 1 ) W as w i follows a binoial distribution. Since the nuber of work requests is proportional to log 2 Φ 0, this initial distribution of tasks reduces the nuber of work requests by a factor of2on average. Let now consider weighted tasks. One can show that putting task j on a rando processor increasesφ 0 by 1 1 ) p2 j on average and thus, we obtain ] E [Φ 0 = 1 1 ) p 2 j j 1 1 ) p j p ax j 1 1 ) W p ax 2 This can happen when the sibling of the stolen task has only one child.

17 List Scheduling: The Price of Distribution 14 Density Makespan Density Makespan Density Makespan Density Makespan a) Unit Tasks b) Weighted Tasks c) DAG short T ) d) DAG long T ) Figure 2: Distribution of the akespan for unit independent tasks 2a), weighted independent tasks 2b) and tasks with dependencies 2c) and 2d). The first three odels follow a gev distribution blue curves), the last one is gaussian red curve). which reduces the nuber of work requests of theore 7.1 by roughly a factor of 2 when p ax is uch lower thanw. The case with precedence constraints is not analyzed here because there is only one task initially the unique source of the DAG). 10 Experiental Study The theoretical analysis gives an upper bound on the expected value of the akespan for the various odels we considered. In this section, we study experientally the distribution of the akespan. Statistical tests give evidence that the akespan for independent tasks follows a generalized extree value gev) distribution [16]. This was expected since such a distribution arises when dealing with axiu of rando variables. For tasks with dependencies, it depends on the structure of the graph: DAGs with short critical path still follow a gev distribution but when the critical path grows, it tends to a gaussian distribution. We also study in ore details the overhead tow/ and show that it is approxiately3log 2 W for unit independent tasks which is close to 4e the theoretical result of e 1 log 2W. We developed a siulator that strictly follows our odel. At the beginning, all the tasks are given to processor 0 in order to be in the worst case, i.e. when the initial potentialφ 0 is axiu. Distribution of the Makespan. We consider here a fixed workload W = 2 17 on = 2 10 processors for independent tasks and = 2 7 processors for tasks with dependencies. For the weighted odel, processing ties were generated randoly and uniforly between 1 and 10. For the DAG odel, graphs have been generated using a layer by layer ethod. We generated two types of DAGs, one with a short critical path close to the iniu possible log 2 W ) and the other one with a long critical path around W/4 in order to keep enough tasks per processor per layer). For each workload unit, weighted and DAG), we executed the siulator 10, 000 ties to obtain the distribution of the akespan. Fig. 2 presents histogras forc ax W/. The distributions of the first three odels a,b,c in fig. 2) are clearly not gaussian: they are asyetrical with an heavier right tail. To fit these three odels, we use the generalized extree value gev) distribution [16]. In the sae way as the noral distribution arises when studying the su of independent and identically distributed iid)

18 List Scheduling: The Price of Distribution 15 Slope Nuber of Processors Mean 95% 99% Figure 3: Proportionality constant oflog 2 W against the nuber of processors in logarithic scale). rando variables, the gev distribution arises when studying the axiu of iid rando variables. The extree value theore, an equivalent of the central liit theore for axia, states that the axiu of iid rando variables converges in distribution to a gev distribution. In our setting, the rando variables easuring the load of each processor are not independent, thus the extree value theore cannot apply directly. However, it is possible to fit the distribution of the akespan to a gev distribution. In fig. 2, the fitted distributions blue curve) closely follow the histogras. To confir this graphical approach, we perfored a goodness of fit test. The χ 2 test is well-suited to our data because the distribution of the akespan is discrete. We copared the results of the best fitted gev to the best fitted gaussian. The χ 2 test strongly rejects the gaussian hypothesis but does not reject the gev hypothesis with a p-value of ore than 0.5. This confirs that the akespan follows a gev distribution. We fitted the last odel, DAG with long critical path, with a gaussian red curve in fig. 2d)). In this last case, the copletion tie of each layer of the DAG should correspond to a gev distribution but the total akespan, the sus of all layers, should tend to a gaussian by the central liit theore. Indeed theχ 2 test does not reject the gaussian hypothesis with a p-value around 0.3. Study of the log 2 W ter. We focus now on unit independent tasks as the other odels rely on too any paraeters the choice of the processing ties for weighted tasks and the structure of the DAG for tasks with dependencies). We want to show that the nuber of work requests is proportional to log 2 W and study the proportionality constant. We first launch siulations with a fixed nuber of processors and a wide range of work in successive powers of two. A linear regression confirs the linear dependency in log 2 W with a coefficient of deterination r squared ) greater than Then, we obtain the slope of the regression for various nuber of processors. The value of the slope tends to a liit between 2 and 3 cf. solid curve in fig. 3). This shows that the theoretical analysis of theore 6.1 is alost accurate with a constant 4e of e The difference between the theoretical and the practical value can be explained by the use of an adversary in theore 5.3. We also copute the slope taking the quantile at 99% instead of the ean i.e. 99% of the 10, 000 siulations have a saller akespan than the considered value). 3 the closest to 1, the better

19 List Scheduling: The Price of Distribution 16 This experient shows that the constant factor of log 2 W is less than 3 in 99% of all siulations cf. dotted curve in fig. 3). 11 Concluding Rearks We have presented in this paper a new analysis of the distributed list scheduling algoriths. It was presented in detail for the case of a global divisible workload W on processors. The ain result was to prove that the expected akespan is no ore than the best possible absolute lower boundw/ plus an additive ter in 4e e 1 log 2W very close to the value obtained by siulation. This technique was successfully extended for unit and weighted independent tasks and for tasks with dependencies. In all cases, we succeeded to characterize precisely the overhead due to the decentralization of the list of ready tasks. We also believe that this work will help to clarify the links between classical list scheduling and work stealing. Another interesting issue is that we developed a siulator for better characterizing the behaviour of the DLS algoriths. In particular, we gave statistical evidence that the distribution of the akespan for independent tasks follows a gev distribution. Based on the results of this paper, it reains challenging open probles to study. First, in the case of weighted independent tasks, it would be interesting to study the ipact of local policies to anage the queues on the global perforance. It would also be of interest to analyze a odified work stealing algorith where idle processors steal tasks using an affinity criterion. We believe that our analysis is well-suited for affinity scheduling because processors are free to steal any tasks of the queue. References [1] U. A. Acar, G. E. Blelloch, and R. D. Bluofe. The data locality of work stealing. In Proceedings of SPAA 00, pages 1 12, [2] M. Adler, S. Chakrabarti, M. Mitzenacher, and L. Rasussen. Parallel randoized load balancing. In Proceedings of STOC 95, pages , [3] N. S. Arora, R. D. Bluofe, and C. G. Plaxton. Thread scheduling for ultiprograed ultiprocessors. Theory of Coputing Systes, 342): , [4] Y. Azar, A. Z. Broder, A. R. Karlin, and E. Upfal. Balanced allocations. SIAM Journal on Coputing, 291): , [5] P. Berenbrink, T. Friedetzky, and L. A. Goldberg. The natural work-stealing algorith is stable. SIAM Journal of Coputing, 325): , [6] P. Berenbrink, T. Friedetzky, L. A. Goldberg, P. W. Goldberg, Z. Hu, and R. Martin. Distributed selfish load balancing. SIAM Journal on Coputing, 374): , [7] P. Berenbrink, T. Friedetzky, and Z. Hu. A new analytical ethod for parallel, diffusion-type load balancing. Journal of Parallel and Distributed Coputing, 691):54 61, [8] P. Berenbrink, T. Friedetzky, Z. Hu, and R. Martin. On weighted balls-into-bins gaes. Theoretical Coputer Science, 4093): , 2008.

20 List Scheduling: The Price of Distribution 17 [9] R. D. Bluofe and C. E. Leiserson. Scheduling ultithreaded coputations by work stealing. Journal of the ACM, 465): , [10] C. Chekuri and M. Bender. An efficient approxiation algorith for iniizing akespan on uniforly related achines. Journal of Algoriths, 412): , [11] M. Drozdowski. Scheduling for Parallel Processing. Springer, [12] M. Frigo, C. E. Leiserson, and K. H. Randall. The ipleentation of the Cilk-5 ultithreaded language. In Proceedings of PLDI 98, [13] T. Gautier, X. Besseron, and L. Pigeon. KAAPI: A thread scheduling runtie syste for data flow coputations on cluster of ulti-processors. In Proceedings of PASCO 07, pages 15 23, [14] R. L. Graha. Bounds on ultiprocessing tiing anoalies. SIAM Journal on Applied Matheatics, 17: , [15] J.-J. Hwang, Y.-C. Chow, F. D. Anger, and C.-Y. Lee. Scheduling precedence graphs in systes with interprocessor counication ties. SIAM Journal on Coputing, 182): , [16] S. Kotz and S. Nadarajah. Extree Value Distributions: Theory and Applications. World Scientific Publishing Copany, [17] J. Leung. Handbook of Scheduling: Algoriths, Models, and Perforance Analysis. CRC Press, [18] M. Mitzenacher. Analyses of load stealing odels based on differential equations. In Proceedings of SPAA 98, pages , [19] M. L. Puteran. Markov Decision Processes : Discrete Stochastic Dynaic Prograing. Wiley, [20] Y. Robert and F. Vivien. Introduction to Scheduling. Chapan & Hall/CRC Press, [21] A. Robison, M. Voss, and A. Kukanov. Optiization via reflection on work stealing in TBB. In Proceedings of IPDPS 08, pages 1 8, [22] U. Schwiegelshohn, A. Tchernykh, and R. Yahyapour. Online scheduling in grids. In Proceedings of IPDPS 08, April [23] H. Topcuoglu, S. Hariri, and M.-Y. Wu. Task scheduling algoriths for heterogeneous processors. In Proceedings of the Eighth Heterogeneous Coputing Workshop, page 3, [24] D. Traoré, J.-L. Roch, N. Maillard, T. Gautier, and J. Bernard. Deque-free workoptial parallel stl algoriths. In Proceedings of Euro-Par 08, pages , 2008.

21 List Scheduling: The Price of Distribution 18 A Proof of lea 5.1 Each of theαt) active processors execute one task,wt+1) = wt) αt). Φt+1) = Thus, i,q it) 1 = i,q it) 1 q i t) 1 wt) αt) q i t) wt) 1 αt) )) 2 + αt)) 0 wt) + αt) = Φt) 2 1 αt) ) i,q it) 1 ) 2 ) 2 + αt)) q i t) wt) αt) ) 2 αt) + αt)) 2 wt) ) e Φt) = 2 1 αt) ) i,q it) 1 q i t) wt) ) +αt) ) αt) αt) ) 2 αt) αt)) +2 wt) ) = 1 αt) ) 2 q i t) wt) ) = αt) 1 αt) i,q it) 1 ) αt)2 1 αt) ) 2 wt) 2 αt) wt) αt) +2 αt)+ αt)2 αt)2 = 1 αt) ) ) 2 wt) αt) +2 αt) wt) ) wt) ) Moreover, as αt) andwt) αt), we have e Φt) 0. B Induction step of theore 5.3 We prove using backwards induction ontthat whereλis such that u t Φt)) λ Φt) Φ0) ) 2 i 1 α 1, α λψα) 0 0 wt) αt) 1 αt) 1 αt) ) 2 ) 2 ) 2

22 List Scheduling: The Price of Distribution 19 At the end of phase i, u T Φ0) 2 ) = 0 as there are no ore work requests in this i phase. u T satisfies the base case. We prove the induction step in the following. u t Φt)) = ax 1 αt) 1 ax 1 αt) 1 = ax 1 αt) 1 { αt)+ { }} u t+1 Φt) δ)p Φt,αt)) = δ δ { αt)+ λ Φt) δ Φ0) ) { }} 2 i P Φt,αt)) = δ δ { αt)+λ Φt) Φ0) ) { } 2 i P Φt,αt)) = δ δ { }} δp Φt,αt)) = δ λ δ { = ax αt)+λ Φt) Φ0) ) 1 αt) 1 2 i { = ax αt)+λ Φt) Φ0) ) 1 αt) 1 2 i = λ Φt) Φ0) ) 2 i λ Φt) Φ0) ) 2 i [ ]} λe Φt,αt)) } λψαt)) { + ax 1 αt) 1 αt) λψαt)) by definition ofλin eq. 6) C Proof of inequality 7 of theore 5.3 Let define f by α fα) = ) α and copute the derivativef ) α ) 2 f α) 1 = ) α ) 1 + α) ln 1 1 ) 1 1 ) α 1 1 = ) α 1+ α)ln 1 1 )) ) α 1 ) 1+ α) ) α α As f is non increasing for1 α 1, ax 1 α 1 α ) α = f1) = } ) 1 Moreover 1 1 ) 1 = exp 1)ln 1 1 )) 1 )) exp 1) e

23 List Scheduling: The Price of Distribution 20 and thus ax 1 α 1 α ) α e e 1) e 1 D Proof of eq. 8 When αt) processors have an epty queue, we can boundφt) by Φt) αt)) 1 w 2 αt) 1 ) 0 wt) ) 2 wt) +αt) αt) wt) ) 2 9) We study in the following the sign of expressiona which as the sae sign as A = αt) wt)2 wt)2 1 4 αt) 1 ) αt) 2 wt)2 1 1 ) α wt)2 4 4 We copute the of this polynoial inαt) = wt) ) 2 +wt) 2 > 0 4 We denote the negative root byα 1 and the positive root byα 2. α 2 = 1 wt) ) + ) > wt) 4 2 as > wt) 2 ) So expressionais negative whenαt) α 2 and positive whenαt) > α 2. Thus when αt) wt)/2, αt) wt)2 wt)2 1 4 αt) 1 ) 0 αt) wt)2 1 Φt) 0 c.f. eq. 9) 4 Φt)+ wt)2 αt) 3 4 Φt) [ ] E r Φt) ) αt) ) Φt) 1

24 Centre de recherche INRIA Grenoble Rhône-Alpes 655, avenue de l Europe Montbonnot Saint-Isier France) Centre de recherche INRIA Bordeaux Sud Ouest : Doaine Universitaire - 351, cours de la Libération Talence Cedex Centre de recherche INRIA Lille Nord Europe : Parc Scientifique de la Haute Borne - 40, avenue Halley Villeneuve d Ascq Centre de recherche INRIA Nancy Grand Est : LORIA, Technopôle de Nancy-Brabois - Capus scientifique 615, rue du Jardin Botanique - BP Villers-lès-Nancy Cedex Centre de recherche INRIA Paris Rocquencourt : Doaine de Voluceau - Rocquencourt - BP Le Chesnay Cedex Centre de recherche INRIA Rennes Bretagne Atlantique : IRISA, Capus universitaire de Beaulieu Rennes Cedex Centre de recherche INRIA Saclay Île-de-France : Parc Orsay Université - ZAC des Vignes : 4, rue Jacques Monod Orsay Cedex Centre de recherche INRIA Sophia Antipolis Méditerranée : 2004, route des Lucioles - BP Sophia Antipolis Cedex Éditeur INRIA - Doaine de Voluceau - Rocquencourt, BP Le Chesnay Cedex France) ISSN

arxiv: v1 [cs.dc] 19 Jul 2011

arxiv: v1 [cs.dc] 19 Jul 2011 DECENTRALIZED LIST SCHEDULING MARC TCHIBOUKDJIAN, NICOLAS GAST, AND DENIS TRYSTRAM arxiv:1107.3734v1 [cs.dc] 19 Jul 2011 ABSTRACT. Classical list scheduling is a very popular and efficient technique for