Real-Time Scheduling of Parallel Tasks under a General DAG Model

Size: px

Start display at page:

Download "Real-Time Scheduling of Parallel Tasks under a General DAG Model"

Josephine Pierce
6 years ago
Views:

1 Washngton Unversty n St. Lous Washngton Unversty Open Scholarshp All Computer Scence and Engneerng Research Computer Scence and Engneerng Report Number: WUCSE Real-Tme Schedulng of Parallel Tasks under a General DAG Model Authors: Abusayeed Safullah, Davd Ferry, Chenyang Lu, and Chrstopher Gll Due to ther potental to delver ncreased performance over sngle-core processors, mult-core processors have become manstream n processor desgn. Computaton-ntensve real-tme systems must explot ntratask parallelsm to take full advantage of mult-core processng. However, exstng results n real-tme schedulng of parallel tasks focus on restrctve task models such as the synchronous model where a task s a sequence of alternatng parallel and sequental segments, and parallel segments have threads of executon that are of equal length. In ths paper, we address a general model for determnstc parallel tasks, where a task s represented as a DAG wth dfferent nodes havng dfferent executon requrements. We make several key contrbutons towards both preemptve and non-preemptve realtme schedulng of DAG tasks on mult-core processors. Frst, we propose a task decomposton that splts a DAG nto sequental tasks. Second, we prove that parallel tasks, upon decomposton, can be scheduled usng preemptve global EDF wth a resource augmentaton bound of 4. Ths bound s as good as the best known bound for more restrctve models, and s the frst... Read complete abstract on page. Follow ths and addtonal works at: Part of the Computer Engneerng Commons, and the Computer Scences Commons Recommended Ctaton Safullah, Abusayeed; Ferry, Davd; Lu, Chenyang; and Gll, Chrstopher, "Real-Tme Schedulng of Parallel Tasks under a General DAG Model" Report Number: WUCSE (01). All Computer Scence and Engneerng Research. Department of Computer Scence & Engneerng - Washngton Unversty n St. Lous Campus Box St. Lous, MO ph: (314)

2 Real-Tme Schedulng of Parallel Tasks under a General DAG Model Complete Abstract: Due to ther potental to delver ncreased performance over sngle-core processors, mult-core processors have become manstream n processor desgn. Computaton-ntensve real-tme systems must explot ntratask parallelsm to take full advantage of mult-core processng. However, exstng results n real-tme schedulng of parallel tasks focus on restrctve task models such as the synchronous model where a task s a sequence of alternatng parallel and sequental segments, and parallel segments have threads of executon that are of equal length. In ths paper, we address a general model for determnstc parallel tasks, where a task s represented as a DAG wth dfferent nodes havng dfferent executon requrements. We make several key contrbutons towards both preemptve and non-preemptve realtme schedulng of DAG tasks on mult-core processors. Frst, we propose a task decomposton that splts a DAG nto sequental tasks. Second, we prove that parallel tasks, upon decomposton, can be scheduled usng preemptve global EDF wth a resource augmentaton bound of 4. Ths bound s as good as the best known bound for more restrctve models, and s the frst for a general DAG model. Thrd, we prove that the decomposton has a resource augmentaton bound of 4 plus a non-preempton overhead for non-preemptve global EDF schedulng. To our knowledge, ths s the frst resource augmentaton bound for nonpreemptve schedulng of parallel tasks. Through smulatons, we demonstrate that the acheved bounds are safe and suffcent. Ths techncal report s avalable at Washngton Unversty Open Scholarshp:

3 Department of Computer Scence & Engneerng Real-Tme Schedulng of Parallel Tasks under a General DAG Model Authors: Abusayeed Safullah, Davd Ferry, Kunal Agrawal, Chenyang Lu, and Chrstopher Gll Correspondng Author: safullah@wustl.edu Web Page: Abstract: Due to ther potental to delver ncreased performance over sngle-core processors, mult-core processors have become manstream n processor desgn. Computaton-ntensve real-tme systems must explot ntra-task parallelsm to take full advantage of mult-core processng. However, exstng results n real-tme schedulng of parallel tasks focus on restrctve task models such as the synchronous model where a task s a sequence of alternatng parallel and sequental segments, and parallel segments have threads of executon that are of equal length. In ths paper, we address a general model for determnstc parallel tasks, where a task s represented as a DAG wth dfferent nodes havng dfferent executon requrements. We make several key contrbutons towards both preemptve and non-preemptve realtme schedulng of DAG tasks on mult-core processors. Frst, we propose a task decomposton that splts a DAG nto sequental tasks. Second, we prove that parallel tasks, upon decomposton, can be scheduled usng preemptve global EDF wth a resource augmentaton bound of 4. Ths bound s as good as the best known bound for more restrctve models, and s the frst for a general DAG model. Thrd, we prove that the decomposton has a resource augmentaton bound of 4 plus a non-preempton overhead for non-preemptve global EDF schedulng. To our knowledge, ths s the frst resource augmentaton bound for nonpreemptve schedulng of parallel tasks. Through smulatons, Type of Report: Other Department of Computer Scence & Engneerng - Washngton Unversty n St. Lous Campus Box St. Lous, MO ph: (314)

4 Real-Tme Schedulng of Parallel Tasks under a General DAG Model Abusayeed Safullah, Davd Ferry, Kunal Agrawal, Chenyang Lu, and Chrstopher Gll Department of Computer Scence and Engneerng Washngton Unversty n St. Lous Abstract Due to ther potental to delver ncreased performance over sngle-core processors, mult-core processors have become manstream n processor desgn. Computaton-ntensve real-tme systems must explot ntra-task parallelsm to take full advantage of mult-core processng. However, exstng results n real-tme schedulng of parallel tasks focus on restrctve task models such as the synchronous model where a task s a sequence of alternatng parallel and sequental segments, and parallel segments have threads of executon that are of equal length. In ths paper, we address a general model for determnstc parallel tasks, where a task s represented as a DAG wth dfferent nodes havng dfferent executon requrements. We make several key contrbutons towards both preemptve and non-preemptve realtme schedulng of DAG tasks on mult-core processors. Frst, we propose a task decomposton that splts a DAG nto sequental tasks. Second, we prove that parallel tasks, upon decomposton, can be scheduled usng preemptve global EDF wth a resource augmentaton bound of 4. Ths bound s as good as the best known bound for more restrctve models, and s the frst for a general DAG model. Thrd, we prove that the decomposton has a resource augmentaton bound of 4 plus a non-preempton overhead for non-preemptve global EDF schedulng. To our knowledge, ths s the frst resource augmentaton bound for nonpreemptve schedulng of parallel tasks. Through smulatons, we demonstrate that the acheved bounds are safe and suffcent. I. INTRODUCTION Due to slowng down of the rate of ncrease of clock frequences, most processor chp manufacturers have recently moved to ncreasng performance of processors by ncreasng the number of cores on each chp. Intel s 80-core Teraflops Research Chp [1], Tlera s 100-core TILE-Gx processor, AMD s 1-core Opteron processor [], and a 96-core processor developed by ClearSpeed [3] are some notable examples of mult-core chps. Wth the rapd evoluton of mult-core processor technology, however, real-tme system software and programmng models have faled to keep pace. In partcular, most classc results n real tme schedulng concentrate on sequental tasks runnng on multple processors or cores [4]. Whle these systems allow many tasks to execute on the same mult-core host, they do not allow an ndvdual task to run any faster on a mult-core machne than on a sngle-core one. If we want to scale the capabltes of ndvdual tasks wth the number of cores, t s essental to develop new approaches for tasks wth ntra-task parallelsm, where real-tme tasks themselves are parallel tasks whch can utlze multple cores at the same tme. Such ntra-task parallelsm may enable more strngent tmng guarantees for complex real-tme systems that requre heavy computaton such as vdeo survellance, computer vson, radar trackng, and hybrd real-tme structural testng [5] whose strngent tmng constrants are dffcult to meet on tradtonal sngle-core processors. There has been some recent work on real-tme schedulng for parallel tasks, but t has been mostly restrcted to the synchronous task model [6], [7]. In the synchronous model, each task conssts of a sequence of segments wth synchronzaton ponts at the end of each segment. In addton, each segment of a task contans threads of executon that are of equal length. For such synchronous tasks, our prevous result [6] proves a resource augmentaton bound of 4. Whle the synchronous task model represents the knd of tasks generated by the parallel for loop construct that s common to many parallel languages such as OpenMP [8] and ClkPlus [9], most parallel languages also have other constructs for generatng parallel programs, notably fork-jon constructs. A program that uses fork-jon constructs wll generate a non-synchronous task, generally represented as a Drected Acyclc Graph (DAG), where each thread (sequence of nstructons) s a node and edges represent dependences between threads. Our prevous work [6] consders a restrcted verson of the DAG model, where each node (thread) requres unt computaton. For the unt-node DAG model, the scheduler frst converts each task to a synchronous task, and then apples the analyss followed for a synchronous model. All prevous work on parallel real-tme tasks consders preemptve schedulng, where threads are allowed to preempt each other n the mddle of executon. Whle ths s a reasonable model, preempton can often be a hgh-overhead operaton snce t often nvolves a system call and a context swtch. An alternatve schedulng model s to consder node-level nonpreemptve schedulng (smply called non-preemptve schedulng n ths paper), where once the executon of a partcular node (thread) starts, the thread cannot be preempted by any other thread. Most parallel languages and lbrares have yeld ponts at the end of threads (nodes of the DAG), allowng lowcost, user-space preempton at these yeld ponts. For these languages and lbrares, schedulers that requre preempton only when threads end (n other words, where threads do not preempt each other) can be mplemented entrely n user-space (wthout nteracton wth the kernel), and therefore have low overheads. In addton, ths model also has cache benefts. In ths paper, we generalze the prevous work n two ways. Frst, we consder a general task model, where tasks are represented by general DAGs where threads (nodes) can

5 have arbtrary executon requrements. Second, we address both preemptve and node-level non-preemptve schedulng for these DAGs. Note that f the decomposton proposed n [6] for unt-node DAG s appled to a general DAG, every thread (node) wll further splt nto smaller threads. Snce all subtasks of a segment synchronze at ts end, there s no easy way of assurng non-preempton of a thread. In partcular, ths paper makes the followng contrbutons. We propose a novel task decomposton to transform the nodes of a general DAG nto sequental tasks. Ths decomposton does not convert non-synchronous tasks to synchronous tasks and therefore, unlke that n [6], t does not requre splttng threads nto shorter threads. Hence, our proposed decomposton allows nonpreemptve schedulng, where threads (nodes of the DAG) are never preempted. We prove that parallel tasks n the general DAG model, upon decomposton, can be scheduled usng preemptve global EDF wth a resource augmentaton bound of 4. Ths bound s as good as the best known bound for more restrctve models [6] and, to our knowledge, s the frst for a general DAG model. We prove that the proposed decomposton requres a resource augmentaton bound of 4 plus a non-preempton overhead of the tasks when usng non-preemptve global EDF schedulng. To our knowledge, ths s the frst bound for non-preemptve schedulng of parallel real-tme tasks. Our prelmnary, short-scale smulatons ndcate that the bounds are safe. For most task sets, the resource augmentaton requred s at most for preemptve schedulng and 3 for non-preemptve schedulng, whch s sgnfcantly smaller than theoretcal bound. The rest of the paper s organzed as follows. Secton II revews related work. Secton III descrbes the task model. Secton IV presents the new task decomposton. Sectons V and VI present analyses for preemptve and non-preemptve global EDF schedulng, respectvely. Secton VII presents the smulaton results. Secton VIII offers conclusons. II. RELATED WORK There has been a substantal amount of work on tradtonal multprocessor real-tme schedulng focused on sequental tasks [4]. Some work has addressed schedulng for parallel tasks [10] [16], but t does not consder task deadlnes. Soft real-tme schedulng (where the goal s to meet a certan subset of deadlnes based on applcaton-specfc crtera) has been studed for varous parallel task models and for varous optmzaton crtera [17] []. For example, many nvestgatons [17] [0] focus on cache performance for multthreaded tasks, where the number of parallel threads n a task cannot exceed the number of cores. Others consder task models where a task s executed on up to a gven number of processors, and focus on metrcs such as the makespan [1] and total work done by tasks that meet ther deadlnes []. Hard real-tme schedulng (where the goal s to meet all task deadlnes) s ntractable for most cases of parallel tasks wthout resource augmentaton [3]. Some early work makes smplfyng assumptons about task models [4] [8]. For example, [4] [6] address the schedulng of malleable tasks, where tasks can execute on varyng number of processors wthout loss n effcency. The study n [7] consders nonpreemptve EDF schedulng of moldable tasks, where the actual number of processors used by a partcular task s determned before startng the system and remans unchanged. Gang EDF schedulng [8] of moldable parallel tasks requres users to select (at submsson tme) a fxed number of processors upon whch ther task wll run, and the task must then always use that number of threads. Recently, preemptve real-tme schedulng has been studed n [6], [7] for synchronous parallel tasks wth mplct deadlnes. In [7], every task s an alternate sequence of parallel and sequental segments wth each parallel segment consstng of multple threads of equal length that synchronze at the end of the segment. All parallel segments n a task have an equal number of threads whch cannot exceed the number of processor cores. It transforms every thread to a subtask, and proves a resource augmentaton bound of 3.4 under parttoned Deadlne Monotonc (DM) schedulng. For the synchronous model wth arbtrary numbers of threads n segments, our earler work n [6] proves a resource augmentaton bound of 4 and 5 for global EDF and parttoned DM schedulng, respectvely. For the unt-node DAG model where each node has unt executon requrement, ths approach converts each task to a synchronous task, and then apples the same approach. In ths paper, we consder a more general model of determnstc parallel real-tme tasks where each task s modeled as a DAG, and dfferent nodes of the DAG may have dfferent executon requrements. For preemptve schedulng, n partcular, we prove the same resource augmentaton bound of 4 as [6]. In addton, we consder non-preemptve global EDF schedulng, and prove a resource augmentaton bound whch, to our knowledge, s the frst bound for non-preemptve schedulng of parallel tasks. III. PARALLEL TASK MODEL We consder n perodc parallel tasks to be scheduled on a mult-core platform consstng of m dentcal cores. The task set s represented by τ = {τ 1,τ,,τ n }. Each task τ, 1 n, s represented as a Drected Acyclc Graph (DAG), where the nodes stand for dfferent executon requrements, and the edges represent dependences between the nodes. A node n τ s denoted by W j, 1 j n, wth n beng the total number of nodes n τ. The executon requrement of node W j s denoted by Ej. A drected edge from node W j to node W k, denoted as W j W k, mples that the executon of W k cannot start unless W j has fnshed executon. W j,n ths case, s called a parent of W k, whle W k s ts chld. A node may have 0 or more parents or chldren. A node can start executon only after all of ts parents have fnshed executon. Fgure 1 shows a task τ wth n =10nodes.

6 Fg. 1. A parallel task τ represented as a DAG The total executon requrement of τ s the sum of the executon requrements of all of ts nodes, and s denoted by C (tme unts). The perod of task τ s denoted by T. The deadlne D of each task τ s consdered mplct,.e., D = T. Task set τ s sad to be schedulable by algorthm A, fa can schedule τ such that every τ τ can meet deadlne D. IV. TASK DECOMPOSITION We consder schedulng parallel tasks by decomposng them nto sequental subtasks. Ths strategy allows us to leverage exstng schedulablty analyss for multprocessor schedulng (both preemptve and non-preemptve). In ths secton, we present the decomposton of a parallel task under general DAG model. The method decomposes a task nto nodes. Thus, each node of a task becomes a sequental subtask wth executon requrement equal to the executon requrement of the node. All nodes of a DAG are assgned approprate deadlnes and release offsets such that when they execute as ndvdual subtasks all dependences among them n the DAG (.e., n the orgnal task) are preserved. Thus, an mplct deadlne DAG s decomposed nto a set of constraned deadlne sequental subtasks wth each subtask correspondng to a node of the DAG. We use the terms subtask and node nterchangeably. Note that for schedulablty analyss of parallel tasks, conventonal utlzaton bound approaches are not useful [6], [7]. Instead, determnng a resource augmentaton bound represents a promsng approach [6], [7]. A resource augmentaton quantfes how much we have to ncrease the processor (core) speed, wth respect to an optmal algorthm for the orgnal task set, to guarantee the schedulablty of the decomposed tasks. Analyss for boundng ths value s mostly based on the denstes of the decomposed tasks. In the followng, we frst present termnology used n decomposton. Then, we present the proposed technque for decomposton, followed by a densty analyss of the decomposed tasks. A. Termnology The executon requrement (.e., the work) C of task τ s the sum of the executon requrements of all nodes n τ. Thus, C s the maxmum executon tme of task τ on a mult-core platform where each processor core has unt speed. That s, C s ts executon tme on a unt-speed sngle-core processor f t s never preempted. We use C,ν to denote the maxmum executon tme of task τ on a mult-core platform where each processor core has speed ν. Forτ wth n nodes, each wth executon requrement E j, C and C,ν are expressed as n C = E j ; C,ν = 1 n E j ν = C (1) ν j=1 j=1 For task τ, the crtcal path length, denoted by P, s the sum of executon requrements of the nodes on a crtcal path. A crtcal path s a drected path that has the maxmum executon requrement among all other paths n DAG τ. Thus, P s the mnmum executon tme of task τ meanng that t needs at least P tme unts on unt-speed processor cores even when the number of cores m s nfnte. Therefore, ts deadlne T (.e., perod) must be no less than P. T P () We use P,ν to denote the crtcal path length of task τ on a mult-core platform where each processor core has speed ν, whch s expressed as P,ν = P ν. The utlzaton u of task τ, and the total utlzaton u sum (τ) for the set of n tasks τ are defned as follows: u = C T ; u sum (τ) = n =1 If the total utlzaton u sum s greater than m, then no algorthm can schedule τ on m dentcal unt-speed processor cores. The densty δ of task τ, and the total densty δ sum (τ) and the maxmum densty δ max (τ) for the task set τ are gven by δ = C D ; δ sum (τ) = C T n δ ; δ max (τ) =max{δ 1 n} =1 The demand bound functon (DBF) of a task τ s the largest cumulatve executon requrement of all jobs generated by τ that have both arrval tmes and deadlnes wthn a contguous nterval of t tme unts. For τ, the DBF s gven by DBF(τ,t)=max ( 0, ( t D T +1 )C ) Based on the DBF, the load of the set of n tasks τ, denoted by λ(τ), s defned as follows n DBF(τ,t) λ(τ) =max =1 t>0 t (4) B. Decomposton Technque In our decomposton, each node of a task becomes an ndvdual sequental subtask wth ts own executon requrement and an assgned constraned deadlne. To preserve the dependences n the orgnal DAG, each node s assgned a release offset. Snce a node cannot start executon untl all of ts parents fnsh, ts release offset s equal to the maxmum sum of the release offset and deadlne among ts parents. That s, a node starts after ts latest parent fnshes. The (relatve) deadlnes of the nodes are assgned by dstrbutng (3) 3

7 # " (a) τ : a tmng dagram for DAG τ (of Fgure 1) when t executes on an nfnte number of processor cores " (b) Slack dstrbuton n τ syn (a synchronous model wth equal length threads n each segment for τ )! # (c) Calculatng offset and deadlne for each node of τ by removng ntermedate subdeadlnes n the node determned n τ syn Fg.. Decomposton of τ nto nodes by assgnng an offset and deadlne to each node 4

8 the avalable slack of the task. We calculate the slack for each task consderng a mult-core platform where each processor core has speed. The slack for task τ, denoted by L,s defned as the dfference between ts deadlne and ts crtcal path length on -speed processor cores. That s, L = D P, = T P, = T P (5) For task τ, the deadlne and the offset assgned to node W j are denoted by D j and Φj, respectvely. Snce we assgn slack consderng -speed processor cores, deadlne D j and offset Φ j are also based on -speed processor cores. That s, these deadlnes may not necessarly be met on unt-speed processor cores. Once approprate values of D j and Φj are determned for each node W j (respectng the dependences n the DAG), task τ s decomposed nto nodes. Upon decomposton, the dependences n the DAG need not be consdered, and each node can execute as a tradtonal multprocessor task. Hence, the decomposton technque for τ bols down to determnng D j and Φj for each node W j. We now present steps to determne D j and Φj for each node W j of τ. Each step s also followed by an example usng the DAG τ of Fgure 1. To do so, we assgn an example executon requrement E j to each node W j as E1 =4, E =, E3 =4, E 4 =5, E5 =3, E6 =4, E7 =, E8 =, E9 =3, E10 =3. Ths gves C =3, and P =14. Perod T s set to 1. Frst, we represent DAG τ as a tmng dagram τ (Fgure (a)) that shows ts executon tme on nfnte number of unt-speed processor cores. Specfcally, τ ndcates the earlest start tme and the earlest fnshng tme of each node when m =. For any node W j that has no parents, the earlest start tme and the earlest fnshng tme are 0 and E j, respectvely. For every other node W j, the earlest start tme s the latest fnshng tme among ts parents, and the earlest fnshng tme s E j tme unts after that. For example, n τ of Fgure 1, nodes W 1, W, and W 3 can start executon at tme 0, and ther earlest fnshng tmes are 4,, and 4, respectvely. Node W 4 can start after W 1 and W complete, and fnsh after 5 tme unts at ts earlest, and so on. Thus, Fgure (a) shows τ of the DAG τ of Fgure 1. Next, based on τ, the calculaton of D j and Φ j (see Fgure (a)) for each node W j nvolves the followng two steps. In Step 1, for each node, we dstrbute slack among dfferent parts of the node. In Step, the total slack assgned to dfferent parts of the node s assgned as the node s slack. 1) Step 1 (slack dstrbuton): In DAG τ, a node can execute wth dfferent numbers of nodes n parallel at dfferent tme. Such a degree of parallelsm can be approxmated based on τ. For example, n Fgure (a), node W 5 executes wth W 1 and W 3 n parallel for the frst tme unts, and then executes wth W 4 n parallel for the next tme unt. In ths way, we frst dentfy the degrees of parallelsm at dfferent parts of each node. Intutvely, the parts of a node that may execute wth a large number of nodes n parallel demand more slack. Therefore, dfferent parts of a node are assgned dfferent amounts of slack consderng ther degrees of parallelsm and executon requrements. Later, the sum of slack of all parts of a node s assgned to the node tself. To dentfy the degree of parallelsm for dfferent portons of a node based on τ, we assgn slack to a node n dfferent (consecutve) segments. In dfferent segments of a node, the task may have dfferent degrees of parallelsm. In τ, startng from the left, we draw a vertcal lne at every tme nstant where a node starts or ends (as shown n Fgure (b)). Ths s done n lnear tme usng a breadth-frst search over the DAG. The vertcal lnes now splt τ nto segments. For example, n Fgure (b), τ s splt nto 7 segments (numbered n ncreasng order from left to rght). Once τ s splt nto segments, each segment conssts of an equal amount of executon by the nodes that le n the segment. Parts of dfferent nodes n the same segment can now be thought of threads that can run n parallel, and the threads n a segment can start only after those n the precedng one fnsh. Such a model s thus smlar to the synchronous task model used n [6]. We denote ths model by τ syn.we frst assgn slack to the segments, and fnally we add all slack assgned to dfferent segments of a node to calculate ts overall slack. Note that τ s never converted to a synchronous model; the procedure only dentfes segments to determne slack for nodes, and does not decompose the task at ths stage. We dstrbute slack among the nodes based on the number of threads and executon requrement of the segments where a node les n τ syn. We frst calculate slack for each segment. Let τ syn be a sequence of s segments, where the j-th segment s represented by e j,mj, wth mj beng the number of threads n the segment, and e j beng the executon requrement of each thread n the segment (see Fgure (b)). Snce τ syn has the same crtcal path and total executon requrements as those of τ, we can now defne P and C n terms of τ syn s P = e j ; C s = m j.ej j=1 j=1 : For every j-th segment of τ syn, we calculate a value d j, called an ntermedate subdeadlne, so that the segment s assgned a slack value of d j ej. That s, each thread n the segment gets ths extra tme d j ej beyond ts executon tme ej on -speed processor cores. In the rest of Step 1, we calculate the values d j based on the technque used n [6]. The total slack s L (Equaton 5). For every j-th segment, a fracton f j of L s determned so that each thread n the segment s assgned slack ej f j, and ntermedate subdeadlne d j = ej + ej f j = ej (1 + f j ) (6) The densty of each thread on -speed cores then becomes e j d j = e j = e j (1 + f j ) 1 1+f j Snce any j-th segment conssts of m j threads, the segment s densty on -speed processor cores s then mj. 1+f j 5

9 The segments wth larger numbers of threads and wth longer threads are computaton ntensve, and demand more slack. Therefore, for each j-th segment, we determne ts slack fracton f j by consderng both mj and ej. Each j-th segment wth m j > C, T P, s classfed as a heavy segment whle other segments are called lght segments. Ths leads us to two dfferent scenaros: when τ syn has no heavy segments, and when τ syn has some heavy segments. Therefore, two dfferent approaches are followed for two scenaros to determne f j. (a) When τ syn has no heavy segments: Snce each segment has a smaller number of threads ( C, T P, ), we only consder the length of a thread n each segment, and assgn the slack proportonally among all segments. That s, for j-th segment, f j = L P, (7) Then, the ntermedate subdeadlne d j s gven by Equaton 6. (b) When τ syn has some (or all) heavy segments: In ths case, no slack s assgned to the lght segments. All avalable slack L s dstrbuted among the heavy segments n a way so that each heavy segment can acheve the same densty. Let τ syn have a total of s h heavy segments, each k-th heavy segment denoted e k,h,m k,h, where 1 k s h (superscrpt h standng for heavy ). Smlarly, let t have a total of s l lght segments, each j-th lght segment denoted e j,l,m j,l, where 1 j s l (superscrpt l standng for lght ). For any j-th lght segment, the slack fracton f j,l =0. For heavy ones, slack fracton f j,h s determned so that m 1,h 1+f 1,h = m,h 1+f,h = m3,h 1+f 3,h = = ms h,h 1+f sh,h In addton, snce all the slack s dstrbuted among the heavy segments, the followng equalty must hold. e 1,h 1,h.f + e,h,h.f + e3,h 3,h.f h + + es,h.f s h,h Solvng Equatons 8 and 9 gves (see [6] for detals): f j,h = mj,h (T P, l ) C, C, l 1, s l P, l = 1 e l,j and C, l = 1 j=1 j, 1 j s h, where s l m l,j.e l,j j=1 Thus, for any j-th segment n τ syn, the slack fracton s 0; f m j C, f j = m j (T P l, ) C, C l, T P, (8) = L (9) 1; f m j > C, T P, (10) Then, ntermedate subdeadlne d j s gven by Equaton 6. Fgure 3(a) shows an example for calculatng slacks for dfferent segments of τ syn when T =1. ) Step (calculatng deadlne and offset for nodes): We have assgned ntermedate subdeadlnes to (the threads of) each segment of τ syn n Step 1. Snce a node may be splt nto multple (consecutve) segments n τ syn, now we have to remove all ntermedate subdeadlnes of a node. Namely, we add all ntermedate subdeadlnes of a node, and assgn the total as the node s deadlne. Now let a node W j of τ belong to segments k to r (1 k r s )nτ syn. Therefore, the deadlne D j of node W j s calculated as follows (as shown n Fgure (c)). D j = dk + d k d r (11) Note that the executon requrement E j of node W j s E j = ek + e k e r (1) Node W j cannot start untl all of ts parents complete. Hence, ts release offset Φ j s determned as follows (Fgure (c)). { Φ j = 0; f W j has no parent max{φ l + Dl W l s a parent of W j }; otherwse. Now that we have assgned approprate deadlne D j and release offset Φ j to each node W j of τ, the DAG τ s now decomposed nto nodes. Each node W j s now an ndvdual (sequental) multprocessor subtask wth an executon requrement E j, a constraned deadlne Dj, and a release offset Φj. Fgure 3(b) shows an example of decomposton of τ. C. Densty Analyss after Decomposton After decomposton, let τ dec denote all subtasks (.e., nodes) that τ generates. Note that the denstes of all such subtasks comprse the densty of τ dec. Now we analyze the densty of τ dec whch wll later be used to analyze schedulablty (n terms of resource augmentaton bound) upon decomposton. Let node W j of τ belong to segments k to r (1 k r s )nτ syn. Snce W j has been assgned deadlne D j,by Equatons 11 and 1, ts densty δ j, after decomposton on -speed processor cores s δ j, = Ej / D j = (ek + ek e r )/ d k + dk d r (13) Let τ dec be the set of all generated subtasks of all orgnal DAG tasks, and δ max, be the maxmum densty among all subtasks n τ dec on -speed processor cores. By Equatons 7 and 10, the value of the slack assgned to each subtask W j l n τ dec s non-negatve,.e., Ej l Dj l. Hence, δ max, =max{δ j l, W j l s a subtask n τ dec } 1 (14) τ syn Note that we represent a DAG τ as τ syn n Step 1. Ths s a sequence of segments, each segment consstng of a set of equal-length threads (see Fgure (b)). As noted, τ syn s exactly the same as the synchronous task model used n [6]. In Step 1, we assgn subdeadlnes to dfferent segments of τ syn usng the same approach as [6]. Accordng to [6], τ syn can be decomposed nto threads as follows: each thread becomes 6

10 !!!!! " # % $ & (a) Calculatng slacks for dfferent segments of τ syn! # % (b) Calculatng deadlne and offset for nodes of τ Fg. 3. An example for decomposton of τ (shown n Fgure 1) when T =1 a sequental subtask; all threads of j-th segment are assgned executon requrement e j, and deadlne dj ; all threads of the frst segment are assgned release offset 0, and those of any other j-th segment are assgned offset d 1 + d + + dj 1. Theorem 1 states the densty of τ syn denoted by δ syn, after such decomposton on -speed processor cores as proved n [6]. Theorem 1. (From [6]) If any τ syn nto threads n all segments, and f δ syn, decomposed threads of τ syn δ syn, C/ T P /., 1 n, s decomposed s the densty of these on -speed processor cores, then Theorem proves that, after our proposed decomposton of a DAG τ nto nodes, ts densty remans no greater than δ syn, on -speed processors cores. Theorem. Let a DAG τ, 1 n, wth perod T, crtcal path length P, and maxmum executon requrement C be decomposed nto subtasks (nodes) denoted τ dec usng the proposed decomposton. The densty of τ dec on -speed processor cores s at most C / T P /. Proof: Snce we decompose τ nto nodes (.e., subtasks), the denstes of all decomposed nodes W j, 1 j n, comprse the densty of τ dec. In Step 1, every node W j of, and each thread s assgned an ntermedate subdeadlne. In Step, we remove the ntermedate subdeadlnes n the node, and ther total s assgned as the node s deadlne. By Theorem 1, f we decompose wthout removng the ntermedate subdeadlnes n the nodes, then the densty of τ after such decomposton τ s splt nto threads n dfferent segments of τ syn on -speed processor cores s δ syn, C/ T P /. Hence, t s suffcent to prove that removng ntermedate subdeadlnes n the nodes does not ncrease the task s overall densty. That s, t s suffcent to prove that the densty δ j, (Equaton 13) of any node W j after removng ts ntermedate subdeadlnes s no greater than the densty δ j,syn, that t had before removng 7

11 ts ntermedate subdeadlnes. Let node W j of τ be splt nto threads n segments k to r (1 k r s )nτ syn. Snce the total densty of any set of tasks s an upper bound on ts load (proven n [9]), the load of the threads of W j must be no greater than the total densty of these threads. Snce each of these threads s executed only once n the nterval of D j, by Equaton 3, the DBF of the thread, thread l, n segment l, k l r, n the nterval Dj on -speed processor cores s gven by DBF(thread l,d j )=el Therefore, usng Equaton 4, the load, denoted by λ j,syn,,of the threads of W j n τ syn on -speed cores for nterval D j s λ j,syn, e k D j + e k+1 D j + + e r D j = Ej / D j = δ j, j, syn j, syn Snce δ, λ,, for any W j syn,wehaveδj,, δ j,. Let δ sum, be the total densty of all subtasks τ dec on -speed processor cores. Then, from Theorem, n C / δ sum, (15) T P / =1 V. PREEMPTIVE GLOBAL EDF SCHEDULING Once all DAG tasks are decomposed nto nodes (.e., subtasks), we consder schedulng the nodes. Snce every node after decomposton becomes a sequental multprocessor task, we schedule them usng tradtonal multprocessor schedulng polces. In ths secton, we consder preemptve global Earlest Deadlne Frst (EDF) schedulng of the decomposed subtasks. Lemma 3. For any set of DAG model parallel tasks τ = {τ 1,,τ n }, let τ dec be the decomposed task set. If τ dec s schedulable under some preemptve schedulng, then τ s also preemptvely schedulable. Proof: In each τ dec, a node (.e., a subtask) s released only after all of ts parents fnsh executon. Hence, the precedence relatons n orgnal task τ are retaned n τ dec. Besdes, for each τ dec, the deadlne and the executon requrement are the same as those of orgnal task τ. Hence, f τ dec s preemptvely schedulable, then a preemptve schedule must exst for τ where each task n τ meets ts deadlne. To schedule the decomposed subtasks τ dec, the EDF polcy s the same as the tradtonal global EDF polcy where jobs wth earler absolute deadlnes have hgher prortes. Due to the preemptve polcy, a job can be suspended (preempted) at any tme by arrvng hgher-prorty jobs, and s later resumed wth (n theory) no cost or penalty. Under preemptve global EDF, we now present a schedulablty analyss for τ dec n terms of a resource augmentaton bound whch, by Lemma 3, s also a suffcent analyss for the orgnal DAG task set τ. For a task set, the resource augmentaton bound ν of a schedulng polcy A on a mult-core processor wth m cores represents a processor speedup factor. That s, f there exsts any way to schedule the task set on m dentcal unt-speed processor cores, then A s guaranteed to successfully schedule t on an m-core processor wth each processor core beng ν tmes as fast as the orgnal. Our analyss hnges on a result (Theorem 4) for preemptve global EDF schedulng of constraned deadlne sporadc tasks on tradtonal multprocessor platform [30]. Ths result s a generalzaton of the result for mplct deadlne tasks [31]. Theorem 4. (From [30]) Any constraned deadlne sporadc task set π wth total densty δ sum (π) and maxmum densty δ max (π) s schedulable usng preemptve global EDF strategy on m unt-speed processor cores f δ sum (π) m (m 1)δ max (π) Snce τ dec also conssts of constraned deadlne (sub)tasks that are perodc (wth offsets), the above result holds for τ dec. We now use the results of densty analyss from Subsecton IV-C and prove n Theorem 5 that τ dec s guaranteed to be schedulable wth a resource augmentaton of at most 4. The proof of Theorem 5 s smlar to the proof used n [6]. Theorem 5. For any set of DAG model parallel tasks τ = {τ 1,τ,,τ n }, let τ dec be the decomposed task set. If there exsts any algorthm that can schedule τ on m unt-speed processor cores, then τ dec s schedulable under preemptve global EDF on m processor cores, each of speed 4. Proof: If τ s schedulable on m dentcal unt-speed processor cores, the followng condton must hold. n C m (16) T =1 We decompose tasks consderng that each processor core has speed. To be able to schedule the decomposed tasks τ dec, suppose we need to ncrease the speed of each core ν tmes further. That s, we need each core to be of speed ν. Onan m-core platform where each core has speed ν, let the total densty and the maxmum densty of task set τ dec be denoted by δ sum,ν and δ max,ν, respectvely. From 14, we have δ max,ν = δ max, 1 (17) ν ν Based on Equatons and 16, when each processor core s of speed ν, the total densty of τ dec can be wrtten from 15 as δ ν sum, n =1 C ν T P C ν T T = 1 ν n C m T ν =1 (18) Usng Equatons 17 and 18 n Theorem 4, τ dec s schedulable under preemptve EDF on m cores each of speed ν f m ν m (m 1) 1 ν ν 1 mν 1 From the above condton, τ dec must be schedulable f ν 1 ν ν 4 8

12 VI. NON-PREEMPTIVE GLOBAL EDF SCHEDULING We now consder non-preemptve global EDF schedulng. The orgnal task set τ s scheduled based on node-level nonpreempton. In node-level non-preemptve schedulng, whenever the executon of a node n a DAG starts, the node s executon cannot be preempted by any task. Most parallel languages and lbrares have yeld ponts at the ends of threads (nodes of the DAG). Therefore, they allow low cost, userspace preempton at the end of threads. For these languages and lbrares, schedulers that requre preempton only when threads end can be mplemented entrely n user-space (wthout nteracton wth the kernel), and therefore have low overheads. The decomposton converts each node of a DAG to a tradtonal multprocessor (sub)task. Therefore, we consder fully non-preemptve global EDF schedulng of the decomposed tasks. Namely, once a job of a decomposed (sub)task starts executon, t cannot be preempted by any other job. Lemma 6. For a set of DAG parallel tasks τ = {τ 1,,τ n }, let τ dec be the decomposed task set. If τ dec s schedulable under some fully non-preemptve schedulng, then τ s schedulable under node-level non-preempton. Proof: Snce the decomposton converts each node of a DAG to an ndvdual task, a fully non-preemptve schedulng of τ dec preserves the node-level non-preemptve behavor of task set τ. The rest of the proof follows from Lemma 3. Under non-preemptve global EDF, we now present a schedulablty analyss for τ dec n terms of a resource augmentaton bound whch, by Lemma 6, s also a suffcent analyss for the DAG task set τ. Ths analyss explots Theorem 7 for non-preemptve global EDF schedulng of constraned deadlne perodc tasks on tradtonal multprocessor. The theorem s a generalzaton of the result for mplct deadlne tasks [3]. For a task set π, let C max (π) and D mn (π) be the maxmum executon requrement and the mnmum deadlne among all tasks n π. In non-preemptve schedulng, C max (π) represents the maxmum blockng tme that a task may experence, and plays major role n schedulablty. Hence, a non-preempton overhead [3] ρ(π) = Cmax(π) D. mn(π) Theorem 7. (From [3]) Any constraned deadlne perodc task set π wth total densty δ sum (π), maxmum densty δ max (π), and non-preempton overhead ρ(π) s schedulable usng non-preemptve global EDF on m unt-speed cores f δ sum (π) m ( 1 ρ(π) ) (m 1)δ max (π) Let E max and E mn be the maxmum and mnmum executon requrement, respectvely, among all nodes of all DAG tasks. In non-preemptve schedulng of decomposed subtasks τ dec, the non-preempton overhead ρ on -speed processor cores s gven by ρ Emax E mn. The overhead on unt-speed processor cores s then ρ. Usng an analyss smlar to Secton V, Theorem 8 derves a resource augmentaton bound of 4+ρ for non-preemptve global EDF schedulng of τ dec. Theorem 8. For DAG model parallel tasks τ = {τ 1,,τ n }, let τ dec be the decomposed task set wth non-preempton overhead ρ. If there exsts any way to schedule τ on m untspeed processor cores, then τ dec s schedulable under nonpreemptve global EDF on m cores, each of speed 4+ρ. Proof: Smlar to Theorem 5, suppose we need each processor core to be of speed ν to be able to schedule the decomposed tasks τ dec. Snce the non-preempton overhead of τ dec on -speed cores s ρ, onν-speed cores t s ρ/ν. Usng Equatons 17 and 18 n Theorem 7, τ dec s schedulable under non-preemptve EDF on m cores each of speed ν f m ν m(1 ρ ν ) (m 1) 1 ν +ρ 1 ν mν 1 From the above condton, task set τ dec s schedulable f +ρ ν 1 ν +ρ ν 4+ρ VII. EVALUATION In ths secton, we descrbe some prelmnary smulaton studes we have conducted to valdate our bounds. Whle these are small-scale studes, they seem to ndcate that not only are the theoretcal bounds easly met, but also they are n fact qute loose, prmarly for non-preemptve schedulng. In partcular, n our experments most task sets requre augmentaton of less than and all requre augmentaton of less than 3. In our studes, DAGs are generated by frst fxng the number of nodes n the graph and then addng edges untl t becomes weakly connected. Nodes are assgned random executon requrements from a gven range. Each task s assgned a vald harmonc perod. To generate a task set, we keep addng tasks to the set as long as ther total utlzaton upper bound (Equaton 16) s stll satsfed. Each result s generated usng at least 1000 task sets. For the frst set of smulatons, executon requrements of the nodes n DAGs are n a range [50, 100] (makng the nonpreempton overhead ρ = ), and the average parallelsm of tasks (C /P ) s about 3.4. We test usng 4, 8, and 16 processor cores, and task sets have an average utlzaton of 3.13, 7.15, and 15.03, respectvely. For every case, the decomposed subtasks are scheduled under both preemptve and non-preemptve EDF consderng dfferent speeds of the cores. Fgure 4 shows the falure rates (.e., the rato of the number of unschedulable task sets to the total number of task sets) as the processor speed ncreases. Under preemptve EDF, all task sets are schedulable at speed 1.0, 0.9, and 0.96 respectvely for 4, 8, and 16 processor cores. Under nonpreemptve schedulng, the tasks requre an augmentaton of 3 (not shown to preserve resoluton),, and 1.3 respectvely. In the second set of smulatons, we set the number of cores to 16 and test the effect of non-preempton overhead (ρ) on our decomposton (results shown n Fgure 5). To acheve a value of 1,, 5, and 10 for ρ, we assgn executon requrements from ranges [50, 50], [50, 100], [50, 50], and [50, 500], respectvely. Our results ndcate that all tasks are schedulable at speed of just, except when ρ =1where a few test cases requred speed more than (up to 3). Surprsngly, contrary to the theoretcal bounds, hgher values of ρ requre a smaller augmentaton. We suspect that ths mght be due to 9

13 0.1 Preemptve Non-Preemptve 0.1 Preemptve Non-Preemptve 0.1 Preemptve Non-Preemptve Falure Rate Falure Rate Falure Rate Processor Speed (a) m =4 Fg Processor Speed Processor Speed (b) m =8 (c) m =16 Falure rate under varyng processor speed-up factor Fg Falure Rate ρ=1 ρ= ρ=5 ρ= Processor Speed Falure rate under dfferent non-preempton overheads the partcular method we use to generate DAGs, snce n our method, when ρ s smaller, the number of tasks n the task set may be larger, makng them more dffcult to schedule. VIII. CONCLUSIONS As mult-core technology becomes manstream n processor desgn, real-tme schedulng of parallel tasks s crucal to explot ts potental. In ths paper, we consder a general task model and through a novel task decomposton, we prove a resource augmentaton bound of 4 for preemptve schedulng and 4 plus a non-preempton overhead for non-preemptve EDF schedulng. To our knowledge, these are the frst bounds for real-tme schedulng of general DAG model tasks. Through smulatons, we have observed that bounds n practce are sgnfcantly smaller than the theoretcal bounds. These results suggest many drectons of future work. Frst, the smulatons ndcate that the bounds may be loose, especally for non-preemptve schedulng. We can try to provde better bounds and/or provde lower bound arguments that suggest that the bounds are n fact tght. Second, we can study the effect of caches on schedulng overhead. Requrng non-preempton mtgates ths problem to a certan extent, but more can be done to optmze cache-localty. Fnally, we have gnored the effects of locks and other forms of nondetermnstc synchronzaton n ths paper. Generalzng these bounds to some of those models would be very nterestng. REFERENCES [1] [] [3] Ace php. [4] R. I. Davs and A. Burns, A survey of hard real-tme schedulng for multprocessor systems, ACM Comp. Surv., vol. 43, pp. 35:1 44, 011. [5] H.-M. Huang, T. Tdwell, C. Gll, C. Lu, X. Gao, and S. Dyke, Cyberphyscal systems for real-tme hybrd structural testng: a case study, n ICCPS 10. [6] A. Safullah, K. Agrawal, C. Lu, and C. Gll, Mult-core real-tme schedulng for generalzed parallel task models, n RTSS 11. [7] K. Lakshmanan, S. Kato, and R. R. Rajkumar, Schedulng parallel realtme tasks on mult-core processors, n RTSS 10. [8] OpenMP, [9] Intel ClkPlus, [10] C. D. Polychronopoulos and D. J. Kuck, Guded self-schedulng: A practcal schedulng scheme for parallel supercomputers, IEEE Transactons on Computers, vol. C-36, no. 1, pp , [11] M. Drozdowsk, Real-tme schedulng of lnear speedup parallel tasks, Inf. Process. Lett., vol. 57, no. 1, pp , [1] X. Deng, N. Gu, T. Brecht, and K. Lu, Preemptve schedulng of parallel jobs on multprocessors, n SODA 96. [13] N. S. Arora, R. D. Blumofe, and C. G. Plaxton, Thread schedulng for multprogrammed multprocessors, n SPAA 98. [14] N. Bansal, K. Dhamdhere, J. Konemann, and A. Snha, Non-clarvoyant schedulng for mnmzng mean slowdown, Algorthmca, vol. 40, no. 4, pp , 004. [15] J. Edmonds, D. D. Chnn, T. Brecht, and X. Deng, Non-clarvoyant multprocessor schedulng of jobs wth changng executon characterstcs, Journal of Schedulng, vol. 6, no. 3, pp , 003. [16] K. Agrawal, Y. He, W. J. Hsu, and C. E. Leserson, Adaptve task schedulng wth parallelsm feedback, n PPoPP 06. [17] J. M. Calandrno and J. H. Anderson, On the desgn and mplementaton of a cache-aware multcore real-tme scheduler, n ECRTS 09. [18], Cache-aware real-tme schedulng on multcore platforms: Heurstcs and a case study, n ECRTS 08. [19] J. M. Calandrno, J. H. Anderson, and D. P. Baumberger, A hybrd real-tme schedulng approach for large-scale multcore platforms, n ECRTS 07. [0] J. H. Anderson and J. M. Calandrno, Parallel real-tme task schedulng on multcore platforms, n RTSS 06. [1] Q. Wang and K. H. Cheng, A heurstc of schedulng parallel tasks and ts analyss, SIAM J. Comput., vol. 1, no., pp , 199. [] O.-H. Kwon and K.-Y. Chwa, Schedulng parallel tasks wth ndvdual deadlnes, Theor. Comput. Sc., vol. 15, no. 1-, pp. 09 3, [3] C.-C. Han and K.-J. Ln, Schedulng parallelzable jobs on multprocessors, n RTSS 89. [4] K. Jansen, Schedulng malleable parallel tasks: An asymptotc fully polynomal tme approxmaton scheme, Algorthmca, vol. 39, no. 1, pp , 004. [5] W. Y. Lee and H. Lee, Optmal schedulng for real-tme parallel tasks, IEICE Trans. Inf. Syst., vol. E89-D, no. 6, pp , 006. [6] S. Collette, L. Cucu, and J. Goossens, Integratng job parallelsm n real-tme schedulng theory, Inf. Process. Lett., vol. 106, no. 5, pp , 008. [7] G. Manmaran, C. S. R. Murthy, and K. Ramamrtham, A new approach for schedulng of parallelzable tasks nreal-tme multprocessor systems, Real-Tme Syst., vol. 15, no. 1, pp , [8] S. Kato and Y. Ishkawa, Gang EDF schedulng of parallel task systems, n RTSS 09. [9] N. Fsher, T. P. Baker, and S. Baruah, Algorthms for determnng the demand-based load of a sporadc task system, n RTCSA 06. [30] S. Baruah, Technques for multprocessor global schedulablty analyss, n RTSS 07. [31] J. Goossens, S. Funk, and S. Baruah, Prorty-drven schedulng of perodc task systems on multprocessors, Real-Tme Syst., vol. 5, no. -3, pp , 003. [3] S. Baruah, The non-preemptve schedulng of perodc tasks upon multprocessors, Real-Tme Syst., vol. 3, pp. 9 0,

Parallel Real-Time Scheduling of DAGs

Parallel Real-Time Scheduling of DAGs Washngton Unversty n St. Lous Washngton Unversty Open Scholarshp All Computer Scence and Engneerng Research Computer Scence and Engneerng Report Number: WUCSE-013-5 013 Parallel Real-Tme Schedulng of DAGs